Computing Local Moran's I for Feature Engineering

Compute Local Moran's I for spatial feature engineering with PySAL: classify hotspots, coldspots, and spatial outliers as statistically rigorous machine learning input features.

Computing Local Moran’s I for feature engineering transforms raw geospatial observations into spatially contextualized predictors by quantifying local spatial autocorrelation. Unlike global metrics that summarize an entire dataset, Local Indicators of Spatial Association (LISA) evaluate how each observation relates to its immediate neighbors. This produces statistically significant hotspot (HH), coldspot (LL), and spatial outlier (HL/LH) classifications that serve as high-signal inputs for tree-based models, neural networks, and spatial regression pipelines. When integrated into a broader Spatial Feature Engineering for Machine Learning workflow, these metrics replace heuristic distance counts with mathematically rigorous neighborhood context, improving model generalization and enabling automated spatial drift detection in production.

Core Mechanics & Statistical Output

Local Moran’s I measures the covariance between a target value and the spatially lagged average of its neighbors. The estimator returns four critical outputs per observation:

  • Local I statistic (Is): Direction and magnitude of local autocorrelation. Positive values indicate clustering (similar values nearby); negative values indicate dispersion (dissimilar values nearby).
  • Pseudo p-value (p_sim): Significance derived from conditional permutation tests. Standard practice uses 999 permutations to approximate the null distribution of spatial randomness.
  • Z-score (z_sim): Standardized deviation from the expected mean under the null hypothesis. Useful for continuous scaling in neural networks.
  • Cluster classification (q): Categorical label mapping to spatial regimes: 1=HH (high surrounded by high), 2=LH (low surrounded by high), 3=LL (low surrounded by low), 4=HL (high surrounded by low).

These outputs directly feed into Spatial Lag and Neighborhood Statistics pipelines, where they act as engineered features that capture localized spatial dependence without requiring manual spatial joins or window aggregations.

Production-Ready Implementation

The following pipeline computes Local Moran’s I using libpysal and esda, returning a clean DataFrame aligned to the original index. It handles missing values, row-standardizes weights for stable ML scaling, and maps cluster codes to human-readable labels.

import geopandas as gpd
import libpysal
import esda
import numpy as np
import pandas as pd

def compute_local_moran_features(
    gdf: gpd.GeoDataFrame,
    value_col: str,
    w_type: str = "queen",
    k_neighbors: int = 4,
    permutations: int = 999,
    random_state: int = 42
) -> pd.DataFrame:
    """
    Computes Local Moran's I features for ML pipelines.
    Returns DataFrame with I, p-value, z-score, and cluster type.
    """
    # 1. Build spatial weights
    if w_type == "queen":
        w = libpysal.weights.Queen.from_dataframe(gdf)
    elif w_type == "knn":
        w = libpysal.weights.KNN.from_dataframe(gdf, k=k_neighbors)
    else:
        raise ValueError("Unsupported weights type. Use 'queen' or 'knn'.")

    # Row-standardize for stable ML scaling
    w.transform = "r"

    # 2. Extract values and handle missing data
    y = gdf[value_col].values.astype(float)
    mask = ~np.isnan(y)
    y_clean = y[mask]

    # Subset weights matrix to valid observations
    w_sub = w[mask, mask]

    # 3. Compute Local Moran's I (ref: https://pysal.org/esda/stable/source/generated/esda.Moran_Local.html)
    lm = esda.Moran_Local(
        y_clean, w_sub, permutations=permutations,
        n_jobs=-1, random_state=random_state
    )

    # 4. Map cluster codes to labels (1=HH, 2=LH, 3=LL, 4=HL)
    cluster_map = {1: "HH", 2: "LH", 3: "LL", 4: "HL", 0: "NS"}
    clusters = pd.Series(lm.q).map(cluster_map).fillna("NS")

    # 5. Reconstruct full DataFrame aligned to original index
    results = pd.DataFrame({
        "local_moran_i": np.nan,
        "p_sim": np.nan,
        "z_sim": np.nan,
        "cluster_type": "NS"
    }, index=gdf.index)

    valid_idx = gdf.index[mask]
    results.loc[valid_idx, "local_moran_i"] = lm.Is
    results.loc[valid_idx, "p_sim"] = lm.p_sim
    results.loc[valid_idx, "z_sim"] = lm.z_sim
    results.loc[valid_idx, "cluster_type"] = clusters.values

    return results

Statistical Validation & Feature Encoding

Raw LISA outputs require careful preprocessing before ingestion into predictive models:

  • Significance filtering: Mask local_moran_i and z_sim where p_sim > 0.05 (or your domain threshold). Non-significant observations introduce noise that degrades gradient-based optimization.
  • Target encoding for clusters: Convert cluster_type to numeric representations. For tree ensembles, one-hot encoding preserves categorical boundaries. For linear or neural architectures, use target encoding or ordinal mapping (HH=1, LH=2, LL=3, HL=4, NS=0) with explicit handling of the non-significant class.
  • Multicollinearity checks: Local Moran’s I correlates strongly with spatial lag and distance-weighted averages. Compute VIF scores post-merge; drop or regularize redundant features to prevent coefficient instability.
  • Distribution alignment: Apply PowerTransformer or QuantileTransformer to local_moran_i and z_sim before feeding them into deep learning pipelines. Spatial statistics often exhibit heavy tails that disrupt learning rate schedules.

MLOps Integration & Inference Architecture

Deploying spatial autocorrelation features requires deterministic neighbor resolution across training and inference:

  1. Weight matrix serialization: Cache the spatial weights object (w) using joblib or pickle. Recomputing contiguity or KNN graphs at scale introduces unacceptable latency.
  2. Inference neighbor resolution: For new observations, map them to existing neighbors using a spatial index (scipy.spatial.cKDTree or geopandas.sjoin_nearest). Apply the cached weights structure to maintain consistent lag calculations.
  3. Drift monitoring: Track the proportion of significant clusters (HH/LL) and the median p_sim across inference batches. A systemic shift toward non-significant (NS) classifications often signals spatial regime changes, sensor degradation, or sampling bias.
  4. Pipeline orchestration: Wrap the LISA computation in a scikit-learn-compatible transformer. Implement fit() to build and serialize weights, and transform() to apply the precomputed matrix to incoming data. This guarantees reproducibility across CI/CD deployments.

Scaling, Optimization & Pitfalls

  • Weights selection: Queen contiguity excels for areal data (census tracts, parcels). KNN adapts to irregular point distributions. Avoid fixed distance-band weights unless your domain enforces a strict search radius.
  • Boundary effects: Edge observations have fewer neighbors, which can artificially deflate local statistics. Apply toroidal correction, buffer boundaries, or explicitly model edge effects in spatial regression.
  • Permutation trade-offs: 999 permutations balance statistical rigor and compute time. For real-time inference, precompute significance thresholds or switch to analytical approximations where the spatial process meets stationarity assumptions.
  • Memory constraints: Large GeoDataFrames trigger OOM errors during weight matrix construction. Chunk spatially contiguous blocks, compute LISA independently, and concatenate results. Alternatively, use sparse matrix representations (libpysal.weights.W.sparse) to reduce footprint.
  • Version compatibility: Recent libpysal releases streamline geometry handling. If you encounter deprecation warnings, pass gdf.geometry explicitly or upgrade to the latest PySAL release. Consult the libpysal spatial weights documentation for migration guidance.

Conclusion

Computing Local Moran’s I for feature engineering bridges spatial statistics and machine learning by converting neighborhood relationships into structured, model-ready signals. When deployed alongside robust MLOps practices, these features improve predictive accuracy, enable automated spatial drift detection, and eliminate fragile manual joins. Validate significance thresholds, serialize neighbor structures, and monitor cluster distributions in production to maintain model reliability across heterogeneous landscapes.