Investigating Moran’s I Properties for Spatial Machine Learning

Jakub Nowosad, Hanna Meyer

the 28th AGILE conference

2025-06-11

Spatial Machine Learning

Traditional machine learning models (e.g., SVM, RF, GBM) lack inherent spatial awareness

Ignoring spatial structure can lead to poor predictive performance, biased predictions, or poor generalization

Incorporating spatial information:

  • Add spatial proxies (e.g., coordinates, Euclidean distances) as predictors
  • Use distance-based spatial predictors or spatial weighting matrices
  • Apply spatially-aware cross-validation for feature selection and tuning
  • Use spatially-enhanced models (e.g., Geographical RF, RF-GLS)
  • Use spatially-aware metrics (e.g., Moran’s I) to understand spatial autocorrelation and assess model performance

Moran’s I for SML

Moran’s I is used to assess spatial autocorrelation before and after modeling

  • Pre-modeling: helps understand spatial structure in the data
  • Post-modeling: applied to residuals to assess model performance


What are the properties and limitations of Moran’s I when applied to spatial machine learning?

Moran’s I

\[ I = \frac{n}{\sum_{i}^{n} \sum_{j}^{n} w_{ij}} \times \frac{\sum_{i}^{n} \sum_{j}^{n} w_{ij} (x_i - \bar{x}) (x_j - \bar{x})}{\sum_{i}^{n} (x_i - \bar{x})^2} \]

  • \(n\): number of observations
  • \(x_i\), \(x_j\): values of the observations at locations \(i\) and \(j\)
  • \(\bar{x}\): mean value of the observations
  • \(w_{ij}\): spatial weight between the observations at locations \(i\) and \(j\)

Spatial weight defines which observations are considered neighbors.

Various types of spatial weights can be used — this decision affects the value of Moran’s I.

Simulation Setup

Three ranges of spatial autocorrelation (10, 50, 100 units)

\[ \phantom{x} \]


Simulation Setup

Three ranges of spatial autocorrelation (10, 50, 100 units)

\[ Y = X_1 + X_2 \cdot X_3 + X_4 + X_5 \cdot X_6 + \mathcal{E} \]

All repeated 100 times.

Modeling Setup

  • Four training set sizes
  • Two training set sampling types

Modeling Setup

Random Forest modeling approach:

  • Extracted covariate and outcome values from rasters for training samples
  • Trained Random Forest (RF) models with 500 trees
  • Tuned mtry parameter (values: 2, 3, 4, 5, 6)
  • Selected final model based on lowest RMSE from out-of-bag (OOB) samples

Total number of models: 2400

Validation Setup

  • Complete validation raster
  • Four test set sizes
  • Two test set sampling types

Model Evaluation Metrics

RMSE

\[ RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} \]

Moran’s I

\[ I = \frac{n}{\sum_{i}^{n} \sum_{j}^{n} w_{ij}} \times \frac{\sum_{i}^{n} \sum_{j}^{n} w_{ij} (x_i - \bar{x}) (x_j - \bar{x})}{\sum_{i}^{n} (x_i - \bar{x})^2} \]


Here, we focus on the residuals of the model predictions, and thus:

\[ x_i = y_i - \hat{y}_i \]

Eight closest cells or point samples were used to calculate the Moran’s I value.

Validation Setup

Model Evaluation Metrics

Model 45: range 100, 500 random training samples, 500 random testing samples

RMSE of the training sample follows the RMSE of the complete raster


Moran’s I of the training sample is much lower than the Moran’s I of the complete raster

(but slightly higher than the Moran’s I of the testing sample)

The variability of RMSE of the testing sample, as compared to the complete RMSE, is getting lower with the increase of the sample size

The variability of Moran’s I, as compared to the complete Moran’s I, is also getting lower with the increase of the sample size, but also its values are changing

More testing samples result in higher values of Moran’s I

Moran’s I values are different between random and cluster sampling of the testing set

For a cluster sampling of the testing set, the correlation between Moran’s I and RMSE values increase with the increase of the sample size

For a random sampling of the testing set, the correlation is low or non-existent

Conclusions

  • Moran’s I is highly sensitive to spatial weight definitions (e.g., neighborhood choice) – please report it

  • In spatial ML, Moran’s I can be useful for assessing the spatial autocorrelation of residuals in the testing set

  • However, unlike RMSE, Moran’s I for the testing set does not reflect overall prediction performance

  • Instead, it is influenced by the sampling strategy and sample size (sampling density). It indicates how well the model captures spatial structure at the testing set — typically at a much finer scale than the resolution of the complete raster

  • Therefore, Moran’s I should not be used to compare performance across different studies. However, it may be useful for comparing models within the same study