R for geospatial predictive mapping

Practical workflows for reliable spatial predictions

Jakub Nowosad, https://jakubnowosad.com/

Rome R Users Group

2025-11-27

R spatial in various domains

  • Ecology
  • Earth-Observation
  • Economics
  • Demography
  • Politics
  • Journalism
  • Climatology
  • Meteorology
  • Geomorphometry
  • Hydrology
  • Urban-Planning
  • Mining
  • Archeology
  • Transport
  • Tourism
  • Marine-Studies
  • Soil-Science
  • and many more…

Geocomputational methods

  • Exploratory data analysis (EDA)
  • Data processing (e.g., adding new variables)
  • Data transformation (e.g., changing CRS, reducing size)
  • Data visualization (maps!)
  • Web application development
  • Software development (e.g., to share new methods)
  • Spatial data analysis and modeling

Based on the goal of the analysis (incomplete lists)

R spatial resources

Books

https://ipeagit.github.io/intro_access_book/ https://e-sensing.github.io/sitsbook/ https://r-spatial.org/book/ https://walker-data.com/census-r/

https://www.bigbookofr.com/chapters/geospatial

R spatial resources

geocompx project

Geocomputation with R

Additional content:

  • A website with reproducible solutions to exercises
  • R package, geocompkg, simplifying installation of the packages used in the book
  • R packages with example data: spData, spDataLarge

Translations (some made by the community, thank you!):

Blog posts

GitHub organization

Discord

Contribute & connect with the geocompx community

How you can contribute

  • Share ideas: suggest new topics, improvements, or materials
  • Fix and enhance: improve code, documentation, or visualizations
  • Translate: make materials accessible in other languages
  • Add: contribute theoretical, software, and real-world examples (e.g., as blog posts)

More about geocompx

Spontaneous growth of the geocompx FOSS4G community
Slides, Video

R’s Geospatial Kaleidoscope: Exploring Perspectives, Strengths, and Challenges
Slides


Join the community

The geocompx community grows through shared curiosity and collaboration.

Website: geocompx.org

Predictive mapping with R

Problem

  • Predictive mapping is a common task in many domains
  • The goal is to create a continuous map from point observations and spatial predictors (label every pixel)






Here, we want to predict plant species richness (the number of different species present in a given area) for South America (a hard problem!)

Code
library(sf)
library(terra)
library(CAST)
library(caret)
library(tmap)
tmap_options(scale = 1.5)
data(splotdata, package = "CAST")
south_america = rnaturalearth::ne_countries(continent = "South America")
# head(splotdata)
tm_shape(south_america) +
  tm_borders() +
  tm_shape(splotdata) +
  tm_symbols(fill = "Species_richness", size = 0.5)

Observations

Code
set.seed(21)
train_indices = sample(1:nrow(splotdata), size = 0.7 * nrow(splotdata))
splotdata_test = splotdata[-train_indices, ]
splotdata = splotdata[train_indices, ]

tm_shape(south_america) +
  tm_borders() +
  tm_shape(splotdata) +
  tm_symbols(fill = "Species_richness", size = 0.5) +
  tm_title("Training data")

Code
tm_shape(south_america) +
  tm_borders() +
  tm_shape(splotdata_test) +
  tm_symbols(fill = "Species_richness", size = 0.5) +
  tm_title("Testing data")

70/30 random split into training and testing data

Predictors

Code
dir.create("data")
wc = geodata::worldclim_global(var = "bio", res = 10, path = "data/")
wc = wc[[c(1, 12, 14)]]
elev = geodata::elevation_global(res = 10, path = "data/")
predictors = crop(c(wc, elev), st_bbox(splotdata))
names(predictors) = c("bio_1", "bio_12", "bio_14", "elev")
tm_shape(predictors) +
  tm_raster(col.scale = list(tm_scale_continuous(values = "cividis"),
                             tm_scale_continuous(values = "PuBuGn"),
                             tm_scale_continuous(values = "YlGnBu"),
                             tm_scale_continuous(values = "geyser")),
            col.legend = tm_legend(orientation = "landscape")) +
  tm_facets(ncol = 4)

bio_1: Average monthly difference between daily high and low temperatures, bio_12: Annual Precipitation, bio_14: Precipitation of Driest Month, elev: Elevation

Data preparation

Code
predictor_space = noNA(predictors, falseNA = TRUE)
tm_shape(predictor_space) +
  tm_raster()

Code
splotdata_df = st_drop_geometry(splotdata)
splotdata_coord = as.data.frame(st_coordinates(splotdata))
names(splotdata_coord) = c("x", "y")
splotdata_df = cbind(splotdata_df, splotdata_coord)
gt::gt(splotdata_df[1:18, c("Species_richness", "bio_1",
                           "bio_12", "bio_14", "elev", "x", "y")])
Species_richness bio_1 bio_12 bio_14 elev x y
40 15.000000 1199 53 2513 -79.36602 -5.343727
18 16.545834 1760 119 720 -50.93534 -28.870443
17 7.345833 992 15 3571 -73.08057 5.908890
2 5.991667 1456 23 3868 -74.15775 3.993870
53 20.020834 531 6 251 -64.54472 -29.964170
35 18.266666 2141 153 603 -52.51585 -26.773140
22 9.379167 1121 62 3384 -79.03380 -3.160160
104 16.954166 1156 17 1941 -78.16670 0.883330
10 10.045834 2175 52 328 -73.12974 -39.845370
29 9.800000 1055 51 3442 -79.31206 -3.716683
92 26.608334 2080 82 380 -74.83330 5.250000
48 17.270834 1660 83 680 -49.04789 -26.192380
110 20.208334 1291 27 1496 -68.48700 -14.988000
66 25.295834 2116 54 212 -69.27120 -12.329900
23 14.108334 1686 113 1249 -49.81775 -28.608093
103 25.441666 1927 48 194 -68.80190 -12.433400
22 7.537500 1212 70 3624 -79.16920 -2.880100
16 12.150000 1304 14 2842 -72.94137 10.247530

Simple interpolation methods

Inverse Distance Weighting (IDW) estimates values at unsampled points using nearby samples – closer points have more influence on the results.

Code
library(gstat)
m_idw = gstat(
  formula = Species_richness ~ 1,
  locations = ~ x + y,
  data = splotdata_df,
  set = list(idp = 2)
)
pred_idw = interpolate(predictor_space, m_idw, index = 1, debug.level = 0)
pred_idw = mask(pred_idw, predictor_space)
names(pred_idw) = "idw"
tm_shape(pred_idw) +
  tm_raster(col.scale = tm_scale_continuous(values = "spectral", midpoint = 0))

Geostatistics (ordinary kriging)

Code
vgm_ok = variogram(Species_richness ~ 1, locations = ~ x + y, data = splotdata_df)
plot(vgm_ok)

Code
fm_ok = fit.variogram(vgm_ok, model = vgm(model = "Exp", nugget = 500))
plot(vgm_ok, fm_ok)

Code
m_ok = gstat(
  formula = Species_richness ~ 1,
  locations = ~ x + y,
  data = splotdata_df,
  model = fm_ok
)
pred_ok = interpolate(predictor_space, m_ok, index = 1, debug.level = 0)
pred_ok = mask(pred_ok, predictor_space)
names(pred_ok) = "ordinary kriging"
tm_shape(pred_ok) +
  tm_raster(col.scale = tm_scale_continuous(values = "spectral", midpoint = 0)) +
  tm_layout(scale = 1)

Geostatistics (universal kriging)

Code
vgm_uk = variogram(Species_richness ~ bio_1 + bio_12 + bio_14 + elev, 
                   locations = ~ x + y, data = splotdata_df)
plot(vgm_uk)

Code
fm_uk = fit.variogram(vgm_uk, model = vgm(model = "Exp", nugget = 500))
plot(vgm_uk, fm_uk)

Code
m_uk = gstat(
  formula = Species_richness ~ bio_1 + bio_12 + elev,
  locations = ~ x + y,
  data = splotdata_df,
  model = fm_uk
)

pred_uk = interpolate(predictors, m_uk, index = 1, debug.level = 0)
pred_uk = mask(pred_uk, predictor_space)
names(pred_uk) = "universal kriging"
tm_shape(pred_uk) +
  tm_raster(col.scale = tm_scale_continuous(values = "spectral", midpoint = 0)) +
  tm_layout(scale = 1)

Machine learning

Machine learning methods, such as Random Forests, can capture complex relationships between predictors and the target variable.

However, they do not explicitly account for spatial relationships in the data.

Code
set.seed(10) # set seed to reproduce the model
model_rf = train(
  splotdata_df[, names(predictors)],
  splotdata_df$Species_richness,
  method = "rf",
  tuneGrid = data.frame("mtry" = 2),
  importance = TRUE,
  ntree = 50,
  trControl = trainControl(method = "cv", number = 3, savePredictions = "final")
)
pred_rf = terra::predict(predictors, model_rf, na.rm = TRUE)
names(pred_rf) = "random forest"
tm_shape(pred_rf) +
  tm_raster(col.scale = tm_scale_continuous(values = "spectral", midpoint = 0))

Which one is the best?

Code
pred_all = c(pred_idw, pred_ok, pred_uk, pred_rf)
tm_shape(pred_all) +
  tm_raster(col.scale = tm_scale_continuous(values = "spectral", midpoint = 0),
            col.legend = tm_legend(title = "Species richness",
                                   orientation = "landscape"),
            col.free = FALSE) +
  tm_facets(ncol = 4)

Which one is the best?

Code
rmse = function(observed, predicted) {
  sqrt(mean((observed - predicted)^2, na.rm = TRUE))
}

## idw
pred_idw_test = extract(pred_idw, splotdata_test)
rmse_idw = rmse(splotdata_test$Species_richness, pred_idw_test[, 2])

## kriging
pred_ok_test = extract(pred_ok, splotdata_test)
rmse_sk = rmse(splotdata_test$Species_richness, pred_ok_test[, 2])

## universal kriging
pred_uk_test = extract(pred_uk, splotdata_test)
rmse_uk = rmse(splotdata_test$Species_richness, pred_uk_test[, 2])

## machine learning
pred_rf_test = extract(pred_rf, splotdata_test)
rmse_rf = rmse(splotdata_test$Species_richness, pred_rf_test[, 2])

# which model is the best?
rmse_results = data.frame(
  Method = c("IDW", "Ordinary Kriging", "Universal Kriging", "Random Forest"),
  RMSE = round(c(rmse_idw, rmse_sk, rmse_uk, rmse_rf), 2)
)
pred_all = c(pred_idw, pred_ok, pred_uk, pred_rf)
pred_all_cent = st_centroid(st_as_sf(as.polygons(ext(pred_all))))
tm_shape(pred_all) +
  tm_raster(col.scale = tm_scale_continuous(values = "spectral", midpoint = 0),
            col.legend = tm_legend(title = "Species richness",
                                   orientation = "landscape"),
            col.free = FALSE) +
  tm_credits(paste0("RMSE: ", rmse_results$RMSE), size = 2) + 
  tm_facets(ncol = 4)

Issue #1

Code
# issue no 1: there are some negative values
tm_shape(pred_all) +
  tm_raster(col.scale = tm_scale_continuous(values = "spectral",
                                            midpoint = 0, 
                                            values.range = c(0.2, 1)),
            col.legend = tm_legend(title = "Species richness",
                                   orientation = "landscape"),
            col.free = FALSE) +
  tm_facets(ncol = 4) +
  tm_shape(splotdata) +
  tm_dots(shape = 2, size = 0.2, lwd = 0.3)

Some methods produce unrealistic negative predictions for species richness.

Issue #2

Code
tm_shape(south_america) +
  tm_borders() +
  tm_shape(splotdata) +
  tm_symbols(size = 0.5) +
  tm_title("Training data")

Our training data is clustered in space. Thus:

  • Some areas are far from any training point
Code
# issue no 2: distances, both spatial and feature
plot(geodist(splotdata, predictors)) +
  scale_x_sqrt(labels = scales::label_number())

Issue #2

Code
tm_shape(south_america) +
  tm_borders() +
  tm_shape(splotdata) +
  tm_symbols(fill = "Species_richness", size = 0.5) +
  tm_title("Training data")

Our training data is clustered in space. Thus:

  • Some areas are far from any training point
  • Values of predictors at some locations are very different from those at training points
Code
plot(geodist(splotdata, predictors, type = "feature")) +
  scale_x_sqrt(labels = scales::label_number()) + 
  scale_fill_manual(values = c("steelblue", "tomato"), 
                    name = "distance function")

Issue #3

Our testing data is also spatially clustered, and not representative of the entire study area.

Code
tm_shape(pred_all) +
  tm_raster(col.scale = tm_scale_continuous(),
            col.legend = tm_legend(title = "Species richness",
                                   orientation = "landscape"),
            col.free = FALSE) +
  tm_facets(ncol = 4) +
  tm_shape(splotdata_test) +
  tm_dots(shape = 2, size = 0.2, lwd = 0.3) +
  tm_title("Testing data")

Code
plot(geodist(splotdata, predictors, testdata = splotdata_test)) +
  scale_x_sqrt(labels = scales::label_number())

Thus, how reliable are our models’ evaluations? How can we address these issues?

Addressing the issues (1)

Solution approach: Having a random training and testing sample of the whole prediction area. (Sadly, often impossible in real-world applications)

Code
set.seed(2025-11-02)
random_sample_train = spatSample(predictor_space, size = 490, method = "random",
                                 na.rm = TRUE, exact = TRUE,
                                 as.points = TRUE, replace = FALSE)
random_sample_test = spatSample(predictor_space, size = 200, method = "random",
                                na.rm = TRUE, exact = TRUE,
                                as.points = TRUE, replace = FALSE)
Code
tm_shape(south_america) +
  tm_borders() +
  tm_shape(random_sample_train) +
  tm_symbols(size = 0.5) +
  tm_title("Training data") +
  tm_layout(scale = 2)

Code
tm_shape(south_america) +
  tm_borders() +
  tm_shape(random_sample_test) +
  tm_symbols(size = 0.5) +
  tm_title("Testing data") +
  tm_layout(scale = 2)

Addressing the issues (2)

Solution approach: Identify areas where the environment is not well represented, making predictions less trustworthy (Area of Applicability – AoA, Meyer and Pebesma, 2021).

Code
AOA = aoa(predictors,
          train = splotdata_df[c("bio_1", "bio_12", "bio_14", "elev")], 
          verbose = FALSE)
tm_shape(AOA$DI) +
  tm_raster(col.scale = tm_scale_continuous_sqrt(values = "batlow"),
            col.legend = tm_legend(title = "")) +
  tm_shape(splotdata) +
  tm_dots(shape = 2, col = "white", size = 0.2, lwd = 0.3) +
  tm_title("Dissimilarity to training data")

Code
tm_shape(AOA$AOA) +
  tm_raster(col.scale = tm_scale_categorical(),
            col.legend = tm_legend(title = "")) +
  tm_shape(splotdata) +
  tm_dots(shape = 2, col = "white", size = 0.2, lwd = 0.3) +
  tm_title("Area of Applicability")

Addressing the issues (2)

Code
pred_all_aoa = mask(pred_all, AOA[[3]], maskvalues = 0)
tm_shape(predictor_space) +
  tm_raster(col.scale = tm_scale(values = "black",
                                 labels = "Different environment"),
            col.legend = tm_legend(title = ""),
            col_alpha = 0.1) +
  tm_add_legend(labels = "Training data", fill = "black", 
                type = "symbols", shape = 2, size = 1) +
  tm_shape(pred_all_aoa) +
  tm_raster(col.scale = tm_scale_continuous(values = "spectral",
                                            midpoint = 0),
            col.legend = tm_legend(title = "Species richness",
                                   orientation = "landscape"),
            col.free = FALSE) +
  tm_facets(ncol = 4) +
  tm_shape(splotdata) +
  tm_dots(shape = 2, size = 0.2, lwd = 0.3)

Our predictions are reliable for the areas which are similar to the training data (inside the AoA).

Addressing the issues (3)

Clustered data: our previous RMSE estimates only reflect model performance in areas close to testing points.

However, we want to know how our model performs across the entire study area (except outside the AoA).

Solution approach: Use a prediction-domain adaptive validation method, e.g., kNNDM, Linnenbrink et al., 2024, to better reflect the prediction situation.

Code
set.seed(11)
indices_knndm = knndm(splotdata, predictors, k = 5)
# plot(indices_knndm)

# indices_knndm2 = knndm(splotdata[, names(predictors)], predictors, k = 5, space = "feature")
# plot(indices_knndm2)
plot(geodist(splotdata, predictors, cvfolds = indices_knndm$indx_test)) +
  scale_x_sqrt(labels = scales::label_number()) +
  ggtitle("kNNDM CV: k-fold nearest-neighbour distance matching cross-validation")

Addressing the issues (3)

kNNDM adapts cross-validation folds to better reflect the prediction situation

Code
# retrain model with knndm folds
model_rf2 = train(
  splotdata_df[, names(predictors)],
  splotdata_df$Species_richness,
  method = "rf",
  tuneGrid = data.frame("mtry" = 2),
  importance = TRUE,
  trControl = trainControl(
    method = "cv",
    index = indices_knndm$indx_train,
    savePredictions = "final"
  )
)

Here, we apply kNNDM approach (with a random forest method) to better reflect the prediction situation.

Final map

Code
tm_shape(pred_all[[4]]) +
  tm_raster(col.scale = tm_scale_continuous(values = "spectral",
                                            midpoint = 0),
            col.legend = tm_legend(title = "Species richness",
                                   orientation = "landscape")) +
  tm_title("Random Forest")

27.42

RMSE based on the (clustered) testing data

Final map

Code
tm_shape(pred_all[[4]]) +
  tm_raster(col.scale = tm_scale_continuous(values = "spectral",
                                            midpoint = 0),
            col.legend = tm_legend(title = "Species richness",
                                   orientation = "landscape")) +
  tm_title("Random Forest")

27.42

RMSE based on the (clustered) testing data

Final map

Code
pred_rf2 = terra::predict(predictors, model_rf2, na.rm = TRUE)
names(pred_rf2) = "random forest2"
pred_rf2_aoa = mask(pred_rf2, AOA[[3]], maskvalues = 0)
tm_shape(predictor_space) +
  tm_raster(col.scale = tm_scale(values = "black",
                                 labels = "Different environment"),
            col.legend = tm_legend(title = ""),
            col_alpha = 0.1) +
  tm_shape(pred_rf2_aoa) +
  tm_raster(col.scale = tm_scale_continuous(values = "spectral",
                                            midpoint = 0),
            col.legend = tm_legend(title = "Species richness",
                                   orientation = "landscape")) +
  tm_title("Random Forest")

35.15

RMSE based on the prediction-domain adaptive cross-validation (kNNDM)

Suggestions for final model evaluation

  • Independent testing set of randomly sampled points over the prediction area
Code
set.seed(2025-11-26)
random_sample_test_i = spatSample(predictor_space, size = 200, method = "random",
                                na.rm = TRUE, exact = TRUE,
                                as.points = TRUE, replace = FALSE)
random_sample_test_i$type = "test"
tm_shape(south_america) +
  tm_borders() +
  tm_shape(random_sample_test_i) +
  tm_symbols(size = 0.5,
             fill = "type",
             fill.legend = tm_legend(title = "",
                                     position = tm_pos_in("right", "bottom"))) +
  tm_title("Independent testing data") +
  tm_layout(scale = 2)

Suggestions for final model evaluation

  • Independent testing set of randomly sampled points over the prediction area
  • Random sample: split-sample validation into training and testing points over the prediction area
Code
random_sample_train$type = "train"
random_sample_test$type = "test"
random_sample = rbind(random_sample_train, random_sample_test)
tm_shape(south_america) +
  tm_borders() +
  tm_shape(random_sample) +
  tm_symbols(size = 0.5, 
             fill = "type",
             fill.legend = tm_legend(title = "",
                                     position = tm_pos_in("right", "bottom"))) +
  tm_title("Random sample") +
  tm_layout(scale = 2)

Suggestions for final model evaluation

  • Independent testing set of randomly sampled points over the prediction area
  • Random sample: split-sample validation into training and testing points over the prediction area
  • Biased or clustered sample; extrapolation: split-sample validation with training and testing points. The split should be aware of the prediction scenario (e.g., using kNNDM), and the evaluation should be supported by AoA
Code
# pak::pak("JanLinnenbrink/CAST") # new feature: work-in-progress
splotdata2 = splotdata
knndm_folds = CAST::knndm(splotdata2, modeldomain = predictors, 
                          prop_test = 0.3, tolerance = 0.1)
splotdata2$type = knndm_folds$clusters
tm_shape(south_america) +
  tm_borders() +
  tm_shape(splotdata2) +
  tm_symbols(size = 0.5, 
             fill = "type",
             fill.legend = tm_legend(title = "",
                                     position = tm_pos_in("right", "bottom"))) +
  tm_title("Clustered sample") +
  tm_layout(scale = 2)

Suggestions for final model evaluation

  • Independent testing set of randomly sampled points over the prediction area
  • Random sample: split-sample validation into training and testing points over the prediction area
  • Biased or clustered sample; extrapolation: split-sample validation with training and testing points. The split should be aware of the prediction scenario (e.g., using kNNDM), and the evaluation should be supported by AoA
  • Small biased or clustered sample; extrapolation: prediction-domain adaptive cross-validation supported by AoA only for the model evaluation
Code
set.seed(2025-11-02)
splotdata3 = splotdata[sample(1:nrow(splotdata), size = 143), ]
indices_knndm2 = knndm(splotdata3, predictors, k = 5)
splotdata3$fold = as.factor(indices_knndm2$clusters)
tm_shape(south_america) +
  tm_borders() +
  tm_shape(splotdata3) +
  tm_symbols(size = 0.5,
             fill = "fold",
             fill.legend = tm_legend(position = tm_pos_in("right", "bottom"))) +
  tm_title("Small sample") +
  tm_layout(scale = 2)

Suggestions for final model evaluation

  • Independent testing set of randomly sampled points over the prediction area
  • Random sample: split-sample validation into training and testing points over the prediction area
  • Biased or clustered sample; extrapolation: split-sample validation with training and testing points. The split should be aware of the prediction scenario, and the evaluation should be supported by AoA
  • Small biased or clustered sample; extrapolation: prediction-domain adaptive cross-validation (e.g., kNNDM) supported by AoA only for the model evaluation
  • Even smaller biased or clustered sample; extrapolation: prediction-domain adaptive cross-validation (e.g., kNNDM) supported by AoA only for the model tuning and evaluation
Code
set.seed(2025-11-02)
splotdata4 = splotdata[sample(1:nrow(splotdata), size = 80), ]
indices_knndm3 = knndm(splotdata4, predictors, k = 5)
splotdata4$fold = as.factor(indices_knndm3$clusters)
tm_shape(south_america) +
  tm_borders() +
  tm_shape(splotdata4) +
  tm_symbols(size = 0.5,
             fill = "fold",
             fill.legend = tm_legend(position = tm_pos_in("right", "bottom"))) +
  tm_title("(Even) smaller sample") +
  tm_layout(scale = 2)

R tools

CRAN Task View: Analysis of Spatial Data:

  • caret: General ML; add CAST or blockCV for spatial CV and feature selection
  • mlr3 and mlr3spatiotempcv : Object-oriented spatial ML with structured spatial CV
  • tidymodels + spatialsample: Tidyverse-style spatial resampling and evaluation
  • sperrorest/blockCV: Spatial cross-validation frameworks
  • spatialRF/RandomForestsGLS: Random Forests with spatial dependence handling
  • meteo: Spatial interpolation using nearby observations
  • gpboost: Gradient boosting + Gaussian processes for spatial effects

Summary

  • Understand sampling design, it affects model training & evaluation
  • Align evaluation with prediction context
  • Use prediction-domain adaptive cross-validation (e.g., kNNDM) for realistic assessment
  • Check prediction reliability, for example, using Area of Applicability (AoA)
  • Leverage R tools for spatial machine learning tasks

Contact

Website: https://jakubnowosad.com

Resources

Slides: https://jakubnowosad.com/rome2025

Acknowledgments

Various R developers

Contributors to the geocompx project

Members of the Remote Sensing and Spatial Modelling Research Group (especially Hanna Meyer and Jan Linnenbrink)