Elephant(s) in the room

Graph neural networks, embeddings, and foundation models in spatial data science

Jakub Nowosad, https://jakubnowosad.com/

2025-12-10

Disclaimer

Graph neural networks

Basic ideas:

Data represented as graphs (nodes + edges)
- Nodes: spatial units (pixels, regions, locations) with features
- Edges: relationships (spatial proximity, similarity, connectivity)
Message Passing: nodes aggregate information from neighbors to update their representations (it is like spatial lag in regression models)

Graph neural networks

Source: https://distill.pub/2021/gnn-intro/

Types of GNNs:

Graph Convolutional Networks (GCNs): generalize convolution to graphs (apply a filter over a node’s neighbors)
Graph Attention Networks (GATs): use attention mechanisms to weigh neighbor contributions differently
GraphSAGE: samples and aggregates features from a fixed-size set of neighbors
Graph Isomorphism Networks (GINs): stack multiple layers to capture complex structures

Code example

R/gnn-torchgnn-spatial.R

Embeddings

“A bunch of numbers representing an idea”

JN

Embedding are created by training models to learn compact representations that capture essential information from high-dimensional data. They are often a byproduct of foundation models.

Usage:

Semantic understanding: Capture complex relationships in data beyond traditional features
Similarity search: Identify locations with comparable environmental and surface characteristics
Change detection: Detect and quantify temporal variation by comparing embeddings across years
Unsupervised clustering: Group pixels into data-driven categories to reveal spatial structure
Classification: Generate thematic maps using a reduced amount of labeled training data

AlphaEarth Foundations

Preprint: https://arxiv.org/pdf/2507.22291

Data: https://source.coop/tge-labs/aef

Integrates diverse geospatial data (optical, thermal, radar, 3D, elevation, climate, gravity, text)
Produces 64-dimensional, 10-m resolution embeddings for every year since 2017
Analysis-ready: no atmospheric correction, cloud masking, spectral transformations, speckle filtering, or other featurization techniques needed

Embeddings

Advantages:

Compression of high-dimensional data into manageable representations
They are not limited to a singe type of data (e.g., optical only)

Challenges:

Interpretability of embeddings
Tools and workflows for working with embeddings are still in their infancy
Which embeddings to use for a given task?

Code example

R/alphaearth-amtsvenn.R

Foundation models

(Geospatial) foundation models

Large, pre-trained models that learn general-purpose spatiotemporal and multimodal representations from massive amounts of unlabeled Earth observation data.

Characteristics:

General within the geospatial domain (support many related tasks)
Pre-trained on large-scale multimodal satellite data
Adaptable: fine-tuned or used as feature extractors with small labeled datasets
Zero-shot capability for some tasks

Based on Self-Supervised Learning:

Masked image modeling (e.g., reconstruct missing pixels)
Multi-modal alignment (e.g., optical <-> SAR)
Temporal modeling (e.g., predict future states)
Contrastive learning across sensors, views, seasons (e.g., distinguish different locations/times)

(Geospatial) foundation models

Fine-Tuning: Small labeled datasets are added to specialize for tasks such as:
land cover mapping, segmentation, change detection, object extraction, and more.

Embeddings: Foundation models output feature vectors reusable across tasks and regions.

Example models: Terramind, AnySat, Prithvi, AlphaEarth Foundations, etc.

Strengths:

Work with very little labeled data
Multimodal fusion (optical, SAR, DEM, time series)
Generalize across various tasks

Challenges:

Limited transferability to unseen geographies
Out-of-domain reliability is still poor
Traditional deep learning remains competitive when labeled data is abundant

TabPFN

Transformer-based model for tabular data (not only spatial!)

Pretrained once; no fine-tuning required
Has foundation-model abilities: data generation, density estimation, reusable embeddings

How It Works:

Pretrained on millions of synthetic datasets generated under a wide range of priors
Synthetic data comes from causal models and Bayesian neural networks, including noise, imbalance, missing values
Learns to approximate Bayesian inference through this large synthetic pretraining
At inference: receives labeled and unlabeled samples, and solves the prediction task directly

TabPFN

(Stated) advantages:

No task-specific training: a single pretrained model aim to generalize across diverse tabular domains
Robustness to data imperfections: pretraining on varied synthetic datasets exposes the model to noise, imbalance, and missing values
Computational efficiency: prediction tasks are solved through a single forward pass without additional optimization
Built-in uncertainty estimation: produces full predictive distributions rather than point estimates
Interpretability support: compatible with SHAP-based methods for analyzing feature contributions

Scope:

Best for datasets with <10k rows, <500 columns, <10 classes
Current version: TabPFN v2.5 (Nov 2025) — supports up to 50k rows & 2k features

Paper: https://doi.org/10.1038/s41586-024-08328-6

Code example

R/tabpfn-init.R

R/tabpfn-splotdata.R