Elephant(s) in the room

Graph neural networks, embeddings, and foundation models in spatial data science

Jakub Nowosad, https://jakubnowosad.com/

2025-12-10

Disclaimer

Graph neural networks

Graph neural networks

Basic ideas:

  • Data represented as graphs (nodes + edges)
    • Nodes: spatial units (pixels, regions, locations) with features
    • Edges: relationships (spatial proximity, similarity, connectivity)
  • Message Passing: nodes aggregate information from neighbors to update their representations (it is like spatial lag in regression models)

Graph neural networks

Source: https://distill.pub/2021/gnn-intro/

Types of GNNs:

  • Graph Convolutional Networks (GCNs): generalize convolution to graphs (apply a filter over a node’s neighbors)
  • Graph Attention Networks (GATs): use attention mechanisms to weigh neighbor contributions differently
  • GraphSAGE: samples and aggregates features from a fixed-size set of neighbors
  • Graph Isomorphism Networks (GINs): stack multiple layers to capture complex structures

Code example

Embeddings

Embeddings

“A bunch of numbers representing an idea”

JN

Embedding are created by training models to learn compact representations that capture essential information from high-dimensional data. They are often a byproduct of foundation models.


Usage:

  • Semantic understanding: Capture complex relationships in data beyond traditional features
  • Similarity search: Identify locations with comparable environmental and surface characteristics
  • Change detection: Detect and quantify temporal variation by comparing embeddings across years
  • Unsupervised clustering: Group pixels into data-driven categories to reveal spatial structure
  • Classification: Generate thematic maps using a reduced amount of labeled training data

AlphaEarth Foundations

  • Integrates diverse geospatial data (optical, thermal, radar, 3D, elevation, climate, gravity, text)
  • Produces 64-dimensional, 10-m resolution embeddings for every year since 2017
  • Analysis-ready: no atmospheric correction, cloud masking, spectral transformations, speckle filtering, or other featurization techniques needed

Embeddings

Advantages:

  • Compression of high-dimensional data into manageable representations
  • They are not limited to a singe type of data (e.g., optical only)

Challenges:

  • Interpretability of embeddings
  • Tools and workflows for working with embeddings are still in their infancy
  • Which embeddings to use for a given task?

Code example

Foundation models

(Geospatial) foundation models

Large, pre-trained models that learn general-purpose spatiotemporal and multimodal representations from massive amounts of unlabeled Earth observation data.

Characteristics:

  • General within the geospatial domain (support many related tasks)
  • Pre-trained on large-scale multimodal satellite data
  • Adaptable: fine-tuned or used as feature extractors with small labeled datasets
  • Zero-shot capability for some tasks

Based on Self-Supervised Learning:

  • Masked image modeling (e.g., reconstruct missing pixels)
  • Multi-modal alignment (e.g., optical <-> SAR)
  • Temporal modeling (e.g., predict future states)
  • Contrastive learning across sensors, views, seasons (e.g., distinguish different locations/times)

(Geospatial) foundation models

Fine-Tuning: Small labeled datasets are added to specialize for tasks such as:
land cover mapping, segmentation, change detection, object extraction, and more.

Embeddings: Foundation models output feature vectors reusable across tasks and regions.

Example models: Terramind, AnySat, Prithvi, AlphaEarth Foundations, etc.

Strengths:

  • Work with very little labeled data
  • Multimodal fusion (optical, SAR, DEM, time series)
  • Generalize across various tasks

Challenges:

  • Limited transferability to unseen geographies
  • Out-of-domain reliability is still poor
  • Traditional deep learning remains competitive when labeled data is abundant

TabPFN

Transformer-based model for tabular data (not only spatial!)

  • Pretrained once; no fine-tuning required
  • Has foundation-model abilities: data generation, density estimation, reusable embeddings

How It Works:

  • Pretrained on millions of synthetic datasets generated under a wide range of priors
  • Synthetic data comes from causal models and Bayesian neural networks, including noise, imbalance, missing values
  • Learns to approximate Bayesian inference through this large synthetic pretraining
  • At inference: receives labeled and unlabeled samples, and solves the prediction task directly

TabPFN

(Stated) advantages:

  • No task-specific training: a single pretrained model aim to generalize across diverse tabular domains
  • Robustness to data imperfections: pretraining on varied synthetic datasets exposes the model to noise, imbalance, and missing values
  • Computational efficiency: prediction tasks are solved through a single forward pass without additional optimization
  • Built-in uncertainty estimation: produces full predictive distributions rather than point estimates
  • Interpretability support: compatible with SHAP-based methods for analyzing feature contributions

Scope:

  • Best for datasets with <10k rows, <500 columns, <10 classes
  • Current version: TabPFN v2.5 (Nov 2025) — supports up to 50k rows & 2k features

Paper: https://doi.org/10.1038/s41586-024-08328-6

Code example