Publikasjonsdetaljer
- Arrangement: (Rennes)
Foundation models (FMs) are transforming the field of Artificial Intelligence (AI) by learning inherent information from vast amounts of unlabeled data, enabling adaptation to numerous applications. Their integration into the Earth Observation (EO) ecosystem promises to revolutionize the information value chain, impacting industry, research, and science. However, EO applications present unique challenges, including the diverse needs for detail or rapid processing, and the variety of data value and sensor characteristics. Models must handle data from multiple sensors at varying ground sampling distances (GSD). Vision Transformers (ViT), often trained using self-supervised learning (SSL), form the backbone of many modern FMs by learning from complex data patterns without explicit supervision.
We introduce FM4CS, a versatile foundation model specifically designed for climate and society EO applications. FM4CS aims to address the aforementioned challenges by supporting four different Sentinel sensors: Sentinel-1 SAR, Sentinel-2 MSI, Sentinel-3 OLCI, and Sentinel-3 SLSTR. Inspired by approaches like USat for multi-sensor data handling at native resolutions and FlexiViT for operating across a wide range of patch sizes without retraining, FM4CS employs a single ViT architecture. The model utilizes individual patch embedding layers for each sensor channel, allowing flexibility in processing subsets of spectral bands. It adapts the number of patches per band based on GSD and uses spectral group pooling to manage token sequence length. To accommodate flexible patch sizes, FM4CS incorporates a training procedure where patch size is randomized, allowing the model to adapt dynamically at inference time. For handling the need for positional information of the patches across different sensors, resolutions and patch sizes, FM4CS adapts the 2D ALiBi (Attention by Linear Bias) relative positional encoding scheme.
The pre-training dataset for FM4CS is curated to ensure diversity. Instead of stacking small image crops, data is sampled using the Sentinel-2 tiling grid, co-locating Sentinel-1, Sentinel-2, and Sentinel-3 imagery for given locations and time intervals. A stratified sampling approach, based on k-means clustering of ESA WorldCover maps and Sentinel-2 RGB composites, is used to capture the diversity of global land cover and address imbalances. Oceanic data sampling incorporates shipping traffic density, oil and gas installations, and areas with higher probabilities of sea ice and icebergs. The dataset also includes ERA5-Land variables to facilitate multi-modal pretext tasks, leveraging daily statistics for variables such as soil moisture, temperature, and snow cover.
FM4CS is trained using several SSL tasks. These include pixel-level input band reconstruction, similar to masked image modeling, where a lightweight ViT decoder reconstructs masked tokens. Additionally, the model predicts existing maps such as ESA WorldCover from Sentinel-1/2 data, and other land cover maps (ESA GlobCover, MOD12Q1) from the Sentinel-3 sensors using a cross-entropy loss. Image-level tasks involve the prediction of ERA5 variables, latitude, longitude and data acquisition month.