The BioMassters Hosted By MathWorks


Problem description

Welcome to the forests of Finland! In this challenge, your goal is to create an algorithm that predicts yearly Aboveground Biomass (ABGM) for Finnish forests using satellite imagery. Our data partners at KU Leuven and University of Liège have provided the two primary datasets for this challenge:

  1. Feature data: Satellite imagery from the European Space Agency and European Commission's joint Sentinel-1 and Sentinel-2 satellite missions, designed to collect a rich array of Earth observation data
  2. Label data: Ground-truth AGBM measurements collected using LiDAR (Light Detection and Ranging) combined with in-situ measurements. LiDAR is able to generate high-quality biomass estimates, but is more time consuming and intensive to collect than satellite imagery.

Let's go over what you will need to biomass-ter this challenge!

Feature data

The feature data for this challenge is imagery collected by the Sentinel-1 and Sentinel-2 satellite missions for nearly 13,000 patches of forest in Finland. Each patch (also called a "chip") represents a different 2,560-by-2,560-meter area of forest. The data were collected over a period of 5 years between 2016 and 2021. Satellite imagery for the challenge is made available in public AWS S3 buckets. For download instructions, see biomassters_download_instructions.txt on the data download page.

Each label in this challenge represents a specific chip, or a distinct area of forest. LiDAR measurements are used to generate the biomass label for each pixel in the chip. For each chip, a full year's worth of monthly satellite images for that area are provided, from the previous September to the most recent August. For example, for a LiDAR-based ground truth chip from 2020, monthly satellite data is provided from September 2019 to August 2020.

For each chip, there should be 24 images in total that can be used as input - 12 for each satellite source. However, note that there are some data outages during this time period, so not all chips will have coverage for every month.

All of the satellite images have been geometrically and radiometrically corrected and resized to 10 meter resolution. Each resulting image is 256 by 256 pixels, and each pixel represents 10 square meters. Images represent monthly aggregations and are provided as GeoTIFFs with any associated geolocation data removed.

You only need to generate one biomass prediction per chip, but can use as many of the chip's multi-temporal (different months) or multi-modal (Sentinel-1 or Sentinel-2) satellite images as you like. Predictions should include a yearly peak aboveground biomass (AGBM) value for each 10 by 10 pixel in the chip.

Information about each satellite image, including its corresponding patch, satellite, and the month in which it was captured, is recorded in features_metadata.csv. The metadata is available on the data download page.

The fields in features_metadata.csv are:

  • chip_id: A unique identifier for a single patch, or area of forest
  • filename: The filename of the corresponding image, which follows the naming convention {chip_id}_{satellite}_{month_number}.tif. (month_number corresponds to the number of months since September of the year previous to when the ground truth was captured, so 00 would represent September, 01 October, and so on, until 12, which represents August of the same year)
  • satellite: The satellite the image was captured by (S1 Sentinel-1 or S2 for Sentinel-2)
  • split: Whether the image is a part of the training data or test data
  • month: The name of the month in which the image was collected
  • size: The file size in bytes
  • cksum: A checksum value to make sure the data was transmitted correctly. For more details on how to use the cksum, see the biomassters_download_instructions.txt file on the data download page.
  • s3path_us: The file location of the image in the public s3 bucket in the US East (N. Virginia) region
  • s3path_eu: The file location of the image in the public s3 bucket in the Europe (Frankfurt) region
  • s3path_as: The file location of the image in the public s3 bucket in the Asia Pacific (Singapore)
  • corresponding_agbm: The filename of the tif that contains the AGBM values for the chip_id

Example of first two rows from features_metadata.csv (transposed):

0 1
filename 0003d2eb_S1_00.tif 0003d2eb_S1_01.tif
chip_id 0003d2eb 0003d2eb
satellite S1 S1
split train train
month September October
size 1049524 1049524
cksum 3953454613 3531005382
s3path_us s3://.../train_features/0003d2eb_S1_00.tif s3://.../train_features/0003d2eb_S1_01.tif
s3path_eu s3://.../train_features/0003d2eb_S1_00.tif s3://.../train_features/0003d2eb_S1_01.tif
s3path_as s3://.../train_features/0003d2eb_S1_00.tif s3://.../train_features/0003d2eb_S1_01.tif
corresponding_agbm 0003d2eb_agbm.tif 0003d2eb_agbm.tif

How satellite images are collected

Sentinel-1 and Sentinel-2 are remote sensing satellite missions that use different instruments for capturing data: - Sentinel-1 uses Synthetic Aperture Radar (SAR) - Sentinel-2 uses Multi-Spectral Imaging (MSI)

These result in a few differences in how the data for each have been prepared and how they should be used.

An image of Sentinel-1 satellites

Sentinel-1. Image credit: ESA Sentinel Online

An image of Sentinel-2 satellites

Sentinel-2. Image credit: ESA Sentinel Online

Sentinel-1 (S1)

The ESA Copernicus Sentinel-1 mission comprises a constellation of polar-orbiting satellites, operating day and night performing C-band Synthetic Aperture Radar (SAR) imaging, enabling them to acquire imagery regardless of the weather. The polar-orbits of SAR satellites mean that for half of their trajectory they are traveling from the north pole towards the south pole. This direction is referred to as a descending orbit. Conversely, when the satellite is traveling from the south pole towards the north pole, it is said to be in an ascending orbit.

Sentinel-1 diagram

Image Credit: AltaMira

The provided S1 data include two bands from each of Sentinel-1's two satellites—"VV" and "VH"—for a total of four bands. These bands are captured from the sensor transmitting vertically polarized signal (represented by the first "V") and receiving either vertically (V) or horizontally (H) polarized signal.

The values in these bands represent the energy that was reflected back to the satellite measured in decibels (dB). Pixel values can range from negative to positive values. A pixel value of -9999 indicates missing data. An advantage of Sentinel-1's use of SAR is that it can acquire data across day or night, under all weather conditions. Clouds or darkness do not impede the S1's ability to collect images.

Finally, Sentinel-1 has a 6-day revisit orbit, which means that it returns to the same area about five times per month. We have provided a single composite image from S1 for each calendar month, which is generated by taking the mean across all images acquired by S1 for the patch during that time. For more details on how to interpret SAR data, participants might find it helpful to consult NASA's guide to SAR.

Example of a Sentinel-1 image:

001b0634_S1_00.tif is an image from S1 provided as a part of the training dataset. The filename follows the format {chip_id}_{satellite}_{month_number}.tif, so we know that the chip_id is 001b0634 and that the image was captured by S1 in September (the month number corresponds to the number of months since September).

An diagram of how SAR uses polarization

Sentinel-1 image, by band

The corresponding row in features_metadata.csv is:
filename 001b0634_S1_00.tif
chip_id 001b0634
satellite S1
split train
month September
size 1049524
cksum 3250666344
s3path_us s3://.../train_features/001b0634_S1_00.tif
s3path_eu s3://.../train_features/001b0634_S1_00.tif
s3path_as s3://.../train_features/001b0634_S1_00.tif
corresponding_agbm 001b0634_agbm.tif

Sentinel-2 (S2)

Sentinel-2 (S2) is a high-resolution imaging mission that monitors vegetation, soil, water cover, inland waterways, and coastal areas. S2 satellites have a Multispectral Instrument (MSI) on board that collects data in the visible, near-infrared, and short-wave infrared portions of the electromagnetic spectrum. While Sentinel-2 measures spectral bands that range from 400 to 2400 nanometers, Sentinel-1's SAR instrument uses longer wavelengths that range from centimeters to meters. We have selected the best image for each month from the S2 data, as opposed to taking the mean of all images collected over the month, as is done with S1.

The following 11 bands are provided for each S2 image: B2, B3, B4, B5, B6, B7, B8, B8A, B11, B12, and CLP (a cloud probability layer). See the Sentinel-2 guide for a description of each band.

The CLP band — cloud probability — is provided because S2 cannot penetrate clouds. The cloud probability layer has values from 0-100, indicating the percentage probability of cloud cover for that pixel. In some images, this layer may have a value of 255, which indicates that the layer has been obscured due to excessive noise.

For information on the satellite data and its sources, check out the About Page.

Example of a Sentinel-2 image:

001b0634_S2_00.tif is an image from S2 that is part of the training dataset. Its name indicates it is for chip 001b0634 and collected by S2 during September.

Examples of imagery from the Sentinel-2 satellite for the three visible bands

Sentinel-2 image, by band

The corresponding row in features_metadata.csv is:
filename 001b0634_S2_00.tif
chip_id 001b0634
satellite S2
split train
month September
size 1443550
cksum 3549883502
s3path_us s3://.../train_features/001b0634_S2_00.tif
s3path_eu s3://.../train_features/001b0634_S2_00.tif
s3path_as s3://.../train_features/001b0634_S2_00.tif
corresponding_agbm 001b0634_agbm.tif

Ground truth data

The ground truth values for this competition are yearly Aboveground Biomass (AGBM) measured in tonnes. Labels for each patch are derived from LiDAR (Light Detection and Ranging), a remote sensing technology that provides 3D information about the terrain and vegetation. The label for each patch is the peak biomass value measured during the summer.

Similarly to the feature satellite imagery, LiDAR data is provided as images that cover 2,560 meter by 2,560 meter areas at 10 meter resolution, which means they are 256 by 256 pixels in size. For the same chip ID, each pixel in the satellite data corresponds to a pixel in the same position in the LiDAR data. Note that 0 values in this dataset can represent areas with zero biomass or areas where there is missing data. The file train_agbm_metadata.csv provides the following information about AGBM images:

  • chip_id: The patch the image corresponds to
  • filename: The filename the image can be found under. The filename follows the convention {chip_id}_agbm.tif
  • size: The file size in bytes
  • cksum: A checksum value to make sure the data was transmitted correctly. For more details on how to use the cksum, see the biomassters_download_instructions.txt file on the data download page.
  • s3path_us: The file location of the image in the public s3 bucket in the US East (N. Virginia) region
  • s3path_eu: The file location of the image in the public s3 bucket in the Europe (Frankfurt) region
  • s3path_as: The file location of the image in the public s3 bucket in the Asia Pacific (Singapore)

Example of an AGBM image:

001b0634_agbm.tif is an image from LiDAR data for the training dataset. From the name, we can tell it is for chip 001b0634:
AGBM image

Example AGBM image

It is represented in train_agbm_metadata.csv like this:
filename 001b0634_agbm.tif
chip_id 001b0634
size 262482
cksum 2397957274
s3path_us s3://.../train_agbm/001b0634_agbm.tif
s3path_eu s3://.../train_agbm/001b0634_agbm.tif
s3path_as s3://.../train_agbm/001b0634_agbm.tif
Note on competition rules: External data is not allowed in this competition. Participants can use pre-trained computer vision models as long as they were available freely and openly in that form at the start of the competition. Each test observation (chip_id) must be processed independently. Other pixels in the same chip_id can be used for generating predictions, but not those in other chip_ids.

Performance metric

To measure your model’s performance, we’ll use a metric called Average Root Mean Square Error (RMSE). RMSE is the square root of the mean of squared differences between estimated and observed values. RMSE will be calculated on a per-pixel basis (i.e., each pixel in your submitted tif for a patch will be compared to the corresponding pixel in the ground-truth tif for the patch). RMSE will be calculated for each image, and then averaged over all images in the test set. This is an error metric, so a lower value is better. Note: There are some outliers in this dataset, and they are included in the scoring. Pixels with a value of zero are also included in the scoring (including both true 0 biomass values and where there is missing data).

Average RMSE is defined as:

$$ Average RMSE = \frac {\sum_{i=0}^M \sqrt{\frac{1}{N} \sum_{i=0}^N (y_i - \hat{y_i})^2 }}{M} $$


  • |$\hat{y_i}$| is the |$i$|th predicted value
  • |$y_i$| is the |$i$|th true value
  • |$N$| is the number of pixels in each image
  • |$M$| is the number of images

Submission format

You must submit your predictions in the form of single-band 256 x 256 tif images, zipped into a single file. The lone band should contain your predictions for the yearly AGBM for that chip.

The format for the submission is a .tar, .tgz, or .zip file containing predicted AGBM values for each chip ID in the test set. Each tif must be named according to the convention {chip_id}_agbm.tif. The order of the files does not matter. In total, there should be 2,773 tifs in your submission.

For example, the first few files in your uncompressed might look like:


Your submission cannot exceed 1GB and the maximum supported precision for the tifs is float32.

Good luck!

If you're wondering how to get started, check out the MATLAB benchmark blog post!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum.