On Cloud N: Cloud Cover Detection Challenge Hosted By Microsoft AI for Earth


Problem description

In this challenge, your goal is to label clouds in satellite imagery. In many uses of satellite imagery, clouds obscure what we really care about - for example, tracking wildfires, mapping deforestation, or visualizing crop health. Being able to more accurately remove clouds from satellite images filters out interference, unlocking the potential of a vast range of use cases.

The challenge uses publicly available satellite data from the Sentinel-2 mission, which captures wide-swath, high-resolution, multi-spectral imaging. Data is publicly shared through Microsoft's Planetary Computer.

Overview of the data provided for this competition:

├── train_features
│   └── ...
├── train_labels
│    └── ...
└── train_metadata.csv

Feature data

The dataset consists of Sentinel-2 satellite imagery stored as GeoTiffs. There are almost 12,000 chips in the training data, collected between 2018 and 2020. Each chip is imagery of a specific area captured at a specific point in time.

Sentinel-2 flies over the part of the Earth between 56° South (Cape Horn, South America) and 82.8° North (above Greenland), so our observations are all between these two latitudes. The chips are mostly from Africa, South America, and Australia. For more background about how data from Sentinel-2 is collected, see the About page.


The main features in this challenge are the satellite images themselves. There are four images associated with each chip. Each image within a chip captures light from a different range of wavelengths, or "band". For example, the B02 band for each chip shows the strengh of visible blue light, which has a wavelength around 492 nanometers (nm). The bands provided are:

Band Description Center wavelength
B02 Blue visible light 497 nm
B03 Green visible light 560 nm
B04 Red visible light 665 nm
B08 Near infrared light 835 nm

Each band image is provided as a 512 x 512 GeoTIFF. The resolution, or real-world distance between pixels, is 10m. All four bands for a given chip cover the exact same area.

The data download page includes instructions for how to download the training images. Within the folder train_features, there is a folder for every chip.

├── adwp # chip with id adwp
├── ├──B02.tif
├── ├──B03.tif
├── ├──B04.tif
├── └──B08.tif
├── adwu
├── ├──B02.tif
├── ├──B03.tif
├── ├──B04.tif
├── └──B08.tif
└── ...

For example, the red visible light band of a chip with ID abcd in the training data would be saved as train_features/abcd/B04.tif.

Each GeoTIFF contains a set of metadata including bounding coordinates, an affine transform, and its coordinate reference system (CRS) projection. In Python, you can access geospatial raster data and extract this metadata using the rasterio package.

with rasterio.open("train_features/cjge/B04.tif") as f:
    meta = f.meta
    bounds = f.bounds
print("Meta:", meta, "\nBounds:", bounds)
{'driver': 'GTiff',
  'dtype': 'uint16',
  'nodata': 0.0,
  'width': 512,
  'height': 512,
  'count': 1,
  'crs': CRS.from_epsg(32630),
  'transform': Affine(10.0, 0.0, 579060.0,
         0.0, -10.0, 3568130.0)}

BoundingBox(left=579060.0, bottom=3563010.0, right=584180.0, top=3568130.0))

For an example of how to manipulate GeoTIFF metadata, see the benchmark blog post.


Metadata is also included as a CSV, train_metadata.csv. It contains the following columns:

  • chip_id (string): A unique identifier for each chip. There is one row per chip in each of the metadata files
  • location (string): General location of the chip, either country (eg. Eswatini), metropolitan region (eg. Lusaka), or country sub-region (eg. Australia - Central)
  • datetime (datetime64[ns, UTC]): Date and time that the images in the chip were captured. These will be loaded as strings with the format %Y-%m-%dT%H:%M:%SZ (using standard string format codes). Z indicates that timestamps are in coordinated universal time (UTC)
  • cloudpath (string): The path to download the folder of chip images from the Azure Blob Storage container. For a step-by-step guide, see data_download_instructions.txt on the data download page.

The first few rows of train_metadata.csv are:

chip_id location datetime cloudpath
0 adwp Chifunfu 2020-04-29T08:20:47Z az://./train_features/adwp
1 adwu Chifunfu 2020-04-29T08:20:47Z az://./train_features/adwu
2 adwz Chifunfu 2020-04-29T08:20:47Z az://./train_features/adwz
3 adxp Chifunfu 2020-04-29T08:20:47Z az://./train_features/adxp
4 aeaj Chifunfu 2020-04-29T08:20:47Z az://./train_features/aeaj

Chifunfu is an area of Tanzania! There are up to 400 chips for each location in the data. Each location is either entirely in the train set or the test set, so all of the settings in the test set will be entirely new.

Feature data example

Feature information for the chip in the training set with ID cjge (taken from Bechar, Algeria)

chip_id location datetime cloudpath
cjge Bechar 2019-11-12T11:02:20Z az://./train_features/cjge
Feature images
band description filepath
B02 Blue visible light train_features/cjge/B02.tif
B03 Green visible light train_features/cjge/B03.tif
B04 Red visible light train_features/cjge/B04.tif
B08 Near infrared light train_features/cjge/B08.tif
B02.tif (blue visible light band)

array([[1102, 1250, 1324, ..., 2928, 2902, 2802],
       [1160, 1252, 1326, ..., 2854, 2844, 2754],
       [1270, 1338, 1348, ..., 2732, 2800, 2814],
       [1086, 1100, 1072, ..., 1200, 1188, 1076],
       [1118, 1118, 1082, ..., 1168, 1190, 1140],
       [1158, 1140, 1086, ..., 1222, 1208, 1148]], dtype=uint16)
B04.tif (red visible light band)

An example TIF image for the B04 band, which shows visible red light, that has clouds in the foreground

array([[3134, 3350, 3492, ..., 4074, 4036, 3978],
       [3190, 3386, 3466, ..., 4064, 4040, 3996],
       [3232, 3420, 3488, ..., 4016, 4016, 4016],
       [2714, 2678, 2654, ..., 2942, 2968, 2822],
       [2778, 2732, 2688, ..., 2930, 2996, 2822],
       [2812, 2724, 2654, ..., 2924, 2958, 2886]], dtype=uint16)
Each TIF is a single-band image. The shape of each image array is (512, 512).

Additional data

While four bands per chip are included in the competition data, the publicly available Sentinal-2 dataset includes up to 10 bands capturing different wavelengths. You may pull in any other information from the Planetary Computer to supplement the provided data. We recommend using the Planetary Computer STAC API to access any additional data. Access to the Planetary Computer will be allowed during inference in the code execution environment. To find additional bands for a given chip, search the Planetary Computer Hub based on both the geographic coordinates and the timestamp.

For example code demonstrating how to pull in an additional band, see the tutorial posted in the Planetary Computer Hub.

The competition dataset includes the four bands with a resolution of 10m, and is exclusively from the L2A dataset. Some chips may not have every single band available.

Note on external data: External data that is not from the Planetary Computer is not allowed in this competition. Participants can use pre-trained computer vision models as long as they were available freely and openly in that form at the start of the competition.


The labels for the competition are 512 x 512 GeoTIFFs with pixels indicating cloud (1) / no cloud (0) for each chip. Each label is saved as <chip_id>.tif in the train_labels folder. For example, listing the first few tiles in train_labels in your terminal would give:

$ ls train_labels | head -n 5

Like the feature images, label GeoTIFFs contain additional metadata including bounding coordinates that can be accessed with rasterio. Any missing values in the labels have been converted to 0, or no cloud, during validation. None of the cloud cover masks in the competition dataset are publicly available in the Planetary Computer.

Labelled training data example


An example TIF image with the cloud cover ground truth label. About half of the image is covered in clouds.

array([[0, 0, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 1, 1],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0]], dtype=uint8)
Each label TIF is a single-band image. The shape of each image array is (512, 512).

Performance metric

To measure your model’s performance, we’ll use a metric called Jaccard index, also known as Generalized Intersection over Union (IoU). Jaccard index is a similarity measure between two label sets. In this case, it is defined as the size of the intersection divided by the size of the union of non-missing pixels. In this competition there should be no missing data. Because it is an accuracy metric, a higher value is better. The Jaccard index can be calculated as follows:

$$J(A, B) = \frac{\left|A\cap B\right|}{\left|A\cup B\right|} = \frac{\left|A\cap B\right|}{\left|A|+|B|-|A\cap B\right|}$$

where |$A$| is the set of true pixels and |$B$| is the set of predicted pixels.

In Python, you can easily calculate the Jaccard index using the scikit-learn function sklearn.metrics.jaccard_score(y_true, y_pred, average='binary'). Below is some pseudocode that demonstrates how you might calculate the Jaccard index in Python without using sklearn:

import numpy as np

# get the sum of intersection and union over all chips
intersection = 0
union = 0

for pred, actual in file_pairs:
    intersection += np.logical_and(actual, pred).sum()
    union += np.logical_or(actual, pred).sum()

# calculate the score across all chips
iou = intersection / union 

In the above, pred and actual are each an array of 512x512 containing only 0s and 1s. file_pairs is a list of tuples where each tuple is (pred, actual).

Submission format

This is a code execution challenge! Rather than submitting your predicted labels, you’ll package everything needed to do inference at the chip level and submit that for containerized execution. Your code must be able to generate predictions in the form of single-band 512x512 pixel TIFs. The lone band should consist of 1 (cloud) and 0 (no cloud) pixel values. An example prediction for one chip could look like the following:

Prediction format example

An example TIF image with the predicted cloud cover mask

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 1, 1],
       [0, 0, 0, ..., 1, 1, 1]], dtype=uint8)
Each prediction TIF is a single-band image. The shape of each image array is (512, 512).

For full instructions on packaging your submission for code execution, see the Code Submission Format page.

Good luck!

Good luck and enjoy the challenge! If you have any questions you can always visit the user forum!