On Cloud N: Cloud Cover Detection Challenge Hosted By Microsoft AI for Earth


About the project

Satellite imagery is critical for a wide variety of applications from disaster management and recovery, to agriculture, to military intelligence. A major obstacle for all of these use cases is the presence of clouds, which cover over 66% of the Earth's surface (Xie et al, 2020). Clouds introduce noise and inaccuracy in image-based models, and usually have to be identified and removed. Improving methods of identifying clouds can unlock the potential of an unlimited range of satellite imagery use cases, enabling faster, more efficient, and more accurate image-based research.

This competition is based on data from the Sentinal-2 mission, which is an effort to monitor land surface conditions and the way that they change. Sentinel-2 imagery has recently been used for critical applications like:

Example Sentinel-2 Imagery

Left: Volcano eruption on La Palma (image source: Sky News). Right: California wildfires (image source: SciTech Daily).

The biggest challenges in cloud detection are identifying thin clouds and distinguishing between bright clouds and other bright objects (Kristollari & Karathanassi, 2020). The three most common approaches used are:

  1. Threshold methods

  2. Handcrafted models

  3. Deep learning

Threshold methods

The earliest approach, the threshold method, takes advantage of the spectral properties of clouds and sets cutoff values for various spectral bands. Air molecules appear blue through a process called Rayleigh scattering. Clouds are composed of much larger water droplets, which reflect all colors similarly regardless of wavelength through Mie scattering - this is why clouds appear gray or white. (Fun fact, Rayleigh and Mie scattering are why the sky appears whiter closer to the sun!)

Threshold methods are relatively simple, easy to apply, and tend to have high precision. However, they depend on specific cutoff values and therefore don't translate well between different conditions or camera types.

Handcrafted models

Handcrafted models generate predictions by training a machine learning algorithm using a set of hand-crafted features. Features are commonly generated based on spectral values like color and texture characteristics like spatial distribution. Models using Bayesian statistics, discriminant analysis, and support vector machines have all had some success.

However, researchers still have to design and test each possible model feature. The input data is highly complex imagery with many possible featuress to generate, making it difficult to find the best model.

Deep learning

Deep learning has the advantage of learning both the most important features and a classification function at the same time. Some researchers have found that convolutional neural networks can successfully identify the best complex features for prediction, and generally outperform earlier methods.

Example convolutional neural network

Illustration of an example convolutional neural network for cloud detection called SegCloud. The model contain an encoder network, a corresponding decoder network, and a final Softmax classifier. Image source: Xie et al, 2021

About the labels

The availability of labeled data has been a major obstacle for cloud detection efforts. Existing models have often been used as a proxy for ground truth, significantly limiting performance (Zupanc, 2017). In this challenge, you have a unique opportunity to advance cutting-edge methods of cloud detection with new, high quality labeled data!

The labels for this dataset are generated using human annotation of the optical bands of Sentinel-2 imagery. In 2021, Radiant Earth Foundation ran a contest to crowdsource data labels identifying clouds in satellite imagery. This contest was sponsored by Planet, Microsoft AI for Earth, and Azavea. The result is a diverse set of Sentinel-2 scenes labeled for cloudy pixels. To simplify the crowdsourcing task, a generic “cloud” / “no cloud” classification was implemented rather than categorizing clouds by type.

The resulting crowdsourced dataset, while extensive, had varying degrees of label quality. With support from Microsoft AI for Earth, Radiant Earth worked with expert annotators at TaQadam to validate and, as needed, revise these labels. The final dataset is a high-quality human-verified set of cloud labels that spans imagery and cloud conditions across three continents (Africa, South America and Australia). The dataset has an open license (CC BY 4.0), and will be made publicly available after the competition ends.

About the features

The features are based on imagery from the Sentinal-2 mission. Data is collected by two wide-swath, high-resolution, multi-spectral imaging satellites. Each satellite has a MultiSpectral Instrument (MSI) that collects data from sunlight reflected off of the Earth's surface. As the satellite moves along its path, incoming light beams are split into a separate band for each wavelength. The final data includes 13 spectral bands:

  • 4 visible bands (including red, green, and blue)
  • 6 Near-Infrared bands
  • 3 Short-Wave Infrared bands

Electromagnetic spectrum

Electromagnetic spectrum. Sentinel-2 detects light with a wavelength ranging from roughly 2,200 nm (short-wave infrared) to 440 nm (visible indigo light).

Sentinel-2 collects data for two levels of the atmosphere. Level-1C (L1C) data captures top of the atmosphere (TOA) reflectance. Level-2A (L1A) data captures bottom of the atmosphere (BOA) reflectance. All images for this competition are from the L2A data.

Both satellite orbits are sun synchronous, meaning that the angle of sunlight on the Earth's surface is consistently maintained by anchoring the satellite orbit to the angle of the sun. This minimizes the potential impact of shadows. The two satellites are in the same orbit exactly opposite one another, which increases how frequently a given land area is captured.

Sentinal-2 orbital configuration

The Twin-Satellite SENTINEL-2 Orbital Configuration (courtesy Astrium GmbH). Image source: Sentinel-2 Overview

About the project team

Microsoft AI for Earth empowers organizations and individuals working to solve global environmental challenges. The program drives innovation in environmental sustainability by developing technology to accelerate environmental science and by providing grants to organizations addressing problems in biodiversity, climate change, water, and agriculture. Since 2017, AI for Earth has awarded grants to research teams across more than 60 different countries.

The Sentinel-2 imagery used in this competition is among the environmental monitoring data made available through Microsoft's new Planetary Computer. The Planetary Computer puts global-scale environmental monitoring capabilities in the hands of scientists, developers, and policy makers. It combines a multi-petabyte catalog of analysis-ready environmental data with intuitive APIs, a flexible development environment, and applications to put actionable information in the hands of conservation stakeholders. The Planetary Computer uses the SpatioTemporal Asset Catalog (STAC) standard for organizing geospatial assets and metadata, making it easy to search for data that match spatial, temporal, or other criteria.

Note on available resources: The Planetary Computer Hub provides a convenient way to compute on data from the Planetary Computer. If you're interested in using the Planetary Computer Hub to develop your model, you can request access here. Make sure to include "DrivenData" in your area of study on the account request form.

Radiant Earth Foundation is a nonprofit organization that applies machine learning for Earth observation to meet the Sustainable Development Goals. The Foundation supports missions worldwide by developing openly licensed machine learning libraries, training data sets, and models through its open-access Radiant MLHub.

Helpful resources

Competition data

Cloud detection