On Cloud N: Cloud Cover Detection Challenge Hosted By Microsoft AI for Earth
About the project
Satellite imagery is critical for a wide variety of applications from disaster management and recovery, to agriculture, to military intelligence. A major obstacle for all of these use cases is the presence of clouds, which cover over 66% of the Earth's surface (Xie et al, 2020). Clouds introduce noise and inaccuracy in image-based models, and usually have to be identified and removed. Improving methods of identifying clouds can unlock the potential of an unlimited range of satellite imagery use cases, enabling faster, more efficient, and more accurate image-based research.
This competition is based on data from the Sentinal-2 mission, which is an effort to monitor land surface conditions and the way that they change. Sentinel-2 imagery has recently been used for critical applications like:
- Tracking an erupting volcano on the Spanish island of La Palma. Satellite images showed the path of lava flowing across the land and helped evacuate towns in danger
- Mapping deforestation in the Amazon rainforest and identifying effective interventions
- Monitoring wildfires in California to identify their sources and track air pollutants
The biggest challenges in cloud detection are identifying thin clouds and distinguishing between bright clouds and other bright objects (Kristollari & Karathanassi, 2020). The three most common approaches used are:
The earliest approach, the threshold method, takes advantage of the spectral properties of clouds and sets cutoff values for various spectral bands. Air molecules appear blue through a process called Rayleigh scattering. Clouds are composed of much larger water droplets, which reflect all colors similarly regardless of wavelength through Mie scattering - this is why clouds appear gray or white. (Fun fact, Rayleigh and Mie scattering are why the sky appears whiter closer to the sun!)
Threshold methods are relatively simple, easy to apply, and tend to have high precision. However, they depend on specific cutoff values and therefore don't translate well between different conditions or camera types.
Handcrafted models generate predictions by training a machine learning algorithm using a set of hand-crafted features. Features are commonly generated based on spectral values like color and texture characteristics like spatial distribution. Models using Bayesian statistics, discriminant analysis, and support vector machines have all had some success.
However, researchers still have to design and test each possible model feature. The input data is highly complex imagery with many possible featuress to generate, making it difficult to find the best model.
Deep learning has the advantage of learning both the most important features and a classification function at the same time. Some researchers have found that convolutional neural networks can successfully identify the best complex features for prediction, and generally outperform earlier methods.
About the labels
The availability of labeled data has been a major obstacle for cloud detection efforts. Existing models have often been used as a proxy for ground truth, significantly limiting performance (Zupanc, 2017). In this challenge, you have a unique opportunity to advance cutting-edge methods of cloud detection with new, high quality labeled data!
The labels for this dataset are generated using human annotation of the optical bands of Sentinel-2 imagery. In 2021, Radiant Earth Foundation ran a contest to crowdsource data labels identifying clouds in satellite imagery. This contest was sponsored by Planet, Microsoft AI for Earth, and Azavea. The result is a diverse set of Sentinel-2 scenes labeled for cloudy pixels. To simplify the crowdsourcing task, a generic “cloud” / “no cloud” classification was implemented rather than categorizing clouds by type.
The resulting crowdsourced dataset, while extensive, had varying degrees of label quality. With support from Microsoft AI for Earth, Radiant Earth worked with expert annotators at TaQadam to validate and, as needed, revise these labels. The final dataset is a high-quality human-verified set of cloud labels that spans imagery and cloud conditions across three continents (Africa, South America and Australia). The dataset has an open license (CC BY 4.0), and will be made publicly available after the competition ends.
About the features
The features are based on imagery from the Sentinal-2 mission. Data is collected by two wide-swath, high-resolution, multi-spectral imaging satellites. Each satellite has a MultiSpectral Instrument (MSI) that collects data from sunlight reflected off of the Earth's surface. As the satellite moves along its path, incoming light beams are split into a separate band for each wavelength. The final data includes 13 spectral bands:
- 4 visible bands (including red, green, and blue)
- 6 Near-Infrared bands
- 3 Short-Wave Infrared bands
Sentinel-2 collects data for two levels of the atmosphere. Level-1C (L1C) data captures top of the atmosphere (TOA) reflectance. Level-2A (L1A) data captures bottom of the atmosphere (BOA) reflectance. All images for this competition are from the L2A data.
Both satellite orbits are sun synchronous, meaning that the angle of sunlight on the Earth's surface is consistently maintained by anchoring the satellite orbit to the angle of the sun. This minimizes the potential impact of shadows. The two satellites are in the same orbit exactly opposite one another, which increases how frequently a given land area is captured.
About the project team
Microsoft AI for Earth empowers organizations and individuals working to solve global environmental challenges. The program drives innovation in environmental sustainability by developing technology to accelerate environmental science and by providing grants to organizations addressing problems in biodiversity, climate change, water, and agriculture. Since 2017, AI for Earth has awarded grants to research teams across more than 60 different countries.
The Sentinel-2 imagery used in this competition is among the environmental monitoring data made available through Microsoft's new Planetary Computer. The Planetary Computer puts global-scale environmental monitoring capabilities in the hands of scientists, developers, and policy makers. It combines a multi-petabyte catalog of analysis-ready environmental data with intuitive APIs, a flexible development environment, and applications to put actionable information in the hands of conservation stakeholders. The Planetary Computer uses the SpatioTemporal Asset Catalog (STAC) standard for organizing geospatial assets and metadata, making it easy to search for data that match spatial, temporal, or other criteria.
Radiant Earth Foundation is a nonprofit organization that applies machine learning for Earth observation to meet the Sustainable Development Goals. The Foundation supports missions worldwide by developing openly licensed machine learning libraries, training data sets, and models through its open-access Radiant MLHub.
- The Azavea Cloud Dataset
- Sentinel-2 Mission
- Sentinel-2 L2A data details
- Sentinel-2 L2A algorithm-based bands
- Cloud Detection in Satellite Imagery (Azavea, 2021)
- Convolutional neural networks for detecting challenging cases in cloud masking using Sentinel-2 imagery (Kristollari & Karathanassi, 2020)
- SegCloud: a novel cloud image segmentation model using a deep convolutional neural network for ground-based all-sky-view camera observation (Xie et al, 2020)
- Cloud detection methodologies: variants and development—a review (Majahan & Fataniya, 2020)
- Cloud Detection Based on Convolutional Neural Network using Different Bands Information for Landsat 8 OLI (Ma, Wang, Lin, & Wang, 2019)
- Improving Cloud Detection with Machine Learning (Zupanc, 2017)
- Deep Learning for Cloud Detection (Goff et al, 2017)