Navigation

Problem description

Your goal in this challenge is to use satellite imagery to detect and classify the severity of cyanobacteria blooms in small, inland water bodies. This will help water quality managers better allocate resources for in situ sampling and make more informed decisions around public health warnings for drinking water and recreation.

Features
Metadata
Satellite imagery
Climate data
Elevation data

Labels
Labels
Submission format

Performance metric
Region-averaged RMSE

Overview of the data provided for this competition:

.
├── metadata.csv
├── submission_format.csv
└── train_labels.csv

Metadata

Labels in this competition are based on “in situ” samples that were collected manually and then analyzed for cyanobacteria density. Each measurement is a unique combination of date and location (latitude and longitude). Samples come from water bodies throughout the continental U.S., and date back to 2013.

metadata.csv includes attributes for all of the sample measurements in the data. It has the following columns:

uid (str): unique ID for each row
latitude (float): latitude of the location where the sample was collected
longitude (float): longitude of the location where the sample was collected
date (pd.datetime): date when the sample was collected, in the format YYYY-MM-DD
split (str): indicates whether the row is part of the train set or the test set

Metadata is provided for all points in both the train and test sets. Labels are only provided for rows in the train set.

The basic steps for completing a submission are to:

Train your model on the train dataset
Generate predictions for the test dataset
Submit those predictions for scoring

Each geographic area is either entirely in the train set or entirely in the test set. This means that none of the test set locations are in the training data, so your model's performance will be measured on unseen locations.

Feature data

The feature data for this challenge is a combination of satellite imagery, climate data, and elevation data. Sentinel-2 is the only required data source — using any of the additional sources below is up to you! The four approved data sources are:

Sentinel-2 satellite imagery
Landsat satellite imagery
NOAA's High-Resolution Rapid Refresh (HRRR) climate data
Copernicus DEM elevation data

Participants will access all feature data through external, publicly available APIs. You may only use the approved access points specified. The section on each data source lists which external APIs it is available through. To pull data about each of the sample points, search an API using the sample's location and date from metadata.csv.

To predict the density of a given sample, you may only use data that was available at the time the sample was collected. For example, if a given sample was collected on October 18, 2022, you may only use data that was collected on or before October 18, 2022. Date range can be included in your queries to external APIs.

Tip: When there are multiple access point locations for a data source, download from the one that is geographically closest to you to reduce download times.

Note on external data: External data that is not one of the four sources listed below is not allowed in this competition. Participants can use pre-trained computer vision models as long as they were available freely and openly in that form at the start of the competition.

Satellite imagery

The satellite data for this competition includes Sentinel-2 and Landsat. Visible spectrum imagery in Sentinel-2 has a higher resolution than Landsat (10 meters compared to 30 meters). However, for data points prior to mid-2016, only Landsat imagery is available.

Sentinel-2 and Landsat satellites revisit the same location roughly every five and eight days respectively, so in most cases imagery is available within a few days of the in situ sampling label. Imagery captured within a ten-day range is still generally an accurate representation of the sampling conditions.

For each satellite source, participants can use either level-1 or level-2 data.

Level-1 describes top of atmosphere reflectance, which can be thought of as the “raw” reflectance. It is a mix of light reflected off of the Earth’s surface and off of the atmosphere.
Level-2 describes bottom of atmosphere reflectance. To generate level-2 data, an atmospheric correction model is applied to level-1 that eliminates some of the effects of the atmosphere on reflectance values.

Level-2 data generally captures conditions on the Earth’s surface more accurately. However, the atmospheric correction algorithms applied to level-2 can also sometimes reduce the visibility of bright algal blooms.

Sentinel-2

Sentinel-2 is a wide-swath, high-resolution, multi-spectral imaging mission. Sentinel-2’s publicly available level-1 data is called Level-1C (L1C), and the publicly available level-2 data is called Level-2A (L2A).

In addition to bottom-of-atmosphere reflectance, Sentinel-2 L2A also includes algorithmic bands that describe environmental features relevant to the behavior of algal blooms.

Scene classification (SCL): The scene classification band sorts pixels into categories including water, high cloud probability, medium cloud probability, and vegetation. Water pixels could be used to calculate the size of a given water body, which impacts the behavior of blooms. Vegetation can indicate non-toxic marine life like sea grass that sometimes resembles cyanobacteria.
Cloud masks (CLM): Sentinel-2’s cloud bands can be used to filter out cloudy pixels. Pixels that are obscured by clouds are likely not helpful for algal bloom detection because the actual water surface is not captured.

Data access details

Microsoft’s Planetary Computer provides access to Sentinel-2 L2A, but not Sentinel-2 L1C.
- Approved data access location: https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a
- Data access details
- Example notebook
- The benchmark blog post also includes a detailed walkthrough showing how to pull competition feature data from the Planetary Computer.
Google Earth Engine provides access to both Sentinel-2 L2A and Sentinel-2 L1C
- Approved data access locations: COPERNICUS/S2_HARMONIZED (L1C) and COPERNICUS/S2_SR_HARMONIZED (L2A) on Google Earth Engine via the Earth Engine package
- Data access details
- Earth Engine installation instructions and example usage
- Example notebook
- Python API introduction

Landsat

Landsat, a joint NASA/USGS program, provides the longest continuous space-based record of Earth’s land in existence.

There have been many Landsat missions since the original launch in 1972. The competition data goes back to 2013.

For data from March 2013 or later, participants may only use Landsat 8 and Landsat 9. Landsat 8 and Landsat 9 satellites are out of phase with one another, so that between the two each point on the Earth is revisited every 8 days. The data collected by Landsat 9 is very similar to Landsat 8.
For data from January and February of 2013, participants may also use Landsat 7 in addition to Landsat 8 and Landsat 9. These early samples may predate the launch of Landsat 8. Landsat 7 may not be used for any later data.

Participants may not use any Landsat missions besides those listed above. Participants may use either level-1 or level-2 data, but may not use level-3. In addition to bottom-of-atmosphere reflectance, Landsat level-2 also includes a measurement of surface temperature, which is relevant to the behavior of algal blooms.

Data access details

Microsoft’s Planetary Computer provides access to Landsat level-1 and level-2 data.
- Approved data access locations: https://planetarycomputer.microsoft.com/api/stac/v1/collections/landsat-c2-l1, https://planetarycomputer.microsoft.com/api/stac/v1/collections/landsat-c2-l2
- Data access details: Level-1, Level-2
- Data product details: Level-1, Level-2
- Example notebook
- The benchmark blog post includes a detailed walkthrough showing how to pull competition feature data from the Planetary Computer.
Google Earth Engine provides access to Landsat level-1 and level-2 data.
- Data access details: Landsat 9, Landsat 8, Landsat 7
- All approved data access locations are linked from either the Landsat 9, Landsat 8, or Landsat 7 data access details
- Earth Engine installation instructions and example usage
- Python API introduction

Climate data

Information about a location’s temperature, wind, and precipitation patterns may help your model better estimate the formation of blooms.

High-Resolution Rapid Refresh (HRRR): HRR is a real-time, 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. It is managed by the National Oceanic and Atmospheric Administration (NOAA).

Approved data access locations:
- https://noaahrrr.blob.core.windows.net/hrrr
- s3://noaa-hrrr-bdp-pds/
Planetary computer resources:
- Data access details
- Example notebook
AWS resources:
- Data access details
- Example script
Data product details
About HRRR

Elevation data

Copernicus DEM (30 meter resolution): The Copernicus Digital Elevation Model (DEM) is a digital surface model (DSM), which represents the surface of the Earth including buildings, infrastructure, and vegetation. This DSM is based on radar satellite data acquired during the TanDEM-X Mission.

Approved data access location: https://planetarycomputer.microsoft.com/api/stac/v1/collections/cop-dem-glo-30
Data access details
Data product details
Example notebook

Labels

The labels in this competition are samples that were collected manually and then analyzed for cyanobacteria density.

train_labels.csv includes the following columns:

uid (str): unique ID for each row. Each row is a unique combination of date and location (latitude and longitude). uid maps each row of the labels to a row in metadata.csv.
region (str): U.S. region in which the sample was taken. Scores are calculated separately for each of these regions and then averaged. See the metric section for details.
severity (int): severity level based on the cyanobacteria density
density (float): raw measurement of total cyanobacteria density in cells per mL. Participants should submit predictions for severity level, NOT for the raw cell density value in cells per milliliter (mL). See the Submission format section.

Labelled training data example

The first row in train_labels.csv is:

uid	aabm
region	midwest
severity	1
density	585.0

The corresponding row in metadata.csv is:

uid	aabm
latitude	39.080319
longitude	-86.430867
date	2018-05-14
split	train

This label is for a sample collected on May 14, 2018 at location (39.080319, -86.430867), which is by a fishing pier on Lake Monroe in Indiana. When analyzed, the sample had a cyanobacteria density of 585.0 cells per mL, which is severity level 1. This means folks can fish away, cyanobacteria-worry-free!

To get feature data from external APIs, you would search for data that includes the point (39.080319, -86.430867) captured on or before May 14, 2018.

Submission format

The format for submission is a .csv with the following columns:

uid (str): unique ID for each row. Each row is a unique combination of date and location (latitude and longitude). uid maps each row of the test data to a row in metadata.csv.
region (str): U.S. region in which the sample was taken. Scores are calculated separately for each of these regions and then averaged. See the metric section for details.
severity (int): Predicted cyanobacteria density severity, as defined below.

Predictions for this competition are cyanobacteria severity levels, each representing a range of densities. Severity levels correspond to different levels of concern for water managers. For example, any density less than 20,000 cells/mL (level 1) is generally too low to be dangerous. Higher severity levels might lead to precautions like closing a public beach. Submissions should have severity level only, NOT a prediction for raw density in cells per mL.

`severity` level	Density range (cells/mL)
1	<20,000
2	20,000 — <100,000
3	100,000 — <1,000,000
4	1,000,000 — <10,000,000
5	≥10,000,000

submission_format.csv contains one row for each uid in the test dataset. To create a submission, download the submission format and replace the placeholder values in the severity column with your predictions for the test samples. Severity values must be integers or they will not be scored correctly.

Example submission

The first few rows of submission_format.csv are:

uid	region	severity
aabn	west	1
aair	west	1
aajw	northeast	1

1 is just a placeholder value for severity. uid maps these samples to metadata.csv:

uid	latitude	longitude	date	split
aabn	36.559700	-121.51000	2016-08-31	test
aair	33.042600	-117.07600	2014-11-01	test
aajw	40.703968	-80.29305	2015-08-26	test

Say you predict that these three samples all have a cyanobacteria severity level of 2 (20,000 — <100,000 cells per mL). The first few rows of your submitted .csv file would be:

uid,region,severity
aabn,west,2
aair,west,2
aajw,northeast,2

Performance metric

To measure your model’s performance, we’ll use Region-Averaged Root Mean Squared Error (RMSE). RMSE is the square root of the mean of squared differences between estimated and observed values. This is an error metric, so a lower value is better. RMSE is implemented in scikit-learn, with the squared parameter set to False.

RMSE will be calculated separately for each of the four regions of the U.S. shown below and then averaged.

$$ \text{Region-Averaged RMSE} = \frac{1}{M} \sum_{m=0}^M \sqrt{\frac{1}{N} \sum_{n=0}^N (y_{nm} - \hat{y}_{nm})^2 } $$

where:

|$\hat{y}_{nm}$| is the predicted severity level for the |$n$|th sample in the |$m$|th region
|$y_{nm}$| is the actual severity level for the |$n$|th sample in the |$m$|th region
|$N$| is the number of samples
|$M$| is the number of regions

This amounts to:

$$ \text{Region-Averaged RMSE} = \frac{\text{West RMSE} + \text{Midwest RMSE} + \text{South RMSE} + \text{Northeast RMSE}}{4} $$

Region information is provided in train_labels.csv for the train set and in submission_format.csv for the test set.

Four US regions that will be scored separately, based on the US Census. West: AK, AZ, CA, CO, HI, ID, MT, NM, NV, OR, UT, WA, WY. Midwest: IA, IL, IN, KS, MI, MN, MO, ND, NE, OH, SD, WI. South: AL, AR, DC, DE, FL, GA, KY, LA, MD, MS, NC, OK, SC, TN, TX, VA, WV. Northeast: CT, MA, ME, NH, NJ, NY, PA, RI, VT

Image source: Mappr

Good luck!

Not sure where to start? Check out the "How to compete" section on the homepage and the benchmark blog post.

Good luck and enjoy creating some toxic algal-rithms! If you have any questions you can always visit the user forum!

Tick Tick Bloom: Harmful Algal Bloom Detection Challenge

Quick Facts

Participants

No. of Entries

Prize

Winner

sheep