Tick Tick Bloom: Harmful Algal Bloom Detection Challenge

Harmful algal blooms occur all around the world, and can harm people, their pets, and marine life. Use satellite imagery to detect dangerous concentrations of cyanobacteria, and help protect public health! #climate

$30,000 in prizes
feb 2023
1,377 joined

Problem description

Your goal in this challenge is to use satellite imagery to detect and classify the severity of cyanobacteria blooms in small, inland water bodies. This will help water quality managers better allocate resources for in situ sampling and make more informed decisions around public health warnings for drinking water and recreation.

Overview of the data provided for this competition:

.
├── metadata.csv
├── submission_format.csv
└── train_labels.csv

Metadata


Labels in this competition are based on “in situ” samples that were collected manually and then analyzed for cyanobacteria density. Each measurement is a unique combination of date and location (latitude and longitude). Samples come from water bodies throughout the continental U.S., and date back to 2013.

metadata.csv includes attributes for all of the sample measurements in the data. It has the following columns:

  • uid (str): unique ID for each row
  • latitude (float): latitude of the location where the sample was collected
  • longitude (float): longitude of the location where the sample was collected
  • date (pd.datetime): date when the sample was collected, in the format YYYY-MM-DD
  • split (str): indicates whether the row is part of the train set or the test set

Metadata is provided for all points in both the train and test sets. Labels are only provided for rows in the train set.

The basic steps for completing a submission are to:

  1. Train your model on the train dataset
  2. Generate predictions for the test dataset
  3. Submit those predictions for scoring

Each geographic area is either entirely in the train set or entirely in the test set. This means that none of the test set locations are in the training data, so your model's performance will be measured on unseen locations.

Feature data


The feature data for this challenge is a combination of satellite imagery, climate data, and elevation data. Sentinel-2 is the only required data source — using any of the additional sources below is up to you! The four approved data sources are:

Participants will access all feature data through external, publicly available APIs. You may only use the approved access points specified. The section on each data source lists which external APIs it is available through. To pull data about each of the sample points, search an API using the sample's location and date from metadata.csv.

To predict the density of a given sample, you may only use data that was available at the time the sample was collected. For example, if a given sample was collected on October 18, 2022, you may only use data that was collected on or before October 18, 2022. Date range can be included in your queries to external APIs.

Tip: When there are multiple access point locations for a data source, download from the one that is geographically closest to you to reduce download times.

Note on external data: External data that is not one of the four sources listed below is not allowed in this competition. Participants can use pre-trained computer vision models as long as they were available freely and openly in that form at the start of the competition.

Satellite imagery

The satellite data for this competition includes Sentinel-2 and Landsat. Visible spectrum imagery in Sentinel-2 has a higher resolution than Landsat (10 meters compared to 30 meters). However, for data points prior to mid-2016, only Landsat imagery is available.

Sentinel-2 and Landsat satellites revisit the same location roughly every five and eight days respectively, so in most cases imagery is available within a few days of the in situ sampling label. Imagery captured within a ten-day range is still generally an accurate representation of the sampling conditions.

For each satellite source, participants can use either level-1 or level-2 data.

  • Level-1 describes top of atmosphere reflectance, which can be thought of as the “raw” reflectance. It is a mix of light reflected off of the Earth’s surface and off of the atmosphere.

  • Level-2 describes bottom of atmosphere reflectance. To generate level-2 data, an atmospheric correction model is applied to level-1 that eliminates some of the effects of the atmosphere on reflectance values.

Level-2 data generally captures conditions on the Earth’s surface more accurately. However, the atmospheric correction algorithms applied to level-2 can also sometimes reduce the visibility of bright algal blooms.

Sentinel-2

Sentinel-2 is a wide-swath, high-resolution, multi-spectral imaging mission. Sentinel-2’s publicly available level-1 data is called Level-1C (L1C), and the publicly available level-2 data is called Level-2A (L2A).

In addition to bottom-of-atmosphere reflectance, Sentinel-2 L2A also includes algorithmic bands that describe environmental features relevant to the behavior of algal blooms.

  • Scene classification (SCL): The scene classification band sorts pixels into categories including water, high cloud probability, medium cloud probability, and vegetation. Water pixels could be used to calculate the size of a given water body, which impacts the behavior of blooms. Vegetation can indicate non-toxic marine life like sea grass that sometimes resembles cyanobacteria.

  • Cloud masks (CLM): Sentinel-2’s cloud bands can be used to filter out cloudy pixels. Pixels that are obscured by clouds are likely not helpful for algal bloom detection because the actual water surface is not captured.

Data access details

Landsat

Landsat, a joint NASA/USGS program, provides the longest continuous space-based record of Earth’s land in existence.

There have been many Landsat missions since the original launch in 1972. The competition data goes back to 2013.

  • For data from March 2013 or later, participants may only use Landsat 8 and Landsat 9. Landsat 8 and Landsat 9 satellites are out of phase with one another, so that between the two each point on the Earth is revisited every 8 days. The data collected by Landsat 9 is very similar to Landsat 8.

  • For data from January and February of 2013, participants may also use Landsat 7 in addition to Landsat 8 and Landsat 9. These early samples may predate the launch of Landsat 8. Landsat 7 may not be used for any later data.

Participants may not use any Landsat missions besides those listed above. Participants may use either level-1 or level-2 data, but may not use level-3. In addition to bottom-of-atmosphere reflectance, Landsat level-2 also includes a measurement of surface temperature, which is relevant to the behavior of algal blooms.

Data access details

Climate data

Information about a location’s temperature, wind, and precipitation patterns may help your model better estimate the formation of blooms.

High-Resolution Rapid Refresh (HRRR): HRR is a real-time, 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. It is managed by the National Oceanic and Atmospheric Administration (NOAA).

Elevation data

Copernicus DEM (30 meter resolution): The Copernicus Digital Elevation Model (DEM) is a digital surface model (DSM), which represents the surface of the Earth including buildings, infrastructure, and vegetation. This DSM is based on radar satellite data acquired during the TanDEM-X Mission.

Labels


The labels in this competition are samples that were collected manually and then analyzed for cyanobacteria density.

train_labels.csv includes the following columns:

  • uid (str): unique ID for each row. Each row is a unique combination of date and location (latitude and longitude). uid maps each row of the labels to a row in metadata.csv.
  • region (str): U.S. region in which the sample was taken. Scores are calculated separately for each of these regions and then averaged. See the metric section for details.
  • severity (int): severity level based on the cyanobacteria density
  • density (float): raw measurement of total cyanobacteria density in cells per mL. Participants should submit predictions for severity level, NOT for the raw cell density value in cells per milliliter (mL). See the Submission format section.

Labelled training data example


The first row in train_labels.csv is:
uid aabm
region midwest
severity 1
density 585.0
The corresponding row in metadata.csv is:
uid aabm
latitude 39.080319
longitude -86.430867
date 2018-05-14
split train
This label is for a sample collected on May 14, 2018 at location (39.080319, -86.430867), which is by a fishing pier on Lake Monroe in Indiana. When analyzed, the sample had a cyanobacteria density of 585.0 cells per mL, which is severity level 1. This means folks can fish away, cyanobacteria-worry-free!

To get feature data from external APIs, you would search for data that includes the point (39.080319, -86.430867) captured on or before May 14, 2018.

Submission format


The format for submission is a .csv with the following columns:

  • uid (str): unique ID for each row. Each row is a unique combination of date and location (latitude and longitude). uid maps each row of the test data to a row in metadata.csv.
  • region (str): U.S. region in which the sample was taken. Scores are calculated separately for each of these regions and then averaged. See the metric section for details.
  • severity (int): Predicted cyanobacteria density severity, as defined below.

Predictions for this competition are cyanobacteria severity levels, each representing a range of densities. Severity levels correspond to different levels of concern for water managers. For example, any density less than 20,000 cells/mL (level 1) is generally too low to be dangerous. Higher severity levels might lead to precautions like closing a public beach. Submissions should have severity level only, NOT a prediction for raw density in cells per mL.

severity level Density range (cells/mL)
1 <20,000
2 20,000 — <100,000
3 100,000 — <1,000,000
4 1,000,000 — <10,000,000
5 ≥10,000,000

submission_format.csv contains one row for each uid in the test dataset. To create a submission, download the submission format and replace the placeholder values in the severity column with your predictions for the test samples. Severity values must be integers or they will not be scored correctly.

Example submission


The first few rows of submission_format.csv are:
uid region severity
aabn west 1
aair west 1
aajw northeast 1
1 is just a placeholder value for severity. uid maps these samples to metadata.csv:
uid latitude longitude date split
aabn 36.559700 -121.51000 2016-08-31 test
aair 33.042600 -117.07600 2014-11-01 test
aajw 40.703968 -80.29305 2015-08-26 test
Say you predict that these three samples all have a cyanobacteria severity level of 2 (20,000 — <100,000 cells per mL). The first few rows of your submitted .csv file would be:

uid,region,severity
aabn,west,2
aair,west,2
aajw,northeast,2

Performance metric


To measure your model’s performance, we’ll use Region-Averaged Root Mean Squared Error (RMSE). RMSE is the square root of the mean of squared differences between estimated and observed values. This is an error metric, so a lower value is better. RMSE is implemented in scikit-learn, with the squared parameter set to False.

RMSE will be calculated separately for each of the four regions of the U.S. shown below and then averaged.

$$ \text{Region-Averaged RMSE} = \frac{1}{M} \sum_{m=0}^M \sqrt{\frac{1}{N} \sum_{n=0}^N (y_{nm} - \hat{y}_{nm})^2 } $$

where:

  • |$\hat{y}_{nm}$| is the predicted severity level for the |$n$|th sample in the |$m$|th region
  • |$y_{nm}$| is the actual severity level for the |$n$|th sample in the |$m$|th region
  • |$N$| is the number of samples
  • |$M$| is the number of regions

This amounts to:

$$ \text{Region-Averaged RMSE} = \frac{\text{West RMSE} + \text{Midwest RMSE} + \text{South RMSE} + \text{Northeast RMSE}}{4} $$

Region information is provided in train_labels.csv for the train set and in submission_format.csv for the test set.

Four US regions that will be scored separately, based on the US Census. West: AK, AZ, CA, CO, HI, ID, MT, NM, NV, OR, UT, WA, WY. Midwest: IA, IL, IN, KS, MI, MN, MO, ND, NE, OH, SD, WI. South: AL, AR, DC, DE, FL, GA, KY, LA, MD, MS, NC, OK, SC, TN, TX, VA, WV. Northeast: CT, MA, ME, NH, NJ, NY, PA, RI, VT

Image source: Mappr


Good luck!


Not sure where to start? Check out the "How to compete" section on the homepage and the benchmark blog post.

Good luck and enjoy creating some toxic algal-rithms! If you have any questions you can always visit the user forum!