competition
complete
\$25,000

## Problem description: Trace Gas Track

The goal of this challenge is to generate daily estimates of surface-level NO2 for a set of 5km x 5km geometries across three urban areas:

These locations represent areas with readily available satellite data, but varying levels of pollution and historical data. High performing models should do well in all three locations.

There will be two separate tracks for the prediction of each pollutant, PM2.5 and NO2. This is the Trace Gas Track (NO2). You can find the Particulate Track (PM2.5) here. More information on the dataset, performance metric, and submission specifications is provided below.

Finalists and runners-up will be determined by performance on the private test set. These participants will then have the opportunity to submit their code to be audited using an out-of-sample verification set. The top 3 eligible teams that pass this final check in each track will be awarded prizes.

## Data

There are two types of pre-approved data that can be used as input to your model: satellite and ancillary (meteorological and topological) data. We provide satellite data through a public s3 bucket hosted by DrivenData. These files have already been subsetted to the correct times and geographies. Ancillary data can be accessed through public portals. You may use any approved data sources you like, but you must use at least one dataset from the list of approved satellite data.

Note that the data are provided in various spatiotemporal resolutions and formats. Malfunctioning instruments and other errors may lead to missing and incorrect values. The datasets may also cover different date ranges or geographical areas. It is your job to figure out how to best combine these datasets to build the best model!

### Approved features (inputs)

All data sources detailed below are pre-approved for both model training and inference. These data have undergone a careful selection process and are freely and publicly available. Note that only the listed data products and access locations are pre-approved for use. You must go through an approved access location when using these sources for inference.

If you are interested in using additional sources that are not listed, see the process for requesting additional data sources below.

#### Satellite Data

NASA provides many data products through the Earthdata portal. Select data sources that are likely to be of greatest use in this track are listed below. You must use at least one of these satellite sources to train your model.

The primary satellite data products related to estimating surface-level NO2 measure something called NO2 vertical column density (NO2 VCD). For NO2, the amount high in the atmosphere (in the stratosphere) is typically removed in order to provide an estimate of the column in the lower part of the atmosphere (the troposphere), resulting in Tropospheric Vertical Column Density (TVCD). The units of VCD or TVCD are molecules per square centimeter. Because these numbers are large, they may be divided by 1015. Units may also be given as moles per unit area.

OMI/Aura NO2 Cloud-Screened Total and Tropospheric Column L3 Global Gridded 0.25 degree x 0.25 degree V3 (OMNO2d):

The Ozone Monitoring Instrument (OMI) aboard NASA's Aura satellite measures criteria pollutants such as O3, NO2, SO2, and aerosols from Earth's surface to top-of-atmosphere. The OMNO2d product provides total and tropospheric column NO2 over the globe. These data are provided as Hierarchical Data Format files.

Sentinel-5P TROPOMI Tropospheric NO2 1-Orbit L2:

The TROPOspheric Monitoring Instrument (TROPOMI) is on board the Copernicus Sentinel-5 Precursor satellite. This product contains Total column NO2 and Tropospheric Column NO2, for all atmospheric conditions, and for sky conditions where cloud fraction is less than 30 percent. These data are provided as NetCDF files.

There are two products available for this dataset. S5P_L2__NO2___ is available for dates prior to August 6, 2019 at a resolution of 7km x 3.5km. S5P_L2__NO2___HiR is available for dates starting from August 6, 2019 at a higher resolution of 5.5km x 3.5km.

#### Ancillary Data

NASADEM data products were derived from original telemetry data from the Shuttle Radar Topography Mission (SRTM). In addition to Terra Advanced Spaceborne Thermal and Reflection Radiometer (ASTER) Global Digital Elevation Model (GDEM) Version 2 data, NASADEM also relied on Ice, Cloud, and Land Elevation Satellite (ICESat) Geoscience Laser Altimeter System (GLAS) ground control points of its lidar shots to improve surface elevation measurements that led to improved geolocation accuracy.

NASADEM can be accessed through Microsoft's Planetary Computer through the nasadem collection via the STAC API. Using the PySTAC library created by Azavea, you can load, traverse, and access data within these STACs programmatically. This quickstart guide demonstrates how to search for data using the STAC API with PySTAC.

Global Forecast System: The Global Forecast System (GFS) is a National Centers for Environmental Prediction (NCEP) weather forecast model that generates data for dozens of atmospheric and land-soil variables, including temperatures, winds, precipitation, soil moisture, and atmospheric ozone concentration. The system couples four separate models (atmosphere, ocean model, land/soil model, and sea ice) that work together to accurately depict weather conditions.

Only the GFS Forecasts dataset is pre-approved for this competition (as opposed to GFS Analysis/GFS-ANL). You may use use the 0.25°, 0.5°, or 1° grids. Note that the 0.25° dataset is only accessible via the NCAR servers, which requires registration.

0.5° or 1° grid:

0.25° grid:

ECMWF IFS CY41r2 High-Resolution Operational Forecasts:

The Integrated Forecast System (IFS) model cycle CY41r2 High-Resolution Operational Forecast is a dataset by the European Centre for Medium-Range Weather Forecasts (ECMWF). It provides highly accurate weather forecasts with a nominal grid point spacing of nine kilometers.

Note that the approved data access location is accessible via NCAR servers, which requires registration.

Goddard Earth Observing System – Composition Forecasting:

The NASA GEOS Composition Forecasting (GEOS-CF) system produces global, three-dimensional distributions of atmospheric composition. The horizontal spatial resolution is 0.25 degrees, and the temporal resolution is hourly for most outputs. Using meteorological analyses from other GEOS systems, the GEOS-CF products include a running near-time estimate of surface pollutant distributions and the composition of the troposphere and stratosphere. A single five-day forecast is generated daily, beginning at 12 UTC.

Only the aqc_tavg_1hr_g1440x721_v1 data collection is approved for this challenge. This collection provides air-quality relevant outputs, including surface-level concentrations of NO2 and PM2.5. You may use either the forecast (fcast) or meteorological replay (assim) versions of this collection. The approved data access location is an OPeNDAP server. The data is provided in NetCDF format and can be read with HDF5 software. Example code is provided below.

The following sample code demonstrates how to access data in Python using the packages Pydap, numpy, and xarray. Specifically, it retrieves the historical estimate of surface-level NO2 for the month of September 2019 for a 2-degree latitude by 2-degree-longitude region over San Francisco.

import pydap
import xarray as xr
import numpy as np

url = 'https://opendap.nccs.nasa.gov/dods/gmao/geos-cf/assim/aqc_tavg_1hr_g1440x721_v1'
DATASET = xr.open_dataset(url)
s_type = 'no2'
start_time = np.datetime64('2019-09-01 00:00:00')
end_time = start_time + np.timedelta64(30,'D')
minimum_latitude = 37
minumum_longitude = -123
maximum_latitude = 39
maximum_longitude = -121
DATA_SUBSET = DATASET[s_type].loc[{'time':slice(start_time,end_time),'lat':slice(minimum_latitude,maximum_latitude),'lon':slice(minumum_longitude,maximum_longitude)}]


Only the approved sources listed above may be used in this challenge track. Any additional data must be reviewed and approved in order to be eligible for use.

If you would like for any additional sources to be approved, you are welcome to submit an official request form and the challenge organizers will review the request. Only select sources that demonstrate a strong case for use will be considered.

To qualify for approval, data sources must meet the following minimum requirements:

• Freely and publicly available to all participants
• Produced reliably by an operational data product
• Does not incorporate reference-grade surface monitor data in any way
• Provides clear value beyond existing approved sources

Keep in mind that data sources used in this challenge cannot be derived from models that use reference-grade monitoring data as input, or include data collected from reference-grade monitors of NO2 or PM2.5 pollutants. Low-cost sensor data may be considered as a separate category, as they explicitly include sensors that are less expensive to manufacture and are typically less accurate and robust compared with reference-grade monitors. Any data in this category must be approved before use, based on clear documentation of how the data is incorporated and a demonstrated case for added value.

Any requests to add approved data sources must be received by February 21 to be considered. An announcement will be made to all challenge participants if your data source has been approved for use.

### Labels (outputs)

Reference grade ground monitor data will be provided for all training times and geographies as target measures for model output. NO2 is reported in μg/m3, or micrograms per cubic meter.

The training period for NO2 spans Jan 2019 - Nov 2020. The test period spans two disjoint periods: Aug 2018 - Dec 2018 and Nov 2020 - Aug 2021.

train_labels.csv contains the following columns:

• datetime (string): The UTC datetime of the measurement in the format YYYY-MM-DDTHH:mm:ssZ. A value represents the average between 12:00am to 11:59pm local time. The datetime provided represents the start of that 24 hour period in UTC time. Remember that for each observation, you may only use input values that are available before this time.
• grid_id (string): A 5-character alphanumeric ID that uniquely identifies a 5x5cm grid cell
• value (float): A float indicating the average daily reference-grade monitor measurement for NO2

A unique row is identified by the combination of datetime and grid_id.

Additionally, we provide two metadata csv files.

grid_metadata.csv contains metadata about each grid cell and contains the following columns:

• grid_id (string): A 5-character alphanumeric ID that uniquely identifies a 5x5 km grid cell
• location (string): The location associated with a grid cell (Delhi, Los Angeles(SoCAB), or Taipei)
• tz: The timezone used to localize the dates. Note that the dates for Los Angeles (SoCAB) ignore daylight savings time and use the Etc/GMT+8 timezone.
• wkt (WKT): The geometry / polygonal coordinates of the grid cell

no2_satellite_metadata.csv contains metadata about hosted satellite data. Each file for a particular dataset is referred to as a granule:

• granule_id (str): The filename for each granule
• time_start (datetime): The start time of the granule in YYYY-DD-MMTHH:mm:ss.sssZ
• time_end (datetime): The end time of the granule in YYYY-DD-MMTHH:mm:ss.sssZ
• product (str): The concise name for the satellite data source
• location (str): One of the three locations for this challenge
• us_url (str): The file location of the granule in the public s3 bucket in the US East (N. Virginia) region
• eu_url (str): The file location of the granule in the public s3 bucket in the Europe (Frankfurt) region
• as_url (str): The file location of the granule in the public s3 bucket in the Asia Pacific (Singapore) region
• cksum (int): The result of running the unix cksum command on the granule
• granule_size (int): The filesize in bytes

The cksum and granule_size columns are especially useful for confirming that downloaded files are not corrupted.

TIP: To identify the corresponding granule for a given observation from train_labels.csv, you can use the grid_metadata.csv to first find the location. Then, using the location and datetime, you can find s3 locations of relevant granules using satellite_metadata.csv. Remember that you are not allowed to use future data, so the time_end of an granule must occur on the same day of a given observation.

## Submission format

The format for submissions is a .csv with columns: grid_id, datetime, value. The combined grid_id and datetime are the row identifiers and should exactly match the provided submission format example. You will fill in the value column with your predictions.

To create a submission, download the submission format and replace the placeholder values in the prediction column with your predicted values. Prediction values must be floats or they will not be scored correctly.

#### Example

For example, if you predicted a value of 1.0 for the first 5 grid cells for September 8, 2018, your predictions would look like the following:

datetime grid_id value
2018-09-08T08:00:00Z 3A3IE 1.0
2018-09-08T08:00:00Z 3S31A 1.0
2018-09-08T08:00:00Z 7II4T 1.0
2018-09-08T08:00:00Z 8BOQH 1.0
2018-09-08T08:00:00Z A2FBI 1.0

And the first few rows and columns of the .csv file that you submit would look like:

datetime,grid_id,value
2018-09-08T08:00:00Z,3A3IE,1.0
2018-09-08T08:00:00Z,3S31A,1.0
2018-09-08T08:00:00Z,7II4T,1.0
2018-09-08T08:00:00Z,8BOQH,1.0
2018-09-08T08:00:00Z,A2FBI,1.0


You can see an example of the format that your submission must conform to, including headers and row names, in submission_format.csv.

In this challenge, predictions for a given test grid cell may use approved features and predictions from other test samples that would be available at the time of inference (i.e., it is not necessary for each test sample to be processed independently without the use of information from other cases in the test set).

#### Requirements for winning solutions

Note that while only grid cells listed in the submission format are required for evaluation during the challenge, winning solutions must be able to produce predictions for the same grid cells on a single new day.

Finalists will need to deliver a solution that includes a trained model that ingests a new date, processes the relevant features, and outputs predicted pollutant concentrations on that date without additional training. Algorithms should be able to run inference within a reasonable runtime of one hour or less to predict one day’s concentrations for all three cities on a single GPU node. Submitted models will be run on an out-of-sample time period to confirm that they are able to execute successfully with comparable performance to the test set.

A readme file accompanying these solutions shall clearly describe the data sources used in both training and inference, including a complete description of where they are incorporated and how they ensure time is treated correctly during inference, along with easy-to-follow instructions for running all parts of the solution. Additonal instructions will be provided to finalists at the end of the submission period.

## Performance metric

To measure your model’s performance, we’ll use a metric called the coefficient of determination R2 (R squared). R2 indicates the proportion of the variation in the dependent variable that is predictable from the independent variables. This is an accuracy metric, so a higher value is better.

R2 is defined as:

$$R^2 = 1 - \frac{\textrm{residual sum of squared errors}}{\text{total sum of squared errors}} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y_i})^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$$

• $n$ = number of values in the dataset
• $y_i$ = $i$th true value
• $\hat{y_i}$ = $i$th predicted value
• $\bar{y}$ = average of all true $y$ values

For each submission, a secondary metric called Root Mean Square Error (RMSE) will also be reported on the leaderboard. RMSE is the square root of the mean of squared differences between estimated and observed values. This is an error metric, so a lower value is better.

RMSE is defined as:

$$RMSE = \sqrt{\frac{1}{N} \sum_{i=0}^N (y_i - \hat{y_i})^2 }$$

where:

• $\hat{y_i}$ is the $i$th predicted value
• $y_i$ is the $i$th true value
• $N$ is the number of samples

While both R2 and RMSE will be reported, only R2 will be used to determine your official ranking and prize eligibility.

## Good luck!

Good luck and enjoy this problem! If you're looking for a place to start, you can check out our blog post tutorial on predicting NO2 with OMI data. If you have any questions you can always visit the user forum!