competition
complete
Glory!

# Development stage

This page outlines the data and submission format for Stage 1: Model Development. This stage provides an opportunity to experiment with modeling approaches and data sources, with a live leaderboard providing feedback based on historical data.

## Data

The features for this competition include remote sensing data, snow measurements captured by volunteer networks, climate information, and a narrow set of ground measures.

For model training, you are given snow water equivalent (SWE) measures for 2013-2019 for approximately 11K 1km x 1km grid cells identified in train_labels.csv. You will then use your model to make predictions for 2020-2021 for approximately 9K grid cells identified in submission_format.csv.

### Approved features (inputs)

All data sources and access locations detailed below are pre-approved for both model training and inference. These data have undergone a careful selection process and are freely and publicly available. Note that only the listed data products and access locations are pre-approved for use. You must go through an approved access location when using these sources for inference.

If you are interested in using additional data sources or access locations, see the process for requesting additional data sources below.

#### Ground measures

Ground measures can help to provide regularly collected, highly accurate point estimates of SWE at designated stations.

SNOTEL: The Snow Telemetry (SNOTEL) program consists of automated and semi-automated data collection sites across the Western U.S.

CDEC: The California Data Exchange Center (CDEC) facilitates the collection, storage, and exchange of hydrologic and climate information to support real-time flood management and water supply needs in California. CDEC operates data collection sites similar to SNOTEL within California.

Ground-based sites from SNOTEL and CDEC are used both as an optional input data source and in ground truth labels for this competition. Sites used for evaluation are entirely distinct from those in the features data.

Approved data access location: When using these sources, you are only permitted to use the data contained in the provided ground measures csvs. On the data download page, you will find ground_measures_train_features.csv which contains SNOTEL and CDEC data for the train period, and ground_measures_test_features.csv which contains SNOTEL and CDEC data for the test period.

In addition, ground_measures_metadata.csv contains metadata about the SNOTEL and CDEC sites, with six columns:

• station_id (str): unique site identifier
• name (str): site name
• elevation_m (float): elevation in meters
• latitude (float): latitude
• longitude (float): longitude
• state (str): state

#### Remote sensing data

The features for this competition include remote sensing images from satellites. These data sources provide regular monitoring of land surfaces.

MODIS Terra MOD10A1 and Aqua MYD10A1: MODIS/Terra and MODIS/Aqua Snow Cover Daily L3 Global 500m SIN Grid. Terra's orbit around the Earth is timed so that it passes from north to south across the equator in the morning, while Aqua passes south to north over the equator in the afternoon. Snow-covered land typically has very high reflectance in visible bands and very low reflectance in shortwave infrared bands. The Normalized Difference Snow Index (NDSI) reveals the magnitude of this difference. The snow cover algorithm calculates NDSI for all land and inland water pixels in daylight using MODIS band 4 (visible green) and band 6 (shortwave near-infrared).

Landsat 8 Collection 2 Level-2: Landsat, a joint NASA/USGS program, provides the longest continuous space-based record of Earth’s land in existence. Landsat Collection 2 includes scene-based global Level-2 surface reflectance and surface temperature science products.

Sentinel 2 Level-2A: Sentinel-2 is a wide-swath, high-resolution, multi-spectral imaging mission, supporting the monitoring of vegetation, soil and water cover, as well as observation of inland waterways and coastal areas. The Sentinel-2 Multispectral Instrument (MSI) samples 13 spectral bands: four bands at 10 metres, six bands at 20 metres and three bands at 60 metres spatial resolution. The mission provides a global coverage of the Earth's land surface every 5 days, making the data of great use in on-going studies.

Sentinel 1 Ground Range Detected (GRD) Data: The Sentinel-1 mission is a constellation of C-band Synthetic Aperature Radar (SAR) satellites from the European Space Agency launched since 2014. These satellites collect observations of radar backscatter intensity day or night, regardless of the weather conditions, making them enormously valuable for environmental monitoring. These radar data have been detected, multi-looked, and projected to ground range using an Earth ellipsoid model. Ground range coordinates are the slant range coordinates projected onto the ellipsoid of the Earth, where pixel values represent detected magnitude.

Sentinel 1 Terrain Corrected Data: The Google Earth Engine Sentinel-1 collection includes GRD scenes that have been processed to generate a calibrated, ortho-corrected product. Each GRD scene has one of three resolutions (10, 25, or 40 meters), four band combinations, and three instrument modes. Each scene was processed using thermal noise removal, radiometric calibration, and terrain correction using SRTM 30 or ASTER DEM for areas greater than 60 degrees latitude, where SRTM is not available. The final terrain-corrected values are converted to decibels via log scaling.

#### Climate data

HRRR: The High-Resolution Rapid Refresh (HRRR) is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation.

#### Digital surface data

Tip: when there are multiple access point locations for a data source, download from the one that is geographically closest to you to reduce download times.

Copernicus DEM (90 meter resolution): The Copernicus Digital Elevation Model (DEM) is a digital surface model (DSM), which represents the surface of the Earth including buildings, infrastructure, and vegetation. This DSM is based on radar satellite data acquired during the TanDEM-X Mission.

Climate Research Data Package (CRDP) Land Cover Gridded Map (300 meter resolution): The CRDP land cover map (2020) classifies land surface into 22 classes, which have been defined using the United Nations Food and Agriculture Organization's Land Cover Classification System (LCCS). This map is based on data from the Medium Resolution Imaging Spectrometer (MERIS) sensor on board the polar-orbiting Envisat-1 environmental research satellite by the European Space Agency. This data comes from the CCI-LC database hosted by the ESA Climate Change Initiative's Land Cover project.

• Approved data access locations:
• s3://drivendata-public-assets/land_cover_map.tar.gz (US)
• s3://drivendata-public-assets-eu/land_cover_map.tar.gz (Europe)
• s3://drivendata-public-assets-asia/land_cover_map.tar.gz (Asia)

CRDP Water Bodies Map (150 meter resolution): The CRDP water bodies map (2000) classifies areas into land and water. This map is based on the Envisat ASAR water bodies indicator from the Global Forest Change and the Global Inland Water data products.

• Approved data access locations:
• s3://drivendata-public-assets/water_bodies_map.tar.gz (US)
• s3://drivendata-public-assets-eu/water_bodies_map.tar.gz (Europe)
• s3://drivendata-public-assets-asia/water_bodies_map.tar.gz (Asia)

CRDP Burned Areas Occurrence Map (500 meter resolution): The CRDP burned areas occurrence map presents the percentage of burned areas occurrence as detected over the 2000-2012 period on a 7-day basis. Data originate from the GFEDv3 dataset. This data product is composed of two series of 52 layers (one per week).

• Approved data access locations:
• s3://drivendata-public-assets/burned_areas_occurrence_map.tar.gz (US)
• s3://drivendata-public-assets-eu/burned_areas_occurrence_map.tar.gz (Europe)
• s3://drivendata-public-assets-asia/burned_areas_occurrence_map.tar.gz (Asia)

FAO-UNESCO Global Soil Regions Map: The global soil regions map (2005) shows the global distribution of the 12 soil orders according to the Soil Taxonomy. This map is based on a reclassification of the FAO-UNESCO Soil Map of the World combined with a soil climate map.

• Approved data access locations:
• s3://drivendata-public-assets/soil_regions_map.tar.gz (US)
• s3://drivendata-public-assets-eu/soil_regions_map.tar.gz (Europe)
• s3://drivendata-public-assets-asia/soil_regions_map.tar.gz (Asia)

Additional data sources may be explored and incorporated during model training. However, only pre-approved data sources are allowed for generating predictions during real-time evaluation.

If you would like for any additional sources to be approved for use during inference, you are welcome to submit an official request form and the challenge organizers will review the request. Only select sources that demonstrate a strong case for use will be considered.

To qualify for approval, data sources must meet the following minimum requirements:

• Freely and publicly available to all participants
• Produced reliably by an operational data product
• Do not incorporate SNOTEL or Airborne Snow Observatory (ASO) data
• Provides clear value beyond existing approved sources

Any requests to add approved data sources must be received by January 18 to be considered. An announcement will be made to all challenge participants if your data source has been approved for use.

#### Sample sources

As you explore additional data sources during model training, the following data sources may offer a useful place to start. Keep in mind that not all of these sample sources are eligible for approved use in the real-time stage. See details in each section below.

NRCS Snow Course: Snow Course provides historical data with accurate measures of SWE. Since Snow Course measures are co-located with SNOTEL stations, this data will not be made available for inference but may be used in model training.

Snow campaign data

You may optionally leverage the following data collected for specific campaigns for model training. Note that these data are highly localized and do not cover all geographies or timeframes in the training and test sets. They are not produced reliably by an operational data product so are not approved for real-time evaluation.

Volunteer-based ground measures

You may optionally leverage a set of crowdsourced ground measures in model training. Volunteer programs have been developed to supplement in situ snow measurements, providing near real-time information to support forecasts, warnings and alerts, and other public service programs. Consistent validation and spatiotemporal coverage are not guaranteed. If you choose to use any of these sources for real-time prediction, you must submit an official request form.

Additional remote sensing products are available to explore during model development. If you would like to use any of these sources for real-time prediction, you must submit an official request form.

You may also explore additional climate data sources during model development. A few options are listed below. If you would like to use any of these sources for real-time prediction, you must submit an official request form.

### Labels (outputs)

The labels are the SWE measurements in inches for each grid cell collected over a set period of time. For the Model Development Stage (Stage 1), we provide a labels file, train_labels.csv, which contains columns for:

• cell_id (type: str): grid cell ID
• YYYY-MM-DD (type: str): each collection date

Training labels are provided for December 1 to June 30 each year from 2013 to 2019. Labels are derived from a combination of ground-based snow telemetry (ie. SNOTEL and CDEC sites) and airborne measurements (ie. ASO data).

Keep in mind that SWE is a cumulative process, meaning temporal (serial) correlation is expected.

grid_cells.geojson contains the coordinates and region information for all grid cells in train_labels.csv and submission_format.csv. It has the following keys:

• crs: Coordinate Reference System (CRS) for grid cells (value of EPSG:4326/WGS84 CRS)
• features: Contains cell_id, region, and geometry for each grid cell. Region options include "sierras", "central rockies", and "other" and will be used to calculate regional prizes.

## Submission format

The format for the submission file is a .csv with columns for cell_id and each weekly collection date for which SWE predictions are needed. The time range includes January-June 2020 and Dec 2020-June 2021. Predictions for swe must be floating point number values. For example, 1.0 is a valid float, but 1 is not.

#### Example

For example, if you predicted a swe of 1.0 inch for the first five cells across the first 2 weeks, your predictions would look like the following:

cell_id 2020-01-07 2020-01-14 ... 2021-06-29
000863e7-21e6-477d-b799-f5675c348627 1.0 1.0 ... 1.0
000ba8d9-d6d5-48da-84a2-1fa54951fae1 1.0 1.0 ... 1.0
00146204-d4e9-4cd8-8f86-d1ef133c5b6d 1.0 1.0 ... 1.0
00211c19-7ea8-4f21-a2de-1d6216186a96 1.0 1.0 ... 1.0
00226e82-e747-4f03-9c5d-3eef8ebe515e 1.0 1.0 ... 1.0

And the first few rows and columns of the .csv file that you submit would look like:

cell_id,2020-01-07,2020-01-14
000863e7-21e6-477d-b799-f5675c348627,1.0,1.0
000ba8d9-d6d5-48da-84a2-1fa54951fae1,1.0,1.0
00146204-d4e9-4cd8-8f86-d1ef133c5b6d,1.0,1.0
00211c19-7ea8-4f21-a2de-1d6216186a96,1.0,1.0
00226e82-e747-4f03-9c5d-3eef8ebe515e,1.0,1.0


You can see an example of the format that your submission must conform to, including headers and row names, in submission_format.csv.

In this challenge, predictions for a given test grid cell may use approved features and predictions from other test samples if desired (i.e., it is not necessary for each test sample to be processed independently without the use of information from other cases in the test set).

## Performance metric

To measure your model’s performance, we’ll use a metric called Root Mean Square Error (RMSE), which is a measure of accuracy and quantifies differences between estimated and observed values. RMSE is the square root of the mean of the squared differences between the predicted values and the actual values. This is an error metric, so a lower value is better.

RMSE is defined as:

$$RMSE = \sqrt{\frac{1}{N} \sum_{i=0}^N (y_i - \hat{y_i})^2 }$$

where:

• $\hat{y_i}$ is the $i$th predicted value
• $y_i$ is the $i$th true value
• $N$ is the number of samples

This metric is implemented in scikit-learn, with the squared parameter set to False.

For each submission, a secondary metric called the coefficient of determination R2 will also be reported on the leaderboard for added interpretability. R2 indicates the proportion of the variation in the dependent variable that is predictable from the independent variables. This is an accuracy metric, so a higher value is better.

$$R^2 = 1 - \frac{\textrm{residual sum of squared errors}}{\text{total sum of squared errors}} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y_i})^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$$

• $n$ = number of values in the dataset
• $y_i$ = $i$th true value
• $\hat{y_i}$ = $i$th predicted value
• $\bar{y}$ = average of all true $y$ values

While both RMSE and R2 will be reported, only RMSE will be used to determine your official ranking and prize eligibility.

## Good luck!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!