Snowcast Showdown: Evaluation Stage

SUBMIT HERE FOR 2022 PREDICTIONS AND PRIZES! Seasonal snowpack is a critical water resource throughout the Western U.S. Help the Bureau of Reclamation estimate snow water equivalent (SWE) at a high spatiotemporal resolution using near real-time data sources. #climate

$500,000 in prizes
jul 2022
333 joined

Problem description

The goal of this challenge is to build algorithms that can estimate snow water equivalent (SWE) at 1km resolution across the entire Western U.S. The data consist of a variety of pre-approved data sources listed below, including remote sensing data, snow measurements captured by volunteer networks, observations collected for localized campaigns, and climate information.

Competition timeline


This competition will include two tracks. The first track is separated into two stages.

Prediction Competition
Stage 1: Model Development
Stage 2a: Submission Testing
Stage 2b: Real-time Evaluation
Model Report Competition
Prediction Competition Stage 1: Model Development (no prizes awarded) Stage 2a: Submission Testing (no prizes awarded) Stage 2b: Real-time Evaluation Modeling Report Competition
Dec 7 - Jul 1 Dec 7 - Feb 15 Jan 11 - Feb 15 Feb 15 - Jul 1 Dec 7 - Mar 15
Train a model to estimate SWE Scores run on data submissions based on a historical ground truth dataset and displayed on the public leaderboard Test data processing and inference pipeline to generate predictions; Submit final code and model weights Scores run on weekly data submissions based on near real-time ground truth dataset; cumulative scores displayed on the public leaderboard Write a report on solution methodology


Track 1: Prediction Competition (Dec 7 - Jul 1)

In the Prediction Competition, you will train a model to estimate SWE. Stage 1 provides an opportunity to build your machine learning pipeline and evaluate its performance on a historical dataset. During Stage 2, you will package everything needed to perform inference on new data. You will run your model weekly to generate and submit weekly predictions until the end of the competition. You may not update your model weights after February 15. You may implement minor fixes to your data processing pipeline to resolve any unforeseen changes to the input data (e.g., format, type, or API updates).

Stage 1: Model Development (Dec 7 - Feb 15)

During Stage 1 of the competition, build your data processing and integration pipelines, train your model using data collected up to the point of prediction, and evaluate model performance on a set of historical test labels. Use this opportunity to experiment with different data sources, refine your model architecture, and optimize computational performance. Establish a strong foundation for the Model Evaluation Stage (Stage 2) of the competition.

A public leaderboard will be made available to facilitate model development, but prizes will not be awarded during this stage of the competition. It is not mandatory that you participate in Stage 1 before Stage 2 opens (though it is recommended).

For all the details you need to get started, see the Development Stage page.

Stage 2: Model Evaluation (Jan 11 - Jul 1)

Stage 2a: Submission Testing (Jan 11 - Feb 15): Package everything needed to perform weekly inference on a set of 1km grid cells. Use this stage to make any final model improvements, ensure that your model works with the pre-approved data sources, and verify that it generates predictions in the correct format. A sandbox environment will be made available to test your submission. Before the end of this stage, submit a single archive, including your model code and ❄ frozen model weights ❄, to be eligible to participate in the Real-time Evaluation Stage (Stage 2b).

Stage 2b: Real-time Evaluation (Feb 15 - Jul 1): See how your model performs! After the ❄ model freeze ❄ (Feb 15), run your model each week to generate and submit near real-time predictions throughout the winter and spring season. Predictions will be evaluated against ground truth labels as they become available, and your position on the public leaderboard will be updated each week. It is reasonable that you may need to implement minor fixes to your data processing pipeline to resolve any unforeseen changes to the input data (e.g., format, type, or API updates) during this time period. However, please note that you may not update your model weights. You may be required to re-submit your model code and model weights at the end of the competition to determine prize eligibility.

Prizes will be awarded based on final private leaderboard rankings and code verification. Separate prizes will be awarded for regional performance in the Sierras and Central Rockies, and for overall performance during the evaluation period (Stage 2b).

Track 2: Model Report Competition (entries due by Mar 15)

If you are participating in the Real-time Evaluation Stage of the Prediction Competition (Stage 2b), you are eligible to submit a model report that details your solution methodology. In addition to obtaining the best possible predictions of SWE, the Bureau of Reclamation is specifically interested in model interpretability and gaining a deep understanding of which models are the most robust over a broad range of geographies and climates (based on historical data).

Final winners will be selected by a judging panel. For more information, see the Model Report description below.

Prediction competition


The goal of the Prediction Competition is to develop algorithmic approaches for most accurately estimating the spatial distribution of SWE at high resolution across the Western US. Participants will draw on near real-time data sources to generate weekly, 1km x 1km gridded estimates that will be compared against ground measures.

See the Evaluation Stage description for information on the ground-based, satellite, and climate data available for Stage 2: Model Evaluation.

Submission

Models will output a .csv with columns for the cell_id and predicted swe for each grid cell in the submission format. For the specific format and an example for Stage 2, see the Evaluation Stage submission description.

Note that while only grid cells listed in the submission format are required for evalutation during the challenge, winning solutions must be able to produce reasonable predictions for every 1km x 1km cell in the Western US (including those outside of the submission format). Finalists will need to deliver a solution that conforms with this requirement in order to be eligible for prizes.

Winners will also need to report the time it takes for their solutions to run. Solutions that can generate predictions for the entire Western US in 24 hours or less (ideally in 6 hours or less) will be of greater operational value to Reclamation.

Evaluation

To measure your model’s performance, we’ll use a metric called Root Mean Square Error (RMSE), which is a measure of accuracy and quantifies differences between estimated and observed values. RMSE is the square root of the mean of the squared differences between the predicted values and the actual values. This is an error metric, so a lower value is better.

For each submission, a secondary metric called the coefficient of determination R2 will also be reported on the leaderboard for added interpretability. R2 indicates the proportion of the variation in the dependent variable that is predictable from the independent variables. This is an accuracy metric, so a higher value is better.

While both RMSE and R2 will be reported, only RMSE will be used to determine your official ranking and prize eligibility.

Model report


In addition to obtaining the best possible predictions of SWE, the Bureau of Reclamation is interested in gaining a deeper understanding of modeling approaches in two primary areas:

  • Interpretability is the degree to which a person can understand the outcome of the solution methodology given its inputs. For example, a solution that relies on physically-based modeling should outline a set of well-defined model governing equations.

  • Robustness is the extent to which solutions provide high performance over a broad range of conditions, such as different elevations, topographies, land cover, weather, and/or climates. Robustness may be evaluated by comparing solution performance over different regions or time periods (based on historical data).

Submission

Each report will consist of up to 8 pages, excluding visualizations and figures, delivered in PDF format. Incorporating graphics is encouraged and does not count against the page limit. Winners will also be responsible for submitting any accompanying code (e.g., notebooks used to visualize data).

At a minimum, write-ups should discuss data source selection, feature engineering, algorithm architecture, evaluation metric(s), generalizability, limitations, and machine specs for optimal performance. Reports that help illuminate which signal(s) in the data are most responsible for measuring SWE will be prioritized.

See the Model Report Template for further information.

Each participant or team that successfully submits a model to the Real-time Evaluation Stage of the Prediction Competition is eligible to submit a report. Submissions should be made through the Model Report Submission page in the Development Stage. Data and submissions for the Development Stage will also remain open through the report submission deadline to facilitate continued exploration.

Evaluation

Model report prizes will be awarded to the top 3 write-ups selected by a panel of judges including domain and data experts from the Bureau of Reclamation and collaborators.

Meet the judges!

The judging panel will evaluate each report based on the following criteria. Reports may also be filtered based on leaderboard performance in order to ensure they demonstrate sufficient performance to be useful.

  • Interpretability (40%): To what extent can a person understand the outcome of the solution methodology given its inputs?
  • Robustness (40%): Do solutions provide high performance over a broad range of geographic and climate conditions?
  • Rigor (10%): To what extent is the report built on sound, sophisticated quantitative analysis and a performant statistical model?
  • Clarity (10%): How clearly are underlying patterns exposed, communicated, and visualized?

Note: Judging will be done primarily on a technical basis, rather than on grammar and language (since many participants may not be native English speakers).

Good luck!


Good luck and enjoy this problem! If you have any questions you can always visit the user forum!