Water Supply Forecast Rodeo: Final Prize Stage

[submissions closed] Water managers in the Western U.S. rely on accurate water supply forecasts to better operate facilities and mitigate drought. Help the Bureau of Reclamation improve seasonal water supply estimates in this probabilistic forecasting challenge! [Final Prize Stage] #climate

$400,000 in prizes
2 months left
34 joined

Cross-validation submission format

This page documents the requirements for your cross-validation submission. You should submit your cross-validation predictions as a CSV file through the "Submissions" link on the left-hand navigation menu.

Your best-scoring submission will automatically be considered as your final submission. If you run into any issues, please contact DrivenData by email.

What is cross-validation?

Cross-validation is a model validation technique in statistics and supervised machine learning that is especially useful in situations where the amount of labeled data is limited.

Cross-validation contrasts with the simpler hold-out validation approach for estimating generalized model performance. In hold-out validation, you split your dataset into a training set and a test set. The model is trained on the training set, and then evaluated on the held out test set.

In contrast, cross-validation involves training and evaluating the model over multiple train/test splits. You split your labeled dataset into multiple "folds" (portions). Then, you iterate over all of the folds, where one fold is used as if it were a test set and the rest of the folds are used as a training set. This way, every observation in your dataset is used as a test observation in one iteration of the cross-validation, so you are able to evaluate over all of your labeled data.

Cross-validation is often used for hyperparameter selection. However, on its own, it is still a generally useful technique for evaluating models because it allows you to evaluate a larger quantity of labeled data. To read more about cross-validation, see the Wikipedia article.

Cross-validation requirements

This section describes the cross-validation setup specific to the Final Prize Stage of this challenge. For this stage, we are requiring you to perform year-wise leave-one-out cross-validation (LOOCV) over the 20-year period from 2004 through 2023. Each cross-validation fold consists of one water year. There will be 20 iterations, each with one water year held out for testing. A diagram illustrating the setup is below.

Diagram with boxes representing years, illustrating leave-one-out cross-validation..
Diagram showing the expected year-wise leave-one-out cross-validation. In each cross-validation iteration, the indicated water year is held out as a test set and the rest are used for training.

This approach of year-wise LOOCV is a standard evaluation methodology in statistical water supply forecast modeling.

If you are training on the supplemental NRCS naturalized flow data, be sure to appropriately exclude the test water year from your training data for each cross-validation iteration.

Time and data use

The task in this challenge is not a standard time series modeling task. There are two levels at which to consider time in this challenge, each with its own requirement:

  1. Across water years—water years are treated as statistically independent.
  2. Within water years—no future data should be used with respect to the issue date.

In this challenge, we use the concept of a "water year" to identify years. A water year in hydrology is defined as a 12-month period from October 1 to September 30 identified by the year in which it ends. For example, the 2005 water year begins on October 1, 2004 and ends on September 30, 2005. So, if you are issuing a forecast on 2005-03-15 for the seasonal water supply for 2005, the forecast and the seasonal water supply value are associated with the 2005 water year.

Please review the detailed explanations in the following two sections carefully. These requirements are present to prevent data leakage and ensure that models generalize.

Independent water years

At the level of years, this challenge is treating water years as statistically independent observations rather than as a temporal sequence. You should think about this as more like standard regression modeling and not as time series modeling.

For the cross-validation, this means that we are splitting the dataset at water year boundaries, as described in the "Cross-validation requirements" section. Some consequences of this framing are:

  • It is valid to predict on a test year with a model whose training data includes water years in the future. For example, in a cross-validation iteration where 2008 is the test year, you can include water years 2009, 2010, 2011, ..., through 2023 as training data.
  • You should be careful not to treat years as ordered or to use year as a variable. Your model should not make use of the temporal relationship between years.

In the simplest case, we are assuming that your features are calculated from lookback windows that do not cross any water year boundary (i.e., they do not extend beyond October 1 of the previous calendar year). If this describes your feature approach, then your cross-validation should match the diagram in the above "Cross-validation requirements" section. For example, if you are issuing a forecast for 2015-03-15 and your lookback window for calculating features spans from 2014-10-01 to 2015-03-14, inclusive, then you are treating single water years independnetly and you should train your 2015-water-year model on the 19 water years that exclude 2015.

You are permitted to have longer lookback windows across water years for calculating features, but you must adjust your cross-validation accordingly to ensure that there is no overlap between training and test data within a cross-validation iteration. For example, if you are predicting for an issue date of 2015-03-15 for the 2015 water year as a test year, and your lookback window extends to 2014-03-15 which is part of the 2014 water year, then you cannot use the 2014 water year for training. Furthermore, you wouldn't be able to use 2016 as a training observation, because its lookback window would overlap with the 2015 water year. Effectively, you would be adding a one-year buffer before and after your test year. The diagram below illustrates this variation on the cross-validation. Note that if you are training on prior_historical_labels.csv with years prior to 2004, you will need to also appropriately avoid training on water years which overlap with the test water year's features.

Diagram with boxes representing years, illustrating how to do cross-validation if your feature windows overlap into the previous water year.
Annotated diagram showing the the cross-validation variation in the case where your features overlap with the previous water year. In each cross-validation iteration, neither the year immediately before nor the year immediately after the indicated test year would be usable as training observations.

You are responsible for clearly documenting and explaining your cross-validation approach in both your code and model report, and demonstrating that leakage is not occurring. Winning submissions will be reviewed to ensure they do not violate these constraints and may be disqualified if they are not compliant. If you have any questions, please ask on the challenge forum.

Treating years as statistically independent observations is a common evaluation methodology in seasonal water supply forecasting. For the competition, it allows more data to be available for training, especially for more recent years and for data sources that have a limited period of record, since you can train on future years and evaluate on an earlier year. Hydrologically, this is justified because water supply is primarily driven by near-term physical processes such as precipitation and snowmelt.

No future data within water years

When making a prediction for a specific issue date, your model must not use any future data from the same water year as features. For this challenge, this means that a forecast may only use feature data from before the issue date. For example, if you are issuing a forecast for 2021-03-15, you may only use feature data from 2021-03-14 or earlier.

You are responsible for explicitly parameterizing your code to subset feature data to the appropriate time period. Sample data reading code for some approved data sources is provided in the runtime repository. This code is clearly documented and available for your optional use; it will load data that is correctly subsetted by time for a given issue date. Winning submissions will be reviewed to ensure they do not violate these constraints and may be disqualified if they are not compliant.

Example code

Here is some example code showing how to perform the cross-validation in the simple, default case where lookback windows do not cross a water year boundary.

# PLACEHOLDER

Submission format

In the Final Prize Stage, the format for the submission is a CSV file where each row is a forecast for a single site (site_id) on a single issue date (issue_date). These should be your cross-validation predictions—each forecast prediction should correspond to when it was in the test set of a cross-validation iteration.

Each row will have three prediction columns volume_10, volume_50, and volume_90 for the respective quantile predictions. Predictions must be floating point number values. For example, 1.0 is a valid float, but 1 is not.

Example

For example, if your predictions are...
site_id issue_date volume_10 volume_50 volume_90
hungry_horse_reservoir_inflow 2004-01-01 0.0 0.0 0.0
hungry_horse_reservoir_inflow 2004-01-08 0.0 0.0 0.0
... ... ... ... ...

The first few rows of the .csv file that you submit would look like:

site_id,issue_date,volume_10,volume_50,volume_90
hungry_horse_reservoir_inflow,2004-01-01,0.0,0.0,0.0
hungry_horse_reservoir_inflow,2004-01-08,0.0,0.0,0.0
...