Water Supply Forecast Rodeo: Forecast Stage

Water managers in the Western U.S. rely on accurate water supply forecasts to better operate facilities and mitigate drought. Help the Bureau of Reclamation improve seasonal water supply estimates in this probabilistic forecasting challenge! [Forecast Stage] #climate

$50,000 in prizes
jul 2024
47 joined

Problem description

The aim of the Water Supply Forecast Rodeo challenge is to create a probabilistic forecast model for the 0.10, 0.50, and 0.90 quantiles of seasonal water supply at 26 different hydrologic sites. In this challenge stage, the Forecast Stage, you will submit a model that will be used to generate forecasts for 28 different issue dates for the upcoming 2024 season. Accurate forecasts with well-characterized uncertainty will help water resources managers better operate facilities and manage drought conditions in the American West.

Forecasting task

In this challenge, the target variable is cumulative streamflow volume, measured in thousand acre-feet (KAF), at selected locations in the Western United States. Streamflow is the volume of water flowing in a river at a specific point, and seasonal water supply forecasting is the practice of predicting the cumulative streamflow—the total volume of water—over a specified period of time (the "season" or "forecast period"). The forecast season of interest typically covers the spring and summer. For most of the sites used in this challenge, the forecast season is April through July.

The date that the forecast is made is called the issue date. For example, a forecast for the April–July season issued on March 1 for a given site would use the best available data as of March 1 to predict the cumulative volume that flows through that site later that year from April through July. In this challenge, you will issue forecasts on the 1st, 8th, 15th, and 22nd of each month from January through July.

Note that some issue dates will overlap with the forecast season. A portion of the seasonal water supply for some of these issue dates will be known, because the naturalized flow for full months that have passed before the issue date will be available (see "Antecedent monthly naturalized flow" section below). However, those forecasts will still have an unknown portion that is the residual (remaining) naturalized flow. For these forecasts, the target variable is still the total seasonal water supply and not just the residual. If you are predicting just the residual, you should add the known months' naturalized flow in order to generate the seasonal total. All forecasts for a given year and a given site share a single ground truth value.

Here are some examples of operational water supply forecasts:

The streamflow measurements of interest are of natural flow (also called unimpaired flow). This refers to the amount of water that would have flowed without influence from upstream dams or diversions. Observed flow measurements are naturalized by adding or subtracting known upstream influences.

Quantile forecasting

Rather than predicting a single value for each forecast, your task is to produce a quantile forecast: the 0.10, 0.50 (median), and 0.90 quantiles of cumulative streamflow volume. Forecasting quantiles acknowledges the uncertainty inherent in predictions and provides a range representing a distribution of possible outcomes . The 0.50 quantile (median) prediction is the central tendency of the distribution, while the 0.10 and 0.90 quantile predictions are the lower and upper bounds of a centered 80%-confidence prediction interval.

Note that quantiles have the opposite notation compared with the "probability of exceedance" used in many of the operational water supply forecasts. For example, a 90% probability of exceedance is equivalent to the 0.10 quantile. Measurements should fall below the true 0.10 quantile value 10% of the time, and they should exceed it 90% of the time. For more information on probability of exceedance, see the section "Interpreting water supply forecasts" of this reference.

Labels (ground truth data)

The ground truth data are derived from naturalized streamflow data provided by the Natural Resources Conservation Service (NRCS) and the National Weather Service River Forecast Centers (RFCs) and are available in CSV format on the data download page. Each row will represent the seasonal water supply for one year's season at one site.

For the Forecast Stage, the forecast_train.csv file contains the full history of seasonal water supply values through 2023 for the 26 sites used in the challenge. The file contains the following columns:

  • site_id (str) — identifier for a particular site.
  • year (int) — the year whose season the seasonal water supply measurement corresponds to.
  • volume (float) — seasonal water supply volume in thousand acre-feet (KAF).

In the Forecast Stage, your models will be evaluated on their predictions for the 2024 seasonal water supply. Because most of the sites' forecast seasons run through July, the ground truth for the test issue dates will not be available until August. We will use the naturalized streamflow data published as of August 14, 2024 to score predictions.

Metadata

An additional CSV file metadata.csv file provides important metadata about the 26 sites. The metadata file contains the following columns:

  • site_id (str) — identifier used in this challenge for a particular site.
  • season_start_month (int) — month that is the beginning of the forecast season, inclusive.
  • season_end_month (int) — month that is the end of the forecast season, inclusive.
  • elevation (float) — elevation of site in feet.
  • latitude (float) — latitude of site in degrees north.
  • longitude (float) — longitude of site in degrees east.
  • drainage_area (float) — estimated area of the drainage basin from USGS in square miles. Note that not all sites have a drainage area estimate. See basin polygons in the geospatial data as an alternative way to estimate the drainage area.
  • usgs_id (str) — U.S. Geological Service (USGS) identifier for this monitoring location in the USGS streamgaging network. This is an 8-digit identifier. May be missing if there is no USGS streamgage at this location.
  • usgs_name (str) — USGS name for this monitoring location in the USGS streamgaging network. May be missing if there is no USGS streamgage at this location.
  • nrcs_id (str) — NRCS triplet-format identifier for this monitoring location. May be missing if the NRCS does not track station data for this location.
  • nrcs_name (str) — NRCS name for this site's location. May be missing if the NRCS does not track station data for this location.
  • rfc_id (str) — RFC identifier for this site's location. May be missing if no RFC issues a forecast for this location.
  • rfc_name (str) — RFC name for this site's location. May be missing if no RFC issues a forecast for this location.
  • rfc (str) — the specific RFC whose area of responsibility this site is located in. May be missing if no RFC issues a forecast for this location.

Update November 2, 2023: The usgs_id column has been corrected from an integer type to a string type. Correct USGS identifiers are 8 digits and may have leading zeros. To correctly read this column using pandas.read_csv, use the keyword argument dtype={"usgs_id": "string"}.

Basin geospatial data

Geospatial vector data related to each of the 26 sites is available in a multi-layered GeoPackage file. You can read this data using libraries like fiona or geopandas. This data file contains two layers named basins and sites.

The basins layer contains one vector feature per site that delineates each site's drainage basin. A drainage basin refers to the area of land where all flowing surface water converges to a single point. A drainage basin is determined by topography and the fact that water flows downhill. Features in this layer contain the following properties:

  • site_id (str) — identifier used in this challenge for a particular site.
  • name (str) — nicely formatted human-readable name for the site.
  • area (float) — area of the basin polygon in square miles.

The sites that contains one vector feature per site that is the point location of the site. Features in this layer contain the following properties.

  • site_id (str) — identifier used in this challenge for a particular site.
  • name (str) — nicely formatted human-readable name for the site.


Map showing site location and drainage basin for the 26 sites in the challenge.
Map showing a plot of the provided geospatial data. Sites are shown as blue points, and drainage basin polygons are shown in purple.

Features (predictor data)

Seasonal water supply is influenced by a range of hydrological, weather, and climate factors. Relevant data sources for streamflow volume prediction include antecedent streamflow measurements, snowpack estimates, short and long-term meteorological forecasts, and climate teleconnection indices. You are encouraged to experiment with and use a variety of relevant data sources in your modeling. However, only approved data sources are permitted to be used as input into models for valid submissions. See the Approved Data Sources page for an up-to-date list of approved data sources.

You will be responsible for downloading your own feature data from approved sources for model training. Example code for downloading and reading the approved data sources is available in the challenge data and runtime repository.

For code execution during the Forecast Stage, you will generally be required to use the data hosted by DrivenData that is mounted to the code execution runtime. Only certain data sources that are designated as "Direct API Access Approved" are allowed for you to download data during the code execution run.

The format and access for each data source will be documented, and the provided sample code will allow you to download the data in the same way.

Data requests

We are no longer accepting requests for new data sources. However, if you have questions or requests about already approved data sources, such as alternative forms of access or inclusion in the mounted data drive, feel free to reach out on the challenge forum. Requests should be made before January 5, 2024 to allow sufficient time for challenge organizers to review before the close of the open submission period.

Antecedent monthly naturalized flow

The past monthly time series observations of the forecast target variable are available to be used as autoregressive modeling features. A dataset of antecedent monthly naturalized flow is available on the data download page. For the Forecast Stage, data for all past years through 2023 have been combined into a single file, forecast_train_monthly_naturalized_flow.csv. Please note that only 23 of the 26 sites have monthly data available.

For each forecast year, the dataset includes values from October of the previous year through the month before the end of the forecast season (i.e., June for sites with seasons through July, and May for the site with a season through June). This time range provided follows the standard concept of a "water year" in hydrology.

The data includes the following columns:

  • site_id (str) — identifier for a particular site.
  • forecast_year (int) - the year whose forecast season this row is relevant for.
  • year (int) — the year whose month the total streamflow value is for.
  • month (int) — the month that the total streamflow value is for.
  • volume (float) — total monthly streamflow volume in thousand acre-feet (KAF). May be missing for some months for some sites.

For example, the row with year=2004 and month=11 contains the total streamflow volume for the month of November 2004. That row has forecast_year=2005 because it is an observation that is relevant to forecasting the seasonal water supply for the 2005 season.

Time and data use

This section is the guidance on time and data use specific to the Forecast Stage.

When making a prediction for a specific issue date, your model must not use any future data as features. For this challenge, this means that a forecast may only use feature data from before the issue date. For example, if you are issuing a forecast for 2024-03-15, you may only use feature data from 2024-03-14 or earlier.

You are responsible for explicitly parameterizing your code to subset feature data to the appropriate time period based on the issue date. Note that your code may not always be run on the issue date itself (e.g., DrivenData may run your code on a later date, or you may be submitting a fix for a failed job.) Your submission should produce the same forecast whether it was run on the issue date itself or later. Winning submissions will be reviewed to ensure they follow this requirement and may be disqualified if they are not compliant.

Sample data reading code for some data sources is provided in the runtime repository. This code is clearly documented and available for your optional use; it will load data that is correctly subsetted by time for a given issue date.

Supplemental NRCS naturalized flow

Supplemental training data from 592 other NRCS monitoring sites that are not the 26 sites in the challenge has been uploaded to the data download page. You may use this data as additional training data for your model. There are two files provided:

  • forecast_supplementary_nrcs_forecast_train_monthly_naturalized_flow.csv—this file contains the monthly naturalized flow time series for the various sites from October through July. The Forecast Stage version of this file contains all water years through 2023.
  • supplementary_nrcs_metadata.csv—this file contains metadata about the supplemental sites.

You will be able to join between the two files using the nrcs_id identifier column.

Hydrologic Unit Codes (HUC)

Added January 9, 2024.

When processing certain feature data sources, such as the spatially averaged data from UA/SWANN, you may need to associate the forecast sites or drainage basins in the challenge to USGS Hydrologic Unit Codes (HUC). HUC definitions are a type of static metadata that are not considered to be a feature, and you do not need to use a specific approved source, though we ask that you use one that is reasonably official. Please make sure that your process for selecting or associating HUCs to forecast sites is clearly documented and reproducible in your model report and training code.

Some sources for HUC definitions that you may find useful:

If you have any questions, please ask in this forum thread.

Updates to your model from the Hindcast Stage

You are allowed to make updates to the model that you previously submitted to the Hindcast Evaluation Arena. This includes both retraining your model using the additional training data available (years formerly in the Hindcast test set), as well as changes to your modeling approach, such as changes to the model architecture, hyperparameters, or features.

Performance metric

Prizes in the Forecast Stage will be determined based on quantitative evaluation of forecasts by the primary challenge metric: quantile loss.

Primary metric: Quantile loss

Quantile loss, also known as pinball loss, assesses the accuracy of a quantile forecast by quantifying the disparity between predicted quantiles (percentiles) and the actual values. This is an error metric, so a lower value is better. Mean quantile loss is the quantile loss averaged over all observations for a given quantile and is implemented in scikit-learn. We are using a quantile loss that has been multiplied by a factor of 2 so that the 0.5 quantile loss is equivalent to mean absolute error.

$$ \text{Mean Quantile Loss}(\tau, y, \hat{y}) = \frac{2}{n} \sum_{i=1}^{n} \tau \cdot \max(y_i - \hat{y}_i, 0) + (1 - \tau) \cdot \max(\hat{y}_i - y_i, 0) $$

Where:

  • |$\tau$| represents the desired quantile (e.g., 0.1 for the 10th percentile).
  • |$y_i$| represents the actual observed value for the |$i$|th observation.
  • |$\hat{y}_i$| represents the predicted value for the |$\tau$| quantile of the |$i$|th observation.
  • |$n$| represents the total number of observations.

For this challenge, mean quantile loss will be calculated for each |$\tau \in \{0.10, 0.50, 0.90\}$| and averaged. The final score will be calculated as:

$$ \text{Averaged Mean-Quantile-Loss (MQL)} = \frac{\text{MQL}_{0.10} + \text{MQL}_{0.50} + \text{MQL}_{0.90}}{3} $$

Secondary metric: Interval coverage

Interval coverage is the proportion of true values that fall in the predicted interval range (0.10 and 0.90). This metric provides information about the statistical calibration of predicted intervals. It will not directly be used for ranking; well-calibrated 0.10 and 0.90 quantile forecasts should have a coverage value of 0.8.

$$ \text{Interval Coverage} = \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{(\hat{y}_{0.1,i} \leq y_i \leq \hat{y}_{0.9,i})} $$

Where:

  • |$\mathbf{1}$| is an indicator function
  • |$y_i$| represents the actual observed value for the |$i$|th observation.
  • |$\hat{y}_{\tau, i}$| represents the predicted value for the |$\tau$| quantile of the |$i$|th observation.
  • |$n$| represents the total number of observations.

Predictions format

In the Forecast Arena, you are required to submit a ZIP archive containing your trained model and prediction code for inference. We will then execute your code in a containerized runtime in our cloud environment on the scheduled issue dates to generate forecasts throughout the 2024 season.

The supervisor code in the runtime will call your predict function and is responsible for compiling predictions into the expected prediction CSV file. The predictions format is the same as the Hindcast Stage, except that on each scheduled run, there will only be rows for a single issue date. For most issue dates, there will be 26 rows—one for each site.

Example

The format for the submission is a CSV file where each row is a forecast for a single site (site_id) on a single issue date (issue_date). Each row will have three prediction columns volume_10, volume_50, and volume_90 for the respective quantile predictions. Predictions for water supply volume must be floating point number values. For example, 1.0 is a valid float, but 1 is not.

For example, if the issue date being evaluated is 2024-01-15, and you predicted...
site_id issue_date volume_10 volume_50 volume_90
hungry_horse_reservoir_inflow 2024-01-15 0.0 0.0 0.0
snake_r_nr_helse 2024-01-15 0.0 0.0 0.0
... ... ... ... ...

Then the first few rows of the .csv file that your code run will produce would look like:

site_id,issue_date,volume_10,volume_50,volume_90
hungry_horse_reservoir_inflow,2024-01-15,0.0,0.0,0.0
snake_r_nr_helse,2024-01-15,0.0,0.0,0.0
...

Frequently Asked Questions (FAQ)

When I make a prediction for a given issue date, what is the allowable time range of data to use? [Forecast Stage]

An individual forecast on a given issue date may in general use feature variables derived from past data through the day before the issue date. For example, a forecast for issue date 2024-03-15 may use any past data through 2024-03-14.

Can I upload trained model weights and/or precomputed features for the code execution run?

Model training (fitting model weights and/or feature parameters) on the training set (2023 water year and earlier) is expected to happen offline on your own hardware, and you should upload your trained model weights and feature parameters along with your code. Preprocessing features and computing predictions on the test set (2024 water year) is expected to happen in the code execution run. If you need to download data associated with the test set from direct-API-access-approved data sources, you should do so within the code execution run.

  • Should upload: Model weights computed on the training set.
  • Should upload: Feature parameters computed on the training set (e.g., mean value of some variable over the water years of the training set).
  • Should not upload: Features computed from data corresponding to a water year in the test set.
  • Should not upload: Data from approved sources corresponding to a water year in the test set.

How do I authenticate with Copernicus Climate Data Store (CDS) in order to download data?

The Copernicus Climate Data Store (CDS) is an approved for certain feature data sources. CDS is free but requires you to create an account and authenticate with an API key. In order to download data from CDS during the code execution run, you will need to provide your own API key as part of your submission. Only DrivenData staff will be able to access your submission contents, and we will only use your API key as part of running your submission. If you have any further questions or concerns, please let us know.

In order to properly authenticate the cdsapi client in your code, see this forum post for a few different approaches.

What do I do during the evaluation period between January 15 and August 14 if my code is being executed automatically?

Be on the lookout for any emails notifying you about code job failures. See the code submission format page about dealing with job failures.

Otherwise, you'll have the opportunity to work on your submissions for other stages of the challenge. The next soonest deadline will be the Hindcast model report deadline on January 26, 2024. (It was delayed to after the Forecast Stage open submission period give you more space to work on preparing for the Forecast Stage.) You'll also be able to work on and submit to Overall Prize evaluation (cross-validation and final model report) and the Explainability Bonus Track during this time. See the preview page for an overview of what to expect.

Good luck

Good luck and felicitous forecasting! If you have any questions, you can always visit the competition forum!