competition
complete
\$30,000

## Woohoo! This competition has come to a close!

Many thanks to the participants for all of their hard work and commitment to using data for good!

# Problem description

The goal of this challenge is to develop models for forecasting Dst (Disturbance Storm-Time Index) that 1) push the boundary of predictive performance 2) under operationally viable constraints 3) using specified real-time solar-wind data feeds. More information on the dataset, performance metric, and submission specifications is provided below.

Finalists will be determined by performance on the private test set. These participants will then have the opportunity to submit their code to be audited using an out-of-sample verification set. The top 4 eligible teams that pass this final check will be awarded prizes.

## The features in this dataset

### Overview

The input data for this challenge is composed of solar wind measurements collected from two satellites: NASA's Advanced Composition Explorer (ACE) and NOAA's Deep Space Climate Observatory (DSCOVR). Your goal is to predict the Disturbance Storm-Time Index (Dst), a measure of magnetic activity, from the provided data up to the time of prediction.

Dst values are measured by 4 ground-based observatories near the equator. These values are then averaged to provide a measurement of Dst for any given hour. However, these values are not always provided in a timely manner.

Thus, your task is to build a model that can predict Dst in real-time for both the current hour and the next hour. For example, if the current timestep is 10:00 am, you are must predict Dst for both 10:00 am and 11:00 am using data up until but not including 10:00 am.

Forecast Dst solely from solar-wind observations at the Lagrangian (L1) position using satellite data from NOAA’s Deep Space Climate Observatory (DSCOVR) and NASA's Advanced Composition Explorer (ACE).

Note: This is a real-time prediction task. Therefore, your solution may not use data captured at the same or later in time to predict Dst, and it may not take Dst as an input.

To ensure similar distributions between the training and test data, the data is separated into three non-contiguous periods. All data are provided with a period and timedelta multi-index which indicates the relative timestep for each observation within a period, but not the real timestamp. The period identifiers and timedeltas are common across datasets.

### Training Data

Three different time-series datasets are provided as features, in addition to the Dst labels:

filename description frequency
solar_wind.csv Solar wind data collected from ACE and DSCOVR satellites minutely
sunspots.csv Smoothed sunspot counts monthly
satellite_positions.csv Coordinate positions for ACE and DSCOVR daily
dst_labels.csv Dst values averaged across the four stations hourly

One more note about the training data: historical training data is provided to you, but the testing environment will be set-up to simulate a real-time environment. It is up to you to figure out how to align and use the appropriate training data, being careful not to leak any "future" information.

For details on the real-time data constraints, refer to the code submission page.

#### Solar Wind Data

The primary feature data are provided in solar_wind.csv. They are composed of solar-wind readings from the ACE and DSCOVR satellites:

• bx_gse - Interplanetary-magnetic-field (IMF) X-component in geocentric solar ecliptic (GSE) coordinate (nanotesla (nT))

• by_gse - Interplanetary-magnetic-field Y-component in GSE coordinate (nT)

• bz_gse - Interplanetary-magnetic-field Z-component in GSE coordinate (nT)
• theta_gse - Interplanetary-magnetic-field latitude in GSE coordinates (defined as the angle between the magnetic vector B and the ecliptic plane, being positive when B points North) (degrees)
• phi_gse - Interplanetary-magnetic-field longitude in GSE coordinates (the angle between the projection of the IMF vector on the ecliptic and the Earth–Sun direction) (degrees)
• bx_gsm - Interplanetary-magnetic-field X-component in geocentric solar magnetospheric (GSM) coordinate (nT)
• by_gsm - Interplanetary-magnetic-field Y-component in GSM coordinate (nT)
• bz_gsm - Interplanetary-magnetic-field Z-component in (GSM) coordinate (nT)
• theta_gsm - Interplanetary-magnetic-field latitude in GSM coordinates (degrees)
• phi_gsm - Interplanetary-magnetic-field longitude in GSM coordinates (degrees)
• bt - Interplanetary-magnetic-field component magnitude (nT)
• density - Solar wind proton density (N/cm^3)
• speed - Solar wind bulk speed (km/s)
• temperature - Solar wind ion temperature (Kelvin)
• source - Starting in 2016, the solar wind data for any given point in time can be sourced from either DSCOVR or ACE satellites depending on the quality. "ac" denotes it was sourced from ACE, and "ds" from DSCOVR.

Example row:

column value
period train_a
timedelta 0 days 00:00:00
bx_gse -5.55
by_gse 3.0
bz_gse 1.25
theta_gse 11.09
phi_gse 153.37
bx_gsm -5.55
by_gsm 3.0
bz_gsm 1.25
theta_gsm 11.09
phi_gsm 153.37
bt 6.8
density 1.53
speed 383.92
temperature 110237.0
source ac

In the code execution environment, you'll be provided one week of historical solar wind data per prediction.

Illustrative example of solar-wind data features and Dst over time. The selected features in green (up to t0) are used to predict the Dst in cyan (at t0 and at t+1). Keep in mind that your solution should work on real-time data streams with realistic noise including sensor malfunctions and data gaps.

#### Satellite Coordinate Data

ACE and DSCOVR satellites are not stationary. They actually orbit around the L1 point, in a relatively constant position with respect to the Earth as the Earth revolves around the sun.

satellite_positions.csv records the daily positions of the DSCOVR and ACE Spacecrafts in Geocentric Solar Ecliptic (GSE) Coordinates for projections in the XY, XZ, and YZ planes. The columns for each spacecraft are denoted by the suffixes _ace or _dscovr.

• gse_x - Position of the satellite in the X direction of GSE coordinates (km)
• gse_y - Position of the satellite in the Y direction of GSE coordinates (km)
• gse_z - Position of the satellite in the Z direction of GSE coordinates (km)

Example row:

period timedelta gse_x_ace gse_y_ace gse_z_ace gse_x_dscovr gse_y_dscovr gse_z_dscovr
train_a 0 days 00:00:00 1522376.9 143704.6 149496.7 NaN NaN NaN

Note that some dates are missing for DSCOVR. In the code execution environment, you'll be provided the last 7 days of satellite positions.

#### Sunspot Numbers

The Sun exhibits a well-known, periodic variation in the number of spots on its disk over a period of about 11 years, called a solar cycle. In general, large geomagnetic storms occur more frequently during the peak of these cycles. Sunspot numbers might allow for calibration of models to the solar cycle.

Smoothed sunspot numbers are provided at a monthly frequency. Because sunspot numbers are reliably projected, the most current month will be provided for each prediction in the real-time environment.

Sunspots are indexed according to the first corresponding day in dst_labels.csv.

Example row:

period timedelta smoothed_ssn
train_a 0 days 00:00:00 65.4

## Labels

The labels are hourly Dst values, indexed using the same period and timedelta multi-index. Your task is to predict the current timestep (t0) and the following timestep (t+1). Remember that you are not allowed to use historical Dst values as an input for prediction.

Example row:

period timedelta dst
train_a 0 days 00:00:00 -7

## Performance metric

Performance is evaluated according to Root Mean Squared Error (RMSE). RMSE will be calculated on t0 and t+1 simultaneously.

RMSE is defined as:

$$RMSE = \sqrt{\frac{1}{N} \sum_{i=0}^N (y_i - \hat{y_i})^2 }$$

where:

• $\hat{y_i}$ is the estimated Dst values for t0 and t+1
• $y_i$ is the recorded Dst for t0 and t+1
• $N$ is the number of samples

This metric is implemented in scikit-learn, with the squared parameter set to False.

## Submission format

This is a code execution challenge! Rather than submitting your predicted labels, you'll package everything needed to do inference and submit that for containerized execution.

The execution environment will be simulating real-time conditions, subject to data availability constraints. See complete details on making your executable code submission here.

## Good luck!

Good luck and enjoy this problem! Check out the benchmark blog post for tips on how to get started. If you have any questions you can always visit the user forum.