Pushback to the Future: Predict Pushback Time at US Airports (Open Arena)

START HERE! Accurate estimates of pushback time can help air traffic management systems more efficiently use the limited capacity of airports, runways and the National Airspace System. Use air traffic and weather data to automatically predict pushback time! #civic

apr 2023
409 joined

Problem description

In this challenge, you will predict the pushback time of departing flights (how long until the plane departs from the gate) using features that capture air traffic and weather conditions. Your goal is to build a model that predicts the minutes until pushback time. It is useful to predict pushback time at various points before a plane’s scheduled departure, so your model will make predictions from roughly an hour before scheduled pushback until the actual pushback time. For each flight, you'll generate predictions based on the information available at a variety of different times leading up to the actual pushback. Each prediction is a unique combination of flight ID, airport, and time (here we'll call this prediction time).

Finalists from Phase 1 will be required to participate in Phase 2, during which they will work with NASA to train a federated version of their model. It is easier to combine weights for certain types of models, and therefore some types of models are more easily federated than others. Keep this in mind as you develop your solutions!

Timeline and leaderboard

All participants can enter the Open Arena of this challenge (you are here!). In the Open Arena, participants can work on their solutions and get live feedback from a public leaderboard. Participants who attest to their eligibility can enter the Prescreened Arena where participants will submit executable code submissions that determine the final rankings.

Phase 1: Open model development
Phase 1: Code execution (Prescreened only)
Phase 2: Federated learning (Phase 1 finalists only)
Phase 1: Open model development Phase 1: Code execution (Prescreened only) Phase 2: Federated learning (Phase 1 finalists Only)
Feb 1 - Apr 17, 2023 Feb 22 - Apr 17, 2023 Jun 1 - Aug 31, 2023
Submit pushback predictions for the validation set to the Open Arena. Scores are displayed on the Open Arena public leaderboard. Submit code to the Prescreened Arena, which we will execute to compute pushback predictions for the test set. These scores are displayed on the Prescreened public leaderboard and will be used to determine prize rankings. Finalists from Phase 1 will train a federated version of their model using NASA's federated learning platform.

Phase 1: Model development

Upon registering for the contest, participants will enter the Open Arena (you are here!). The Open Arena provides access to a training data set including 10 airports over approximately two years, as well as feature data for a withheld validation set sampled from the same time range.

Throughout this phase, participants can submit and score predicted pushback times to get feedback on their model’s performance. Once a participant submits predictions according to the Submission Format, those predictions will be compared to the ground truth and the score (mean absolute error) will be shown on the public leaderboard.

Pre-screening: In order to be eligible for the Prescreened Arena, participants must submit an attestation that they are eligible to participate according to the rules. Finalists from the Prescreened Arena will have to submit proof of eligibility.

Once successfully prescreened, participants can:

  • Enter the Prescreened Arena.
  • Make executable code submissions to the containerized test harness, which will execute submissions on the Prescreened Arena test data and produce scores for the Prescreened public leaderboard. The Prescreened private leaderboard determines final ranking. More detailed instructions will be provided in the Prescreened Arena.

Solutions in the Prescreened Arena may use both training and validation data from the Open Arena as training data for submissions. The test harness will execute solutions on a disjoint sample with a similar distribution as the Open Arena.

Phase 2: Federated Learning

Finalists from Phase 1 will be required to train a federated version of their models in Phase 2. Finalists will use a model-agnostic federated learning platform to re-train their model. Some features will remain centralized, while others will be partitioned by airline. Successfully federated models will be able to aggregate model weights from an arbitrary number of partitions. Read on for more detail about the competition's federated learning requirements.

The goal of Phase 2 is to explore federated learning techniques that enable independent clients (also known as parties, organizations, groups) to jointly train a global model without sharing or pooling data. In this case, the clients are airlines. Federated learning is a perfect match for the problem of pushback time prediction because airlines collect a lot of information that is relevant to pushback time, but too valuable or sensitive to share, like the number of passengers that have checked in for a flight or the number of bags that have been loaded onto a plane. Federated learning enables airlines to safely contribute the valuable and sensitive data they collect towards a centralized model. Each finalist from Phase 1 will be invited to translate their winning Phase 1 model into a model that can be trained in a federated manner. The data used in Phase 2 will be the the same data as used in Phase 1, but it will be divided into:

  • "Public" or "non-federated" variables for which all rows are available to all airlines
  • "Private" or "federated" variables for which only rows corresponding to a flight that an airline operates are available to that airline

The exact variables airlines want to protect by federating is by definition too sensitive to release for a competition. Since we don't have access to the actual airline data that would be federated, we'll simulate the federated scenario by treating a subset of variables as if they are private. All of the federated variables for Phase 2 come from the Phase 1 mfs and standtimes datasets:

Federated mfs variables

  • aircraft_engine_class
  • aircraft_type
  • major_carrier
  • flight_type

Federated standtimes variables

  • departure_stand_actual_time

Each airline will have access to its federated variables and all non-federated variables. Airlines will not have access to federated variables from other airlines. In Phase 2, these data constraints apply during training and prediction.

Below is an example of airline-specific federated variables using a sample of the KMEM_mfs.csv.bz2 file:

gufi aircraft _engine _class aircraft _type major _carrier flight_type isdeparture

The airline is indicated as a code at the beginning of the GUFI, e.g., "AAL" is American Airlines and "UAL" is United Airlines. During training, you will simulate an AAL client that trains using its rows of the federated variables (highlighted in blue) plus any of the public data and transmits model updates to a centralized server. A UAL client (private data highlighted in yellow) and clients for the other airlines will do the same. The centralized server decides how to aggregate all of the individual model weights into a single global model. During prediction, each airline will use the final trained model to make pushback predictions for all of the flights it operates. Your federated learning approach will be scored on the same time period and same flights as the test set from Phase 1—the only difference is that the federated variables can only be accessed by the airline that produced them.

In the spirit of experimentation, you will have some flexibility in how you translate your Phase 1 winning model during Phase 2. Do some strategies for combining the models from individual clients work better than others? Are there basic changes that your model needs to operate in a federated setting? We want to know! We will release more of the specific requirements as Phase 2 ramps up, but for now get started thinking about what it will take to turn your winning centralized model into an effective federated model!

About Federated Learning

There are a vast number of public and private organizations that collect and provide flight and airspace related data. However, privacy and intellectual property concerns prevent much of this data from being aggregated, and thus hamper the ability of models and analysts to make the best predictions and decisions.

Federated learning (FL), also known as collaborative learning, is a technique for collaboratively training a shared machine learning model across data from multiple parties while preserving each party's data privacy. Federated learning stands in contrast to the typical centralized machine learning, where the training data needs to be collected and centralized for training. Requiring the parties to share their data compromises the privacy of that data!

In Phase 1, your model does not have to be federated. However, you should choose an architecture with federation in mind. For example, linear models where weights can be aggregated with simple addition are easier to train in a federated way than tree-based models. You can learn more about federated learning on the About page.


This challenge is possible because of the effort that NASA, the FAA, airlines, and other agencies undertake to collect, process, and distribute data to decision makers in near real-time. You will be working with around two years of historical data, but any solution you develop could be translated directly into a pipeline with access to data collected in real time.

Location of airports

10 airports whose data is included in this competition

The data download page contains a tar archive for each of these airports. The structure of each archive is:

├── <airport>
│   ├── <airport>_config.csv.bz2
│   ├── <airport>_etd.csv.bz2
│   ├── <airport>_first_position.csv.bz2
│   ├── <airport>_lamp.csv.bz2
│   ├── <airport>_mfs.csv.bz2
│   ├── <airport>_runways.csv.bz2
│   ├── <airport>_standtimes.csv.bz2
│   ├── <airport>_tbfm.csv.bz2
│   └── <airport>_tfm.csv.bz2
└── train_labels_<airport>.csv.bz2

Read on to learn more about each of these files!

Feature data

The feature data for this competition includes information about air traffic and weather conditions.

Air traffic

This competitions uses air traffic data from Fuser, a data processing platform designed by NASA as part of the ATD-2 project. Fuser processes the FAA's raw data stream and distributes cleaned, real-time data on the status of individual flights nationwide.

On the data download page there is a separate tar archive for each airport. Each tar archive contains the files listed below.

Actual departure time and runway code

<airport>_runways.csv has one row for each flight, and contains the following columns:

  • departure_runway_actual: The flight's actual departure runway as a runway code, e.g., 18R, 17C, etc.
  • departed_runway_actual_time: The time that the flight departed from the runway
  • arrival_runway_actual: The flight's actual arrival runway as a runway code, e.g., 18R, 17C, etc.
  • arrival_runway_actual_time: The time that the flight arrived at the runway
Airport configuration

<airport>_config.csv describes the active runway configuration at different times. Runway configuration is the combination of runways used for arrivals and departures and the flow direction on those runways. Each row is a different time at the given airport, and the columns are:

  • timestamp: The time that the Fuser system received the data (use this for filtering)
  • start_time: The time the configuration took effect
  • arrival_runways: The active arrival runway configuration starting from start_time as comma separated runway codes, e.g., 18R, 17C, etc.
  • departure_runway: The active departure runway configuration starting from start_time as comma separated runway codes, e.g., 18R, 17C, etc.
Estimated departure times

<airport>_etd.csv may have multiple rows for each flight and contains the following columns:

  • gufi: GUFI (Global Unique Flight Identifier)
  • timestamp: The time that the prediction was generated
  • estimated_runway_departure_time: Estimated time that the flight will depart from the runway
Estimated arrival times

TFM (traffic flow management) and TBFM (time-based flow management) are two FAA system that track flights in the NAS. TFM and TBFM forecast the estimated time of arrival (ETA) continuously throughout the duration of a flight.

TFM forecasts are available as <airport>_tfm.csv and contain the following columns:

TBFM forecasts are available as <airport>_tbfm.csv and contain the following columns:

  • gufi: GUFI (Global Unique Flight Identifier)
  • timestamp: The time that the prediction was generated
  • scheduled_runway_estimated_time: Scheduled time that the flight will arrive at the runway
MFS event times

MFS provides actual event times and flight information during the lifecycle of a flight. MFS data is provided in two files: metadata and standtimes.

MFS metadata are available as <airport_mfs>_mfs.csv and contains critical information about the flight. MFS metadata are different from the other features in that it does not include a timestamp. It is assumed that these flight metadata are available for any flight for which a GUFI exists. However — certain uses of this metadata CSV violate the real-time constraints of the problem.

You may:

  • Look up (or "join") metadata for GUFIs that you already know exist within the valid time window from other timestamped features.
  • Use MFS data for the current flight that predictions are being generated for.

You may not:

  • Use the MFS metadata to look for information about flights for which you do not already have a GUFI from timestamped features.
  • Analyze the entire MFS metadata file to, for example, directly incorporate the distribution of aircraft type, carriers, etc. into your solution.

<airport_mfs>_mfs.csv has one row for each flight, and contains the columns:

  • gufi: GUFI (Global Unique Flight Identifier)
  • aircraft_engine_class: The class of engine
  • aircraft_type: The type of aircraft
  • major_carrier: The airline carrier
  • flight_type: The type of flight
  • is_departure: True if the flight is a departure, else False if it is an arrival

<airport_mfs>_standtimes.csv may have multiple rows for each flight and contains the actual arrival and departure times to/from a gate. For a given flight being predicted, this information is not available at the time of prediction. The columns are:

  • gufi: GUFI (Global Unique Flight Identifier)
  • timestamp: The time that the Fuser system received the data
  • arrival_stand_actual_time: The time the flight arrived at the gate at the destination airport.
  • departure_stand_actual_time: The time the flight departed the gate (the pushback time). At prediction time, this variable is known for other flights that have already push-backed, but is not available for the given flight being predicted
First position

<airport>_first_position.csv contains the following columns:

Note that while first position is tracked for all flights in NAS, an airport's first position dataset only contains flights that are arriving at that airport, i.e., it does not contain first position for flights departing that airport.


LAMP (Localized Aviation MOS (Model Output Statistics) Program), a weather forecast service operated by the National Weather Service, will be the primary source of weather data. LAMP includes data for each of the airport facilities in the challenge. In addition to the temperature and humidity you'll find in your favorite weather app, LAMP includes quantities that are particularly relevant to aviation, such as visibility, cloud ceiling, and likelihood of lightning.

LAMP includes not only the retrospective weather, but also historical weather predictions – that is, at a point in time in the past, what we thought the weather was going to be. In other words, consider the weather at noon yesterday. In hindsight I know it was sunny (retrospective), but what was my prediction at 9 AM yesterday (historical prediction)? This distinction is critical to making sure our models do not rely on information from the future, but also giving your models access to the best weather predictions at the time they were available. LAMP makes predictions every hour on the half hour, so 00:30, 01:30, 02:30, etc. Each prediction includes a forecast for the next 25 hours.

An extract of LAMP predictions will be available with the following format:

  • timestamp: The time that the forecast was generated
  • forecast_timestamp: The time for which the forecast is predicting weather conditions
  • temperature: Temperature in degree Fahrenheit
  • wind_direction: Wind direction in compass heading divided by 10 and rounded to the nearest integer (to match runway codes)
  • wind_speed: Wind speed in knots
  • wind_gust: Wind gust speed in knots
  • cloud_ceiling: Cloud ceiling height in feet encoded as category indices
    • 1: <200 feet
    • 2: 200–400 feet
    • 3: 500–900 feet
    • 4: 1,000–1,900 feet
    • 5: 2,000–3,000 feet
    • 6: 3,100–6,500 feet
    • 7: 6,600–12,000 feet
    • 8: >12,000 feet
  • visibility: Visibility in miles encoded as category indices
    • 1: <½ mile
    • 2: ½–1 mile
    • 3: 1–2 miles
    • 4: 2–3 miles
    • 5: 3–5 miles
    • 6: 6 miles
    • 7: >6 miles
  • cloud: Total sky cover category
    • "BK": broken
    • "CL": clear
    • "FEW": few
    • "OV": overcast
    • "SC": scattered
  • lightning_prob: Probability of lightning
    • "N": none
    • "L": low
    • "M": medium
    • "H": high
  • precip Boolean indicating whether precipitation is expected
    • True: precipitation is expected
    • False: no precipitation expected

Check out the Meterorological Development Lab website and the original LAMP paper for more information.


Within the tar archive for each airport, labels are provided in train_labels_<airport>.csv. The target variable is minutes_until_pushback, or minutes until actual pushback time from time of prediction for a given flight. Predictions are made every 15 minutes starting from roughly an hour before scheduled pushback time until actual pushback time.

Here is an example showing minutes_until_pushback beginning at 04:00:00 at Chicago O’Hare (KORD). Each individual flight ID or gufi has labels for multiple timestamps.

gufi timestamp airport minutes_until_pushback
SKW5143.ORD.EAU.201031.0059.0006.TFM 2020-11-15 04:00:00 KORD 85
2020-11-15 04:15:00 70
2020-11-15 04:30:00 55
2020-11-15 04:45:00 40
2020-11-15 05:00:00 25
2020-11-15 05:15:00 10

Submission format

For the Open Arena, you will be submitting a CSV of predictions for each flight and prediction time, where the index is gufi, airport, and timestamp:

minutes_until_pushback should include an integer representing the number of minutes from the time of prediction (timestamp) until the actual pushback time. To generate a submission, download the submission format from the Data download page and replace the values in the minutes_until_pushback column with your predictions. The other columns of your submission exactly match the submission format.

Note that the submission format may be either a CSV or zipped CSV. Zipping your submission can dramatically reduce the file size and upload time.

For example, the first few rows of submission_format.csv are:

AAL1008.ATL.DFW.210607.2033.0110.TFM,2021-06-08 19:15:00,KATL,0
AAL1008.ATL.DFW.210607.2033.0110.TFM,2021-06-08 19:30:00,KATL,0
AAL1008.ATL.DFW.210607.2033.0110.TFM,2021-06-08 19:45:00,KATL,0
AAL1008.ATL.DFW.210607.2033.0110.TFM,2021-06-08 20:00:00,KATL,0

And your submission might look like this:

gufi airport timestamp minutes_until_pushback
AAL1008.ATL.DFW.210607.2033.0110.TFM 2021-06-08 19:15:00 KORD 92
2021-06-08 19:30:00 75
2021-06-08 19:45:00 62
2021-06-08 20:00:00 44

Performance metric

Performance is evaluated according to Mean Absolute Error (MAE), which measures how much the estimated values differ from the observed values. MAE is the mean of the magnitude of the differences between the predicted values and the ground truth. MAE is always non-negative, with lower values indicating a better fit to the data. The competitor that minimizes this metric will top the leaderboard.

$$ MAE = \frac{1}{N} \sum_{i=0}^N |y_i - \hat{y_i}| $$


  • |$N$| is the number of samples
  • |$\hat{y_i}$| is the estimated minutes until pushback for the |$i$|th sample
  • |$y_i$| is the actual minutes until pushback of the |$i$|th sample

In this case, each sample is a unique combination of gufi and timestamp.

In Python you can calculate MAE using the scikit-learn function sklearn.metrics.mean_absolute_error(y_true, y_pred).

Keep in mind that scores displayed on the public Open Arena leaderboard while the competition is running will not be the same as the final scores on the Prescreened Arena leaderboard.

See if you can beat the baseline score on the leaderboard, which simply predicts the number of minutes until 15 minutes before the latest estimated time of departure. This simple baseline works because most flights pushback around 15 minutes before departure, and most flights depart on time.

Good luck

Good luck and enjoy this challenge! Check out the benchmark blog post for tips on how to get started. If you have any questions you can always visit the user forum.