U.S. PETs Prize Challenge: Phase 2 (Pandemic Forecasting) Hosted By NIST and NSF

$575,000

Centralized Code Submission Format


This page documents the submission format for your centralized solution code for Track B: Pandemic Forecasting. Exact submission specifications may be subject to change until submissions open. Each team will be able to make one final submission for evaluation. In addition to local testing and experimentation, teams will also have limited access to test their solutions through the hosted infrastructure later in Phase 2. More details will be provided when submissions open.

Your centralized solution should be a version of your federated solution that runs with full data access and under no privacy threat. The performance of your federated solution relative to the performance of your centralized solution will be part of the evaluation of your overall solution.

The full source code and environment specification for how your code is run is available from the challenge runtime repository.

What to submit


Your submission should be a zip archive named with the extension .zip (e.g., submission.zip). The root level of the archive must contain a solution_centralized.py module that contains the following named functions:

  • fit: A function that fits your model on the training data and writes your model to disk
  • predict: A function that loads your model and performs inference on the test data

When you make a submission, this will kick off a containerized evaluation job. This job will run a Python main_centralized_train.py script which will import your fit function and call it with the appropriate training data access and filesystem access. Then, the job will run a Python main_centralized_test.py script which will import your predict function with the appropriate test data access and filesystem access. Please see the following sections for the training and test stages for additional details and API specifications.

Training

Your submitted solution_centralized.py module must contain a fit function that matches the below API. This fit function will be called with paths to the training data, which you can then load and use to train your model. It will also be called with a directory path model_dir which you should use to save your model. The later test stage will be a separate Python process that does not not share any in-memory scope with training. Therefore, saving to and loading from the model_dir directory is the only way for the test stage to access your trained model.

def fit(
    person_data_path: pathlib.Path, 
    household_data_path: pathlib.Path, 
    residence_location_data_path: pathlib.Path, 
    activity_location_data_path: pathlib.Path, 
    activity_location_assignment_data_path: pathlib.Path, 
    population_network_data_path: pathlib.Path, 
    disease_outcome_data_path: pathlib.Path, 
    model_dir: pathlib.Path
) -> None:
    """Function that fits your model on the provided training data and saves 
    your model to disk in the provided directory. 

    Args:
        person_data_path (Path): Path to CSV data file for the Person table.
        household_data_path (Path): Path to CSV data file for the House table.
        residence_location_data_path (Path): Path to CSV data file for the 
            Residence Locations table.
        activity_location_data_path (Path): Path to CSV data file for the 
            Activity Locations on table.
        activity_location_assignment_data_path (Path): Path to CSV data file 
            for the Activity Location Assignments table.
        population_network_data_path (Path): Path to CSV data file for the 
            Population Network table.
        disease_outcome_data_path (Path): Path to CSV data file for the Disease
            Outcome table.
        model_dir (Path): Path to a directory that is constant between the train
            and test stages. You must use this directory to save and reload 
            your trained model between the stages. 
        preds_format_path (Path): Path to CSV file matching the format you must
            write your predictions with, filled with dummy values. 
        preds_dest_path (Path): Destination path that you must write your test
            predictions to as a CSV file.

    Returns: None
    """
    ...

For more details on the input data files, please see the data overview page.

Test

Your submitted solution_centralized.py module must contain a predict function that matches the below API. This predict function will be called with paths to the data needed for test-time inference.

  • You should not be doing any additional training during the test stage.
  • You should load your saved trained model from the model_dir directory.
  • Your predict method will be called with a preds_format_path to a CSV file that provides a template that you should write your predictions to. Your predictions should be written to preds_dest_path.
def predict(
    person_data_path: pathlib.Path, 
    household_data_path: pathlib.Path, 
    residence_location_data_path: pathlib.Path, 
    activity_location_data_path: pathlib.Path, 
    activity_location_assignment_data_path: pathlib.Path, 
    population_network_data_path: pathlib.Path, 
    disease_outcome_data_path: pathlib.Path, 
    model_dir: pathlib.Path,
    preds_format_path: pathlib.Path,
    preds_dest_path: pathlib.Path,
) -> None:
    """Function that loads your model from the provided directory and performs
    inference on the provided test data. Predictions should match the provided
    format and be written to the provided destination path.

    Args:
        person_data_path (Path): Path to CSV data file for the Person table.
        household_data_path (Path): Path to CSV data file for the House table.
        residence_location_data_path (Path): Path to CSV data file for the 
            Residence Locations table.
        activity_location_data_path (Path): Path to CSV data file for the 
            Activity Locations on table.
        activity_location_assignment_data_path (Path): Path to CSV data file 
            for the Activity Location Assignments table.
        population_network_data_path (Path): Path to CSV data file for the 
            Population Network table.
        disease_outcome_data_path (Path): Path to CSV data file for the Disease
            Outcome table.
        model_dir (Path): Path to a directory that is constant between the train
            and test stages. You must use this directory to save and reload 
            your trained model between the stages. 
        preds_format_path (Path): Path to CSV file matching the format you must
            write your predictions with, filled with dummy values. 
        preds_dest_path (Path): Destination path that you must write your test
            predictions to as a CSV file.

    Returns: None

    """
    ...

Predictions Format

Your predict function should produce a predictions CSV file written to the provided preds_dest_path file path. Each row should correspond to one individual in the population, identified by the column pid. Each row should also have a float value in the range [0.0, 1.0] for the column score which represents a risk score that that individual will become infected during the test period of the synthetic outbreak. A higher score means higher confidence that that individual will become infected. A predictions format CSV is provided via preds_format_path that matches the rows and columns that your predictions need to have, with dummy values for score. You can load that file and replace the score values with ones from your model.

pid score
0 0.5
1 0.5
2 0.5
...

Data Access and Scope

Your code is called with specific scope and data access. Please note that attempts to circumvent the structure of the setup is grounds for disqualification.

  • Your code should not inspect the data or print any data to the logs.
  • Your code should not directly access any data files other than what the evaluation runner explicitly provides to each function.
  • Your code should not exceed its scope. Directly accessing or modifying any global variables or evaluation runner state is forbidden.

If in doubt about whether something is okay, you may email us or post on the forum.

Runtime specs


Your code is executed within a container that is defined in our runtime repository. The limits are as follows:

  • Your submission must be written in Python (Python 3.9.13) and use the packages defined in the runtime repository.
  • The submission must complete execution in 6 hours or less.
  • The container has access to 6 vCPUs, 56 GB RAM, and 1 GPU with 12 GB of memory.
  • The container will not have network access. All necessary files (code and model assets) must be included in your submission.
  • The container execution will not have root access to the filesystem.

Requesting additional dependencies


Since the Docker container will not have network access, all dependency packages must be pre-installed in the container image. We are happy to consider additional packages as long as they are approved by the challenge organizers, do not conflict with each other, and can build successfully. Packages must be available through conda for Python 3.9.13. To request an additional package be added to the docker image, follow the instructions in the runtime repository README.

Note: Since package installations need to be approved, be sure to submit any PRs requesting installation by January 4, 2023 to ensure they are incorporated in time for you to make a successful submission.


Happy building! Once again, if you have any questions or issues you can always head over to the user forum!