competition
complete
\$800,000

# Data Track B: Pandemic Forecasting

## Transforming Pandemic Response and Forecasting

There are two data use case tracks for the PETs prize challenge. This is the data overview page for the pandemic forecasting track. In this track, innovators will develop end-to-end privacy-preserving federated learning solutions to accurately forecast individual risk of infection. You can find the other financial crime prevention track here.

## Background

Federated learning (FL), or more generally collaborative learning, shows huge promise for machine learning applications derived from sensitive data by enabling training on distributed data sets without sharing the data among the participating parties.

The PETs Prize Challenge is focused on enhancing cross-organization, cross-border data sharing to support efforts to enable better public health related forecasting to bolster pandemic response capabilities. The COVID-19 pandemic has taken an immense toll on human lives and had unprecedented level of socio-economic impact on individuals and societies around the globe. As we continue to deal with COVID-19, it has become apparent that better ways to harness the power of data by employing privacy-preserving data sharing and analytics are critical to preparing for such pandemics or public health crises.

In this Data Track, you are asked to develop innovative, privacy-preserving FL solutions utilizing dataset of a synthetic population (a.k.a. digital twin) that simulates a population with statistical and dynamical properties similar to a real population. You will design and later develop privacy-preserving federated learning solutions that can predict an individual’s risk for infection. If an individual knows that they are likely to be infected in the next week, they can take more prophylactic measures (masking, changing plans, etc.) than they usually might. Furthermore, this knowledge could help public health officials better predict interventions and staffing needs for specific regions.

By focusing on this public health use case, the Challenge aims to drive innovation in privacy-enhancing technologies by exploring their role in a set of high-impact use cases where there are currently challenging trade-offs between enabling sufficient access to data to and limiting the identifiability of individuals or inference of their privacy sensitive information within those data sets. Though novel innovation for this use case alone could achieve significant real world impact, the challenge is designed to incentivize development of privacy technologies that can be applied to other use cases where data is distributed across multiple organizations or jurisdictions, both in public health and elsewhere. The best solutions will deliver meaningful innovation towards deployable solutions in this space, with consideration of how to evidence the privacy guarantees offered to data owners and regulators, but also have the potential to generalize to other situations.

## The Data

You will be provided with datasets representing synthetic population data that includes:

• a social contact network, capturing when and where any two people come into contact, and the duration of the contact
• demographic attributes of individuals
• observations of individuals’ health state (i.e., whether they are infected or not).

These datasets have been generated by the University of Virginia’s Biocomplexity Institute (UVA-BII). Two synthetic population datasets are provided: the first simulates the population of the state of Virginia, USA, and the second the population of the U.K. The two datasets (Virginia and U.K.) have identical structures and schemas (detailed below), but differ in size and scale. This challenge (the U.S. challenge) primarily focuses on the Virginia dataset. The U.K. dataset is provided as supplemental data for development purposes, but there will not be an equivalent dataset during Phase 2 evaluation.

Using these datasets and an agent-based outbreak simulation, the UVA-BII team created 63 days of simulated disease outbreak data, subsequently split into 56 days of training data and 7 days of target data. You will be provided with the synthetic populations and the first 56 days of the simulated outbreak datasets for model training. Your task is to predict a risk score for the binary disease state (infected or not infected) of each individual in the final week of the simulation. That is, you should predict the risk of whether each person in the population will be in an infected state on any day in the last 7 days.

Details about the size of the synthetic datasets:

Virginia dataset U.K. dataset
Population size 7.7 million 62 million
Number of social contacts 181 million 722 million
Number of disease state records (upper bound based on 1 reading per individual per day) 430 million 3.5 billion

Note: The challenges are based on synthetic data to minimize the security burden placed on participants during the development phase; of course the intent of the challenge is to develop privacy solutions appropriate for use on real datasets with demonstrable privacy guarantees. However, participants must adhere to a data use agreement (see competition rules for more details).

#### Disease States and Asymptomatic Infection

Each individual in the population will have a disease state for each day of the dataset. The disease state acts like a finite-state machine with the following states:

• S: indicates "susceptible". This individual is susceptible to infection.
• I: indicates "infected". This individual has been infected and is infectious. They can infect other individuals (who are susceptible) through contact.
• R: indicates "recovered". This individual has recovered and is no longer infectious or infectable.

State transition diagram for disease state in challenge dataset.

An individual disease state can only progress from S to I to R. Once an individual is in the recovered R state, they will not be able to become infected again within this simulation.

This dataset also features an additional layer of complexity around asymptomatic infection. Some individuals in the simulation will be asymptomatically infected and will transition to I states, but the I state is hidden from the data. Such an individual is still infectious while in the I state but will appear in the data as being in the S state before transitiong to the R state. This is in constrast to individuals with the normal symptomatic infection whose disease state is transparently visible in the data. Individuals in both normal symptomatic infection and asymptomatic infection states will transition to a visible R state upon recovery.

State transition diagram for disease state in challenge dataset showing both symptomatic and asymptomatic infections. Asymptomatically infected individuals will have a hidden infected state in the simulation, but appear as susceptible in the dataset. For this reason, asymptomatic infections may appear to transition directly from susceptible to recovered.

#### Development and Evaluation Data

The datasets being provided are intended for local development use in both Phase 1 and Phase 2. Each dataset has been split in time—the first 56 days of the dataset is the training set, and the final 7 days of the dataset is the forecast target data. The prediction task, as detailed in a later section, is to make predictions for the final 7 days of the dataset. The ground truth is provided for the forecast target period for the development datasets.

In Phase 2, a separate and held-out dataset will be used for solution evaluation. You can expect the Phase 2 evaluation dataset be close in size and statistical distributions to the development Virginia dataset, but have a different contact network and an independent disease outbreak simulation. In Phase 2, you will submit code for your solution to a code execution environment. The code execution runtime will run cold-start federated training on the new dataset's training split and then run inference to generate predictions for the forecast target period. Your solution's performance will be measured by evaluating its predictions against the ground truth for the new dataset's target period.

#### Partitions

For local development, you are provided a full, unpartitioned dataset. In Phase 2 evaluation, the evaluation data will be partitioned along administrative boundaries, e.g., by grouped FIPS codes/counties. Any cross-partition edges will be dropped, where an edge is created between two people (nodes) that come into contact as specified by the population contact network table. That is, each partition will only have visibility into edges between people that reside within the same partition. Cross-partition edges will continue to affect the spread of infection, but will not be visible to any of the partitions. Any partitioning of the data that you might perform in your local development experiments should take this into account. Your solution should be able to handle any number of partitions, and in Phase 2, we may evaluate your solution with a number of partitions between 1 and 10.

#### Data Details

Each synthetic population dataset can be considered as a relational database consisting of eight tables. Each of the eight tables will be provided as a separate CSV file. Each table has data for a particular data schema. The fields and their description for each table are shown below. The eight tables are:

• Person data - Information about individuals in the dataset
• Household data - Information about individual households
• Residence location data - Location information about household residences
• Activity location data - Information about locations where non-home activities take place
• Activity location assignment data - Information about which activities each person was involved in, where it took place, and timing information. Activities are repeated every day, as if it were like the film Groundhog Day
• Population contact network data - Information about the contact network between people, including location, time and duration. This table is generated from the Activity location assignment data table, and also repeats every day
• Disease outcome training data - Information about each individual's health status on a particular day
• Disease outcome target data - Binary ground truth infected label for each individual for the forecast period

Table 1: Person data has the following fields:

• Household ID (hid) - An integer identifying a household
• Person ID (pid) - A unique integer identifying a person
• Person number (person_number) - Sequence ID of a person within the household. A household of size 3 will have people with person_numbers 1, 2, and 3.
• Age - Age of person
• Sex - Indicating the gender of person

Table 2: Household data has the following fields:

• Household ID (hid) - A unique integer identifying the household
• Residence ID (rlid) - An integer identifying the residence
• Admin 1 - ADCW ID for the admin1 region for U.K. or state FIPS code for Virginia
• Admin 2 - ADCW ID for the admin1 region for U.K.; Equals to admin1 if for U.K. (ADCW does not assign admin2 region ID for U.K.); 3-digit county FIPS code for Virginia
• Household Size (hh_size) - Number of persons in the household

Table 3: Residence location data has the following fields:

• Residence ID (rlid) - A unique integer identifying the residence
• Longitude - the longitude of the location
• Latitude - the latitude of the location
• Admin 1 - See household file description
• Admin 2 - See household file description

Table 4: Activity location data has the following fields:

• Activity location ID (alid) - Unique integer identifying the location where activity took place
• Longitude - Longitude of the location
• Latitude - Latitude of the location
• Admin 1 - See household file description
• Admin 2 - See household file description
• Work - Does location support work activity? (Value is 0 or 1)
• Shopping - Does location support shopping activity? (Value is 0 or 1)
• School - Does location support school activity? (Value is 0 or 1)
• Other - Does location support other activity? (Value is 0 or 1)
• College - Does location support college activity? (Value is 0 or 1)
• Religion - Does location support religion activity? (Value is 0 or 1)

Table 5: Activity location assignment data has the following fields:

• Household ID - Household ID of the person
• Person ID - Person ID of the person
• Activity number - Number of the activity in the activity sequence to which it belongs
• Activity type - Activity type (1: Home, 2: Work, 3: Shopping, 4: Other, 5: School, 6: College)
• Start Time - Start time of activity in seconds since midnight
• Duration - Duration of the activity in seconds
• Location ID - Location ID of the location where the activity takes place (rlid or alid)

Table 6: Population contact network data has the following fields:

• Person ID 1 (pid1) - Person ID number 1 of this edge
• Person ID 2 (pid2) - Person ID number 2 of this edge
• Location ID (lid) - Location ID where contact (edge) arises
• Start time - Start time of the contact between Person ID 1 and Person ID 2 measured in seconds since midnight
• Duration - Duration of the contact measured in seconds
• Activity ID 1 (activity1) - Activity type of Person ID 1 at time of contact (see above)
• Activity ID 2 (activity2) - Activity type of Person ID 2 at time of contact (see above)

Table 7: Disease outcome training data has the following fields:

• Day - Simulation day
• Person ID - ID of the person in the Person data or the Contact Network data
• Disease state - Disease-related state of the person on that day with possible values S for "susceptible", I for "infected", and R for "recovered". See previous section for details.

Table 8: Disease outcome target data has the following fields:

• Person ID - ID of the person in the Person data or the Contact Network data
• Infected - Binary variable with 1 if the person was infected in the final week of the simulation and 0 if otherwise

Above: Overall data model of the synthetic population data. The arrows indicate dependencies between files; for example, the hid element in the person file should be constrained to the values available in the household file.

## Threat Profile

You will design and develop end-to-end solutions that preserve privacy across a range of possible threats and attack scenarios, through all stages of the machine learning model lifecycle. You should therefore carefully consider the overall privacy of your solution, focusing on the protection of sensitive information held by all parties involved in the federated learning scenario. The solutions you design and develop should include comprehensive measures to address the threat profiles described below. These measures will provide an appropriate degree of resilience to a wide range of potential attacks defined within the threat profile. For more information on threat profiles, please visit the privacy threat profile section of the problem description.

### Scope of sensitive data

Your solution must prevent the unintended disclosure of sensitive information in the population dataset to any other party, including other insider stakeholders (i.e., the other federation units) and outsiders.

The following information in the dataset should be treated as sensitive:

• All personally identifiable information, including the identity of a person, their housing information, their demographics such as age, etc.
• Location activities of an individual
• The health state of an individual
• Social contact information, including the location of any social contact, and the identities of who was involved in the social contact.

## Forecasting Individual Risk of Infection

The key analytical objective of the PPFL task is to effectively train a model that can predict the risk of infection for individuals in a population over a period of time in a privacy-preserving manner. Such a predictive capability is important from the perspective of an individual; if they know they are likely to be infected in the next week, they may choose to take additional preventative measures (such as wearing a face covering, reducing social contacts, or taking antivirals prophylactically). Additionally, this capability can help inform the measures public health authorities implement to respond to such outbreaks.

### Prediction Target and Evaluation Metric

The target variable for the modeling task is a risk score (between 0.0 and 1.0) for each individual in the population. That risk score corresponds to a confidence that that individual enters into an symptomatic infected I disease state at any time during the final week of the simulation (between days 56 and 63).

Note that, as discussed in the previous section, some individuals become infected asymptomatically. Those individuals count as negative cases (not infected) for the purposes of the ground truth.

The evaluation metric will be Area Under the Precision–Recall Curve (AUPRC), also known as average precision (AP), PR-AUC, or AUCPR. This is a commonly used metric for binary classification that summarizes model performance across all operating thresholds. This metric rewards models which can consistently assign positive cases with a higher risk score than negative cases. AUPRC is computed as follows:

$$\text{AUPRC} = \sum_n (R_n - R_{n-1}) P_n$$

where ${P}_{n}$ and ${R}_{n}$ are the precision and recall, respectively, when thresholding at the nth individual sorted in order of increasing recall.

### Partitioned Datasets for Federated learning

The dataset will be horizontally partitioned into local datasets belonging to different federation units. This is intended to mirror how data is distributed across different health districts, hospitals, etc. These federation units have access to the full data tables described above. The dataset assumes that people are in precisely one federation unit, so a health district can see everywhere that one of its individuals goes, but does not have any access to information about other people who reside outside of the health district. As a result, the contacts between individuals from two different federation units are not represented in the population contact network in either of the local datasets.

The federated units work jointly to train a global FL model, and can take a common approach to technical design, infrastructure, etc., but are not able to access each other’s raw data. In the real world there are a number of barriers that might prevent this; the parties and the datasets they contribute for PPFL may be subject to a variety of privacy, competition and industry specific regulations (e.g., HIPPA), may be operating in different jurisdictions, and have legitimate commercial and ethical reasons for not sharing customer data with other parties.

Above: Diagram comparing the centralized model (MC) with a privacy-preserving federated learning model (MPF).

The key task of this challenge is to design a privacy solution so that the collaborating parties can jointly train and deploy such a model without compromising the privacy requirements (more details on the requirements, and an associated threat model, are described in the problem description).

For the purposes of the challenge, you should demonstrate your solution by training two models:

• $M_C$ = a centralized model trained on the datasets in a non-privacy preserving way
• $M_{PF}$ = a privacy-preserving federated model trained using your privacy solution

### Example Centralized Baseline

You will be provided an example centralized model developed by the UVA-BII team, as well as the code used to train this model. You are permitted to use this example code as the basis of your PPFL solution, or take an entirely different analytical approach.

In either case, the core of the evaluation will be assessing the comparison between a centralized model $M_C$, and an alternative model $M_{PF}$ that combines a federated learning approach with innovative privacy-preserving techniques.

The evaluation approach will be further detailed in an Evaluation Methodology document to be supplied at the start of Phase 2. For Phase 1, the solution evaluation will consider:

• The ability of the solution to deliver (and evidence) relevant privacy properties
• The accuracy of model $M_{PF}$ compared to $M_C$.
• The performance/computational cost of training $M_{PF}$ compared to $M_C$.
• The scalability, usability, and adaptability of the solution.

Full details on evaluation criteria can be found in the problem description.

For additional reference, here is a technical brief for this use case track provided through the U.K. challenge. This brief has been assembled in collaboration by the U.K. and U.S. challenge organizers. This may help to give a sense for the use case and capabilities expected, though note that details in the brief may not match exactly how the U.S. challenge will operate.

## Good luck

Good luck and enjoy this problem! For more details on submission and evaluation, visit the problem description page. If you have any questions, you can always ask the community by visiting the DrivenData user forum or the cross-U.S.–U.K. public Slack channel. You can request access to the Slack channel here.