Differential Privacy Temporal Map Challenge: Sprint 2 (Prescreened Arena) Hosted By NIST PSCR

competition
complete
$39,000

Woohoo! This competition has come to a close!

Many thanks to the participants for all of their hard work and commitment to using data for good!

Problem description

In this contest, your task is to develop algorithms that preserve data utility while guaranteeing individual privacy is protected. Submissions will be assessed based on their ability to prove they satisfy differential privacy and the accuracy of output data as compared with ground truth.


  • Correctness in differential privacy
  • Overview

Timeline and leaderboard


This contest sprint will proceed in two phases:

Development Phase (Jan 6 - Feb 15, 2021)

Upon registering for the contest, participants will enter the Open Arena. You will have access to a training ground truth data set and data dictionary, and have the opportunity to submit privatized versions of the data for scoring. For purposes of privacy, the ground truth data set is considered previously released publicly available data that can be used for algorithm design and testing, without impact to the privacy loss budget. DrivenData will use a different data set with the same feature schema (but not necessarily the same distributions) for final scoring.

Throughout the Development Phase, participants can submit privatized data sets using the contest-specified values of delta, epsilon, and the max_records_per_individual (maximum number of records associated with one person in the data set) provided in the paramaters.json configuration file. The scoring metric below will compare privatized data to ground truth data to produce a similarity score for each submission which will be used to update the public leaderboard.

Pre-screening: In order to be eligible for final scoring, participants must have their approach Prescreened as differentially private during the Development Phase. Submitted write-ups will be reviewed and validated by NIST staff or their delegates to confirm that the participants have an essentially correct understanding of differential privacy as applied to their submission, or to provide a brief explanation why the algorithm or proof is incorrect.

Once successfully Prescreened, participants will have the ability to:

  • Enter the Prescreened Arena of the contest sprint (you are here!)
  • Make executable code submissions to the containerized test harness, which will execute submissions on the public challenge data and produce scores for the Prescreened leaderboard (these scores take precedence in determining Progressive Prizes)
  • Make a submission for Final Scoring, which will be used to determine final rankings

For more information on scoring and pre-screening submissions, see the Submission Format section below.

Progressive prizes: In addition, challenge leaders will have the opportunity to win progressive prizes partway through the Development Phase. Four progressive prizes will be awarded based on top scores achieved by January 25, 2021 at 10:00 pm EST. Rankings will be determined first by position on the Prescreened leaderboard (which requires achieving Prescreened status and making a successful executable code submission), and will proceed to rankings on the open leaderboard as needed. Pre-screening write-ups must be submitted by 10:00 pm EST on January 19, 2021 to ensure results are returned before the progressive prize deadline.

Final Scoring Submission (Feb 15 - 22, 2021)

Final prize rankings will be determined by the results of the final scoring phase. All Prescreened participants will have the opportunity to make a final submission to run on a separate data set that has not been shared with the challenge participants. This data set has the same schema as the public challenge data provided to all participants, though map segements (PUMAs) and time segments (years) may be different. To robustly evaluate solutions during final scoring, final submissions may be run multiple times per epsilon and on additional values of epsilon.

During the submission period, participants must submit:

  • their final executable code submission to the containerized test harness on DrivenData
  • a clear and complete report including the algorithm description and mathematical privacy proof
  • source code of their solution
  • a code guide "README" file which gives the mapping between the steps of the algorithm write-up and the analogous steps in the source code, and identifies which parts of the source code perform pre-processing, privatization and post-processing operations.

The report will be subject to a final validation review to verify that the algorithm correctly satisfies differential privacy at the stated values of delta and epsilon, and that the code implements the stated algorithm. The review will be performed by NIST staff or their delegates, who may contact participants for clarifications or minor corrections. Validated submissions will be ranked by their score to determine the final leaderboard used for awarding prizes.


Privatized data scoring
Executable code scoring (Prescreened only)
Final scoring submission (Prescreened only)
Privatized data scoring Executable code scoring (Prescreened only) Final scoring submission (Prescreened only)
Jan 6 - Feb 15, 2021 Jan 6 - Feb 15, 2021 Feb 15 - 22, 2021
Scores run on de-identified data submissions are calculated from the provided ground truth data set and displayed on the public leaderboard. Scores run on executable code submissions are calculated from the provided ground truth data set and displayed on the Prescreened public leaderboard. Prescreened participants may submit a final code submission and write-up for evaluation. Final, validated scores will be displayed on the final leaderboard and used to determine prize rankings.


Correctness in Differential Privacy


The definition of epsilon-delta differential privacy used in this contest is as follows:

An algorithm M is (ϵ,δ)-differentially private if

P(M(D) ∈ S) ≤ eϵ P(M(D′) ∈ S) + δ

where D, D′ are data sets in the same schema but differing by one individual’s records, ϵ is the privacy-loss parameter, and δ is negligible in the number of database rows, so δ < 1/p(n) with n being the number of database rows (for this challenge, the values of and δ are provided in the parameters.json file).

There are several resources available that provide a useful overview, beginner’s guide, and primer for working with differential privacy, as well open source algorithms from the 2018 Differential Privacy Synthetic Data Challenge (DeID1) for working with synthetic data. For additional resources see the About page.

In addition, here are a few tips that may help to keep in mind as you develop your approach.

Tip 1: The algorithm should not look at the data itself to determine a valid set of values for a given feature; rather it must take the schema as external input. Otherwise changing one individual's data could change the schema and control flow of the algorithm, and this would leak information in a way that is not covered by privatization noise.

At the same time, keep in mind that the dataset you’re given during the development phase is considered to be previously released, public data. You can take general lessons learned from working with the publicly released data and hardcode them into your algorithm. For instance, you could use the public dataset to learn a similarity-based clustering of features and then build your algorithm to use this clustering. Because the private data set used for final scoring shares the same schema, if there are data distribution properties that can be expected to generalize across different years of data, it is fair game to learn them during the development phase in order to apply them during final scoring. Of course, as always, overfitting to the development data is a risk.

Tip 2: The algorithm should be cautious when using functions such as group-by, min, max, or mean whose outputs could be changed if one individual’s records in the data are changed. All functions whose outputs are "sensitive" to the records in the data must be considered when computing the sensitivity of the algorithm and adding appropriate noise.

For some operations, the global sensitivity (i.e. worst-case impact of changing one individual’s records), may be quite large and thus any algorithm containing these operations will be impractical to privatize. Differential privacy often requires using different versions of common operations and different approaches to common algorithms that have been intentionally designed to reduce sensitivity.

Tip 3: Histograms (built on externally-provided schema) can be quite useful. The sensitivity of creating a histogram is equal to the total quantity (across all histogram bins) that can be changed in the worst case when one individual's records are changed. In this sprint that is equal to the max_records_per_individual parameter. Check out the provided baseline solution and beginners guide tutorial above for more information on differentially private algorithms using histograms.

Tip 4: Post-processing is also likely to be important. Once data has been privatized (sufficient noise has been added for the given epsilon, delta, and sensitivity), any further operations can be performed on it safely—with no further risk to privacy—as long as they only interact with the privatized data or publicly available data (and do not interact with the sensitive ground truth data in any fashion). Post-processing operations are perfect for reducing positive bias and handling negative and non-integer values.

Data


The provided data set for this sprint consists of a 1,033,968 row CSV file which records quantitative survey information about a sample of the American population from 2012 - 2018. The data is adapted from the American Community Survey (ACS) provided by IPUMS USA, with simulated individuals connecting records longitudinally across years. As a reminder, for purposes of differential privacy, the data set provided during the development phase is considered to be previously released, public data. Lessons learned from working with this dataset (e.g. similarity of features) can be included in your algorithm without loss of privacy.

NOTE: For this sprint, final scoring will use the same time segments as the public data, but a different map segments (i.e. final scoring will be done on different states). That means the public data won't be useful for learning about specific map segments in this sprint.

Records

The main dataset includes survey data, including demographic and financial features, representing a subset of IPUMS American Community Survey data for Ohio and Illinois from 2012-2018. The data includes a large feature set of quantitative survey variables along with simulated individuals (with a sequence of records across years), time segments (years), and map segments (PUMA). Solutions in this sprint will need to produce a list of records (i.e. synthetic data) with corresponding time/map segments.

There are 36 total columns in the ground truth data, including PUMA, YEAR, 33 additional survey features, and an ID denoting simulated individuals.

Here is what an example row of the ground_truth.csv data looks like:

PUMA 17-1001
YEAR 2012
HHWT 30
GQ 1
PERWT 34
SEX 1
AGE 49
MARST 1
RACE 1
HISPAN 0
CITIZEN 0
SPEAKENG 3
HCOVANY 2
HCOVPRIV 2
HINSEMP 2
HINSCAID 1
HINSCARE 1
EDUC 7
EMPSTAT 1
EMPSTATD 10
LABFORCE 2
WRKLSTWK 2
ABSENT 4
LOOKING 3
AVAILBLE 5
WRKRECAL 3
WORKEDYR 3
INCTOT 10000
INCWAGE 10000
INCWELFR 0
INCINVST 0
INCEARN 10000
POVERTY 36
DEPARTS 1205
ARRIVES 1224
sim_individual_id 1947

The columns are as follows:

  • PUMA (str) — Identifies the Public Use Microdata Area (PUMA) where the housing unit was located.
  • YEAR (uint32) — Reports the four-digit year when the household was enumerated or included in the survey.
  • HHWT (float) — Indicates how many households in the U.S. population are represented by a given household in an IPUMS sample.
  • GQ (uint8) — Classifies all housing units, falling into one of three main categories: households, group quarters, or vacant units.
  • PERWT (float) — Indicates how many persons in the U.S. population are represented by a given person in an IPUMS sample.
  • SEX (uint8) — Reports whether the person was male or female.
  • AGE (uint8) — Reports the person's age in years as of the last birthday.
  • MARST (uint8) — Gives each person's current marital status.
  • RACE (uint8) — Reports what race the person considers himself/herself to be.
  • HISPAN (uint8) — Identifies persons of Hispanic/Spanish/Latino origin and classifies them according to their country of origin when possible.
  • CITIZEN (uint8) — Reports the citizenship status of respondents, distinguishing between naturalized citizens and non-citizens.
  • SPEAKENG (uint8) — Indicates whether the respondent speaks only English at home, and also reports how well the respondent, who speaks a language other than English at home, speaks English.
  • HCOVANY, HCOVPRIV, HINSEMP, HINSCAID, HINSCARE (uint8) — Indicates whether persons had any health insurance coverage at the time of interview, and whether they had private, employer-provided, Medicaid or other government insurance, or Medicare coverage, respectively.
  • EDUC (uint8) — Indicates respondents' educational attainment, as measured by the highest year of school or degree completed.
  • EMPSTAT, EMPSTATD, LABFORCE, WRKLSTWK, ABSENT, LOOKING, AVAILBLE, WRKRECAL, WORKEDYR (uint8) — Indicates whether the respondent was a part of the labor force (working or seeking work), whether the person was currently unemployed, their work-related status in the last week, whether they were informed they would be returning to work (if not working in the last week), and whether they worked during the previous year.
  • INCTOT, INCWAGE, INCWELFR, INCINVST, INCEARN (int32) — Reports each respondent's total pre-tax personal income or losses from all sources for the previous year, as well as the income from wages, welfare, investment, and wages or a person's own business or farm, respectively.
  • POVERTY (uint32) — Expresses each family's total income for the previous year as a percentage of the Social Security Administration's inflation-adjusted poverty threshold.
  • DEPARTS, ARRIVES (uint32) — Reports the time that the respondent usually left home for work and arrived at work last week, measured using a 24-hour clock.
  • sim_individual_id (int) — Unique, synthetic ID for the notional person to which this record was attributed. The largest number of records attributed to a single simulated resident is provided in the parameters.json file as max_records_per_individual.

For further information on the possible values for each column, you can refer to the schema in the parameters file as described below.

Parameters

In addition to the CSV file, there is also a parameters.json file which serves both as a codebook for the dataset described above (under the schema key) as well as a configuration describing which outputs are expected to be produced by your code (under the runs key).

Here is a bit of what the schema (or "codebook") part of the file looks like:

"schema": {
    "PUMA": {
      "dtype": "str",
      "values": [
        "17-1001",
        "17-104",
        "17-105",
        ...
      ]
    },
    "YEAR": {
      "dtype": "uint32",
      "values": [
        2012,
        2013,
        ...
      ]
    },
    ...
  }
}

And here is the entire runs part:

[
    {
      "epsilon": 0.1,
      "delta": 3.4498908254380166e-11,
      "max_records": 1350000,
      "max_records_per_individual": 7
    },
    {
      "epsilon": 1.0,
      "delta": 3.4498908254380166e-11,
      "max_records": 1350000,
      "max_records_per_individual": 7
    },
    {
      "epsilon": 10.0,
      "delta": 3.4498908254380166e-11,
      "max_records": 1350000,
      "max_records_per_individual": 7
    }
]

As you can see, the delta values (this is just under 1/n^2, where n is the number of simulated individuals in the data set), max_records (maximum allowed synthetic records for each value of epsilon, roughly 250 records per PUMA-year beyond the ground truth), and max_records_per_individual (maximum number of records associated with one person in the data set) are identical for the purposes of this ground truth data set. The epsilon values show what privacy level your code is required to implement for each of three different runs.

For the Open Arena CSV submission, you will simply upload a submission having used these values. In the Prescreened Arena, your executable code may assume that this file exists at data/parameters.json and should use the contents to output a submission file accordingly. Note that this means you should not hard code these values as they may change for final scoring, requiring your code to use different values. (Delta and max_records will change and epsilon values may change in final scoring. The schema will remain the same. See below for more details about submission format.)

External data

As described in the Challenge Rules, external datasets and pre-trained models are allowed for use in the competition as long as they are freely and publicly available to all participants and do not have the same schema or source as the provided contest data. Any additional data sources used other than the provided data must be indicated for approval in privacy write-ups for pre-screening and final scoring.

Performance metric


Synthetic data submissions will be compared to ground truth values to evaluate similarity using a version of the k-marginal evaluation metric. The intent behind the metric is to quantify the ability of each differential privacy algorithm to conserve clustering characteristics of the dataset being privatized. The original version of the metric was developed to score solutions in the NIST DeID1 Synthetic Data Challenge.

The k-marginal evaluation tool is a randomized heuristic that measures similarity between two high dimensional data sets, by considering all correlations of k or fewer variables. For each run, the method selects a set of k features, computes the distribution of individuals that fall into each possible combinations of those features (with numerical features grouped into range bins), and then takes the absolute difference between the two resulting densities. By repeating across randomly selected marginals, the metric aggregates performance across a variety of features to create a total score.

Here is a simplified illustration of the k-marginal metric where k=3:

metric diagram

Calculation

1. K-marginal scoring:

For this sprint, the k-marginal metric will be applied for k=4, where two of the four features are always PUMA (map segment) and year (time segment), and the other two features are selected at random.

A single run will compare the marginal density for two variables selected randomly along with PUMA and year. During the Development Phase, provisional scores are shown on the leaderboard for a set of 50 runs. Final Scoring will use additional runs for broader data coverage. The differences in densities between the synthetic and ground truth data will be evaluated by PUMA-year, so that each temporal map slice of the data has an associated score.

For instance, one run of the k-marginal metric might feature the variables RACE and INCTOT. The number of individuals that fall into each racial category and income bin will be calculated per PUMA per YEAR, both for the ground truth data set and the sythetic data submission. The distribution of the resulting cells will be normalized and compared by PUMA-year. Since the score takes the absolute difference of each pair of cells, the best score for each PUMA-year will be 0.0 (i.e. perfect match between the density distributions) and the worst score for each PUMA-year (density distrubutions do not overlap at all) will be 2.0.

2. Apply bias penalty (BP):

Each temporal map segment (PUMA-year) will also be checked to ensure the number of individuals represetned is not too far away from the actual population in the ground truth data.

For any PUMA-year where the synthetic population differs from the actual population by at least 250 records, the score for PUMA-year will be set to the worst possible score (2.0) as a violation of the bias constraint.

3. Aggregate score

Finally, interim PUMA-year scores will be transformed into an aggregate leaderboard score. Raw scores will be averaged for all PUMA-years, runs, and values of epsilon, resulting in a value between 0.0 (best) and 2.0 (worst). As a reminder, epsilon values can be found in the parameters.json configuration file, and solutions may be run on additional values of epsilon during final scoring. The leaderboard score will be calculated by subtracting this raw_score value from 2.0, then normalizing the result between 0 (worst) and 1,000 (best).

Keep in mind: while 1,000 represents the best theoretically possible accuracy score, scores close to 1,000 are not achievable given privacy-preserving guarantees! For more information on privacy write-ups, see the submissions section below.

For further details and transparent implementation of the performance metric, see the metric script in the competitor pack.

Baseline Privacy Solution: A naive but valid baseline approach to satisfying differential privacy on temporal map data is to simply create a random set of individuals that adhere to the constraints provided in the schema section of the configurations file, without ever looking at the data. Code for this baseline approach is provided in the competitors pack, and a score from this baseline in shown on the leaderboard.

Use

In the Open Arena, this score will be calculated based on the provided 2012-2018 data for Ohio and Illinois as the ground truth data set and the submitted file as the synthetic data set.

In the Prescreened Arena, scores during the Development Phase will be calculated based on the the provided 2012-2018 data as the ground truth data, and the synthetic data set produced by running the submitted executable code on the ground truth data and configurations file. During Final Scoring, final scores will be determined by (1) running the submitted executable code to generate a synthetic data set for each epsilon value for separate held-out ground truth data, identical in schema to the 2012-2018 Ohio and Illinois data but with records from other states and/or years, (2) evaluating the metric described above for each privatized output on the appropriate ground truth, and (3) taking the arithmetic mean of the resulting values. To robustly evaluate solutions during final scoring, final submissions will be run on more k-marginal permutations than during the Development Phase, and may be run multiple times per epsilon and on additional values of epsilon.

Submission format


There are four types of submissions accepted throughout this contest:

  • Privatized data set (Open Arena)
  • Pre-screening write-up (Open Arena)
  • Executable code submission (Prescreened Arena)
  • Final scoring (Prescreened Arena)

The submissions for this arena are described here. For a description of Open Arena submissions click here.

Executable code submission

The Prescreened Arena is a code execution arena! Rather than submitting a privatized data set, participants will package everything needed to perform the privatization and submit for containerized execution on the cloud. Check out the separate code submission format page where all the details are encapsulated.

You may submit as many times as the stated submission limit allows. Keep in mind that code submissions in the Prescreened Arena should be the same basic algorithm that passed pre-screening. You may continue to tune and tinker with your approach after it has been prescreened, but if you're switching to a substantially different algorithm you should resubmit to pre-screening in order to avoid accidental privacy violations. Algorithms that have been changed may no longer pass validation of differential privacy and so would not be eligible for prizes.

Final scoring

Directly following the end of the Development Phase, Prescreened participants will have one week to make a single final submission during the Final Scoring phase. Each Prescreened team can make a single final, successfully-executed code submission through the Prescreened Arena submission acceptor. The Final Scoring submission will generate a private score based on performance on the private ground truth data. As a reminder, this data set has the same schema as the public challenge data provided to all participants, though map segements (PUMAs) and time segments (years) may be different. This score will be revealed after evaluation is complete and will be used to determine final prize rankings.

The Final Scoring submission should be a zip file containing the following materials:

  • Final executable submission that runs in the containerized test harness
  • All source code used in the solution, structured logically with an obvious point of entry, an extremely clear README file with clear instructions, dependencies and requirements identified, and including the final submission made through the DrivenData platform in the bullet above
  • Differential privacy write-up, containing:
    • a 1-2 sentence high-level description of the approach;
    • a clear and complete written explanation of the algorithm used in the final submission, including a summary of what the solution is doing at each step in the process to generate the privatized data submission;
    • a brief description of any additional data sources used other than the provided data, including a link to the data;
    • a clear, correct mathematical proof that their solution satisfies differential privacy; and
    • a code guide which gives the mapping between the steps of the algorithm write-up and the analogous steps in the source code, and identifies which parts of the source code perform pre-processing, privatization and post-processing operations.
  • Optionally, an eligibility form completed by your team's Official Representative (required only if you believe you are prize-eligible*).

* As a reminder, to be eligible for a prize, the Official Representative (individual or team lead, in the case of a group project) must be age 18 or older and a U.S. citizen or permanent resident of the United States or its territories at the time of entry. See Official Rules for full information. Note that you are NOT required to be prize-eligible in order to make a final scoring submission and have your performance displayed on the leaderboard.

A sample privacy write-up was provided in the Competitor Pack from Sprint 1. Additional examples like this one can be found in the open solutions list from the DeID1 Challenge. If you are also looking for tools to help organize your code for reproduction and sharing, we have a data science project template which may be helpful.

Final scoring submissions will be reviewed carefully by subject matter experts to ensure that the algorithm and proof correctly satisfy differential privacy and to verify that the code is a good faith implementation of the algorithm without significant errors. Validated submissions will be scored on the private ground truth data per the scoring function above to determine final prize rankings. To robustly evaluate solutions during final scoring, final submissions will be run on more k-marginal permutations than during the Development Phase, and may be run multiple times per epsilon and on additional values of epsilon. For more information see the Challenge Rules.

Good luck!


If you're wondering how to get started, check out the baseline solution in the competitor pack.

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!