Differential Privacy Temporal Map Challenge: Sprint 2 (Open Arena)

START HERE! Help public safety agencies share data while protecting privacy. This is part of a series of contests to develop algorithms that preserve the usefulness of temporal map data while guaranteeing individual privacy is protected. #privacy

feb 2021
76 joined

Problem description

In this contest, your task is to develop algorithms that preserve data utility while guaranteeing individual privacy is protected. Submissions will be assessed based on their ability to prove they satisfy differential privacy and the accuracy of output data as compared with ground truth.


  • Correctness in differential privacy
  • Overview

Timeline and leaderboard


This contest sprint will proceed in two phases:

Development Phase (Jan 6 - Feb 15, 2021)

Upon registering for the contest, participants will enter the Open Arena (you are here!). You will have access to a training ground truth data set and data dictionary, and have the opportunity to submit privatized versions of the data for scoring. For purposes of privacy, the ground truth data set is considered previously released publicly available data that can be used for algorithm design and testing, without impact to the privacy loss budget. DrivenData will use a different data set with the same feature schema (but not necessarily the same distributions) for final scoring.

Throughout the Development Phase, participants can submit privatized data sets using the contest-specified values of delta, epsilon, and the max_records_per_individual (maximum number of records associated with one person in the data set) provided in the paramaters.json configuration file. The scoring metric below will compare privatized data to ground truth data to produce a similarity score for each submission which will be used to update the public leaderboard.

Pre-screening: In order to be eligible for final scoring, participants must have their approach Prescreened as differentially private during the Development Phase. Submitted write-ups will be reviewed and validated by NIST staff or their delegates to confirm that the participants have an essentially correct understanding of differential privacy as applied to their submission, or to provide a brief explanation why the algorithm or proof is incorrect.

Once successfully Prescreened, participants will have the ability to:

  • Enter the Prescreened Arena of the contest sprint
  • Make executable code submissions to the containerized test harness, which will execute submissions on the public challenge data and produce scores for the Prescreened leaderboard (these scores take precedence in determining Progressive Prizes)
  • Make a submission for Final Scoring, which will be used to determine final rankings

For more information on scoring and pre-screening submissions, see the Submission Format section below.

Progressive prizes: In addition, challenge leaders will have the opportunity to win progressive prizes partway through the Development Phase. Four progressive prizes will be awarded based on top scores achieved by January 25, 2021 at 10:00 pm EST. Rankings will be determined first by position on the Prescreened leaderboard (which requires achieving Prescreened status and making a successful executable code submission), and will proceed to rankings on the open leaderboard as needed. Pre-screening write-ups must be submitted by 10:00 pm EST on January 19, 2021 to ensure results are returned before the progressive prize deadline.

Final Scoring Submission (Feb 15 - 22, 2021)

Final prize rankings will be determined by the results of the final scoring phase. All Prescreened participants will have the opportunity to make a final submission to run on a separate data set that has not been shared with the challenge participants. This data set has the same schema as the public challenge data provided to all participants, though map segements (PUMAs) and time segments (years) may be different. To robustly evaluate solutions during final scoring, final submissions may be run multiple times per epsilon and on additional values of epsilon.

During the submission period, participants must submit:

  • their final executable code submission to the containerized test harness on DrivenData
  • a clear and complete report including the algorithm description and mathematical privacy proof
  • source code of their solution
  • a code guide "README" file which gives the mapping between the steps of the algorithm write-up and the analogous steps in the source code, and identifies which parts of the source code perform pre-processing, privatization and post-processing operations.

The report will be subject to a final validation review to verify that the algorithm correctly satisfies differential privacy at the stated values of delta and epsilon, and that the code implements the stated algorithm. The review will be performed by NIST staff or their delegates, who may contact participants for clarifications or minor corrections. Validated submissions will be ranked by their score to determine the final leaderboard used for awarding prizes.


Privatized data scoring
Executable code scoring (Prescreened only)
Final scoring submission (Prescreened only)
Privatized data scoring Executable code scoring (Prescreened only) Final scoring submission (Prescreened only)
Jan 6 - Feb 15, 2021 Jan 6 - Feb 15, 2021 Feb 15 - 22, 2021
Scores run on de-identified data submissions are calculated from the provided ground truth data set and displayed on the public leaderboard. Scores run on executable code submissions are calculated from the provided ground truth data set and displayed on the Prescreened public leaderboard. Prescreened participants may submit a final code submission and write-up for evaluation. Final, validated scores will be displayed on the final leaderboard and used to determine prize rankings.


Correctness in Differential Privacy


The definition of epsilon-delta differential privacy used in this contest is as follows:

An algorithm M is (ϵ,δ)-differentially private if

P(M(D) ∈ S) ≤ eϵ P(M(D′) ∈ S) + δ

where D, D′ are data sets in the same schema but differing by one individual’s records, ϵ is the privacy-loss parameter, and δ is negligible in the number of database rows, so δ < 1/p(n) with n being the number of database rows (for this challenge, the values of and δ are provided in the parameters.json file).

There are several resources available that provide a useful overview, beginner’s guide, and primer for working with differential privacy, as well open source algorithms from the 2018 Differential Privacy Synthetic Data Challenge (DeID1) for working with synthetic data. For additional resources see the About page.

In addition, here are a few tips that may help to keep in mind as you develop your approach.

Tip 1: The algorithm should not look at the data itself to determine a valid set of values for a given feature; rather it must take the schema as external input. Otherwise changing one individual's data could change the schema and control flow of the algorithm, and this would leak information in a way that is not covered by privatization noise.

At the same time, keep in mind that the dataset you’re given during the development phase is considered to be previously released, public data. You can take general lessons learned from working with the publicly released data and hardcode them into your algorithm. For instance, you could use the public dataset to learn a similarity-based clustering of features and then build your algorithm to use this clustering. Because the private data set used for final scoring shares the same schema, if there are data distribution properties that can be expected to generalize across different years of data, it is fair game to learn them during the development phase in order to apply them during final scoring. Of course, as always, overfitting to the development data is a risk.

Tip 2: The algorithm should be cautious when using functions such as group-by, min, max, or mean whose outputs could be changed if one individual’s records in the data are changed. All functions whose outputs are "sensitive" to the records in the data must be considered when computing the sensitivity of the algorithm and adding appropriate noise.

For some operations, the global sensitivity (i.e. worst-case impact of changing one individual’s records), may be quite large and thus any algorithm containing these operations will be impractical to privatize. Differential privacy often requires using different versions of common operations and different approaches to common algorithms that have been intentionally designed to reduce sensitivity.

Tip 3: Histograms (built on externally-provided schema) can be quite useful. The sensitivity of creating a histogram is equal to the total quantity (across all histogram bins) that can be changed in the worst case when one individual's records are changed. In this sprint that is equal to the max_records_per_individual parameter. Check out the provided baseline solution and beginners guide tutorial above for more information on differentially private algorithms using histograms.

Tip 4: Post-processing is also likely to be important. Once data has been privatized (sufficient noise has been added for the given epsilon, delta, and sensitivity), any further operations can be performed on it safely—with no further risk to privacy—as long as they only interact with the privatized data or publicly available data (and do not interact with the sensitive ground truth data in any fashion). Post-processing operations are perfect for reducing positive bias and handling negative and non-integer values.

Data


The provided data set for this sprint consists of a 1,033,968 row CSV file which records quantitative survey information about a sample of the American population from 2012 - 2018. The data is adapted from the American Community Survey (ACS) provided by IPUMS USA, with simulated individuals connecting records longitudinally across years. As a reminder, for purposes of differential privacy, the data set provided during the development phase is considered to be previously released, public data. Lessons learned from working with this dataset (e.g. similarity of features) can be included in your algorithm without loss of privacy.

NOTE: For this sprint, final scoring will use the same time segments as the public data, but a different map segments (i.e. final scoring will be done on different states). That means the public data won't be useful for learning about specific map segments in this sprint.

Records

The main dataset includes survey data, including demographic and financial features, representing a subset of IPUMS American Community Survey data for Ohio and Illinois from 2012-2018. The data includes a large feature set of quantitative survey variables along with simulated individuals (with a sequence of records across years), time segments (years), and map segments (PUMA). Solutions in this sprint will need to produce a list of records (i.e. synthetic data) with corresponding time/map segments.

There are 36 total columns in the ground truth data, including PUMA, YEAR, 33 additional survey features, and an ID denoting simulated individuals.

Here is what an example row of the ground_truth.csv data looks like:

PUMA 17-1001
YEAR 2012
HHWT 30
GQ 1
PERWT 34
SEX 1
AGE 49
MARST 1
RACE 1
HISPAN 0
CITIZEN 0
SPEAKENG 3
HCOVANY 2
HCOVPRIV 2
HINSEMP 2
HINSCAID 1
HINSCARE 1
EDUC 7
EMPSTAT 1
EMPSTATD 10
LABFORCE 2
WRKLSTWK 2
ABSENT 4
LOOKING 3
AVAILBLE 5
WRKRECAL 3
WORKEDYR 3
INCTOT 10000
INCWAGE 10000
INCWELFR 0
INCINVST 0
INCEARN 10000
POVERTY 36
DEPARTS 1205
ARRIVES 1224
sim_individual_id 1947

The columns are as follows:

  • PUMA (str) — Identifies the Public Use Microdata Area (PUMA) where the housing unit was located.
  • YEAR (uint32) — Reports the four-digit year when the household was enumerated or included in the survey.
  • HHWT (float) — Indicates how many households in the U.S. population are represented by a given household in an IPUMS sample.
  • GQ (uint8) — Classifies all housing units, falling into one of three main categories: households, group quarters, or vacant units.
  • PERWT (float) — Indicates how many persons in the U.S. population are represented by a given person in an IPUMS sample.
  • SEX (uint8) — Reports whether the person was male or female.
  • AGE (uint8) — Reports the person's age in years as of the last birthday.
  • MARST (uint8) — Gives each person's current marital status.
  • RACE (uint8) — Reports what race the person considers himself/herself to be.
  • HISPAN (uint8) — Identifies persons of Hispanic/Spanish/Latino origin and classifies them according to their country of origin when possible.
  • CITIZEN (uint8) — Reports the citizenship status of respondents, distinguishing between naturalized citizens and non-citizens.
  • SPEAKENG (uint8) — Indicates whether the respondent speaks only English at home, and also reports how well the respondent, who speaks a language other than English at home, speaks English.
  • HCOVANY, HCOVPRIV, HINSEMP, HINSCAID, HINSCARE (uint8) — Indicates whether persons had any health insurance coverage at the time of interview, and whether they had private, employer-provided, Medicaid or other government insurance, or Medicare coverage, respectively.
  • EDUC (uint8) — Indicates respondents' educational attainment, as measured by the highest year of school or degree completed.
  • EMPSTAT, EMPSTATD, LABFORCE, WRKLSTWK, ABSENT, LOOKING, AVAILBLE, WRKRECAL, WORKEDYR (uint8) — Indicates whether the respondent was a part of the labor force (working or seeking work), whether the person was currently unemployed, their work-related status in the last week, whether they were informed they would be returning to work (if not working in the last week), and whether they worked during the previous year.
  • INCTOT, INCWAGE, INCWELFR, INCINVST, INCEARN (int32) — Reports each respondent's total pre-tax personal income or losses from all sources for the previous year, as well as the income from wages, welfare, investment, and wages or a person's own business or farm, respectively.
  • POVERTY (uint32) — Expresses each family's total income for the previous year as a percentage of the Social Security Administration's inflation-adjusted poverty threshold.
  • DEPARTS, ARRIVES (uint32) — Reports the time that the respondent usually left home for work and arrived at work last week, measured using a 24-hour clock.
  • sim_individual_id (int) — Unique, synthetic ID for the notional person to which this record was attributed. The largest number of records attributed to a single simulated resident is provided in the parameters.json file as max_records_per_individual.

For further information on the possible values for each column, you can refer to the schema in the parameters file as described below.

Parameters

In addition to the CSV file, there is also a parameters.json file which serves both as a codebook for the dataset described above (under the schema key) as well as a configuration describing which outputs are expected to be produced by your code (under the runs key).

Here is a bit of what the schema (or "codebook") part of the file looks like:

"schema": {
    "PUMA": {
      "dtype": "str",
      "values": [
        "17-1001",
        "17-104",
        "17-105",
        ...
      ]
    },
    "YEAR": {
      "dtype": "uint32",
      "values": [
        2012,
        2013,
        ...
      ]
    },
    ...
  }
}

And here is the entire runs part:

[
    {
      "epsilon": 0.1,
      "delta": 3.4498908254380166e-11,
      "max_records": 1350000,
      "max_records_per_individual": 7
    },
    {
      "epsilon": 1.0,
      "delta": 3.4498908254380166e-11,
      "max_records": 1350000,
      "max_records_per_individual": 7
    },
    {
      "epsilon": 10.0,
      "delta": 3.4498908254380166e-11,
      "max_records": 1350000,
      "max_records_per_individual": 7
    }
]

As you can see, the delta values (this is just under 1/n^2, where n is the number of simulated individuals in the data set), max_records (maximum allowed synthetic records for each value of epsilon, roughly 250 records per PUMA-year beyond the ground truth), and max_records_per_individual (maximum number of records associated with one person in the data set) are identical for the purposes of this ground truth data set. The epsilon values show what privacy level your code is required to implement for each of three different runs.

For the Open Arena CSV submission, you will simply upload a submission having used these values. In the Prescreened Arena, your executable code may assume that this file exists at data/parameters.json and should use the contents to output a submission file accordingly. Note that this means you should not hard code these values as they may change for final scoring, requiring your code to use different values. (Delta and max_records will change and epsilon values may change in final scoring. The schema will remain the same. See below for more details about submission format.)

External data

As described in the Challenge Rules, external datasets and pre-trained models are allowed for use in the competition as long as they are freely and publicly available to all participants and do not have the same schema or source as the provided contest data. Any additional data sources used other than the provided data must be indicated for approval in privacy write-ups for pre-screening and final scoring.

Performance metric


Synthetic data submissions will be compared to ground truth values to evaluate similarity using a version of the k-marginal evaluation metric. The intent behind the metric is to quantify the ability of each differential privacy algorithm to conserve clustering characteristics of the dataset being privatized. The original version of the metric was developed to score solutions in the NIST DeID1 Synthetic Data Challenge.

The k-marginal evaluation tool is a randomized heuristic that measures similarity between two high dimensional data sets, by considering all correlations of k or fewer variables. For each run, the method selects a set of k features, computes the distribution of individuals that fall into each possible combinations of those features (with numerical features grouped into range bins), and then takes the absolute difference between the two resulting densities. By repeating across randomly selected marginals, the metric aggregates performance across a variety of features to create a total score.

Here is a simplified illustration of the k-marginal metric where k=3:

metric diagram

Calculation

1. K-marginal scoring:

For this sprint, the k-marginal metric will be applied for k=4, where two of the four features are always PUMA (map segment) and year (time segment), and the other two features are selected at random.

A single run will compare the marginal density for two variables selected randomly along with PUMA and year. During the Development Phase, provisional scores are shown on the leaderboard for a set of 50 runs. Final Scoring will use additional runs for broader data coverage. The differences in densities between the synthetic and ground truth data will be evaluated by PUMA-year, so that each temporal map slice of the data has an associated score.

For instance, one run of the k-marginal metric might feature the variables RACE and INCTOT. The number of individuals that fall into each racial category and income bin will be calculated per PUMA per YEAR, both for the ground truth data set and the sythetic data submission. The distribution of the resulting cells will be normalized and compared by PUMA-year. Since the score takes the absolute difference of each pair of cells, the best score for each PUMA-year will be 0.0 (i.e. perfect match between the density distributions) and the worst score for each PUMA-year (density distrubutions do not overlap at all) will be 2.0.

2. Apply bias penalty (BP):

Each temporal map segment (PUMA-year) will also be checked to ensure the number of individuals represetned is not too far away from the actual population in the ground truth data.

For any PUMA-year where the synthetic population differs from the actual population by at least 250 records, the score for PUMA-year will be set to the worst possible score (2.0) as a violation of the bias constraint.

3. Aggregate score

Finally, interim PUMA-year scores will be transformed into an aggregate leaderboard score. Raw scores will be averaged for all PUMA-years, runs, and values of epsilon, resulting in a value between 0.0 (best) and 2.0 (worst). As a reminder, epsilon values can be found in the parameters.json configuration file, and solutions may be run on additional values of epsilon during final scoring. The leaderboard score will be calculated by subtracting this raw_score value from 2.0, then normalizing the result between 0 (worst) and 1,000 (best).

Keep in mind: while 1,000 represents the best theoretically possible accuracy score, scores close to 1,000 are not achievable given privacy-preserving guarantees! For more information on privacy write-ups, see the submissions section below.

For further details and transparent implementation of the performance metric, see the metric script in the competitor pack.

Baseline Privacy Solution: A naive but valid baseline approach to satisfying differential privacy on temporal map data is to simply create a random set of individuals that adhere to the constraints provided in the schema section of the configurations file, without ever looking at the data. Code for this baseline approach is provided in the competitors pack, and a score from this baseline in shown on the leaderboard.

Use

In the Open Arena, this score will be calculated based on the provided 2012-2018 data for Ohio and Illinois as the ground truth data set and the submitted file as the synthetic data set.

In the Prescreened Arena, scores during the Development Phase will be calculated based on the the provided 2012-2018 data as the ground truth data, and the synthetic data set produced by running the submitted executable code on the ground truth data and configurations file. During Final Scoring, final scores will be determined by (1) running the submitted executable code to generate a synthetic data set for each epsilon value for separate held-out ground truth data, identical in schema to the 2012-2018 Ohio and Illinois data but with records from other states and/or years, (2) evaluating the metric described above for each privatized output on the appropriate ground truth, and (3) taking the arithmetic mean of the resulting values. To robustly evaluate solutions during final scoring, final submissions will be run on more k-marginal permutations than during the Development Phase, and may be run multiple times per epsilon and on additional values of epsilon.

Submission format


There are four types of submissions accepted throughout this contest:

  • Privatized data set (Open Arena)
  • Pre-screening write-up (Open Arena)
  • Executable code submission (Prescreened Arena)
  • Final scoring (Prescreened Arena)

The submissions for this arena are described here. For further description of Prescreened Arena submissions click here.

Privatized data set

All participants can submit privatized data files for scoring at any point while general submissions are open. To make a submission, sign up for the challenge and head over to the Submissions tab.

In this sprint, you are expected to produce a CSV submission that privatizes the survey records data by creating a synthetic data set of individual records with the same columns as the original data, while adhering to the privacy guarantees you lay out in your proof. This framing of the challenge is very different from an ordinary machine learning train/test/eval split, and it means that you must be very careful about the way and number of times that your code looks at the ground truth numbers which you are provided (see the Correctness in Differential Privacy section above). Code that violates the provided proof can’t win the competition!

Here is what a proper submission file looks like. If you predicted:

epsilon PUMA YEAR ... ARRIVES sim_individual_id
0.1 17-1001 2012 ... 1200 1000
0.1 17-1001 2012 ... 1200 1001
0.1 17-1001 2012 ... 1200 1002
... ... ... ... ... ...
10.0 39-910 2018 ... 1200 9999

The CSV file that you submit would look like:

epsilon,PUMA,YEAR,...,ARRIVES,sim_individual_id
0.1,17-1001,2012,...,1200,1000
0.1,17-1001,2012,...,1200,1001
0.1,17-1001,2012,...,1200,1002
...
10.0,39-910,2018,...,1200,9999

Every row in this file corresponds to a certain epsilon value for a single individual. The first three columns of this CSV are used as the index for computing scores (comparing distribution densities by epsilon, PUMA, and YEAR), and the remaining columns correspond to each of the additional variables in the data set, as well as a final column denoting the individual the record is assigned to. (Your submitted synthetic data is expected to have a simulated individual (sim_individual_id) column like the ground truth file, but this column cannot be based on anything in the real data, and it will not be used in scoring. You are welcome to fill this column in with random numbers.)

Note: we do not guarantee that every possible value of each variable will be present in the ground truth records data, and this means that the set of values appearing in the records data depend on the individuals present in the data (a privacy leak!). You should use parameters.json to make sure that you are creating a proper submission regardless of what is in the data set.

The number of data rows in this submission may vary, but may not exceed the max_records value provided in the runs part of the parameters file (this value is roughly 250 individuals per PUMA-year beyond the number of records in the ground truth). The number of columns will be one more than the ground truth file (to account for epsilon), so 36 + 1 = 37. Your submission — including columns and data types — must exactly match the submission format file.

In addition to the columns matching, you should also verify that your columns are the proper data types. If you are using pandas, you may wish to use the dtypes we specify in read_csv_kwargs.json available for your use on the 'Data Download' page. Furthermore, the values in each column should be limited to the allowed values of ranges specified in the parameters.json we have provided.

You can validate that your submission meets the expected format using the metric.py script that we provide:

# see all the options for the scoring script
python metric.py --help
# get a score and validate the submission format
python metric.py ground_truth.csv your-submission.csv --parameters-json parameters.json

This script will also warn you if you are receiving bias penalties. If you are confused about why your score is lower for your submission than you received locally and you are sure it's not from bias penalties, be aware that we are using a differend random seed determining which k-marginal column permutations are used. Make sure that your methodology generalizes well to other random seeds or number of column permutations (see options in metric.py.)

Note: we accept submissions that are zipped as long as the CSV you are submitting is the only file inside the zip archive. The submissions for this competition have many reptitive values (e.g. ",0,0,0") so they compress very well and will help prevent your uploads from timing out. To zip up your submissions on Linux, for example, you can run `zip my-cool-submission submission.csv` which will result in a `my-cool-submission.zip`. Bear in mind that submissions to this competition are relatively large, and computing many k-marginals can take some time. You can close the submission dialog box and check back after a few minutes to see your score.

Pre-screening write-up

To complete pre-screening, participants are required to submit a PDF containing the following information:

  • Identification of the submission and score generated by the described solution
  • A 1-2 sentence high-level description of the approach
  • A clear and complete written explanation of the algorithm, including a summary of what the solution is doing at each step in the process to generate the privatized data submission
  • A brief description of any additional data sources used other than the provided data, including a link to the data
  • A clear, correct mathematical proof that their solution satisfies differential privacy

A sample privacy write-up was provided in the Competitor Pack from Sprint 1.

Write-ups should be submitted through the Pre-screen Submission page of the challenge website. This document will be reviewed and validated by NIST staff or their delegates, who may request clarification or rewriting if documents are unclear or underspecified. Participants will receive “Prescreened” status that they have an essentially correct understanding of differential privacy as applied to their submission, or a brief explanation why their algorithm or proof is incorrect. In general it may take up to one week to receive Prescreened status or feedback.

Keep in mind: only Prescreened participants will be eligible for code submission and final scoring. For more details see the Challenge Rules.

Good luck!


If you're wondering how to get started, check out the baseline solution in the competitor pack.

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!