Differential Privacy Temporal Map Challenge: Sprint 1 (Open Arena)

START HERE! Help public safety agencies share data while protecting privacy. This is part of a series of contests to develop algorithms that preserve the usefulness of temporal map data while guaranteeing individual privacy is protected. #privacy

nov 2020
169 joined

Problem description

In this contest your task is to develop algorithms that preserve data utility while guaranteeing individual privacy is protected. Submissions will be assessed based on their ability to prove they satisfy differential privacy and the accuracy of output data as compared with ground truth.

  • Correctness in differential privacy
  • Overview

Timeline and leaderboard


This contest sprint will proceed in two phases:

Development Phase (Oct 1 - Nov 9, 2020)

Upon registering for the contest, participants will enter the Open Arena (you are here!). You will have access to a training ground truth data set and data dictionary, and have the opportunity to submit privatized versions of the data for scoring. For purposes of privacy, the ground truth data set is considered previously released publicly available data that can be used for algorithm design and testing, without impact to the privacy loss budget. DrivenData will use a different data set with the same feature schema (but not necessarily the same distributions) for final scoring.

Throughout the Development Phase, participants can submit privatized data sets using the contest-specified values of delta, epsilon, and the max_records_per_individual (maximum number of records associated with one person in the data set) provided in the configuration file. The scoring metric below will compare privatized data to ground truth data to produce a similarity score for each submission which will be used to update the public leaderboard.

Pre-screening: In order to be eligible for final scoring, participants must have their approach Prescreened as differentially private during the Development Phase. Submitted write-ups will be reviewed and validated by NIST staff or their delegates to confirm that the participants have an essentially correct understanding of differential privacy as applied to their submission, or to provide a brief explanation why the algorithm or proof is incorrect.

Once successfully Prescreened, participants will have the ability to:

  • Enter the Prescreened Arena of the contest sprint
  • Make executable code submissions to the containerized test harness, which will execute submissions on the public challenge data and produce scores for the Prescreened leaderboard (these scores take precedence in determining Progressive Prizes)
  • Make a submission for Final Scoring, which will be used to determine final rankings

For more information on scoring and pre-screening submissions, see the Submission Format section below.

Progressive prizes: In addition, challenge leaders will have the opportunity to win progressive prizes partway through the Development Phase. Four progressive prizes will be awarded based on top scores achieved by October 23, 2020 at 10:00 pm EDT. Rankings will be determined first by position on the Prescreened leaderboard (which requires achieving Prescreened status and making a successful executable code submission), and will proceed to rankings on the open leaderboard as needed. Pre-screening write-ups must be submitted by 10:00 pm EDT on October 20, 2020 to ensure results are returned before the progressive prize deadline.

Final Scoring Submission (Nov 9 - 15, 2020)

Final prize rankings will be determined by the results of the final scoring phase. All Prescreened participants will have the opportunity to make a final submission to run on a separate data set that has not been shared with the challenge participants. This data set has the same schema as the public challenge data provided to all participants. To robustly evaluate solutions during final scoring, final submissions will be run multiple times per epsilon, and may be run on additional values of epsilon.

During the submission period, participants must submit:

  • their final executable code submission to the containerized test harness on DrivenData
  • a clear and complete report including the algorithm description and mathematical privacy proof
  • source code of their solution
  • a code guide "README" file which gives the mapping between the steps of the algorithm write-up and the analogous steps in the source code, and identifies which parts of the source code perform pre-processing, privatization and post-processing operations.

The report will be subject to a final validation review to verify that the algorithm correctly satisfies differential privacy at the stated values of delta and epsilon, and that the code implements the stated algorithm. The review will be performed by NIST staff or their delegates, who may contact participants for clarifications or minor corrections. Validated submissions will be ranked by their score to determine the final leaderboard used for awarding prizes.


Privatized data scoring
Executable code scoring (Prescreened only)
Final scoring (Prescreened only)
Privatized data scoring Executable code scoring (Prescreened only) Final scoring (Prescreened only)
Oct 1 - Nov 9, 2020 Oct 1 - Nov 9, 2020 Nov 9 - 15, 2020
Scores run on de-identified data submissions are calculated from the provided ground truth data set and displayed on the public leaderboard. Scores run on executable code submissions are calculated from the provided ground truth data set and displayed on the Prescreened public leaderboard. Prescreened participants may submit a final code submission and write-up for evaluation. Final, validated scores are displayed on the final leaderboard and used to determine prize rankings.


Correctness in Differential Privacy


The definition of epsilon-delta differential privacy used in this contest is as follows:

An algorithm M is (ϵ,δ)-differentially private if

P(M(D) ∈ S) ≤ eϵ P(M(D′) ∈ S) + δ

where D, D′ are data sets in the same schema but differing by one individual’s records, ϵ is the privacy-loss parameter, and δ is negligible in the number of database rows, so δ < 1/p(n) with n being the number of database rows (for this challenge, the values of and δ are provided in the parameters.json file).

There are several resources available that provide a useful overview, beginner’s guide, and primer for working with differential privacy (for more resources see the About page).

In addition, here are a few tips that may help to keep in mind as you develop your approach.

Tip 1: The algorithm should not look at the data itself to determine a valid set of values for a given feature; rather it must take the schema as external input. Otherwise changing one individual's data could change the schema and control flow of the algorithm, and this would leak information in a way that is not covered by privatization noise.

At the same time, keep in mind that the dataset you’re given during the development phase is considered to be previously released, public data. You can take general lessons learned from working with the publicly released data and hardcode them into your algorithm. For instance, you could use the public dataset to learn a similarity-based clustering of map segments, time slices, or event codes and then build your algorithm to use this clustering. Because the private data set used for final scoring shares the same schema, if there are data distribution properties that can be expected to generalize across different years of data, it is fair game to learn them during the development phase in order to apply them during final scoring. Of course, as always, overfitting to the development data is a risk.

Tip 2: The algorithm should be cautious when using functions such as group-by, min, max, or mean whose outputs could be changed if one individual’s records in the data are changed. All functions whose outputs are "sensitive" to the records in the data must be considered when computing the sensitivity of the algorithm and adding appropriate noise.

For some operations, the global sensitivity (i.e. worst-case impact of changing one individual’s records), may be quite large and thus any algorithm containing these operations will be impractical to privatize. Differential privacy often requires using different versions of common operations and different approaches to common algorithms that have been intentionally designed to reduce sensitivity.

Tip 3: Histograms (built on externally-provided schema) can be quite useful. The sensitivity of creating a histogram is equal to the number of histogram bins that can be changed in the worst case when one individual's records are changed. In this sprint that is equal to the max_records_per_individual parameter. Check out the provided baseline solution and beginners guide tutorial above for more information on differentially private algorithms using histograms.

Tip 4: Post-processing is also likely to be important. Once data has been privatized (sufficient noise has been added for the given epsilon, delta, and sensitivity), any further operations can be performed on it safely—with no further risk to privacy—as long as they only interact with the privatized data or publicly available data (and do not interact with the sensitive ground truth data in any fashion). Post-processing operations are perfect for reducing positive bias and handling negative and non-integer values.

Data


The provided data set for this sprint consists of a 1,455,609 row CSV file which records 911 calls that occurred in Baltimore in 2019. As a reminder, for purposes of differential privacy, the 2019 dataset provided during the development phase is considered to be previously released, public data. Lessons learned from working with the 2019 dataset (e.g. similarity of neighborhoods) can be included in your algorithm without loss of privacy.

Incidents

Here is what the first few rows of that incidents.csv data look like:


event_id year month day hour minute neighborhood incident_type sim_resident
140203235110672 2019 1 1 0 0 29 167 4081
140203737381840 2019 1 1 0 0 166 168 6115
140202952922576 2019 1 1 0 0 147 163 17498
140203118608848 2019 1 1 0 0 251 166 30987
140203196663184 2019 1 1 0 0 166 163 35984
... ... ... ... ... ... ... ... ...

The columns are as follows:

  • event_id (int) — Unique ID for each row of the data.
  • year, month, day, hour, minute (int) — Time when the call took place.
  • neighborhood (categorical [int]) — Code for the neighborhood in which the incident took place. See the codebook for the human-readable name corresponding to this code.
  • incident_type (categorical [int]) — Code for which type of incident took place. See the codebook for the human-readable name corresponding to this code.
  • sim_resident (int) — Unique, synthetic ID for the notional person to which this event was attributed. The largest number of incidents attributed to a single simulated resident is provided in the parameters.json file as max_records_per_individual.

Parameters

In addition to the CSV file, there is also a parameters.json file which serves both as a codebook for the dataset described above (under the schema key) as well as a configuration describing which outputs are expected to be produced by your code (under the runs key).

Here is a bit of what the schema (or "codebook") part of the file looks like:

"schema": {
    "periods": [
      {
        "year": 2019,
        "month": 1
      },
      {
        "year": 2019,
        "month": 2
      },
      …
    ],
    "neighborhood": [
      {
        "name": "Abell",
        "code": 0
      },
      {
        "name": "Allendale",
        "code": 1
      },
      …
    ],
    "incident_type": [
      ...
      {
        "name": "Behavioral Crisis",
        "code": 32
      },
      {
        "name": "Bike Theft",
        "code": 33
      },
      …
    ]
}

And here is the entire runs part:

[
    {
      "epsilon": 1.0,
      "delta": 1.2244024187236405e-11,
      "max_records_per_individual": 20
    },
    {
      "epsilon": 2.0,
      "delta": 1.2244024187236405e-11,
      "max_records_per_individual": 20
    },
    {
      "epsilon": 10.0,
      "delta": 1.2244024187236405e-11,
      "max_records_per_individual": 20
    }
]

As you can see, the delta values (this is just under 1/n^2, where n is the number of simulated individuals in the data set) and max_records_per_individual (maximum number of records associated with one person in the data set) are identical for the purposes of this ground truth data set. The epsilon values show what privacy level your code is required to implement for each of three different runs.

For the Open Arena CSV submission, you will simply upload a submission having used these values. In the Prescreened Arena, your executable code may assume that this file exists at data/parameters.json and should use the contents to output a submission file accordingly; note that this means you should not hard code these values as they may change for final scoring, requiring your code to use different values. (Delta will change and epsilon values may change in final scoring. The schema will remain the same. See below for more details about submission format.)

External data

As described in the Challenge Rules, external datasets and pre-trained models are allowed for use in the competition as long as they are freely and publicly available to all participants and do not have the same schema or source as the provided contest data. Any additional data sources used other than the provided data must be indicated for approval in privacy write-ups for pre-screening and final scoring.

Performance metric


One row of the differentially private submission file dpi represents a vector of non-negative incident counts for a particular neighborhood-month in the standard order (ascending by incident type code), while the corresponding ground truth vector gti represents the true counts.

Here’s a simplified example for one time period with only four incident counts:

metric diagram

We use a custom "pie chart" loss metric to compare the sets of counts. This metric is inspired by a typical public policy use case for aggregated data where insignificant entries below some percentage threshold are dropped and then the resultant entries are shown as a breakdown summing to 100%.

Calculation

After removing insignificant counts, this metric is essentially based on the information-theoretic Jensen Shannon distance plus some penalties that try to capture undesirable properties from Differential Privacy subject matter expertise: (1) a misleading presence penalty (MPP) which quantifies the harm done to analyses where seemingly important counts appear but are purely an artifact of privatization, and (2) a bias penalty (BP) where the proportions of counts may be correct but the overall number are unreasonably far from the truth.

The difference between two vectors is evaluated using the following steps:

Step 1: Zero out non-significant counts in each row and re-normalize.

Any incident count in either dpi or gti that accounts for less than 5% of the overall incidents in this vector is set to zero. After zeroing insignificant counts, each vector is divided by its new sum to get the proportions instead of raw counts.

metric diagram

Step 2: Calculate the individual loss components for each row:

2A. Calculate the Jensen–Shannon distance (JSD) between the renormalized vectors. Divide each vector by its sum to get probability vectors and then calculate JSD.

2B. Add a misleading presence penalty (MPP) for each time a category of incident shows up as significant in the privatized data but is actually zero in the ground truth vector. This additive penalty is 0.2 per misleading presence.

mpp diagram

2C. Add a bias penalty (BP) of 0.25 if the sum of the raw, unzeroed privatized counts is more than 500 off from the ground truth for row i.

bias diagram

Step 3: Sum up the loss components, clip to [0, 1], and subtract from 1:

Putting the pieces together, we can get a loss for row i by summing the loss components. We then subtract this loss from 1 to get the score. The minimum score for each row is 0, so even if the sum of the penalties exceed 1 we clip the resulting score to [0, 1].

We start with two MxN matrices where M is number of rows (number of epsilons times number of neighborhoods times number of different time periods) and N is number of incident types. These are the ground truth count matrix GT and the privatized count matrix DP. We can get count vectors dpi and gti by taking the row of the corresponding matrix and applying Step 1 above to zero out insignificant counts and renormalize.

loss(dpi, gti) = JSD(dpi, gti) + MPP(dpi, gti) + BP(dpi, gti)

score(dpi, gti) = max(1 - loss(dpi, gti)i, 0)

PieChart(dp, gt) = ∑i score(dpi, gti)

Given that the sum of these individual row scores is the overall score, and that a perfect score for a row is 1 and a "failure" is a 0, the resulting overall score will lie in the range [0, M]. That said, scores close to M are not achievable given privacy-preserving guarantees!

Note: You have an implementation of this metric available to you in the competition repository. We also provide a number of helpful scripts for working with the data. These scripts demonstrate how you can:
  • Create a proper submission format from the parameters file without "looking at" the incident data
  • Implement the naive baseline with Laplace noise
  • Score a submission that you create locally and generate a detailed score report per row
  • Visualize the results in your scoring report to diagnose issues

Baseline Privacy Solution: A useful baseline approach to satisfying differential privacy on temporal map data is to 1) simply create a histogram of ground truth counts of each incident type in each neighborhood-month segment, 2) add Laplace noise to the gt histogram in accordance with the Laplace scale determined by the sensitivity and privacy ϵ for each chunk of the submission, then 3) round these noisy counts to integers, and change negative counts to 0, in order to get valid data.

Code for this baseline approach is provided in the competitors pack, along with an algorithm write-up and differential privacy proof. A score from this baseline in shown on the leaderboard.

50% Sampling Error Benchmark: A useful benchmark for good algorithm performance is sampling error; if the deviation between gt and dp introduced by the added privacy noise is comparable to the deviation that might be introduced by sampling error (i.e. from subsampling gt), then the added privacy noise is arguably within normally accepted tolerances for practical applications.

To get a sampling error benchmark, we create two uniform random 50% subsamples of the input event record data, and then run the scoring metric (using one sample as gt, and the other as dp) to get a measure of the amount of disparity between them according to our metric. This score should give an idea of where high utility scores might be, and again this baseline is included on the leaderboard for your reference.

For more information on this benchmark, and further analysis on the scoring metric in general, check out the Pie Chart Metric write-up provided as an example for the utility metrics track of this challenge.

Use

In the Open Arena, this score will be calculated based on the 2019 data as gt and the submitted file as dp.

In the Prescreened Arena, scores during the Development Phase will be calculated based on the 2019 data as gt and the dp produced by running the submitted executable code on the 2019 data. During Final Scoring, final scores will be determined by (1) running the submitted executable code to generate a dp for each epsilon value for each of two subsets of held-out ground truth data identical in CSV format to the 2019 data but with records from other years, (2) evaluating the metric described above for each privatized output dp on the appropriate year’s gt, and (3) taking the arithmetic mean of the resulting values.

Submission format


There are four types of submissions accepted throughout this contest:

  • Privatized data set (Open Arena)
  • Pre-screening write-up (Open Arena)
  • Executable code submission (Prescreened Arena)
  • Final scoring (Prescreened Arena)

The submissions for this arena are described here. For further description of Prescreened Arena submissions click here.

Privatized data set

All participants can submit privatized data files for scoring at any point while general submissions are open. To make a submission, sign up for the challenge and head over to the Submissions tab.

Using these inputs, you are expected to produce a CSV submission that privatizes the events data by providing privatized counts of each type of incident aggregated at the neighborhood/year/month level while adhering to the privacy guarantees you lay out in your proof. This framing of the challenge is very different from an ordinary machine learning train/test/eval split, and it means that you must be very careful about the way and number of times that your code looks at the ground truth numbers which you are provided (see the Correctness in Differential Privacy section below). Code that violates the provided proof can’t win the competition!

Here is what a proper submission file looks like. If you predicted:

epsilon neighborhood year month 0 1 2 ... 171 172 173
1.0 0 2019 1 0 0 0 ... 0 0 0
1.0 0 2019 2 0 0 0 ... 0 0 0
1.0 0 2019 3 0 0 0 ... 0 0 0
... ... ... ... ... ... ... ... ... ... ...
10.0 277 2019 10 0 0 0 ... 0 0 0
10.0 277 2019 11 0 0 0 ... 0 0 0
10.0 277 2019 12 0 0 0 ... 0 0 0

The CSV file that you submit would look like:

epsilon,neighborhood,year,month,0,1,2,...,171,172,173
1.0,0,2019,1,0,0,0,...,0,0,0
1.0,0,2019,2,0,0,0,...,0,0,0
1.0,0,2019,3,0,0,0,...,0,0,0
...
10.0,0,2019,10,0,0,0,...,0,0,0
10.0,0,2019,11,0,0,0,...,0,0,0
10.0,0,2019,12,0,0,0,...,0,0,0

Of course your submission would contain actual counts for each incident ID, not just 0.

Every row in this file corresponds to a certain epsilon value and a Baltimore neighborhood for a specific month and year. The first four columns of this CSV are the index and the remaining columns correspond to each of the incident types.

Here is some (very inefficient) pseudocode for putting together a baseline solution:

for each epsilon in epsilons:
for each neighborhood in neighborhoods:
for each (year, month) in periods:
    for each incident_type in incident_types:
      # add noise
      privatized[epsilon][neighborhood][year][month] = ground_truth + laplace_noise(...)
      # make sure counts are non-negative
      privatized[epsilon][neighborhood][year][month] = max(0, privatized)

For the Open Arena CSV submission, the number of data rows in this submission will be (number of epsilons) * (number of neighborhoods) * (number of different months) = 3 * 278 * 12 = 10,008. With the header row, that makes 10,009. The number of columns will be the 4 index columns plus the number of incident types, so 4 + 174 = 178. Your submission must exactly match the submission format file that you are provided as submission.csv — the index, columns, and data type (int) must exactly match with the only variation being the actual numbers. No missing (NaN) values are allowed, and all values must be non-negative integers.

In the Prescreened Arena, the number of rows can be calculated similarly but depends on the specific time periods and epsilons specified in parameters.json.

Note that we do not guarantee that every possible incident value within the range will be present in the actual incident data, and this means that the set of values that appear in the data depend on the individuals present in the data (a privacy leak!). You should use parameters.json to make sure that you are creating a proper submission regardless of what is in the data set. A good way to test this is to see if you get the right shape of submission even if you feed in only the first 10 lines of the actual incident data.

Pre-screening write-up

To complete pre-screening, participants are required to submit a PDF containing the following information:

  • Identification of the submission and score generated by the described solution
  • A 1-2 sentence high-level description of the approach
  • A clear and complete written explanation of the algorithm, including a summary of what the solution is doing at each step in the process to generate the privatized data submission
  • A brief description of any additional data sources used other than the provided data, including a link to the data
  • A clear, correct mathematical proof that their solution satisfies differential privacy

A sample privacy write-up is provided in the Competitor Pack.

Write-ups should be submitted through the Pre-screen Submission page of the challenge website. This document will be reviewed and validated by NIST staff or their delegates, who may request clarification or rewriting if documents are unclear or underspecified. Participants will receive “Prescreened” status that they have an essentially correct understanding of differential privacy as applied to their submission, or a brief explanation why their algorithm or proof are incorrect. In general it may take up to one week to receive Prescreened status or feedback.

Keep in mind: only Prescreened participants will be eligible for code submission and final scoring. For more details see the Challenge Rules.

Good luck!


If you're wondering how to get started, check out the baseline solution and write-up in the competitor pack.

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!