Differential Privacy Temporal Map Challenge: Sprint 3 (Open Arena)

START HERE! Help public safety agencies share data while protecting privacy. This is part of a series of contests to develop algorithms that preserve the usefulness of temporal map data while guaranteeing individual privacy is protected. #privacy

may 2021
88 joined

Problem description

In this contest, your task is to develop algorithms that preserve data utility while guaranteeing individual privacy is protected. Submissions will be assessed based on their ability to prove they satisfy differential privacy and the accuracy of output data as compared with ground truth.


  • Correctness in differential privacy
  • Overview

Timeline and leaderboard


This contest sprint will proceed in two phases:

Development Phase (March 29 - May 10, 2021)

Upon registering for the contest, participants will enter the Open Arena (you are here!). You will have access to a training ground truth data set and data dictionary, and have the opportunity to submit privatized versions of the data for scoring. For purposes of privacy, the ground truth data set is considered previously released publicly available data that can be used for algorithm design and testing, without impact to the privacy loss budget. DrivenData will use a different data set with the same feature schema (but not necessarily the same distributions) for final scoring.

Throughout the Development Phase, participants can submit privatized data sets using the contest-specified values of delta, epsilon, and the max_records_per_individual (maximum number of records associated with one person in the data set) provided in the paramaters.json configuration file. The scoring metric below will compare privatized data to ground truth data to produce a similarity score for each submission which will be used to update the public leaderboard.

Pre-screening: In order to be eligible for final scoring, participants must have their approach Prescreened as differentially private during the Development Phase. Submitted write-ups will be reviewed and validated by NIST staff or their delegates to confirm that the participants have an essentially correct understanding of differential privacy as applied to their submission, or to provide a brief explanation why the algorithm or proof is incorrect.

Once successfully Prescreened, participants will have the ability to:

  • Enter the Prescreened Arena of the contest sprint
  • Make executable code submissions to the containerized test harness, which will execute submissions on the Prescreened Arena challenge data and produce scores for the Prescreened public leaderboard (these scores take precedence in determining Progressive Prizes)
  • Make a submission for Final Scoring, which will be used to determine final rankings

For more information on scoring and pre-screening submissions, see the Submission Format section below.

Progressive prizes: In addition, challenge leaders will have the opportunity to win progressive prizes partway through the Development Phase. Four progressive prizes will be awarded based on top scores achieved by April 26, 2021 at 10:00 pm EST. Rankings will be determined first by position on the Prescreened leaderboard (which requires achieving Prescreened status and making a successful executable code submission), and will proceed to rankings on the open leaderboard as needed. Pre-screening write-ups must be submitted by 10:00 pm EST on April 20, 2021 to ensure results are returned before the progressive prize deadline.

Final Scoring Submission (May 10 - 17, 2021)

Final prize rankings will be determined by the results of the final scoring phase. All Prescreened participants will have the opportunity to make a final submission to run on a separate data set that has not been shared with the challenge participants. This data set has the same schema as the public challenge data provided to all participants. To robustly evaluate solutions during final scoring, final submissions may be run multiple times per epsilon and on additional values of epsilon.

During the final scoring submission period, participants must submit to the Prescreened Arena:

  • their final executable code submission to the containerized test harness on DrivenData
  • a clear and complete report including the algorithm description and mathematical privacy proof
  • source code of their solution
  • a code guide "README" file which gives the mapping between the steps of the algorithm write-up and the analogous steps in the source code, and identifies which parts of the source code perform pre-processing, privatization and post-processing operations.

The report will be subject to a final validation review to verify that the algorithm correctly satisfies differential privacy at the stated values of delta and epsilon, and that the code implements the stated algorithm. The review will be performed by NIST staff or their delegates, who may contact participants for clarifications or minor corrections. Validated submissions will be ranked by their score to determine the final leaderboard used for awarding prizes.


Privatized data scoring
Executable code scoring (Prescreened only)
Final scoring submission (Prescreened only)
Privatized data scoring Executable code scoring (Prescreened only) Final scoring submission (Prescreened only)
Mar 29 - May 10, 2021 Mar 29 - May 10, 2021 May 10 - 17, 2021
Scores run on de-identified data submissions are calculated from the provided ground truth data set and displayed on the public leaderboard. Scores run on executable code submissions are calculated from the Prescreened Arena ground truth data set and displayed on the Prescreened public leaderboard. Prescreened participants may submit a final code submission and write-up for evaluation. Final, validated scores will be displayed on the final leaderboard and used to determine prize rankings.


Correctness in Differential Privacy


The definition of epsilon-delta differential privacy used in this contest is as follows:

An algorithm M is (ϵ,δ)-differentially private if

P(M(D) ∈ S) ≤ eϵ P(M(D′) ∈ S) + δ

where D, D′ are data sets in the same schema but differing by one individual’s records, ϵ is the privacy-loss parameter, and δ is negligible in the number of database rows, so δ < 1/p(n) with n being the number of database rows (for this challenge, the values of and δ are provided in the parameters.json file).

For newcomers to differential privacy, there are several resources available that provide a useful overview, beginner’s guide, and primer for working with differential privacy, as well open source algorithms from the 2018 Differential Privacy Synthetic Data Challenge (DeID1) for working with synthetic data. For additional resources see the About page.

In addition, here are a few tips that may help to keep in mind as you develop your approach.

Tip 1: Note that individuals in the Sprint 3 data set are taxi drivers, not taxi trips. When computing sensitivity, two neighboring data sets will differ by all of the trips associated with one taxi driver. Because one taxi can have a maximum of 200 trips, successful approaches will likely rely on queries about taxis, rather than on individual trips.

Tip 2: One subtle implication of the definition of differential privacy is that the algorithm should not look at the data itself to determine a valid set of values for a given feature; rather it must take the schema as external input. Otherwise changing one individual's data could change the schema and control flow of the algorithm, and this would leak information in a way that is not covered by privatization noise.

At the same time, keep in mind that the dataset you’re given during the development phase is considered to be previously released, public data. You can take general lessons learned from working with the publicly released data and hardcode them into your algorithm. For instance, you could use the public dataset to learn a similarity-based clustering of features and then build your algorithm to use this clustering. Because the private data set used for final scoring shares the same schema, if there are data distribution properties that can be expected to generalize across different years of data, it is fair game to learn them during the development phase in order to apply them during final scoring. Of course, as always, overfitting to the development data is a risk.

Tip 3: The algorithm should be cautious when using functions such as group-by, min, max, or mean whose outputs could be changed if one individual’s records in the data are changed. All functions whose outputs are "sensitive" to the records in the data must be considered when computing the sensitivity of the algorithm and adding appropriate noise.

For some operations, the global sensitivity (i.e. worst-case impact of changing one individual’s records), may be quite large and thus any algorithm containing these operations will be impractical to privatize. Differential privacy often requires using different versions of common operations and different approaches to common algorithms that have been intentionally designed to reduce sensitivity.

Tip 4*: Histograms (built on externally-provided schema) can be quite useful. The sensitivity of creating a histogram is equal to the total quantity (across all histogram bins) that can be changed in the worst case when one individual's records are changed. In this sprint, if the histogram is counting some property of taxi trips, the sensitivity will be max_records_per_individual parameter (in this case, 200). However, if the histogram is counting some properties of taxis***, the sensitivity will be 1, because in this problem the taxis (as specified by taxi_id) are defined to be our individuals.

Tip 5: Post-processing is also likely to be important. Once data has been privatized (sufficient noise has been added for the given epsilon, delta, and sensitivity), any further operations can be performed on it safely—with no further risk to privacy—as long as they only interact with the privatized data or publicly available data (and do not interact with the sensitive ground truth data in any fashion). Post-processing operations are perfect for reducing positive bias and handling negative and non-integer values.

Data


The provided data set for this sprint consists of a 16 million row CSV file which records information about taxi trips in Chicago, Illinois. The data is adapted from the Chicago Open Data Portal. As a reminder, only the data set provided to participants during the development phase is considered to be previously publicly released data for differential privacy. Lessons learned from working with this dataset (e.g. similarity of features) can be included in your algorithm without loss of privacy.

For this sprint, prescreened and final scoring will use the same time and map segments as the public data, but will be sourced from different year(s) of data.

Records

The main dataset includes quantitative and categorical information about taxi trips in Chicago, including time, distance, location, payment, and service provider. The data includes several features along with time segments (trip_day_of_week and trip_hour_of_day), map segments (pickup_community_area and dropoff_community_area), and simulated individuals (taxi_id). Solutions in this sprint will need to produce a list of records (i.e. synthetic data) with corresponding time and map segments.

Note: taxi_id specifies the individual in this data, for the purposes of differential privacy (see DP tips above). In the past individuals were simulated, but in this sprint we're providing real individual temporal data (one taxi_id represents one week of trips associated with a real world Chicago taxi driver). In this sprint you will have access to real world data, and we will be asking you to synthesize realistic individual taxis as well as realistic trip distributions (see scoring metric below).

Here is what an example row of the ground_truth.csv data looks like:

taxi_id 1000139
shift 19
company_id 26
pickup_community_area 8
dropoff_community_area -1
trip_day_of_week 0
trip_hour_of_day 0
shift 0
payment_type 0
trip_day_of_week 6
trip_hour_of_day 11
fare 314
tips 0
trip_total 391
trip_seconds 9217
trip_miles 138

The columns are as follows:

  • taxi_id (int) — Simulated individual ID for each taxi, derived from taxi-weeks in the original data.
  • shift (int) — Computed variable combining the above two variables into 21 eight-hour segments. Each day has three shifts: "night" (20:00-4:00), "morning" (4:00-12:00), and "afternoon" (12:00-20:00). A "night" shift is contiguous; that is, it includes the last four hours of one day and the first four hours of the following day (e.g. Friday night includes late Friday night and early Saturday morning).
  • company_id (int) — Company that the taxi works for.
  • pickup_community_area (int) — Community Area where trip started.
  • dropoff_community_area (int) — Community Area where trip ended.
  • payment_type (int) — Type of payment used.
  • trip_day_of_week (int) — Day of the week that the trip occured (0 is Monday).
  • trip_hour_of_day (int) — Hour of the day that the trip occured on a 24-hour clock.
  • fare (int) — Fare paid to nearest dollar.
  • tips (int) — Tips paid to nearest dollar.
  • trip_total (int) — Total amount paid for the trip.
  • trip_seconds (int) — Total duration of the trip to the nearest second.
  • trip_miles (int) — Total length of the trip to the nearest mile.

For further information on the possible values for each column, refer to the schema in the parameters file as described below.

Parameters

In addition to the CSV file, there is also a parameters.json file which serves both as a codebook for the dataset described above (under the schema key) as well as a configuration describing which outputs are expected to be produced by your code (under the runs key).

Here is a bit of what the schema (or "codebook") part of the file looks like:

"schema": {
    "taxi_id": {
      "dtype": "int64",
      "kind": "id"
    },
    "shift": {
      "dtype": "uint8",
      "kind": "categorical",
      "values": [
        0,
        1,
        2,
        3,
        4,
        5,
        6,
        7,
        8,
        9,
        10,
        11,
        12,
        13,
        14,
        15,
        16,
        17,
        18,
        19,
        20
      ]
    },
    ...
  }

And here is the entire runs part:

    {
      "epsilon": 1.0,
      "delta": 2.5e-05,
      "max_records": 17000000,
      "max_records_per_individual": 200
    },
    {
      "epsilon": 10.0,
      "delta": 2.5e-05,
      "max_records": 17000000,
      "max_records_per_individual": 200
    }

As you can see, the delta values (this is just under 1/n^2, where n is the number of simulated individuals in the data set), max_records (maximum allowed synthetic records for each value of epsilon), and max_records_per_individual (maximum number of records associated with one taxi in the data set) are identical for the purposes of this ground truth data set. The epsilon values show what privacy level your code is required to implement for each run.

For the Open Arena CSV submission, you will simply upload a submission having used these values. In the Prescreened Arena, your executable code may assume that this file exists at data/parameters.json and should use the contents to output a submission file accordingly. Note that this means you should not hard code these values as they may change for prescreened and final scoring, requiring your code to use different values. (delta and max_records will change and epsilon values may change in final scoring. The possible values of company_id may also change. Otherwise the overall schema will remain the same. See below for more details about submission format.)

External data

As described in the Challenge Rules, external datasets and pre-trained models are allowed for use in the competition as long as they are freely and publicly available to all participants and do not have the same schema or source as the provided contest data. Any additional data sources used other than the provided data must be indicated for approval in privacy write-ups for pre-screening and final scoring.

Performance metric


Synthetic data submissions will be compared to ground truth values to evaluate their similarity in a few meaningful ways. Aspects of these approaches were informed by the metrics contest run as part of the broader Differential Privacy Temporal Map Challenge.

There are three different ways similarity will be evaluated, each contributing equally to the total score:

  1. K-marginal feature similarity: how well are clustering characteristics of the dataset conserved
  2. Graph-edge map similarity: how well are connections between pickup and dropoff locations conserved
  3. Individual higher-order conjunction: how well are types of individual taxis conserved

Each of these components will build on the temporal map nature of the data.

  • The primary map segment used for scoring is Pickup Community Area. Dropoff Community Area will also be used as a secondary map segment when evaluating graph-edge map similarity. There are 78 possible Community Areas (including a value for "unknown").
  • The primary time segment used for scoring is Shift, a computed variable summarizing the hour and day that the ride occured. There are 21 shifts per week, reflecting 7 days and 3 contiguous eight-hour blocks per day: Night 8pm-4am, Morning 4am- 12pm, Afternoon 12pm-8pm.

Below is a complete overview of each scoring component and how it is used to calculate the overall score for each submission.

Calculation

K-marginal background:

This contest uses a version of the k-marginal evaluation metric, which was originally developed to score solutions in the NIST DeID1 Synthetic Data Challenge.

The k-marginal evaluation tool is a randomized heuristic that measures similarity between two high dimensional data sets, by considering all correlations of k or fewer variables. For each run, the method selects a set of k features, computes the distribution of individuals that fall into each possible combinations of those features (with numerical features grouped into range bins), and then takes the absolute difference between the two resulting densities. By repeating across randomly selected marginals, the metric aggregates performance across a variety of features to create a total score.

Here is a simplified illustration of the k-marginal metric where k=3:

k-marginal metric diagram

1. K-marginal feature scoring:

For this sprint, the k-marginal metric will be applied for k=4, where two of the four features are always Pickup Community Area (map segment) and Shift (time segment).

A single run will compare the marginal density for all possible permutations of the variables other than Pickup Community Area and Shift. The differences in densities between the synthetic and ground truth data will be evaluated for each temporal map slice of the data, then averaged across all temporal map slices and runs.

For instance, one run of the k-marginal metric might feature the variables company_id and trip_total. The number of trips that fall into each taxi company and total fare bin will be calculated per pickup_community_area per shift, both for the ground truth data set and the sythetic data submission. The distribution of the resulting cells will be normalized and compared by temporal map slice. Since the score takes the absolute difference of each pair of cells, the best score is 0.0 (i.e. perfect match between the density distributions) and the worst score is 2.0 (i.e. density distrubutions do not overlap at all).

2. Graph-edge map scoring

A specific version of the k-marginal approach will also be used to evaluate the connections between map segments represented in each trip (Pickup and Dropoff Community Areas).

In this case the k-marginal metric will be applied for k=2, where the two features are pickup_community_area and dropoff_community_area. The score will calculate the number of trips that go from each pickup area to each dropoff area, then take the difference in densities by pickup area between the synthetic and ground truth data. The resulting values will be between 0.0 and 2.0, and will be averaged across all permutations.

3. Higher-order conjunction scoring

For the first time this sprint, the column for individuals (taxi_id) will also factor into the leaderboard score. This component is intended to assess how well synthetic data preserves the distribution of types of individuals (in this case, taxis) from the ground truth.

Like k-marginal, the Higher Order Conjunction (HOC) metric was developed for the first NIST Differential Privacy Synthetic Data Challenge (DeID1). Where the k-marginal metric tests for distributional similarity across a few columns at a time, the HOC metric checks for similarity across full records. In order to evaluate taxis, the HOC metric will operate over an aggregation of taxi activity data by assigning each taxi a 99-dimension vector counting up the number of trips it takes from each pickup_community_area (78 buckets) and each shift (21 buckets).

Here is an illustration representing a sample vector for one taxi:

HOC metric sample histogram

Where a k-marginal test begins by selecting k columns in the data, an iteration of the HOC metric begins by randomly selecting one target record (taxi activity vector) in the ground truth data. This represents the taxi 'type' we're testing. It then generates a randomized similarity constraint for this test, by generating an array of 99 uniformly random integers (in the range [5,50] for this sprint). Using this similarity constraint array, a taxi is defined to be 'similar' to the target taxi if, for each of the 99 columns, the absolute difference between the taxi's activity vector and the target taxi's activity vector is less than similarity constraint value for that column.

HOC metric diagram

The percent (pool) of taxis that are 'similar' to the target taxi type in both the ground truth data and the privatized synthetic data are then computed. The score will take the absolute difference between these densities and the average across all of the tests. For the provisional phase, we we are running HOC tests with 50 randomly selected taxi types.

4. Aggregate score

Finally, interim scores for each of the components above will be transformed into an aggregate leaderboard score. Raw scores will be averaged for all values of epsilon (as a reminder, epsilon values can be found in the parameters.json configuration file, and solutions may be run on additional values of epsilon during final scoring). The leaderboard score will be calculated by normalizing the three scoring components between 0 (worst) and 1,000 (best) and averaging across them, giving each component equal weight.

Keep in mind: while 1,000 represents the best theoretically possible accuracy score, scores close to 1,000 are not achievable given privacy-preserving guarantees! For more information on privacy write-ups, see the submissions section below.

For further details and transparent implementation of the performance metric, see the metric script in the competitor pack.

Baseline Privacy Solution: A naive but valid baseline approach to satisfying differential privacy on temporal map data is to simply create a random set of individuals that adhere to the constraints provided in the schema section of the configurations file, without ever looking at the data. Code for this baseline approach is provided in the competitors pack, and a score from this baseline in shown on the leaderboard.

Use

In the Open Arena, this score will be calculated based on the provided data adapted from 2019 as the ground truth data set, and the submitted file as the synthetic data set.

In the Prescreened Arena, scores during the Development Phase will be calculated based on a separate ground truth data set drawn from a different year, and the synthetic data set produced by running the submitted executable code on that ground truth data and configurations file. Note that this is different from previous sprints, which used the same public ground truth data set for prescreened scoring. Using a separate data set in this sprint will help to give participants a clearer sense of their solution's performance on non-public data and encourage these solutions not to hard-code values that may change in different years.

During Final Scoring, final scores will be determined by (1) running the submitted executable code to generate a synthetic data set for each epsilon value for separate held-out ground truth data, identical in schema to the 2019 data but with records adapted from other years, (2) evaluating the metric described above for each privatized output on the appropriate ground truth, and (3) taking the arithmetic mean of the resulting values. To robustly evaluate solutions during final scoring, final submissions may be run on more k-marginal permutations, multiple times per epsilon and on additional values of epsilon.

Submission format


There are four types of submissions accepted throughout this contest:

  • Privatized data set (Open Arena)
  • Pre-screening write-up (Open Arena)
  • Executable code submission (Prescreened Arena)
  • Final scoring (Prescreened Arena)

The submissions for this arena are described here. For further description of Prescreened Arena submissions click here.

Privatized data set

All participants can submit privatized data files for scoring at any point while general submissions are open. To make a submission, sign up for the challenge and head over to the Submissions tab.

In this sprint, you are expected to produce a CSV submission that privatizes the taxi records data by creating a synthetic data set of individual records with the columns below, while adhering to the privacy guarantees you lay out in your proof. This framing of the challenge is very different from an ordinary machine learning train/test/eval split, and it means that you must be very careful about the way and number of times that your code looks at the ground truth numbers which you are provided (see the Correctness in Differential Privacy section above). Code that violates the provided proof can’t win the competition!

There are a couple differences between the ground truth data file and the synthetic file you will submit:

  • The submission file should include shift but not trip_day_of_week or trip_time_of_day, since shift is a summary of the other two variables used for evaluation. See the data section for details on how shift is computed.
  • Every row in this file corresponds to a certain epsilon value for a single taxi trip. The first column of this CSV is used to indicate epsilon, and the remaining columns correspond to each of the additional variables in the data set.
Note: we do not guarantee that every possible value of each variable will be present in the ground truth records data, and this means that the set of values appearing in the records data depend on the individuals present in the data (a privacy leak!). You should use parameters.json to make sure that you are creating a proper submission regardless of what is in the data set.

Here is what a proper submission file looks like. If you predicted:

epsilon taxi_id shift ... trip_seconds trip_miles
1.0 1112727 0 ... 60262 8321
1.0 1199041 0 ... 83103 4554
1.0 1031551 0 ... 80037 5392
... ... ... ... ... ...
10.0 1004895 20 ... 76558 9714

The CSV file that you submit would look like:

epsilon,shift,pickup_community_area,...,trip_seconds,trip_miles
1.0,1112727,0,...,60262,8321
1.0,1199041,0,...,83103,4554
1.0,1031551,0,...,80037,5392
...
10.0,1004895,20,...,76558,9714

The number of data rows in this submission may vary, but may not exceed the max_records value provided in the runs part of the parameters file. The number of columns will be one less than the ground truth file (to remove day and hour, and add epsilon), so 13 - 2 + 1 = 12. Your submission — including columns and data types — must exactly match the submission format file.

In addition to the columns matching, you should also verify that your columns are the proper data types. If you are using pandas, you may wish to use the dtypes we specify in read_csv_kwargs.json available for your use on the 'Data Download' page. Furthermore, the values in each column should be limited to the allowed values of ranges specified in the parameters.json we have provided.

You can validate that your submission meets the expected format using the metric.py script that we provide:

# see all the options for the scoring script
python metric.py --help
# get a score and validate the submission format
python metric.py ground_truth.csv your-submission.csv --parameters-json parameters.json

If you are confused about why your score is lower for your submission than you received locally, be aware that we are using a differend random seed determining which k-marginal column permutations are used. Make sure that your methodology generalizes well to other random seeds or number of column permutations (see options in metric.py.)

Note: we accept submissions that are zipped as long as the CSV you are submitting is the only file inside the zip archive. The submissions for this competition have many reptitive values (e.g. ",0,0,0") so they compress very well and will help prevent your uploads from timing out. To zip up your submissions on Linux, for example, you can run `zip my-cool-submission submission.csv` which will result in a `my-cool-submission.zip`. Bear in mind that submissions to this competition are relatively large, and computing many k-marginals can take some time. You can close the submission dialog box and check back after a few minutes to see your score.

Pre-screening write-up

To complete pre-screening, participants are required to submit a PDF containing the following information:

  • Identification of the submission and score generated by the described solution
  • A 1-2 sentence high-level description of the approach
  • A clear and complete written explanation of the algorithm, including a summary of what the solution is doing at each step in the process to generate the privatized data submission
  • A brief description of any additional data sources used other than the provided data, including a link to the data
  • A clear, correct mathematical proof that their solution satisfies differential privacy

A sample privacy write-up was provided in the Competitor Pack from Sprint 1.

Write-ups should be submitted through the Pre-screen Submission page of the challenge website. This document will be reviewed and validated by NIST staff or their delegates, who may request clarification or rewriting if documents are unclear or underspecified. Participants will receive “Prescreened” status that they have an essentially correct understanding of differential privacy as applied to their submission, or a brief explanation why their algorithm or proof is incorrect. In general it may take up to one week to receive Prescreened status or feedback.

Keep in mind: only Prescreened participants will be eligible for code submission and final scoring. For more details see the Challenge Rules.

Good luck!


If you're wondering how to get started, check out the baseline solution in the competitor pack.

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!