Differential Privacy Temporal Map Challenge: Sprint 3 (Prescreened Arena) Hosted By NIST PSCR

4 weeks left
$79,000

Problem description

In this contest, your task is to develop algorithms that preserve data utility while guaranteeing individual privacy is protected. Submissions will be assessed based on their ability to prove they satisfy differential privacy and the accuracy of output data as compared with ground truth.


  • Correctness in differential privacy
  • Overview

Timeline and leaderboard


This contest sprint will proceed in two phases:

Development Phase (March 29 - May 10, 2021)

Upon registering for the contest, participants will enter the Open Arena. You will have access to a training ground truth data set and data dictionary, and have the opportunity to submit privatized versions of the data for scoring. For purposes of privacy, the ground truth data set is considered previously released publicly available data that can be used for algorithm design and testing, without impact to the privacy loss budget. DrivenData will use a different data set with the same feature schema (but not necessarily the same distributions) for final scoring.

Throughout the Development Phase, participants can submit privatized data sets using the contest-specified values of delta, epsilon, and the max_records_per_individual (maximum number of records associated with one person in the data set) provided in the paramaters.json configuration file. The scoring metric below will compare privatized data to ground truth data to produce a similarity score for each submission which will be used to update the public leaderboard.

Pre-screening: In order to be eligible for final scoring, participants must have their approach Prescreened as differentially private during the Development Phase. Submitted write-ups will be reviewed and validated by NIST staff or their delegates to confirm that the participants have an essentially correct understanding of differential privacy as applied to their submission, or to provide a brief explanation why the algorithm or proof is incorrect.

Once successfully Prescreened, participants will have the ability to:

  • Enter the Prescreened Arena of the contest sprint (you are here!)
  • Make executable code submissions to the containerized test harness, which will execute submissions on the Prescreened Arena challenge data and produce scores for the Prescreened public leaderboard (these scores take precedence in determining Progressive Prizes)
  • Make a submission for Final Scoring, which will be used to determine final rankings

For more information on scoring and pre-screening submissions, see the Submission Format section below.

Progressive prizes: In addition, challenge leaders will have the opportunity to win progressive prizes partway through the Development Phase. Four progressive prizes will be awarded based on top scores achieved by April 26, 2021 at 10:00 pm EST. Rankings will be determined first by position on the Prescreened leaderboard (which requires achieving Prescreened status and making a successful executable code submission), and will proceed to rankings on the open leaderboard as needed. Pre-screening write-ups must be submitted by 10:00 pm EST on April 20, 2021 to ensure results are returned before the progressive prize deadline.

Final Scoring Submission (May 10 - 17, 2021)

Final prize rankings will be determined by the results of the final scoring phase. All Prescreened participants will have the opportunity to make a final submission to run on a separate data set that has not been shared with the challenge participants. This data set has the same schema as the public challenge data provided to all participants. To robustly evaluate solutions during final scoring, final submissions may be run multiple times per epsilon and on additional values of epsilon.

During the final scoring submission period, participants must submit to the Prescreened Arena:

  • their final executable code submission to the containerized test harness on DrivenData
  • a clear and complete report including the algorithm description and mathematical privacy proof
  • source code of their solution
  • a code guide "README" file which gives the mapping between the steps of the algorithm write-up and the analogous steps in the source code, and identifies which parts of the source code perform pre-processing, privatization and post-processing operations.

The report will be subject to a final validation review to verify that the algorithm correctly satisfies differential privacy at the stated values of delta and epsilon, and that the code implements the stated algorithm. The review will be performed by NIST staff or their delegates, who may contact participants for clarifications or minor corrections. Validated submissions will be ranked by their score to determine the final leaderboard used for awarding prizes.


Privatized data scoring
Executable code scoring (Prescreened only)
Final scoring submission (Prescreened only)
Privatized data scoring Executable code scoring (Prescreened only) Final scoring submission (Prescreened only)
Mar 29 - May 10, 2021 Mar 29 - May 10, 2021 May 10 - 17, 2021
Scores run on de-identified data submissions are calculated from the provided ground truth data set and displayed on the public leaderboard. Scores run on executable code submissions are calculated from the Prescreened Arena ground truth data set and displayed on the Prescreened public leaderboard. Prescreened participants may submit a final code submission and write-up for evaluation. Final, validated scores will be displayed on the final leaderboard and used to determine prize rankings.


Correctness in Differential Privacy


The definition of epsilon-delta differential privacy used in this contest is as follows:

An algorithm M is (ϵ,δ)-differentially private if

P(M(D) ∈ S) ≤ eϵ P(M(D′) ∈ S) + δ

where D, D′ are data sets in the same schema but differing by one individual’s records, ϵ is the privacy-loss parameter, and δ is negligible in the number of database rows, so δ < 1/p(n) with n being the number of database rows (for this challenge, the values of and δ are provided in the parameters.json file).

For newcomers to differential privacy, there are several resources available that provide a useful overview, beginner’s guide, and primer for working with differential privacy, as well open source algorithms from the 2018 Differential Privacy Synthetic Data Challenge (DeID1) for working with synthetic data. For additional resources see the About page.

In addition, here are a few tips that may help to keep in mind as you develop your approach.

Tip 1: Note that individuals in the Sprint 3 data set are taxi drivers, not taxi trips. When computing sensitivity, two neighboring data sets will differ by all of the trips associated with one taxi driver. Because one taxi can have a maximum of 200 trips, successful approaches will likely rely on queries about taxis, rather than on individual trips.

Tip 2: One subtle implication of the definition of differential privacy is that the algorithm should not look at the data itself to determine a valid set of values for a given feature; rather it must take the schema as external input. Otherwise changing one individual's data could change the schema and control flow of the algorithm, and this would leak information in a way that is not covered by privatization noise.

At the same time, keep in mind that the dataset you’re given during the development phase is considered to be previously released, public data. You can take general lessons learned from working with the publicly released data and hardcode them into your algorithm. For instance, you could use the public dataset to learn a similarity-based clustering of features and then build your algorithm to use this clustering. Because the private data set used for final scoring shares the same schema, if there are data distribution properties that can be expected to generalize across different years of data, it is fair game to learn them during the development phase in order to apply them during final scoring. Of course, as always, overfitting to the development data is a risk.

Tip 3: The algorithm should be cautious when using functions such as group-by, min, max, or mean whose outputs could be changed if one individual’s records in the data are changed. All functions whose outputs are "sensitive" to the records in the data must be considered when computing the sensitivity of the algorithm and adding appropriate noise.

For some operations, the global sensitivity (i.e. worst-case impact of changing one individual’s records), may be quite large and thus any algorithm containing these operations will be impractical to privatize. Differential privacy often requires using different versions of common operations and different approaches to common algorithms that have been intentionally designed to reduce sensitivity.

Tip 4: Histograms (built on externally-provided schema) can be quite useful. The sensitivity of creating a histogram is equal to the total quantity (across all histogram bins) that can be changed in the worst case when one individual's records are changed. In this sprint, if the histogram is counting some property of taxi trips, the sensitivity will be max_records_per_individual parameter (in this case, 200). However, if the histogram is counting some properties of taxis, the sensitivity will be 1, because in this problem the taxis (as specified by taxi_id) are defined to be our individuals.

Tip 5: Post-processing is also likely to be important. Once data has been privatized (sufficient noise has been added for the given epsilon, delta, and sensitivity), any further operations can be performed on it safely—with no further risk to privacy—as long as they only interact with the privatized data or publicly available data (and do not interact with the sensitive ground truth data in any fashion). Post-processing operations are perfect for reducing positive bias and handling negative and non-integer values.

Data


The provided data set for this sprint consists of a 16 million row CSV file which records information about taxi trips in Chicago, Illinois. The data is adapted from the Chicago Open Data Portal. As a reminder, only the data set provided to participants during the development phase is considered to be previously publicly released data for differential privacy. Lessons learned from working with this dataset (e.g. similarity of features) can be included in your algorithm without loss of privacy.

For this sprint, prescreened and final scoring will use the same time and map segments as the public data, but will be sourced from different year(s) of data.

Records

The main dataset includes quantitative and categorical information about taxi trips in Chicago, including time, distance, location, payment, and service provider. The data includes several features along with time segments (trip_day_of_week and trip_hour_of_day), map segments (pickup_community_area and dropoff_community_area), and simulated individuals (taxi_id). Solutions in this sprint will need to produce a list of records (i.e. synthetic data) with corresponding time and map segments.

Note: taxi_id specifies the individual in this data, for the purposes of differential privacy (see DP tips above). In the past individuals were simulated, but in this sprint we're providing real individual temporal data (one taxi_id represents one week of trips associated with a real world Chicago taxi driver). In this sprint you will have access to real world data, and we will be asking you to synthesize realistic individual taxis as well as realistic trip distributions (see scoring metric below).

Here is what an example row of the ground_truth.csv data looks like:

taxi_id 1000139
shift 19
company_id 26
pickup_community_area 8
dropoff_community_area -1
trip_day_of_week 0
trip_hour_of_day 0
shift 0
payment_type 0
trip_day_of_week 6
trip_hour_of_day 11
fare 314
tips 0
trip_total 391
trip_seconds 9217
trip_miles 138

The columns are as follows:

  • taxi_id (int) — Simulated individual ID for each taxi, derived from taxi-weeks in the original data.
  • shift (int) — Computed variable combining the above two variables into 21 eight-hour segments. Each day has three shifts: "night" (20:00-4:00), "morning" (4:00-12:00), and "afternoon" (12:00-20:00). A "night" shift is contiguous; that is, it includes the last four hours of one day and the first four hours of the following day (e.g. Friday night includes late Friday night and early Saturday morning).
  • company_id (int) — Company that the taxi works for.
  • pickup_community_area (int) — Community Area where trip started.
  • dropoff_community_area (int) — Community Area where trip ended.
  • payment_type (int) — Type of payment used.
  • trip_day_of_week (int) — Day of the week that the trip occured (0 is Monday).
  • trip_hour_of_day (int) — Hour of the day that the trip occured on a 24-hour clock.
  • fare (int) — Fare paid to nearest dollar.
  • tips (int) — Tips paid to nearest dollar.
  • trip_total (int) — Total amount paid for the trip.
  • trip_seconds (int) — Total duration of the trip to the nearest second.
  • trip_miles (int) — Total length of the trip to the nearest mile.

For further information on the possible values for each column, refer to the schema in the parameters file as described below.

Parameters

In addition to the CSV file, there is also a parameters.json file which serves both as a codebook for the dataset described above (under the schema key) as well as a configuration describing which outputs are expected to be produced by your code (under the runs key).

Here is a bit of what the schema (or "codebook") part of the file looks like:

"schema": {
    "taxi_id": {
      "dtype": "int64",
      "kind": "id"
    },
    "shift": {
      "dtype": "uint8",
      "kind": "categorical",
      "values": [
        0,
        1,
        2,
        3,
        4,
        5,
        6,
        7,
        8,
        9,
        10,
        11,
        12,
        13,
        14,
        15,
        16,
        17,
        18,
        19,
        20
      ]
    },
    ...
  }

And here is the entire runs part:

    {
      "epsilon": 1.0,
      "delta": 2.5e-05,
      "max_records": 17000000,
      "max_records_per_individual": 200
    },
    {
      "epsilon": 10.0,
      "delta": 2.5e-05,
      "max_records": 17000000,
      "max_records_per_individual": 200
    }

As you can see, the delta values (this is just under 1/n^2, where n is the number of simulated individuals in the data set), max_records (maximum allowed synthetic records for each value of epsilon), and max_records_per_individual (maximum number of records associated with one taxi in the data set) are identical for the purposes of this ground truth data set. The epsilon values show what privacy level your code is required to implement for each run.

For the Open Arena CSV submission, you will simply upload a submission having used these values. In the Prescreened Arena, your executable code may assume that this file exists at data/parameters.json and should use the contents to output a submission file accordingly. Note that this means you should not hard code these values as they may change for prescreened and final scoring, requiring your code to use different values. (delta and max_records will change and epsilon values may change in final scoring. The possible values of company_id may also change. Otherwise the overall schema will remain the same. See below for more details about submission format.)

External data

As described in the Challenge Rules, external datasets and pre-trained models are allowed for use in the competition as long as they are freely and publicly available to all participants and do not have the same schema or source as the provided contest data. Any additional data sources used other than the provided data must be indicated for approval in privacy write-ups for pre-screening and final scoring.

Performance metric


Synthetic data submissions will be compared to ground truth values to evaluate their similarity in a few meaningful ways. Aspects of these approaches were informed by the metrics contest run as part of the broader Differential Privacy Temporal Map Challenge.

There are three different ways similarity will be evaluated, each contributing equally to the total score:

  1. K-marginal feature similarity: how well are clustering characteristics of the dataset conserved
  2. Graph-edge map similarity: how well are connections between pickup and dropoff locations conserved
  3. Individual higher-order conjunction: how well are types of individual taxis conserved

Each of these components will build on the temporal map nature of the data.

  • The primary map segment used for scoring is Pickup Community Area. Dropoff Community Area will also be used as a secondary map segment when evaluating graph-edge map similarity. There are 78 possible Community Areas (including a value for "unknown").
  • The primary time segment used for scoring is Shift, a computed variable summarizing the hour and day that the ride occured. There are 21 shifts per week, reflecting 7 days and 3 contiguous eight-hour blocks per day: Night 8pm-4am, Morning 4am- 12pm, Afternoon 12pm-8pm.

Below is a complete overview of each scoring component and how it is used to calculate the overall score for each submission.

Calculation

K-marginal background:

This contest uses a version of the k-marginal evaluation metric, which was originally developed to score solutions in the NIST DeID1 Synthetic Data Challenge.

The k-marginal evaluation tool is a randomized heuristic that measures similarity between two high dimensional data sets, by considering all correlations of k or fewer variables. For each run, the method selects a set of k features, computes the distribution of individuals that fall into each possible combinations of those features (with numerical features grouped into range bins), and then takes the absolute difference between the two resulting densities. By repeating across randomly selected marginals, the metric aggregates performance across a variety of features to create a total score.

Here is a simplified illustration of the k-marginal metric where k=3:

k-marginal metric diagram

1. K-marginal feature scoring:

For this sprint, the k-marginal metric will be applied for k=4, where two of the four features are always Pickup Community Area (map segment) and Shift (time segment), and the other two features are selected at random.

A single run will compare the marginal density for all possible permutations of the variables other than Pickup Community Area and Shift. The differences in densities between the synthetic and ground truth data will be evaluated for each temporal map slice of the data, then averaged across all temporal map slices and runs.

For instance, one run of the k-marginal metric might feature the variables company_id and trip_total. The number of trips that fall into each taxi company and total fare bin will be calculated per pickup_community_area per shift, both for the ground truth data set and the sythetic data submission. The distribution of the resulting cells will be normalized and compared by temporal map slice. Since the score takes the absolute difference of each pair of cells, the best score is 0.0 (i.e. perfect match between the density distributions) and the worst score is 2.0 (i.e. density distrubutions do not overlap at all).

2. Graph-edge map scoring

A specific version of the k-marginal approach will also be used to evaluate the connections between map segments represented in each trip (Pickup and Dropoff Community Areas).

In this case the k-marginal metric will be applied for k=2, where the two features are pickup_community_area and dropoff_community_area. The score will calculate the number of trips that go from each pickup area to each dropoff area, then take the difference in densities by pickup area between the synthetic and ground truth data. The resulting values will be between 0.0 and 2.0, and will be averaged across all permutations.

3. Higher-order conjunction scoring

For the first time this sprint, the column for individuals (taxi_id) will also factor into the leaderboard score. This component is intended to assess how well synthetic data preserves the distribution of types of individuals (in this case, taxis) from the ground truth.

Like k-marginal, the Higher Order Conjunction (HOC) metric was developed for the first NIST Differential Privacy Synthetic Data Challenge (DeID1). Where the k-marginal metric tests for distributional similarity across a few columns at a time, the HOC metric checks for similarity across full records. In order to evaluate taxis, the HOC metric will operate over an aggregation of taxi activity data by assigning each taxi a 99-dimension vector counting up the number of trips it takes from each pickup_community_area (78 buckets) and each shift (21 buckets).

Here is an illustration representing a sample vector for one taxi:

HOC metric sample histogram

Where a k-marginal test begins by selecting k columns in the data, an iteration of the HOC metric begins by randomly selecting one target record (taxi activity vector) in the ground truth data. This represents the taxi 'type' we're testing. It then generates a randomized similarity constraint for this test, by generating an array of 99 uniformly random integers (in the range [5,50] for this sprint). Using this similarity constraint array, a taxi is defined to be 'similar' to the target taxi if, for each of the 99 columns, the absolute difference between the taxi's activity vector and the target taxi's activity vector is less than similarity constraint value for that column.

HOC metric diagram

The percent (pool) of taxis that are 'similar' to the target taxi type in both the ground truth data and the privatized synthetic data are then computed. The score will take the absolute difference between these densities and the average across all of the tests. For the provisional phase, we we are running HOC tests with 50 randomly selected taxi types.

4. Aggregate score

Finally, interim scores for each of the components above will be transformed into an aggregate leaderboard score. Raw scores will be averaged for all values of epsilon (as a reminder, epsilon values can be found in the parameters.json configuration file, and solutions may be run on additional values of epsilon during final scoring). The leaderboard score will be calculated by normalizing the three scoring components between 0 (worst) and 1,000 (best) and averaging across them, giving each component equal weight.

Keep in mind: while 1,000 represents the best theoretically possible accuracy score, scores close to 1,000 are not achievable given privacy-preserving guarantees! For more information on privacy write-ups, see the submissions section below.

For further details and transparent implementation of the performance metric, see the metric script in the competitor pack.

Baseline Privacy Solution: A naive but valid baseline approach to satisfying differential privacy on temporal map data is to simply create a random set of individuals that adhere to the constraints provided in the schema section of the configurations file, without ever looking at the data. Code for this baseline approach is provided in the competitors pack, and a score from this baseline in shown on the leaderboard.

Use

In the Open Arena, this score will be calculated based on the provided data adapted from 2019 as the ground truth data set, and the submitted file as the synthetic data set.

In the Prescreened Arena, scores during the Development Phase will be calculated based on a separate ground truth data set drawn from a different year, and the synthetic data set produced by running the submitted executable code on that ground truth data and configurations file. Note that this is different from previous sprints, which used the same public ground truth data set for prescreened scoring. Using a separate data set will help to give participants a clearer sense of their solution's performance on non-public data and encourage these solutions not to hard-code values that may change in different years.

During Final Scoring, final scores will be determined by (1) running the submitted executable code to generate a synthetic data set for each epsilon value for separate held-out ground truth data, identical in schema to the 2019 data but with records adapted from other years, (2) evaluating the metric described above for each privatized output on the appropriate ground truth, and (3) taking the arithmetic mean of the resulting values. To robustly evaluate solutions during final scoring, final submissions may be run on more k-marginal permutations, multiple times per epsilon and on additional values of epsilon.

Submission format


There are four types of submissions accepted throughout this contest:

  • Privatized data set (Open Arena)
  • Pre-screening write-up (Open Arena)
  • Executable code submission (Prescreened Arena)
  • Final scoring (Prescreened Arena)

The submissions for this arena are described here. For a description of Open Arena submissions click here.

Executable code submission

The Prescreened Arena is a code execution arena! Rather than submitting a privatized data set, participants will package everything needed to perform the privatization and submit for containerized execution on the cloud. Check out the separate code submission format page where all the details are encapsulated.

You may submit as many times as the stated submission limit allows. Keep in mind that code submissions in the Prescreened Arena should be the same basic algorithm that passed pre-screening. You may continue to tune and tinker with your approach after it has been prescreened, but if you're switching to a substantially different algorithm you should resubmit to pre-screening in order to avoid accidental privacy violations. Algorithms that have been changed may no longer pass validation of differential privacy and so would not be eligible for prizes.

Final scoring

Directly following the end of the Development Phase, Prescreened participants will have one week to make a single final submission during the Final Scoring phase. Each Prescreened team can make a single final, successfully-executed code submission through the Prescreened Arena submission acceptor. The Final Scoring submission will generate a private score based on performance on the private ground truth data. As a reminder, this data set has the same schema as the public challenge data provided to all participants. This private score will be revealed after evaluation is complete and will be used to determine final prize rankings.

The Final Scoring submission should be a zip file containing the following materials:

  • Final executable submission that runs in the containerized test harness
  • All source code used in the solution, structured logically with an obvious point of entry, an extremely clear README file with clear instructions, dependencies and requirements identified, and including the final submission made through the DrivenData platform in the bullet above
  • Differential privacy write-up, containing:
    • a 1-2 sentence high-level description of the approach;
    • a clear and complete written explanation of the algorithm used in the final submission, including a summary of what the solution is doing at each step in the process to generate the privatized data submission;
    • a brief description of any additional data sources used other than the provided data, including a link to the data;
    • a clear, correct mathematical proof that their solution satisfies differential privacy; and
    • a code guide which gives the mapping between the steps of the algorithm write-up and the analogous steps in the source code, and identifies which parts of the source code perform pre-processing, privatization and post-processing operations.
  • Optionally, an eligibility form completed by your team's Official Representative (required only if you believe you are prize-eligible*).

* As a reminder, at the time of entry, the Official Representative (individual or team lead, in the case of a group project) must be age 18 or older and a U.S. citizen or permanent resident of the United States or its territories in order to be eligible for a prize. See Official Rules for full information on eligibility. Note that you are NOT required to be prize-eligible in order to make a final scoring submission and have your performance displayed on the leaderboard.

A sample privacy write-up was provided in the Competitor Pack from Sprint 1. Additional examples like this one can be found in the open solutions list from the DeID1 Challenge. If you are also looking for tools to help organize your code for reproduction and sharing, we have a data science project template which may be helpful.

Final scoring submissions will be reviewed carefully by subject matter experts to ensure that the algorithm and proof correctly satisfy differential privacy and to verify that the code is a good faith implementation of the algorithm without significant errors. Validated submissions will be scored on the private ground truth data per the scoring function above to determine final prize rankings. To robustly evaluate solutions during final scoring, final submissions may be run on more k-marginal permutations, multiple times per epsilon and on additional values of epsilon. For more information see the Challenge Rules.

Good luck!


If you're wondering how to get started, check out the baseline solution in the competitor pack.

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!