Differential Privacy Temporal Map Challenge: Sprint 2 (Open Arena)

START HERE! Help public safety agencies share data while protecting privacy. This is part of a series of contests to develop algorithms that preserve the usefulness of temporal map data while guaranteeing individual privacy is protected. #privacy

feb 2021
76 joined

Differential Privacy Temporal Map Challenge


The goal of this challenge is to develop algorithms that preserve data utility as much as possible while guaranteeing individual privacy is protected. The challenge features a series of coding sprints to apply differential privacy methods to temporal map data, where each record is tied to a location and each individual may contribute to a sequence of events.

Why

Large data sets containing personally identifiable information (PII) are exceptionally valuable resources for research and policy analysis in a host of fields supporting America's First Responders such as emergency planning and epidemiology.

Temporal map data is of particular interest to the public safety community in applications such as optimizing response time and personnel placement, natural disaster response, epidemic tracking, demographic data and civic planning. Yet, the ability to track a person's location over a period of time presents particularly serious privacy concerns.

The Solution

Sprint 2 featured featured census data about simulated individuals in various US states from 2012-2018, including 35 different survey features aside from the simulated individual the record belonged to. These individual records were also linked across years, increasing the challenge to make sure each person's privacy was protected.

To succeed in this competition, participants needed to build de-identification algorithms generating a set of synthetic, privatized survey records that most accurately preserved the patterns in the original data. Because accuracy is evaluated with respect to all map segments (Public Use Microdata Areas) and time segments (years), it was important that solutions provide good performance even on the more complex or sparse sections of the data.

The Results

Winning competitors found that general techniques which had proven effective in the 2018-2019 Synthetic Data challenge (DeID1) were also useful here, although advances were necessary to address the difficulties imposed by the temporal map context.

  • Probabilistic Graphical Models (PGM): Both N-CRiPT and Minutemen used PGM-based approaches, which were also used by the first place winner of the 2018-2019 challenge. Each team tailored their approach and applied new techniques to deal with the increased sensitivity of the temporal data.
  • Noisy Marginals: On the high epsilon setting, both DukeDP and DPSyn used approaches which first collected a set of marginal counts from the data, then privatized the counts with added noise, and finally used post-processing and sampling from the public data to produce a final consistent synthetic data set. DPSyn also used a similar technique when they earned second place in the 2018-2019 challenge. Each team introduced new strategies to handle the increased difficulty of the problem, including new approaches to creating the final consistent data.
  • Histograms: In the 2018-2019 challenge, the DPFields solution partitioned the variables in the schema and then collected (and privatized) histograms over each of the partitions, and finally used these noisy histograms as probability distributions for sampling the synthetic data. Jim King used a similar approach, carefully tailoring the choice of histograms and level of aggregation to the more challenging temporal map problem.

These solutions were evaluated using the "K-marginal Evaluation Metric", designed to measure how faithfully each privatization algorithm preserves the most significant patterns across all features in the data within each map/time segment. One result of the increased query sensitivity in the temporal map challenge is that it is important for solutions to interact with the data strategically, in order to get the most value from each query. Techniques that were able to make efficient use of their queries performed well, especially at higher epsilons.