U.S. PETs Prize Challenge: Phase 1

Help unlock the potential of privacy-enhancing technologies (PETs) to combat global societal challenges. Develop efficient, accurate, and extensible federated learning solutions with strong privacy guarantees for individuals in the data. #privacy

$55,000 in prizes
sep 2022
652 joined

Problem Description

The objective of the challenge is to develop a privacy-preserving federated learning solution that is capable of training an effective model while providing a demonstrable level of privacy against a broad range of privacy threats. For the purposes of this challenge, a privacy-preserving solution is defined as one which is able to provably ensure that sensitive information in the datasets remain confidential to the respective data owners across the machine learning lifecycle. This requires the raw data is protected during training (input privacy), and that it also cannot be reverse-engineered during inference (output privacy).

In Phase 1, your goal is to write a concept paper describing a privacy-preserving federated learning solution that tackles one or both of two tasks: financial crime prevention or pandemic forecasting. The challenge organizers are interested in efficient and usable federated learning solutions that provide end-to-end privacy and security protections while harnessing the potential of AI for overcoming significant global challenges.

Solutions should aim to:

  • Provide robust privacy protection for the collaborating parties
  • Minimize loss of overall accuracy in the model
  • Minimize additional computational resources (including compute, memory, communication), as compared to a non-federated learning approach.

In addition to this, the evaluation process will reward competitors who:

  • Show a high degree of novelty or innovation
  • Demonstrate how their solution (or parts of it) could be applied or generalized to other use cases
  • Effectively prove or demonstrate the privacy guarantees offered by their solution, in a form that is comprehensible to data owners or regulators
  • Consider how their solution could be applied in a production environment

You are free to determine the set of privacy technologies used in your solution, with the exception of hardware enclaves (or similar approaches) and other specialized hardware-based solutions.

More information on the concept paper evaluation criteria and submission specifications is provided below. You can read more about the data for the separate tracks on the financial crime and pandemic forecasting data overview pages.


Timeline

In Phase 1, there are two deadlines—one for registration and the abstract, and one for the concept paper. Teams must both register and submit an abstract by the abstract deadline in order to submit a concept paper and in order to compete in Phase 2.

Key Dates

Launch & Blue Team Registration Opened July 20, 2022
Abstracts Due & Blue Team Registration Closed September 4, 2022 at 11:59:59 PM UTC
Concept Papers Due September 19, 2022 at 11:59:59 PM UTC
Winners Announced October 24, 2022

Phase 2 Tracks and Prizes

In order to participate in the upcoming Phase 2, your team must submit a concept paper in Phase 1 (this phase) for the solution(s) you intend to develop.

In Phase 2, prizes are awarded in separate categories for top solutions for each of the two Data Tracks as well as for Generalized Solutions. Teams can win prizes across multiple of the three categories but are required in Phase 1 in their concept papers to declare how their solutions apply to which prize categories. You can read more about the data tracks on the respective pages for Track A: Financial Crime and Track B: Pandemic Forecasting.

  • You can choose to develop a solution for either Track A or Track B, or develop two solutions total—one for each track.
    • A solution in Phase 2 is eligible to be considered for a prize for the respective track's top solutions prize category.
    • In order to have two solutions—you must submit two concept papers in Phase 1. Teams may win at most one prize in Phase 1, no matter whether they submit one or two concept papers. However, teams with two solutions may win up to two top solution prizes in Phase 2—each of their solutions will be considered for a prize for their respective data tracks.
  • Alternatively, you can choose to develop a single generalized solution.
    • A generalized solution is one where the same core privacy techniques and proofs apply to both use cases, and adaptations to specific use cases are relatively minor and separable from the shared core. (The specific privacy claims or guarantees as a result of those proofs may differ by use case.)
    • Generalized solutions may win up to three prizes—one from each of the two data track prize categories, and a third from a prize category for top generalized solutions.

Phase 2 has a total expected prize pool of $575,000. You can see anticipated prize structure in Phase 2 with individual prize amounts on the home page.

About Federated Learning


Federated learning (FL), also known as collaborative learning, is a technique for collaboratively training a shared machine learning model across data from multiple parties while preserving each party's data privacy. Federated learning stands in contrast to the typical centralized machine learning, where the training data needs to be collected and centralized for training. Requiring the parties to share their data compromises the privacy of that data!

In federated learning, model training is decentralized and parties do not need to share any data. This process is generally broken into four steps:

  1. A central server who is coordinating the model training starts with an initial trained model.
  2. The central server transmits the initial model to each of the data holders.
  3. Each data holder conducts training with the model on their own data locally.
  4. The data holders send only their training results back to the central server, who securely aggregates these results into the final trained model.


Diagram of basic federated learning process.


A diagram showing the basic training process under federated learning. Adapted from Wikimedia Commons under CC-BY SA.

Note that while federated learning is the primary topic of this challenge, there are many other privacy-enhancing technologies, such as differential privacy, homomorphic encryption, and secure multi-party computation. These technologies can be complementary and used together with federated learning and with each other. In fact, we expect successful solutions in this challenge to leverage multiple privacy-enhancing technologies in interesting and novel ways!

Here are some additional general resources about federated learning:

Privacy Threat Profiles


Overview

Federated learning produces a global model that aggregates local models obtained from distributed parties. As data from each participating party does not need to be shared, the approach provides a baseline level of privacy. However, privacy vulnerabilities exist across the FL lifecycle. For example, as the global federated model is trained, the parameters related to the local models could be used to learn about the sensitive information contained in the training data of each client.

Similarly, the released global model could also be used to infer sensitive information about the training datasets. Protecting privacy across the FL pipeline requires a combination of privacy-enhancing technologies and techniques that can be deployed efficiently and effectively to preserve privacy while still producing ML models with high accuracy and utility.

You will design and develop end-to-end solutions that preserve privacy across a range of possible threats and attack scenarios, through all stages of the machine learning model lifecycle. You should therefore carefully consider the overall privacy of your solution, focusing on the protection of sensitive information held by all parties involved in the federated learning scenario. The solutions you design and develop should include comprehensive measures to address the threat profiles described below. These measures will provide an appropriate degree of resilience to a wide range of potential attacks defined within the threat profile.

Lifecycle

You should consider risks across the entire lifecycle of a solution including, in particular, the following stages:

  • Training
    • Raw training data should be protected appropriately during training
    • Sensitive information in the training data should not be left vulnerable to reverse-engineering from the local model weight updates.
  • Prediction/inference
    • Sensitive information in the training data should not be left vulnerable to reverse-engineering from the global model. The privacy solution should aim to ensure that those with access to the global model cannot infer sensitive information in the training data for the lifetime of the model’s production deployment.

Actors and intention

You should consider threat models that range from honest-but-curious to malicious, applied to both aggregators and participating clients, and propose solutions accordingly. While participating organizations can be trusted, such threat models help capture a broad spectrum of possible risks, such as the outsourcing of computation to the untrusted cloud; and, in the event trusted private cloud infrastructure is used, the remaining possibility that malicious external actors could compromise part of that infrastructure (for example one or multiple banks), leading to a potential reduction in the trustworthiness of components within the system.

Privacy attack types

Any vulnerabilities that could lead to the unintended exposure of private information could fundamentally undermine the solution as a whole. You should therefore consider a range of known possible privacy attacks, and any new ones relevant to the specific privacy techniques you employ or to the specific use case. You will primarily be expected to consider inference vulnerabilities and attacks, including the risks of membership inference and attribute inference.

Instructions


Abstract instructions

You must complete two submissions. The first submission will be a one-page abstract, followed by a concept paper. Abstracts will be screened for competition eligibility and inform challenge planning for future stages. Feedback will not be provided.

Abstract Requirements

Successful abstracts shall include the following:

  1. Submission Title
    The title of your submission
  2. Solution Description
    A brief description of your proposed solution, including the proposed privacy mechanisms and architecture of the federated model. The abstract should describe the proposed privacy-preserving federated learning model and expected results with regard to accuracy. Successful abstracts will outline how solutions will achieve privacy while minimizing loss of accuracy.

Abstract Style Guidelines

Abstracts must adhere to the following style guidelines:

  • Abstracts must not exceed one page in length.
  • Abstracts may be submitted in PDF format using 11-point font for the main text.
  • Set the page size to 8.5”x 11” with margins of 1 inch all around.
  • Lines should be at a minimum single-spaced.

A template is provided here that you may choose to use. Abstracts can be submitted through the abstract submission page.

Concept paper instructions

The second and primary submission for this phase is a concept paper. This paper should describe in detail your solution to the tasks outlined in the financial crime prevention or pandemic forecasting tracks. Successful papers will clearly describe the technical approaches employed to achieve privacy, security, and accuracy, as well as lay out the proof-of-privacy guarantees.

You should consider a broad range of privacy threats during the model training and model use phases. Successful papers will address technical and process aspects including, but not limited to, cryptographic and non-cryptographic methods, and protection needed within the deployment environment.

There may be additional privacy threats that are specific to your technical approaches. If this applies, please clearly articulate how your solution addresses any other novel privacy attacks.

You are free to determine the set of privacy technologies used in your solutions, with the exception of hardware enclaves and other specialized hardware-based solutions. For example, any de-identification techniques, differential privacy, cryptographic techniques, or combinations thereof could be the part of the end-to-end solutions.

Concept Paper Requirements

Successful submissions to the Phase 1 concept paper competition will include:

  1. Title
    The title of your submission, matching the abstract.
  2. Abstract
    A brief description of the proposed privacy mechanisms and federated model.
  3. Background
    The background should clearly articulate the selected track(s) the solution addresses, understanding of the problem, and opportunities for privacy technology within the current state of the art.
  4. Threat Model
    This threat model section should clearly state the threat models considered, and any related assumptions, including:
    • the risks associated with the considered threat models through the design and implementation of technical mitigations in your solution
    • how your solution will mitigate against the defined threat models
    • whether technical innovations introduced in your proposed solution may introduce novel privacy vulnerabilities
    • relevant established privacy and security vulnerabilities and attacks, including any best practice mitigations
  5. Technical Approach
    The approach section should clearly describe the technical approaches used and list any privacy issues specific to the technological approaches. Successful submissions should clearly articulate:
    • the design of any algorithms, protocols, etc. utilized
    • justifications for enhancements or novelties compared to the current state-of-the-art
    • the expected accuracy and performance of the model, including, if applicable, a comparison to the centralized baseline model
    • the expected efficiency and scalability of the privacy solution
    • the expected tradeoffs between privacy and accuracy/utility
    • how the explainability of the model might be impacted by the privacy solution
    • the feasibility of implementing the solution within the competition timeframe
  6. Proof of Privacy
    The proof of privacy section should include formal or informal evidence-based arguments for how the solution will provide privacy guarantees while ensuring high utility. Successful papers will directly address the privacy vs. utility trade-off.
  7. Data
    The data section should describe how the solution will cater to the types of data provided and articulate what additional work may be needed to generalize the solution to other types of data or models.
  8. Team Introduction
    An introduction to yourself and your team members (if applicable) that briefly details background and expertise. Optionally, you may explain your interest in the problem.
  9. References
    A reference section.

If the use of licensed software is anticipated, please acknowledge it in your paper.

Concept Paper Style Guidelines

Papers must adhere to the following style guidelines:

  • Papers must not exceed ten pages in length, including abstract, figures, and appendices. References will not count towards paper length.
  • Before submitting your paper, please ensure that it has been carefully read for typographical and grammatical errors.
  • Papers must be submitted in PDF format using 11-point font for the main text.
  • Set the page size to 8.5”x 11” with margins of 1 inch all around.
  • Lines should be at a minimum single-spaced.

A template is provided here that you may choose to use. Concept papers can be submitted through the concept paper submission page. Note that the concept paper submission form will not be enabled until your team's abstract has been screened by challenge organizers.

Evaluation and Submission


Evaluation criteria

The concept paper will be evaluated according to the following criteria, in order of importance:

  1. Privacy
    Has the concept paper considered an appropriate range of potential privacy attacks and how the solution will mitigate those?
  2. Innovation:
    Does the concept paper propose a solution with the potential to improve on the state of the art in privacy enhancing technology? Does the concept paper demonstrate an understanding of any existing solutions or approaches and how their solution improves on or differs from those?
  3. Efficiency and Scalability
    Is it credible that the proposed solution can be run with a reasonable amount of computational resources (e.g., CPU, memory, storage, communication), when compared to a centralized approach for the same machine learning technique? Does the concept paper propose an approach to scalability that is sufficiently convincing from a technical standpoint to justify further consideration, and reasonably likely to perform to an adequate standard when implemented?
  4. Accuracy
    Is it credible that the proposed solution could deliver a useful level of model accuracy?
  5. Technical Understanding
    Does the concept paper demonstrate an understanding of the technical challenges that need to be overcome to deliver the solution?
  6. Feasibility
    Is it likely that the solution can be meaningfully prototyped within the timeframe of the challenge?
  7. Usability and Explainability
    Does the proposed solution show that it can be easily deployed and used in the real world, and provide a means to preserve any explainability of model outputs?
  8. Generalizability
    Is the proposed solution potentially adaptable to different use cases and/or different machine learning techniques?

Submissions

Abstracts should be submitted through the abstract submission page. You will also need to answer additional questions about your team and solution. You may use a template for the abstract, which you can find here.

Concept papers should be submitted through the Concept Paper Submission page. You may use a template for the concept paper, which you can find here. Note that the concept paper submission form will not be enabled until your team's abstract has been screened by challenge organizers.

Submissions will be reviewed and validated by NIST staff or their delegates, who may request clarification or rewriting if documents are unclear or underspecified.

Keep in mind: only teams who submit a concept paper will be eligible to participate in Phase 2: Solution Development. For more details see the Challenge Rules.

Judges and Reviewers

Submissions are evaluated by privacy and machine learning experts who make up the judge and reviewer panel. Thank you to the judges and reviewers for their time. Meet the judges and reviewers for the challenge below.

Judges:

  • David Archer – Galois, Inc.
  • Joshua Baron – DARPA
  • Vitaly Feldman – Apple ML Research
  • Chad Heilig – US Centers for Disease Control and Prevention
  • Ravi Madduri – Data Science and Learning, Argonne National Laboratory
  • Emily Shen – MIT Lincoln Laboratory
  • Heidi Sofia – National Human Genome Research Institute, NIH
  • Abhradeep Thakurta – Google Research - Brain Team

Reviewers:

  • Erman Ayday – Case Western Reserve University
  • Raef Bassily – The Ohio State University
  • Marina Blanton – University at Buffalo
  • Luis Brandao – National Institute of Standards and Technology
  • Troncoso Carmela – Ecole Polytechnique Fédérale de Lausanne
  • Amrita Roy Chowdhury – University of California, San Diego
  • Liyue Fan – UNC Charlotte
  • Matt Fredrikson – Carnegie Mellon University
  • Yanmin Gong – The University of Texas at San Antonio
  • Yuan Hong – University of Connecticut
  • Jean-Pierre Hubaux – Ecole Polytechnique Fédérale de Lausanne
  • Evgenios Kornaropoulos – George Mason University
  • Olivera Kotevska – Oak Ridge National Laboratory
  • Neal Krawetz – Hacker Factor
  • Adam O'Neill – University of Massachusetts Amherst
  • Olya Ohrimenko – The University of Melbourne
  • Catuscia Palamidessi – INRIA
  • Sean Peisert – Lawrence Berkeley National Laboratory and University of California, Davis
  • Or Sheffet – University of Alberta
  • Shuang Song – Google
  • Daniel Takabi – Department of Computer Science, Georgia State University
  • Christine Task – Knexus Research
  • Stacey Truex – Denison University
  • Alan Wang – Illinois Institute of Technology
  • Tianhao Wang – University of Virginia
  • Yu-Xiang Wang – UC Santa Barbara
  • Minhui Jason Xue – CSIRO's Data61
  • Xinglian Yuan – Monash University
  • Yupeng Zhang – Texas A&M University
  • Xingyu Zhou – Wayne State University

Good luck


Good luck and enjoy this problem! If you have any questions, you can always ask the community by visiting the DrivenData user forum or the cross-U.S.–U.K. public Slack channel. You can request access to the Slack channel here.