Navigation

Quick access

Problem Description

The objective of the challenge is to develop a privacy-preserving federated learning solution that is capable of training an effective model while providing a demonstrable level of privacy against a broad range of privacy threats. The challenge organisers are interested in efficient and usable federated learning solutions that provide end-to-end privacy and security protections while harnessing the potential of AI for overcoming significant global challenges.

Solutions will tackle one or both of two tasks: financial crime prevention or pandemic forecasting. Teams will be required to submit both centralised and federated versions of their models.

This is the U.K. Pandemic Forecasting Track for Phase 2. The Financial Crime Track can be found here.

In Phase 2 of the challenge, you will develop a working prototype of the privacy-preserving federated learning solution that you proposed in Phase 1. As part of this phase, you will package your solution for containerised execution in a common environment provided for the testing, evaluation and benchmarking of solutions.

Code Execution
Overview
Federated Code Execution
Centralised Code Execution

Documentation
Technical Paper
Code Guide

Evaluation
Rubric
Accuracy Metrics
Computational Metrics
Red Teaming
FAQ

Overview of what teams submit

You will need to submit the following for your solution by the end of Phase 2:

Executable federated solution code—working implementation of your federated learning solution that is run via containerised execution.
Executable centralised solution code—working implementation of a centralised version of your solution that is run via containerised execution.
Documentation zip archive—a zip archive that contains the following documentation items:
- Technical report—updated paper that refines and expands on your Phase 1 white paper, including your own experimental privacy, accuracy, and efficiency metrics for the two models.
- Code guide—README files which provide the mapping between components of your solution methodology and the associated components in your submitted source code, to help reviewers identify which parts of your source code perform which parts of your solution.

Each team will be able to make one final submission for evaluation. In addition to local testing and experimentation, teams will also have limited access to test their solutions through the hosted infrastructure later in Phase 2.

If your team is participating in both data tracks (whether with separate solutions or a generalised solution), you are required to submit all items for both tracks. Additional details regarding each of these items are provided below.

Data Tracks

In Phase 2, a prize will be awarded for the best performing solution in each track, and there will be 3rd and 4th place awards which will be determined independent of the track.

Teams submitting to both tracks (either two separate solutions, or a single generalised solution) could win the award for best performing solution in both the Financial Crime track and the Public Health track.

Code Execution

Federated Code Execution

As part of the challenge, all solutions will be deployed and evaluated on a technical infrastructure providing a common environment for the testing, evaluation, and benchmarking of solutions. To run on this infrastructure, your solution will need to adapt to the provided API specification.

You will submit a code implementation of your federated learning solution to the challenge platform. The evaluation harness will run your code under simulated federated training and inference on a single node on multiple predefined data partitioning scenarios that are the same for all teams. Runtime and accuracy metrics resulting from the evaluation run will be captured and incorporated as part of the overall evaluation of Phase 2 solutions.

The execution harness for this challenge will simulate the federation in your solution in a single containerised node using the virtual client engine from the Flower federated learning framework, a Python library. The API specification will be based on the Client and Strategy interfaces. Uses of federated learning frameworks other than Flower and programming languages other than Python are allowed; however, you will need to wrap such elements of your solution in Python code that conforms to the API specifications.

What to submit

For details on what you need to submit for federated code execution, see the Federated Code Submission Format page for this track.

Centralised Code Execution

You will also submit code for a centralised version of your solution to the challenge platform. The evaluation harness will run training and inference on a centralised version of the evaluation dataset. Runtime and accuracy metrics resulting from the evaluation run will be captured and incorporated as part of the overall evaluation of Phase 2 solutions.

What to submit

For details on what you need to submit for centralised code execution, see the Centralised Code Submission Format page for this track.

Documentation Submissions

Technical Paper

The technical paper builds on the white paper from Phase 1. Teams will be expected to:

Update threat model, technical approach, and privacy proofs to reflect any refinement to their solution. Changes should be relatively minor and not alter the fundamentals of their solution.
Describe your centralised solution, including any justifications as needed for how architecture choices and training parameters make it an appropriate baseline to compare against your federated solution.
Add self-reported privacy, accuracy, efficiency, and scalability metrics from local experimentation, including documentation about the experimentation environment.

You will be allowed an additional 4 pages to update your white paper. Therefore, updated technical papers shall not exceed 14 pages total, not including references.

Papers should include an "Experimental Results" section that was not present in the Phase 1 white paper. This section should include experimental privacy, accuracy, and efficiency and scalability metrics based on the development dataset. Your experimental results will be used to help determine your scores for the Privacy, Accuracy, and Efficiency/Scalability criteria. We suggest using following metrics to measure performance in these categories:

Privacy: privacy parameters (e.g. ε and δ for differential privacy); success rate of membership inference attack
Accuracy: area under the precision-recall curve (AUPRC)
Efficiency: total execution time; computation time for each party; maximum memory usage for each party; communication cost for each party
Scalability: change in execution time and computation/communication as number of partitions increases

We ask that your experimental results include description of the Privacy–accuracy tradeoff: Please define three privacy scenarios (strong, moderate, weak) as applicable to your solution, and report accuracy metrics (at a minimum AUPRC) corresponding to these scenarios to demonstrate the tradeoff. Two possible ways of defining the privacy scenarios appear in the table below - one for differentially private solutions, and one that leverages membership inference advantage for solutions without theoretical guarantees. Please define your scenarios clearly and use the strongest possible definitions that apply to your solution (i.e. report privacy parameters for theoretical bounds when possible).

	Differential Privacy	Membership Inference
Strong	ε ≈ 0	Adv ≈ 0
Moderate	ε ≈ 1	Adv ≤ 0.1
Weak	ε ≈ 5	Adv ≤ 0.2

Code Guide

You will be required to create a code guide in the style of a README that documents your code. The code guide should explain all of the components of your code and how they correspond to the conceptual elements of your solution. An effective code guide will provide a mapping between the key parts of your technical paper and the relevant parts of your source code. Please keep in mind that reviewers will need to be able to read and understand your code, so follow code readability best practices as much as you are able to when developing your solution.

Evaluation

Solutions should aim to:

Provide robust privacy protection for the collaborating parties
Minimise loss of overall accuracy in the model
Minimise additional computational resources (including compute, memory, communication), as compared to a non-federated learning approach.

In addition to this, the evaluation process will reward competitors who:

Show a high degree of novelty or innovation
Demonstrate how their solution (or parts of it) could be applied or generalised to other use cases
Effectively prove or demonstrate the privacy guarantees offered by their solution, in a form that is comprehensible to data owners or regulators
Consider how their solution could be applied in a production environment

Rubric

Initial evaluation of the developed solutions will be based on a combination of quantitative metrics, and qualitative assessments by judges according to the following criteria:

Topic	Factors	Weighting (/100)
Privacy	Information leakage possible from the PPFL model during training and inference, for a fixed level of model accuracy. Ability to clearly evidence privacy guarantees offered by solution in a form accessible to a regulator and/or data owner audience	35
Accuracy	Absolute accuracy of the PPFL model developed (e.g., F1 score). Comparative accuracy of PPFL model compared with a centralised model, for a fixed amount of information leakage	20
Efficiency and scalability	Time to train PPFL model and comparison with the centralised model. Network overhead of model training. Memory (and other temporary storage) overhead of model training. Ability to demonstrate scalability of the overall approach taken for additional nodes	20
Adaptability	Range of different use cases that the solution could potentially be applied to, beyond the scope of the current challenge	5
Usability and Explainability	Level of effort to translate the solution into one that could be successfully deployed in a real world environment. Extent and ease of which privacy parameters can be tuned. Ability to demonstrate that the solution implementation preserves any explainability of model outputs.	10
Innovation	Demonstrated advancement in the state-of-the-art of privacy technology, informed by above-described accuracy, privacy and efficiency factors	10

Phase 2 may include one round of interaction with the teams so that they can provide any clarification sought by the judges. Comparison bins may be created to compare similar solutions. Solutions should make a case for improvements against existing state-of-the-art solutions.

As with Phase 1, the outcomes of trade-off considerations among criteria as made in the White Papers should be reflected in the developed solution. Solutions must meet a minimum threshold of privacy and accuracy, as assessed by judges and measured quantitatively, to be eligible to score points in the remaining criteria.

The top solutions in each track, as well as the 3rd and 4th place solution, will advance to Red Team evaluation as described below. The results of the Red Team evaluation will be used to finalise the scores above in order to determine final rankings.

Accuracy Metrics

The evaluation metric will be Area Under the Precision–Recall Curve (AUPRC), also known as average precision (AP), PR-AUC, or AUCPR. This is a commonly used metric for binary classification that summarises model performance across all operating thresholds. This metric rewards models which can consistently assign a higher confidence score to individuals who become infected during the test period than to individuals who do not.

AUPRC will be evaluated under the following scenarios:

Federated Solution with N_1...3 partitions (a minimum of three different partitioning schemes)
Centralised Solution

AUPRC is computed as follows:

$$ \text{AUPRC} = \sum_n (R_n - R_{n-1}) P_n $$

where $P_{n}$ and $R_{n}$ are the precision and recall, respectively, when thresholding at the n^th individual observation sorted in order of increasing recall.

Computational Metrics

In addition, metrics will be calculated at runtime to empirically assess performance, efficiency, and scalability. These metrics may include, but are not limited to:

Total Training Time for Federated Solution with N_1...3 partitions
Total Training Time for Centralised Solution
Peak Training Memory Usage for Federated Solution with N_1...3 partitions
Peak Training Memory Usage for Centralised Solution
Total Training "Network" Disk Volume for Federated Solution with N_1...3 partitions
Total Training "Network" File Number for Federated Solution with N_1...3 partitions

Red Team Evaluation

The objective of Phase 3 is to test the strength of the privacy-preserving techniques of the developed PPFL modes through a series of privacy audits and attacks. Red Team Participants will plan and launch audits and attacks against the highest-scoring solutions developed during Phase 2.

Frequently Asked Questions

What is the role of the common execution runtime?

Phase 2 of the PETs Prize challenge includes the submission of a containerized implementation of participant solutions to a common execution runtime and infrastructure. This runtime provides a common environment and logic for the testing, evaluation, and benchmarking of the solutions. Centralized and federated model submissions will be run on a separate, unseen dataset.

The quantitative results derived from this testing will be one part of the overall evaluation of solutions. Participants will also submit an updated technical paper including their own experimental privacy, accuracy, and efficiency metrics for the two models. Final scores will be determined by a panel of judges. See the evaluation section in each data track for further information.

How is Flower used in Phase 2 code execution? Why is it needed? Why isn't a light-weight Docker container sufficient?

Flower is a customizable federated learning library that is agnostic to specific machine learning frameworks. The federated evaluation harness uses Flower as an API specification and as the execution engine for simulating the federated learning workflow.

Setting a standard API and simulation engine have a few benefits for the objectives of the PETs Prize Challenge:

Comparable metrics for consistent evaluation—The standardization of execution allows challenge organizers to instrument the evaluation workflow in order to collect performance metrics and capture client–server communications in a standardized way. This allows for greater comparability between solution implementations.
Facilitation of judging and red team evaluation—The standardized API will make it easier for judges and red teams to review and understand source code. Standardized capture of client–server communications will facilitate red teams to evaluate privacy attacks that make use of that information. Additionally, judges can have greater confidence that the federation structure of solutions is properly implemented.
Focus on privacy techniques—The objective of the challenge is to drive innovation in privacy technologies. Teams can focus on the design and implementation of their privacy techniques, as the federated learning simulation is handled by a provided standard implementation.

The challenge organizers recognize that there are inherent tradeoffs in how the evaluation harness is designed. This design has been chosen to balance those tradeoffs in achieving the challenge's objectives.

Is this a challenge for Flower-based solutions? What if I have my own federated learning framework?

No, the challenge is for general development of privacy-preserving federated learning solutions.

One part of how solutions are evaluated is having an implementation that is tested, evaluated, and benchmarked in a standardized evaluation runtime. This submitted implementation must follow the standardized API specifications based on Flower. If you have another federated learning framework that you would like to use as part of the submitted implementation, you can wrap your code with the Flower API.

You may have other implementations of your solution that are not submitted for code execution and exclusively make use of your own framework or another framework. If such an implementation demonstrates additional strengths of your solution, you should discuss it and include experimental results as part of your technical paper. Keep in mind that your report should clearly articulate and defend the benefits of your solution. See the evaluation criteria for further information.

The evaluation does not support client peer–to–peer communication. What do I do if my federated solution uses decentralized federated learning?

You can still implement communication between clients by routing them through the server as a mediator. If this has an impact on the communication efficiency or the privacy of your solution, you should clearly explain in your technical paper as part of your definition of your threat model. You should also include any relevant experimental results in your technical paper. Judges and red teams will take this into account when reviewing your solution during final evaluation.

What access do Red Teams have to my solution?

Red Teams will have access to the white papers submitted by Blue Teams in Phase 1, and will also be provided with submitted solutions (including source code) from Phase 2 for the finalists selected to advance to Red Team testing. All Red Teams are required to sign a non-disclosure agreement as a condition for participation. You can find a copy of the agreement here.

How can I include software dependencies that my solution depends on?

The primary way that software dependencies are available to solutions is for such dependencies to be included as part of the runtime container image. Please see instructions here for instructions on opening a pull request to add additional dependencies to the runtime image.

Vendoring software dependencies by including them as part of your submission is also an available option for dependencies that do not make sense to include as part of building the runtime image. You can learn more about how Python's module search path works from the Python documentation or this guide.

Please note that containers will not have any network access when running your code during the evaluation process.

What should I do if the design of the standardized evaluation has an impact on the performance of my solution that otherwise wouldn't apply in other deployment circumstances?

Please use the technical report to describe any impact that the evaluation process has on your solution that you believe would not be applicable under other deployment circumstances. You should include any relevant results from local experimentation. The judges will consider such claims and justifications as part of the evaluation.

Good luck

Good luck and enjoy this problem! For more details on the code submission format, visit the code submission page. If you have any questions, you can always ask the community by visiting the DrivenData user forum or the cross-U.S.–U.K. public Slack channel. You can request access to the Slack channel here. You can also reach out to CDEI at petsprizechallenges@cdei.gov.uk or Innovate UK at support@iuk.ukri.org.

U.K. PETs Prize Challenge: Phase 2 (Pandemic Forecasting–Federated)

Quick Facts

Participants

No. of Entries

Prize

Winner

Faculty Science Limited

Navigation

Quick access

Problem Description

Overview of what teams submit

Data Tracks

Code Execution

Federated Code Execution

What to submit

Centralised Code Execution

What to submit

Documentation Submissions

Technical Paper

Code Guide

Evaluation

Rubric

Accuracy Metrics

Computational Metrics

Red Team Evaluation

Frequently Asked Questions

What is the role of the common execution runtime?

How is Flower used in Phase 2 code execution? Why is it needed? Why isn't a light-weight Docker container sufficient?

Is this a challenge for Flower-based solutions? What if I have my own federated learning framework?

The evaluation does not support client peer–to–peer communication. What do I do if my federated solution uses decentralized federated learning?

What access do Red Teams have to my solution?

How can I include software dependencies that my solution depends on?

What should I do if the design of the standardized evaluation has an impact on the performance of my solution that otherwise wouldn't apply in other deployment circumstances?

Good luck

On this page