U.S. PETs Prize Challenge: Phase 2 (Financial Crime–Federated)

Help unlock the potential of privacy-enhancing technologies (PETs) to combat global societal challenges. Develop efficient, accurate, and extensible federated learning solutions with strong privacy guarantees for individuals in the data. #privacy

$185,000 in prizes
mar 2023
201 joined

Data Track A: Financial Crime

Transforming Financial Crime Prevention


There are two data use case tracks for the PETs prize challenge. This is the Phase 2 data overview page for the financial crime prevention track. In this track, innovators will develop end-to-end privacy-preserving federated learning solutions to detect potentially anomalous payments, leveraging a combination of input- and output-privacy techniques.

Background


The financial crime track is focused on enhancing cross-organization, cross-border data access to support efforts to combat fraud, money laundering and other financial crime. You are asked to develop innovative, privacy-preserving solutions to enable detection of potentially anomalous payments, utilizing synthetic datasets representing data held by the SWIFT payments network and datasets held by partner banks.

“Anomalous transactions” covers a range of payments that vary significantly from the norms seen in the dataset, and thus may be indicative of fraud, money laundering, or other financial crime. Examples include a transaction that is of an unexpected amount or currency, uses unusual corridors (senders/receivers), has unusual timestamps, or contains other unusual fields. In the scope of this challenge, the problem is framed as a classification task. The training datasets are labeled with anomalies, and therefore you do not need a detailed understanding of financial crime issues.

This is a high-impact and exciting use case for novel privacy-enhancing technologies. There are currently challenging trade-offs between enabling sufficient access to data to build tools to effectively detect illegal financial activity, and limiting the identifiability of innocent individuals or inference of their sensitive information within those data sets. The scale of the problem is vast: the UN estimates that US$800-2000bn is laundered each year, representing 2-5% of global GDP.

Though novel innovation for this use case alone could achieve significant real-world impact, the challenge is designed to incentivize development of privacy technologies that can be applied to other use cases where data is distributed across multiple organisations or jurisdictions, both in financial services and elsewhere. The best solutions will deliver meaningful innovation towards deployable solutions in this space, with consideration of how to evidence the privacy guarantees offered to data owners and regulators, but also have the potential to generalize to other situations.

Data Overview


Innovators will use synthetic datasets representing data held by the SWIFT global payments network and by its partner banks. In Phase 1, you were provided with two development datasets:

  • Dataset 1: A synthetic dataset representing transaction data from the SWIFT global payment network
  • Dataset 2: Synthetic customer / account metadata, including flags, from SWIFT's partner banks

There are approximately 4 million rows across the two development datasets.

Additional development data may be released for Phase 2.

Note: The challenges are based on synthetic data to minimize the security burden placed on competitors during the development phase; of course the intent of the challenge is that privacy solutions are developed that would be appropriate for use on real data sets with demonstrable privacy guarantees. However, competitors must adhere to a data use agreement (see competition rules for more details).

Dataset 1: Transaction data held by SWIFT

In Phase 1, you were provided a synthetic dataset derived from data from the SWIFT global payment network. Each row in this dataset is an individual transaction, representing a payment from one sending bank to one receiving bank. The dataset will:

  • Contain data elements as defined in the ISO20022 pacs.008 / MT103 message format
  • Comprise transactions between fictitious originators and beneficiaries, sender and receiving banks, payment corridor, amount and timestamps

Expertise in financial crime or ISO20022 messaging is not an expected prerequisite for entering the challenge, and the assessment process will not focus on detailed understanding of the use case itself. The details in the sections below should be sufficient for understanding the data within the scope of the challenge. However, participants unfamiliar with this space may find it informative to consult a general introduction to ISO20022. You may also find the ISO20022 message definitions informative.

The dataset reflects a snapshot of transactions sent by an ordering customer or institution to credit a beneficiary customer or institution. The dataset covers roughly a month’s worth of transactions involving 50 institutions.

The synthetic data is not generated based on any real traffic and will not contain any statistical properties of the real SWIFT transaction data (SWIFT has applied normal and uniform distributions).

Dataset 1 details

Dataset 1 contains the following fields:

  • MessageId - Globally unique identifier within this dataset for individual transactions
  • UETR - The Unique End-to-end Transaction Reference—a 36-character string enabling traceability of all individual transactions associated with a single end-to-end transaction
  • TransactionReference - Unique identifier for an individual transaction
  • Timestamp - Time at which the individual transaction was initiated
  • Sender - Institution (bank) initiating/sending the individual transaction
  • Receiver - Institution (bank) receiving the individual transaction
  • OrderingAccount - Account identifier for the originating ordering entity (individual or organization) for end-to-end transaction,
  • OrderingName - Name for the originating ordering entity
  • OrderingStreet - Street address for the originating ordering entity
  • OrderingCountryCityZip - Remaining address details for the originating ordering entity
  • BeneficiaryAccount - Account identifier for the final beneficiary entity (individual or organization) for end-to-end transaction
  • BeneficiaryName - Name for the final beneficiary entity
  • BeneficiaryStreet - Street address for the final beneficiary entity
  • BeneficiaryCountryCityZip - Remaining address details for the final beneficiary entity
  • SettlementDate - Date the individual transaction was settled
  • SettlementCurrency - Currency used for transaction
  • SettlementAmount - Value of the transaction net of fees/transfer charges/forex
  • InstructedCurrency - Currency of the individual transaction as instructed to be paid by the Sender
  • InstructedAmount - Value of the individual transaction as instructed to be paid by the Sender
  • Label - Boolean indicator of whether the transaction is anomalous or not. This is the target variable for the prediction task.

End-to-end transactions

Each row in this dataset is an individual transaction, representing a payment from a sender bank to a receiver bank. An end-to-end transaction is a transaction from an originating ordering entity (a.k.a. ultimate debtor) to a final beneficiary entity (a.k.a. ultimate creditor) and may involve one or more individual transactions. The end-to-end transaction is one individual transaction in the case where the originating orderer's bank sends payment directly to the final beneficiary's bank. However, it may be the case where the payment is not directly sent, but is instead routed through one or more intermediary banks. In such a case, there are multiple individual transactions belonging to the single end-to-end transaction, with each individual transaction representing a bank-to-bank payment. Each end-to-end transaction is uniquely identified by the UETR field. In the case of a sequence of multiple individual transactions for one end-to-end transaction, all individual transactions share a value for UETR, and the Sender and Receiver banks form a chain from the originating ordering bank through one or more intermediary banks to the final beneficiary bank.

Because each end-to-end transaction is defined by one originating orderer and one final beneficiary, this means the Ordering* columns for the orderer and Beneficiary* columns for the beneficiary have been included in this dataset in a denormalized fashion—the values are duplicated across all the individual transactions (rows) belonging to the same end-to-end transaction. Additionally, this means that the OrderingAccount and BeneficiaryAccount in a given row may not necessarily belong to the bank in that row's Sender and the bank in that row's Receiver, respectively. The correct way to associate an OrderingAccount to the correct bank is to identify the Sender bank in the originating (first) individual transaction in that end-to-end transaction, and the correct way to associate a BeneficiaryAccount to the correct bank is to identify the Receiver bank in the final (last) individual transaction in that end-to-end transaction.

MessageId UETR Sender Receiver OrderingAccount BeneficiaryAccount ...
... ... ... ... ... ... ...
10 00012345-... A B 111 222 ...
11 00012345-... B C 111 222
12 00012345-... C D 111 222
... ... ... ... ... ... ...

Illustrative example showing the how to associate the originating orderer and final beneficiary information with the correct banks for one end-to-end transaction made up of three individual transactions. The orderer and beneficiary account information is duplicated across all rows in this group, and the sender and receiver banks form a chain. The bank and account information of the originating orderer is highlighted in blue, and the bank and account information for the final beneficiary is highlighted in yellow.

Dataset 2: Account data held by banks

Participants were provided access to account-related data representative of that held by banks. This dataset contains account-level information, including flags signaling whether the account is valid, suspended, etc.

Data can be linked using the Account field in the bank data and the OrderingAccount or BeneficiaryAccount in the SWIFT transaction data. Please see the previous section on end-to-end transactions for details on how to identify which bank an OrderingAccount or BeneficiaryAccount should be linked to.

Note that bank nodes will not have access to data on the SWIFT node and vice-versa—a case of vertical data partitioning. It is up to you to determine how to exchange this information in a secure and private way.

Dataset 2 details

Dataset 2 will contain the following fields:

  • Bank - Identifier for the bank
  • Account - Identifier for the account
  • Name - Name of the account
  • Street - Street address associated with the account
  • CountryCityZip - Remaining address details associated with the account
  • Flags - Enumerated data type indicating potential issues or special features that have been associated with an account. Flag definitions are provided below:
    • 00 - No flags
    • 01 - Account closed
    • 03 - Account recently opened
    • 04 - Name mismatch
    • 05 - Account under monitoring
    • 06 - Account suspended
    • 07 - Account frozen
    • 08 - Non-transaction amount
    • 09 - Beneficiary deceased
    • 10 - Invalid company ID
    • 11 - Invalid individual ID

Note that this dataset is provided unpartitioned, with all banks' data in one table.

Note that the flags may not be representative of real-world practices. For example, in the real world, banks may use different flags and may interpret or weight them differently based on appetite for risk.

Development and Evaluation Data

The datasets being provided are intended for local development use in both Phase 1 and Phase 2. The transaction dataset has been split in time–the bulk of the dataset is the training set, and the final week of the dataset is a test set. The prediction task, as detailed in a later section, is to predict a confidence score for each individual transaction in the test set as to whether it is an anomalous transaction. The ground truth is provided for both the training and set sets in the development dataset.

In Phase 2, a separate and held-out dataset will be used for solution evaluation. Some aspects of the Phase 2 evaluation data may be changed that should be learnable by your model. You will submit code for your solution to a code execution environment. The code execution runtime will run cold-start federated training on the new dataset's training split and then run inference to generate predictions for the new dataset's test split. Your solution's performance will be measured by evaluating its predictions against the ground truth for the new dataset's test split.

New Development Dataset

New development dataset published December 3, 2022.

A new development dataset has been provided by our data partners at SWIFT and is available on the data download page, indicated by the [NEW] tag. This new development dataset is a synthetic dataset that is largely similar to the previously-provided synthetic development dataset (also still available on the data download page), but has some differences that better represent what is observed in the real world. You should expect that the evaluation dataset (held out and will never be made available) will have similar distributions as this new dataset. Local development and self-reported results in the technical paper should primarily use the new development dataset.

Threat Profile


You will design and develop end-to-end solutions that preserve privacy across a range of possible threats and attack scenarios, through all stages of the machine learning model lifecycle. You should therefore carefully consider the overall privacy of your solution, focusing on the protection of sensitive information held by all parties involved in the federated learning scenario. The solutions you design and develop should include comprehensive measures to address the threat profiles described below. These measures will provide an appropriate degree of resilience to a wide range of potential attacks defined within the threat profile. For more information on threat profiles, please visit the privacy threat profile section of the problem description.

Scope of sensitive data

Your solution must prevent the unintended disclosure of

  • a) sensitive information in the SWIFT transaction dataset
  • b) sensitive information in the bank dataset, to any other party, including other insider stakeholders (for example, SWIFT and other financial institutions) and outsiders.

The sensitive information for the SWIFT dataset is all personally-identifiable information about the originating orderer (a.k.a. ultimate debtor) and final beneficiary (a.k.a. ultimate creditor) parties, including personal details like names and addresses, and group membership information. This includes but is not limited to the raw private data about the orderer/beneficiary stored directly in the account number, name, and address fields, and the transaction identifiers and timestamps.

The sensitive information for the bank datasets include all personally identifiable information about parties involved in the transactions, including names and addresses and group membership information. This includes but is not limited to the raw private / business data reflected in account numbers/names/addresses and flags.

Anomaly detection models


The analytical objective is to train a model that enables SWIFT to identify anomalous transactions. In the context of this challenge, this is a classification model to be trained on provided training data with ground truth labels. In real-world deployments, such transactions might be subject to additional verification actions or flagged for further investigation, dependent on context.

Prediction Target and Evaluation Metric

The target variable for the modeling task is a confidence score (between 0.0 and 1.0) for whether each individual transaction is anomalous. As discussed previously, anomalous is not precisely defined and should be learned by your model via supervised learning on provided training data.

The evaluation metric will be Area Under the Precision–Recall Curve (AUPRC), also known as average precision (AP), PR-AUC, or AUCPR. This is a commonly used metric for binary classification that summarizes model performance across all operating thresholds. This metric rewards models which can consistently assign anomalous transactions with a higher confidence score than negative non-anomalous transactions. AUPRC is computed as follows:

$$ \text{AUPRC} = \sum_n (R_n - R_{n-1}) P_n $$

where Pn and Rn are the precision and recall, respectively, when thresholding at the nth individual transaction sorted in order of increasing recall.

Partitioned Datasets for Federated Learning

Partitioning Overview

A number of banks are working with SWIFT to collaboratively train a model. The parties are working jointly to do this, and can take a common approach to technical design, infrastructure, etc., but are not able to enable access to each other's data. In the real world there are a number of barriers that might prevent this; banks are subject to a variety of privacy, competition and financial industry regulations, may be operating in different jurisdictions, and have legitimate commercial and ethical reasons for not sharing customer data with competitors.

The key task of this challenge is to design a privacy solution so that SWIFT can safely train and deploy such a model without compromising the privacy requirements (more details on the requirements, and an associated threat model, are described in the problem description).

Diagram comparing a centralized model with a federated model

Above: Diagram comparing the centralized model (MC) with a privacy-preserving federated learning model (MPF).

Partitioning Details

This use case features both vertical and horizontal data partitioning. Data is vertically partitioned between Dataset 1 (SWIFT) and Dataset 2 (partner banks), and it is horizontally partitioned within Dataset 2 (between each partner bank).

For local development, you were provided a full, unpartitioned dataset. In Phase 2 evaluation, evaluation will occur with predetermined partitioning along institutional boundaries. The SWIFT data will always belong a single federation unit who represents the SWIFT Data Store and only has access to the SWIFT data. Banks will be split up among federation units such that one bank's account data entirely belongs within one partition. In cases where there are fewer bank partitions than the number of banks, each bank partition may contain data from more than one bank.

Any partitioning of the data that you might perform in your local development experiments should take this into account. Your solution should be able to handle any number of bank partitions, and in Phase 2, we may evaluate your solution with a number of bank partitions between 1 and 10.

Key Task

For the purposes of the challenge, you should demonstrate your solution by training two models:

  • \(M_C\) = a centralized model trained on datasets 1 and 2 in a non-privacy preserving way
  • \(M_{PF}\) = a privacy-preserving federated model trained using your privacy solution

Example Centralized Baseline

In Phase 1, SWIFT provided sample Python code for training a centralized anomaly detection model (\(M_C\)) on the ISO20022 training data. This code snippet used the dataset provided as input and trained a simple anomaly detection model using the XGBoost Classifier.

The core of the evaluation will be assessing the comparison between a centralized model \(M_C\), and an alternative model \(M_{PF}\) that combines a federated learning approach with innovative privacy-preserving techniques.

In the real world, SWIFT may wish to train a model collaboratively with a number of banks, in order to increase the volume and variety of data being used to train the model. You should therefore aim to develop scalable solutions that enable additional nodes to be integrated into the federated network whilst incurring an acceptable additional performance overhead.

The federated learning scenario thus consists of one node hosting the SWIFT dataset, and N nodes hosting bank data. We may evaluate solutions for values of N between 1 and 10, in order to assess how well solutions scale as more banks are added to the network. During solution development, you may have full autonomy over how you partition the bank dataset in order to understand the scalability of your solution.

Full details on evaluation criteria can be found in the problem description.

For additional reference, here is a technical brief for this use case track provided through the U.K. challenge. This brief has been assembled in collaboration by the U.K. and U.S. challenge organizers. This may help to give a sense for the use case and capabilities expected, though note that details in the brief may not match exactly how the U.S. challenge will operate.

Good luck


Good luck and enjoy this problem! For more details on submission and evaluation, visit the problem description page. If you have any questions, you can always ask the community by visiting the DrivenData user forum or the cross-U.S.–U.K. public Slack channel. You can request access to the Slack channel here.