Genetic Engineering Attribution Challenge

Genetic engineering is a powerful tool that demands responsible use. Your goal is to create an algorithm that identifies the lab-of-origin for genetically engineered DNA with the highest accuracy level possible. #science

$60,000 in prizes
oct 2020
1,205 joined

Innovation track

How would your approach perform in the real world?


Introduction

In the Innovation Track, we challenge you to bridge the gap between your work in the Prediction Track and real-world attribution problems in academia, industry, and forensics. To succeed, submissions should demonstrate how their lab-of-origin prediction models excel in domains beyond raw accuracy. Submissions will be evaluated by a panel of judges, including world leaders in synthetic biology, data science and biosecurity policy. Winners will receive prizes equal to those provided for the Prediction Track.

How to compete – beat the benchmark

Any model that achieves a higher top-10 accuracy score than our BLAST benchmark will qualify for entry to the Innovation Track. Teams that exceed this threshold (i.e. exceeding 75.6% on the private leaderboard) will automatically be invited to submit a report for assessment by our panel of judges (see “Assessment”, below).

If a large number of teams successfully beat the benchmark and make submissions, altLabs and our partners will perform pre-screening, based on the criteria below, to select which submissions to pass to our judges.

How to win

To succeed in the Innovation Track, you will need to convince experts from a variety of fields that your submission represents valuable progress in solving real-world attribution problems. To do this, you will need to demonstrate – in plain language – that your approach is impressive beyond raw lab-of-origin accuracy. This task is fully open-ended: you can write about whatever aspects of your model you think will most impress the judges. Some examples of important criteria you might want to address include:

  • Calibration: Does your model accurately report its uncertainty about a sample’s lab-of-origin?
  • Robustness: Is your model robust to outliers, changes in the dataset, and unseen labs?
  • Interpretability & biological groundedness: Explain how your model makes its predictions in a way that would be useful to a biologist or forensic scientist. Do the features it considers important make biological sense? How might you go about integrating the outputs of your model with other biological data sources?
  • Responsible development: Does your approach thoughtfully address the practical and societal challenges which might arise in real-world applications? Have your design decisions been made with diverse stakeholders in mind? How might you continue the responsible development of your approach in the future?

This list is not exhaustive; we expect a wide variety of approaches to do well in the Innovation Track. The key is to demonstrate that your model brings us closer to concrete application.

Note on data use: External data may be used in the Innovation Track for the purposes of showcasing the capabilities of the model post-training. External data may not be used for model training or for any use in the Prediction Track.

If you need more inspiration, we’ve provided three example scenarios below. We emphasize that these are just examples; you do not need to address these specific scenarios to succeed.

Examples

Example 1: Detecting genetic plagiarism in a multi-source plasmid

The International Genetically Engineered Machine (iGEM) competition is the world’s premiere student competition in synthetic biology, in which teams of students work together to design, build, test, and measure systems of their own design using interchangeable biological parts. As part of their competition submissions, iGEM teams must provide detailed information regarding the origin of ideas and components they use. With thousands of submissions to review, detecting and investigating misattributed ideas is becoming a challenge for iGEM.

As part of their submission, one iGEM team presents composite parts of their own design, containing sequences derived from multiple different sources. Some of these parts are from the iGEM parts registry, while others are claimed to be of their own design. Could your model identify which components of each plasmid come from which sources, and thus help assess whether all of the team’s work have been correctly attributed?

Example 2: Accidental release

A sewage treatment plant notices a strange, unsettling green glow in the...sludge. They send a sample to a local microbiology lab, where it’s classified as a E. coli, probably the most common laboratory microbe. While harmless, the bacteria have been genetically engineered to express green fluorescent protein: someone is being sloppy with biowaste disposal.

Authorities in the city turn to you to find out which lab is responsible; however, many of the labs in the city are not in your database. In the event that a precise identification is not possible, could your model provide any partial information that would help human investigators narrow the search?

Example 3: Synthesis Screening

A (fictional) DNA synthesis company has a serious problem with fraud: they offer large discounts to academic institutions, but suspect that many private startups are using academic contacts to exploit this, potentially losing millions of dollars in revenue. Given the immense volume of orders, it is impossible to manually inspect each sequence to verify its purported lab-of-origin. So, the company has turned to attribution models to automate verification.

To be useful and economical in this context, an attribution model must overcome certain challenges: among others, it must be very computationally cheap to run, and perform well given only DNA sequence fragments (no phenotype information). Most importantly, to minimize disruption to legitimate business, it should have a very low false-positive rate, while maintaining an acceptably low rate of false negatives. Could your model fulfill these criteria?

Assessment

At the close of the Prediction Track on October 19th, any team whose accuracy exceeds the BLAST benchmark on the private leaderboard will be invited to submit a report on their approach. This report should be at most four pages long, with at most two figures, and should:

  • Demonstrate a novel and creative approach to genetic engineering attribution;
  • Demonstrate what capabilities of their model, other than raw accuracy, would make it useful for solving realistic attribution problems;
  • Show thoughtful consideration of the societal impact and use of attribution models;
  • Discuss the limitations of the model, and how further work might improve it;
  • Be comprehensible to both technical and non-technical readers.

Assessment of submissions to the Innovation Track will be conducted by a diverse panel of distinguished judges, including top experts in synthetic biology, data science, and biosecurity policy. Each submission will be assessed by multiple judges. As such, to succeed in the Innovation Track, your report should be comprehensible to both technical and non-technical readers. Innovation Track prize-winners will be announced alongside those for the Prediction Track.

Judges

The judging panel for the Innovation Track will comprise top experts from the fields of synthetic biology, biosecurity, and data science, including:

  • George Church – Professor of Genetics, Harvard University
  • Tom Inglesby – Director, Johns Hopkins Center for Health Security
  • Nancy Connell – Senior Scholar, Professor, Johns Hopkins Center for Health Security
  • Gigi Gronvall – Senior Scholar, Associate Professor, Johns Hopkins Center for Health Security
  • Piers Millett – Vice President for Safety and Security, iGEM
  • Drew Endy – Associate Professor of Bioengineering, Stanford University
  • Kevin Esvelt – Director, Sculpting Evolution Group, MIT Media Lab
  • David Relman – Professor of Microbiology and Immunology, Stanford University
  • Rosie Campbell – Program Lead, Partnership on AI
  • James Diggans - Head of Biosecurity, Twist Bioscience
  • Megan Palmer – Executive Director, Bio Policy & Leadership Initiatives, Stanford University
  • Jason Matheny – Director, Center for Security and Emerging Technology
  • Joanna Lipinski – Director of Bioinformatics and Data Science, Twist Bioscience
  • Carroll Wainwright - Research Scientist, Partnership on AI
  • Jonathan Huggins – Assistant Professor, Department of Mathematics & Statistics, Boston University
  • Adam Marblestone – Senior Fellow, Federation of American Scientists (FAS)
  • Natalie Ma – Director of Business Development, Felix Biotechnology
  • Gregory Koblentz – Associate Professor & Director of Biodefense Graduate Programs, George Mason University
  • Dylan George – Vice President, Technical Staff, In-Q-Tel

FAQ

Who can submit to the Innovation Track? Any team that beats our BLAST benchmark on the Prediction Track can submit a report to the Innovation Track.

What is the exact benchmark value we need to beat? The BLAST benchmark for top-10 accuracy is 78.8% on the public leaderboard and 75.6% on the private leaderboard. To qualify for the Innovation Track, your model should score better than 75.6% on the private leaderboard.

What if we have multiple submissions to the Prediction Track? Each team may make at most one submission to the Innovation Track. If your team makes multiple submissions, you must select one to discuss in your Innovation Track entry.

Nobody on our team is a biologist! Can we compete? Any team that beats the BLAST benchmark in the Prediction Track is welcome to compete in the Innovation Track. Collaborating with biology or policy specialists may be helpful for this track, but is not required.

How long will we have to prepare our report? Following closure of the Prediction Track on October 19th 2020, notified participants will have up to two weeks to prepare their submissions to the Innovation Track. However, teams that are confident they’ve beaten the benchmark are strongly encouraged to prepare their submissions well in advance. The submission window for the Innovation Track will close at 11:59pm (UTC) on November 1st, 2020.

What should I submit? Participants in the Innovation Track should submit the following:

  • The code for your lab-of-origin prediction model. This should be the same model you used in your submission to the Prediction Track.
  • Your report, which should explain in plain language how your model fulfills the criteria outlined above.
  • Code and data (see below) for generating any figures you include in your report.

The code and data files do not count towards the four-page limit for the report.

How should the report be formatted? Reports submitted to the Innovation Track should be written in English, in PDF format, at most four pages long (Letter or A4), and with at most two figures. There is no minimum length requirement. To enable blind assessment of submissions, your report should not include the names of your team or team members. There are no other detailed formatting requirements, but the report should conform to the norms of a good scientific paper: intellectual contributions from outside your team should be acknowledged and cited, and methods should be reported transparently.

What data can I use for the model I submit to the Innovation Track? The model you submit to the Innovation Track should be the same as the corresponding submission to the Prediction Track. As such, the model should be trained only on the dataset provided for this competition. External data may also be used in the Innovation Track for the purposes of showcasing the capabilities of the model post-training. As with the Prediction Track, finalists for the Innovation Track will have their model performance validated against an out-of-sample verification set; teams judged to have violated rules regarding data usage will be disqualified.

What data can I use for the figures in my Innovation Track report? While the model submitted to the Innovation Track should be trained only on the data provided for the Prediction Track, you are welcome to use other data to illustrate the capabilities of that model in your Innovation Track report. For example, if your team submits a report based on Example 1 above, you would be welcome to acquire or design some multi-source plasmids to demonstrate your model’s ability to distinguish subsequences from different sources. Any additional data you use in this way must be included in your submission to the Innovation Track, alongside the code for your model and figures.

Who owns the intellectual property for my Innovation Track submission? What happens to my model after the competition? As with the Prediction Track, the IP for all prize-winning submissions will be assigned to altLabs upon receipt of prize money. After the competition, altLabs will seek input from various stakeholders – including prizewinning teams – on how best to use these results to promote responsible innovation. altLabs is a non-profit organisation, and will never sell or otherwise monetize prize-winning submissions. The IP for submissions that do not win prizes remains with their respective teams.


Good luck!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!