Genetic Engineering Attribution Challenge

Genetic engineering is a powerful tool that demands responsible use. Your goal is to create an algorithm that identifies the lab-of-origin for genetically engineered DNA with the highest accuracy level possible. #science

$60,000 in prizes
oct 2020
1,206 joined

Problem description

Genetic engineering attribution is the process of identifying the source of a genetically engineered piece of DNA. Reducing anonymity can encourage more responsible behavior within scientific and entrepreneurial communities, without stifling innovation. The objective of this competition is to predict the laboratory of origin for plasmid DNA sequences.

Overview


This competition will include two tracks.

Stage 1: Prediction Track

Your goal in this track is to predict the laboratory of origin for a given plasmid DNA sequence. Your model should output a probability for each of the 1,314 possible labs. More information on the dataset, performance metric, and submission format is provided below.

Finalists and runners-up will be determined by the private leaderboard. These participants will then have the opportunity to submit their code to be audited using an out-of-sample verification set. The top 4 teams that pass this final check will be awarded prizes for this track.

Stage 2: Innovation Track

Following the closing date for submissions to the Prediction Track, teams whose models achieve higher top-10 accuracy than the BLAST benchmark on the private leaderboard will be invited to submit brief reports describing how their approach addresses attribution problems in the real world. The final prizes will be awarded to the top four reports selected by a panel of expert judges. For a complete description see the Innovation Track challenge page.

Note: In this competition, the private test set is larger than the public test set and contains more labs. As a result, most models achieve slightly lower performance on the private test set. The altLabs BLAST benchmark, for example, achieves 78.8% on the public leaderboard, but only 75.6% on the private leaderboard. The size of the difference in performance varies between models; in general, you should expect your model to do a few percentage points worse on the private leaderboard.


Submissions open to Prediction Track
Prediction Track verification
Innovation Track
Submissions open to Prediction Track Prediction Track verification Innovation Track
Participants use the competition dataset to build models and make submissions. Top-10 accuracy scores on the public test set are displayed on the public leaderbaord. Top finalists on the private leaderboard submit their code for execution on an out-of-sample verification set. Participants who beat the BLAST benchmark have the opportunity to submit brief reports on their approaches to be evaluated by expert judging panel.


The features in this dataset


This dataset consists of DNA sequences from genetically engineered plasmids. Plasmids are small, circular DNA molecules that replicate independently from chromosomes.

There are 41 columns in this dataset. Each row corresponds to a plasmid DNA sequence, which is uniquely identified by sequence_id, a 5-character alphanumeric string. In addition to the DNA sequences provided in sequence, there are 39 binary features that provide metadata about the plasmids. All variables are described below.

  • sequence (type: string): A plasmid DNA sequence. Any Us were changed to Ts and letters other than A, T, G, C, or N were changed to Ns. Possible values: A, T, G, C, or N
  • bacterial_resistance_ampicillin, bacterial_resistance_chloramphenicol, bacterial_resistance_kanamycin, bacterial_resistance_other, bacterial_resistance_spectinomycin (type: binary): One-hot encoded columns that indicate the antibiotic resistance of the plasmid used for selecting during bacterial growth and cloning.
  • copy_number_high_copy, copy_number_low_copy, copy_number_unknown (type: binary): One-hot encoded columns that indicate the number of plasmids per bacterial cell.
  • growth_strain_ccdb_survival, growth_strain_dh10b, growth_strain_dh5alpha, growth_strain_neb_stable, growth_strain_other, growth_strain_stbl3,growth_strain_top10, growth_strain_xl1_blue (type: binary): One-hot encoded columns that indicate the strain used to clone the plasmid.
  • growth_temp_30, growth_temp_37, growth_temp_other (type: binary): One-hot encoded columns that indicate the temperature the plasmid should be grown at.
  • selectable_markers_blasticidin, selectable_markers_his3, selectable_markers_hygromycin, selectable_markers_leu2, selectable_markers_neomycin, selectable_markers_other,selectable_markers_puromycin, selectable_markers_trp1, selectable_markers_ura3, selectable_markers_zeocin (type: binary): One-hot encoded columns that indicate genes that allow non-bacterial selection (for a plasmid used outside of the cloning organism).
  • species_budding_yeast, species_fly, species_human, species_mouse,species_mustard_weed, species_nematode, species_other, species_rat, species_synthetic,species_zebrafish (type: binary): One-hot encoded columns that indicate the species the plasmid is used in, after cloning.

Feature data example


For example, a single row in the dataset, has these values:

field value
sequence GGCTGAAGCACTGCACGAAT
bacterial_resistance_ampicillin 1
bacterial_resistance_kanamycin 0
bacterial_resistance_spectinomycin 0
bacterial_resistance_chloramphenicol 0
bacterial_resistance_other 0
copy_number_high_copy 0
copy_number_unknown 1
copy_number_low_copy 0
growth_strain_dh5alpha 1
growth_strain_neb_stable 0
growth_strain_top10 0
growth_strain_stbl3 0
growth_strain_xl1_blue 0
growth_strain_dh10b 0
growth_strain_ccdb_survival 0
growth_strain_other 0
growth_temp_37 1
growth_temp_30 0
growth_temp_other 0
selectable_markers_neomycin 0
selectable_markers_puromycin 0
selectable_markers_hygromycin 0
selectable_markers_ura3 0
selectable_markers_blasticidin 0
selectable_markers_zeocin 0
selectable_markers_leu2 0
selectable_markers_trp1 0
selectable_markers_his3 0
selectable_markers_other 0
species_human 0
species_synthetic 1
species_mouse 0
species_fly 0
species_budding_yeast 0
species_zebrafish 0
species_rat 0
species_mustard_weed 0
species_nematode 0
species_other 0

Labels


The labels identify the lab of origin for each DNA sequence. They are one hot encoded, meaning there is a column for each lab ID. The correct lab of origin for each sequence_id is indicated with a 1.0, and the rest of the values in the row are 0.0. You can find an example of how to collapse these labels into two columns, sequence_id and lab_id, in the benchmark.

Performance metric


Performance will be evaluated using top ten accuracy. This method considers the solution’s ability to correctly place the true lab-of-origin in the top ten most likely labs predicted for each sequence. That is, how often does the set of ten labs with the highest predicted probabilities contain the true lab of origin? The submission with the highest top ten accuracy score will top the leaderboard.

Submission format


The format for the submission file consists of an index column called sequence_id and a column for each possible lab ID. Your predictions should be the probability of each lab being the true lab-of-origin for a given sequence. Note that there is only one true lab of origin for each sequence.

Lab ID is an 8-character alpha-numeric string (not to be confused with the sequence_id, a 5-character string). The probabilities should be floats between 0.0 and 1.0. The final submission csv should have 18,817 lines (including the header) and 1,315 columns (including sequence_id).

For example, if you predicted...

sequence_id 00Q4V31T 012VT4JK ... ZX06ZDZN ZZJVE4HO
E0VFT 0.0 0.0 ... 0.0 0.0
TTRK5 0.0 0.0 ... 0.0 0.0
2Z7FZ 0.0 0.0 ... 0.0 0.0
VJI6E 0.0 0.0 ... 0.0 0.0
721FI 0.0 0.0 ... 0.0 0.0

Your .csv file that you submit would look like:

sequence_id,00Q4V31T,012VT4JK,...,ZX06ZDZN,ZZJVE4HO
E0VFT,0.0,0.0,...,0.0,0.0
TTRK5,0.0,0.0,...,0.0,0.0
2Z7FZ,0.0,0.0,...,0.0,0.0
VJI6E,0.0,0.0,...,0.0,0.0
721FI,0.0,0.0,...,0.0,0.0

Of course your submission would contain actual probabilities for each lab ID, not just 0.0.

Good luck!


If you're wondering how to get started, check out our benchmark walkthrough.

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!