competition
complete
\$30,000

## Problem description

In this challenge, your goal is to detect the presence of certain families of chemical compounds in geological material samples using gas chromatography-mass spectrometry (GCMS) data collected for Mars exploration missions. These families are mostly of organic compounds relevant to understanding Mars' potential for past habitability.

The data for this challenge comes from laboratory instruments at NASA's Goddard Space Flight Center that are operated by the Sample Analysis at Mars (SAM) science team. SAM is an instrument suite aboard the Curiosity rover on Mars. The data was collected by commercially manufactured instruments that have been configured as SAM analogs at Goddard. For more about SAM and the SAM team, see the "About" page.

### Features

Each observational unit in the dataset is a physical sample. The features for each sample are the measurements from GCMS and are provided as individual CSV files. There are three dimensions given in long format:

• time - Time in minutes since start of data collection.
• mass – Mass-to-charge ratio (m/z) of ion being measured.
• intensity – Rate of ions detected, per second. Typically, all intensity values are compared in a relative way within one sample's analysis run.

### Example features CSV

For example, the CSV file S0001.csv for sample S0001 looks like this:

time mass intensity
0.042183 45.200928 107402
0.042183 46.030167 16078
0.042183 47.020752 16760
0.042183 47.945526 13596
0.042183 48.528091 2244
... ... ...

Gas Chromatography-Mass Spectrometry (GCMS) is an analytical method that can be used to determine the composition of a mixed sample.

First, in most cases, the sample is placed inside a pyrolysis oven, where it is vaporized into its gaseous form. A derivatization agent, which is a common additive that helps in capturing clearer signal during gas chromatography, may be added to the sample. Then, a non-reactive carrier gas (e.g., Helium) carries the molecules of the vaporized sample through the Gas Chromatograph column (the "GC" column). The coating of the GC column separates the gas from the sample into its constituent components as it moves through the column. Each component is released from the column at a different time based on its chemical characteristics, such as its boiling point or polarity. The time at which the compound is released by the GC column is called the compound's "retention time."

As each component leaves the GC column, it enters the mass spectrometer. Once it enters, it is ionized—the molecules are transformed into ions (electrically charged particles). During this ionization process, fragmentation occurs as energetically unstable ions begin to dissociate. These ions pass through a device called the mass analyzer that can separate the ions by their mass-to-charge ratio (m/z), and then the intensities (often, counts) of the separated ions are measured by an ion detector. Ion intensity is measured by these instruments in units such as ion current or ion count, but scientists usually consider the intensity in relative terms.

The output measurements for a typical mass spectrometry experiment are visualized as a mass spectrum—a histogram with relative intensity on the y-axis and m/z on the x-axis. To infer the composition of the sample under analysis, scientists can use domain knowledge of how materials fragment under ionization or compare the mass spectrum to reference spectra measured from known substances.

When using gas chromatography, scientists can also use the time at which compounds are eluted (in addition to their mass and intensity) to identify them.

#### Preprocessing samples: Derivatization

Before undergoing GCMS, some samples may undergo derivatization, a procedure in which a derivatization agent is added to the sample to help GCMS yield clearer results through a variety of means including by making the sample components more stable, leading to more distinct peaks. Samples that have undergone derivatization may therefore have differently-shaped peaks than those with the same composition that have not. A compound that has been derivatized will have an added component in its structure, resulting in a mass spectrum that is different from its non-derivatized counterpart. The column derivatized in metadata.csv will have a value of 1 if the sample has been confirmed derivatized. A missing value indicates that it is unknown whether the sample was derivatized; however, samples with missing values for derivatized are more likely to not have been derivatized.

One side effect of using data from sequential experiments on the same instrument, as is the case here, is that there may be contamination across runs. In some runs that did not involve derivatization themselves but followed a run that did, the byproducts of the derivatization agent may still appear in the mass spectrum. However, the products that linger due to contamination from previous runs are usually not as prominent on the spectrum as components of the sample. This can be considered as noise in the data in the context of this problem.

#### Column Temperature

As previously discussed, different compounds are released from the column at different times based on their chemical characteristics and the temperature of the column. This will be reflected in the data by the shape and position of the peaks.

However, the time series for column temperature is not available for the experiments for this competition. Each experiment contains measurements of m/z and intensity for each time step, but not the associated temperature of the GC column at that time. Time can be used as a proxy for temperature, but the temperature ramp is not exactly known nor the same across samples. It is always the case that observations at later times are of compounds that were released at higher temperatures. In most samples, it is expected that the temperature remains constant for the first 0-5 minutes, and then increases at an approximate rate of 5-10 degrees per minute until it reaches about 300 degrees. However, because the time before the ramp begins and the rate at which the temperature increases changes across samples, the same time will represent different temperatures across samples.

#### Understanding GCMS data

Some notes on how scientists typically interpret the GCMS data:

• In chemical analysis, it is common to compare the relative amounts of different substances (e.g., the chemical reaction [2H2O → 2H2 + O2] says two moles of water becomes two moles of molecular hydrogen and one mole of molecular oxygen). Accordingly, scientists will typically interpret mass spectrometry intensities collected from one sample in a relative way. Sometimes, mass spectra are normalized as "relative intensity" from 0 to 100 with the highest intensity value normalized to 100. (Note that this normalization should happen after background subtraction, discussed later in this section.)
• As previously discussed, peaks in ion intensities indicate an increase of the respective ions. Scientists typically look for known combinations of certain ions in certain ratios as evidence of certain fragments that derive from a compound as it elutes from the GC. The combination of ions that elute over a certain period of time is often integrated to get the highest quality mass spectra, which then can be used to identify the compounds.
• Ions of a given m/z may have more than one peak in one GCMS run. This means that there were different compounds that contain that ion fragment that elute at different times eventually leading to that ion being detected.
• It can be possible that ions from a given m/z result from more than one compound. For example, carbon dioxide (CO2), carbon monoxide (CO), and nitrogen gas (N2) all have major ions or fragments that appear at m/z 28. If a sample releases multiple gases (or there is atmospheric background), scientists have to separate out the different contributions when analyzing the data.
• Scientists will often integrate the intensity curve in the time domain for a given m/z value when considering how much of that ion was measured in the GCMS run. The integration turns the intensity from a time series of rate values (e.g., counts per second) to a quantity (e.g., counts). In some cases, the counts values can be used to determine the absolute intensity.
• Helium (He) is used as a carrier gas in all GCMS runs in this competition. That means the presence of helium is not a meaningful signal in the classification task. Helium ions will typically show up in the data as ions detected with an m/z value of 4.0 and are usually disregarded, along with most ions with mass values of less than 12. Most ions relevant to the detection of the label classes for this competition will be below 250, but there are some compounds in these experiments that will generate ions between 250 and 500.
• Some ions have a background presence in the gas passing through the mass spectrometer, as evidenced by a relatively constant non-zero intensity value over the entire run, across the temperature range. This can happen for various reasons, such as contamination from the atmosphere. Scientists typically subtract this background to clean the data.
• Background intensities do not remain constant across the experimental run. More often, the background intensities increase as the temperature increases, but often do so non-linearly. The best way to subtract background intensities is to subtract the intensity value that immediately precedes or follows a peak.

We expect modeling these samples will be a hard task! However, the ability to accurately classify samples is important to our competition sponsors. Accordingly, competitors with the top five best-performing solutions will be considered as finalists for the bonus prize. The bonus prize will be awarded for the best modeling methodology based on a submitted write-up. See "Competition Timeline and Prizes".

### Labels

You are provided with multilabel binary labels for the training set. There are nine label classes, each indicating presence of material in the sample belonging to the respective rock, mineral, or organic compound families:

1. aromatic: sample contains aromatic compounds
2. hydrocarbon: sample contains hydrocarbon compounds
3. carboxylic_acid: sample contains carboxylic acid compounds, such as amino acids
4. nitrogen_bearing_compound: sample contains nitrogen-containing compounds, such as amines or nitriles
5. chlorine_bearing_compound: sample contains chlorine
6. sulfur_bearing_compound: sample contains sulfur
7. alcohol: sample contains alcohol compounds
8. other_oxygen_bearing_compound: sample contains compounds that contain oxygen but are not carboxylic acids or alcohols, such as esters and ethers
9. mineral: sample contains mineral compounds

Each sample can have any number of class assignments. A 1 indicates that a compound from that family is present in the sample, and a 0 indicates otherwise.

#### Labels CSV

sample_id aromatic hydrocarbon carboxylic_acid nitrogen_bearing_compound chlorine_bearing_compound sulfur_bearing_compound alcohol other_oxygen_bearing_compound mineral
S0000 0 0 0 0 0 0 0 0 1
S0001 0 0 0 0 0 0 0 0 0
S0002 0 0 1 1 0 0 0 0 1
S0003 0 1 0 0 0 0 0 0 0
S0004 0 0 0 0 1 0 0 0 1
... ... ... ... ... ... ... ... ... ...

### Submissions and evaluation

Note that this competition's limits on rate of submissions are stricter than usual, in order to reduce the impact of overfitting and luck on the results. The dataset in this competition is relatively small, reflecting the specialized nature of the task—there is just not that much data out there for GCMS for Mars planetary science. Please see the submissions page for submission restriction details and the status of your available submissions.

#### Validation and test sets

The data for this competition is split into three sets: train, validation, and test. You will be submitting predictions for samples in the validation and test splits.

The performance metric evaluated on the validation set will be used for leaderboard ranking while the competition is open, but will not be used for final ranking and prize determination. Labels for the validation set will be released at the beginning of Phase 2: Final Training so that you will have more data available for training your final model.

Final ranking and prizes will be based on performance on the test set. There is also a bonus prize that will be awarded to the best write-up among the top 5 performers.

#### Submission format

The format for the submission file is a CSV file. Each row should correspond to one sample and there will be one column for each label class. For each sample and each label class, there should be a numerical score in the range [0.0, 1.0] that represents the confidence of the prediction that a compound belonging to that label class family is present in the sample.

For example, if you predicted:

sample_id aromatic hydrocarbon carboxylic_acid nitrogen_bearing_compound chlorine_bearing_compound sulfur_bearing_compound alcohol other_oxygen_bearing_compound mineral
S0809 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
S0810 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
... ... ... ... ... ... ... ... ... ...
S1582 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
S1583 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

then your .csv file that you submit would look like:

sample_id,aromatic,hydrocarbon,carboxylic_acid,nitrogen_bearing_compound,chlorine_bearing_compound,sulfur_bearing_compound,alcohol,other_oxygen_bearing_compound,mineral
S0809,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
S0810,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
...
S1582,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
S1583,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5


#### Performance metric

Performance is evaluated according to multilabel aggregated log loss. Log loss (a.k.a. logistic loss or cross-entropy loss) penalizes confident but incorrect predictions. It also rewards confidence scores that are well-calibrated probabilities, meaning that they accurately reflect the long-run probability of being correct. This is an error metric, so a lower value is better.

Log loss for a single observation is calculated as follows:

$$L_{\log}(y, p) = -(y \log (p) + (1 - y) \log (1 - p))$$

where $y$ is a binary variable indicating whether the label is correct and $p$ is the user-predicted probability that the label is present. The loss for the entire dataset is the summed loss of individual observations.

The log loss scores across target label classes are aggregated with an unweighted average. This treats each observation and each label class equally, regardless of prevalence in the evaluation set.

### Good luck!

Good luck and enjoy the challenge! We and our partners at NASA are looking forward to seeing your approaches and hopefully be able to incorporate learnings in future space missions. Check out the benchmark blog post for tips on how to get started. If you have any questions you can always visit the user forum!