Mars Spectrometry: Detect Evidence for Past Habitability

Help NASA scientists identify the chemical composition of rock and soil samples for Mars planetary science. #science

$30,000 in prizes
Completed apr 2022
714 joined

Problem description


In this challenge, your goal is to detect the presence of certain families of chemical compounds in geological material samples using evolved gas analysis (EGA) mass spectrometry data collected for Mars exploration missions. These families are of rocks, minerals, and ionic compounds relevant to understanding Mars' potential for past habitibility.

The data from this challenge comes from laboratory instruments at NASA's Goddard Space Flight Center and Johnson Space Center that are affiliated with the Sample Analysis at Mars (SAM) science team. SAM is an instrument suite aboard the Curiosity rover on Mars. For more about SAM and the SAM team, see the "About" page.

Features


Each observational unit in the dataset is a physical sample. The features for each sample are the mass spectrometry measurements from EGA and are provided as individual CSV files. There are four dimensions given in long format:

  • time - Time in seconds since start of a reference time.
  • temp – Temperature of the sample in ºC at time of the measurement.
  • m/z – Mass-to-charge ratio of ion being measured.
  • abundance – Rate of ions detected, per second. Typically, all abundance values are compared in a relative way within one sample's analysis run. (Note that different samples will have abundances in different units, discussed more later.)

Example features CSV


For example, the CSV file S0237.csv for sample S0237 looks like this:

time temp m/z abundance
0 0.0 32.226 0.0 7.020079e-10
1 0.0 32.226 1.0 1.045714e-09
2 0.0 32.226 2.0 3.802511e-10
3 0.0 32.226 3.0 5.783763e-10
4 0.0 32.226 4.0 7.238828e-08
... ... ... ... ...

About EGA-MS

Mass spectrometry (MS) is an analytical method that can be used to determine the composition of a sample. First, the substance under analysis is ionized—the molecules are transformed into ions (electrically charged particles). During this ionization process, fragmentation occurs as energetically unstable molecular atoms dissociate. These ions pass through a stage called the mass analyzer that can separate the ions by their mass-to-charge ratio (m/z), and then the abundances (often, counts) of the separated ions are measured by an ion detector. The output measurements are typically visualized as a mass spectrum—a histogram with abundance on the y-axis and m/z on the x-axis. To infer the composition of the sample under analysis, scientists can use domain knowledge of how materials fragment under ionization or compare the mass spectrum to reference spectra measured from known substances.

Plot of mass spectrum for sample S0237.

Example of a mass spectrum. This mass spectrum shows a large peak at m/z=4.0 and smaller peaks at 18.0 and 32.0. Plotted data is for sample S0237 taken at a time snapshot at time=218.376.

Evolved gas analysis (EGA), used in generating the data for this challenge, is an analytical technique that involves heating a sample and measuring the gases released with a mass spectrometer. By introducing temperature as an additional dimension, EGA can provide more information about a sample's chemistry than mass spectrometry alone. In EGA, the sample is steadily heated up in an oven. Gases are released by desorption, dehydration, or decomposition, and these gases flow to the mass spectrometer using a carrier gas (helium in the case of the SAM instrument). The measurements by the mass spectrometer are collected as time series, and scientists can use the mass spectra to identify the gases produced from the sample over time. Based on domain knowledge of how different materials produce gases as they are heated, the composition and mineralogy of the sample can be backed out.

The figures below show the EGA data for an example sample from the training data. The mass spectrometry data is collected as a time series, and the sample's temperature over time is also collected. In the data for this competition, the temperature is already joined to the ion abundances such that each mass spectrometer measurement has an associated temperature.

Plot of ion abundance vs. time for sample S0237.      Plot of temperature vs. time for sample S0237.

(Left) Sample S0237 ion abundance plotted over time, with each m/z plotted as a separate time series and with m/z values of 4.0, 18.0, and 32.0 highlighted. In contrast to the previous mass spectrum showing a time snapshot, we can see that m/z=18.0 and 32.0 peak at different times in the analysis run, corresponding to different temperatures of the sample. (Right) Temperature (°C) over time for sample S0237.

Understanding EGA–MS data

Some notes on how scientists typically interpret the EGA–MS data:

  • In chemical analysis, it is common to compare the relative amounts of different substances (e.g., a hydrated sulfate mineral such as gypsum [CaSO4•2H2O] releases 2 moles of water and one mole of SO2 when thermally decomposed). Accordingly, scientists will typically interpret mass spectrometry abundances collected from one sample in a relative way. Sometimes, mass spectra are normalized as "relative abundance" from 0 to 100 with the highest abundance value mapped to 100. (Note that this normalization should happen after background subtraction, discussed later in this section.)
  • As previously discussed, peaks in ion abundances indicate an increase of the respective ions. Scientists typically look for known combinations of certain ions in certain ratios as evidence of certain gases having evolved (been produced) from the sample. The temperature at which those peaks were measured is accordingly the temperature at which the respective gases evolved. This is indicative of certain chemical reactions (such as thermal decomposition) and gives information about the composition of the sample.
    • Note that the shape of the peak can matter—whether the gas evolved in a narrow or broad temperature band gives information about the underlying chemical reaction and/or mineralogy.
    • Ions of a given m/z may have more than one peak in one EGA run. This means that there were different chemical reactions at different temperatures that eventually led to that ion being detected.
    • It can be possible that ions from a given m/z result from more than one compound. For example, carbon dioxide (CO2), carbon monoxide (CO), and nitrogen gas (N2) all have major ions or fragments that appear at m/z 28. If a sample releases multiple gases (or there is atmospheric background) with such overlap, scientists have to separate out the different contributions when analyzing the data.
    • Scientists will often integrate the abundance curve in the time domain for a given m/z value when considering how much of that ion was measured in the EGA run. The integration turns the abundance from a time series of rate values (counts per second) to a quantity (counts).
  • Helium (He) is used as a carrier gas in all EGA runs in this competition. That means the presence of helium is not a meaningful signal in the classification task. Helium ions will typically show up in the data as ions detected with an m/z value of 4.0 and are usually disregarded.
  • Some ions have a background presence in the gas passing through the mass spectrometer, as evidenced by a relatively constant non-zero abundance value over the entire run, across the temperature range. This can happen for various reasons, such as contamination from the atmosphere. Scientists typically subtract this background to clean the data.
    • The simplest way to subtract the background is to take the initial value for a given m/z and subtract it from that m/z value's whole time series. This works well if the background abundance is constant over the run.
    • Background ions do not always have static abundances. Sometimes they may increase or decrease over the EGA run. In such cases, scientists may do more sophisticated subtraction, such as fitting a line or even a polynomial to the measurements and subtracting that.

Commercial instruments vs. SAM testbed

The data from this competition has been collected from multiple labs from NASA's Goddard Space Flight Center and Johnson Space Center. There are differences in the feature distributions, and this can be captured by distinguishing two kinds of instruments that were used to conduct the measurements:

  1. Commercial instruments—the data comes from commercially manufactured instruments that have been configured as SAM analogs at the Goddard and Johnson labs
  2. SAM testbed—the data comes from the SAM testbed at Goddard, a replica of the SAM instrument suite on Curiosity

The instrument type for each sample in the competition data is indicated by the instrument_type column of the metadata.csv file, with values commercial and sam_testbed.

Notable differences you will see between data collected from these two types of instruments are as follows:

  • Commercial instruments measure ion abundance as ion current in amps (Coulombs per second), while the SAM testbed measures abundance as counts per second. This results in their respective samples having drastically different orders of magnitude for their abundance values. As noted in the previous section, however, the key idea is to compare relative abundance values within one sample's run, and not to compare absolute abundance values across samples.
  • Commercial instrument runs will have ion abundance measurements for all m/z values at every timestep of measurement. The SAM testbed can only measure abundance for one m/z value at a time—the mass spectrometer scans across m/z values in ascending order and cycles through its range of detection.
  • Commercial instrument runs were generally configured to collect data for whole number m/z values from 0.0 to 100.0. The SAM testbed detects ions for a larger range of m/z values, up to 534.0 or 537.0, and data sometimes includes fractional m/z values.
    • In general, ions relevant to the detection of the label classes for this competition will be within the 0.0–100.0 range.
    • In general, if fractional m/z abundances are significant, they will be highly correlated to those of the nearest whole number m/z, e.g., m/z=1.9 will be highly correlated with 2.0. For EGA data, it is generally enough to only look at the whole number m/z values and ignore the fractional m/z values.

There are many fewer SAM testbed samples in the competition dataset than commercial samples—a consequence of the SAM testbed's uniqueness and specialized purpose. Note also that some label classes are not represented in the training data, but may be present in the test set used for final evaluation.

We expect modeling the SAM testbed samples will be a hard task! However, the ability to accurately classify samples analyzed with the SAM testbed is important to our competition sponsors. Accordingly, competitors with the top five best-performing solutions on just the SAM testbed samples in the testbed will be considered as finalists for the bonus prize. The bonus prize will be awarded for the best modeling methodology based on a submitted write-up. See "Competition Timeline and Prizes".

Supplemental data

Additional unlabeled data is provided that can be used in developing your model. These samples are provided with features only. You may find these useful for unsupervised or semi-supervised methods. We look forward to seeing what you come up with!

Note that some of the samples in the supplemental dataset were run under different experimental parameters than the primary data for the competition. This may cause the physics and chemistry in the EGA to be different and result in different distributions in the data. The differences in parameters are indicated for each sample in the supplemental_metadata.csv file. In summary, the differences are:

  • Oxygen or nitrogen were used as the carrier gas. This is indicated by the carrier_gas column by o2 and n2 respectively. Samples which use helium will have he. In all samples of the primary data, helium is used.
  • The run was conducted at a different pressure. This is indicated by the different_pressure column, which will be 1 for a different pressure than the primary data and 0 for the same.

Samples which have carrier_gas as he and different_pressure as 0 can be expected to behave similarly to the primary data.

Labels


You are provided with multilabel binary labels for the training set. There are ten label classes, each indicating presence of material in the sample belonging to the respective rock, mineral, or ionic compound families:

  1. Basalt
  2. Carbonate
  3. Chloride
  4. Iron Oxide
  5. Oxalate
  6. Oxychlorine (chlorate, perchlorate)
  7. Phyllosilicate
  8. Silicate
  9. Sulfate
  10. Sulfide

Each sample can have any number of class assignments. A 1 indicates that a compound from that family is present in the sample, and a 0 indicates otherwise.

Labels CSV

sample_id basalt carbonate chloride iron_oxide oxalate oxychlorine phyllosilicate silicate sulfate sulfide
S0000 0 0 0 0 0 0 0 0 1 0
S0001 0 1 0 0 0 0 0 0 0 0
S0002 0 0 0 0 0 1 0 0 0 0
S0003 0 1 0 1 0 0 0 0 1 0
S0004 0 0 0 1 0 1 1 0 0 0
... ... ... ... ... ... ... ... ... ... ...

Submissions and evaluation


Note that this competition's limits on rate of submissions are stricter than usual, in order to reduce the impact of overfitting and luck on the results. The dataset in this competition is relatively small, reflecting the specialized nature of the task—there is just not that much data out there for EGA for Mars planetary science. Please see the submissions page for submission restriction details and the status of your available submissions.

Validation and test sets

The data for this competition is split into three sets: train, validation, and test. You will be submitting predictions for samples in the validation and test splits.

The performance metric evaluated on the validation set will be used for leaderboard ranking while the competition is open, but will not be used for final ranking and prize determination. Labels for the validation set will be released at the beginning of Phase 2: Final Training on March 18, 2022 so that you will have more data available for training your final model.

Final ranking and prizes will be based on performance on the test set. There is also a bonus prize awarded for best modeling methodology for the SAM testbed samples, with eligible finalists selected based on their performance on the SAM testbed samples within the test set.

Submission format

The format for the submission file is a CSV file. Each row should correspond to one sample and there will be one column for each label class. For each sample and each label class, there should be a numerical score in the range [0.0, 1.0] that represents the confidence of the prediction that a compound belonging to that label class family is present in the sample.

For example, if you predicted:

sample_id basalt carbonate chloride iron_oxide oxalate oxychlorine phyllosilicate silicate sulfate sulfide
S0766 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
S0767 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
... ... ... ... ... ... ... ... ... ... ...
S1568 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
S1569 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

then your .csv file that you submit would look like:

sample_id,basalt,carbonate,chloride,iron_oxide,oxalate,oxychlorine,phyllosilicate,silicate,sulfate,sulfide
S0766,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
S0767,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
...
S1568,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
S1569,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5

Performance metric

Performance is evaluated according to aggregated log loss. Log loss (a.k.a. logistic loss or cross-entropy loss) penalizes confident but incorrect predictions. It also rewards confidence scores that are well-calibrated probabilities, meaning that they accurately reflect the long-run probability of being correct. This is an error metric, so a lower value is better.

Log loss for a single observation is calculated as follows:

$$L_{\log}(y, p) = -(y \log (p) + (1 - y) \log (1 - p))$$

where |$y$| is a binary variable indicating whether the label is correct and |$p$| is the user-predicted probability that the label is present. The loss for the entire dataset is the summed loss of individual observations.

The log loss scores across target label classes are aggregated with an unweighted average. This treats each observation and each label class equally, regardless of prevalence in the evaluation set.

Good luck!


Good luck and enjoy the challenge! We and our partners at NASA are looking forward to seeing your approaches and hopefully be able to incorporate learnings in future space missions. Check out the benchmark blog post for tips on how to get started. If you have any questions you can always visit the user forum!