Youth Mental Health Narratives: Automated Abstraction

Apply machine learning techniques to automate the process of data abstraction from youth suicide narratives. #health

$45,000 in prizes
Completed nov 2024
588 joined

Code submission format

This is a code execution challenge! Rather than submitting your predicted labels, you'll package everything needed to perform inference and submit that for containerized execution. The runtime repository contains the complete specification for the runtime. The repository README includes detailed instructions for using the runtime to test your submission and request additional packages.

The general process for making a submission is:

  1. Create a properly formatted submission.zip
  2. Test your submission locally in Docker
  3. Make a smoke test submission
  4. Once you have succesfully debugged your submission with steps 2 and 3, submit for scoring on the full test set!

Tracking a submission:

  • "Code jobs" tab: This tracks the generation of predictions using your code.
  • "Submissions" tab: This tracks the scoring of your predictions based on the test set ground truth. It will show scores and any errors that came up during scoring.

What to submit


Your submission will be a zip archive (e.g. submission.zip). The root level of the submission.zip file must contain a main.py python script which performs inference on a CSV of test narratives and writes predictions to a file named submission.csv in the same directory as main.py.

Here's an example of the unpacked contents of a zipped submission:

submission # Root directory
├── assets/
│   └── ... # All assets needed for inference, e.g., model weights
└── main.py         # Inference script

Be sure that when you unzip the submission, main.py exists in the folder where you unzip. This file must be present at the root level of the zip archive. There should be no folder that contains it.

During code execution, your submission will be unzipped and run in our cloud compute cluster. The script will have access to the following directory structure:

submission # Root directory
├── data/
│   ├── submission_format.csv
│   └── test_features.csv
├── main.py
└── <any additional assets included in the submission archive>

During execution, you will have access to data/test_features.csv and data/submission_format.csv. At the end of inference, your submission root directory should look like this:

submission # Root directory
├── data/
│   ├── submission_format.csv
│   └── test_features.csv
├── main.py
├── submission.csv # Your predictions on the test set
└── <any additional assets included in the submission archive>

For a detailed description of the submission.csv format, see the submission format section.

Test features

data/test_features.csv will contain the narratives for the test set. It has three columns:

  • uid (str): The individual's ID
  • NarrativeCME (str): A summary of the information in the coroner / medical examiner report
  • NarrativeLE (str): A summary of the information in the law enforcement report

Your main.py must read this file to perform inference. There are 1,000 test data points.

Testing your submission


The machines for executing your inference code are a shared resource across competitors, so please be conscientious in your use of them. Before you make a full submission, you should first test locally, and then test in the limited smoke test environment that has logging enabled. Make sure to cancel any jobs that will run longer than the allowed time. Smoke tests, cancelled jobs, and failed jobs won't count against your submission limit.

Testing your submission locally

In order to ensure that your submission runs inference successfully and to debug any issues, you should first and foremost test your submission locally using the make test-submission command in the runtime repository. This is a great way to work out any bugs locally and ensure that your model will run quickly enough.

Smoke tests

All code logs during inference on the test set are disabled to protect the confidentiality of the data. For additional debugging, we provide a "smoke test" environment that replicates the test inference runtime but runs only a small sample. During a smoke test, you will still have access to data/test_features.csv and data/submission_format.csv. test_features.csv will contain a sample of narratives from the training set.

Code logs are enabled in this environment. See the smoke test section of the runtime repository README for more details.

The sample of training data used in smoke tests is available on the data download page.

  • smoke_test_features.csv includes the same columns as the test features.
  • smoke_test_labels.csv includes the same columns as the test labels.

Runtime constraints


Your code will be executed within a container that is defined in the runtime repository. The limits are as follows:

  • Your submission must be run on Python 3.10 using the packages defined in the runtime repository.
    • Submissions to the platform will run within the GPU environment. (CPU environments are provided within the container for local debugging only.)
  • The submission must complete execution in 8 hours or less (10 minutes or less for smoke tests). We expect most submissions will complete much more quickly and computation time per participant will be monitored to prevent abuse. If you believe that an efficient, optimized solution requires more time than this limit allows, let us know in the competition forum and we may consider increasing the time limit.
  • The container runtime has access to a single NVIDIA T4 GPU with 16GB VRAM. It also has access to 8 vCPUs with 56GB RAM.
  • The container will not have network access. All necessary files (code and model assets) must be included in your submission.
  • The container execution will not have root access to the file system.

Requesting new packages

Since the docker container will not have network access, all packages must be pre-installed. You may also add Python or other dependencies to the environment by following the process described in "Updating runtime packages" section of the runtime README.

Submission format


Your script should generated predictions for standard variables based on the test narratives. The NVDRS Coding Manual has more detailed definitions of each variable.

submission_format.csv contains the correct row indices and columns for the test set labels, with placeholder values. To generate a submission, replace these values with your predictions. The submission format is available on the data download page. It will also be accessible during code execution at data/submission_format.csv. The submission.csv that you generate must exactly match the row order, column order, and data types in the submission format. Column order and data types are also specified below.

submission_format.csv contains 1,000 rows and 24 columns. uid is a string. All other columns are integers. InjuryLocationType and WeaponType1 are integer encodings of categorical strings. All other fields are binary variables with a value of 0 (absent) or 1 (present).

Individual identifier:

  • uid (str): The individual's ID

Mental health history and current state:

  • DepressedMood (int, 0 or 1): The person was perceived to be depressed at the time
  • MentalIllnessTreatmentCurrnt (int, 0 or 1): Currently in treatment for a mental health or substance abuse problem
  • HistoryMentalIllnessTreatmnt (int, 0 or 1): History of ever being treated for a mental health or substance abuse problem
  • SuicideAttemptHistory (int, 0 or 1): History of attempting suicide previously
  • SuicideThoughtHistory (int, 0 or 1): History of suicidal thoughts or plans
  • SubstanceAbuseProblem (int, 0 or 1): The person struggled with a substance abuse problem. This combines AlcoholProblem and SubstanceAbuseOther from the coding manual
  • MentalHealthProblem (int, 0 or 1): The person had a mental health condition at the time

Specific mental health diagnoses: Variables indicating whether specific mental illness diagnoses applied. They are based on MentalHealthDiagnosis1/2 and MentalHealthDiagnosisOther in the coding manual (5.3.3). Only the most common diagnoses are included.

  • DiagnosisAnxiety (int, 0 or 1)
  • DiagnosisDepressionDysthymia (int, 0 or 1)
  • DiagnosisBipolar (int, 0 or 1)
  • DiagnosisAdhd (int, 0 or 1)

Contributing factors:

  • IntimatePartnerProblem (int, 0 or 1): Problems with a current or former intimate partner appear to have contributed
  • FamilyRelationship (int, 0 or 1): Relationship problems with a family member (other than an intimate partner) appear to have contributed
  • Argument (int, 0 or 1): An argument or conflict appears to have contributed
  • SchoolProblem (int, 0 or 1): Problems at or related to school appear to have contributed
  • RecentCriminalLegalProblem (int, 0 or 1): Criminal legal problem(s) appear to have contributed

Disclosure of intent:

  • SuicideNote (int, 0 or 1): The person left a suicide note
  • SuicideIntentDisclosed (int, 0 or 1): The person disclosed their thoughts and/or plans to die by suicide to someone else within the last month
  • DisclosedToIntimatePartner (int, 0 or 1): Intent was disclosed to a previous or current intimate partner
  • DisclosedToOtherFamilyMember (int, 0 or 1): Intent was disclosed to another family member
  • DisclosedToFriend (int, 0 or 1): Intent was disclosed to a friend

Incident details:

  • InjuryLocationType (int, categorical): The type of place where the suicide took place. This must be an integer between 1 and 6. Integers are used to encode the options specified in the coding manual (4.3.3) according to the mapping below. The order of integer encodings is solely alphabetical. Note that multiple uncommon categories from the coding manual are combined into "Other".

    • 1: House, apartment
    • 2: Motor vehicle (excluding school bus and public transportation)
    • 3: Natural area (e.g., field, river, beaches, woods)
    • 4: Park, playground, public use area
    • 5: Street/road, sidewalk, alley
    • 6: Other
  • WeaponType1 (int, categorical): Type of weapon used. This must an integer between 1 and 12. Integers are used to encode the options specified in the coding manual (6.1) according to the mapping below. The order of integer encodings is solely alphabetical.

    • 1: Blunt instrument
    • 2: Drowning
    • 3: Fall
    • 4: Fire or burns
    • 5: Firearm
    • 6: Hanging, strangulation, suffocation
    • 7: Motor vehicle including buses, motorcycles
    • 8: Other transport vehicle, eg, trains, planes, boats
    • 9: Poisoning
    • 10: Sharp instrument
    • 11: Other (e.g. taser, electrocution, nail gun)
    • 12: Unknown