SNOMED CT Entity Linking Challenge

Link spans of text in clinical notes to concepts in the SNOMED CT clinical terminology. #health

$25,000 in prizes
mar 2024
553 joined

Code submission format

This is a code execution challenge! Rather than submitting your predicted labels, you'll package everything needed to perform inference and submit that for containerized execution. The runtime repository contains the complete specification for the runtime.

All submissions for inference must run on Python 3.10. No other languages or versions of Python are supported.

What to submit


Your final submission should be a zip archive named with the extension .zip (for example, submission.zip). The root level of the submission.zip file must contain a main.py which performs inference on the test notes and writes the predictions to a file named submission.csv in the same directory as main.py. You can see an example of this submission setup in the runtime repository.

Here's an example of the unpacked contents of a zipped submission:

submission root directory
├── assets/
│   └── ... (all assets needed for inference, e.g., model weights)
└── main.py         # Inference script

Note: be sure that when you unzip the submission main.py exists in the folder where you unzip. This file must be present at the root level of the zip archive. There should be no folder that contains it.

During code execution, your submission will be unzipped and run in our cloud compute cluster. The script will have access to the following directory structure:

submission root directory
├── data/
│   └── test_notes.csv
├── main.py
└── <additional assets included in the submission archive>

The test dataset is contained in test_notes.csv, a CSV file containing two columns: note_id containing the ID of the clinical note and text containing the text of the notes. Your main.py must read this file to perform inference but may not log any of its contents. Doing so may be grounds for disqualification.

submission.csv format

Your generated span classification predictions must be in a CSV file called submission.csv with four columns:

  • note_id (str): the ID of the clinical note
  • start (int): the start character index of the span
  • end (int): the end character index of the span
  • concept_id (int): the integer ID of the corresponding SNOMED CT concept

where each row corresponds to a labeled span. Your code-generated submission.csv must match this format exactly.

In addition, your predicted spans must be non-overlapping. Each character in the note text must be predicted to belong to at most a single concept in the SNOMED CT clinical terminology.

Here is an example of what the first six rows of your submission.csv file might look like:

note_id,start,end,concept_id
12345678-DS-9,0,10,12345
12345678-DS-9,100,110,54321
98765432-DS-1,0,10,6789
98765432-DS-1,100,110,12345
98765432-DS-1,1000,1010,987654321

You can find an example submission_format.csv on the data download page. Note that the provided submission format only has 10 rows, however the length of your submission will vary based on the number of labeled spans your model produces.

At the end of inference, your submission root directory should look like this:

submission root directory
├── data/
│   └── test_notes.csv
├── main.py
├── submission.csv
└── <additional assets included in the submission archive>

Testing your submission


Testing your submission locally

In order to ensure that your submission runs inference successfully and to debug any issues, you should first and foremost test your submission locally using the make test-submission command in the runtime container repository. This is a great way to work out any bugs locally and ensure that your model will run quickly enough.

Smoke tests

All code logs during inference on the test set are disabled. For additional debugging, we provide a "smoke test" environment that replicates the test inference runtime but runs only on a small set of notes. In the smoke test environment, test_notes.csv contains three clinical notes from the training set. Code logs are enabled in this environment, though you are still prohibited from logging any information about the notes present in the test_notes.csv file. See the smoke test section of the runtime repo README for more details.

Submission checklist


  • Submission includes main.py in the root directory of the zip. There can be extra files with more code that is called (see assets folder in the example).
  • Submission contains any model weights that need to be loaded. There will be no network access.
  • Submission does not print or log any information about the test dataset, including specific note ids, the contents of the notes, and/or aggregations such as sums, means, or token counts. Doing so may be grounds for disqualification.
  • Script loads the data for inference from the data folder. All notes for inference are in a single CSV in the root level of the data folder. This folder is read-only.
  • Script writes submission.csv to the root directory when inference is finished. The format of this file must match the submission format exactly: each row corresponds to a labeled span, spans are non-overlapping, and values for note_id are strings and all other values are integers.

Runtime


Your code is executed within a container that is defined in our runtime repository. The limits are as follows:

  • Your submission must be written in Python and run Python 3.10 using the packages defined in the runtime repository.
  • The submission must complete execution in 1 hour or less. We expect most submissions will complete much more quickly and computation time per participant will be monitored to prevent abuse. If you find yourself requiring more time than this limit allows, open a GitHub issue in the repository to let us know.
  • The container runtime has access to a single GPU. All of your code should run within the GPU environments in the container, even if actual computation happens on the CPU. (CPU environments are provided within the container for local debugging only.)
  • The container has access to 24 vCPUs powered by an AMD EPYC™ 7V13 chip and 220GB RAM.
  • The container has 1 NVIDIA A100 GPU with 80GB VRAM.
  • The container will not have network access. All necessary files (code and model assets) must be included in your submission.
  • The container execution will not have root access to the filesystem.

The VMs for executing your inference code are a shared resource across competitors. We request that you are conscientious in your use of them. Ensure that you are first testing locally and then testing in the limited smoke test environment to prevent long-running test inference jobs that fail with logging disabled. Smoke tests won't count against your submission limit, and this means more available resources to score submissions that will complete on time.

Requesting package installations


Since the docker container will not have network access, all packages must be pre-installed. We are happy to add packages as long as they do not conflict and can build successfully. Packages must be available through conda for Python 3.10. To request an additional package be added to the docker image, follow the instructions in the runtime repository.

Happy building! Once again, if you have any questions or issues you can always head on over the user forum!