SNOMED CT Entity Linking Challenge

Link spans of text in clinical notes to concepts in the SNOMED CT clinical terminology. #health

$25,000 in prizes
mar 2024
553 joined

Problem description

The objective of this competition is to link spans of text in clinical notes with specific topics in the SNOMED Clinical Terms (CT) clinical terminology. The clinical notes in this challenge come from MIMIC-IV-Note, which contains de-identified hospital discharge notes provided by the Beth Israel Deaconess Medical Center in Boston, MA. SNOMED CT, maintained by SNOMED International, is a "systematically organized computer-processable collection of medical terms providing codes, terms, synonyms and definitions used in clinical documentation and reporting" (source).

Overview


Although each of these steps is covered in detail below, it is helpful to have an overview of what you’ll need to do to get started with this competition. The steps are:

  1. Obtain access to the challenge dataset from PhysioNet and download it.
  2. Download the SNOMED CT taxonomy and get SNOMED CT working locally.
  3. Use the train_annotations.csv file and SNOMED CT to train your entity linking model. (You may also use any other publicly available data source you wish for unsupervised or supervised learning.)
  4. Use the scoring.py file to evaluate your model locally.
  5. Package and submit your code to run your model against the test set and get a leaderboard score.

Data


The features: MIMIC-IV Note

MIMIC-IV is a large repository of multi-modal, clinical datasets hosted on the PhysioNet platform by the MIT Laboratory for Computational Physiology. The dataset we are using for this challenge comes from MIMIC-IV-Note, which contains 331,794 de-identified hospital discharge notes provided by the Beth Israel Deaconess Medical Center in Boston, MA.

The terms of the provision of the MIMIC data to PhysioNet preclude third parties from re-publishing it. This means that you will need to access the notes for this challenge via PhysioNet. Furthermore, the MIMIC license requires that users are registered for access and have undertaken a short training course on handling data from human subjects. Details of the access procedures that you’ll need to follow are given here.

Once you have been granted access, you will be able to download the mimic-iv_notes_training_set.csv from the PhysioNet challenge page.

The training notes csv contains the following fields:

  • note_id: a unique identifier
  • text: the hospital discharge note

A handy script to load the data into a sqlite database is provided here.

The terminology: SNOMED CT

SNOMED CT (Systematized Nomenclature of Medicine Clinical Terms) is a systematically organized clinical terminology widely used in the healthcare industry to facilitate the accurate recording, sharing, and analysis of clinical data. Unlike simpler medical coding systems that assign numerical codes to diseases and treatments (primarily for billing and administrative purposes), SNOMED CT offers a more comprehensive and detailed approach to medical information.

A clinical terminology like SNOMED CT is designed to encompass a vast array of clinical concepts—including diseases, symptoms, procedures, and body structures—in a structured and standardized format. This allows for precise and unambiguous communication among healthcare providers and supports various clinical activities, including patient documentation, reporting, and decision-making.

Where to get SNOMED CT

This challenge depends on using the May 2023 release of the SNOMED CT International Edition. This version of the terminology has been packaged up into SnomedCT_InternationalRF2.zip, which is available on the data download page. However, if you have access to SNOMED CT elsewhere (details here), please make sure that you are working with the May 2023 International Edition and not any other versions or country editions/extensions.

Check out the SNOMED CT resources page for details on ways to work with this terminology. Use of SNOMED CT in this challenge is governed by the terms of use described in the official rules.

Using SNOMED CT for entity linking

Unlike simple medical coding systems, SNOMED CT has been carefully curated to capture and represent medical knowledge and practice. By exploring the attributes of concepts and the relationships between them, we can arm our models with powerful prior knowledge about the clinical terms they will encounter. Here are three examples demonstrating how this knowledge can be extracted.

Terms

Every concept in SNOMED CT has a “fully specified name” (FSN). This is a descriptive term that is unique to the concept within the terminology. For example, the FSN of the concept:

4596009 |Laryngeal structure (body structure)|

is “Laryngeal structure (body structure)”. The FSN does not always use natural language as it intends to closely match the logical definition of the concept.

SNOMED CT also records other terms (“synonyms”) for each concept. These are alternative descriptions that have been deemed medically acceptable. For the above concept the synonyms are: “Laryngeal structure”, “Larynx structure” and “Larynx”. The presence of multiple descriptive terms for each concept is useful for training vector embeddings or simply for string-matching.

Hierarchy

Every concept in SNOMED CT exists somewhere in the concept hierarchy. SNOMED CT uses an attribute called “is a” to define the parent-child relationships. The parents and children of a concept can help us to understand it in the proper context. For example, the concept:

4596009 |Laryngeal structure (body structure)|

has the following (“is a”) parents:

  • 49928004 |Structure of anterior portion of neck (body structure)|
  • 303417002 |Larynx and/or tracheal structures (body structure)|
  • 716151000 |Structure of oropharynx and/or hypopharynx and/or larynx (body structure)|
  • 714323000 |Structure of organ in respiratory system (body structure)|

From this we learn that the concept in question is a body structure, that it forms part of the respiratory system, that it is located towards the front of the neck and that it is closely linked to the trachea. A concept’s grandparents and further ancestors can be used to discover broader information about a concept. As we navigate up the hierarchy, each “is a” relationship is like a logical statement about the target concept.

The terminology also contains concepts and other content marked as inactive, but the use of these inactive concepts is outside the scope of this competition.

Defining relationships

Every concept is modelled with zero or more sets of defining relationships with each relationship taking the form of an (attribute, value) pair. Together with the “is a” relationships, these sets constitute a unique (both necessary and sufficient) definition for a concept.

Consider the concept:

274317003 |Laryngoscopic biopsy larynx (procedure)|

Its defining relationships are:

Attribute Value
260686004 |Method (attribute)| 129314006 |Biopsy - action (qualifier value)|
405813007 |Procedure site - Direct (attribute)| 4596009 |Laryngeal structure (body structure)|
425391005 |Using access device (attribute)| 44738004 |Laryngoscope, device (physical object)|


Suppose that an algorithm encountered the sentence: “Patient referred for a biopsy to investigate potential swelling in upper larynx”. The algorithm matches the span “larynx” to 4596009 |Laryngeal structure (body structure)| and is now considering whether to match the span “biopsy” to a concept.

An obvious choice might be 86273004 |Biopsy (procedure)|, but we also have the semantically similar 129314006 |Biopsy - action (qualifier value)|. By using the defining relationships above – combined with the presence of the span “larynx” in close proximity to the span “biopsy” – the algorithm could make a pretty good guess that the relevant concept here is 274317003 |Laryngoscopic biopsy larynx (procedure)| – which would be the correct annotation in this instance.

The labels: Clinical note span annotations

For this challenge, the SNOMED International team annotated nearly 300 discharge notes from MIMIC-IV-Note. Of the 219K SNOMED CT concepts that appear in the subset of the clinical terminology relevant for this challenge, approximately 7,000 appear across these documents.

The annotations have been divided into a training dataset with annotations for approximately 200 clinical notes and a test dataset with annotations for approximately 70 clinical notes which your submissions will be evaluated against. Note that there are concepts which appear in the test dataset which do not appear in the training dataset. Your models should appropriately leverage the SNOMED CT clinical terminology and the relationships contained therein to generalize to concepts not seen in the training dataset.

The train_annotations.csv file is available on the PhysioNet challenge page.

The training annotations csv contains the following fields:

  • note_id: a string identifying the discharge note to which the annotation belongs.
  • start: an integer indicating the character start index of the annotation span in the document.
  • end: an integer indicating the character end index of the annotation span in the document.
  • concept_id: an integer corresponding to the SNOMED CT Concept ID (sctid).

Concepts in scope

The data were annotated using the May 2023 International Release of SNOMED CT. Annotators were instructed to look only for concepts which appear in the following sub-hierarchies of the terminology:

  • Procedures
  • Body Structures
  • Clinical Findings

This equates to around 219K concepts. Notably absent from the above are administrative concepts, situational concepts and substances (the latter includes all drugs and medications).

Example

Below is a synthetic example that illustrates how the annotations apply to the notes text.

Annotation example

The note_id field connects the note to its corresponding annotation rows. The start and end characters in the annotations file indicate the start and end of the span (highlighted in green) and the concept_id is the corresponding label for that span in the SNOMED CT clinical terminiology.

External data

Use of external data is allowed in this competition provided it is freely and publicly available to all participants under a permissive open source license.

Note that MIMIC III and MIMIC IV are not considered external data as your credentialling on PhysioNet gives you access to these datasets. You are allowed to train on them if desired.

Performance metric


For this challenge, participants may use a variety of tokenizers that lead to small variations in token start and end indices. To account for this variation, we use a character-level metric instead of a token-based one.

Performance for this challenge is evaluated according to a class macro-averaged character intersection-over-union (IoU). This is an accuracy metric, so a higher value is better.

The metric is defined as follows for character classification predictions |$P$| and character ground truth classifications |$G$|:

$$ \text{IoU}_\text{class} = \frac{P_\text{class}^\text{char} \cap G_\text{class}^\text{char}}{P_\text{class}^\text{char} \cup G_\text{class}^\text{char}} $$

$$ \text{macro IoU} = \frac{ \sum_{\text{classes} \in P \cup G} \text{IoU}_\text{class}}{N_{\text{classes} \in P \cup G}} $$

where |$P_\text{class}^\text{char}$| is the set of characters in all predicted spans for a given class category, |$G_\text{class}^\text{char}$| is the set of characters in all ground truth spans for a given class category, and |$\text{classes} \in P \cup G$| are the set of categories present in either the ground truth or the predicted spans.

Note that the predicted concept ID must match exactly—relationships between concepts are not taken into account for scoring purposes.

Submission format


This is a code execution challenge! Rather than submitting your predicted labels, you'll package everything needed to do inference and submit that for containerized execution.

Your code submission must contain a main.py script that reads in the notes in the test set, generates predicted span classifications for each note, and outputs a single submission.csv containing the predicted spans for all notes in the test set.

See details on the submission format and process here.

Good luck!


If you're wondering how to get started, check out the benchmark blog post!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!