SNOMED CT Entity Linking Benchmark

A benchmark for linking text in medical notes to entities in SNOMED Clinical Taxonomy. #health

Benchmark
Open
1 joined

Problem description

This benchmark measures the ability of a model to link spans of text in clinical notes with specific topics in the SNOMED Clinical Terms (CT) clinical terminology. The clinical notes in this benchmark come from MIMIC-IV-Note, which contains de-identified hospital discharge notes provided by the Beth Israel Deaconess Medical Center in Boston, MA. SNOMED CT, maintained by SNOMED International, is a "systematically organized computer-processable collection of medical terms providing codes, terms, synonyms and definitions used in clinical documentation and reporting" (source).

Overview


Detailed instructions appear below; here's a quick overview to get started:

  1. Obtain access to the benchmark dataset from PhysioNet and download it.
  2. Obtain access to the SNOMED CT taxonomy and get SNOMED CT working locally.
  3. Use the train_annotations.csv file and SNOMED CT to train your entity linking model. (You may also use any other publicly available data source you wish for unsupervised or supervised learning.)
  4. Use the scoring.py file to evaluate your model locally.
  5. Package and submit your code to run your model against the test set and get a leaderboard score.

Data


The features: MIMIC-IV Note

MIMIC-IV is a large repository of multi-modal, clinical datasets hosted on the PhysioNet platform by the MIT Laboratory for Computational Physiology. The dataset we are using for this benchmark comes from MIMIC-IV-Note, which contains 331,794 de-identified hospital discharge notes provided by the Beth Israel Deaconess Medical Center in Boston, MA.

The terms of the provision of the MIMIC data to PhysioNet preclude third parties from re-publishing it. This means that you will need to access the notes for this benchmark via PhysioNet. Furthermore, the MIMIC license requires that users are registered for access and have undertaken a short training course on handling data from human subjects. Details of the access procedures that you’ll need to follow are given here.

Once you have been granted access, you will be able to download train_notes.csv from the PhysioNet page which contains the annotated set of medical notes in csv format.

The training notes csv has the following fields:

  • note_id: a unique identifier
  • text: the hospital discharge note

A handy script to load the data into a sqlite database is provided here.

The terminology: SNOMED CT

SNOMED CT (Systematized Nomenclature of Medicine Clinical Terms) is a systematically organized clinical terminology widely used in the healthcare industry to facilitate the accurate recording, sharing, and analysis of clinical data. Unlike simpler medical coding systems that assign numerical codes to diseases and treatments (primarily for billing and administrative purposes), SNOMED CT offers a more comprehensive and detailed approach to medical information.

A clinical terminology like SNOMED CT is designed to encompass a vast array of clinical concepts—including diseases, symptoms, procedures, and body structures—in a structured and standardized format. This allows for precise and unambiguous communication among healthcare providers and supports various clinical activities, including patient documentation, reporting, and decision-making.

Where to get SNOMED CT

This benchmark depends on using the November 2025 release of the SNOMED CT International Edition. Participants must independently download this release; a copy will not be provided. See the official release notes for the SNOMED CT November 2025 International Edition. If you already have access to SNOMED CT elsewhere, please make sure that you are working with the November 2025 International Edition and not any other versions or country editions/extensions.

Check out the SNOMED CT resources page for details on ways to work with this terminology. Use of SNOMED CT in this benchmark is governed by the terms of use described in the official rules.

Using SNOMED CT for entity linking

Unlike simple medical coding systems, SNOMED CT has been carefully curated to capture and represent medical knowledge and practice. By exploring the attributes of concepts and the relationships between them, we can arm our models with powerful prior knowledge about the clinical terms they will encounter. Here are three examples demonstrating how this knowledge can be extracted.

Terms

Every concept in SNOMED CT has a “fully specified name” (FSN). This is a descriptive term that is unique to the concept within the terminology. For example, the FSN of the concept:

4596009 |Laryngeal structure (body structure)|

is “Laryngeal structure (body structure)”. The FSN does not always use natural language as it intends to closely match the logical definition of the concept.

SNOMED CT also records other terms (“synonyms”) for each concept. These are alternative descriptions that have been deemed medically acceptable. For the above concept the synonyms are: “Laryngeal structure”, “Larynx structure” and “Larynx”. The presence of multiple descriptive terms for each concept is useful for training vector embeddings or simply for string-matching.

Hierarchy

Every concept in SNOMED CT exists somewhere in the concept hierarchy. SNOMED CT uses an attribute called “is a” to define the parent-child relationships. The parents and children of a concept can help us to understand it in the proper context. For example, the concept:

4596009 |Laryngeal structure (body structure)|

has the following (“is a”) parents:

  • 49928004 |Structure of anterior portion of neck (body structure)|
  • 303417002 |Larynx and/or tracheal structures (body structure)|
  • 716151000 |Structure of oropharynx and/or hypopharynx and/or larynx (body structure)|
  • 714323000 |Structure of organ in respiratory system (body structure)|

From this we learn that the concept in question is a body structure, that it forms part of the respiratory system, that it is located towards the front of the neck and that it is closely linked to the trachea. A concept’s grandparents and further ancestors can be used to discover broader information about a concept. As we navigate up the hierarchy, each “is a” relationship is like a logical statement about the target concept.

The terminology also contains concepts and other content marked as inactive, but the use of these inactive concepts is outside the scope of this benchmark.

Defining relationships

Every concept is modelled with zero or more sets of defining relationships with each relationship taking the form of an (attribute, value) pair. Together with the “is a” relationships, these sets constitute a unique (both necessary and sufficient) definition for a concept.

Consider the concept:

274317003 |Laryngoscopic biopsy larynx (procedure)|

Its defining relationships are:

Attribute Value
260686004 |Method (attribute)| 129314006 |Biopsy - action (qualifier value)|
405813007 |Procedure site - Direct (attribute)| 4596009 |Laryngeal structure (body structure)|
425391005 |Using access device (attribute)| 44738004 |Laryngoscope, device (physical object)|


Suppose that an algorithm encountered the sentence: “Patient referred for a biopsy to investigate potential swelling in upper larynx”. The algorithm matches the span “larynx” to 4596009 |Laryngeal structure (body structure)| and is now considering whether to match the span “biopsy” to a concept.

An obvious choice might be 86273004 |Biopsy (procedure)|, but we also have the semantically similar 129314006 |Biopsy - action (qualifier value)|. By using the defining relationships above – combined with the presence of the span “larynx” in close proximity to the span “biopsy” – the algorithm could make a pretty good guess that the relevant concept here is 274317003 |Laryngoscopic biopsy larynx (procedure)| – which would be the correct annotation in this instance.

The labels: Clinical note span annotations

For this benchmark, the SNOMED International team annotated nearly 300 discharge notes from MIMIC-IV-Note. Of the 219K SNOMED CT concepts that appear in the subset of the clinical terminology relevant for this benchmark, approximately 7,000 appear across these documents.

The annotations have been divided into a training dataset with annotations for approximately 275 clinical notes and a test dataset with annotations for approximately 25 clinical notes which your submissions will be evaluated against. Note that there are concepts which appear in the test dataset which do not appear in the training dataset. Your models should appropriately leverage the SNOMED CT clinical terminology and the relationships contained therein to generalize to concepts not seen in the training dataset.

The train_annotations.csv file is available on the PhysioNet page.

The training annotations csv contains the following fields:

  • note_id: a string identifying the discharge note to which the annotation belongs.
  • start: an integer indicating the character start index of the annotation span in the document.
  • end: an integer indicating the character end index of the annotation span in the document.
  • concept_id: an integer corresponding to the SNOMED CT Concept ID (sctid).

Concepts in scope

The data were annotated using the November 2025 International Release of SNOMED CT. Annotators were instructed to look only for concepts which appear in the following sub-hierarchies of the terminology:

  • Procedures
  • Body Structures
  • Clinical Findings

This equates to around 219K concepts. Notably absent from the above are administrative concepts, situational concepts and substances (the latter includes all drugs and medications).

Example

Below is a synthetic example that illustrates how the annotations apply to the notes text.

Annotation example

The note_id field connects the note to its corresponding annotation rows. The start and end characters in the annotations file indicate the start and end of the span (highlighted in green) and the concept_id is the corresponding label for that span in the SNOMED CT clinical terminology.

External data

Use of external data is allowed in this benchmark.

Performance metric


For this benchmark, participants may use a variety of tokenizers that lead to small variations in token start and end indices. To account for this variation, we use character-level metrics instead of a token-based one.

Performance for this benchmark is evaluated according to multiple character intersection-over-union (IoU) metrics. These are accuracy metrics, so a higher value is better.

For a given set of character classification predictions |$P$| and character ground truth classifications |$G$|, the character IoU for a given class category is defined as:

$$ \text{IoU}_\text{class} = \frac{P_\text{class}^\text{char} \cap G_\text{class}^\text{char}}{P_\text{class}^\text{char} \cup G_\text{class}^\text{char}} $$

where |$P_\text{class}^\text{char}$| is the set of characters in all predicted spans for a given class category, |$G_\text{class}^\text{char}$| is the set of characters in all ground truth spans for a given class category, and |$\text{classes} \in P \cup G$| are the set of categories present in either the ground truth or the predicted spans.

Macro character IoU

The macro character IoU metric computes an unweighted average of the character IoU for each class:

$$ \text{macro IoU} = \frac{ \sum_{\text{classes} \in P \cup G} \text{IoU}_\text{class}}{N_{\text{classes} \in P \cup G}} $$

Span-weighted character IoU

The span-weighted character IoU metric computes a weighted average of the character IoU for each class, where the weight for each class is determined by the number of ground truth spans for that class:

$$ \text{span-weighted IoU} = \frac{ \sum_{\text{classes}} \text{IoU}_\text{class} \cdot N_{\text{spans}_\text{class}}}{N_{\text{spans} \in G}} $$

Note that the predicted concept ID must match exactly; relationships between concepts are not taken into account for scoring purposes.

Submission format


This benchmark leverages code execution! Rather than submitting your predicted labels, you'll package everything needed to do inference and submit that for containerized execution.

Your code submission must contain a main.py script that reads in the notes in the test set, generates predicted span classifications for each note, and outputs a single submission.csv containing the predicted spans for all notes in the test set.

See details on the submission format and process here.

Good luck!


If you're wondering how to get started, check out the benchmark blog post!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!