SNOMED CT Entity Linking Challenge

Link spans of text in clinical notes to concepts in the SNOMED CT clinical terminology. #health

$25,000 in prizes
mar 2024
553 joined

SNOMED CT Resources

SNOMED CT is a comprehensive, multilingual clinical healthcare terminology. The SNOMED CT logical model defines the way in which each type of SNOMED CT component and derivative is related and represented. The core component types in SNOMED CT are concepts, descriptions and relationships.

The official SNOMED International Confluence site contains a wealth of information about the terminology. A good place to start if you’re new to it is the SNOMED CT Starter Guide.

Working with SNOMED CT

There are a few different ways you can work with SNOMED CT. Below is a non-exhaustive list of the options.

Option 1: In the browser

If you want to explore SNOMED CT from a web browser, you can use either the official SNOMED CT Browser or the Shrimp Browser, provided by CSIRO. The SNOMED CT Browser and Shrimp Browser are both backed by an ontology server, which is how they query the terminology.

Option 2: Running your own ontology server with Snowstorm

Running your own ontology server will enable you to query the terminology locally using the Expression Constraint Language (ECL).

The official SNOMED International terminology server is Snowstorm. You can install it from the project Github repository. However, SNOMED International also provides Snowstorm Lite, a lightweight version of the terminology server which will run in a Docker container locally. Simply pull the container from SNOMED International’s public Docker repository snomedinternational/snowstorm-lite:latest and run it. You load can load the terminology for this competition with the following CURL command:

curl -u admin:YOUR_CHOSEN_PASSWORD \ 
    --form file=@SnomedCT_InternationalRF2_PRODUCTION_20230531T120000Z.zip \ 
    --form version-uri="http://snomed.info/sct/900000000000207008/version/20230531" \ 
    http://localhost:8085/fhir-admin/load-package 

Once loaded, you can retrieve a concept (e.g. 73211009) as follows:

curl --location 'http://localhost:8085/fhir/CodeSystem/$lookup?code=73211009&system=http%3A%2F%2Fsnomed.info%2Fsct' 

To retrieve the entire set of in-scope concepts for this challenge, you send the following ECL query:

curl --location 'http://localhost:8085/fhir/ValueSet/$expand?url=http://snomed.info/sct?fhir_vs=ecl/<< 71388002 |Procedure (procedure)| OR << 123037004 |Body structure (body structure)| OR << 404684003 |Clinical finding (finding)|' 

Option 3: snomed_graph python library

snomed_graph is a simple Python library created for this challenge to help you explore and work with the terminology. This is an alternative to using a terminology server. You can fork or download the repo here. The sample notebook shows how you can browse and select concepts.

Loading the terminology and retrieving the in-scope concepts for the challenge is as simple as:

from snomed_graph import SnomedGraph 

SG = SnomedGraph.from_rf2("SnomedCT_InternationalRF2_PRODUCTION_20230531T120000Z_Challenge_Edition") 

SG.get_descendants(71388002) | SG.get_descendants(123037004) | SG.get_descendants(404684003) 

Option 4: Creating a flat terminology csv

Here is python code to create a flat terminology csv from the provided RF2 zip file.

from pathlib import Path
import pandas as pd


def load_snomed_df(data_path):
    """
    Create a SNOMED CT concept DataFrame.

    Derived from: https://github.com/CogStack/MedCAT/blob/master/medcat/utils/preprocess_snomed.py

    Returns:
        pandas.DataFrame: SNOMED CT concept DataFrame.
    """

    def _read_file_and_subset_to_active(filename):
        with open(filename, encoding="utf-8") as f:
            entities = [[n.strip() for n in line.split("\t")] for line in f]
            df = pd.DataFrame(entities[1:], columns=entities[0])
        return df[df.active == "1"]

    active_terms = _read_file_and_subset_to_active(
        data_path / "sct2_Concept_Snapshot_INT_20230531.txt"
    )
    active_descs = _read_file_and_subset_to_active(
        data_path / "sct2_Description_Snapshot-en_INT_20230531.txt"
    )

    df = pd.merge(active_terms, active_descs, left_on=["id"], right_on=["conceptId"], how="inner")[
        ["id_x", "term", "typeId"]
    ].rename(columns={"id_x": "concept_id", "term": "concept_name", "typeId": "name_type"})

    # active description or active synonym
    df["name_type"] = df["name_type"].replace(
        ["900000000000003001", "900000000000013009"], ["P", "A"]
    )
    active_snomed_df = df[df.name_type.isin(["P", "A"])]

    active_snomed_df["hierarchy"] = active_snomed_df["concept_name"].str.extract(
        r"\((\w+\s?.?\s?\w+.?\w+.?\w+.?)\)$"
    )
    active_snomed_df = active_snomed_df[active_snomed_df.hierarchy.notnull()].reset_index(drop=True)
    return active_snomed_df


# unzip the terminology provided on the data download page and specify the path to the folder here
snomed_rf2_path = Path(
    "SnomedCT_InternationalRF2_PRODUCTION_20230531T120000Z_Challenge_Edition"
)

# load the SNOMED release
df = load_snomed_df(snomed_rf2_path / "Snapshot" / "Terminology")
df.shape[0]
>> 364323

concept_type_subset = [
    "procedure",                    # top level category
    "body structure",               # top level category
    "finding",                      # top level category
    "disorder",                     # child of finding
    "morphologic abnormality",      # child of body structure
    "regime/therapy",               # child of procedure
    "cell structure",               # child of body structure
]

filtered_df = df[
    (df.hierarchy.isin(concept_type_subset)) &   # Filter the SNOMED data to the selected Concept Types
    ( df.name_type == "P" )                      # Preferred Terms only (i.e. one row per concept, drop synonyms)
].copy()

filtered_df.shape[0]
>> 218467

filtered_df.hierarchy.value_counts()
>> disorder                   83431
>> procedure                  55981
>> finding                    35545
>> body structure             34914
>> morphologic abnormality     4969
>> regime/therapy              3110
>> cell structure               517
>> Name: count, dtype: int64

filtered_df.drop("name_type", axis="columns", inplace=True)
filtered_df.to_csv("flattened_terminology.csv")