Navigation

Problem description

The data for this challenge includes thousands of microscopic slides of uterine cervical tissue from medical centers across France. Your objective is to classify each image according to the most severe category of epithelial lesion present in the sample. The classes are defined as follows:

0: benign (normal or subnormal)
1: low malignant potential (low grade squamous intraepithelial lesion)
2: high malignant potential (high grade squamous intraepithelial lesion)
3: invasive cancer (invasive squamous carcinoma)

Dataset
Images
Annotations
Metadata
Labels
Data example
Test set

Performance metric
Custom metric

Submission Format
Code submission

The features in this dataset

This dataset consists of high resolution images of microscopic slides created from cervical biopsies. Additionally, you are given slide metadata as well as annotations for the training set that outline some (but not necessarily all) of the lesions present on a slide.

Training set

Images

The training set images are provided in three formats: native whole slide image formats (720 GB), pyramidal TIFs (928 GB), and downsampled JPEGs (752 MB). The JPEGs likely don't contain sufficient information for diagnosis at the 32x downsampled resolution, however they should be a helpful reference in getting started. We also provide just the annotated regions of each slide at full resolution to further support prototyping.

All training images are hosted in a public s3 bucket in the train folder. AWS CLI will be useful for downloading images (you will probably need the --no-sign-request argument to the CLI). See train_metadata.csv for filepaths by bucket region and file format. CSVs are available on the data download page.

The data is replicated to buckets hosted in the US, the EU (Germany), and Asia (Singapore). Pick the bucket closest to your machine geographically to maximize transfer speeds.

The wsi_images folder contains images in their native digital pathology format, known as whole slide images (WSIs). A WSI is a digital representation of a microscopic slide at high levels of magnification. These images are extremely high resolution (e.g. 150,000 x 85,000 pixels). Software such as OpenSlide allows you to view different regions of the slide at varying zoom levels as you would with a real microscope.

There are four native WSI image formats in the dataset:

.mrxs (MIRAX)
.svs (Aperio)
.ndpi (Hamamatsu)
.tif (pyramidal TIF)

The tif_images folder contains the training images as a standardized set of compressed images in pyramidal TIF format. These images are compressed using JPEG Q=75. The pyramidal TIF format maintains a sufficient level of detail for pathologists to perform diagnoses while enabling smaller file sizes and easier loading with actively developed Python libraries such as PyVips.

The downsampled_images folder contains JPEG versions of the images that have been downsampled by 32x. While these likely do not contain sufficient information for diagnosis at this resolution, they may be particularly helpful for prototyping model pipelines.

The annotated_regions folder contains JPEG versions of the annotated portions of the slides at full resolution. The annotated region is defined by geometry in train_annotations.csv. See the annotations section below for more information. You can combine these with the downsampled images for a great place to start without having to download sizeable amounts of data!

Check out the Data Resources page for tips and tricks on interacting with WSI and pyramidal TIF formats.

Annotations

The images are very large and contain a mixture of lesioned and otherwise normal tissue. Pathologists have annotated images to point out regions that represent lesions.

The train_annotations.csv contains the following information:

annotation_id - a unique identifier for each annotation
filename - the slide image corresponding to each annotation; each image contains multiple annotations
geometry - x, y coordinates of the annotation in WKT format; all geometries are closed rectangles. NOTE: x,y coordinates assume origin is bottom left corner of the image.
annotation_class - the class of the annotated tissue
us_jpeg_url - file location of the annotated region jpeg in the public s3 bucket in the US East region
eu_jpeg_url - file location of the annotated region jpeg in the public s3 bucket in the EU region
asia_jpeg_url - file location of the annotated region jpeg in the public s3 bucket in the Asia Pacific region

Annotations are only provided for the training set and will not be provided for the test set.

When working with the annotations, it's important to keep in mind the following points:

The annotated regions do not necessarily include all lesioned tissue in the slide. An unannotated region is not necessarily normal tissue.
The whole image class label and the annotation class label do not necessarily match. The annotated regions may be the image's labeled class or below. For instance, an image labeled as a class 2 lesion could have annotations representing class 0, 1, or 2. At least some of the annotated regions will represent the most severe/labeled class. All annotations on a slide with label 0 will be normal tissue.
The lesion may fall entirely within the square, or may extend beyond the annotation boundaries.
All annotations are a fixed size of 300x300 micrometers. As images have different resolutions in pixels/micrometer, annotations will have different dimensions in terms of pixels.
When using the geometries, it is important to know the origin of the coordinate system. See below for more detail.

Annotation coordinate systems

Image processing software may assume the image origin is either the bottom left or the top left. The WKT shapes that we provide as annotations (geometry column in train_annotations.csv) are relative to the bottom left being the origin (0, 0).

If your imaging software uses the top-left as the origin (0, 0), like pyvips does, you will need to take this into account when using the annotations. Before doing any other calculations, you should recalculate the points for the annotations by changing the y-value for each of the points in the WKT to y_reflected = image_height - y.

We recommend you confirm that the annotation coordinates select reasonable areas (tissue as opposed to non-tissue areas of the slide) if you are going to use them. You can also compare your identified regions with the extracted jpeg versions we make available for download.

Metadata

In addition to the images and annotation CSV, train_metadata.csv provides the following metadata fields for each image:

filename - the filename for each image; this connects the image metadata to the image annotations
width - the width of the image in pixels
height - the height of the image in pixels
resolution - the resolution of the image in microns (μm) per pixel
magnification - the equivalent magnification of the image (40x or 20x)
tif_cksum - the result of running the unix cksum command on the TIF image (useful to validate data is not corrupted)
tif_size - the file size in bytes of the TIF image (useful to avoid downloading large files or validate data is not corrupted)
us_wsi_url - file location of the WSI in the public s3 bucket in the US East region
us_tif_url - file location of the pyramidal TIF in the public s3 bucket in the US East region
us_jpeg_url - file location of the downsampled JPEG in the public s3 bucket in the US East region
eu_wsi_url - file location of the WSI in the public s3 bucket in the EU region
eu_tif_url - file location of the pyramidal TIF in the public s3 bucket in the EU region
eu_jpeg_url - file location of the downsampled JPEG in the public s3 bucket in the EU region
asia_wsi_url - file location of the WSI in the public s3 bucket in the Asia Pacific region
asia_tif_url - file location of the pyramidal TIF in the public s3 bucket in the Asia Pacific region
asia_jpeg_url - file location of the downsampled JPEG in the public s3 bucket in the Asia Pacific region

Note that the metadata CSV dimensions (width and height) aren't 100% reliable, but are generally close to the actual image dimensions. If you want the exact dimensions, you can access the image dimensions directly by loading file with PyVips or OpenSlide. Check out the Data Resources page for a more detailed example.

Labels

Anatomical pathologists have labeled each slide according to four classes of lesion severity as classified by the World Health Organization:

0: Normal or subnormal
1: Low grade squamous intraepithelial lesion
2: High grade squamous intraepithelial lesion
3: Invasive squamous carcinoma

train_labels.csv provides the one-hot encoded class for each image. Your goal is to predict the class for each image in the test set. Remember that you are predicting at the slide level, not annotation level.

Unlabeled set

There is an additional set of unlabeled data available in the unlabeled folder in the s3 bucket. There is a corresponding unlabeled_metadata.csv that follows the same structure as described above. There are no labels for these images or corresponding annotations, however you are free to use these images to improve your models. The images are available in WSI, pyramidal TIF, and downsampled JPEG formats.

Image data example

The images vary substantially in terms of the amount of white space, coloration, and the number of slides or samples pictured. Your solution will need to handle these variations. Here are a few thumbnails that demonstrate what the whole slide images can look like:

Biopsy Example 1

Biopsy Example 2

Biopsy Example 3

And here are examples of annotated regions at full resolution:

Test set

The test set images are only accessible in the runtime container and are mounted at inference/data. Test set images are only in pyramidal TIF format.

Along with the images, you are also provided test_metadata.csv which includes the following columns:

filename - the filename for each image; this connects the image metadata to the image annotations
width - the width of the image in pixels
height - the height of the image in pixels
resolution - the resolution of the image in microns (μm) per pixel
magnification - the equivalent magnification of the image (40x or 20x)

The metadata file is only accessible in the runtime container. There is no annotations file for the test set.

Performance metric

Performance is evaluated according to a custom metric devised by a panel of expert pathologists. The score for each prediction equals 1 minus the error, where the error is defined by the following values set by an expert consensus within the scientific council:

Error table

	Class 0 (pred)	Class 1 (pred)	Class 2 (pred)	Class 3 (pred)
Class 0 (actual)	0.0	0.1	0.7	1.0
Class 1 (actual)	0.1	0.0	0.3	0.7
Class 2 (actual)	0.7	0.3	0.0	0.3
Class 3 (actual)	1.0	0.7	0.3	0.0

The total score will be the average across all predictions. Note that the metric is symmetric, e.g. predicting class 3 when it's actually class 0 produces the same error as predicting class 0 when it's actually class 3.

Submission format and code submission

This is a code execution challenge! Rather than submitting your predicted labels, you'll package everything needed to do inference and submit that for containerized execution.

See details on the submission format and process here.

Good luck!

Good luck, and enjoy this challenge! If you have any questions you can always visit the user forum!

TissueNet: Detect Lesions in Cervical Biopsies

Quick Facts

Participants

No. of Entries

Prize

Winner

karelds