Digital pathologists must leverage specialized knowledge about medical pathology and the specific digital data formats used to represent microscope slides in their analyses. We've provided information about these domains to help you get started. If you want to dive deeper into the field of digital pathology, check the bottom of the page for additional references. Let's get started!
Pathologists examine portions of the tissue, structure, or organ with the melanoma with a microscope to diagnose a patient with melanoma and to predict the probability that they will relapse. Physicians also take clinical factors into account when assessing a patient's likelihood of relapse. These include age, sex, and medical history of melanoma.
In order to make a prediction about the likelihood of relapse, pathologists rely primarily on two factors:
- Breslow depth: the thickness of the tumor measured in millimeters. Melanomas that are less than 1 millimeter thick are called "thin" melanomas, and are usually associated with a good prognosis (>95% five-year survival rate)—however, these melanomas also comprise a large and poorly understood proportion of relapses and deaths. Melanomas of intermediate thickness (between 1 and 4 millimeters) have a higher risk of relapse, and are also not well understood.1 Thicker melanomas (over 4 millimeters) have a 50% risk of relapse within 5 years.
- Ulceration: the presence of a lesion that has total loss of epidermal tissue.
Examples of breslow measurements and ulceration are provided in a set of 16 annotated slides (described below) provided by VisioMel to help finalists learn what pathologists pay attention to when looking at slides.
Our partners at VisioMel have provided a set of 16 annotated slides that help illustrate a pathologist's analysis. These slides also appear in the training data without any annotations. These annotations are available on the data download page in the file
annotations.csv, which provides the geometries of the annotations in well-known text (WKT) and openCV representations and their respective labels. The origin for the WKT geometries is the bottom-left corner and the origin for the openCV geometries is the top-left corner. The page also contains
visualized_annotations.pdf, which overlays these annotations on their respective tifs.
The slide images are very large and contain a mixture of lesioned and otherwise normal tissue. They vary substantially in terms of the amount of white space, coloration, and the number of slides or samples pictured. Your solution will need to handle these variations.
In looking at the annotated slides, it is important to keep a few points in mind:
- Pathologists have annotated images to point out regions that represent melanoma and normal areas. However, annotated regions do not necessarily include all lesioned tissue in the slide. An unannotated region is not necessarily normal tissue.
- The lesion may fall entirely within the square, or may extend beyond the annotation boundaries.
- As images have different resolutions in pixels/micrometer, annotations will have different dimensions in terms of pixels. When using the geometries, it is important to know the origin of the coordinate system.
- About two-thirds of skin melanomas occur in a healthy skin free of blemishes or lesions. However, in 30% of cases, melanoma develops from an existing beauty spot or "naevus," a benign lesion. In this case, the malignant melanoma tissue and the healthy nevus tissue are closely associated.
Artifacts in images
Some images in this challenge may have a few anomalies that participants might want to look out for. These quirks are not relevant to the task of predicting relapse.
- Some slides may be partially out of focus or highly zoomed:
- Some slides have black spots around the edges. These are from the slide holder used during the scanning process and are not relevant to the melanoma.
Working with whole slide images
Whole slide images (WSIs) are digital formats that allow glass slides to be viewed, managed, shared, and analyzed on a computer monitor. These are extremely high resolution images that are provided to you as pyramidal TIFs.
Pyramidal TIFs are a multi-resolution, tiled format. Each resolution is stored as a separate layer or "page" within the TIF. Page 0 contains the image at the full resolution of the WSI. Subsequent pages contain lower resolution images―each layer is the previous layer downsampled by a factor of two. In other words, relative to the full resolution, the pages are downsampled 2x, 4x, 8x, 16x, and so on.
The full-resolution slide images are enormous, and you may not be able to load even a single full-resolution image into memory. You will need to get creative in your use of multiple scales! For example, one of the first things you will notice browsing the images is that much of the slide is blank space with no tissue whatsoever. It would be a waste of your limited compute time to analyze those parts at full resolution. One simple solution is to segment the image into tissue/non-tissue at a lower resolution and analyze only the tissue regions at higher resolutions.
PyVips is a Python binding for libvips, a low-level library for working with large images. PyVips can be used to read and manipulate pyramidal TIF formats. Here's a code snippet to perform some basic image operations using PyVips:
# load the full-resolution image
path = "slide1.tif"
slide = pyvips.Image.new_from_file(path, page=0)
>> 67456 84224
# load the 16x downsampled image from page 4 (2 ** 4 = 16)
slide = pyvips.Image.new_from_file(path, page=4)
>> 4216 5264
# read an image region in as a PIL Image
x, y = 100, 200
level = 2
region_width = 500
region_height = 500
region = slide.crop(x, y, region_width, region_height)
# convert the PIL Image to a numpy array
array = np.ndarray(
shape=(region.height, region.width, region.bands)
# red, green, blue channels
>> (500, 500, 3)
# also look into pyvips.Region.fetch for faster region read
region = pyvips.Region.new(slide).fetch(x, y, region_width, region_height)
bands = 3
array = np.ndarray(
shape=(region_height, region_width, bands)
>> (500, 500, 3)
Cytomine allows you to display and explore native whole slide images and pyramidal TIF formats in a web browser. It also supports adding annotations and executing scripts from inside Cytomine or from any computing server using the dedicated Cytomine Python client. Cytomine can be installed locally or on any Linux server. The Cytomine GitHub repository includes examples of Python scripts demonstrating how to interact with your Cytomine instance, as well as examples of ready-to-use machine learning scripts (all
S_ prefixed repos, such as
Tips and tricks
Here are a few tips to working with the data:
- Spend time learning how to efficiently process these enormous images. Due to the time limit on submissions, you cannot process every part of the image at the highest resolution available. Instead, consider methods to first predict which parts of the image are most important for an accurate diagnosis. The pyramidal TIF format contains multiple zoom levels that you can access with the
pagekeyword argument to
pyvips.Image.new_from_file. You might try a "multiscale" approach―detect the most important parts of the image from lower resolutions versions (
pyvips.Image.new_from_file(path, page=5)) and then analyze just those parts at higher resolutions.
pyvips.Image.cropto read in small image regions. According to a PyVips developer and our own tests,
pyvips.Region.new(image).fetch(...)is much faster than
pyvips.Image.crop(...). (Requires libvips version 0.8.6 or greater.)
- Avoid OpenSlide-based data loaders (
openslide.OpenSlide) in your machine learning pipelines. OpenSlide is no longer under active development, and several issues have arisen that have not been addressed.
Additional reading and research
Here are a few papers and tutorials that talk about machine learning with WSI that you may find helpful:
- An attention-based multi-resolution model for prostate whole slide image classification and localization
- Using deep convolutional neural networks to identify and classify tumor-associated stroma in diagnostic breast biopsies
- Assessment of machine learning of breast pathology structures for automated differentiation of breast cancer and high-risk proliferative lesions
- Histologic tissue components provide major cues for machine learning-based prostate cancer detection and grading on prostatectomy specimens
- Assessment of machine learning detection of environmental enteropathy and celiac disease in children