Where's Whale-do?

Help the Bureau of Ocean Energy Management (BOEM), NOAA Fisheries, and Wild Me accurately identify endangered Cook Inlet beluga whales from photographic imagery. Scalable photo-identification of individuals is critical to population assessment, management, and protection for these endangered whales. #climate

$35,000 in prizes
jun 2022
442 joined

Problem description

For this competition, your goal is to identify which images in a database contain the same individual beluga whale seen in a query image.

You will be provided with a set of queries, each one specifying a single query image of a beluga whale and a corresponding database to search for matches to that same individual. The database will include images of both matching and non-matching belugas. This is a learning-to-rank information retrieval task.

For example, a query might be specified in the following way:

query_image_id: train0009
database_image_ids: [train0366, train0522, train3686, ...]

In the hypothetical database below, the beluga in train0522.jpg is the same one seen in the query image. You can verify this yourself by looking closely at the marks on the whale's back.

Query Image


Database of Images

✅ train0522.jpg

Your job in this competition will be to develop a model that can identify the images in a database that most closely match the query image, using only the images along with limited metadata.

Sound like fun? We thought so too! Read on for the details.




The images in this dataset come from a sub-population of critically endangered beluga whales that live in Cook Inlet, Alaska. To help protect these animals from the risk of extinction, the Marine Mammal Laboratory at the NOAA Alaska Fishery Science Center conducts an annual photographic survey of Cook Inlet belugas to more closely monitor and track individual whales.

The photographs are then passed through an auto-detection algorithm developed by Wild Me which draws bounding boxes around each whale and then rotates and crops the image. These cropped images are what you will be working with for this competition.

The quality, resolution, and lighting conditions of the images will vary. Additionally, as you may have noticed with train0366.jpg above, some images will have missing data that appears as contiguous white pixels, due to the whale appearing at the edge of the original photograph. However, images included in the challenge dataset are expected to still show distinctive enough markings for identifying the individual whale.

There are a total of 5,902 images in the training set. If you are ready to start working with them, head on over to the Data Download page where you will find a zipped archive called images.zip.


You will also have access to a metadata.csv containing additional information about each image. It contains following columns, which will also be available as part of the test set:

  • image_id: a unique identifier for each image.
  • path: path to the corresponding image file.
  • height: the height of the image in pixels.
  • width: the width of the image in pixels.
  • viewpoint: the direction from which the whale is being photographed; may be "left", "right" or "top". See below for additional details.
  • date: date that the photograph was captured.

The training metadata file also contains the following additional columns which are available only for the training but not in the test set for evaluation. These final three columns are intended to be used to help you generate ground truth labels for supervised training.

  • timestamp: the full timestamp that the photograph was captured with second-level resolution.
  • whale_id: an identifier for the individual whale in the photograph. However, note that an individual whale can have multiple whale IDs. See below for additional details.
  • encounter_id: an identifier for the continuous encounter of the individual whale during which the photo was taken. See below for additional details.

The first few rows of the metadata table are shown below. The final three columns that are only present for training and not for test are shaded in grey.

image_id path height width viewpoint date timestamp whale_id encounter_id
train0000 images/train0000.jpg 463 150 top 2017-08-07 2017-08-07 20:38:36 whale000 whale000-000
train0001 images/train0001.jpg 192 81 top 2019-08-05 2019-08-05 16:49:13 whale001 whale001-000
train0002 images/train0002.jpg 625 183 top 2017-08-07 2017-08-07 22:12:19 whale002 whale002-000
train0003 images/train0003.jpg 673 237 top 2017-08-07 2017-08-07 20:40:59 whale003 whale003-000
train0004 images/train0004.jpg 461 166 top 2018-08-10 2018-08-10 21:45:30 whale004 whale004-000
... ... ... ... ... ... ... ... ...

The metadata.csv is available on the Data Download page.


Some of the images are taken from overhead by drones, and others are taken laterally from vessels. Here are some examples of what those different viewpoints can look like.

Aerial Images
Lateral Images

Left View

Right View

Most of the images in the dataset are overhead photographs ("top"), with the remainder a roughly equal balance between left and right lateral views. Some of the competition test queries will test your model's skill in identifying images of the same whale for different viewpoints, for example finding an overhead image of a whale for which you only have a lateral viewpoint query image.

Top views in the dataset are oriented so that they are "tall" with the whale's head at the top of the image. Note that the images shown on this page have been rotated 90 degrees from their original orientation for a more compact layout.


The same individual whale can appear in the dataset longitudinally across several years in time. Whales' appearances can change over time, such as gaining new marks or scars, or having existing ones fade. Expert human reviewers take this into account when determining matches in images. The capture data of each image is provided so that your model can make use of information regarding the direction and magnitude of time differences when comparing images.

Whale ID

The whale_id is an identifier for the individual whale in an image, provided only for the training set. These IDs have been meticulously curated by a team of wildlife researchers who closely review each image and associate it with a catalogue of known belugas.

But it can get complicated, and there are some nuances to be aware of.

You may have already noted that there are approximately 300 Cook Inlet beluga whales, but as you will soon find out in exploring the data, there are nearly 800 unique whale IDs in the training dataset. Why are there so many more whale IDs than whales?

The short answer is that an individual beluga may have more than one whale ID associated with it. This can happen for a couple of reasons:

  • Whales change in appearance over time. Whale calves are given a new whale ID shortly after being born, and researchers identify them by proximity to their mother, since they usually lack the distinctive marks of older whales. As whales mature they change in shape and color, and will be assigned new whale IDs as yearlings and as young adults. So the whale ID really encodes similarity in physical appearance between images, rather than an individual identity that persists over time.
  • The other reason is that photographs only capture partial views of a whale's body. This means that for some whales, we may have some images of the upper region of the body (head, blowhole) and some images of the lower region of the body (dorsal ridge, tail), but not necessarily enough information to link the two sets of images into a single view of an individual whale. In such a case we would have two whale IDs for an individual whale, one for the upper region and one for the lower region of the body. In other instances we may have an image where only the left or right side is visible due to glare or shading. The researchers curating this data try to merge multiple whale IDs together whenever they can identify them as a single individual, but this is not possible in all cases.

What's important to remember is that the whale ID still groups together images of the same whale. If an individual whale has a second whale ID, that second set of images will look different from the first set.

A lot of painstaking work has gone into producing the whale ID labels, but as with so many datasets, there is the possibility of some errors or ambiguities remaining. Do your best to handle the uncertainty and embrace this aspect of the challenge! Also, feel free to share what you believe to be potential errors in the ground truth on the user forum. We may not be able to update the competition data, but Cook Inlet beluga researchers will be grateful for your help!

Encounter ID and Timestamp

Images of a whale captured in the same encounter are more likely to be visually similar. Furthermore, some images of a whale may be nearly identical due to being captured only seconds apart. In order to facilitate training your model on data that is representative of the task, we have provided the columns encounter_id and timestamp only in the training set to help inform selection of images. The encounter_id column has been generated per-individual-whale by associating runs of images of the whale that are separated by a short amount of time (1 hour).

Note that test set will only include the date column, which is date-level resolution of when the image was captured, in order to regularize solutions against overfitting on timestamps.

External Data

External data of any images of beluga whales from the same population (Cook Inlet) as the challenge data is prohibited. Otherwise, external data not provided through the competition website is permitted for training models as long as it is freely and publicly available with permissive open licensing. This includes pre-trained computer vision models available freely and openly in that form at the start of the competition.

If you are using any external datasets or pre-trained models, you are required to publicly share about them in the competition discussion forum. If you have any questions, please ask on the forum.


Test Queries and Databases

This challenge uses a scenario-based evaluation approach to test how well your model performs under various conditions of the query image and database images. The test set consists of multiple scenarios, with each scenario defining a distinct set of query images and a distinct database. The scenarios have been constructed to test performance of the following aspects of the task:

  • Query top-view images against a database of top-view images
  • Query top-view images against a database of top-view images from the same year (e.g., 2017 against 2017; 2018 against 2018)
  • Query top-view images against a database of top-view images from a previous year (e.g., 2019 against 2018; 2019 against 2017)
  • Query lateral-view images against a database of top-view images
  • Query top-view images against a database of lateral-view images

The test set will include some images and whale IDs that you have seen in the train set, and others that are completely new.

The table below shows the number of images in the query set and database for each scenario, which may help you to estimate the computational costs of generating your solution.

query set size database size
scenario01 1183 3290
scenario02 567 633
scenario03 616 2041
scenario04 279 1110
scenario05 306 971
scenario06 501 1209
scenario07 274 403
scenario08 282 338
scenario09 102 318
scenario10 309 111

Examples of scenarios using the training data are available on the Data Download page.

Submission Format

This is a code execution challenge! Rather than submitting your image rankings, you'll package everything needed to do inference and submit that for containerized execution.

Your code submission must contain a main.py script that reads in the test scenarios, generates predictions for each query, and output a single submissions.csv containing image rankings for all test queries.

Each query will involve ranking a set of database images for one query image. For each query, solutions will output a ranked list of up to 20 images from the database along with confidence scores. Confidence scores should be a floating point value in the range [0.0, 1.0], with 1.0 being the highest confidence value.

Each query's inference should be independent of others; in other words, the results should not depend on the existence of any other images besides that query image and its associated database. Participants are allowed to use in-memory or on-disk caching within one submission execution in order to reduce redundant computations, so long as the independence requirement is not violated.

Note that some test scenarios have been constructed in a leave-one-out manner, meaning that the query set for that scenario is a subset of the scenario's database. The correct procedure for such scenarios is to perform inference for each query image against the database excluding that query image. In order to facilitate caching, you are allowed to derive intermediate data structures for the entire database including the query image, provided that you exclude the query image itself in your returned predictions. (Otherwise, your submission will result in a validation error upon final scoring.)

For a full explanation of the inference process and for more details on how to make your executable code submission, see the Code Submission Format page.

Performance Metric

The competition metric will be mean average precision (mAP), a commonly used metric for query ranking tasks in information retrieval. Average precision (a.k.a. area under the precision–recall curve) will be calculated for each query, and the final score is a mean of the per-query scores. This metric rewards models which rank database images of the same individual whale as the query image higher than images of other whales.

The metric will be computed using a held-out test set of beluga whale images that no participants have access to. The queries and databases used for this test set will also be unknown to participants, but will be as described above.

Average precision is computed for each query as follows:

$$ \text{AP} = \sum_n (R_n - R_{n-1}) P_n $$

where Pn and Rn are the precision and recall, respectively, when thresholding at the nth image pair of the sequence sorted in order of increasing recall. Average precision summarizes model performance over a range of operating thresholds from the highest predicted score to the lowest predicted score. For this challenge, the integral of the area under the curve is estimated with the finite sum over every threshold in the prediction set, without interpolation.

And mean average precision is computed across all Q queries as: $$ \text{mAP} = \frac{\sum_{q=1}^Q \text{AP(q)}}{Q} $$

The overall metric can be thought of as a macro-average of individual database image prediction precision scores across queries, and a micro-average of per-query average precision scores across the test scenarios.

Note that this is an information retrieval metric, which is closely related to—but slightly different than—the average precision metric used for classification tasks. The main difference is that some matching images (i.e., ground truth positives) may not have been predicted, and those are assigned precision scores of 0.0. Additionally, because this challenge has a prediction limit of 20, we consider 20 correct results to be a recall of 1.0, even if the database contains more than 20 possible matches. The commonly used sklearn.metrics.average_precision_score function from scikit-learn is a classification metric and requires some minor adjustments to correctly compute the information retrieval metric. We provide a scoring script which adapts scikit-learn's metric that you can use to locally evaluate your predictions. See the relevant section of the Code Submission Format page for more details.


Beluga Photo-identification Guidance

Finally, since not everyone is an expert in beluga whales, we are providing a resource from the human reviewers at NOAA Fisheries about the visual features they use to identify individual whales. You can find this document Cook Inlet Beluga Whale Manual Matching.pdf on the Data Download page. It may help to inform how you approach the problem, but don't feel constrained by these ideas—explore new directions too!

Good luck!

Good luck and enjoy this problem! If you have any questions, you can always visit the challenge's forum.