Deep Chimpact: Depth Estimation for Wildlife Conservation

Help conservationists monitor wildlife populations. In this challenge, you will estimate the depth of animals in camera trap videos. Better distance estimation models can rapidly accelerate the availability of critical information for wildlife monitoring and conservation. #climate

$10,000 in prizes
nov 2021
314 joined

Problem description

Your goal in this challenge is to automatically estimate the distance between a camera trap and an animal for selected frames in camera trap videos. Automated distance estimation can rapidly accelerate access to population monitoring estimates for conservation. The primary data consists of camera trap videos from two parks in West Africa, along with hand-labelled distances and estimated bounding box coordinates for animals.

Data overview

The table below lists all of the data provided for this competition. For more detailed explanations of each piece of data, read on!

Train set Test set
Folders train_videos
csv files train_metadata.csv


The main features in this challenge are the videos themselves. Each video is up to a minute long and was captured automatically by a motion-triggered camera. Videos capture six different species: bushbucks, chimpanzees, duikers, elephants, leopards, and monkeys. Duikers are the most commonly seen, while leopards and elephants are the most rare. There are roughly 3,900 videos, split across the train and test sets. Note that height, width, and length of videos are not consistent across the data. Videos also vary in quality.

The videos come from two parks: Taï National Park in Côte d'Ivoire, and Moyen-Bafing National Park in the Republic of Guinea.

Location Train set videos Test set videos
Taï National Park 530 508
Moyen-Bafing National Park 1,542 1,328

The videos are contained in two main folders:

  • train_videos (101 GB)
  • test_videos (81 GB)

Each video is named with a unique id and the file type is either .mp4 or .avi. For example, listing the first few files in train_videos in your terminal would look like:

$ ls train_videos | head -n 5

To view the .avi videos, we recommend downloading VLC media player. The .mp4 files should play without issue in any default video player.

Downsampled videos

Additionally, we provide downsampled versions for all videos. These datasets are much smaller and are a great way to get started with the competition. Downsampled videos have a frame rate of 1 frame per second, making it easy to index video arrays by time. Original videos have a frame rate of around 30 frames per second.

  • train_videos_downsampled (16 GB)
  • test_videos_downsampled (13 GB)

Downloading the videos

All videos are hosted in an AWS S3 bucket. To download the video datasets for the competition:

  • Join the competition and agree to the competition guidelines
  • Download video_download_instructions.txt and video_access_metadata.csv from the data download page
  • Follow the instructions in video_download_instructions.txt


The labels in this competition are the distance in meters between the animal and the camera at selected times in the video. All videos were hand-labelled by researchers based on reference videos. You can learn more about the labelling process here.

Each frame is uniquely identified by its video name (video_id) and relative time in seconds since the start of the video (time). Each row in the labels represents one frame. There are no frames with multiple labelled animals, meaning there is only one label per frame.

Only a subset of frames in each video are labelled. This does not necessarily include all frames during which the animal is visible. Your goal is to predict the labels only at the times indicated.

train_labels.csv includes the following columns:

  • video_id (string): unique video identifier corresponding to the file name
  • time (integer): seconds since the start of the video. For downsampled videos, this will be the same as the array index since videos are downsampled to one frame per second.
  • distance (float): ground truth distance between the camera trap and the animal in meters

The labels for video zxsz.mp4 are as follows. For this video, we only have labels at times 0, 2, and 4.

video_id time distance
zxsz.mp4 0 7.0
zxsz.mp4 2 8.0
zxsz.mp4 4 10.0


In addition to the videos and labels, you are provided with frame metadata.

Metadata is separated into two files, one for the train set (train_metadata.csv) and one for the test set (test_metadata.csv). The frames in the train metadata are the same as those in the train labels, and the frames in the test metadata are the same as those in the submission format.

Each csv includes the following columns:

  • video_id (string): unique video identifier corresponding to the video file name
  • time (integer): seconds since the start of the video. For downsampled videos, this will be the same as the array index.
  • x1, y1, x2, y2 (float between 0 and 1): an estimated bounding box for the labelled animal, generated using a pretrained object detection model, based on MegaDetector. Bounding box coordinates are given as percentages of height (y1 and y2) and width (x1 and x2).
  • probability (float between 0 and 1): the model's confidence in the bounding box.
  • park (string): the location of the camera trap. moyen_bafing means Moyen-Bafing National Park, Guinea, and tai means Taï National Park, Côte d'Ivoire.
  • site_id (string): unique site identifier. Each site represents one camera trap location. Each site is either entirely in the train set or the test set, so all of the video backgrounds in the test set will be entirely new.

Since bounding boxes are model generated they are not always correct. They are provided only as a helpful reference, not as a ground truth. The MegaDetector is most likely to make mistakes when an animal is far away, obscured by objects in the foreground, or captured with poor video quality. Many cameras in Taï National Park were covered with cling film to prevent damage, which can blur or distort the image. There are a number of frames where the MegaDetector did not return any bounding boxes. For these frames, the five MegaDetector columns (bounding box coordinates and probability) are null.

The metadata for video zxsz.mp4 looks like:

video_id time x1 y1 x2 y2 probability park site_id
zxsz.mp4 0 0.071450 0.318391 0.211672 0.839233 0.635906 moyen_bafing xzi
zxsz.mp4 2 0.111120 0.520176 0.187030 0.770960 0.975330 moyen_bafing xzi
zxsz.mp4 4 0.045394 0.523235 0.112021 0.707161 0.973707 moyen_bafing xzi

Here's an example of what overlaying the bounding box on the frame looks like. Check out the Helpful Code section for more explanation and resources!

# load video resampled to 1 frame per second (fps)
video_array = load_video("train_videos/zxsz.mp4", fps=1)
# select frame at time 0
frame_t0 = video_array[0]

# load train metadata
train_metadata = pd.read_csv("train_metadata.csv")
# select frame using video id and time
frame_metadata = train_metadata[
  (train_metadata.video_id == "zxsz.mp4") & (train_metadata.time == 0)
# get bounding box coordinates
box = frame_metadata[["x1", "y1", "x2", "y2"]].values[0]

# overlay bounding box on top of frame
add_bounding_box(frame_t0, box)

A camera trap frame of a monkey with a predicted bounding box around the animal

External data

External data is not allowed in this competition. However, participants can use pre-trained computer vision models as long as they were available freely and openly in that form at the start of the competition.

Submission format

The format for submissions is a .csv with columns: video_id, time, and distance. The video_id and time columns identify which frames must be labeled. You will fill in the distance column. You should only provide depth estimates for frames that are included in the submission format. You do not have to determine which frames contain an animal.

It is possible that there are additional frames where the animal is visible, however you should only provide distance estimates for the frames in the submission format.

To create a submission, download the submission format and replace the placeholder values in the distance column with your predictions for the test frames. Distance labels must be floats or they will not be scored correctly.

For example, if you predicted that the animal in video dudh.mp4 was 2.0 meters away for the first five labelled frames, your predictions would be:

video_id time distance
dudh.mp4 0 2.0
dudh.mp4 2 2.0
dudh.mp4 4 2.0
dudh.mp4 6 2.0
dudh.mp4 8 2.0

And the first few lines of your submitted .csv file would be:


Performance metric

Performance is evaluated according to Mean Absolute Error (MAE), which measures how much the estimated values differ from the observed values. MAE is the mean of the magnitude of the differences between the predicted values and the ground truth. MAE is always non-negative, with lower values indicating a better fit to the data. The competitor that minimizes this metric will top the leaderboard.

$$ MAE = \frac{1}{N} \sum_{i=0}^N |y_i - \hat{y_i}| $$


  • |$N$| is the number of samples
  • |$\hat{y_i}$| is the estimated distance for the |$i$|th sample
  • |$y_i$| is the actual distance of the |$i$|th sample

In this case, each sample is one frame.

In Python you can calculate MAE using the scikit-learn function sklearn.metrics.mean_absolute_error(y_true, y_pred).

Keep in mind that scores displayed on the public leaderboard while the competition is running may not be the same as the final scores on the private leaderboard, depending on how samples from the test set are used for evaluation.

Helpful code

To help you get started, here's some useful code for loading video arrays, selecting frames, and overlaying bounding boxes.

Loading video arrays

One way to load videos is with ffmpeg-python. Here's a code snippet to load a video with a specified frame rate.

import ffmpeg

def load_video(filepath, fps):
    """Use ffmpeg to load a video as an array with a specified frame rate.

    filepath (pathlike): Path to the video
    fps (float): Desired number of frames per second

    def _get_video_stream(path):
      probe = ffmpeg.probe(path)
      return next(
          (stream for stream in probe["streams"] if stream["codec_type"] == "video"), None

    video_stream = _get_video_stream(filepath)
    w = int(video_stream["width"])
    h = int(video_stream["height"])

    pipeline = ffmpeg.input(filepath)
    pipeline = pipeline.filter("fps", fps=fps, round="up")
    pipeline = pipeline.output("pipe:", format="rawvideo", pix_fmt="rgb24")
    out, err =, capture_stderr=True)
    arr = np.frombuffer(out, np.uint8).reshape([-1, h, w, 3])
    return arr

# load the array for an example video, with one frame per second
video_array = load_video("data/train_videos/adsm.mp4", fps=1)
>> (61, 720, 1280, 3)

Specifying a frame rate of one frame per second returns 61 frames, indicating that the video is about one minute long. This video has a height of 720 pixels and width of 1280 pixels, and each frame has three channels to represent color (R, G, B). The array of the first frame is:

# look at the array of the first frame in the video
array([[[ 59, 77, 89],
        [ 62, 80, 92],
        [ 57, 74, 88],
        [205, 203, 181],
        [199, 196, 177],
        [202, 199, 180]],
       [[221, 102,  42],
        [227, 108,  48],
        [148, 136,  95],
        [250, 250, 250],
        [250, 250, 250],
        [250, 250, 250]]], dtype=uint8)

We can then use matplotlib to look at individual frames.

import matplotlib.pyplot as plt

# show the image of the first frame

The first frame of a camera trap video showing a chimpanzee

Since we resampled to one frame per second in load_video, we can easily select a frame based on the time. For example, to get the frame for 26 seconds into the video we select element 26 from the video array.

plt.title("Frame 26 seconds from video start")

A frame 26 seconds into a camera trap video showing a chimpanzee

Plotting bounding boxes

As described in the metadata section, you are provided with estimated bounding boxes for the animal in the frame. Here's a code snippet showing how to overlay the bounding box on the frame, using the coordinates provided. The coordinates are given as a percentage of the frame height and width and so this code will work even if the video has been resized.

import cv2

def add_bounding_box(img, box):
    """Add bounding box to an image

        img: image array
        box: list-like of box coordinates as percent of height and width, format [x1, y1, x2, y2]
    x1, y1, x2, y2 = box

    # add box to image
    img = cv2.rectangle(
        # (x, y) coordinates of the starting point
        (int(x1 * img.shape[1]), int(y1 * img.shape[0])),
        # (x, y) coordinates of the ending point
        (int(x2 * img.shape[1]), int(y2 * img.shape[0])),
        color=(255, 0, 0),


A frame 26 seconds into a camera trap video showing a chimpanzee outlined by a red box

Good luck!

Looking for a way to get started? Check out the Helpful Resources section, which contains recent depth estimation papers and associated Github repos.

Good luck and enjoy the challenge! If you have any questions you can always visit the user forum!