Hakuna Ma-data: Identify Wildlife on the Serengeti with AI for Earth

Can you predict which animals are present in camera trap images? Leverage millions of images of animals on the Serengeti to build a classifier that distinguishes between gazelles, lions, and more! #climate

$20,000 in prizes
jan 2020
806 joined

Problem description

Your goal is to predict which animals are present in Snapshot Serengeti camera trap image sequences. There are 54 categories in total: 53 animal categories plus 1 category corresponding to no animal (empty).

Given a sequence ID and its corresponding images, your trained model should output a list of 54 probabilities corresponding to the model's confidence that each respective category is present in the series.

IMPORTANT: Competition timeline, leaderboard, and announcements


There are a few things that are different about this competition, so read the following carefully.

Final leaderboard

This competition is designed to reward the best generalizable solutions. Throughout the competition, you will submit code to run inference on seasons 10 and 11 as described below. The scores for the public test data from season 10 will appear on the public leaderboard.

However, the final leaderboard is not determined by your season 10 results since these labels are public. It is determined by your predictions for season 11. While you can overfit the public leaderboard all you want, this will not help you when it comes to your final score. The public leaderboard will help make sure that your own cross validation matches our scoring.

During the competition, submissions will also be run against a private subset from season 11. In addition to the public leaderboard, we will post an announcement every week or two with a list of the top 25 teams as evaluated against that subset, along with the score of the top team. At the end of the submission period, the top 25 teams from the private subset will have their submissions evaluated against the full test set to determine the final prize winners.

The final leaderboard that determines prizes will not be available until a few weeks after the competition end.

Code submission

This competition requires that you submit code to perform inference along with any assets you need to load your trained model. You can find out more about this on the submission format page.

The code will be executed inside a container with a limited set of libraries. You can find the definition of the container in the runtime repository, which also contains the benchmarks.

Early stopping

There is a limited amount of computation available for this competition. Based on past competitions, we suspect that we have enough resources, but if not we may close submissions to the competition early.

In the event we are approaching the limit, we will make regular announcements so that all competitors have enough heads-up to prepare their final submissions. We don't anticipate this being necessary, but it is possible.


Score public test set (Season 10)
Score private subset (Season 11)
Final evaluation against full private test set (Season 11)
Public leaderboard Feedback from private subset Final evaluation against full private test set
Scores for Season 10 test data displayed on public leaderboard Top 25 teams and top score for Season 11 subset shared via periodic Announcements Code from the top 25 teams run against full test set to determine final prize winners


The features in this dataset


The features provided in this challenge are the images themselves as well as the date and time of the capture. There is additional metadata available for download from LILA that contains information such as the total number of frames in the sequence and which frame number each image corresponds to. You are welcome to use this metadata during the training of your model but keep in mind that this metadata will not be provided for the test set.

Images

Each image is tied to a sequence, which is identified by an alphanumeric string (e.g. SER_S1#B04#1#3) in the train set and a five character string (e.g. ABCEY) in the test set. You will be predicting the presence/absence of wildlife species for each sequence.

The train_x.csv and test_x.csv files show to which sequence each image corresponds. These files also include the date and time for each image.

file_name seq_id datetime
0 S1/B04/B04_R1/S1_B04_R1_PICT0003.JPG SER_S1#B04#1#3 2010-07-20 06:14:06
1 S1/B04/B04_R1/S1_B04_R1_PICT0004.JPG SER_S1#B04#1#4 2010-07-22 08:56:06
2 S1/B04/B04_R1/S1_B04_R1_PICT0005.JPG SER_S1#B04#1#5 2010-07-24 01:16:28
3 S1/B04/B04_R1/S1_B04_R1_PICT0006.JPG SER_S1#B04#1#6 2010-07-24 08:20:10
4 S1/B04/B04_R1/S1_B04_R1_PICT0007.JPG SER_S1#B04#1#7 2010-07-24 10:14:32

Sequences can have varying numbers of images though it's most common to have three images in a sequence. This means that each image corresponds to a unique sequence ID but each sequence ID does not uniquely identify an image.

The season and location can be inferred from the sequence ID in the train set but not in the test set, as the sequence IDs have been obfuscated.

By alphabetically sorting the filenames, the image order in a sequence can be inferred. The last four numbers in the file name show what order the images in that sequence were taken in. This is true of images in both the train set (e.g. S1/B04/B04_R1/S1_B04_R1_PICT0003.JPG) and test set (e.g. ABCEY_IMAG2898.JPG).

The species labels are at the level of a sequence, not an image. This means that there is some noise in this data as sequence annotations apply to all images in that sequence. There are rare sequences in which two of three images contain a lion, but the third is empty (lions, it turns out, walk away sometimes), but all three images would be annotated as “lion”.

Image sizes vary so your preprocessing must handle this accordingly.

Labels

There are 54 categories which may be present or absent in each sequence. If an empty label is present, all other categories will be absent. For non-empty sequences, multiple species may be present.

Sequence data example


For example, a single label in the dataset may have these values, indicating the presence or absence of categories in sequence SER_S1#B04#1#10:

SER_S1#B04#1#10
aardvark 0.0
aardwolf 0.0
baboon 0.0
bat 0.0
batearedfox 0.0
buffalo 0.0
bushbuck 0.0
caracal 0.0
cattle 0.0
cheetah 0.0
civet 0.0
dikdik 0.0
duiker 0.0
eland 0.0
elephant 0.0
empty 1.0
gazellegrants 0.0
gazellethomsons 0.0
genet 0.0
giraffe 0.0
guineafowl 0.0
hare 0.0
hartebeest 0.0
hippopotamus 0.0
honeybadger 0.0
hyenaspotted 0.0
hyenastriped 0.0
impala 0.0
insectspider 0.0
jackal 0.0
koribustard 0.0
leopard 0.0
lionfemale 0.0
lionmale 0.0
mongoose 0.0
monkeyvervet 0.0
ostrich 0.0
otherbird 0.0
porcupine 0.0
reedbuck 0.0
reptiles 0.0
rhinoceros 0.0
rodents 0.0
secretarybird 0.0
serval 0.0
steenbok 0.0
topi 0.0
vulture 0.0
warthog 0.0
waterbuck 0.0
wildcat 0.0
wildebeest 0.0
zebra 0.0
zorilla 0.0

Performance metric


Performance is evaluated according to a mean aggregated binary log loss. For each possible category in a sequence, the binary log loss will be computed and then the results will be summed (this accounts for potential presence of multiple species in a single sequence). The sum of the binary losses represent the total loss for the sequence. The competitor that minimizes the mean value of this loss over all test cases will top the leaderboard.

Submission format


This is a brand new kind of DrivenData challenge! Rather than submit your predicted labels, you'll package everything needed to do inference and submit that for containerized execution on Azure. By leveraging Microsoft Azure's cloud computing platform and Docker containers, we're moving our competition infrastructure one step closer to translating participants’ innovation into impact.

See details on the submission format and process here.

Data download


Head over to the data download page. There are 10 seasons of images available for download. Each season can be downloaded either by clicking on the CDN hyperlink in the lefthand column or by using azcopy cp for the blob links given in the right hand column.

To confirm your downloads are complete and uncorrupted, you may reference the MD5 hashes and file sizes here.

The azcopy tool is not required, but it provides for optimized downloads from Azure Blob Storage. Instructions for installing azcopy are available here:

Note that these represent the same file:

  • https://lilablobssc.blob.core.windows.net/snapshotserengeti-v-2-0/SnapshotSerengeti_S01_v2_0.zip
  • https://lilacdn.azureedge.net/snapshotserengeti-v-2-0/SnapshotSerengeti_S01_v2_0.zip

These 10 seasons can be used for training. The images for the test set are from an unreleased season and are stored in the data folder in the docker container that we execute. Competitors will not have access to these images during the competition.

Image metadata, training labels, and the submission format CSV are also available on the data download page.

Good luck!


If you're wondering how to get started, check out our benchmark blog post!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!