11 months left
Glory!

# Problem description

Welcome to Taï National Park, data scientists! In this challenge, your goal is to classify the species that appear in camera trap images collected by our research partners at the Wild Chimpanzee Foundation and the Max Planck Institute for Evolutionary Anthropology. As mentioned in the about page, camera traps are one of the best tools available to study and monitor wildlife populations, and the enormous amounts of data they provide can be used to track different species for conservation efforts—once they are processed.

Thanks to our partners, we have a trove of images from camera traps located in different sites around Taï National Park. There are seven types of critters captured in this set of images: birds, civets, duikers, hogs, leopards, other monkeys, and rodents. There are also images that contain no animals. Your job is to build a model that can help researchers predict whether an image contains one of these seven types of species. Let's predict!

## The features in this dataset

For this challenge, you are provided with images, along with a few attributes of each image that might be helpful in setting up your training and testing sets. The images for the training and testing sets are in train_features and test_features respectively, and additional information about each image is in train_features.csv and test_features.csv. Let's take a look at a few of our featured critters!

### Dataset

Each record in the dataset corresponds to a single image captured by a camera trap. Each record has one .jpg file associated with it in either the train_features or test_features directory, depending on which dataset it is a part of. Each image is accompanied by additional information in train_features.csv and test_features.csv, which contain the following fields:

• id (string, unique identifier) - a unique identifier for each image
• filepath (string, feature) - image path including its split directory (train or test)
• site (string, feature) - the site in which the image was taken

There is no overlap between the sites in the training data and the sites in the testing data, which means that the model you build must perform well in new contexts. You may use this site feature to create your own train and test set with disjoint sites to make sure your model can make good predictions on sites it has not been trained on.

Note: External data not provided through the challenge is not allowed for use in this competition. However, participants can use pre-trained computer vision models as long as they are publicly available freely and openly without any use of challenge data.

### Feature data example

Here is an example of a single image in our dataset - ZJ000048, which captures a monkey in the act of looking cute:

Here is the associated metadata for this image in train_features.csv:

id ZJ000048 train_features/ZJ000048.jpg S0025

Here is an example of a blank image (ZJ000144), where there are no animals detected:

These are its accompanying characteristics in train_features.csv:

id ZJ000144 train_features/ZJ000144.jpg S0013

Finally, here is an image from test_features (ZJ016556)—unlike the previous two images, which were a part of the training set, there is no test_labels.csv file to look up for this one! Your model will have to do the work:

Here is the information we have from test_features.csv:

id ZJ016556 test_features/ZJ016556.jpg S0086

### Tips for working with images as features

Working with images as features can be an exciting challenge. Here are some tips to consider as you begin:

#### Generalizing to new sites

As noted above, a model may erroneously learn to predict a species class based on characteristics of the environment, rather than characteristics of the species itself.

In order to make sure that our models predict well in new contexts, it is important that train and test sets have entirely different environments. In this case, we ensure this by making sure that sites are entirely in the train set or the test set, but never in both. We recommend that you take a similar approach in setting up your own splits of the training set.

#### Image transformations

Further, there are many differences to account for for images taken even within the same site. For example, consider these different images of our leopard friends:

 Here is an image of a leopard on the prowl (ZJ000102): and here is an image of a leopard hiding out (ZJ000090). Here is an image of a leopard investigating the camera (ZJ000097), and here is an image of the leopard waving goodbye (ZJ000253).

As you can see, there are significant differences in our images of leopards. The animals may be close or far from the camera, in the sun or in the shadows, or facing toward or away from the lens, among other variations. There are also differences in the color of the image, the weather conditions it was taken, and the type of camera.

In order to teach the model that these images are all of leopards despite these differences, it can be helpful to perform a preprocessing step called image augmentation. Image augmentation involves transforming the training set in multiple ways—rotating, distorting in color or sharpness, zooming in or out, are a few examples. These manipulations of the image can help the model make correct predictions in contexts it has not been exposed to before.

## Labels

There are eight possible labels for every image. If the image does not contain any animals, it is labeled as blank. Otherwise it must be labeled as containing one of the seven species groups included in our dataset:

1. antelope_duiker
2. bird
3. civet_genet
4. hog
5. leopard
6. monkey_prosimian
7. rodent

Note that each image should be associated with one class, since each image contains at most one animal. (Taï National Park is of course a great place to make friends, but we have chosen only those images that capture a single species class). The train_labels.csv file therefore has a single row per image, containing the following values:

• id (string, identifier): the unique identifier corresponding to the image
• antelope_duiker, bird, blank, civet_genet, hog, leopard, monkey_prosimian, rodent (float, target variables): one column for each of the possible predicted classes that indicates whether the image is of that class (1) or not (0). Each row will add up to 1, since the classes are mutually exclusive (there are no images in this dataset that have more than one species class in them).

Here are the first few rows of train_labels.csv:

id antelope_
duiker
bird blank civet_
genet
hog leopard monkey_
prosimian
rodent
ZJ000000 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
ZJ000001 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
ZJ000002 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
ZJ000003 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
ZJ000004 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

## Submission and evaluation

For this challenge, you will submit the probability that an image belongs to each of the eight possible classes for each image in test_features. Your performance will be based on how correctly you are able to identify the animals and blanks in these images.

### Submission format

The submission format contains nine columns:

• id (string): unique identifier for each row from test_features.csv
• antelope_duiker, bird, blank, civet_genet, hog,leopard, monkey_prosimian, rodent (float, target variables): there is a column for each of the eight possible classes containing the probability that the image is of that class.

The predictions for the target variables should be float probabilities that range between 0.0 and 1.0. Each row should add up to 1, since the classes are mutually exclusive. In submission_format.csv, the target columns are currently filled in with a random placeholder probability. To create a submission, overwrite these eight columns with your predictions for the respective classes. Your final submission should resemble train_labels.csv in form, but will be for the ids in test_features.csv.

For example, if you made the following predictions:

id antelope_
duiker
bird blank civet_
genet
hog leopard monkey_
prosimian
rodent
ZJ016488 0.048233 0.189185 0.044914 0.199588 0.106118 0.132915 0.166410 0.112637
ZJ016489 0.097078 0.061400 0.026409 0.241530 0.144344 0.051780 0.287811 0.089648
ZJ016490 0.124658 0.089101 0.189225 0.174494 0.180540 0.079995 0.085672 0.076314
ZJ016491 0.109966 0.048397 0.055598 0.323600 0.322356 0.063252 0.008160 0.068671
ZJ016492 0.165742 0.184610 0.005431 0.136806 0.000389 0.122078 0.151521 0.233423
...
 ZJ020947 ZJ020948 ZJ020949 ZJ020950 ZJ020951 0.143675 0.185103 0.109074 0.158833 0.083497 0.010513 0.155293 0.154011 0.068913 0.150572 0.197232 0.124579 0.131486 0.152534 0.039096 0.135588 0.098239 0.149066 0.168635 0.086321 0.136863 0.077155 0.12148 0.162241 0.036345 0.115411 0.302392 0.072757 0.105491 0.183422 0.106307 0.077875 0.118298 0.093348 0.088412 0.110691 0.129448 0.009582 0.243225 0.206995

the .csv file that you submit would look like this:

id,antelope_duiker,bird,blank,civet_genet,hog,leopard,monkey_prosimian,rodent
ZJ016488,0.048233,0.189185,0.044914,0.199588,0.106118,0.132915,0.16641,0.112637
ZJ016489,0.097078,0.0614,0.026409,0.24153,0.144344,0.05178,0.287811,0.089648
ZJ016490,0.124658,0.089101,0.189225,0.174494,0.18054,0.079995,0.085672,0.076314
ZJ016491,0.109966,0.048397,0.055598,0.3236,0.322356,0.063252,0.00816,0.068671
ZJ016492,0.165742,0.18461,0.005431,0.136806,0.000389,0.122078,0.151521,0.233423

...

ZJ020947,0.143675,0.185103,0.109074,0.158833,0.083497,0.010513,0.155293,0.154011
ZJ020948,0.068913,0.150572,0.197232,0.124579,0.131486,0.152534,0.039096,0.135588
ZJ020949,0.098239,0.149066,0.168635,0.086321,0.136863,0.077155,0.12148,0.162241
ZJ020950,0.036345,0.115411,0.302392,0.072757,0.105491,0.183422,0.106307,0.077875
ZJ020951,0.118298,0.093348,0.088412,0.110691,0.129448,0.009582,0.243225,0.206995


### Performance metric

To measure your model's accuracy by looking at prediction error, well use a metric called log loss. This is an error metric, so a lower value is better (as opposed to an accuracy metric, where a higher value is better). Log loss can be calculated as follows:

$$loss = -\frac{1}{N}\cdot\sum\limits_{i=1}^{N}\sum\limits_{j=1}^{M} y_{ij}\log p_{ij}$$

where $N$ is the number of observations, $M$ is the number of classes (in terms of our problem $M=5$), $y_{ij}$ is a binary variable indicating if classification for observation $i$ was correct, and $p_{ij}$ was the user-predicted probability that label $j$ applies to observation $i$.

In Python you can easily calculate log loss using the scikit-learn function sklearn.metrics.log_loss. R users may find the MultiLogLoss` function in the MLmetrics package.

## Good luck!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum for this competition. Happy predicting!