competition
complete
\$16,000

## Woohoo! This competition has come to a close!

Many thanks to the participants for all of their hard work and commitment to using data for good!

# Problem description

We're very excited for you to bring your modeling skills to the world of penguins! Your goal is to predict the number of penguins at different sites at different times.

All of the data for this competition and the MAPPPD project comes from the hard work of scientists around the globe who are dedicated to collecting data on penguins. The data comes from the published works of these contributors, and full citations to this work can be found on the MAPPPD page.

We're running our competition with two main prize pools. The first is the prediction competition. Just like any other DrivenData competition, in the prediction competition you'll submit a model that predicts penguin population counts in Antarctica. The top 5 models by accuracy will be awarded a prize.

We're also offering separate prizes for the participants that have the best biological justification for their model. As you work on your model, think about the data you included, the modeling assumptions you made, and how your methods work. Write up your work and submit a link to us with the explanation and insight. Our expert panel will review these reports and award prizes to the two reports they found most compelling.

## External data

You may use any publicly available data that are not penguin population counts in this competition. Sea ice, weather, krill populations, and chlorophyll have all been studied as possible covariates with penguin populations. We provide you the latitude and longitude of the colony sites so that you can match other geospatial timeseries to this dataset and improve your predictions. Please document any of these datasets and include a link to the dataset in your model write-up!

## Raw observations

You are provided data with historical counts from many different sites for different kinds of penguins. We provide you with the dataset of all of the observations through the 2013 penguin season. In some cases, a colony was visited more than once per season. Often, counts performed at the same colony in the same season can be slightly different due to a number of factors, including: counting errors, birds dying, nest abandonment, timing of egg laying or timing of fledging (birds going to sea).

### Explanation of columns

• site_name - A string with the name for the site.
• site_id - A unique four digit alphanumeric code for the site.
• ccamlr_region - Region of Antarctica as defined by CCAMLR
• longitude_epsg_4326 - Longitude for the site in decimal degrees.
• latitude_epsg_4326 - Latitude for the site in decimal degrees.
• common_name - The species of penguin that was observed.
• chinstrap penguin
• gentoo penguin
• day - The day of the survey. Missing information encoded as 'Unknown'. Note: Not all counts have exact dates available for when they were performed. In other words, they were either estimates for a particular season or the dates were not reported. This could be important because a count can be impacted by its timing (i.e. because of the timing of when birds lay eggs or when the chicks leave to go to sea).
• month - The month of the survey. Missing information encoded as 'Unknown'.
• year - The year of the survey. Missing information encoded as 'Unknown'.
• season_starting - The season of the count. The Antarctic "season" is the austral summer for penguin breeding and stretches from November to March of the following year. Accordingly, we denote the season by the year in which the breeding season began. For example, January 2011 is in the 2010 season and November 2010 is in the 2010 season.
• penguin_count - The number of either nests, chicks, or adults observed.
• accuracy - Not all data were collected the same way. Data accuracy can be affected by the method used (ground counts, aerial photos, or satellite-based estimates) and the time available for surveying, as well as weather conditions and terrain. The accuracy of each count is estimate by the surveyor at the time of the census and is usually recorded alongside the population estimate.
• 1 - count is within $\pm 5\%$, $e_{n}=0.05$
• 2 - count is within $\pm 10\%$, $e_{n}=0.10$
• 3 - count is within $\pm 25\%$, $e_{n}=0.25$
• 4 - count is within $\pm 50\%$, $e_{n}=0.50$
• 5 - order of magnitude "guesstimate", $e_{n}=0.90$
• e_n - These are the error weights based on the accuracy for use in the scoring function. (Note: we also provide these for nests in the same format as the nest count timeseries).
• 0.05 - $e_{n}$ when accuracy is 1
• 0.1 - $e_{n}$ when accuracy is 2
• 0.25 - $e_{n}$ when accuracy is 3
• 0.5 - $e_{n}$ when accuracy is 4
• 0.9 - $e_{n}$ when accuracy is 5
• nan - Sometimes the accuracy is not present, in these cases we assume 0.05 (the most accurate) when calculating the score.
• count_type - Penguin counts can be made using different methods including counting adults, counting chicks, and counting nests. This is a categorical variable that tracks the kind of observation.
• nests - Number of penguin nests counted
• chicks - Number of penguin chicks counted. These species generally lay 2 eggs, so 1 nest can produce 0, 1 or 2 chicks.
• adults - Number of adult penguins counted
• vantage - How the count was made. Different vantages mean that the counts will have different accuracies.
• ground - A person on the ground provided the count
• None - vantage is unknown
• vhr - very-high resolution commercial satellite imagery
• landsat - Landsat satellite imagery
• ground photo - Count is from photographs that were taken on the ground
• aerial - A person in an aircraft provided the count
• aerial photo - Count is from photographs that were taken from an aircraft
• offshore vessel - A person on a boat off the shore provided the count

### Example of a row

For example, a single row in the dataset, has these values:

0
site_name Barrientos Island (Aitcho Islands)
site_id AITC
ccamlr_region 48.1
longitude_epsg_4326 -59.7517
latitude_epsg_4326 -62.4071
common_name chinstrap penguin
day 25
month 11
year 1997
season_starting 1997
penguin_count 4608
accuracy 1
count_type nests
vantage ground

## Performance metric

Performance is evaluated according to an adjusted MAPE calculation. Since some penguin counts have differing accuracies, the percent error is weighted differently based on this expected accuracy.

$$AMAPE = \frac{1}{N}\sum_{n=1}^{N} \frac{\bigl| \frac{ \hat{y_n} - y_n }{\max{(1, y_n})} \bigr|}{e_n}$$

where $\hat{y_n}$ is the predicted penguin count, $y_n$ is the actual penguin count, $e_n$ is the error re-weighting based on our confidence in the penguin count. We also will not divide by zero, so if $y_n = 0$ we use one as our percentage denominator instead.

Note: We provide $e_n$ in the same structured format as the nest count timeseries for ease of calculating this metric.

## Nest count data

Nest counts are the most direct reflection of the number of breeding pairs, which is the quantity that biologists are usually interested in when modeling populations. Occasionally, only counts of adults are possible, either because the count is made from an aircraft or because the count is made before nests have been established. To first approximation, we generally assume one adult counted at the colony corresponds to one nest, since the adults often take turns at the colony while the other is feeding. The number of chicks does not directly relate to the number of nests because all three species lay two eggs and so each nest may yield zero, one, or two chicks in a season.

To make things a little easier, we've extracted just the nest counts from the raw data and transformed them into a timeseries. To do that, we take the following steps. First, we remove any observations that are for chicks or adults. Next, we look at each site_id+common_name pair (because some sites have colonies of penguins of different species). We take the maximum nest count that we see for each of these and use that as our count for the season. The columns in the dataset are now the years where we have at least one observation, and the values are how many penguins were observed.

We provide data in this format for all of the seasons up through 2013. However, the time series for each site are variable in length and patchy due to the expensive nature of doing field work in Antarctica.

Here is a sample from the nest count data:

2009 2010 2011 2012 2013
site_id common_name
BENN chinstrap penguin NaN NaN NaN NaN NaN
BENT adelie penguin NaN NaN NaN NaN NaN
BERK adelie penguin NaN 13423.0 NaN NaN NaN
BERN gentoo penguin NaN NaN NaN NaN NaN
BERT adelie penguin NaN 346.0 NaN 313.0 NaN
BIEN adelie penguin NaN 35466.0 NaN NaN NaN
BISC adelie penguin 594.0 539.0 567.0 522.0 NaN
gentoo penguin 2401.0 2404.0 3081.0 3197.0 NaN
BLAK chinstrap penguin NaN NaN NaN NaN NaN
BOLI adelie penguin NaN NaN NaN NaN NaN

## Submission format

The format for the submission file is provided in submission_format.csv. We're asking you to predict nest counts for a number of site_id/common_name combinations that appear in the dataset. For each of these combinations, you must predict a float value (so zeros need to have a decimal in the CSV, e.g. 0.0) that represents the nest count for that season in that location. You will be evaluated against the nest count data that was actually collected.

We ask you to predict these nest counts for four seasons into the future: 2014, 2015, 2016, and the upcoming season, 2017.

Here is an example of what your predictions might be:
2014 2015 2016 2017
site_id common_name
ACUN adelie penguin 0.0 0.0 0.0 0.0
chinstrap penguin 0.0 0.0 0.0 0.0
AILS chinstrap penguin 0.0 0.0 0.0 0.0
. . . . . . . . . . . .
YANK gentoo penguin 0.0 0.0 0.0 0.0
YOUN adelie penguin 0.0 0.0 0.0 0.0
YTRE adelie penguin 0.0 0.0 0.0 0.0
ZAWA chinstrap penguin 0.0 0.0 0.0 0.0
ZIGZ chinstrap penguin 0.0 0.0 0.0 0.0

Your .csv file for those predictions would look like:

site_id,common_name,2014,2015,2016,2017
ACUN,chinstrap penguin,0.0,0.0,0.0,0.0