Navigation

Problem description

We're very excited for you to bring your modeling skills to the world of penguins! Your goal is to predict the number of penguins at different sites at different times.

All of the data for this competition and the MAPPPD project comes from the hard work of scientists around the globe who are dedicated to collecting data on penguins. The data comes from the published works of these contributors, and full citations to this work can be found on the MAPPPD page.

We're running our competition with two main prize pools. The first is the prediction competition. Just like any other DrivenData competition, in the prediction competition you'll submit a model that predicts penguin population counts in Antarctica. The top 5 models by accuracy will be awarded a prize.

We're also offering separate prizes for the participants that have the best biological justification for their model. As you work on your model, think about the data you included, the modeling assumptions you made, and how your methods work. Write up your work and submit a link to us with the explanation and insight. Our expert panel will review these reports and award prizes to the two reports they found most compelling.

External data

You may use any publicly available data that are not penguin population counts in this competition. Sea ice, weather, krill populations, and chlorophyll have all been studied as possible covariates with penguin populations. We provide you the latitude and longitude of the colony sites so that you can match other geospatial timeseries to this dataset and improve your predictions. Please document any of these datasets and include a link to the dataset in your model write-up!

Provided data

Raw Observations
Explanation of columns
Example of a row

Performance metric
Adjusted-MAPE

Nest counts
Nest count data
Submission format

Raw observations

You are provided data with historical counts from many different sites for different kinds of penguins. We provide you with the dataset of all of the observations through the 2013 penguin season. In some cases, a colony was visited more than once per season. Often, counts performed at the same colony in the same season can be slightly different due to a number of factors, including: counting errors, birds dying, nest abandonment, timing of egg laying or timing of fledging (birds going to sea).

Explanation of columns

site_name - A string with the name for the site.
site_id - A unique four digit alphanumeric code for the site.
ccamlr_region - Region of Antarctica as defined by CCAMLR
longitude_epsg_4326 - Longitude for the site in decimal degrees.
latitude_epsg_4326 - Latitude for the site in decimal degrees.
common_name - The species of penguin that was observed.
- chinstrap penguin
- gentoo penguin
- adelie penguin
day - The day of the survey. Missing information encoded as 'Unknown'. Note: Not all counts have exact dates available for when they were performed. In other words, they were either estimates for a particular season or the dates were not reported. This could be important because a count can be impacted by its timing (i.e. because of the timing of when birds lay eggs or when the chicks leave to go to sea).
month - The month of the survey. Missing information encoded as 'Unknown'.
year - The year of the survey. Missing information encoded as 'Unknown'.
season_starting - The season of the count. The Antarctic "season" is the austral summer for penguin breeding and stretches from November to March of the following year. Accordingly, we denote the season by the year in which the breeding season began. For example, January 2011 is in the 2010 season and November 2010 is in the 2010 season.
penguin_count - The number of either nests, chicks, or adults observed.
accuracy - Not all data were collected the same way. Data accuracy can be affected by the method used (ground counts, aerial photos, or satellite-based estimates) and the time available for surveying, as well as weather conditions and terrain. The accuracy of each count is estimate by the surveyor at the time of the census and is usually recorded alongside the population estimate.
- 1 - count is within |$\pm 5\%$|, |$e_{n}=0.05$|
- 2 - count is within |$\pm 10\%$|, |$e_{n}=0.10$|
- 3 - count is within |$\pm 25\%$|, |$e_{n}=0.25$|
- 4 - count is within |$\pm 50\%$|, |$e_{n}=0.50$|
- 5 - order of magnitude "guesstimate", |$e_{n}=0.90$|
e_n - These are the error weights based on the accuracy for use in the scoring function. (Note: we also provide these for nests in the same format as the nest count timeseries).
- 0.05 - |$e_{n}$| when accuracy is 1
- 0.1 - |$e_{n}$| when accuracy is 2
- 0.25 - |$e_{n}$| when accuracy is 3
- 0.5 - |$e_{n}$| when accuracy is 4
- 0.9 - |$e_{n}$| when accuracy is 5
- nan - Sometimes the accuracy is not present, in these cases we assume 0.05 (the most accurate) when calculating the score.
count_type - Penguin counts can be made using different methods including counting adults, counting chicks, and counting nests. This is a categorical variable that tracks the kind of observation.
- nests - Number of penguin nests counted
- chicks - Number of penguin chicks counted. These species generally lay 2 eggs, so 1 nest can produce 0, 1 or 2 chicks.
- adults - Number of adult penguins counted
vantage - How the count was made. Different vantages mean that the counts will have different accuracies.
- ground - A person on the ground provided the count
- None - vantage is unknown
- vhr - very-high resolution commercial satellite imagery
- landsat - Landsat satellite imagery
- ground photo - Count is from photographs that were taken on the ground
- aerial - A person in an aircraft provided the count
- aerial photo - Count is from photographs that were taken from an aircraft
- offshore vessel - A person on a boat off the shore provided the count

Example of a row

For example, a single row in the dataset, has these values:

	0
site_name	Barrientos Island (Aitcho Islands)
site_id	AITC
ccamlr_region	48.1
longitude_epsg_4326	-59.7517
latitude_epsg_4326	-62.4071
common_name	chinstrap penguin
day	25
month	11
year	1997
season_starting	1997
penguin_count	4608
accuracy	1
count_type	nests
vantage	ground

Performance metric

Performance is evaluated according to an adjusted MAPE calculation. Since some penguin counts have differing accuracies, the percent error is weighted differently based on this expected accuracy.

$$AMAPE = \frac{1}{N}\sum_{n=1}^{N} \frac{\bigl| \frac{ \hat{y_n} - y_n }{\max{(1, y_n})} \bigr|}{e_n}$$

where |$ \hat{y_n} $| is the predicted penguin count, |$y_n$| is the actual penguin count, |$e_n$| is the error re-weighting based on our confidence in the penguin count. We also will not divide by zero, so if |$y_n = 0$| we use one as our percentage denominator instead.

Note: We provide |$e_n$| in the same structured format as the nest count timeseries for ease of calculating this metric.

Nest count data

Nest counts are the most direct reflection of the number of breeding pairs, which is the quantity that biologists are usually interested in when modeling populations. Occasionally, only counts of adults are possible, either because the count is made from an aircraft or because the count is made before nests have been established. To first approximation, we generally assume one adult counted at the colony corresponds to one nest, since the adults often take turns at the colony while the other is feeding. The number of chicks does not directly relate to the number of nests because all three species lay two eggs and so each nest may yield zero, one, or two chicks in a season.

To make things a little easier, we've extracted just the nest counts from the raw data and transformed them into a timeseries. To do that, we take the following steps. First, we remove any observations that are for chicks or adults. Next, we look at each site_id+common_name pair (because some sites have colonies of penguins of different species). We take the maximum nest count that we see for each of these and use that as our count for the season. The columns in the dataset are now the years where we have at least one observation, and the values are how many penguins were observed.

We provide data in this format for all of the seasons up through 2013. However, the time series for each site are variable in length and patchy due to the expensive nature of doing field work in Antarctica.

Here is a sample from the nest count data:

		2009	2010	2011	2012	2013
site_id	common_name
BENN	chinstrap penguin	NaN	NaN	NaN	NaN	NaN
BENT	adelie penguin	NaN	NaN	NaN	NaN	NaN
BERK	adelie penguin	NaN	13423.0	NaN	NaN	NaN
BERN	gentoo penguin	NaN	NaN	NaN	NaN	NaN
BERT	adelie penguin	NaN	346.0	NaN	313.0	NaN
BIEN	adelie penguin	NaN	35466.0	NaN	NaN	NaN
BISC	adelie penguin	594.0	539.0	567.0	522.0	NaN
BISC	gentoo penguin	2401.0	2404.0	3081.0	3197.0	NaN
BLAK	chinstrap penguin	NaN	NaN	NaN	NaN	NaN
BOLI	adelie penguin	NaN	NaN	NaN	NaN	NaN

Submission format

The format for the submission file is provided in submission_format.csv. We're asking you to predict nest counts for a number of site_id/common_name combinations that appear in the dataset. For each of these combinations, you must predict a float value (so zeros need to have a decimal in the CSV, e.g. 0.0) that represents the nest count for that season in that location. You will be evaluated against the nest count data that was actually collected.

We ask you to predict these nest counts for four seasons into the future: 2014, 2015, 2016, and the upcoming season, 2017.

Here is an example of what your predictions might be:

		2014	2015	2016	2017
site_id	common_name
ACUN	adelie penguin	0.0	0.0	0.0	0.0
ACUN	chinstrap penguin	0.0	0.0	0.0	0.0
ADAM	adelie penguin	0.0	0.0	0.0	0.0
ADAR	adelie penguin	0.0	0.0	0.0	0.0
AILS	chinstrap penguin	0.0	0.0	0.0	0.0
		. . .	. . .	. . .	. . .
YANK	gentoo penguin	0.0	0.0	0.0	0.0
YOUN	adelie penguin	0.0	0.0	0.0	0.0
YTRE	adelie penguin	0.0	0.0	0.0	0.0
ZAWA	chinstrap penguin	0.0	0.0	0.0	0.0
ZIGZ	chinstrap penguin	0.0	0.0	0.0	0.0

Your .csv file for those predictions would look like:

site_id,common_name,2014,2015,2016,2017
ACUN,adelie penguin,0.0,0.0,0.0,0.0
ACUN,chinstrap penguin,0.0,0.0,0.0,0.0
ADAM,adelie penguin,0.0,0.0,0.0,0.0
ADAR,adelie penguin,0.0,0.0,0.0,0.0
AILS,chinstrap penguin,0.0,0.0,0.0,0.0
...
YANK,gentoo penguin,0.0,0.0,0.0,0.0
YOUN,adelie penguin,0.0,0.0,0.0,0.0
YTRE,adelie penguin,0.0,0.0,0.0,0.0
ZAWA,chinstrap penguin,0.0,0.0,0.0,0.0
ZIGZ,chinstrap penguin,0.0,0.0,0.0,0.0

Good luck!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!

Random Walk of the Penguins

Quick Facts

Participants

No. of Entries

Prize

Winner

ambarishg