Pover-T Tests: Predicting Poverty
Measuring poverty is hard. Thanks to the efforts of thousands of competitors, The World Bank can now build on open source machine learning tools to help predict poverty, optimize uses of survey data, and support work to end extreme poverty … #development
Your models should predict whether or not a given household for a given country is poor or not. The training features are survey data from three countries. For each country,
C, survey data is provided at the household as well as individual level. Each household is identified by its
id, and each individual is identified by both their household
id and individual
iid. Most households have multiple individuals that make up that household.
Predictions should be made at the household level only, but data for each of the three countries is provided at the household and individual level. It may be the case that you can construct additional features for the household using the individual data that are particular useful for predicting at the household level, which is why we provide both. There are six training files in total.
The dataset has been structured so that the
id columns match across the individual and house hold datasets. For both datasets, an assessment of whether or not the household is above or below the poverty line is in the
poor column. This binary variable is the target variable for the competition.
Each column in the dataset corresponds with a survey question. Each question is either multiple choice, in which case each choice has been encoded as random string, or it is a numeric value. Many of the multiple choice questions are about consumable goods--for example does your household have items such as
Salt. Numeric questions often ask things like
How many working cell phones in total does your household own? or
How many separate rooms do the members of your household occupy?
For example, a few rows from
A_hhold_train.csv look like
And the corresponding data in
Note that questions across the two survey levels are different, so it may be the case that generating new household level features by aggregating the individual features in different ways will create better predictors.
The test data format is the same as the training data, except that the
poor column is not included. The files are
Performance is evaluated according to a mean log loss. That is, log loss scores are generated for each country, and the mean of those scores will be your overall score. The competitor that minimizes the value of this loss over all test cases will top the leaderboard.
The format for the submission file is
country, and a
poor prediction. The predictions for
poor should be
For example, if your first five predictions were:
The first five lines of your submission
.csv file would look like:
id,country,poor 418,A,0.5 41249,A,0.5 16205,A,0.5 97501,A,0.5 67756,A,0.5 17938,A,0.5 19036,A,0.5 61587,A,0.5 57571,A,0.5
If you're wondering how to get started, check out our benchmark blog post!
Good luck and enjoy this problem! If you have any questions you can always visit the user forum!