Pover-T Tests: Predicting Poverty

Measuring poverty is hard. Thanks to the efforts of thousands of competitors, The World Bank can now build on open source machine learning tools to help predict poverty, optimize uses of survey data, and support work to end extreme poverty … #development

$15,000 in prizes
feb 2018
2,307 joined

Problem description

Your models should predict whether or not a given household for a given country is poor or not. The training features are survey data from three countries. For each country, A, B, and C, survey data is provided at the household as well as individual level. Each household is identified by its id, and each individual is identified by both their household id and individual iid. Most households have multiple individuals that make up that household.

Data Format


Training Data

Predictions should be made at the household level only, but data for each of the three countries is provided at the household and individual level. It may be the case that you can construct additional features for the household using the individual data that are particular useful for predicting at the household level, which is why we provide both. There are six training files in total.

Country Filename Survey Level
A A_hhold_train.csv household
B B_hhold_train.csv household
C C_hhold_train.csv household
A A_indiv_train.csv individual
B B_indiv_train.csv individual
C C_indiv_train.csv individual

The dataset has been structured so that the id columns match across the individual and house hold datasets. For both datasets, an assessment of whether or not the household is above or below the poverty line is in the poor column. This binary variable is the target variable for the competition.

Each column in the dataset corresponds with a survey question. Each question is either multiple choice, in which case each choice has been encoded as random string, or it is a numeric value. Many of the multiple choice questions are about consumable goods--for example does your household have items such as Bar soap, Cooking oil, Matches, and Salt. Numeric questions often ask things like How many working cell phones in total does your household own? or How many separate rooms do the members of your household occupy?

For example, a few rows from A_hhold_train.csv look like

wBXbHZmp SlDKnCuu AlDbXTlZ ... poor
id
80389 JhtDR GUusz aQeIm ... True
9370 JhtDR GUusz cecIq ... True
39883 JhtDR GUusz aQeIm ... False
18327 JhtDR alLXR cecIq ... True
88416 JhtDR GUusz cecIq ... True

And the corresponding data in A_indiv_train.csv is

HeUgMnzF CaukPfUC xqUooaNJ ... poor
id iid
80389 1 XJsPz mOlYV dSJoN ... True
2 XJsPz mOlYV JTCKs ... True
3 TRFeI mOlYV JTCKs ... True
4 XJsPz yAyAe JTCKs ... True
9370 1 XJsPz mOlYV JTCKs ... True

Note that questions across the two survey levels are different, so it may be the case that generating new household level features by aggregating the individual features in different ways will create better predictors.

Test Data

The test data format is the same as the training data, except that the poor column is not included. The files are

Country Filename Survey Level
A A_hhold_test.csv household
B B_hhold_test.csv household
C C_hhold_test.csv household
A A_indiv_test.csv individual
B B_indiv_test.csv individual
C C_indiv_test.csv individual

Performance metric


Performance is evaluated according to a mean log loss. That is, log loss scores are generated for each country, and the mean of those scores will be your overall score. The competitor that minimizes the value of this loss over all test cases will top the leaderboard.

Submission format


The format for the submission file is id, country, and a poor prediction. The predictions for poor should be float probabilities.

For example, if your first five predictions were:

country poor
id
418 A 0.5
41249 A 0.5
16205 A 0.5
97501 A 0.5
67756 A 0.5

The first five lines of your submission.csv file would look like:

id,country,poor
418,A,0.5
41249,A,0.5
16205,A,0.5
97501,A,0.5
67756,A,0.5
17938,A,0.5
19036,A,0.5
61587,A,0.5
57571,A,0.5

Good luck!


If you're wondering how to get started, check out our benchmark blog post!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!