Pover-T Tests: Predicting Poverty Hosted By DrivenData


Woohoo! This competition has come to a close!

Many thanks to the participants for all of their hard work and commitment to using data for good!

Problem description

Your models should predict whether or not a given household for a given country is poor or not. The training features are survey data from three countries. For each country, A, B, and C, survey data is provided at the household as well as individual level. Each household is identified by its id, and each individual is identified by both their household id and individual iid. Most households have multiple individuals that make up that household.

Data Format

Training Data

Predictions should be made at the household level only, but data for each of the three countries is provided at the household and individual level. It may be the case that you can construct additional features for the household using the individual data that are particular useful for predicting at the household level, which is why we provide both. There are six training files in total.

Country Filename Survey Level
A A_hhold_train.csv household
B B_hhold_train.csv household
C C_hhold_train.csv household
A A_indiv_train.csv individual
B B_indiv_train.csv individual
C C_indiv_train.csv individual

The dataset has been structured so that the id columns match across the individual and house hold datasets. For both datasets, an assessment of whether or not the household is above or below the poverty line is in the poor column. This binary variable is the target variable for the competition.

Each column in the dataset corresponds with a survey question. Each question is either multiple choice, in which case each choice has been encoded as random string, or it is a numeric value. Many of the multiple choice questions are about consumable goods--for example does your household have items such as Bar soap, Cooking oil, Matches, and Salt. Numeric questions often ask things like How many working cell phones in total does your household own? or How many separate rooms do the members of your household occupy?

For example, a few rows from A_hhold_train.csv look like

wBXbHZmp SlDKnCuu AlDbXTlZ ... poor
80389 JhtDR GUusz aQeIm ... True
9370 JhtDR GUusz cecIq ... True
39883 JhtDR GUusz aQeIm ... False
18327 JhtDR alLXR cecIq ... True
88416 JhtDR GUusz cecIq ... True

And the corresponding data in A_indiv_train.csv is

HeUgMnzF CaukPfUC xqUooaNJ ... poor
id iid
80389 1 XJsPz mOlYV dSJoN ... True
2 XJsPz mOlYV JTCKs ... True
3 TRFeI mOlYV JTCKs ... True
4 XJsPz yAyAe JTCKs ... True
9370 1 XJsPz mOlYV JTCKs ... True

Note that questions across the two survey levels are different, so it may be the case that generating new household level features by aggregating the individual features in different ways will create better predictors.

Test Data

The test data format is the same as the training data, except that the poor column is not included. The files are

Country Filename Survey Level
A A_hhold_test.csv household
B B_hhold_test.csv household
C C_hhold_test.csv household
A A_indiv_test.csv individual
B B_indiv_test.csv individual
C C_indiv_test.csv individual

Performance metric

Performance is evaluated according to a mean log loss. That is, log loss scores are generated for each country, and the mean of those scores will be your overall score. The competitor that minimizes the value of this loss over all test cases will top the leaderboard.

Submission format

The format for the submission file is id, country, and a poor prediction. The predictions for poor should be float probabilities.

For example, if your first five predictions were:

country poor
418 A 0.5
41249 A 0.5
16205 A 0.5
97501 A 0.5
67756 A 0.5

The first five lines of your submission.csv file would look like:


Good luck!

If you're wondering how to get started, check out our benchmark blog post!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!