Navigation

Problem description

Your models should predict whether or not a given household for a given country is poor or not. The training features are survey data from three countries. For each country, A, B, and C, survey data is provided at the household as well as individual level. Each household is identified by its id, and each individual is identified by both their household id and individual iid. Most households have multiple individuals that make up that household.

Data Format
Training Data
Test Data

Performance metric
Mean Log Loss

Submission Format
Format example

Data Format

Training Data

Predictions should be made at the household level only, but data for each of the three countries is provided at the household and individual level. It may be the case that you can construct additional features for the household using the individual data that are particular useful for predicting at the household level, which is why we provide both. There are six training files in total.

Country	Filename	Survey Level
A	A_hhold_train.csv	household
B	B_hhold_train.csv	household
C	C_hhold_train.csv	household
A	A_indiv_train.csv	individual
B	B_indiv_train.csv	individual
C	C_indiv_train.csv	individual

The dataset has been structured so that the id columns match across the individual and house hold datasets. For both datasets, an assessment of whether or not the household is above or below the poverty line is in the poor column. This binary variable is the target variable for the competition.

Each column in the dataset corresponds with a survey question. Each question is either multiple choice, in which case each choice has been encoded as random string, or it is a numeric value. Many of the multiple choice questions are about consumable goods--for example does your household have items such as Bar soap, Cooking oil, Matches, and Salt. Numeric questions often ask things like How many working cell phones in total does your household own? or How many separate rooms do the members of your household occupy?

For example, a few rows from A_hhold_train.csv look like

	wBXbHZmp	SlDKnCuu	AlDbXTlZ	...	poor
id
80389	JhtDR	GUusz	aQeIm	...	True
9370	JhtDR	GUusz	cecIq	...	True
39883	JhtDR	GUusz	aQeIm	...	False
18327	JhtDR	alLXR	cecIq	...	True
88416	JhtDR	GUusz	cecIq	...	True

And the corresponding data in A_indiv_train.csv is

		HeUgMnzF	CaukPfUC	xqUooaNJ	...	poor
id	iid
80389	1	XJsPz	mOlYV	dSJoN	...	True
	2	XJsPz	mOlYV	JTCKs	...	True
	3	TRFeI	mOlYV	JTCKs	...	True
	4	XJsPz	yAyAe	JTCKs	...	True
9370	1	XJsPz	mOlYV	JTCKs	...	True

Note that questions across the two survey levels are different, so it may be the case that generating new household level features by aggregating the individual features in different ways will create better predictors.

Test Data

The test data format is the same as the training data, except that the poor column is not included. The files are

Country	Filename	Survey Level
A	A_hhold_test.csv	household
B	B_hhold_test.csv	household
C	C_hhold_test.csv	household
A	A_indiv_test.csv	individual
B	B_indiv_test.csv	individual
C	C_indiv_test.csv	individual

Performance metric

Performance is evaluated according to a mean log loss. That is, log loss scores are generated for each country, and the mean of those scores will be your overall score. The competitor that minimizes the value of this loss over all test cases will top the leaderboard.

Submission format

The format for the submission file is id, country, and a poor prediction. The predictions for poor should be float probabilities.

For example, if your first five predictions were:

	country	poor
id
418	A	0.5
41249	A	0.5
16205	A	0.5
97501	A	0.5
67756	A	0.5

The first five lines of your submission.csv file would look like:

id,country,poor
418,A,0.5
41249,A,0.5
16205,A,0.5
97501,A,0.5
67756,A,0.5
17938,A,0.5
19036,A,0.5
61587,A,0.5
57571,A,0.5

Good luck!

If you're wondering how to get started, check out our benchmark blog post!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!

Pover-T Tests: Predicting Poverty

Quick Facts

Participants

No. of Entries

Prize

Winner

Ag100