Poverty Prediction Challenge

Estimate individual and aggregate household consumption from limited survey data. #development

$10,000 in prizes
8 weeks left
41 joined

Problem description

For this challenge, your task is to use survey data to predict both the population’s poverty rate at various thresholds as well as household-level daily per capita consumption.

This challenge mimics a common challenge faced in real-time poverty monitoring in which older surveys fully capture household consumption data but more recent surveys lack the detailed information needed to directly measure poverty rates. The process of inferring consumption data from less detailed surveys is called imputation. In this challenge, you will impute both household consumption and, with that information, the population-level poverty rate.

This competition is designed for research purposes to explore imputation methods, and has a limited test set reflecting the real-world challenges faced by researchers. As a result, the number of submissions you can make is limited and the public leaderboard for this competition may not be indicative of final rankings.

Datasets


You are provided a training dataset consisting of three panels of household surveys that were conducted in different years. Each household responds to a variety of questions, and their responses are recorded. The household’s daily consumption per capita is also recorded. Weights were assigned to each household according to the number of people in the household and the degree to which the household is representative of the overall population. From the consumption values and the weights, the consumption distribution is computed and the poverty rate–the proportion of the population with a per capita consumption below a certain consumption level or poverty threshold–was calculated.

An overview of the files available on the data download page is below, followed by more detailed information about each file:

data/
├── feature_descriptions.csv          # Survey data dictionary
├── feature_value_descriptions.csv    # Survey codes data dictionary
├── submission_format.zip             # Submission format example
├── test_hh_features.csv              # Household-level survey responses for test set
├── train_hh_features.csv             # Household-level survey responses for training set
├── train_hh_gt.csv                   # Household-level consumption for each response in each survey
└── train_rates_gt.csv                # Computed poverty rates at given thresholds for training surveys

Survey features

The household-level surveys have the following sets features, provided in train_hh_features.csv:

  • Identifiers & sampling information
  • The weights in the dataset are population-expanded weights, meaning that they reflect the probability of sampling times the number of members in the household. These are used to convert household-level survey data to accurate population-level estimates.
  • Welfare & expenditure information
  • Demographics & household composition
  • Education & employment
  • Housing & utilities
  • Geography
  • Food-consumption indicators (last 7 days)

In addition, the household-level consumption labels are provided in train_hh_gt.csv in dollars per day per capita.

The full data dictionary for the survey data is provided in feature_descriptions.csv and the values for certain coded indicators are provided in feature_value_descriptions.csv.

For your solution, you should predict the household-level consumption for each survey response in the test set in dollars per day per capita (2017 USD PPP).

Poverty thresholds

In addition to household-level per capita consumption, the training set also contains the computed poverty rates at various thresholds, provided in train_rates_gt.csv. The poverty rate is defined as the percentage of the population with a consumption strictly below each threshold. The thresholds are set approximately at the ventiles of the consumption distribution for survey 300000:

Poverty Threshold
(per capita consumption)
Poverty Rate
2017 USD PPP Survey 300000
$3.17 5%
$3.94 10%
$4.60 15%
$5.26 20%
$5.88 25%
$6.47 30%
$7.06 35%
$7.70 40%
$8.40 45%
$9.13 50%
$9.87 55%
$10.70 60%
$11.62 65%
$12.69 70%
$14.03 75%
$15.64 80%
$17.76 85%
$20.99 90%
$27.37 95%

For your solution, you should predict the percentage of the overall population strictly below (i.e., strictly less than) each of these thresholds for each of the three surveys in the test set.

Training dataset

The training set consists of three surveys (IDs 100000, 200000 and 300000) and the computed poverty rates at specified thresholds. Each survey contains approximately 35,000 responses.

Test dataset

The test set contains responses from an additional three surveys (IDs 400000, 500000 and 600000), each with approximately 35,000 responses. These survey responses, provided in test_hh_features.csv, contain the same features as the surveys in the training set, but lack the household consumption and poverty rate labels.

One of the three test set surveys will be used for validation purposes and your score on this survey panel will be reflected on the leaderboard during the competition. The scores for the other two surveys are privately withheld and only the withheld surveys are used for final ranking. Note that this may result in the leaderboard serving as a poor indicator of where you will eventually place in the competition.

You should treat each survey in the test set as an independent sample, and your predictions for households within each survey and the poverty rates predicted from a survey should be made independently of the other surveys. You may not train models on the test set data, and you may not use the test set data in any way other than to generate predictions.

Performance metric


Performance is evaluated according to a weighted average of the per capita consumption prediction error and the poverty rate ratel prediction error:

  • As accurately predicting poverty rates is the most important and challenging task for the World Bank Group, 90% of the weighted average is computed as the weighted mean absolute percentage error between your predicted poverty rates at various poverty thresholds and the actual rates at those thresholds, averaged across all surveys in the test set.
    • The poverty thresholds were derived from the ventiles of the consumption distribution for survey ID 300000 in the training dataset.
    • The weights are computed as |$w_t = 1 - \left| 0.4 - p_t \right|$| where |$p_t$| is the percentile rank corresponding to threshold |$t$|. This weighting prioritizes the poverty thresholds more the closer they are to the threshold corresponding to a 40% poverty rate in the training dataset from which the thresholds were derived.
  • The remaining 10% of the weighted average consists of a mean absolute percentage error between predicted household-level per capita consumption and actual per capita consumption across all responses in the test set.

Mathematically, this looks like:

|$\text{metric} = \frac{1}{S} \sum_s^S \left(\frac{90}{\sum w}\sum_t^T w_t \frac{\left|r_t^s-\hat{r}_t^s\right|}{r_t^s} + \frac{10}{H} \sum_h^H \frac{\left|c_h^s - \hat{c}_h^s\right|}{c_h^s}\right)$|

where |$\hat{r}_t^s$| is the predicted poverty rate at poverty threshold |$t$| for survey |$s$|, |$r_i^s$| is the actual poverty rate at poverty threshold |$t$| for survey |$s$|, |$\hat{c}_h^s$| is the predicted per capita consumption for household |$h$| in survey |$s$|, |$c_h^s$| is the actual household per capita consumption for household |$h$| in survey |$s$|, |$S$| is the total number of surveys, |$H$| is the total number of households, and |$w_t = 1 - \left| 0.4 - p_t \right|$| where |$p_t$| is the percentile rank corresponding to threshold |$t$|.

The leaderboard will show this blended metric as the primary competition metric, which will be used for rankings. In addition, a secondary indicator will show the contribution of errors in poverty rate prediction, i.e., the weighted mean absolute percentage error between your predicted poverty rates at various poverty thresholds and the actual rates at those thresholds.

Submission format


The format for the submission file is a zip file containing two CSV files at the root level:

submission.zip
├── predicted_household_consumption.csv
└── predicted_poverty_distribution.csv

The household consumption CSV will contain the survey ID, the household ID, and the corresponding predicted per capita consumption value for that household. In total, this CSV should have 3 columns and 103,024 rows. A truncated example is shown below:

survey_id,household_id,per_capita_household_consumption
400000,400001,1.23
400000,400002,4.56
...
400000,434565,7.89
500000,500001,12.3
...
500000,534245,45.6
600000,600001,78.9
...
600000,634213,0.01

The poverty distribution CSV has twenty columns and four rows, including a header row. Each row is one survey dataset, labeled by its survey id in the first column. The remaining 19 columns contain the calculated poverty rates at various poverty thresholds. A truncated example is shown below:

survey_id,pct_hh_below_3.17,pct_hh_below_3.94,...,pct_hh_below_27.37
400000,0.001,0.002,...,0.999
500000,0.001,0.002,...,0.999
600000,0.001,0.002,...,0.999

Submission limits and submission selection

Due to the limited size of the test set, you are limited to only three submissions per week over the course of the challenge. In addition, you may select only one of these submissions to use for final ranking purposes. You should choose the model you believe will best generalize unseen test data, which may not necessarily be the model with the best score on the leaderboard.

Good luck!


Good luck and enjoy this challenge! If you have any questions, you can always visit the user forum!