Countable Care: Modeling Women's Health Care Decisions Hosted By Planned Parenthood Federation of America


Problem description

Your goal is to use data from a nationally representative sample of women to predict which healthcare services respondents used in the last 12 months.

This is where you'll find specific documentation about this dataset and the problem we are trying to solve. For this competition, there are two subsections to the problem description:

A note on data obfuscation

The NSFG dataset is a public dataset. When you signed up for the competition, you agreed to use only the data that we provide while engaged in this competition. However, we don't want a bad apple to spoil the fun for everyone else, so we have obfuscated the dataset. Categorical variables have been recoded, ordinal variables have been transformed, and numeric variables have been rescaled. Our goal was to make it extremely difficult to cheat while maintaining the form of the data. Point being, it's best all around if you build your model based on what is provided.

For those interested in what features have predictive power, we'll be releasing the reversing dictionary at the end of the competition. This will allow you to reverse the obfuscation and recover the data as it is described in the NSFG codebooks.

We hope that the obfuscation will help us maintain the integrity of the competition while still letting interested parties interpret the resulting models at the end of the competition.

The features in this dataset

This data is based on the National Survey of Family Growth conducted by the United States Center for Disease Control and Prevention. People across the United States were asked a series of over 1700 questions about their demographics, pregnancies, family planning, use of healthcare services, and medical insurance. We're focusing on the respondents to these questions that are women, and each row in the provided data represents an individual. To give you a sense of the kind of data collected, the relevant sections of the NSFG survey are:

  • Section A - Demographics, household, childhood
  • Section B - Pregnancy, birth, and adoption
  • Section C - Marital and relationship history
  • Section D - Sterilizing operations and impaired fecundity
  • Section E - Contraceptive history and pregnancy wantedness
  • Section F - Family planning and medical services
  • Section G - Birth desires and intentions
  • Section H - Infertility services and reproductive health
  • Section I - Insurance, residence, religion, work history, child care, attitudes

In obfuscating this data, we generated feature names which indicate the variable type. Numeric variables start with n_, categorical variables start with c_, and ordinal variables start with o_.

New data is released every few years in what the CDC calls a "cycle," and we have compiled a number of different release cycles together in this dataset. The release feature includes randomly ordered levels indicating to which cycle of the CDC data each row belongs. This could be useful to uncover patterns that have a temporal trend. Columns beginning with service_ are the labels (and are in a separate .csv file).

Feature data types

Each column is one of the following datatypes, which we indicate with the column headers:

Data type Header prefix Description Values NaN value
Categorical c_ Variables that can take on a set of values which are not numeric or ordered, e.g. yes/no answers A case sensitive single letter Empty
Numeric n_ Variables that have a numeric value, e.g. number of children Scaled to the range [0, 1], or all 1 for singleton values Empty
Ordinal o_ Variables where the order matters, e.g. level of education Integer value Empty

Feature example values

Here are a few rows of a selection of random columns of different data types:

n_0002 o_0140 c_1370
11193 0.025449 11 NaN
11382 0.031297 11 a
16531 0.024475 NaN a
1896 0.041694 27 NaN
18262 0.038120 17 b

You may notice that this data is sparse — a characteristic that makes this data challenging (and interesting!) to work on. Many survey answers are predicated on answering other questions; for example, the answer for "How many children do you have?" will be represented by a NaN value if the respondent has never had children.

The labels in this dataset

The labels for this data set are binary outcomes representing whether or not a woman went to a healthcare provider for a particular service in the 12 months preceding the survey.

Label example (first 5 rows of training labels)

The first 4 columns, for example, look like this:

service_a service_b service_c service_d
11193 1 1 0 0
11382 0 0 0 0
16531 0 0 0 0
1896 0 0 0 1
18262 0 0 0 1

And the last 4 columns look like this:
service_k service_l service_m service_n
11193 1 0 0 0
11382 1 0 0 0
16531 1 0 0 0
1896 0 1 0 0
18262 1 1 1 0

Submission format

Your goal is to predict a probability for each possible label in the dataset given a row of new data. Each of these probabilities goes in a separate column in the submission file. The submission must be 3661x14 where 3,661 is the number of rows in the test dataset (excluding the header) and 14 is the number of columns (excluding a first column of row ids). This format is identical to the binary matrix of labels that we provide as the training labels.

Again, you must submit a probability for each label. Valid probabilities are in the range [0, 1]. We provide a SubmissionFormat.csv file that has the exact format that your submission should have.

Here's what the first few rows of SubmissionFormat.csv looks like:

Hint: you may want to use the handy submission validator tool to cut down on submission format errors!