Countable Care: Modeling Women's Health Care Decisions Hosted By Planned Parenthood Federation of America
Woohoo! This competition has come to a close!
Many thanks to the participants for all of their hard work and commitment to using data for good!
Your goal is to use data from a nationally representative sample of women to predict which healthcare services respondents used in the last 12 months.
This is where you'll find specific documentation about this dataset and the problem we are trying to solve. For this competition, there are two subsections to the problem description:
A note on data obfuscation
The NSFG dataset is a public dataset. When you signed up for the competition, you agreed to use only the data that we provide while engaged in this competition. However, we don't want a bad apple to spoil the fun for everyone else, so we have obfuscated the dataset. Categorical variables have been recoded, ordinal variables have been transformed, and numeric variables have been rescaled. Our goal was to make it extremely difficult to cheat while maintaining the form of the data. Point being, it's best all around if you build your model based on what is provided.
For those interested in what features have predictive power, we'll be releasing the reversing dictionary at the end of the competition. This will allow you to reverse the obfuscation and recover the data as it is described in the NSFG codebooks.
We hope that the obfuscation will help us maintain the integrity of the competition while still letting interested parties interpret the resulting models at the end of the competition.
The features in this dataset
This data is based on the National Survey of Family Growth conducted by the United States Center for Disease Control and Prevention. People across the United States were asked a series of over 1700 questions about their demographics, pregnancies, family planning, use of healthcare services, and medical insurance. We're focusing on the respondents to these questions that are women, and each row in the provided data represents an individual. To give you a sense of the kind of data collected, the relevant sections of the NSFG survey are:
- Section A - Demographics, household, childhood
- Section B - Pregnancy, birth, and adoption
- Section C - Marital and relationship history
- Section D - Sterilizing operations and impaired fecundity
- Section E - Contraceptive history and pregnancy wantedness
- Section F - Family planning and medical services
- Section G - Birth desires and intentions
- Section H - Infertility services and reproductive health
- Section I - Insurance, residence, religion, work history, child care, attitudes
In obfuscating this data, we generated feature names which indicate the variable type. Numeric variables start with
n_, categorical variables start with
c_, and ordinal variables start with
New data is released every few years in what the CDC calls a "cycle," and we have compiled a number of different release cycles together in this dataset. The
release feature includes randomly ordered levels indicating to which cycle of the CDC data each row belongs. This could be useful to uncover patterns that have a temporal trend. Columns beginning with
service_ are the labels (and are in a separate
Feature data types
Each column is one of the following datatypes, which we indicate with the column headers:
|Data type||Header prefix||Description||Values||NaN value|
||Variables that can take on a set of values which are not numeric or ordered, e.g. yes/no answers||A case sensitive single letter||Empty|
||Variables that have a numeric value, e.g. number of children||Scaled to the range [0, 1], or all 1 for singleton values||Empty|
||Variables where the order matters, e.g. level of education||Integer value||Empty|
Feature example values
Here are a few rows of a selection of random columns of different data types:
You may notice that this data is sparse — a characteristic that makes this data challenging (and interesting!) to work on. Many survey answers are predicated on answering other questions; for example, the answer for "How many children do you have?" will be represented by a
NaN value if the respondent has never had children.
The labels in this dataset
The labels for this data set are binary outcomes representing whether or not a woman went to a healthcare provider for a particular service in the 12 months preceding the survey.
Label example (first 5 rows of training labels)
The first 4 columns, for example, look like this:
And the last 4 columns look like this:
Your goal is to predict a probability for each possible label in the dataset given a row of new data. Each of these probabilities goes in a separate column in the submission file. The submission must be
3661x14 where 3,661 is the number of rows in the test dataset (excluding the header) and 14 is the number of columns (excluding a first column of row ids). This format is identical to the binary matrix of labels that we provide as the training labels.
Again, you must submit a probability for each label. Valid probabilities are in the range [0, 1]. We provide a
SubmissionFormat.csv file that has the exact format that your submission should have.
Here's what the first few rows of
SubmissionFormat.csv looks like:
id,service_a,service_b,service_c,service_d,service_e,service_f,service_g,service_h,service_i,service_j,service_k,service_l,service_m,service_n 16757,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5 9752,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5 12839,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5 7689,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5 6885,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5