Navigation

Problem description

Your goal is to use data from a nationally representative sample of women to predict which healthcare services respondents used in the last 12 months.

This is where you'll find specific documentation about this dataset and the problem we are trying to solve. For this competition, there are two subsections to the problem description:

A note on data obfuscation

The NSFG dataset is a public dataset. When you signed up for the competition, you agreed to use only the data that we provide while engaged in this competition. However, we don't want a bad apple to spoil the fun for everyone else, so we have obfuscated the dataset. Categorical variables have been recoded, ordinal variables have been transformed, and numeric variables have been rescaled. Our goal was to make it extremely difficult to cheat while maintaining the form of the data. Point being, it's best all around if you build your model based on what is provided.

For those interested in what features have predictive power, we'll be releasing the reversing dictionary at the end of the competition. This will allow you to reverse the obfuscation and recover the data as it is described in the NSFG codebooks.

We hope that the obfuscation will help us maintain the integrity of the competition while still letting interested parties interpret the resulting models at the end of the competition.

The features in this dataset

This data is based on the National Survey of Family Growth conducted by the United States Center for Disease Control and Prevention. People across the United States were asked a series of over 1700 questions about their demographics, pregnancies, family planning, use of healthcare services, and medical insurance. We're focusing on the respondents to these questions that are women, and each row in the provided data represents an individual. To give you a sense of the kind of data collected, the relevant sections of the NSFG survey are:

Section A - Demographics, household, childhood
Section B - Pregnancy, birth, and adoption
Section C - Marital and relationship history
Section D - Sterilizing operations and impaired fecundity
Section E - Contraceptive history and pregnancy wantedness
Section F - Family planning and medical services
Section G - Birth desires and intentions
Section H - Infertility services and reproductive health
Section I - Insurance, residence, religion, work history, child care, attitudes

In obfuscating this data, we generated feature names which indicate the variable type. Numeric variables start with n_, categorical variables start with c_, and ordinal variables start with o_.

New data is released every few years in what the CDC calls a "cycle," and we have compiled a number of different release cycles together in this dataset. The release feature includes randomly ordered levels indicating to which cycle of the CDC data each row belongs. This could be useful to uncover patterns that have a temporal trend. Columns beginning with service_ are the labels (and are in a separate .csv file).

Feature data types

Each column is one of the following datatypes, which we indicate with the column headers:

Data type	Header prefix	Description	Values	NaN value
Categorical	`c_`	Variables that can take on a set of values which are not numeric or ordered, e.g. yes/no answers	A case sensitive single letter	Empty
Numeric	`n_`	Variables that have a numeric value, e.g. number of children	Scaled to the range [0, 1], or all 1 for singleton values	Empty
Ordinal	`o_`	Variables where the order matters, e.g. level of education	Integer value	Empty

Feature example values

Here are a few rows of a selection of random columns of different data types:

	n_0002	o_0140	c_1370
id
11193	0.025449	11	NaN
11382	0.031297	11	a
16531	0.024475	NaN	a
1896	0.041694	27	NaN
18262	0.038120	17	b

You may notice that this data is sparse — a characteristic that makes this data challenging (and interesting!) to work on. Many survey answers are predicated on answering other questions; for example, the answer for "How many children do you have?" will be represented by a NaN value if the respondent has never had children.

The labels in this dataset

The labels for this data set are binary outcomes representing whether or not a woman went to a healthcare provider for a particular service in the 12 months preceding the survey.

Label example (first 5 rows of training labels)

The first 4 columns, for example, look like this:

	service_a	service_b	service_c	service_d	…
id
11193	1	1	0	0	…
11382	0	0	0	0	…
16531	0	0	0	0	…
1896	0	0	0	1	…
18262	0	0	0	1	…

And the last 4 columns look like this:

	…	service_k	service_l	service_m	service_n
id
11193	…	1	0	0	0
11382	…	1	0	0	0
16531	…	1	0	0	0
1896	…	0	1	0	0
18262	…	1	1	1	0

Submission format

Your goal is to predict a probability for each possible label in the dataset given a row of new data. Each of these probabilities goes in a separate column in the submission file. The submission must be 3661x14 where 3,661 is the number of rows in the test dataset (excluding the header) and 14 is the number of columns (excluding a first column of row ids). This format is identical to the binary matrix of labels that we provide as the training labels.

Again, you must submit a probability for each label. Valid probabilities are in the range [0, 1]. We provide a SubmissionFormat.csv file that has the exact format that your submission should have.

Here's what the first few rows of SubmissionFormat.csv looks like:

id,service_a,service_b,service_c,service_d,service_e,service_f,service_g,service_h,service_i,service_j,service_k,service_l,service_m,service_n
16757,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
9752,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
12839,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
7689,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5
6885,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5

Countable Care: Modeling Women's Health Care Decisions

Quick Facts

Participants

No. of Entries

Prize

Winner

giba