PREPARE Challenge: Data for Early Prediction (Phase 1)

Find, curate, or contribute data to help the National Institute of Aging, a center of the National Institute of Health, create representative and open datasets that can be used for the early prediction of Alzheimer's disease and related dementias. #health

$200,000 in prizes
jan 2024
376 joined

Problem description

The challenge is centered around developing better methods for prediction of Alzheimer's disease and Alzheimer's disease related dementias (AD/ADRD) as early as possible. Phase 1 — Find IT! Data for Early Prediction — is focused on sourcing shareable datasets that can support novel machine learning approaches for predicting AD/ADRD as early as possible, and at least earlier than is possible with standard clinical diagnosis, with an emphasis on addressing biases in existing data sources. Doing so has the potential to advance our ability to treat these diseases. Solvers will submit a dataset that can be used in later phases of the challenge.

An emphasis of the challenge is to build teams bringing diverse perspectives to generating solutions that will benefit populations that have been disproportionately impacted by AD/ADRD. Submissions should describe how the proposed solution might address health disparities and should reference the NIA Health Disparities Research Framework to facilitate identifying and proposing tools, technologies, and products that reflect the life course perspective or theory, as well as relevant levels of analysis among the different domains described in the NIA Health Disparities Research Framework.

Target variables

Datasets should include one or more high quality indicators of AD/ADRD that can serve the role of target variable for predictive machine learning models. Some examples include:

  • Evidence of new cognitive impairment or dementia, including clinical diagnosis of mild cognitive impairment (MCI) or dementia, or non-diagnostic measures such as subjective cognitive decline or memory measures
  • AD/ADRD-consistent biomarkers, including amyloid, tau, NfL, GFAP, the AT(N) framework, or other biomarkers with clear evidence of association with the presence or risk of AD/ADRD

Solvers have some flexibility in defining what constitutes an indicator of AD/ADRD and will need to justify their choice of target variable in their written submission.

Predictor variables

Datasets should include predictor variables with the potential to provide early signal for prediction of AD/ADRD. Predictor variables might advance the challenge goals in a number of ways, such as supporting earlier prediction, more accurate prediction, addressing a wider and/or more representative population, and being more cost effective. Some data types that could constitute interesting predictors are:

  • Electronic health records (EHR)
  • Medical claims data
  • Linguistic data, e.g., voice, handwriting, or typed writing samples
  • Social media or mobile device use
  • Financial, housing, or social services records
  • Behavior from direct-to-consumer products, e.g., online cognitive tests

Linking new data to existing research datasets including NIH-supported research (e.g., population-representative longitudinal studies like the Health and Retirement Study, Alzheimer’s Disease Sequencing Project, Alzheimer’s Disease Neuroimaging Initiative, and open datasets), and non-NIH-supported research is another possible approach to creating a valuable new dataset.

These examples are intended to give solvers some concrete ideas for suitable directions and are by no means exhaustive. The challenge welcomes innovative approaches, and grants solvers flexibility to develop and justify the datasets in their written submissions.

Data access

The challenge is interested in sourcing data to support open innovation. Datasets should be as open as possible given privacy considerations inherent to the type of data included. Sensitive data types may require that data requesters agree to terms of use or a data license before accessing the data. At a minimum, the dataset must be freely available to all solvers in the challenge who agree to terms of use for the purpose and duration of the challenge.

Submission format

Solvers will submit a written description of an open, shareable dataset that can support novel machine learning approaches for early prediction of AD/ADRD with an emphasis on addressing biases in existing data sources. The dataset must include a target variable that validly indicates AD/ADRD, along with predictor variables that can be used as input features for predicting the target variable. Solvers will need to ensure that the dataset is available to all solvers participating in later phases of the challenge. Submissions must include an executive summary section, a data description section, and a team introductions section. A references section and the below data description subsections are encouraged but not required.

Executive summary (maximum of 1 page, excluding references): Provide a summary of the dataset, including the predictor variables, their potential for innovation and utility for machine learning, the target AD/ADRD variable(s), the data source(s) (e.g., clinical interview, electronic health records (EHR), mobile app data, etc.), and information about the suitability of the dataset for use in the challenge (e.g., data ownership, any restrictions on sharing due to data sensitivity or copyright).

Drafts of the executive summary are due on January 17th, 2024 at the latest. Optionally, submit an executive summary draft by November 15th, 2023 to be included in the midpoint review.

Data description (maximum of 6 pages, excluding references)

  • Basic information: Describe how data were produced, including descriptions of relevant experimental design, data acquisition and any computational processing of raw data. You may reference papers or external links. Include the number of files, file formats, size of files, number of subjects, total number of observations, number of observations with links to the target variable(s), and statistical distribution of the target variable(s). If applicable, provide links to external datasets and state the accessibility of those datasets (copyright, license details, cost, etc.).
  • Utility & rigor: Provide a definition of the target variable(s), information about its measurement and its distribution in the sample. Present evidence that the target reliably and validly indicates AD/ADRD. Describe the predictor variables in the dataset and explain any theoretical or empirical links to AD/ADRD. Include any potential for validation or generalization to other data.
  • Innovation: Describe to what extent the proposed data push forward the state of the field and present novel insights and directions for further research by, for example, enabling higher accuracy, earlier predictions, lower cost, and/or greater accessibility. Describe other similar datasets, if any, and how the proposed dataset provides advancements over existing ones. Explain the potential of machine learning on the data to improve early AD/ADRD prediction.
  • Sample characteristics and representation: Describe the sample demographics (age, race/ethnicity, other known demographics) and to what extent the data addresses current biases in research and diagnosis of AD/ADRD and expands the representation of traditionally underrepresented/underserved populations disproportionately impacted by AD/ADRD. Include relevant information about the sampling approach, such as whether participants were compensated for participation, the geographic location of participants, how participants were contacted and recruited, and any other aspects of the study methodology designed to enhance sample representativeness.
  • Usability: Describe where data are currently stored, who owns the data, the data license, and how data would be made available for verification and use in and beyond the challenge. Note any details regarding its readiness for use and any further preparation needed for delivery. Include any known or planned restrictions on data or metadata use (e.g., inability to share clinical measure protocols due to copyright, inability to post data on the open internet due to IRB/consent). Describe any sensitive or personally identifiable information contained in the dataset, whether the dataset includes information that could be used to identify individuals (e.g., people’s names, information about their home address or frequently visited locations, birth date, face images, retinal images, voice recordings, or other), and approaches to mitigate these risks.

Team introduction (maximum 1 page): Provide information about your team, including members' roles and expertise with respect to the submission. For example, this may include information about a team member's role on the project, job title, career stage, institutional affiliation and relevant education, training, and professional or personal experiences.

Requirements

Written submissions must:

  • Consist of a single PDF file with page size set to 8.5” x 11” and at least 1-inch margins.
  • Use a font no smaller than 11-point Arial and line spacing no less than 1.0.
  • Be written in English.
  • Not use the HHS logo or official seal or the logo of NIH or NIA in the entries and must not claim federal government endorsement.

Solvers may optionally include the following accompanying files with the written submission: a data dictionary, schema, data sample, a link to the data in a repository, and/or a data management plan. A submission with accompanying files must be a single ZIP archive containing the written submission and all accompanying files. To submit accompanying files, please email a complete ZIP to info@drivendata.org AND submit the final written submission as a PDF using the Final submission page.

In order to be eligible for a prize, solvers must:

  • Upload a draft of the executive summary to the Executive summary draft page by the Executive summary draft due date, January 17, 2024 at 11:59:59 PM UTC.
  • Upload a final written submission to the Final submission page by the Final submission due date, January 31, 2024 at 11:59:59 PM UTC.

A template for the full written submission containing the required section headings, descriptions of each section, and guidance for writing a competitive proposal are provided on the Data downloads page.

Evaluation criteria

Entries that are responsive and comply with the entry requirements will be scored by technical reviewers in accordance with the criteria outlined below.

Utility & Rigor - Predictor(s) (20%): What is the potential for the predictor data to provide useful signal for early prediction of AD/ADRD? What are the benefits for using this information beyond what exists today (e.g., wider population, gaps in coverage, earlier identification, lower cost, etc.)?

Utility & Rigor - Target(s) (20%): How well defined is the AD/ADRD target variable? How well do applicants justify their choice of a target variable? How well do the data include demonstrated links between the predictors and target variable(s)?

Innovation (20%): To what extent does the data push forward the state of the field and present novel insights and directions for further research?

Disproportionate Impact (20%): To what extent does the data help address current biases in research and diagnosis of AD/ADRD in populations disproportionately impacted by AD/ADRD?

Usability (20%): How accessible (i.e., how broadly licensed for access (e.g., downloading) and use by others) is the dataset to further research through and beyond the challenge? How well-prepared will the data be for use in the challenge?

Disproportionate Impact bonus prize

NIA is particularly interested in submissions that advance equity for populations disproportionately impacted by AD/ADRD. For example, Black and Hispanic Americans have a higher prevalence of AD/ADRD compared to non-Hispanic White Americans.

Challenge finalists (i.e., not Ideas for data collection finalists) will be eligible for a bonus prize related to this interest in research advances for populations disproportionately impacted by AD/ADRD. This prize will be awarded based on the team or entity that receives the highest score in the ‘Disproportionate Impact’ category of the evaluation criteria.

Data access and supporting verification materials

Following the judging process, selected finalists will provide the following for verification by the June 1, 2024 deadline:

Data access: Provide the challenge organizers instructions for downloading the data, with any terms for access and use defined.

Dataset documentation: Provide documentation of the dataset and how it was produced, including a data dictionary, schema, and data sample.

Associated code (optional): Provide software associated with the dataset, for example, code (and documentation) used in the collection or processing of the data, code used to generate the descriptive summaries included in the written submissions, interactive notebooks, examples, tutorials, or documentation that demonstrate how to load and work with the dataset.

Frequently asked questions

Can I submit both an existing dataset and an idea for data collection?

Yes, you can submit both an existing dataset as part of the main challenge and an idea for data collection! There are two separate submission pages, one for the main challenge (which also requires the submission of an executive summary draft) and one to collect ideas for data collection.

Can I submit more than one dataset or more than one idea for data collection?

Each team can make only one final submission of each type (i.e., one main submission of an existing dataset and one idea for data collection). Submissions can be updated an unlimited number of times prior to the submission deadline. The last entry made before the deadline will be considered the final, official submission.

Why should I submit an early draft of the executive summary by midpoint review deadline?

Submitting your executive summary by the midpoint review deadline will provide an opportunity for you to receive initial feedback from judges so that you can improve upon your final submission. Midpoint submissions do not have any impact on the ranking of your final submission. Feedback will be returned in a format that will be accessible to all solvers.

While submitting your executive summary draft by the midpoint review deadline is optional, it is required to submit your draft by the executive summary deadline. For a complete list of deadlines, refer to the home page.

Good luck

Good luck and have fun engaging with this challenge! If you have any questions, send an email to the challenge organizers at info@drivendata.org or post on the forum!