You are not currently approved to enter this competition. Please contact DrivenData support to request being allowed to join.

Navigation

Quick access

PREPARE Challenge Overview

Problem description

The challenge is centered around developing better methods for prediction of Alzheimer's disease and Alzheimer's disease related dementias (AD/ADRD) as early as possible. Phase 3 — [Put IT All Together!]: Proof of Principle Demonstration — is focused on improving winning Phase 2 models to provide useful proofs-of-concept about early prediction methods.

In Phase 3, the winning teams from Phase 2 will spend several months refining their winning solutions from Phase 2 and share learnings from the process. In Phase 2, model performance was a primary focus. In Phase 3, the focus is making models more fair, equitable, and generalizable.

Key events:

Refinement Period (April 23 – July 15): Participants improve their Phase 2 winning models
Submission Deadline (July 15, 2025, 11:59 p.m. UTC): Participants submit their refined codebase and supporting materials
Virtual Pitch Event (Late July or early August 2025): Participants present their refined solutions
In-Person Event (Late September 2025): 1st and 2nd place Overall Winners share insights and engage with NIH stakeholders

Objectives

In PREPARE — Phase 3, solvers will explore new modeling approaches and data to provide proof of principle for the early prediction of AD/ADRD. In this work, solvers can help guide research and improve detection of AD/ADRD in a real-world population. For example, which methods should be investigated more? Which new datasets should be collected? How should existing data collection processes be changed?

The primary goal is not to optimize the performance of a single trained model on a specific test set, because a single test set would not be able to capture the full population of interest. Final Phase 3 models should serve as proofs of concept for early detection by demonstrating novel methods, data usage, or other approaches.

Participants are encouraged to explore any approaches that improve model performance on a real-world population, with a particular emphasis on equitable performance. Below is a non-comprehensive list of example model refinement activities. Participants do not need to complete all of the below — identify which activities are most relevant to your model, either on this list or something different.

Incorporating new data, particularly new data from populations disproportionately impacted by AD/ADRD or data that enables earlier predictions
Correcting for overfitting on the Model Arena competition test set, and improving or demonstrating generalizability across data sources or populations. For example, winners from the Acoustic Track can utilize the metadata provided in the Report Arena to account for correlations between language, corpus, and diagnosis in the competition data.
Identifying and mitigating model bias
Testing novel/experimental or established biomarkers
Testing existing theoretical models of AD/ADRD
Improving explainability and better understanding the contribution of different model features
Addressing gaps in team expertise. For example, soliciting input from experts or clinical users to inform model updates
Improving data cleaning and processing, particularly for data from the Acoustic Track

A few examples of activities that would not be helpful as part of Phase 3:

Directly collecting additional data
Increasing model performance by increasing compute (eg. number of ensembled models with different random seeds), rather than by changing methods
Making an interactive dashboard to share model predictions without changing the model itself

Submission format

To be eligible for prizes, participants are required to both make a competition submission and participate in a virtual pitch event.

Each submission should be a ZIP archive containing the following components:

Key Insights: A concise, easily digestible summary of practical learnings targeted towards AD/ADRD researchers (key_insights.pdf)
Codebase: A well-organized codebase for training the final model and performing inference (codebase.zip or codebase.pdf)
Phase 3 Activity Summary: A summary of what participants worked on during Phase 3, and how the model was changed from Phase 2 (activity_summary.pdf)

The ZIP archive of your submission cannot be larger than 1GB. You are only allowed to make one submission. To make changes, you can delete and re-upload your submission as many times as you like. Only the last entry submitted will be considered.

1. Key Insights

"Key Insights" provides a summary of takeaways for AD/ADRD research. Key Insights should be targeted to an audience of AD/ADRD researchers who have subject matter expertise, but may only have basic familiarity with machine learning. The goal is that an AD/ADRD researcher should easily be able to use the insights to improve or direct research.

The final model should be used as a proof-of-concept for early prediction. Learnings from the experimentation and development process are just as important as the final product. Insights can be derived from work done during both Phase 2 and Phase 3.

Suggested sections and prompts

Overview
- Brief overview of the final model, including model architecture, datasets used, and key data processing and feature engineering steps
Performance
- Describe model performance metrics on different datasets (if applicable) and patient groups. How well does the solution perform? How does this compare with existing approaches?
- For which groups does the solution struggle most with accurate early prediction, and how might performance be improved?
- Other key limitations and weaknesses relevant for interpretation and clinical use
Methodology
- Takeaways about which modeling methods do or don’t improve model performance / generalizability and why
Data
- Takeaways about which existing datasets do or don’t improve model performance / generalizability and why
- Any recommendations of top priority additional datasets to gather
Contributing factors
- Takeaways about how different features influence model predictions and why (i.e., analysis of feature importance and real-world meaning of the results)
Future directions
- Recommended next steps for advancing early prediction of AD/ADRD. What methodologies should be tested? What are key limiting factors in the field overall?

Technical requirements

5 pages maximum including figures and tables. References may be added on additional pages
A PDF file with the name key_insights.pdf
On paper size 8.5 x 11 inch or A4, with minimum margins of 1 inch
Minimum font size of 11
Minimum single-line spacing

2. Codebase

Participants should submit a well-organized codebase for their final model, including all steps for model training and for inference on at least one dataset. The goal of the codebase is to lower the barrier for other researchers using similar methods. For example, other researchers should be able to easily find starter code for things like:

Working with useful datasets (loading, cleaning, processing, etc)
Implementing useful modeling methods, feature engineering, and data preprocessing

Codebases should be written to maximize reproducibility in the spirit of open science. We recommend starting with DrivenData's Cookiecutter project structure. You may even win a prize for clean code!

Technical requirements

The codebase should be submitted as either:
- codebase.pdf: A PDF file that includes a link to a public Github repository
- codebase.zip: A ZIP file containing all code files that is structured like a Github repository. Keep in mind that your whole submission ZIP cannot be larger than 1GB
Model weights do not need to be included in the submitted code
The submitted code should not contain any raw data, but should contain instructions for how to set up the raw data before running training and inference

3. Phase 3 Activity Summary

The "Phase 3 Activity Summary" should provide an overview of what work was done as part of Phase 3, and how solutions were updated from Phase 2 final models. The target audience is competition judges (i.e., a group familiar with both the subject matter and machine learning). "Key Insights" describes the final model and full experimentation process, while the "Phase 3 Activity Summary" describes only the changes made during Phase 3.

Suggested sections and prompts

Activity Overview
- High-level summary of activity during Phase 3
- Summary of what changes you made to your final Phase 2 model
Data
- Any additional datasets you experimented with and the results
Generalizability
- How you explored making your model more generalizable outside of the competition dataset
- Experimentation to identify and mitigate bias
Methodology
- Details about any other ways you experimented with improving model performance
References
- Any ways that existing AD/ADRD research, theory, or practice influenced your work. E.g., literature references, user interviews to better understand clinical needs, expert input, etc

Technical requirements

3 pages maximum including figures and tables. References may be added on additional pages
A PDF file with the name activity_summary.pdf
On paper size 8.5 x 11 inch or A4, with minimum margins of 1 inch
Minimum font size of 11
Minimum single-line spacing

Evaluation

Overall Prizes

Overall prize winners (1st, 2nd, and Runners-up) will be selected based on the following weighted criteria. Submissions will be judged by a panel of experts.

Insights and Innovation (30%): What is the depth and relevance of insights gained? Does the submission include novel or innovative takeaways that can guide future directions of AD/ADRD research? This can be demonstrated through feature importance interpretation, assessment of model limitations and possible methods for improvement, demonstration of new modeling approaches, etc.
Generalizability (25%): How likely is the model to generalize well to a real-world population of interest, in particular historically underserved segments of the population? This can be demonstrated through incorporation of new data, bias assessment and mitigation, and rigorous model performance assessment.
Communication (25%): How clearly and effectively are findings communicated? In particular, how well are key insights communicated to an audience of AD/ADRD researchers who have subject matter expertise but may not be familiar with machine learning?
Rigor (20%): Is the technical basis of the solution correct? How sound is the solution’s methodology?

Submissions are required to be in English, but will not be judged based on English fluency. Judgment will be based on the content and ideas communicated. For example, participants may choose to write in a different language, and then use a tool like Google translate to submit in English.

Clean Code Bonus Prizes

Clean Code Bonus Prizes will be awarded based on the "Codebase" component of the submission. Submissions will be evaluated based on how clear, reproducible, and usable the codebase is. I.e., Is it easy for others to understand and learn from the code? How effectively does the codebase lower the barrier for others to work with similar datasets or methods?

Below are some good resources for how to organize and write clean code in the spirit of open science:

Cookiecutter Data Science: Logical, reasonably standard project structure for data science that reflects best practices (created by DrivenData)
- In particular, take a look at the Cookiecutter opinions about how data science should work
Computational reproducibility course from Utrecht University
Think we're missing a good resource? Let us know in the forum!

Data

The focus of Phase 3 is demonstrating the potential of algorithms and approaches developed in Phase 2, with key emphasis placed on solution equity, generalizability, and explainability. As part of this, participants should find and incorporate new data, data processing methods, and feature engineering.

External data is permitted in this competition provided participants have all rights, licenses, and permissions to use it as contemplated in the Competition Rules.

If you aren't sure whether a specific dataset is allowable, please reach out to us in the competition forum or send an email to info@drivendata.org.

Metadata

You are provided with a metadata file for each track that contains a mapping of the Phase 2 uid to a source data identifier. Both files can be found on the data download page.

mhas_uid_map.csv contains the following variables:

uid (str) - unique identifier for the individual used in Phase 2 data files
mhas_cunicah (str) - 5-digit household id used in MHAS public use data files, listed as cunicah in MHAS data files
mhas_np (str) - 2-digit person number id used in MHAS public use data files

dementiabank_uid_map.csv contains the following variables:

uid (str) - unique identifier for the individual used in Phase 2 data files
dementiabank_id (str) - unique identifier for the individual used in a DementiaBank corpus
corpus (str) - name or abbreviated name of the corpus that the patient record came from, with values "baycrest", "delaware", "depaul", "hopkins", "ivanova", "lu", "pitt", "vas", "wls", and "ye".

Additional data resources

Below are some useful links and tools for finding additional data for experimentation.

General resources:

Global Alzheimer’s Association Interactive Network (GAIN): GAIN has a data portal listing all datasets collected by partner organizations related to Alzheimer's disease. Different datasets include different types of data, icnluding biomarkers, cognitive tests, etc.

Acoustic Track:

DementiaBank corpus: The full DementiaBank corpus includes language samples and cases not used in the Phase 2 data as well as newly released content. Of particular note is the Delaware corpus, an ongoing study specifically designed to capture changes in speech and language abilities across the progression of dementia with a robust, updated version of the Pitt protocol. To access data, solvers must apply to join the DementiaBank consortium and use the data only for the competition’s purpose and duration; commercial use is prohibited.
Past DementiaBank competition datasets: DementiaBank includes cleaned and curated datasets from four previous research challenges, each with distinct goals and processing pipelines. To access data, solvers must apply to join the DementiaBank consortium and use the data only for the competition’s purpose and duration; commercial use is prohibited. See the "Challenges" section of the DementiaBank page.
Global Voice Datasets Repository Map: The NIH Bridge2AI program’s map of all publicly accessible speech datasets collected for neurological research may be a useful tool to find additional datasets. The data is organized geographically and contains licensing information.
Yang et al. 2022 literature review: In a 2022 paper "Speech-based AD Prediction Review", Yang et al. include a useful literature review table listing datasets by language, task / structure, and label availability.

Social Determinants of Health track:

MHAS: MHAS shares data products publicly, including additional subjects, years, and features that were not part of the Phase 2 data. This also includes some restricted MHAS data that requires access approval, such as genetic data. To access data, follow instructions and all terms of use on the MHAS website.
HCAP Network: MHAS is just one of a group of studies based on the Harmonized Cognitive Assessment Protocal (HCAP). Participants can experiment with comparable surveys in other countries. For example:
- HRS-HCAP: United States
- ELSA-HCAP: United Kingdom

Additional tools

Participants may use any additional or external tools as long as they are publicly available and free to use, including open weight pre-trained models. If you want to use a tool that is not clearly designated as open source, you must reach out to competition organizers for approval at info@drivendata.org.

This rule exists to ensure that participants have all rights, licenses, and permissions to include the tool as part of their submission as contemplated in the Competition Rules (e.g., see here).

Events

Virtual office hours

To support participants, there will be three virtual office hours sessions for participants to learn about competition background and logistics (May 23rd, 1pm ET), research on social determinants of cognitive decline (May 29th, 1pm ET), and research on language changes associated with MCI and dementia from AD (June 5, 1pm ET).

At each session, participants will be able to ask questions live. At any time, participants can ask questions via email or the challenge forum.

Sessions will run from 1pm - 1:45pm ET. Instructions for attending and links to register will be shared on the Announcements Page.

Virtual pitches

To be eligible for prizes, participants are required to present a summary of their work in a virtual pitch event in addition to making a competition submission. The primary goal of the virtual pitch event is to share learnings with the broader AD/ADRD research community. Pitches should generally focus on the same content as the "Key Insights" submission component. Attendees may include competition organizers, competition judges, representatives from the NIH, and others in the AD/ADRD research community. Pitches should be targeted to an audience of AD/ADRD researchers with a basic knowledge of machine learning, but who may not be machine learning experts.

Technical requirements:

Each pitch presentation should be a maximum of 5 minutes long
There is no required format for pitch presentations. Pitches may include slides, screen sharing, a simple voiceover, or any other desired format
If an individual or team is unable to attend the event, they may submit a pre-recorded pitch presentation instead

The virtual pitch event will be scheduled for July or August 2025. Exact timing and event logistics will be shared on the Announcements Page at a later date.

In-person winner showcase

Phase 3 will culminate in an in-person event where the winners (1st and 2nd Place) will have an opportunity to connect with the broader AD/ADRD research community. Additional participants may be invited to the event depending on capacity.

Support for travel and lodging will be provided for winners to attend (with limitations on the number of participants per team). The event will take place in the Washington, D.C. area in September 2025. Details will be shared with teams before they confirm attendance.

The in-person winner showcase will include:

A very short presentation from each winner, followed by a longer question and answer session. Winners are required to either attend the showcase in person, or to submit a short pre-recorded presentation
An award ceremony
Opportunities to meet and exchange ideas with prominent professionals in the AD/ADRD research space

Good luck

Good luck and have fun engaging with this challenge! If you have any questions, send an email to the challenge organizers at info@drivendata.org or post on the forum!

PREPARE Challenge: Proof of Principle Demonstration (Phase 3)

Navigation

Quick access

Problem description

Objectives

Submission format

1. Key Insights

Suggested sections and prompts

2. Codebase

3. Phase 3 Activity Summary

Suggested sections and prompts

Evaluation

Overall Prizes

Clean Code Bonus Prizes

Data

Metadata

Additional data resources

Additional tools

Events

Virtual office hours

Virtual pitches

In-person winner showcase

Good luck

On this page