Competition: On Top of Pasketti: Children’s Speech Recognition Challenge

Navigation

Quick access

About the Project

The On Top of Pasketti: Children's Speech Recognition Challenge is part of a broader initiative at DrivenData to develop high-quality data and models that make a meaningful impact in education, through both data science competitions and direct engagements with partner organizations.

This challenge was designed to advance Automatic Speech Recognition (ASR) that works well for children across early and elementary learning stages, and can be openly shared for broad public benefit.

The live competition represents one stage in a broader, multi-phase effort to improve speech recognition technologies for children and support the development of practical, real-world tools. Building toward this challenge included several key phases:

Landscape analysis and benchmarking. The project team conducted a review of existing child speech datasets and models, benchmarking current performance and identifying critical gaps in publicly available resources. These gaps included limited representation of dialects commonly found in U.S. classrooms, a lack of naturalistic classroom recordings, and limited availability of phonetic labels needed for diagnostic and assessment-focused applications.

User and expert engagement. The team engaged a group of advisors representing potential end users and subject matter experts, including researchers, clinicians, and education technology practitioners. These conversations helped clarify which gaps were most urgent, which use cases were most impactful in practice, and which populations were most likely to be underserved by existing tools.

Dataset curation and annotation. Guided by this input, the team curated and harmonized audio recordings of children spanning early and elementary learning stages, including as young as pre-K, from public, research, and enterprise sources. Dedicated transcription teams produced high-quality word-level and phonetic annotations across hundreds of hours of audio, enabling both educational and diagnostic use cases.

Challenge design. Finally, the competition was designed to leverage these new datasets and engage the broader machine learning community. Two complementary tracks were created to support prioritized use cases with distinct technical requirements: word-level speech recognition and phonetic-level speech recognition.

The table below illustrates how word-level and phonetic-level ASR support different learning and assessment activities.

Learning Activity	Phonetic ASR	Word ASR
Speech Screening	Required	Helpful
Dyslexia Screening	Required	Helpful
Reading Fluency	Helpful	Required
Language Use Assessment	Not Needed	Required
Comprehension	Not Needed	Required
Problem Solving	Not Needed	Required
Reasoning	Not Needed	Required
Verbal Interaction with Tools	Not Needed	Required

Refinement, analysis, and sharing. Following the challenge, the insights and models from the winning approaches will be refined and shared to advance ASR tools for children's speech. These tools will integrate and refine competition outputs for real-world applications. The project team will conduct additional training, evaluation, and validation using data sources that cannot be shared publicly.

The resulting models will be packaged into an open-source software library designed to make child-focused ASR models easier to use and integrate into downstream applications. In addition, the annotations created for this competition will be released publicly to support researchers, educators, clinicians, and technologists working to build more accessible and inclusive tools for children.

Additional Resources

The resources below provide a curated starting point for participants interested in prior work on automatic speech recognition for children. They are not required reading, but may be helpful for understanding common challenges, baseline approaches, and open research questions in this domain.

Foundational and Benchmarking Papers

These papers provide useful context on the performance of modern ASR models on children's speech and common sources of error:

Fan, R., Shankar, N. B., & Alwan, A. (2023). Benchmarking children's ASR with supervised and self-supervised speech foundation models. arXiv.
https://arxiv.org/abs/2406.10507
Graave, T., Li, Z., Lohrenz, T., & Fingscheidt, T. (2024). Mixed children/adult/childrenized fine-tuning for children's ASR: How to reduce age mismatch and speaking style mismatch. In Proceedings of Interspeech 2024 (pp. 5188–5192).
https://doi.org/10.21437/Interspeech.2024-499
Jain, R., Barcovschi, A., Yiwere, M. Y., Corcoran, P., & Cucu, H. (2024). Exploring native and non-native English child speech recognition with Whisper. IEEE Access, 12, 41601–41610.
https://doi.org/10.1109/ACCESS.2024.3378738
Potamianos, A., & Narayanan, S. (2003). Robust recognition of children's speech. IEEE Transactions on Speech and Audio Processing, 11(6), 603–616.
https://doi.org/10.1109/TSA.2003.818026

Phonetic and Diagnostic Applications

This paper is particularly relevant for participants interested in phonetic-level modeling and diagnostic or assessment-oriented use cases:

Block Medin, L., Pellegrini, T., & Gelin, L. (2023). Self-supervised models for phoneme recognition: Applications in children's speech for reading learning. In Proceedings of Interspeech 2024.
https://arxiv.org/abs/2503.04710

Practitioner-Oriented Background Reading

These resources provide high-level synthesis and practical guidance for working with children's speech in ASR systems:

Holmes, L. (2025). Finetuning ASR models for child voices. The Learning Agency.
https://the-learning-agency.com/guides-resources/finetuning-asr-models-for-child-voices/
Natarajan, N. (2026). Closing the child speech recognition gap: Evidence, limitations, and paths forward. The Learning Agency.
https://the-learning-agency.com/guides-resources/closing-the-child-speech-recognition-gap-evidence-limitations-and-paths-forward/

For general guidance on working with speech data in machine learning, see our blog post on open-source packages for using speech data in ML.

Acknowledgements

We are grateful to the many individuals, organizations, and teams who contributed expertise, data, and guidance throughout the development of this project, including:

Academic, data, and research collaborators

Dr. Ahmed Attia and Dr. Jing Liu, University of Maryland
Dr. Kate Bunton, University of Arizona
Dr. Satrajit Ghosh, Massachusetts Institute of Technology
Dr. Brian MacWhinney and collaborators at TalkBank
Dr. Yaacov Petscher, Florida State University
Dr. Sameer Pradhan and Boulder Learning
Dr. Marisha Speights, Northwestern University
Linguistic Data Consortium
TeachFX research team

Annotation and transcription partners

Dr. Charlotte Moore and the DrivenData Transcription Team

ASR for Assessment User Advisory Group, including representatives from:

Magpie
PowerMyLearning
Khan Kids
Soapbox Labs
Lit
Waterford
Harvard Graduate School of Education

Funding support

We are grateful to the Gates Foundation, whose support made this project possible, as well as additional funding support from:

Valhalla Foundation
Center for Educational Data Science and Innovation at the University of Maryland

Data Attribution

Participants must provide appropriate attribution in any publication, presentation, or public description of work developed using the competition data.

Competition and platform:

DrivenData. (2026). On Top of Pasketti: Children's Speech Recognition Challenge. Retrieved [Month Day Year] from https://www.drivendata.org/competitions/group/childrens-asr-competition/.
Bull, P., Slavitt, I., & Lipstein, G. (2016). Harnessing the power of the crowd to increase capacity for data science in the social sector. arXiv. https://arxiv.org/abs/1606.07781

Dataset sources include:

Arizona Child Acoustic Database Repository
Cameron
The CMU Kids Corpus
CSLU: Kids' Speech Version 1.1
Edmonton Narrative Norms Instrument
Ellis Weismer Corpus
JIBO Kids
My Science Tutor; To obtain permission or licensing information for use of MyST data, contact mystcorpus25@gmail.com.
The Ohio Child Speech Corpus
PERCEPT-GFTA
PERCEPT-R
ReadNet
Speech Production Repository for Optimizing Use of AI Technologies (SPROUT)

Required data citations for challenge data:

Attia, A. A., Liu, J., & Wilson, C. E. (2025). RealClass: A Framework for Classroom Speech Simulation with Public Datasets and Game Engines. arXiv. https://arxiv.org/abs/2510.01462
Ayala, S. A., Eads, A., Kabakoff, H., Swartz, M., Shiller, D. M., Hill, J., Hitchcock, E. R., Preston, J. L., & McAllister, T. (2023). Auditory and Somatosensory Development for Speech in Later Childhood. Journal of Speech, Language, and Hearing Research, 66(4), 1252–1273. https://doi.org/10.1044/2022_jslhr-22-00496
balaji1312. (2024). balaji1312/Jibo_Kids: Init (v0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13964792
Benway, N. R., Hitchcock, E. R., McAllister, T., Feeny, G. T., Hill, J., & Preston, J. L. (2021). Comparing biofeedback types for children with residual /ɹ/ errors in American English: A single-case randomization design. American Journal of Speech-Language Pathology, 30(4), 1819–1845. https://doi.org/10.1044/2021_ajslp-20-00216
Benway, N. R., Preston, J. L., Hitchcock, E. R., Salekin, A., Sharma, H., & McAllister, T. (2022). PERCEPT-R: An Open-Access American English Child/Clinical Speech Corpus Specialized for the Audio Classification of /ɹ/. INTERSPEECH 2022: Proceedings of the 23rd Annual Conference of the International Speech Communication Association (ISCA), Incheon, Republic of Korea.
Bunton, K., & Story, B. H. (2016). Arizona Child Acoustic Database Repository. Folia Phoniatrica et Logopaedica, 68(3), 107–111. https://doi.org/10.1159/000452128
Cameron, C. E., Starke, K., Lewis-Jones, T., Baker, M., Didrichsen, S., & Mead, C. (2023). Technical codebook for Project Equity: A study to capture, appreciate, and understand young children's language diversity. Unpublished document.
Campbell, H., Harel, D., Hitchcock, E., & McAllister Byun, T. (2018). Selecting an acoustic correlate for automated measurement of American English rhotic production in children. International Journal of Speech-Language Pathology, 20(6), 635–643. https://doi.org/10.1080/17549507.2017.1359334
Ellis Weismer, S., Venker, C., Evans, J. L., & Moyle, M. (2013). Fast mapping in late-talking toddlers. Applied Psycholinguistics, 34, 69–89.
Guo, L. Y., & Schneider, P. (2016). Differentiating school-aged children with and without language impairment using tense and grammaticality measures from a narrative task. Journal of Speech, Language, and Hearing Research, 59(2), 317–329. https://doi.org/10.1044/2015_jslhr-l-15-0066
Heilmann, J., Ellis Weismer, S., Evans, J., & Hollar, C. (2005). Utility of the MacArthur Communicative Development Inventory in identifying children's language level. American Journal of Speech-Language Pathology, 14, 40–51.
Hitchcock, E. R., & McAllister Byun, T. (2015). Enhancing generalisation in biofeedback intervention using the challenge point framework: A case study. Clinical Linguistics & Phonetics, 29(1), 59–75. https://doi.org/10.3109/02699206.2014.956232
Hitchcock, E. R., McAllister Byun, T., Swartz, M., & Lazarus, R. (2017). Efficacy of electropalatography for treating misarticulation of /r/. American Journal of Speech-Language Pathology, 26(4), 1141–1158. https://doi.org/10.1044/2017_ajslp-16-0122
McAllister, T., Eads, A., Kabakoff, H., Scott, M., Boyce, S., Whalen, D. H., & Preston, J. L. (2022). Baseline Stimulability Predicts Patterns of Response to Traditional and Ultrasound Biofeedback Treatment for Residual Speech Sound Disorder. Journal of Speech, Language, and Hearing Research, 65(10), 1–21. https://doi.org/10.1044/2022_jslhr-22-00161
McAllister, T., Hitchcock, E. R., & Ortiz, J. A. (2021). Computer-assisted challenge point intervention for residual speech errors. Perspectives of the ASHA Special Interest Groups, 6(1), 214–229. https://doi.org/10.1044/2020_persp-20-00191
McAllister, T., Preston, J. L., Hitchcock, E. R., & Hill, J. (2020). Protocol for correcting residual errors with spectral, ultrasound, traditional speech therapy randomized controlled trial (C-RESULTS RCT). BMC Pediatrics, 20(1), 1–14. https://doi.org/10.1186/s12887-020-1951-3
McAllister Byun, T. (2017). Efficacy of visual–acoustic biofeedback intervention for residual rhotic errors: A single-subject randomization study. Journal of Speech, Language, and Hearing Research, 60(5), 1175–1193. https://doi.org/10.1044/2016_JSLHR-S-16-0038
McAllister Byun, T., & Campbell, H. (2016). Differential effects of visual-acoustic biofeedback intervention for residual speech errors. Frontiers in Human Neuroscience, 10, 567. https://doi.org/10.3389/fnhum.2016.00567
McAllister Byun, T., & Hitchcock, E. R. (2012). Investigating the use of traditional and spectral biofeedback approaches to intervention for /r/ misarticulation. American Journal of Speech-Language Pathology, 21(3), 207–221. https://doi.org/10.1044/1058-0360(2012/11-0083)
McAllister Byun, T., Hitchcock, E. R., & Ferron, J. (2017). Masked visual analysis: Minimizing type I error in visually guided single-case design for communication disorders. Journal of Speech, Language, and Hearing Research, 60(6), 1455–1466. https://doi.org/10.1044/2017_JSLHR-S-16-0012
McAllister Byun, T., Hitchcock, E. R., & Swartz, M. T. (2014). Retroflex versus bunched in treatment for rhotic misarticulation: Evidence from ultrasound biofeedback intervention. Journal of Speech, Language, and Hearing Research, 57(6), 2116–2130. https://doi.org/10.1044/2014_jslhr-s-14-0034
McAllister Byun, T., Swartz, M. T., Halpin, P. F., Szeredi, D., & Maas, E. (2016). Direction of attentional focus in biofeedback treatment for /r/ misarticulation. International Journal of Language & Communication Disorders, 51(4), 384–401. https://doi.org/10.1111/1460-6984.12215
Moyle, M. J., Ellis Weismer, S., Lindstrom, M., & Evans, J. (2007). Longitudinal relationships between lexical and grammatical development in typical and late talking children. Journal of Speech, Language, and Hearing Research, 50, 508–528.
Paradis, J., Schneider, P., & Duncan, T. S. (2013). Discriminating children with language impairment among English-language learners from diverse first-language backgrounds. Journal of Speech, Language, and Hearing Research, 56(3), 971–981. https://doi.org/10.1044/1092-4388(2012/12-0050)
Peterson, L., Savarese, C., Campbell, T., Ma, Z., Simpson, K. O., & McAllister, T. (2022). Telepractice treatment of residual rhotic errors using app-based biofeedback: A pilot study. Language, Speech, and Hearing Services in Schools, 53(2), 256–274. https://doi.org/10.1044/2021_LSHSS-21-00057
Preston, J. L., Brick, N., & Landi, N. (2013). Ultrasound biofeedback treatment for persisting childhood Apraxia of speech. American Journal of Speech-Language Pathology, 22(4), 627–644. https://doi.org/10.1044/1058-0360(2013/12-0139)
Preston, J. L., Caballero, N. F., Leece, M. C., Wang, D., Herbst, B. M., & Benway, N. R. (2023). A randomized controlled trial of treatment distribution and biofeedback effects on speech production in school-age children with apraxia of speech. Journal of Speech, Language, and Hearing Research, 66(8), 1–23. https://doi.org/10.1044/2023_JSLHR-22-00622
Preston, J. L., & Edwards, M. L. (2007). Phonological processing skills of adolescents with residual speech sound errors. Language, Speech, and Hearing Services in Schools, 38(4), 297–308.
Preston, J. L., Hitchcock, E. R., & Leece, M. C. (2020). Auditory perception and ultrasound biofeedback treatment outcomes for children with residual /ɹ/ distortions: A randomized controlled trial. Journal of Speech, Language, and Hearing Research, 63(2), 444–455. https://doi.org/10.1044/2019_jslhr-19-00060
Preston, J. L., Hull, M., & Edwards, M. L. (2013). Preschool speech error patterns predict articulation and phonological awareness outcomes in children with histories of speech sound disorders. American Journal of Speech-Language Pathology, 22(2), 173–184. https://doi.org/10.1044/1058-0360(2012/12-0043)
Preston, J. L., & Leece, M. C. (2017). Intensive Treatment for Persisting Rhotic Distortions: A Case Series. American Journal of Speech-Language Pathology, 26(4), 1066–1079. https://doi.org/10.1044/2017_AJSLP-16-0211
Preston, J. L., Leece, M. C., & Maas, E. (2016). Intensive treatment with ultrasound visual feedback for speech sound errors in childhood apraxia. Frontiers in Human Neuroscience, 10, 440. https://doi.org/10.3389/fnhum.2016.00440
Preston, J. L., Leece, M. C., McNamara, K., & Maas, E. (2017). Variable practice to enhance speech learning in ultrasound biofeedback treatment for childhood apraxia of speech: A single case experimental study. American Journal of Speech-Language Pathology, 26(3), 840–852. https://doi.org/10.1044/2017_AJSLP-16-0155
Schneider, P., & Hayward, D. (2010). Who does what to whom: Introduction of referents in children's storytelling from pictures. Language, Speech, and Hearing Services in Schools, 41, 459–473.
Schneider, P., Hayward, D., & Dubé, R. V. (2006). Storytelling from pictures using the Edmonton Narrative Norms Instrument. Journal of Speech-Language Pathology and Audiology, 30, 224–238.
Schneider, P., Rivard, R., & Debreuil, B. (2011). Does colour affect the quality or quantity of children's stories elicited by pictures? Child Language Teaching and Therapy, 27, 371–378.
Shankar, N. B., Afshan, A., Johnson, A., Mahapatra, A., Martin, A., Ni, H., Park, H. W., Perez, M. Q., Yeung, G., Bailey, A., Breazeal, C., & Alwan, A. (2024). The JIBO Kids Corpus: A speech dataset of child-robot interactions in a classroom environment. JASA Express Letters, 4(11), 115201. https://doi.org/10.1121/10.0034195
Sjolie, G. M., Leece, M. C., & Preston, J. L. (2016). Acquisition, retention, and generalization of rhotics with and without ultrasound visual feedback. Journal of Communication Disorders, 64, 62–77. https://doi.org/10.1016/j.jcomdis.2016.10.002
Speights, M., Shrivastava, V., Yedla, B., Srinivasan, S., Vroegh, E., Xu, S., Lidov, E., & Herholz, P. (2025). Speech Production Repository for Optimizing Use of FAIR AI Training - Research [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15723648
Wagner, L., Alghowinhem, S., Alwan, A., Bowdrie, K., Breazeal, C., Clopper, C. G., Fosler-Lussier, E., Jamsek, I. A., Lander, D., Ramnath, R., & Ross, J. (2025). The Ohio Child Speech Corpus. Speech Communication, 170, 103206. https://doi.org/10.1016/j.specom.2025.103206

On Top of Pasketti: Children’s Speech Recognition Challenge - Phonetic Track

Quick Facts

Participants

No. of Entries

Prize

Winner

qingtianbairimandihong