On Top of Pasketti: Children’s Speech Recognition Challenge - Phonetic Track

Develop automatic speech recognition models that produce phone-level transcriptions of children’s speech in IPA. #education

$50,000 in prizes
Completed apr 2026
430 joined

About the Project

The On Top of Pasketti: Children's Speech Recognition Challenge is part of a broader initiative at DrivenData to develop high-quality data and models that make a meaningful impact in education, through both data science competitions and direct engagements with partner organizations.

This challenge was designed to advance Automatic Speech Recognition (ASR) that works well for children across early and elementary learning stages, and can be openly shared for broad public benefit.

The live competition represents one stage in a broader, multi-phase effort to improve speech recognition technologies for children and support the development of practical, real-world tools. Building toward this challenge included several key phases:

Landscape analysis and benchmarking. The project team conducted a review of existing child speech datasets and models, benchmarking current performance and identifying critical gaps in publicly available resources. These gaps included limited representation of dialects commonly found in U.S. classrooms, a lack of naturalistic classroom recordings, and limited availability of phonetic labels needed for diagnostic and assessment-focused applications.

User and expert engagement. The team engaged a group of advisors representing potential end users and subject matter experts, including researchers, clinicians, and education technology practitioners. These conversations helped clarify which gaps were most urgent, which use cases were most impactful in practice, and which populations were most likely to be underserved by existing tools.

Dataset curation and annotation. Guided by this input, the team curated and harmonized audio recordings of children spanning early and elementary learning stages, including as young as pre-K, from public, research, and enterprise sources. Dedicated transcription teams produced high-quality word-level and phonetic annotations across hundreds of hours of audio, enabling both educational and diagnostic use cases.

Challenge design. Finally, the competition was designed to leverage these new datasets and engage the broader machine learning community. Two complementary tracks were created to support prioritized use cases with distinct technical requirements: word-level speech recognition and phonetic-level speech recognition.

The table below illustrates how word-level and phonetic-level ASR support different learning and assessment activities.

Learning Activity Phonetic ASR Word ASR
Speech Screening Required Helpful
Dyslexia Screening Required Helpful
Reading Fluency Helpful Required
Language Use Assessment Not Needed Required
Comprehension Not Needed Required
Problem Solving Not Needed Required
Reasoning Not Needed Required
Verbal Interaction with Tools Not Needed Required

Refinement, analysis, and sharing. Following the challenge, the insights and models from the winning approaches will be refined and shared to advance ASR tools for children's speech. These tools will integrate and refine competition outputs for real-world applications. The project team will conduct additional training, evaluation, and validation using data sources that cannot be shared publicly.

The resulting models will be packaged into an open-source software library designed to make child-focused ASR models easier to use and integrate into downstream applications. In addition, the annotations created for this competition will be released publicly to support researchers, educators, clinicians, and technologists working to build more accessible and inclusive tools for children.

Additional Resources

The resources below provide a curated starting point for participants interested in prior work on automatic speech recognition for children. They are not required reading, but may be helpful for understanding common challenges, baseline approaches, and open research questions in this domain.

Foundational and Benchmarking Papers

These papers provide useful context on the performance of modern ASR models on children's speech and common sources of error:

  • Fan, R., Shankar, N. B., & Alwan, A. (2023). Benchmarking children's ASR with supervised and self-supervised speech foundation models. arXiv.
    https://arxiv.org/abs/2406.10507

  • Graave, T., Li, Z., Lohrenz, T., & Fingscheidt, T. (2024). Mixed children/adult/childrenized fine-tuning for children's ASR: How to reduce age mismatch and speaking style mismatch. In Proceedings of Interspeech 2024 (pp. 5188–5192).
    https://doi.org/10.21437/Interspeech.2024-499

  • Jain, R., Barcovschi, A., Yiwere, M. Y., Corcoran, P., & Cucu, H. (2024). Exploring native and non-native English child speech recognition with Whisper. IEEE Access, 12, 41601–41610.
    https://doi.org/10.1109/ACCESS.2024.3378738

  • Potamianos, A., & Narayanan, S. (2003). Robust recognition of children's speech. IEEE Transactions on Speech and Audio Processing, 11(6), 603–616.
    https://doi.org/10.1109/TSA.2003.818026

Phonetic and Diagnostic Applications

This paper is particularly relevant for participants interested in phonetic-level modeling and diagnostic or assessment-oriented use cases:

  • Block Medin, L., Pellegrini, T., & Gelin, L. (2023). Self-supervised models for phoneme recognition: Applications in children's speech for reading learning. In Proceedings of Interspeech 2024.
    https://arxiv.org/abs/2503.04710

Practitioner-Oriented Background Reading

These resources provide high-level synthesis and practical guidance for working with children's speech in ASR systems:

For general guidance on working with speech data in machine learning, see our blog post on open-source packages for using speech data in ML.

Acknowledgements

We are grateful to the many individuals, organizations, and teams who contributed expertise, data, and guidance throughout the development of this project, including:

Academic, data, and research collaborators

  • Dr. Ahmed Attia and Dr. Jing Liu, University of Maryland
  • Dr. Kate Bunton, University of Arizona
  • Dr. Satrajit Ghosh, Massachusetts Institute of Technology
  • Dr. Brian MacWhinney and collaborators at TalkBank
  • Dr. Yaacov Petscher, Florida State University
  • Dr. Sameer Pradhan and Boulder Learning
  • Dr. Marisha Speights, Northwestern University
  • Linguistic Data Consortium
  • TeachFX research team

Annotation and transcription partners

  • Dr. Charlotte Moore and the DrivenData Transcription Team

ASR for Assessment User Advisory Group, including representatives from:

  • Magpie
  • PowerMyLearning
  • Khan Kids
  • Soapbox Labs
  • Lit
  • Waterford
  • Harvard Graduate School of Education

Funding support

We are grateful to the Gates Foundation, whose support made this project possible, as well as additional funding support from:

Data Attribution

Participants must provide appropriate attribution in any publication, presentation, or public description of work developed using the competition data.

Competition and platform:

Dataset sources include:

Required data citations for challenge data:

  • Attia, A. A., Liu, J., & Wilson, C. E. (2025). RealClass: A Framework for Classroom Speech Simulation with Public Datasets and Game Engines. arXiv. https://arxiv.org/abs/2510.01462
  • Ayala, S. A., Eads, A., Kabakoff, H., Swartz, M., Shiller, D. M., Hill, J., Hitchcock, E. R., Preston, J. L., & McAllister, T. (2023). Auditory and Somatosensory Development for Speech in Later Childhood. Journal of Speech, Language, and Hearing Research, 66(4), 1252–1273. https://doi.org/10.1044/2023_JSLHR-22-00331
  • balaji1312. (2024). balaji1312/Jibo_Kids: Init (v0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13964792
  • Benway, N. R., Hitchcock, E. R., McAllister, T., Feeny, G. T., Hill, J., & Preston, J. L. (2021). Comparing biofeedback types for children with residual /ɹ/ errors in American English: A single-case randomization design. American Journal of Speech-Language Pathology, 30(4), 1819–1845. https://doi.org/10.1044/2021_AJSLP-20-00214
  • Benway, N. R., Preston, J. L., Hitchcock, E. R., Salekin, A., Sharma, H., & McAllister, T. (2022). PERCEPT-R: An Open-Access American English Child/Clinical Speech Corpus Specialized for the Audio Classification of /ɹ/. INTERSPEECH 2022: Proceedings of the 23rd Annual Conference of the International Speech Communication Association (ISCA), Incheon, Republic of Korea.
  • Bunton, K., & Story, B. H. (2016). Arizona Child Acoustic Database Repository. Folia Phoniatrica et Logopaedica, 68(3), 107–111. https://doi.org/10.1159/000452128
  • Cameron, C. E., Starke, K., Lewis-Jones, T., Baker, M., Didrichsen, S., & Mead, C. (2023). Technical codebook for Project Equity: A study to capture, appreciate, and understand young children's language diversity. Unpublished document.
  • Campbell, H., Harel, D., Hitchcock, E., & McAllister Byun, T. (2018). Selecting an acoustic correlate for automated measurement of American English rhotic production in children. International Journal of Speech-Language Pathology, 20(6), 635–643. https://doi.org/10.1080/17549507.2017.1359334
  • Ellis Weismer, S., Venker, C., Evans, J. L., & Moyle, M. (2013). Fast mapping in late-talking toddlers. Applied Psycholinguistics, 34, 69–89.
  • Guo, L. Y., & Schneider, P. (2016). Differentiating school-aged children with and without language impairment using tense and grammaticality measures from a narrative task. Journal of Speech, Language, and Hearing Research, 59(2), 317–329. https://doi.org/10.1044/2015_JSLHR-L-14-0320
  • Heilmann, J., Ellis Weismer, S., Evans, J., & Hollar, C. (2005). Utility of the MacArthur Communicative Development Inventory in identifying children's language level. American Journal of Speech-Language Pathology, 14, 40–51.
  • Hitchcock, E. R., & McAllister Byun, T. (2015). Enhancing generalisation in biofeedback intervention using the challenge point framework: A case study. Clinical Linguistics & Phonetics, 29(1), 59–75. https://doi.org/10.3109/02699206.2014.956232
  • Hitchcock, E. R., McAllister Byun, T., Swartz, M., & Lazarus, R. (2017). Efficacy of electropalatography for treating misarticulation of /r/. American Journal of Speech-Language Pathology, 26(4), 1141–1158. https://doi.org/10.1044/2017_AJSLP-16-0113
  • McAllister, T., Eads, A., Kabakoff, H., Scott, M., Boyce, S., Whalen, D. H., & Preston, J. L. (2022). Baseline Stimulability Predicts Patterns of Response to Traditional and Ultrasound Biofeedback Treatment for Residual Speech Sound Disorder. Journal of Speech, Language, and Hearing Research, 65(10), 1–21. https://doi.org/10.1044/2022_JSLHR-22-00085
  • McAllister, T., Hitchcock, E. R., & Ortiz, J. A. (2021). Computer-assisted challenge point intervention for residual speech errors. Perspectives of the ASHA Special Interest Groups, 6(1), 214–229. https://doi.org/10.1044/2020_PERSP-20-00161
  • McAllister, T., Preston, J. L., Hitchcock, E. R., & Hill, J. (2020). Protocol for correcting residual errors with spectral, ultrasound, traditional speech therapy randomized controlled trial (C-RESULTS RCT). BMC Pediatrics, 20(1), 1–14. https://doi.org/10.1186/s12887-020-1951-3
  • McAllister Byun, T. (2017). Efficacy of visual–acoustic biofeedback intervention for residual rhotic errors: A single-subject randomization study. Journal of Speech, Language, and Hearing Research, 60(5), 1175–1193. https://doi.org/10.1044/2016_JSLHR-S-16-0038
  • McAllister Byun, T., & Campbell, H. (2016). Differential effects of visual-acoustic biofeedback intervention for residual speech errors. Frontiers in Human Neuroscience, 10, 567. https://doi.org/10.3389/fnhum.2016.00567
  • McAllister Byun, T., & Hitchcock, E. R. (2012). Investigating the use of traditional and spectral biofeedback approaches to intervention for /r/ misarticulation. American Journal of Speech-Language Pathology, 21(3), 207–221. https://doi.org/10.1044/1058-0360(2012/11-0010)
  • McAllister Byun, T., Hitchcock, E. R., & Ferron, J. (2017). Masked visual analysis: Minimizing type I error in visually guided single-case design for communication disorders. Journal of Speech, Language, and Hearing Research, 60(6), 1455–1466. https://doi.org/10.1044/2017_JSLHR-S-16-0012
  • McAllister Byun, T., Hitchcock, E. R., & Swartz, M. T. (2014). Retroflex versus bunched in treatment for rhotic misarticulation: Evidence from ultrasound biofeedback intervention. Journal of Speech, Language, and Hearing Research, 57(6), 2116–2130. https://doi.org/10.1044/2014_JSLHR-S-13-0210
  • McAllister Byun, T., Swartz, M. T., Halpin, P. F., Szeredi, D., & Maas, E. (2016). Direction of attentional focus in biofeedback treatment for /r/ misarticulation. International Journal of Language & Communication Disorders, 51(4), 384–401. https://doi.org/10.1111/1460-6984.12215
  • Moyle, M. J., Ellis Weismer, S., Lindstrom, M., & Evans, J. (2007). Longitudinal relationships between lexical and grammatical development in typical and late talking children. Journal of Speech, Language, and Hearing Research, 50, 508–528.
  • Paradis, J., Schneider, P., & Duncan, T. S. (2013). Discriminating children with language impairment among English-language learners from diverse first-language backgrounds. Journal of Speech, Language, and Hearing Research, 56(3), 971–981. https://doi.org/10.1044/1092-4388(2012/12-0050)
  • Peterson, L., Savarese, C., Campbell, T., Ma, Z., Simpson, K. O., & McAllister, T. (2022). Telepractice treatment of residual rhotic errors using app-based biofeedback: A pilot study. Language, Speech, and Hearing Services in Schools, 53(2), 256–274. https://doi.org/10.1044/2021_LSHSS-21-00057
  • Preston, J. L., Brick, N., & Landi, N. (2013). Ultrasound biofeedback treatment for persisting childhood Apraxia of speech. American Journal of Speech-Language Pathology, 22(4), 627–644. https://doi.org/10.1044/1058-0360(2013/12-0139)
  • Preston, J. L., Caballero, N. F., Leece, M. C., Wang, D., Herbst, B. M., & Benway, N. R. (2023). A randomized controlled trial of treatment distribution and biofeedback effects on speech production in school-age children with apraxia of speech. Journal of Speech, Language, and Hearing Research, 66(8), 1–23. https://doi.org/10.1044/2023_JSLHR-22-00622
  • Preston, J. L., & Edwards, M. L. (2007). Phonological processing skills of adolescents with residual speech sound errors. Language, Speech, and Hearing Services in Schools, 38(4), 297–308.
  • Preston, J. L., Hitchcock, E. R., & Leece, M. C. (2020). Auditory perception and ultrasound biofeedback treatment outcomes for children with residual /ɹ/ distortions: A randomized controlled trial. Journal of Speech, Language, and Hearing Research, 63(2), 444–455. https://doi.org/10.1044/2019_JSLHR-19-00078
  • Preston, J. L., Hull, M., & Edwards, M. L. (2013). Preschool speech error patterns predict articulation and phonological awareness outcomes in children with histories of speech sound disorders. American Journal of Speech-Language Pathology, 22(2), 173–184. https://doi.org/10.1044/1058-0360(2012/12-0043)
  • Preston, J. L., & Leece, M. C. (2017). Intensive Treatment for Persisting Rhotic Distortions: A Case Series. American Journal of Speech-Language Pathology, 26(4), 1066–1079. https://doi.org/10.1044/2017_AJSLP-16-0211
  • Preston, J. L., Leece, M. C., & Maas, E. (2016). Intensive treatment with ultrasound visual feedback for speech sound errors in childhood apraxia. Frontiers in Human Neuroscience, 10, 440. https://doi.org/10.3389/fnhum.2016.00440
  • Preston, J. L., Leece, M. C., McNamara, K., & Maas, E. (2017). Variable practice to enhance speech learning in ultrasound biofeedback treatment for childhood apraxia of speech: A single case experimental study. American Journal of Speech-Language Pathology, 26(3), 840–852. https://doi.org/10.1044/2017_AJSLP-16-0155
  • Schneider, P., & Hayward, D. (2010). Who does what to whom: Introduction of referents in children's storytelling from pictures. Language, Speech, and Hearing Services in Schools, 41, 459–473.
  • Schneider, P., Hayward, D., & Dubé, R. V. (2006). Storytelling from pictures using the Edmonton Narrative Norms Instrument. Journal of Speech-Language Pathology and Audiology, 30, 224–238.
  • Schneider, P., Rivard, R., & Debreuil, B. (2011). Does colour affect the quality or quantity of children's stories elicited by pictures? Child Language Teaching and Therapy, 27, 371–378.
  • Shankar, N. B., Afshan, A., Johnson, A., Mahapatra, A., Martin, A., Ni, H., Park, H. W., Perez, M. Q., Yeung, G., Bailey, A., Breazeal, C., & Alwan, A. (2024). The JIBO Kids Corpus: A speech dataset of child-robot interactions in a classroom environment. JASA Express Letters, 4(11), 115201. https://doi.org/10.1121/10.0034195
  • Sjolie, G. M., Leece, M. C., & Preston, J. L. (2016). Acquisition, retention, and generalization of rhotics with and without ultrasound visual feedback. Journal of Communication Disorders, 64, 62–77. https://doi.org/10.1016/j.jcomdis.2016.10.002
  • Speights, M., Shrivastava, V., Yedla, B., Srinivasan, S., Vroegh, E., Xu, S., Lidov, E., & Herholz, P. (2025). Speech Production Repository for Optimizing Use of FAIR AI Training - Research [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15723648
  • Wagner, L., Alghowinhem, S., Alwan, A., Bowdrie, K., Breazeal, C., Clopper, C. G., Fosler-Lussier, E., Jamsek, I. A., Lander, D., Ramnath, R., & Ross, J. (2025). The Ohio Child Speech Corpus. Speech Communication, 170, 103206. https://doi.org/10.1016/j.specom.2025.103206