On Top of Pasketti: Children’s Speech Recognition Challenge - Phonetic Track

Develop automatic speech recognition models that produce phone-level transcriptions of children’s speech in IPA. #education

$50,000 in prizes
6 weeks left
195 joined

About the Project

The On Top of Pasketti: Children's Speech Recognition Challenge is part of a broader initiative at DrivenData to develop high-quality data and models that make a meaningful impact in education, through both data science competitions and direct engagements with partner organizations.

This challenge was designed to advance Automatic Speech Recognition (ASR) that works well for children across early and elementary learning stages, and can be openly shared for broad public benefit.

The live competition represents one stage in a broader, multi-phase effort to improve speech recognition technologies for children and support the development of practical, real-world tools. Building toward this challenge included several key phases:

Landscape analysis and benchmarking. The project team conducted a review of existing child speech datasets and models, benchmarking current performance and identifying critical gaps in publicly available resources. These gaps included limited representation of dialects commonly found in U.S. classrooms, a lack of naturalistic classroom recordings, and limited availability of phonetic labels needed for diagnostic and assessment-focused applications.

User and expert engagement. The team engaged a group of advisors representing potential end users and subject matter experts, including researchers, clinicians, and education technology practitioners. These conversations helped clarify which gaps were most urgent, which use cases were most impactful in practice, and which populations were most likely to be underserved by existing tools.

Dataset curation and annotation. Guided by this input, the team curated and harmonized audio recordings of children spanning early and elementary learning stages, including as young as pre-K, from public, research, and enterprise sources. Dedicated transcription teams produced high-quality word-level and phonetic annotations across hundreds of hours of audio, enabling both educational and diagnostic use cases.

Challenge design. Finally, the competition was designed to leverage these new datasets and engage the broader machine learning community. Two complementary tracks were created to support prioritized use cases with distinct technical requirements: word-level speech recognition and phonetic-level speech recognition.

The table below illustrates how word-level and phonetic-level ASR support different learning and assessment activities.

Learning Activity Phonetic ASR Word ASR
Speech Screening Required Helpful
Dyslexia Screening Required Helpful
Reading Fluency Helpful Required
Language Use Assessment Not Needed Required
Comprehension Not Needed Required
Problem Solving Not Needed Required
Reasoning Not Needed Required
Verbal Interaction with Tools Not Needed Required

Refinement, analysis, and sharing. Following the challenge, the insights and models from the winning approaches will be refined and shared to advance ASR tools for children's speech. These tools will integrate and refine competition outputs for real-world applications. The project team will conduct additional training, evaluation, and validation using data sources that cannot be shared publicly.

The resulting models will be packaged into an open-source software library designed to make child-focused ASR models easier to use and integrate into downstream applications. In addition, the annotations created for this competition will be released publicly to support researchers, educators, clinicians, and technologists working to build more accessible and inclusive tools for children.

Additional Resources

The resources below provide a curated starting point for participants interested in prior work on automatic speech recognition for children. They are not required reading, but may be helpful for understanding common challenges, baseline approaches, and open research questions in this domain.

Foundational and Benchmarking Papers

These papers provide useful context on the performance of modern ASR models on children's speech and common sources of error:

  • Fan, R., Shankar, N. B., & Alwan, A. (2023). Benchmarking children's ASR with supervised and self-supervised speech foundation models. arXiv.
    https://arxiv.org/abs/2406.10507

  • Graave, T., Li, Z., Lohrenz, T., & Fingscheidt, T. (2024). Mixed children/adult/childrenized fine-tuning for children's ASR: How to reduce age mismatch and speaking style mismatch. In Proceedings of Interspeech 2024 (pp. 5188–5192).
    https://doi.org/10.21437/Interspeech.2024-499

  • Jain, R., Barcovschi, A., Yiwere, M. Y., Corcoran, P., & Cucu, H. (2024). Exploring native and non-native English child speech recognition with Whisper. IEEE Access, 12, 41601–41610.
    https://doi.org/10.1109/ACCESS.2024.3378738

  • Potamianos, A., & Narayanan, S. (2003). Robust recognition of children's speech. IEEE Transactions on Speech and Audio Processing, 11(6), 603–616.
    https://doi.org/10.1109/TSA.2003.818026

Phonetic and Diagnostic Applications

This paper is particularly relevant for participants interested in phonetic-level modeling and diagnostic or assessment-oriented use cases:

  • Block Medin, L., Pellegrini, T., & Gelin, L. (2023). Self-supervised models for phoneme recognition: Applications in children's speech for reading learning. In Proceedings of Interspeech 2024.
    https://arxiv.org/abs/2503.04710

Practitioner-Oriented Background Reading

These resources provide high-level synthesis and practical guidance for working with children's speech in ASR systems:

For general guidance on working with speech data in machine learning, see our blog post on open-source packages for using speech data in ML.

Acknowledgements

We are grateful to the many individuals, organizations, and teams who contributed expertise, data, and guidance throughout the development of this project, including:

Academic, data, and research collaborators

  • Dr. Ahmed Attia and Dr. Jing Liu, University of Maryland
  • Dr. Kate Bunton, University of Arizona
  • Dr. Satrajit Ghosh, Massachusetts Institute of Technology
  • Dr. Brian MacWhinney and collaborators at TalkBank
  • Dr. Yaacov Petscher, Florida State University
  • Dr. Sameer Pradhan and Boulder Learning
  • Dr. Marisha Speights, Northwestern University
  • Linguistic Data Consortium
  • TeachFX research team

Annotation and transcription partners

  • Dr. Charlotte Moore and the DrivenData Transcription Team

ASR for Assessment User Advisory Group, including representatives from:

  • Magpie
  • PowerMyLearning
  • Khan Kids
  • TeachFX
  • Soapbox Labs
  • Lit
  • Waterford
  • Harvard Graduate School of Education

Funding support

We are grateful to the Gates Foundation, whose support made this project possible, as well as additional funding support from:

Data Attribution

Participants must provide appropriate attribution in any publication, presentation, or public description of work developed using the competition data. A comprehensive list of required citations will be provided after the competition submission deadline.

Competition and platform:

Data sources:

  • TalkBank
  • ReadNet
  • balaji1312. (2024). balaji1312/Jibo_Kids: Init (v0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13964792
  • Bunton, K., & Story, B. H. (2016). Arizona Child Acoustic Database Repository. Folia Phoniatrica et Logopaedica, 68(3), 107–111. DOI: 10.1159/000452128. This work was funded by NSF BSC-1145011 awarded to Kate Bunton, Ph.D., and Brad Story, Ph.D., Principal Investigators.

Additional citations will be added here after competition deadline