Goodnight Moon, Hello Early Literacy Screening

Help educators provide effective early literacy intervention by scoring audio clips from literacy screeners. #education

$30,000 in prizes
Completed jan 2025
399 joined

While most speech models focus on adults, this challenge focused on evaluating the speech of young children at an age at which there is significant variation across individuals. By advancing automated scoring for literacy screeners, these innovations have the potential to transform classrooms and improve outcomes for young learners.

— Dr. Satrajit Ghosh, Director, Senseable Intelligence Group

Why

Literacy—the ability to read, write, and comprehend language—is a fundamental skill that underlies personal development, academic success, career opportunities, and active participation in society. Many children in the United States need more support with their language skills. A national test of literacy in 2022 estimated that 37% of U.S. fourth graders lack basic reading skills.

To provide effective early literacy intervention, teachers must be able to reliably identify the students who need support. Currently, teachers across the U.S. are tasked with administering and scoring verbal literacy screeners. Manual scoring methods not only take time, but they may be unreliable, producing different results depending on who is scoring the test and how thoroughly they were trained.

The Solution

Machine learning approaches to score literacy assessments can help teachers quickly and reliably identify children in need of early literacy intervention. The goal of this challenge was to develop a model to score audio recordings from literacy screener exercises completed by students in kindergarten through 3rd grade.

In this challenge, participants trained models on anonymized audio clips. The student voices in the audio files were anonymized using voice cloning technology to produce de-identified audio files. At the end of the competition, the top 3 teams on the private leaderboard packaged up their training and inference code and their models were retrained and evaluated using the raw, non-anonymized version of the competition data.

The Results

Challenge participants generated 315 submissions, and the winning solutions achieved an impressive AUROC of 0.97 on the anonymized test set. This was a substantial improvement over the transcription benchmark that used Whisper (a general-purpose speech recognition model from OpenAI) and had an AUROC of 0.59.

Retrained raw models were even more successful. Raw models saw around a 30% improvement in log loss compared to the anonymized version. This shows that methods developed on anonymized data generalized well to the raw data and that machine learning can be an accurate way to score child literacy screening exercises.

Bar chart showing raw and anonymized scores for winners

See the results announcement for more information on the winning submissions and the participants who developed them. All of the prize-winning submissions and write-ups from this competition are linked below and available for anyone to use and learn from.


RESULTS ANNOUNCEMENT + MEET THE WINNERS

WINNING MODELS ON GITHUB