Navigation

Problem description

The data for this challenge include thousands of audio clips from literacy skill assessments of students in kindergarten through 3rd grade. You will develop an automatic scoring model for the literacy exercises. Your task is to estimate the likelihood that an audio clip of a speech task is correct.

Training set

The training data for this challenge includes audio files of students performing literacy exercises along with associated metadata.

Audio files

You are provided with over 38,000 anonymized audio recordings (in the form of .wav files) of students performing literacy screening exercises. These audio clips were collected in the development and testing of Reach Every Reader's literacy screener. Students span in age range from kindergarten through 3rd grade. The student voices in the audio files were anonymized using voice cloning technology, where the student voices were replaced with 50 different target voices per grade level, resulting in de-identified audio files. For more details on the dataset curation, please see the About page.

The training audio files are contained in three separate tar archives on the data download page. You'll need to download all three to get the full training dataset. Each tar archive is approximately 2 GB. Once you've downloaded an archive, you can unzip it by double clicking the file in your file browser or using tar at the command line. For example, the commands below create a new directory called train_audio and unzip the audio files from the first archive into it.

mkdir train_audio
tar xzvf data/final/public/train_audio_1.tar.gz -C train_audio/

Metadata

Alongside the audio clips, you are provided with a metadata file containing details on the literacy exercise and student's grade. Each exercise corresponds to one row in the metadata. The filename column in the metadata is the unique identifier that connects the audio files to the metadata.

train_metadata.csv contains the following variables:

filename (str) - unique identifier for each audio file
task (str) - the literacy task of the exercise (deletion, nonword_repetition, blending, or sentence_repetition)
expected_text (str) - the target word or phrase of the exercise
grade (str) - the grade of the student (KG, 1, 2, 3)

There are four types of literacy exercises (task) in the challenge data. These include:

Deletion: Delete an identified portion of a word and say the target word after the omission, e.g., "birdhouse without bird"
Blending: Combine portions of a word and say the target word out loud, e.g., "foot - ball"
Nonword repetition: Repeat nonwords of increasing complexity, e.g., "guhchahdurnam"
Sentence repetition: Repeat sentences of varying length and complexities without changing the sentence’s structure or meaning, e.g., "The girls are playing in the park"

Note: there are many possible exercises within each task. The literacy screener uses computer adaptive testing, meaning that all students in a given grade will not see the same exercises.

Train metadata example

For example, audiofile hgxrel.wav has the following information in train_metadata.csv:

filename	hgxrel.wav
task	deletion
expected_text	old
grade	KG

Labels

The labels in this challenge are manually assigned scores by trained evaluators that indicate whether the student's response was correct or incorrect. A correct label was given when the student's spoken response matched the expected_text for the exercise.

train_labels.csv contains the following variables:

filename (str) - unique identifier for each audio file
score (int) - whether the audio response was correct or incorrect, where 1 means correct and 0 means incorrect

Label example

For example, the first five rows in train_labels.csv have these values:

filename	score
hgxrel.wav	0.0
ltbona.wav	0.0
bfaiol.wav	1.0
ktvyww.wav	1.0
htfbnp.wav	1.0

Test set

The test set audio files and metadata are only accessible in the runtime container and are mounted at /code_execution/data.

test_metadata.csv contains the following variables:

filename (str) - unique identifier for each audio file
task (str) - the literacy task of the exercise (deletion, nonword_repetition, blending, or sentence_repetition)
expected_text (str) - the target word or phrase of the exercise
grade (str) - the grade of the student (KG, 1, 2, 3)

External data and models

Use of external data and models is allowed in this competition provided they are freely and publicly available to all participants under a permissive open source license.

However, participants may not upload competition data to any third party services or APIs that retain the data. For example, participants cannot submit data to ChatGPT or Gemini. Participants can use external models by loading open-source model weights into an environment that they can wipe the data from afterwards, such as their local machine or a cloud compute environment.

Performance metric

Performance is evaluated according to log loss. Log loss (a.k.a. logistic loss or cross-entropy loss) penalizes confident but incorrect predictions. It also rewards confidence scores that are well-calibrated probabilities, meaning that they accurately reflect the long-run probability of being correct. This is an error metric, so a lower value is better.

Log loss for a single observation is calculated as follows:

$$L_{\log}(y, p) = -(y \log (p) + (1 - y) \log (1 - p))$$

where |$y$| is a binary variable indicating whether the literacy exercise is correct and |$p$| is the predicted probability that the literacy exercise is correct. The loss for the entire dataset is the average loss across all observations.

Note: log loss can often be improved with calibration. A well-calibrated model outputs predictions that are directly interpretable as probabilities.

ROC AUC will be displayed as a secondary metric on the leaderboard, but only log loss will be used to rank submissions.

Remember that final rankings will be determined by model performance on both the anonymized and raw test sets. See the Prizes section for more details.

Submission format

This is a code execution challenge! Rather than submitting your predicted labels, package everything needed to do inference and submit that for containerized execution.

Your code submission must contain a main.py script that reads in the test set audio files, generates the likelihood that each audio file is correct, and outputs a single submission.csv with one row per audio file where predictions are floating point numbers representing probabilities.

See details on the submission format and process here.

Good luck!

If you're wondering how to get started, check out the benchmark blog post!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!

Goodnight Moon, Hello Early Literacy Screening

Quick Facts

Participants

No. of Entries

Prize

Winner

sheep