On Top of Pasketti: Children’s Speech Recognition Challenge - Phonetic Track

Develop automatic speech recognition models that produce phone-level transcriptions of children’s speech in IPA. #education

$50,000 in prizes
6 weeks left
195 joined

Problem Description

In the Phonetic Track of the On Top of Pasketti: Children's Speech Recognition Challenge, your goal is to build a model that predicts the speech sounds, or phones, spoken by children in short audio clips.

Unlike word-level ASR, which focuses on intended lexical content, phonetic ASR captures how speech is produced. This distinction is especially important for children, whose speech varies widely due to development, dialect, and speech sound disorders. Accurate phone-level transcriptions enable diagnostic and educational applications such as speech pathology screening, literacy assessment, and early intervention. This track brings together well-labeled, representative children's speech data to support models that generalize across ages, dialects, recording environments, and speech styles.

Dataset

Data for this challenge are assembled from over a dozen data sources collected under different protocols, with manual corrections and annotations added by our team. These sources have been harmonized to a common schema and released as two distinct training corpora that share the same structure but contain different data, and are hosted in separate locations for participant access.

One corpus is hosted on the DrivenData platform, while a second corpus, which follows the same schema but contains different data, is provided by TalkBank. Detailed instructions for accessing both corpora—including how to submit the required data access request form for TalkBank—are available on the Data Download page.

In addition, participants are welcome to incorporate external datasets for training; however, use of external data is subject to important exceptions and caveats (see External Data and Models.)

Participants are not allowed to share the competition data or use it for any purpose other than this competition. Participants cannot send the data to any third-party service or API, including but not limited to OpenAI's ChatGPT, Google's Gemini, or similar tools. For complete details, review the competition rules.

Audio

The data in this challenge are .flac audio clips of utterances from recordings of children as young as age 3 participating in a variety of controlled and semi-controlled speech tasks. These tasks include read or directed speech, prompted elicitation, and spontaneous conversational speech. The Word track and Phonetic track training data audio overlap substantially, though each track includes audio not found in the other. Training data, including audio and transcripts, can be used across tracks.

The audio clips have been scrubbed of any personally identifying information, and any adult speech that may have been present in the original recording.

The challenge dataset encompasses a broad range of U.S. regional dialects, accents, and socio-linguistic backgrounds, and includes both typically developing children and children presenting with speech sound disorders or other speech pathologies. Models are not expected to generalize to non-U.S. varieties of English.

The DrivenData corpus is made available on the Data Download page in smaller .zip files, each containing a random subset of audio clips. The TalkBank corpus is a single zipped archive with audio files for both the Word and Phonetic tracks.

Labels

The ground truth labels for the Phonetic Track are normalized phonetic transcriptions of individual utterances using the International Phonetic Alphabet (IPA), with a one-to-one mapping between Unicode characters and phones. Each transcription captures the full sequence of speech sounds in the corresponding audio clip and may include substitutions, omissions, or non-standard productions that are typically ignored in word-level ASR.

All phonetic labels are restricted to the predefined IPA character set used during phonetic transcription. This set is provided in the scoring script in the runtime repository for local validation of predictions.

Environmental noises and non-lexical sounds produced by the child (e.g., siren-like play sounds or sneezes) are not labeled. While efforts have been made to apply consistent normalization, the data were collected from multiple sources with differing annotation protocols, and some variation in labeling should be expected.

Metadata and labels for this track are provided as a UTF-8 encoded JSONL manifest. Each line corresponds to a single utterance and references exactly one associated audio file. Phonetic transcriptions use Unicode IPA symbols.

For each of the two corpora, the file train_phon_transcripts.jsonl contains the following fields:

  • utterance_id (str) - unique identifier for each utterance
  • child_id (str) - unique, anonymized identifier for the speaker
  • session_id (str) - unique identifier for the recording session; a single child_id may be associated with multiple session_ids
  • audio_path (str) - path to the corresponding .flac audio file relative to the /audio directory, following the pattern audio/{utterance_id}.flac
  • audio_duration_sec (float) - duration of the audio clip in seconds
  • age_bucket (str) - age range of the child at the time of recording ("3-4", "5-7", "8-11", "12+", or "unknown")
  • md5_hash (str) - MD5 checksum of the audio file, used for integrity verification
  • filesize_bytes (int) - size of the audio file in bytes
  • phonetic_text (str) - phonetic transcription of the utterance using the International Phonetic Alphabet (IPA)

Training and Test Data

The training and test splits are drawn from multiple data sources, and some sources appear exclusively in either the training or test split. A small portion of the dataset uses synthetically anonymized children’s voices. Participants are encouraged to develop models that generalize across speakers, recording conditions, and speech types.

Although all audio files have been converted to FLAC, the data vary in duration.

  • Test data have been normalized to a 16 kHz sampling rate and a single channel (mono).
  • Training data have not been normalized. The training data vary in sampling rate and number of channels (e.g., mono vs. stereo).

Submission Format

This is a code execution challenge! Rather than submitting your predicted labels, you will package your trained model and the prediction code and submit that for containerized execution.

Your code will generate a submission file that must be in JSONL format with one line per utterance. Each line must be a JSON object with the following fields:

  • utterance_id - the unique identifier matching those provided in the submission format in the runtime environment
  • phonetic_text - the predicted phonetic transcript for the utterance using International Phonetic Alphabet (IPA)

Be sure to only include characters defined in the scoring script's set of valid IPA characters.

Information about the submission package, predict function, code and other details about the runtime environment can be found on the Code submission format page.

Performance Metric

Text Preprocessing

Before scoring, predictions undergo Unicode normalization to remove certain characters (such as ligatures) and retain only the phonetic symbols defined in the scoring script's set of valid IPA characters.

We provide a script to perform this normalization and compute the metric in the runtime repository. Participants are encouraged to use this script to validate their submission format and reproduce the scoring behavior locally to get a sense of their performance prior to submission.

Metric Calculation

Performance is evaluated using IPA Character Error Rate (CER). The IPA Character Error Rate is equivalent to Phone Error Rate (PER) since the submissions are normalized to have a one-to-one character-to-phone mapping.

The metric computes the minimum number of substitutions (|$S$|), deletions (|$D$|), and insertions (|$I$|) required to transform the predicted character sequence into the reference sequence, divided by the total number of reference characters (|$N$|):

$$CER = \frac{S + D + I}{N}$$

Since this is an error metric, a lower value is better. Be sure to only include characters defined in the scoring script's set of valid IPA characters.

Good luck!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!