Problem description
For this challenge, your goal is to classify each second of a set of videos and accompanying audio transcripts according to taxonomies of instructional activities and discourse content.
After successfully submitting a data access agreement, you will be granted access to a training set containing approximately 117 hours of annotated videos of English language arts (ELA) and math lessons from elementary classrooms along with labeled audio transcripts for approximately 39 of the 117 hours. You will also be granted access to a Phase 1 testing set of approximately 26 hours of unlabeled videos (with accompanying unlabeled audio transcripts for approximately 11 of those 26 hours) of ELA and math lessons from elementary classrooms.
Competitors who successfully submit at least one Phase 1 solution will be granted access to Phase 2. In Phase 2, competitors will make one submission against a new test set. The Phase 2 test dataset contains approximately 23 hours of unlabeled classroom videos and 10.5 hours of unlabeled audio transcripts. Prior to the end of Phase 1, competitors must select one submission whose corresponding model they will use to generate predictions on the Phase 2 test set. The final prize rankings will be determined by a weighted average of 25% of the Phase 1 score and 75% of the Phase 2 score.
Datasets overview
Secure access to all of the data for this challenge is provided via Globus. For more information about data access, see the data access instructions.
Training Dataset
The training dataset consists of:
- 116 hours of video footage, provided as 203 video files
- Each video file has a corresponding metadata JSON file which contains frame-by-frame bounding boxes for the teacher and any important objects in the frame that the teacher is interacting with.
- 39 hours of timestamped transcripts, provided as 59 transcript files in Microsoft Excel format with one row per turn of talk
train_gt.csv
, a file containing labels for each second of video and audio data in the dataset
Training dataset updates (June 16th)
Here is a complete list of changes to the training dataset made on June 16th, 2025:
- Training video
220.044_ELA1_Year1_20170206_Part2
removed from the training dataset due to incomplete labels - Final two seconds of labels for video
220.152_ELA2_Year1_20170308
removed for erroneously extending beyond the end of the video - Final labelled second for video
221.217_Math3_Year1_180515_Part2
removed for erroneously extending beyond the end of the video - Training video metadata files updated to reflect these changes
Training dataset updates (July 31st)
Here is a complete list of changes to the training dataset made on July 31st, 2025:
- A new file,
train_gt_corrected20250731.csv
, was uploaded to Globus containing corrected audio labels for clip220.044_Math1_Year1_20170206_Part1
. Previously, the audio labels for this clip were zero. The correct labels were present in the Excel file.
Phase 1 Test Dataset
The Phase 1 test dataset consists of:
- 26 hours of video footage, provided as 48 video files
- 11 hours worth of timestamped transcripts, provided as 19 transcript files in Microsoft Excel format with one row per turn of talk
Phase 2 Test Dataset
The Phase 2 test dataset consists of:
- 23 hours of video footage, provided as 39 video files
- 10.5 hours worth of timestamped transcripts, provided as 15 transcript files in Microsoft Excel format with one row per turn of talk
It is important to note that the distribution of positive class labels across the test datasets is different. This is due to the constraint of using real-world video data and splitting the dataset to prevent leakage. Your models should generalize well and not rely on the distribution of class labels in the provided datasets.
For Phase 1, the data file tree looks like this:
data/
├── phase1_data/
│ ├── phase1_unlabeled_transcripts.zip # compressed archive of transcript files in Excel format
│ ├── phase1_test_metadata.csv # data about video filenames and which are annotated
│ ├── phase1_subfmt.csv # template for submission format predicting absence of all classes
│ └── phase1_test_videos.zip # compressed archive of unlabeled Phase 1 test videos
└── training/
├── train_gt.csv # training ground truth labels
├── labeled_transcripts.zip # compressed archive of transcript files in Excel format
├── training_metadata.csv # metadata about video filenames and which are annotated
├── bounding_boxes.zip # compressed archive of json files with bounding box metadata
├── video_annotations.zip # eaf label files, included for completeness
└── training_videos.zip # compressed archive of labeled training videos
Labels overview
Instructional activity label definitions and action levels
There are 24 instructional activity labels. The activities and actions are described in the table below.
Instructional Activity | Definition |
---|---|
Activity Type | |
Whole class activity |
All students are involved in one activity, with the teacher leading the learning (e.g., lecture, presentation, carpet time).
|
Individual activity |
All students privately work (e.g., independent practice, reading) alone at a separate desk or in small groups with no interaction between students.
|
Small group activity |
Students working together with peers (e.g., think-pair-share, book club); this is prioritized when there are students interacting or somewhat interacting near one another.
|
Transition |
The students and teacher transition from one instructional activity to another (e.g., whole class to small group). The teacher and students move from one spot in the room to another (e.g., from the carpet to desks). Other than specific behavioral directions, no instruction or meaningful instructional activity is occurring during the transition.
|
Discourse | |
On task student talking with student |
Students conversing together; can overlap with "small group activity;" this is specific to mouth-movements within the parent code time interval; without teacher support.
|
Student raising hand |
Hand up for more than 1 second; clearly and purposefully raising hand.
|
Teacher Location | |
Teacher sitting |
Teacher sitting (chair, stool, floor, crouching, on desk, kneeling).
|
Teacher standing |
Teacher standing (in generally the same spot to keep the same orientation to students).
|
Teacher walking |
Teacher walking with purpose to change orientation to students.
|
Student Location | |
Student(s) sitting on carpet or floor |
Students sitting on floor/carpet.
|
Student(s) sitting at group tables |
Students sitting at tables.
|
Sitting at desks |
Students at individual desks.
|
Student(s) standing or walking |
Students standing up or walking around the room; can be one or multiple.
|
Teacher Supporting | |
Teacher supporting one student |
Teacher uses proximity to offer assistance to one student, can be verbal or non-verbal.
|
Teacher supporting multiple students with student interaction |
Teacher uses proximity to offer assistance to multiple students, can be verbal or non-verbal; individual students are also interacting with one another.
|
Teacher supporting multiple students without student interaction |
Teacher uses proximity to offer assistance to multiple students who are engaged in an activity, can be verbal or non-verbal, e.g., students are sitting close to one another or in a small group but are not interacting with one another.
|
Representing Content | |
Using or holding book |
A book is used or held by a teacher or student.
|
Using or holding worksheet |
A worksheet is used or held by a teacher or student.
|
Presentation with technology |
A smartboard, Elmo, projector is used to show content.
|
Using or holding instructional tool |
A tangible object (e.g., ruler, math manipulative; anything in someone's hand other than what is already listed; does not include pencil/pen) is used or held by teacher or student for instructional purposes (does not include furniture).
|
Using or holding notebook |
A notebook is used or held by a teacher or student.
|
Individual technology |
Student or teacher using a laptop, tablet, etc.
|
Teacher writing |
On paper or on a whiteboard/document camera; includes erasing.
|
Student writing |
On paper or on a whiteboard, includes erasing.
|
Audio transcript label definitions and examples
There are five top-level discourse labels and 19 sub-labels. For this challenge, you should label audio at the sub-label level.
Discourse Labels | Definition & Examples |
---|---|
Cognitive Demand | |
Analysis – Give# |
Teacher provides analysis of content (e.g., compare, construct, revise, interpret, synthesize, extend, justify/explain, inference/conjecture, connect, etc.).
Examples:
|
Report – Give |
Teacher provides report of content or recitation of facts (e.g., define, recall, identify, compute, describe (part of) a method, etc.).
Examples:
|
Analysis – Request* |
Teacher asks students to analyze content (e.g., compare, construct, revise, interpret, synthesize, extend, justify/explain, inference/conjecture, connect, etc.).
Examples:
|
Report – Request |
Teacher asks students to report on content or recite facts (e.g., define, recall, identify, compute/count, describe a method etc.).
Examples:
|
Questions | |
Open-Ended |
Teacher asks a content-related question that is open-ended for which there is immediate response and not a pre-scripted answer.
Examples:
|
Closed-Ended |
Teacher asks a content-related question that has an immediate closed response that is a pre-scripted answer (e.g., yes/no or agree/disagree) and/or that tests students’ fluency.
Examples:
|
Task related prompt TRP |
Teacher reads/restates question or prompt in an ELA/math task from the instructional materials. Does not have to be word-for-word. Teacher may read or provide an additional question/statement in the same "spirit" as the task prompt.
Examples:
|
Explanation and Justification | |
E/J – Teacher Request* |
Teacher requests a content-related explanation or justification.
Examples:
|
E/J – Teacher Give# |
Teacher gives a content-related explanation or justification. May include a teacher providing a (partial) solution strategy.
Examples:
|
E/J – Student Request |
Student requests an explanation or justification.
Examples:
|
E/J – Student Give |
Student gives an explanation or justification.
Examples:
|
Classroom Community – Feedback (Judgment) | |
Affirming |
Teacher gives a positive judgement of the correctness of a student's content-related response.
Examples:
|
Disconfirming |
Teacher gives a negative judgement of the correctness of a student's content-related response.
Examples:
|
Neutral+ |
Teacher gives neither a confirming nor disconfirming judgement of the correctness of a student's content-related response. Teacher may simply repeat the student's contribution.
Examples:
|
Classroom Community – Feedback (Elaboration) | |
Elaborated |
Teacher statements that expand on students' responses or thinking. Focus is usually on students' thinking but not always.
Examples:
|
Unelaborated+ |
Teacher evaluations, acknowledgements, repeating a question (as an indirect evaluation), or superficially using a students' idea. Focus is not on students' thinking.
Examples:
|
Classroom Community – Uptake | |
Restating+ |
Teacher (partially) repeats or summarizes a student's contribution. Does not have to be word for word but should be in the same "spirit" of the student's contribution without building on the student's idea. Restating could be non-verbal such as writing or pointing to a student's contribution on the board.
Examples:
|
Building |
Teacher incorporates a student's ideas (potentially by restating their contribution) by clarifying or expanding on the idea on behalf of the student. Building does NOT include the correction of a student written or verbal contribution.
Examples:
|
Exploring |
Teacher incorporates a student's idea (potentially by restating their contribution) into a following line of questions by asking probing questions such as asking students to elaborate, clarify, or evaluate the idea.
Examples:
|
*Analysis – Request and E/J – Teacher Request usually paired together.
#Analysis – Give and E/J – Teacher Give usually paired together.
+Neutral, Restating, and Unelaborated usually paired together.
Audio transcript file format
Audio transcriptions are provided as Excel files, with the .xlsx extension. Transcribed text is provided in the first column of each row and classified according to the discourse taxonomy above, with one unique label given in each of the five sub-categories.
Typically, one turn of talk is labeled. Sometimes, a turn of talk is split into multiple turns if different segments require different labels; do not split turns further, but use the existing splits within each cell for classification. Use the provided labels in the transcripts to classify turns of talk, noting that within each turn, only one label is given per category.
These labels should then be converted into the time domain using the transcript timestamps. For each second containing a turn of talk, mark 1
in the column corresponding to the label and 0
otherwise. In cases of overlapping turns of talk within the same second, label each second of data with all applicable labels from the turns of talk. This entails that in the time domain submission, labels within the same category may not be mutually exclusive, unlike in the transcript domain.
Not all videos have corresponding audio data. For these videos missing audio transcripts, you should predict a value of 0
for all audio label classes.
Video annotation file format
The annotation files provided are in ELAN format, a standard used extensively for multimedia annotation tasks. Participants must parse the required action labels from these ELAN files using fields specified within them. Libraries like pypmi or speach can be used for convenient parsing.
Bounding box data format
Bounding box annotations for this competition are provided in JSON format, which includes frame-level bounding box data specifying spatial locations for two main entities:
- Teacher: Indicates the spatial location of the teacher within each annotated video frame.
- Important Object (such as Worksheet, Presentation with Technology): Indicates the location of key objects or tools that the teacher interacts with during the video.
Each bounding box is represented by coordinates specifying the rectangle corners, provided in the following format:
[x1, y1, x2, y2]
- (x1, y1): Top-left corner coordinates of the bounding box.
- (x2, y2): Bottom-right corner coordinates of the bounding box.
Example Bounding Box JSON structure
{
"frame": 11,
"label": "Teacher",
"points": [55.75, 173.45, 264.74, 720.0],
"occluded": false,
"outside": false
}
Parsing annotations
Participants must extract and parse the bounding box information from the provided JSON files. Each JSON file includes a sequence of annotated frames with labels clearly indicating the entity (Teacher or Important Object) being annotated.
Model input format
Participants may convert bounding box annotations into an input format suitable for models used for action detection. Typical methods include generating cropped frame regions or feature representations based on bounding boxes.
Important considerations
- Ensure alignment between bounding box annotations and corresponding frame-level activity labels.
- Accurately maintain temporal consistency and continuity of bounding box coordinates across sequential video frames.
- Bounding boxes can overlap; your model should handle overlapping cases effectively.
Performance metric
Performance is evaluated according to macro-weighted F1-score across all 24 instructional activity labels and 19 discourse labels. Note that this weights video classes slightly more than audio classes because there are more classes for videos. This is intentional and meant to account for a slight discrepancy in support between audio and video classes.
Submission format
Your output predictions on the Phase 1 test set must be submitted as a single CSV file. This CSV file must adhere to the following format:
- The first column must be named
clip_id
. Theclip_id
is a unique identifier for each second of video and consists of a double-underscore separated concatenation of the video filename (without the .xlsx extension) and a zero-padded string of the starting second of the 1-second clip. The first column should be in ascending sorted order. - The next 43 columns must contain the 43 labels as column headers (24 activity labels and 19 discourse labels). These labels should be in alphabetical order as shown below.
- Your predictions must be
1
if the activity or discourse sub-topic is present and0
if absent.
Some videos in the test set may be slightly longer, by one second or less, than indicated by the submission format clip IDs. You should only submit predictions for seconds of video contained in the submission format.
In total, your CSV should have 95,087 rows (including the header) and 44 columns (including the clip_id column). You will receive an error message if your submission does not exactly conform to the expected format.
Example CSV structure:
Here is a truncated example of a CSV submission. For a full example, reference the submission format file.
clip_id,Activity__Discourse__On task student talking with student,Activity__Discourse__Student raising hand,...,Discourse__Questions__Open-Ended,Discourse__Questions__Task Related Prompt
220.049_ELA1_Year1_20170320_Part1__00000,1,0,...,0,1
220.049_ELA1_Year1_20170320_Part1__00001,1,0,...,0,1
220.049_ELA1_Year1_20170320_Part1__00002,1,0,...,0,1
Full list of expected columns
- clip_id
- Activity__Discourse__On task student talking with student
- Activity__Discourse__Student raising hand
- Activity__Representing Content__Individual technology
- Activity__Representing Content__Presentation with technology
- Activity__Representing Content__Student writing
- Activity__Representing Content__Teacher writing
- Activity__Representing Content__Using or holding book
- Activity__Representing Content__Using or holding instructional tool
- Activity__Representing Content__Using or holding notebook
- Activity__Representing Content__Using or holding worksheet
- Activity__Student Location__Sitting at desks
- Activity__Student Location__Students sitting at group tables
- Activity__Student Location__Students sitting on carpet or floor
- Activity__Student Location__Students standing or walking
- Activity__Teacher Location__Teacher sitting
- Activity__Teacher Location__Teacher standing
- Activity__Teacher Location__Teacher walking
- Activity__Teacher Supporting__Teacher supporting multiple students with student interaction
- Activity__Teacher Supporting__Teacher supporting multiple students without student interaction
- Activity__Teacher Supporting__Teacher supporting one student
- Activity__Type__Individual activity
- Activity__Type__Small group activity
- Activity__Type__Transition
- Activity__Type__Whole class activity
- Discourse__Classroom Community__Feedback-Affirming
- Discourse__Classroom Community__Feedback-Disconfirming
- Discourse__Classroom Community__Feedback-Elaborated
- Discourse__Classroom Community__Feedback-Neutral
- Discourse__Classroom Community__Feedback-Unelaborated
- Discourse__Classroom Community__Uptake-Building
- Discourse__Classroom Community__Uptake-Exploring
- Discourse__Classroom Community__Uptake-Restating
- Discourse__Cognitive Demand__Analysis-Give
- Discourse__Cognitive Demand__Analysis-Request
- Discourse__Cognitive Demand__Report-Give
- Discourse__Cognitive Demand__Report-Request
- Discourse__EJ__Student-Give
- Discourse__EJ__Student-Request
- Discourse__EJ__Teacher-Give
- Discourse__EJ__Teacher-Request
- Discourse__Questions__Closed-Ended
- Discourse__Questions__Open-Ended
- Discourse__Questions__Task Related Prompt
Submission phases
During Phase 1, you or your team can submit up to three predictions per day. Prior to the end of Phase 1, your team must choose which submission and corresponding model they intend to use to generate Phase 2 predictions. You can indicate this by selecting the submission that corresponds to predictions generated by your chosen model. For Phase 2, you or your team can make only one submission.
You are not permitted to train your models on the Phase 1 or Phase 2 test sets, nor are you permitted to retrain or alter any of your Phase 1 model to generate your Phase 2 submission. In addition, you may not manually annotate any Phase 1 or Phase 2 data. For more information, please see the competition rules.
Approximate benchmarks for instructional activity labels
The table below features F1-scores from a study predicting the 24 instructional activity labels based solely on video annotations (i.e., no bounding box data). The results were achieved from a subset of approximately 244 hours of video recordings from the DAI dataset. A fuller description of the dataset, the models, and the results are available here. The code for training the model used to achieve the results is available here. The AIAI project recently developed a multimodal model that fuses this video model with an audio transcript model; it is available here.
Instructional activity | BaS-Net F1-Score | BaS-Net+ F1-Score |
---|---|---|
Activity Type | ||
Whole class activity | 0.37 | 0.57 |
Individual activity | 0.38 | 0.61 |
Small group activity | 0.55 | 0.75 |
Transition | 0.38 | 0.53 |
Discourse | ||
On task student talking with student | 0.12 | |
Student raising hand | 0.36 | |
Teacher Location | ||
Teacher sitting | 0.78 | |
Teacher standing | 0.64 | |
Teacher walking | 0.34 | |
Student Location | ||
Student(s) sitting on carpet or floor | 0.64 | |
Student(s) sitting at group tables | 0.72 | |
Sitting at desks | 0.74 | |
Student(s) standing or walking | 0.62 | |
Teacher Supporting | ||
Teacher supporting one student | 0.27 | |
Teacher supporting multiple students with student interaction | 0.54 | |
Teacher supporting multiple students without student interaction | 0.40 | |
Representing Content | ||
Using or holding book | 0.46 | |
Using or holding worksheet | 0.56 | |
Presentation with technology | 0.63 | |
Using or holding instructional tool | 0.40 | |
Using or holding notebook | 0.24 | |
Individual technology | 0.35 | |
Teacher writing | 0.23 | |
Student writing | 0.43 |
Good luck!
Good luck and enjoy this problem! If you have any questions you can always visit the competition forum!