AIAI (Artificial Intelligence for Advancing Instruction) Challenge - Phase 1

Classify instructional activities using multimodal classroom data. #education

$70,000 in prizes
4 days left
27 joined

Problem description

For this challenge, your goal is to classify each second of a set of videos and accompanying audio transcripts according to taxonomies of instructional activities and discourse content.

After successfully submitting a data access agreement, you will be granted access to a training set containing approximately 117 hours of annotated videos of English language arts (ELA) and math lessons from elementary classrooms along with labeled audio transcripts for approximately 39 of the 117 hours. You will also be granted access to a Phase 1 testing set of approximately 26 hours of unlabeled videos (with accompanying unlabeled audio transcripts for approximately 11 of those 26 hours) of ELA and math lessons from elementary classrooms.

Competitors who successfully submit at least one Phase 1 solution will be granted access to Phase 2. In Phase 2, competitors will make one submission against a new test set. The Phase 2 test dataset contains approximately 23 hours of unlabeled classroom videos and 10.5 hours of unlabeled audio transcripts. Prior to the end of Phase 1, competitors must select one submission whose corresponding model they will use to generate predictions on the Phase 2 test set. The final prize rankings will be determined by a weighted average of 25% of the Phase 1 score and 75% of the Phase 2 score.

Datasets overview


Secure access to all of the data for this challenge is provided via Globus. For more information about data access, see the data access instructions.

Training Dataset

The training dataset consists of:

  • 116 hours of video footage, provided as 203 video files
  • Each video file has a corresponding metadata JSON file which contains frame-by-frame bounding boxes for the teacher and any important objects in the frame that the teacher is interacting with.
  • 39 hours of timestamped transcripts, provided as 59 transcript files in Microsoft Excel format with one row per turn of talk
  • train_gt.csv, a file containing labels for each second of video and audio data in the dataset
Training dataset updates (June 16th)

Here is a complete list of changes to the training dataset made on June 16th, 2025:

  • Training video 220.044_ELA1_Year1_20170206_Part2 removed from the training dataset due to incomplete labels
  • Final two seconds of labels for video 220.152_ELA2_Year1_20170308 removed for erroneously extending beyond the end of the video
  • Final labelled second for video 221.217_Math3_Year1_180515_Part2 removed for erroneously extending beyond the end of the video
  • Training video metadata files updated to reflect these changes
Training dataset updates (July 31st)

Here is a complete list of changes to the training dataset made on July 31st, 2025:

  • A new file, train_gt_corrected20250731.csv, was uploaded to Globus containing corrected audio labels for clip 220.044_Math1_Year1_20170206_Part1. Previously, the audio labels for this clip were zero. The correct labels were present in the Excel file.

Phase 1 Test Dataset

The Phase 1 test dataset consists of:

  • 26 hours of video footage, provided as 48 video files
  • 11 hours worth of timestamped transcripts, provided as 19 transcript files in Microsoft Excel format with one row per turn of talk

Phase 2 Test Dataset

The Phase 2 test dataset consists of:

  • 23 hours of video footage, provided as 39 video files
  • 10.5 hours worth of timestamped transcripts, provided as 15 transcript files in Microsoft Excel format with one row per turn of talk

It is important to note that the distribution of positive class labels across the test datasets is different. This is due to the constraint of using real-world video data and splitting the dataset to prevent leakage. Your models should generalize well and not rely on the distribution of class labels in the provided datasets.

For Phase 1, the data file tree looks like this:

data/
├── phase1_data/
│   ├── phase1_unlabeled_transcripts.zip    # compressed archive of transcript files in Excel format
│   ├── phase1_test_metadata.csv            # data about video filenames and which are annotated
│   ├── phase1_subfmt.csv                   # template for submission format predicting absence of all classes
│   └── phase1_test_videos.zip              # compressed archive of unlabeled Phase 1 test videos
└── training/
    ├── train_gt.csv                         # training ground truth labels
    ├── labeled_transcripts.zip              # compressed archive of transcript files in Excel format 
    ├── training_metadata.csv                # metadata about video filenames and which are annotated
    ├── bounding_boxes.zip                   # compressed archive of json files with bounding box metadata
    ├── video_annotations.zip                # eaf label files, included for completeness
    └── training_videos.zip                  # compressed archive of labeled training videos

Labels overview


Instructional activity label definitions and action levels

Image of teacher with table of activity label taxonomy

There are 24 instructional activity labels. The activities and actions are described in the table below.

Instructional Activity Definition
Activity Type
Whole class activity
All students are involved in one activity, with the teacher leading the learning (e.g., lecture, presentation, carpet time).
Individual activity
All students privately work (e.g., independent practice, reading) alone at a separate desk or in small groups with no interaction between students.
Small group activity
Students working together with peers (e.g., think-pair-share, book club); this is prioritized when there are students interacting or somewhat interacting near one another.
Transition
The students and teacher transition from one instructional activity to another (e.g., whole class to small group). The teacher and students move from one spot in the room to another (e.g., from the carpet to desks). Other than specific behavioral directions, no instruction or meaningful instructional activity is occurring during the transition.
Discourse
On task student talking with student
Students conversing together; can overlap with "small group activity;" this is specific to mouth-movements within the parent code time interval; without teacher support.
Student raising hand
Hand up for more than 1 second; clearly and purposefully raising hand.
Teacher Location
Teacher sitting
Teacher sitting (chair, stool, floor, crouching, on desk, kneeling).
Teacher standing
Teacher standing (in generally the same spot to keep the same orientation to students).
Teacher walking
Teacher walking with purpose to change orientation to students.
Student Location
Student(s) sitting on carpet or floor
Students sitting on floor/carpet.
Student(s) sitting at group tables
Students sitting at tables.
Sitting at desks
Students at individual desks.
Student(s) standing or walking
Students standing up or walking around the room; can be one or multiple.
Teacher Supporting
Teacher supporting one student
Teacher uses proximity to offer assistance to one student, can be verbal or non-verbal.
Teacher supporting multiple students with student interaction
Teacher uses proximity to offer assistance to multiple students, can be verbal or non-verbal; individual students are also interacting with one another.
Teacher supporting multiple students without student interaction
Teacher uses proximity to offer assistance to multiple students who are engaged in an activity, can be verbal or non-verbal, e.g., students are sitting close to one another or in a small group but are not interacting with one another.
Representing Content
Using or holding book
A book is used or held by a teacher or student.
Using or holding worksheet
A worksheet is used or held by a teacher or student.
Presentation with technology
A smartboard, Elmo, projector is used to show content.
Using or holding instructional tool
A tangible object (e.g., ruler, math manipulative; anything in someone's hand other than what is already listed; does not include pencil/pen) is used or held by teacher or student for instructional purposes (does not include furniture).
Using or holding notebook
A notebook is used or held by a teacher or student.
Individual technology
Student or teacher using a laptop, tablet, etc.
Teacher writing
On paper or on a whiteboard/document camera; includes erasing.
Student writing
On paper or on a whiteboard, includes erasing.

Audio transcript label definitions and examples

There are five top-level discourse labels and 19 sub-labels. For this challenge, you should label audio at the sub-label level.

Discourse Labels Definition & Examples
Cognitive Demand
Analysis – Give#
Teacher provides analysis of content (e.g., compare, construct, revise, interpret, synthesize, extend, justify/explain, inference/conjecture, connect, etc.).
Examples:
  • "Yeah, four. There are four hiding because there are six here, right."
  • "The keyword is effective to internet research. Okay. It's kind of like the main idea. We talk about the main idea in the books that we're reading. It's the same thing."
Report – Give
Teacher provides report of content or recitation of facts (e.g., define, recall, identify, compute, describe (part of) a method, etc.).
Examples:
  • "You know six and six is 12."
  • "All right, in our box method, we've got the first part of our box is all of our tens or all of our rods. The second part of our box is all of our ones."
  • "But sometimes children just read this little piece and they think that's the answer, but do you see the dot dot dot? That means it keeps talking. You want to make sure it's talking about what you want."
  • "And then what happens? She asks another question and another question."
Analysis – Request*
Teacher asks students to analyze content (e.g., compare, construct, revise, interpret, synthesize, extend, justify/explain, inference/conjecture, connect, etc.).
Examples:
  • "Is it 18 plus four that we want to do? 18 cookies plus four boxes? Does that make sense?"
  • "Why did he do four circles? How is that related to our problem?"
  • "Do you see how that's the same thing as 90?"
  • "How can we show that with blocks?"
  • "Why did she have good traits of a scientist? Tell me why."
  • "Do you think she knows what she did wrong?"
Report – Request
Teacher asks students to report on content or recite facts (e.g., define, recall, identify, compute/count, describe a method etc.).
Examples:
  • "What's 32 plus 40?"
  • "All right, and then what'd you do?"
  • "Who remembers our strategy for multiplying by three?"
  • "I was going to say, is he a troublemaker?"
  • "What does it mean when you're copying word for word? Anyone remember?"
Questions
Open-Ended
Teacher asks a content-related question that is open-ended for which there is immediate response and not a pre-scripted answer.
Examples:
  • "All right, how did your partner solve it?"
  • "Who's got a strategy to solve four times 80? What can we do for that?"
  • "What's another way?"
  • "How could we describe the character?"
Closed-Ended
Teacher asks a content-related question that has an immediate closed response that is a pre-scripted answer (e.g., yes/no or agree/disagree) and/or that tests students’ fluency.
Examples:
  • What's 32 plus 40?
  • Let's count by three six times.
  • What do we do with that 150 and 18 to find our total product?
  • Did you figure out what the word lifespan means?
  • But does she look like she's enjoying her time in the chair?
  • So is it asking when or where?
Task related prompt TRP
Teacher reads/restates question or prompt in an ELA/math task from the instructional materials. Does not have to be word-for-word. Teacher may read or provide an additional question/statement in the same "spirit" as the task prompt.
Examples:
  • "There are 18 cookies in each box. I ordered four boxes of cookies. How many cookies will I get?"
  • "The first problem we have is 56 times three."
  • "Let's do another one question. If you wanted to know when Rosa Parks was born, what would you type in as your keyword?"
Explanation and Justification
E/J – Teacher Request*
Teacher requests a content-related explanation or justification.
Examples:
  • "Why did he do four circles? How is that related to our problem?"
  • "And how'd you solve that?"
  • "And why is she sad?"
E/J – Teacher Give#
Teacher gives a content-related explanation or justification. May include a teacher providing a (partial) solution strategy.
Examples:
  • "All right, if I've got my box here, I put my three on the side. That's like my three rows then I have."
  • "When we multiply by four, it's like we're doubling it two times. So 80 doubled is 160. 160 doubled is 320."
E/J – Student Request
Student requests an explanation or justification.
Examples:
  • S: "How do you know if it's wrong?"
E/J – Student Give
Student gives an explanation or justification.
Examples:
  • T: "How did you know?" S: "Counted them with eight, nine, 10, that means three. And I know seven is not there."
  • T: "And why is she sad?" S: "Because her parents yelled at her."
Classroom Community – Feedback (Judgment)
Affirming
Teacher gives a positive judgement of the correctness of a student's content-related response.
Examples:
  • "I love your work that you showed for this. That's very good."
  • "Excellent."
  • "You got it right."
  • "Nice thinking."
  • "Okay. There's the connection. That's what we were trying to get at."
Disconfirming
Teacher gives a negative judgement of the correctness of a student's content-related response.
Examples:
  • "No, not just 20."
  • "Not quite."
  • "Mmm, no. What does it mean when I say keyword?"
  • "You've got too many."
Neutral+
Teacher gives neither a confirming nor disconfirming judgement of the correctness of a student's content-related response. Teacher may simply repeat the student's contribution.
Examples:
  • "Okay, times two."
  • "On the smaller side, okay."
  • "Okay...try it."
  • "Like kind of comic-y. Okay. Thank you."
Classroom Community – Feedback (Elaboration)
Elaborated
Teacher statements that expand on students' responses or thinking. Focus is usually on students' thinking but not always.
Examples:
  • "Good. And that's the same as counting by five. 5, 10, 15. Excellent."
  • "Careful. That shows 33 times three. How can you make it just 32 times three?"
  • "Yeah. A scientist is always doing experiments and always asking questions."
  • "Quotation marks. If you put those around what you're searching, that can actually help, too."
Unelaborated+
Teacher evaluations, acknowledgements, repeating a question (as an indirect evaluation), or superficially using a students' idea. Focus is not on students' thinking.
Examples:
  • "Excellent."
  • "Not quite."
  • "Ooh. A scientist."
  • "Amount of days in March. Okay."
Classroom Community – Uptake
Restating+
Teacher (partially) repeats or summarizes a student's contribution. Does not have to be word for word but should be in the same "spirit" of the student's contribution without building on the student's idea. Restating could be non-verbal such as writing or pointing to a student's contribution on the board.
Examples:
  • S: "14" T: "14."
  • "All right, and then [Student] was also telling us that we would have to do three times six to figure out how many little ones we have."
  • S: "Scientist" T: "Ooh. A scientist."
Building
Teacher incorporates a student's ideas (potentially by restating their contribution) by clarifying or expanding on the idea on behalf of the student. Building does NOT include the correction of a student written or verbal contribution.
Examples:
  • S: "That's 12." T: "12. And that's what three times four is which goes right here."
  • S: "We would make five 10 blocks and three little blocks. No, we would make eight little blocks." T: "You got it right. We would have five 10 blocks, because that would be the 50."
  • T: "Is that normal for a child to not make a sound until they're three?" Ss: "No." T: "So, it's not typical. No. We want you to be making sounds before the age of three."
Exploring
Teacher incorporates a student's idea (potentially by restating their contribution) into a following line of questions by asking probing questions such as asking students to elaborate, clarify, or evaluate the idea.
Examples:
  • S: "Do times two." T: "Okay, times two. And then what?"
  • S: "Put the three over this side, and the seven on this side." T: "Is it just a three and a seven? What does the three really mean in the number 37?"
  • S: "And four threes over here." T: "Show me what you mean. Show me. Use these to help you."
  • S: "The cat doesn't have a clue either." T: "What makes you say that?"

*Analysis – Request and E/J – Teacher Request usually paired together.

#Analysis – Give and E/J – Teacher Give usually paired together.

+Neutral, Restating, and Unelaborated usually paired together.

Audio transcript file format

Audio transcriptions are provided as Excel files, with the .xlsx extension. Transcribed text is provided in the first column of each row and classified according to the discourse taxonomy above, with one unique label given in each of the five sub-categories.

Typically, one turn of talk is labeled. Sometimes, a turn of talk is split into multiple turns if different segments require different labels; do not split turns further, but use the existing splits within each cell for classification. Use the provided labels in the transcripts to classify turns of talk, noting that within each turn, only one label is given per category.

These labels should then be converted into the time domain using the transcript timestamps. For each second containing a turn of talk, mark 1 in the column corresponding to the label and 0 otherwise. In cases of overlapping turns of talk within the same second, label each second of data with all applicable labels from the turns of talk. This entails that in the time domain submission, labels within the same category may not be mutually exclusive, unlike in the transcript domain.

Not all videos have corresponding audio data. For these videos missing audio transcripts, you should predict a value of 0 for all audio label classes.

Video annotation file format

Image of teacher with table of activity label taxonomy with active taxonomy classifications highlighted

The annotation files provided are in ELAN format, a standard used extensively for multimedia annotation tasks. Participants must parse the required action labels from these ELAN files using fields specified within them. Libraries like pypmi or speach can be used for convenient parsing.

Bounding box data format

Image of teacher with table of activity label taxonomy and bounding boxes around teacher and whiteboard

Bounding box annotations for this competition are provided in JSON format, which includes frame-level bounding box data specifying spatial locations for two main entities:

  • Teacher: Indicates the spatial location of the teacher within each annotated video frame.
  • Important Object (such as Worksheet, Presentation with Technology): Indicates the location of key objects or tools that the teacher interacts with during the video.

Each bounding box is represented by coordinates specifying the rectangle corners, provided in the following format:

[x1, y1, x2, y2]

  • (x1, y1): Top-left corner coordinates of the bounding box.
  • (x2, y2): Bottom-right corner coordinates of the bounding box.

Example Bounding Box JSON structure

{
  "frame": 11,
  "label": "Teacher",
  "points": [55.75, 173.45, 264.74, 720.0],
  "occluded": false,
  "outside": false
}

Parsing annotations

Participants must extract and parse the bounding box information from the provided JSON files. Each JSON file includes a sequence of annotated frames with labels clearly indicating the entity (Teacher or Important Object) being annotated.

Model input format

Participants may convert bounding box annotations into an input format suitable for models used for action detection. Typical methods include generating cropped frame regions or feature representations based on bounding boxes.

Important considerations

  • Ensure alignment between bounding box annotations and corresponding frame-level activity labels.
  • Accurately maintain temporal consistency and continuity of bounding box coordinates across sequential video frames.
  • Bounding boxes can overlap; your model should handle overlapping cases effectively.

Performance metric


Performance is evaluated according to macro-weighted F1-score across all 24 instructional activity labels and 19 discourse labels. Note that this weights video classes slightly more than audio classes because there are more classes for videos. This is intentional and meant to account for a slight discrepancy in support between audio and video classes.

Submission format


Your output predictions on the Phase 1 test set must be submitted as a single CSV file. This CSV file must adhere to the following format:

  • The first column must be named clip_id. The clip_id is a unique identifier for each second of video and consists of a double-underscore separated concatenation of the video filename (without the .xlsx extension) and a zero-padded string of the starting second of the 1-second clip. The first column should be in ascending sorted order.
  • The next 43 columns must contain the 43 labels as column headers (24 activity labels and 19 discourse labels). These labels should be in alphabetical order as shown below.
  • Your predictions must be 1 if the activity or discourse sub-topic is present and 0 if absent.

Some videos in the test set may be slightly longer, by one second or less, than indicated by the submission format clip IDs. You should only submit predictions for seconds of video contained in the submission format.

In total, your CSV should have 95,087 rows (including the header) and 44 columns (including the clip_id column). You will receive an error message if your submission does not exactly conform to the expected format.

Example CSV structure:

Here is a truncated example of a CSV submission. For a full example, reference the submission format file.

clip_id,Activity__Discourse__On task student talking with student,Activity__Discourse__Student raising hand,...,Discourse__Questions__Open-Ended,Discourse__Questions__Task Related Prompt
220.049_ELA1_Year1_20170320_Part1__00000,1,0,...,0,1
220.049_ELA1_Year1_20170320_Part1__00001,1,0,...,0,1
220.049_ELA1_Year1_20170320_Part1__00002,1,0,...,0,1

Full list of expected columns

  • clip_id
  • Activity__Discourse__On task student talking with student
  • Activity__Discourse__Student raising hand
  • Activity__Representing Content__Individual technology
  • Activity__Representing Content__Presentation with technology
  • Activity__Representing Content__Student writing
  • Activity__Representing Content__Teacher writing
  • Activity__Representing Content__Using or holding book
  • Activity__Representing Content__Using or holding instructional tool
  • Activity__Representing Content__Using or holding notebook
  • Activity__Representing Content__Using or holding worksheet
  • Activity__Student Location__Sitting at desks
  • Activity__Student Location__Students sitting at group tables
  • Activity__Student Location__Students sitting on carpet or floor
  • Activity__Student Location__Students standing or walking
  • Activity__Teacher Location__Teacher sitting
  • Activity__Teacher Location__Teacher standing
  • Activity__Teacher Location__Teacher walking
  • Activity__Teacher Supporting__Teacher supporting multiple students with student interaction
  • Activity__Teacher Supporting__Teacher supporting multiple students without student interaction
  • Activity__Teacher Supporting__Teacher supporting one student
  • Activity__Type__Individual activity
  • Activity__Type__Small group activity
  • Activity__Type__Transition
  • Activity__Type__Whole class activity
  • Discourse__Classroom Community__Feedback-Affirming
  • Discourse__Classroom Community__Feedback-Disconfirming
  • Discourse__Classroom Community__Feedback-Elaborated
  • Discourse__Classroom Community__Feedback-Neutral
  • Discourse__Classroom Community__Feedback-Unelaborated
  • Discourse__Classroom Community__Uptake-Building
  • Discourse__Classroom Community__Uptake-Exploring
  • Discourse__Classroom Community__Uptake-Restating
  • Discourse__Cognitive Demand__Analysis-Give
  • Discourse__Cognitive Demand__Analysis-Request
  • Discourse__Cognitive Demand__Report-Give
  • Discourse__Cognitive Demand__Report-Request
  • Discourse__EJ__Student-Give
  • Discourse__EJ__Student-Request
  • Discourse__EJ__Teacher-Give
  • Discourse__EJ__Teacher-Request
  • Discourse__Questions__Closed-Ended
  • Discourse__Questions__Open-Ended
  • Discourse__Questions__Task Related Prompt

Submission phases

During Phase 1, you or your team can submit up to three predictions per day. Prior to the end of Phase 1, your team must choose which submission and corresponding model they intend to use to generate Phase 2 predictions. You can indicate this by selecting the submission that corresponds to predictions generated by your chosen model. For Phase 2, you or your team can make only one submission.

You are not permitted to train your models on the Phase 1 or Phase 2 test sets, nor are you permitted to retrain or alter any of your Phase 1 model to generate your Phase 2 submission. In addition, you may not manually annotate any Phase 1 or Phase 2 data. For more information, please see the competition rules.

Approximate benchmarks for instructional activity labels


The table below features F1-scores from a study predicting the 24 instructional activity labels based solely on video annotations (i.e., no bounding box data). The results were achieved from a subset of approximately 244 hours of video recordings from the DAI dataset. A fuller description of the dataset, the models, and the results are available here. The code for training the model used to achieve the results is available here. The AIAI project recently developed a multimodal model that fuses this video model with an audio transcript model; it is available here.

Instructional activity BaS-Net F1-Score BaS-Net+ F1-Score
Activity Type
Whole class activity 0.37 0.57
Individual activity 0.38 0.61
Small group activity 0.55 0.75
Transition 0.38 0.53
Discourse
On task student talking with student 0.12
Student raising hand 0.36
Teacher Location
Teacher sitting 0.78
Teacher standing 0.64
Teacher walking 0.34
Student Location
Student(s) sitting on carpet or floor 0.64
Student(s) sitting at group tables 0.72
Sitting at desks 0.74
Student(s) standing or walking 0.62
Teacher Supporting
Teacher supporting one student 0.27
Teacher supporting multiple students with student interaction 0.54
Teacher supporting multiple students without student interaction 0.40
Representing Content
Using or holding book 0.46
Using or holding worksheet 0.56
Presentation with technology 0.63
Using or holding instructional tool 0.40
Using or holding notebook 0.24
Individual technology 0.35
Teacher writing 0.23
Student writing 0.43

Good luck!


Good luck and enjoy this problem! If you have any questions you can always visit the competition forum!