Meta AI Video Similarity Challenge: Descriptor Track | Phase 1 Hosted By Meta


Problem Description

For this competition, your goal is to identify whether a query video contains manipulated portions of one or more videos in a reference set. You will receive a reference set of approximately 40,000 videos and a query set of approximately 8,000 videos. Some query videos contain clips derived from videos in the reference set, and some do not.

For this Descriptor Track, your task is to compute descriptors (embeddings) for each video in the query and reference sets. Your descriptors will be 32-bit floating point vectors of up to 512 dimensions. You may use multiple descriptors for each video, up to one descriptor for every second of video (see rules).

After receiving your descriptors, an evaluation script will compute the inner-product distance between each pair of descriptor vectors in the query and reference sets, and then use this distance as a confidence score to compute micro-average precision (the confidence score used is the maximum of the pairwise inner-product similarities computed between descriptor embeddings for each video pair).

Table of contents

The Challenge


The primary domain of this competition is a field of computer vision called copy detection. For images, the problem of copy detection is difficult but straightforward. With the addition of a time dimension, videos contain potentially thousands of images in sequence, and the problem of not just copy detection but also localization presents a novel challenge. Our competition sponsors at Meta AI believe that this problem is ripe for innovation, particularly with regard to solutions that are faster and require fewer resources to be performant.

Phase 1

During the first phase of the competition, teams will have access to the training dataset of approximately 8,000 query videos and approximately 40,000 reference videos, with labels that specify the portions of query videos that contain derived content along with the portion of the reference video from which the content was derived. In addition, teams will have acces to a distinct Phase 1 test dataset containing approximately 8,000 unlabeled query videos and approximately 40,000 reference videos. Teams will be able to make submissions to be evaluated against the ground truth labels for the Phase 1 test dataset. Scores will appear on the public leaderboard, but will not determine your team's final prize ranking.

Note: You will not be allowed to re-train your model in Phase 2, so it must be finalized by the end of Phase 1 (March 24, 2023 23:59:59 UTC). In order to be eligible for final prizes, you will need to submit a file containing code and metadata. See below for more details.

The Dataset


For this competition, Meta AI has compiled a new dataset composed of approximately 100,000 videos derived from the YFCC100M dataset. This dataset has been divided for the purposes of this competition into a training set, which participants may use to train their models, and a test set on which training is prohibited. Both the train and test sets are further divided into a set of reference videos, and a set of query videos that may or may not contain content derived from one or more videos in the reference set. Query videos are, on average, about 40 seconds long, while reference videos are, on average, about 30 seconds long. In the case of the training set, a set of labels is included indicating which query videos contain derived content.

Let's look at some examples to understand different ways in which content from reference videos may be included in query videos.

The video on the left is a reference video. The video on the right is a query video that contains a clip that has been derived from the reference video -- in this case, a portion of the query video (from 0:00.3 to 0:05.7) contains a portion of the reference video (from 0:24.0 to 0:29.43) that has been surrounded in all four corners by cartoon animals and had a static effect applied:

reference video query video derived from reference video
Credit: walkinguphills Credit: walkinguphills, Timmy and MiNe (sfmine79)

Edited query videos may have been modified using a number of techniques. The portion of the following reference video (from 0:05.0 to 16.6) inserted into the query video (from 0:05.6 to 16.9) has been overlayed with a transparency effect:

reference video query video derived from reference video
Credit: Timo_Beil Credit: Timo_Beil and Neeta Lind

Editing techniques employed on the challenge data set include:

  • blending videos together
  • altering brightness or contrast
  • jittering color or converting to grayscale
  • blurring or pixelization
  • altering aspect ratio or video speed, looping
  • rotation, scaling, cropping, padding
  • adding noise or changing encoding quality
  • flipping or stacking horizontally or vertically
  • formatting as a meme; overlaying of dots, emoji, shapes, or text
  • transforming perspective and shaking; transition effects like cross-fading
  • creating multiple small scenes; inserting content in the background

Please also note that as part of the data cleaning process, faces have been obscured from the videos using a segmentation mask.

Although synthetically created by our sponsors at Meta AI, the transformations are similar to those seen in a production setting on social media platforms, with a particular emphasis on the ones that are more difficult to detect. Each query video that does contain content derived from a reference video will contain at least five seconds of video derived from that particular reference video. It is also possible for a query video to have content derived from multiple reference videos.

Performant solutions should be able to differentiate between videos that have been copied and manipulated and videos that are simply quite similar. Below are two examples of such similar, but not copied, videos. The actual data will contain queries that contain some copied portion of a reference video, some videos quite similar to some reference videos, and some videos that have not been derived in any way from the reference set.

reference video similar, but not derived, query video
Credit: thievingjoker Credit: thievingjoker

reference video similar, but not derived, query video
Credit: Gwydion M. Williams Credit: Gwydion M. Williams

TOOL TIP: Looking for great tools to help augment your training set? The Meta AI team has released an open source data augmentation library to help build more robust AI models. Check out AugLy on GitHub to get started!

Data Phases

Data will be made available in separate tranches over the course of Phase 1 and Phase 2.

In Phase 1, you will have access to the training and test datasets, described in more detail below. Training is permitted on the training set, but is prohibited for the test set.

For your Phase 1 submissions, you will submit a list of predicted matches for all videos in the query test set, as well as code capable of generating predicted matches for an unseen query video and database.

In Phase 2, you will have access to approximately 8,000 new query videos, which may or may not contain content derived from videos in the Phase 1 test reference corpus.


To access the Phase 1 data, first make sure you are registered for the competition, and then download the data following the instructions in the data tab. The corpus is very large, so this might take a little while. Make sure you have a stable internet connection.

You will be able to download the following:

  • The train dataset, containing:
    • reference corpus of 40,311 reference video mp4 files. Filenames correspond to each reference video id, going from R100001.mp4 to R140311.mp4.
    • query corpus containing 8,404 query video mp4 files. Filenames correspond to each query video id, going from Q100001.mp4 to Q108404.mp4 in Phase 1.
    • train_query_metadata.csv and train_reference_metadata.csv files that contain metadata about each video
    • train_descriptor_ground_truth.csv file containing the ground truth for the query videos which contain content derived from reference videos:
      • query_id gives the query video id
      • ref_id gives the id of the corresponding reference video from which the the query video was derived
  • The test dataset, containing
    • reference corpus of 40,318 reference video mp4 files. Filenames correspond to each reference video id, going from R200001.mp4 to R240318.mp4.
    • query corpus containing 8,295 query video mp4 files. Filenames correspond to each query video id, going from Q200001.mp4 to Q208295.mp4 in Phase 1.
    • test_query_metadata.csv and test_reference_metadata.csv files that contain metadata about each video

It is possible that a query video has matched segments from multiple reference videos, in which case the ground truth should contain multiple rows, one for each matched segment.

In Phase 2, you will receive access to a new corpus of approximately 8,000 query videos, with filenames going from Q300001.mp4 to Q308001.mp4. Phase 2 will utilize the same corpus of reference videos as the Phase 1 test dataset.


In order to more closely simulate the needs of a system operating at production scale, and to help ensure that solutions generalize well to new data, the following rules apply for this competition. Eligible solutions must adhere to these requirements. Any violation of these rules is grounds for disqualification and ineligibility for prizes.


  • Submitted descriptors for a video may not make use of other videos (query or reference) in the test set.

The goal of the competition is to reflect real world settings where new videos are continuously added to a reference database, and old videos are removed from it. Therefore, the generation of descriptors for each video must be independent from other videos.

Number of descriptors

The limitation of one descriptor per second of video is a global limitation across all videos in the test set (~8,000 videos) and in the code execution test subset (~800 videos). You may, if you wish, choose to allocate more descriptors to some videos and fewer descriptors to others - as long as the number of descriptors you allocate to a particular video is a function of that video alone. Specifically, you may not dynamically allocate the number of descriptors calculated for a particular video based on other videos in the set. Participants may distribute descriptors unevenly across videos. For instance, a participant could assign descriptors for certain frames a "priority", and calibrate a priority threshold to match the total descriptor budget. Participants may consider the total dataset length (as computed from the metadata file) to compute a descriptor budget and dynamically select a threshold, but may not look at the length distribution or conduct other analysis that would violate the independence criteria. For conducting inference on a subset of videos in the code execution runtime (and for Phase 2), the same restriction applies: you may calculate total length of videos in the subset and use that to dynamically select a priority threshold that ensures that the number of descriptors that you submit falls in this budget, but you may not use information about the distribution of video lengths to inform your priority scores.

Annotations and Training

“Adding annotations" is defined as creating additional labels beyond the provided ground truth. Adding annotations to query or reference videos in the training set is permitted.

  • Adding annotations to any videos in the Phase 1 test dataset is prohibited.
  • Training on the test dataset is also prohibited, including simple models such as whitening or dimensionality reduction.
  • Adding annotations to any videos used in Phase 2 is prohibited.

Participants must train all models on the training dataset. Participants may use the training dataset for training, or as a reference distribution for similarity normalization techniques.

"Augmenting" refers to applying a transformation to an image or video to generate a new image or video. Augmentation of both training videos and test set videos is permitted, provided that videos in the test set are not used for training purposes in either augmented or original form.

Data sources
External data is allowed subject to the rules governing "External Data" in the Official Rules. Note that as the source of the challenge data, the YFCC100M dataset is not external data and training using data from this source is prohibited (with the exception of the training data provided through the competition).

End of Phase 1
In order to be prize eligible, prior to the end of Phase 1 you will be prompted to select the Phase 1 submission you would like to use for Phase 2.

After Phase 1 closes, no further model development is permitted. For Phase 2, you will simply be applying the Phase 1 model you selected to the new Phase 2 query set. Additional training of your Phase 1 model or annotation of the Phase 2 data will be grounds for disqualification.

You are responsible for storing and keeping track of previous versions of your model, in case any of these end up being the version you want to use in Phase 2. Differences between your Phase 2 submission and your selected Phase 1 submission are grounds for disqualification. DrivenData will not provide means to download previously submitted model or code.

Submission Format

This is a hybrid code execution challenge! In addition to submitting your video descriptors for the query and reference videos in the test dataset, you'll package everything needed to do inference and submit that for containerized execution on a subset of the test set.

Note that this is different from the previous Image Similarity Challenge. You will not only submit your matches or descriptors; eventually, you must also include a script to run inference for your solution, which we execute on our compute cluster to ensure solutions meet requirements for being computationally efficient. The script is not required in order to make a submission and be scored, but it will be required in order to be prize-eligible. You may choose to focus on the other parts of your solution initially, and turn to the code execution component once your solution is more developed.

To be eligible for prizes, your code submission must contain a script that is capable of generating descriptor vectors for any specified query video in the test set at a specified resource constraint of no more than 10 seconds per query video on the hardware provided in the execution environment. For a full explanation of the inference process and for more details on how to make your executable code submission, see the Code Submission Format page.

You may make submissions without in order to ascertain how your submitted descriptors are scored, without requiring the addition of executable code. Note that these submissions should not be selected at the end of Phase 1, since they are not eligible for Phase 2 prizes.

All submissions, whether or not they contain executable code, will count towards the submission limit specified on the Submissions page. This limit may be updated during the challenge based on resource availability.

As noted above, each video's descriptors must be independent of other videos. In addition, participants are allowed to use in-memory or on-disk caching within one submission execution in order to reduce redundant computations, so long as the independence requirement is not violated.

Performance Metric

Submissions will be evaluated by their micro-average precision, also known as global average precision. This is related to the common classification metric, average precision, also known as the area under the precision-recall curve or the average precision over recall values. The micro-average precision metric will be computed based on the submitted descriptors, and the result will be published on the Phase 1 leaderboard (remember that this is not the final leaderboard).

To generate predictions, the code execution environment will run an exhaustive similarity search, comparing each submitted query vector to each submitted reference vector, and using the maximum of the inner-product similarity between query and reference vectors for a given query-reference video pair as the ranking score for the prediction that the query video has content derived from the reference video.

Mathematically, average precision is computed as follows:

$$ \text{AP} = \sum_k (R_k - R_{k-1}) P_k $$

where \(P_k\) and \(R_k\) are the precision and recall, respectively, when thresholding at the kth score threshold of the predictions sorted in order of decreasing score.

This metric summarizes model performance over a range of operating thresholds from the highest predicted score to the lowest predicted score. This is equivalent to estimating the area under the precision–recall curve with the finite sum over every threshold in the prediction set, without interpolation. See this wikipedia entry for further details.

For this challenge, micro-average precision is then average precision calculated globally as a micro-average of all predictions across queries.

The subset of descriptors generated by the participant's code within the code execution environment will also be scored. This score does not determine leaderboard ranking, but is a secondary metric provided to ensure that the code does, in fact, generate useful query descriptors.

Resource constraint

In addition to performance on this average precision metric, Meta AI also values the resource efficiency of proposed solutions. To be eligible for prizes, your inference code must generate descriptors in the containerized code execution environment at a rate of no more than 10 seconds per query image.

End Phase 1 Submission

Prior to the end of Phase 1, you will be prompted to submit information about the models you will be using in Phase 2. In order to be eligible for final prizes in Phase 2, you will need to specify up to three of your submissions that you would like to carry forward into Phase 2. The reference descriptors from these submissions will be used for the purposes of the similarity score calculation when comparing to the submitted descriptors for Phase 2. As noted above, you will not be allowed to re-train your Phase 1 model when participating in Phase 2. You will simply be applying your existing Phase 1 model onto a new data set. It is your responsibility to ensure that your Phase 2 submissions contain identical models and code to those you have chosen to move forward from Phase 1. If any changes must be made for your code to successfully execute in the code execution environment, please document them. DrivenData has the discretion to determine whether such changes are permissible.

Note: As we approach the deadline for Phase 1 of this competition, you may see queue times for the code submission runtime increasing. In the event that you would like to select a submission to carry into Phase 2 that does not finish execution until after the deadline, please contact the DrivenData team via email or the forum. Any changes to submission selection must be communicated to DrivenData prior to 23:59:59 UTC on Thursday, March 30.

For Phase 2, submissions will also be required to include models and code to perform inference on a query subset. This query subset will contain the same number of videos as the analogous query subset from Phase 1. Submissions are expected to meet the same 10-seconds-per-query-video average runtime constraint. The overall time limit will include a small margin to allow for minor unexpected variability in runtime performance, but you should plan to have your solution meet the same constraint for the Phase 2 test set.

Good luck!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!

NO PURCHASE NECESSARY TO ENTER/WIN. A PURCHASE WILL NOT INCREASE YOUR CHANCES OF WINNING. The Competition consists of two (2) Phases, with winners determined based upon Submissions using the Phase II dataset. The start and end dates and times for each Phase will be set forth on this Competition Website. Open to legal residents of the Territory, 18+ & age of majority. "Territory" means any country, state, or province where the laws of the US or local law do not prohibit participating or receiving a prize in the Challenge and excludes any area or country designated by the United States Treasury's Office of Foreign Assets Control (e.g. Crimea, Donetsk, and Luhansk regions of Ukraine, Cuba, North Korea, Iran, Syria), Russia and Belarus. Any Participant use of External Data must be pursuant to a valid license. Void outside the Territory and where prohibited by law. Participation subject to official Competition Rules. Prizes: $25,000 USD (1st), $15,000 (2nd), $10,000 USD (3rd) for each of two tracks. See Official Rules and Competition Website for submission requirements, evaluation metrics and full details. Sponsor: Meta Platforms, Inc., 1 Hacker Way, Menlo Park, CA 94025 USA.