Meta AI Video Similarity Challenge: Matching Track | Phase 1

Help keep social media safe by identifying whether a video contains a manipulated clip from one or more videos in a reference set. #society

mar 2023
212 joined

Problem Description

For this competition, your goal is to identify whether a query video contains manipulated portions of one or more videos in a reference set. You will receive a reference set of approximately 40,000 videos and a query set of approximately 8,000 videos. Some query videos contain clips derived from videos in the reference set, and some do not.

For the Matching Track, your task is to generate predicted matches, which consist of query-reference video pairs, timestamps that correspond to the start and end of the derived content in the query and reference videos, and a confidence score. The metric for this task is micro-average precision.

Table of contents

The Challenge


Background

The primary domain of this competition is a field of computer vision called copy detection. For images, the problem of copy detection is difficult but straightforward. With the addition of a time dimension, videos contain potentially thousands of images in sequence, and the problem of not just copy detection but also localization presents a novel challenge. Our competition sponsors at Meta AI believe that this problem is ripe for innovation, particularly with regard to solutions that are faster and require fewer resources to be performant.

Phase 1

During the first phase of the competition, teams will have access to the training dataset of approximately 8,000 query videos and approximately 40,000 reference videos, with labels that specify the portions of query videos that contain derived content along with the portion of the reference video from which the content was derived. In addition, teams will have acces to a distinct Phase 1 test dataset containing approximately 8,000 unlabeled query videos and approximately 40,000 reference videos. Teams will be able to make submissions to be evaluated against the ground truth labels for the Phase 1 test dataset. Scores will appear on the public leaderboard, but will not determine your team's final prize ranking.

Note: You will not be allowed to re-train your model in Phase 2, so it must be finalized by the end of Phase 1 (March 24, 2023 23:59:59 UTC). In order to be eligible for final prizes, you will need to submit a file containing code and metadata. See below for more details.

The Dataset


Overview

For this competition, Meta AI has compiled a new dataset composed of approximately 100,000 videos derived from the YFCC100M dataset. This dataset has been divided for the purposes of this competition into a training set, which participants may use to train their models, and a test set on which training is prohibited. Both the train and test sets are further divided into a set of reference videos, and a set of query videos that may or may not contain content derived from one or more videos in the reference set. Query videos are, on average, about 40 seconds long, while reference videos are, on average, about 30 seconds long. In the case of the training set, a set of labels is included indicating which query videos contain derived content.

Let's look at some examples to understand different ways in which content from reference videos may be included in query videos.

The video on the left is a reference video. The video on the right is a query video that contains a clip that has been derived from the reference video -- in this case, a portion of the query video (from 0:00.3 to 0:05.7) contains a portion of the reference video (from 0:24.0 to 0:29.43) that has been surrounded in all four corners by cartoon animals and had a static effect applied:

reference video query video derived from reference video
Credit: walkinguphills Credit: walkinguphills, Timmy and MiNe (sfmine79)


Edited query videos may have been modified using a number of techniques. The portion of the following reference video (from 0:05.0 to 16.6) inserted into the query video (from 0:05.6 to 16.9) has been overlayed with a transparency effect:

reference video query video derived from reference video
Credit: Timo_Beil Credit: Timo_Beil and Neeta Lind


Editing techniques employed on the challenge data set include:

  • blending videos together
  • altering brightness or contrast
  • jittering color or converting to grayscale
  • blurring or pixelization
  • altering aspect ratio or video speed, looping
  • rotation, scaling, cropping, padding
  • adding noise or changing encoding quality
  • flipping or stacking horizontally or vertically
  • formatting as a meme; overlaying of dots, emoji, shapes, or text
  • transforming perspective and shaking; transition effects like cross-fading
  • creating multiple small scenes; inserting content in the background

Please also note that as part of the data cleaning process, faces have been obscured from the videos using a segmentation mask.

Although synthetically created by our sponsors at Meta AI, the transformations are similar to those seen in a production setting on social media platforms, with a particular emphasis on the ones that are more difficult to detect. Each query video that does contain content derived from a reference video will contain at least five seconds of video derived from that particular reference video. It is also possible for a query video to have content derived from multiple reference videos.

Performant solutions should be able to differentiate between videos that have been copied and manipulated and videos that are simply quite similar. Below are two examples of such similar, but not copied, videos. The actual data will contain queries that contain some copied portion of a reference video, some videos quite similar to some reference videos, and some videos that have not been derived in any way from the reference set.

reference video similar, but not derived, query video
Credit: thievingjoker Credit: thievingjoker


reference video similar, but not derived, query video
Credit: Gwydion M. Williams Credit: Gwydion M. Williams


TOOL TIP: Looking for great tools to help augment your training set? The Meta AI team has released an open source data augmentation library to help build more robust AI models. Check out AugLy on GitHub to get started!

Data Phases

Data will be made available in separate tranches over the course of Phase 1 and Phase 2.

In Phase 1, you will have access to the training and test datasets, described in more detail below. Training is permitted on the training set, but is prohibited for the test set.

For your Phase 1 submissions, you will submit a list of predicted matches for all videos in the query test set, as well as code capable of generating predicted matches for an unseen query video and database.

In Phase 2, you will have access to approximately 8,000 new query videos, which may or may not contain content derived from videos in the Phase 1 test reference corpus.

Details

To access the Phase 1 data, first make sure you are registered for the competition, and then download the data following the instructions in the data tab. The corpus is very large, so this might take a little while. Make sure you have a stable internet connection.

You will be able to download the following:

  • The train dataset, containing:
    • reference corpus of 40,311 reference video mp4 files. Filenames correspond to each reference video id, going from R100001.mp4 to R140311.mp4.
    • query corpus containing 8,404 query video mp4 files. Filenames correspond to each query video id, going from Q100001.mp4 to Q108404.mp4 in Phase 1.
    • train_query_metadata.csv and train_reference_metadata.csv files that contain metadata about each video
    • train_matching_ground_truth.csv file containing the ground truth for the query videos which contain content derived from reference videos:
      • query_id gives the query video id
      • ref_id gives the id of the corresponding reference video that the query video was derived from
      • query_start gives the timestamp at which the derived content begins in the query video
      • query_end gives the timestamp at which the derived content ends in the query video
      • ref_start gives the timestamp at which the derived content begins in the reference video
      • ref_end gives the timestamp at which the derived content ends in the reference video
  • The test dataset, containing
    • reference corpus of 40,318 reference video mp4 files. Filenames correspond to each reference video id, going from R200001.mp4 to R240318.mp4.
    • query corpus containing 8,295 query video mp4 files. Filenames correspond to each query video id, going from Q200001.mp4 to Q208295.mp4 in Phase 1.
    • test_query_metadata.csv and test_reference_metadata.csv files that contain metadata about each video

It is possible that a query video has matched segments from multiple reference videos, in which case the ground truth should contain multiple rows, one for each matched segment.

In Phase 2, you will receive access to a new corpus of approximately 8,000 query videos, with filenames going from Q300001.mp4 to Q308001.mp4. Phase 2 will utilize the same corpus of reference videos as the Phase 1 test dataset.

RULES ON DATA USE

In order to more closely simulate the needs of a system operating at production scale, and to help ensure that solutions generalize well to new data, the following rules apply for this competition. Eligible solutions must adhere to these requirements. Any violation of these rules is grounds for disqualification and ineligibility for prizes.

Independence

  • Submitted match predictions for a given query-reference video pair may not depend on any other test set videos.

The goal of the competition is to reflect real world settings where new videos are continuously added to a reference database, and old videos are removed from it. Therefore, the prediction of matching segments and their assigned scores must only depend on the videos in a given query-reference video pair and must not depend on other query or reference videos. This means that the score of a matching segment for a given video pair would be the same even if the entire dataset consisted of only that single query video and that single reference video.

Further, KNN retrieval is not allowed, since this is affected by other references. Participants should use retrieval that returns all results within a threshold similarity (or distance), as provided by the baseline solution. Selecting a distance threshold on the test dataset (e.g. to provide a certain number of candidates) is allowed.

Annotations and Training

“Adding annotations" is defined as creating additional labels beyond the provided ground truth. Adding annotations to query or reference videos in the training set is permitted.

  • Adding annotations to any videos in the Phase 1 test dataset is prohibited.
  • Training on the test dataset is also prohibited, including simple models such as whitening or dimensionality reduction.
  • Adding annotations to any videos used in Phase 2 is prohibited.

Participants must train all models on the training dataset. Participants may use the training dataset for training, or as a reference distribution for similarity normalization techniques.

Augmentation
"Augmenting" refers to applying a transformation to an image or video to generate a new image or video. Augmentation of both training videos and test set videos is permitted, provided that videos in the test set are not used for training purposes in either augmented or original form.

Data sources
External data is allowed subject to the rules governing "External Data" in the Official Rules. Note that as the source of the challenge data, the YFCC100M dataset is not external data and training using data from this source is prohibited (with the exception of the training data provided through the competition).

End of Phase 1
In order to be prize eligible, prior to the end of Phase 1 you will be prompted to select the Phase 1 submission you would like to use for Phase 2.

After Phase 1 closes, no further model development is permitted. For Phase 2, you will simply be applying the Phase 1 model you selected to the new Phase 2 query set. Additional training of your Phase 1 model or annotation of the Phase 2 data will be grounds for disqualification.

You are responsible for storing and keeping track of previous versions of your model, in case any of these end up being the version you want to use in Phase 2. Differences between your Phase 2 submission and your selected Phase 1 submission are grounds for disqualification. DrivenData will not provide means to download previously submitted model or code.

Submission Format


This is a hybrid code execution challenge! Your submission will include both a csv with your predictions and a main.py script which will be executed by the platform on a subset of the challenge data.

submission.zip
├── full_matches.csv      # csv file containing test set predictions
├── main.py               # script to generate predictions
└── model_assets/         # directory containg model assets,
                          # e.g. a checkpoint file

Your full_matches.csv will contain your match predictions for the test dataset. Note that you may have multiple match predictions for a given query video. The full_matches.csv must contain the same column headings as the ground truth file, plus a confidence score:

  • query_id gives the query video id
  • ref_id gives the id of the corresponding reference video that the query video was derived from
  • query_start gives the timestamp at which the derived content begins in the query video
  • query_end gives the timestamp at which the derived content ends in the query video
  • ref_start gives the timestamp at which the derived content begins in the reference video
  • ref_end gives the timestamp at which the derived content ends in the reference video
  • score gives the confidence score for the given match

Your main.py script must be capable of generating predicted and scored matches for any specified query video in the test set. For full details on how to make your executable code submission, see the Code Submission Format page.

You may make submissions without main.py in order to ascertain how your predicted matches are scored, without requiring the addition of executable code. Note that these submissions should not be selected at the end of Phase 1, since they are not eligible for Phase 2 prizes.

All submissions, whether or not they contain executable code, will count towards the submission limit specified on the Submissions page. This limit may be updated during the challenge based on resource availability.

As noted above, the prediction for each query–reference pair must be independent; in other words, the results must not depend on the existence of any other videos. Participants are allowed to use in-memory or on-disk caching within one submission execution in order to reduce redundant computations, so long as the independence requirement is not violated.

Performance Metric


Submissions will be evaluated by a segment-matching version of micro-average precision, also known as global average precision. This is related to the common classification metric, average precision, also known as the area under the precision-recall curve or the average precision over recall values. The micro-average precision metric will be computed based on the matches the participants submit, and the result will be published on the Phase 1 leaderboard (remember that this is not the final leaderboard).

Mathematically, micro-average precision is computed as follows:

$$ \text{AP} = \sum_k (R_k - R_{k-1}) P_k $$

where \(P_k\) and \(R_k\) are the precision and recall, respectively, when thresholding at the kth score threshold of the predictions sorted in order of decreasing score.

This metric summarizes model performance over a range of operating thresholds from the highest predicted score to the lowest predicted score. This is equivalent to estimating the area under the precision–recall curve with the finite sum over every threshold in the prediction set, without interpolation. See this wikipedia entry for further details.

For this challenge, micro-average precision is then average precision calculated globally as a micro-average of all predictions across queries.

Computing precision and recall for the video similarity task can be a little tricky because, unlike other information retrieval tasks, we don't have a binary indicator of whether a predicted match is correct or not. Instead, it is the degree of overlap between the predictions and the ground truth that matters. For that reason, we follow the definitions of \(P_k\) and \(R_k\) from He et. al..

To better understand how this works, let's start by visualizing a particular query-reference pair side-by-side:

A representation of a query video containing a portion derived from a portion of a reference video

We can visualize predicted matches as slices of each query and reference video that we believe create a derived content relationship, as shown here:

A representation of a prediction that a query video contains a portion derived from a portion of a reference video

This one-axis visualization is a bit cumbersome. We can instead treat the reference video time axis as the x-axis and the query video time axis as the y-axis to transform this into a two-dimensional visualization:

A two-axis representation of a prediction that a query video contains a portion derived from a portion of a reference video

This representation presents our matching problem as a direct analog of a computer vision image segmentation evaluation metric like Intersection over Union (IoU). However, we want a metric that is easy to compute over a dataset with potentially a large number of predictions and correct matches, where IoU can be complicated to compute. Further, we want a metric that summarizes accuracy over a range of operating points, so that participants’ scores do not critically rely on their choice of operating point. We therefore define the overall precision (and recall) as the geometric mean of the precision (and recall) of predicted segments along each axis:

A two-axis representation of a prediction that a query video contains a portion derived from a portion of a reference video, with equations specifying the relationship of the two-dimensional representations to the corresponding equations for precision and recall

The above representations correspond to a single prediction. In the fully generic case, with multiple predictions and ground truths, precision and recall are defined as follows.

$$ P_k = \sqrt{\frac{\sum_{j=0}^{n_k} L_{O_j^k}^x}{\sum_{j=0}^{n_k} L_{P_j^k}^x} \cdot \frac{\sum_{j=0}^{n_k} L_{O_j^k}^y}{\sum_{j=0}^{n_k} L_{P_j^k}^y}} $$

$$ R_k = \sqrt{\frac{\sum_{i=0}^{k} L_{O_i^k}^x}{\sum_{i=0}^{m} L_{G_i}^x} \cdot \frac{\sum_{i=0}^{m} L_{O_i^k}^y}{\sum_{i=0}^{m} L_{G_i}^y}} $$

where \(O_k\) is the set of the overlapping boxes between the predictions at rank \(k\) and the ground truth; \(P_k\) is the set of all prediction boxes at rank \(k\), \(G\) is set of the all ground truth boxes, \(n_k\) is the number of prediction boxes at rank \(k\), \(m\) is the number of all ground truth boxes, and \(L_x\) and \(L_y\) are the total length on each axis. This mathematical definition appears rather complex, but when broken down into its constituent components it can be understood more easily.

Secondary Metric

Each submission will also be scored on its performance on a subset of the test set for which matches are generated by the participant's code within the code execution environment. This is a secondary metric provided to ensure that the code does, in fact, generate useful matches.

End Phase 1 Submission


Prior to the end of Phase 1, you will be prompted to submit information about the models you will be using in Phase 2. In order to be eligible for final prizes in Phase 2, you will need to specify up to three of your submissions that you would like to carry forward into Phase 2. As noted above, you will not be allowed to re-train your Phase 1 model(s) when participating in Phase 2. You will simply be applying your existing Phase 1 model(s) onto a new data set. It is your responsibility to ensure that your Phase 2 submissions contain identical models and code to those you have chosen to move forward from Phase 1. If any changes must be made for your code to successfully execute in the code execution environment, please document them. DrivenData has the discretion to determine whether such changes are permissible.

Note: As we approach the deadline for Phase 1 of this competition, you may see queue times for the code submission runtime increasing. In the event that you would like to select a submission to carry into Phase 2 that does not finish execution until after the deadline, please contact the DrivenData team via email or the forum. Any changes to submission selection must be communicated to DrivenData prior to 23:59:59 UTC on Thursday, March 30.

For Phase 2, submissions will also be required to include models and code to perform inference on a query subset. This query subset will contain the same number of videos as the analogous query subset from Phase 1. Submissions are expected to meet the same 10-seconds-per-query-video average runtime constraint. The overall time limit will include a small margin to allow for minor unexpected variability in runtime performance, but you should plan to have your solution meet the same constraint for the Phase 2 test set.

Good luck!


Good luck and enjoy this problem! If you have any questions you can always visit the user forum!


NO PURCHASE NECESSARY TO ENTER/WIN. A PURCHASE WILL NOT INCREASE YOUR CHANCES OF WINNING. The Competition consists of two (2) Phases, with winners determined based upon Submissions using the Phase II dataset. The start and end dates and times for each Phase will be set forth on this Competition Website. Open to legal residents of the Territory, 18+ & age of majority. "Territory" means any country, state, or province where the laws of the US or local law do not prohibit participating or receiving a prize in the Challenge and excludes any area or country designated by the United States Treasury's Office of Foreign Assets Control (e.g. Crimea, Donetsk, and Luhansk regions of Ukraine, Cuba, North Korea, Iran, Syria), Russia and Belarus. Any Participant use of External Data must be pursuant to a valid license. Void outside the Territory and where prohibited by law. Participation subject to official Competition Rules. Prizes: $25,000 USD (1st), $15,000 (2nd), $10,000 USD (3rd) for each of two tracks. See Official Rules and Competition Website for submission requirements, evaluation metrics and full details. Sponsor: Meta Platforms, Inc., 1 Hacker Way, Menlo Park, CA 94025 USA.