Meta AI Video Similarity Challenge: Descriptor Track (Open Arena)

Help keep social media safe by identifying whether a video contains a manipulated clip from one or more videos in a reference set. #society

advanced practice
dec 2023
38 joined

Problem Description

For this competition, your goal is to identify whether a query video contains manipulated portions of one or more videos in a reference set. You will receive a reference set of approximately 40,000 videos and a query set of approximately 8,000 videos. Some query videos contain clips derived from videos in the reference set, and some do not.

For this Descriptor Track, your task is to compute descriptors (embeddings) for each video in the query and reference sets. Your descriptors will be 32-bit floating point vectors of up to 512 dimensions. You may use multiple descriptors for each video, up to one descriptor for every second of video (see rules).

You will then supply your generated descriptors to a provided script in order to compute the inner-product distance between each pair of descriptor vectors in the query and reference sets and generate a list of predicted matching pairs with an associated confidence score to compute micro-average precision (the confidence score is the maximum of the pairwise inner-product similarities computed between descriptor embeddings for each video pair).

Table of contents

The Challenge


Background

The primary domain of this competition is a field of computer vision called copy detection. For images, the problem of copy detection is difficult but straightforward. With the addition of a time dimension, videos contain potentially thousands of images in sequence, and the problem of not just copy detection but also localization presents a novel challenge. Our competition sponsors at Meta AI believe that this problem is ripe for innovation, particularly with regard to solutions that are faster and require fewer resources to be performant.

The Dataset


Overview

For this competition, Meta AI has compiled a new dataset composed of approximately 100,000 videos derived from the YFCC100M dataset. This dataset has been divided for the purposes of this competition into a training set, which participants may use to train their models, and a test set on which training is prohibited. Both the train and test sets are further divided into a set of reference videos, and a set of query videos that may or may not contain content derived from one or more videos in the reference set. Query videos are, on average, about 40 seconds long, while reference videos are, on average, about 30 seconds long. In the case of the training set, a set of labels is included indicating which query videos contain derived content.

Let's look at some examples to understand different ways in which content from reference videos may be included in query videos.

The video on the left is a reference video. The video on the right is a query video that contains a clip that has been derived from the reference video -- in this case, a portion of the query video (from 0:00.3 to 0:05.7) contains a portion of the reference video (from 0:24.0 to 0:29.43) that has been surrounded in all four corners by cartoon animals and had a static effect applied:

reference video query video derived from reference video
Credit: walkinguphills Credit: walkinguphills, Timmy and MiNe (sfmine79)


Edited query videos may have been modified using a number of techniques. The portion of the following reference video (from 0:05.0 to 16.6) inserted into the query video (from 0:05.6 to 16.9) has been overlayed with a transparency effect:

reference video query video derived from reference video
Credit: Timo_Beil Credit: Timo_Beil and Neeta Lind


Editing techniques employed on the challenge data set include:

  • blending videos together
  • altering brightness or contrast
  • jittering color or converting to grayscale
  • blurring or pixelization
  • altering aspect ratio or video speed, looping
  • rotation, scaling, cropping, padding
  • adding noise or changing encoding quality
  • flipping or stacking horizontally or vertically
  • formatting as a meme; overlaying of dots, emoji, shapes, or text
  • transforming perspective and shaking; transition effects like cross-fading
  • creating multiple small scenes; inserting content in the background

Please also note that as part of the data cleaning process, faces have been obscured from the videos using a segmentation mask.

Although synthetically created by our sponsors at Meta AI, the transformations are similar to those seen in a production setting on social media platforms, with a particular emphasis on the ones that are more difficult to detect. Each query video that does contain content derived from a reference video will contain at least five seconds of video derived from that particular reference video. It is also possible for a query video to have content derived from multiple reference videos.

Performant solutions should be able to differentiate between videos that have been copied and manipulated and videos that are simply quite similar. Below are two examples of such similar, but not copied, videos. The actual data will contain queries that contain some copied portion of a reference video, some videos quite similar to some reference videos, and some videos that have not been derived in any way from the reference set.

reference video similar, but not derived, query video
Credit: thievingjoker Credit: thievingjoker


reference video similar, but not derived, query video
Credit: Gwydion M. Williams Credit: Gwydion M. Williams


TOOL TIP: Looking for great tools to help augment your training set? The Meta AI team has released an open source data augmentation library to help build more robust AI models. Check out AugLy on GitHub to get started!

Details

You will have access to a training dataset of approximately 8,000 query videos and approximately 40,000 reference videos, with labels that specify the portions of query videos that contain derived content along with the portion of the reference video from which the content was derived. In addition, you will have access to a distinct test dataset containing approximately 8,000 unlabeled query videos and approximately 40,000 reference videos. Your submissions will be evaluated against the ground truth labels for this test dataset.

To access the data, first make sure you are registered for the competition, and then download the data following the instructions in the data tab. The corpus is very large, so this might take a little while. Make sure you have a stable internet connection.

You will be able to download the following:

  • The train dataset, containing:
    • reference corpus of 40,311 reference video mp4 files. Filenames correspond to each reference video id, going from R100001.mp4 to R140311.mp4.
    • query corpus containing 8,404 query video mp4 files. Filenames correspond to each query video id, going from Q100001.mp4 to Q108404.mp4
    • train_query_metadata.csv and train_reference_metadata.csv files that contain metadata about each video
    • train_descriptor_ground_truth.csv file containing the ground truth for the query videos which contain content derived from reference videos:
      • query_id gives the query video id
      • ref_id gives the id of the corresponding reference video from which the the query video was derived
  • The test dataset, containing
    • reference corpus of 40,318 reference video mp4 files. Filenames correspond to each reference video id, going from R200001.mp4 to R240318.mp4.
    • query corpus containing 8,295 query video mp4 files. Filenames correspond to each query video id, going from Q200001.mp4 to Q208295.mp4
    • test_query_metadata.csv and test_reference_metadata.csv files that contain metadata about each video

It is possible that a query video has matched segments from multiple reference videos, in which case the ground truth should contain multiple rows, one for each matched segment.

Submission Format


In order to generate a submission, you must first generate descriptors for the ~8,000 query videos and ~40,000 reference videos in the test set in a format which the provided script can use to conduct a similarity search and generate matching predictions. The expected format for descriptors is as compressed numpy npz files. Your query and reference video files should contain three top-level variables as follows:

  • video_ids is a list of identifiers that provides the video id to which each descriptor vector corresponds, e.g., "Q100001" (do not include the .mp4 extension). These can be either a string or an integer (version without the Q or R prefix). Your video ids must be in sorted order.
    • Note that if you are generating this array from a pandas dataframe, you may end up with the object datatype. However, object datatype arrays will result in the following error: ValueError: Object arrays cannot be loaded when allow_pickle=False. If you encounter this error, you should convert your array to strings. See this StackOverflow post for more information.
  • timestamps is a 1D or 2D array of timestamps indicating the start and end times in seconds that the descriptor describes. In the case of a 1D array, your timestamps will be treated as a range with the same start and end timestamp. These are not used in descriptor track scoring, since your predictions for this track are based on the maximum similarity across all descriptors in a query-reference pair, but are useful for diagnostics and model interpretability.
  • features is a 32-bit float ndarray of descriptor embeddings for the corresponding video_id, up to one descriptor per second of video and with maximum descriptor dimensionality of 512. The features in your features array should properly correspond to the sorted video_ids array so that they are appropriately matched.

Note: As mentioned in the problem description, the limitation of one descriptor per second of video is a global limitation - that is, the total number of submitted or generated descriptors must be less than the total number of seconds of video in the test set or the code execution test subset. Participants can, if they so choose, distribute their descriptors among videos in a set in such a way that violates this “one frame per second of video per video” constraint, provided that the number of descriptors is still below the global threshold. For more information on how you may accomplish this, please see the rules on data use on the problem description page.

Here is an example of how you might create your descriptors and write them out as a .npz file using numpy.savez:

import numpy as np

qry_video_ids = [20000, 20001, ..., 29998, 29999]  # Can also be str: "Q20000", ...
qry_timestamps = [[0.0, 1.1], [1.1, 2.2], ..., [52.9, 54.9], [51.1, 58.4]]
qry_descriptors = np.array(
    [
        [0.2343, -0.8415, ..., 1.3961, -1.3243],
        [-1.5233, 0.1302, ..., -0.8566, 0.0243],
        ...
        [1.4251, 0.1345, ..., 0.7582, -1.7841],
        [0.8537, 0.4745, ..., 0.1689, 1.3798],
    ]
).astype(np.float32)

np.savez(
    "query_descriptors.npz",
    video_ids=qry_video_ids,
    timestamps=qry_timestamps,
    features=qry_descriptors,
)

You can then use the descriptor evaluation script to generate predicted match candidates as follows:

python descriptor_eval.py \
        --query_features query_descriptors.npz \
        --ref_features ref_descriptors.npz \
        --candidates_output submission.csv

When run with descriptors for the test dataset, your submission.csv will contain your match predictions for the test dataset. Note that you may have multiple match predictions for a given query video.

The submission.csv must contain, in order, the same column headings as the ground truth file, plus a confidence score:

  • query_id gives the query video id
  • ref_id gives the id of the corresponding reference video that the query video was derived from
  • score gives the confidence score for the given match

RULES ON DATA USE

In order to more closely simulate the needs of a system operating at production scale, and to help ensure that solutions generalize well to new data, the following rules apply for this competition. Eligible solutions must adhere to these requirements. Any violation of these rules is grounds for disqualification.

Independence

  • Submitted descriptors for a video may not make use of other videos (query or reference) in the test set.

The goal of the competition is to reflect real world settings where new videos are continuously added to a reference database, and old videos are removed from it. Therefore, the generation of descriptors for each video must be independent from other videos.

Number of descriptors

The limitation of one descriptor per second of video is a global limitation across all videos in the test set (~8,000 videos) and in the code execution test subset (~800 videos). You may, if you wish, choose to allocate more descriptors to some videos and fewer descriptors to others - as long as the number of descriptors you allocate to a particular video is a function of that video alone. Specifically, you may not dynamically allocate the number of descriptors calculated for a particular video based on other videos in the set. Participants may distribute descriptors unevenly across videos. For instance, a participant could assign descriptors for certain frames a "priority", and calibrate a priority threshold to match the total descriptor budget. Participants may consider the total dataset length (as computed from the metadata file) to compute a descriptor budget and dynamically select a threshold, but may not look at the length distribution or conduct other analysis that would violate the independence criteria. For conducting inference on a subset of videos in the code execution runtime (and for Phase 2), the same restriction applies: you may calculate total length of videos in the subset and use that to dynamically select a priority threshold that ensures that the number of descriptors that you submit falls in this budget, but you may not use information about the distribution of video lengths to inform your priority scores.

Annotations and Training

“Adding annotations" is defined as creating additional labels beyond the provided ground truth. Adding annotations to query or reference videos in the training set is permitted.

  • Adding annotations to any videos in the test dataset is prohibited.
  • Training on the test dataset is also prohibited, including simple models such as whitening or dimensionality reduction.

Participants must train all models on the training dataset. Participants may use the training dataset for training, or as a reference distribution for similarity normalization techniques.

Augmentation
"Augmenting" refers to applying a transformation to an image or video to generate a new image or video. Augmentation of both training videos and test set videos is permitted, provided that videos in the test set are not used for training purposes in either augmented or original form.

Data sources
External data is allowed subject to the rules governing "External Data" in the Official Rules. Note that as the source of the challenge data, the YFCC100M dataset is not external data and training using data from this source is prohibited (with the exception of the training data provided through the competition).

Performance Metric


Submissions will be evaluated by their micro-average precision, also known as global average precision. This is related to the common classification metric, average precision, also known as the area under the precision-recall curve or the average precision over recall values. The micro-average precision metric will be computed based on the submitted pairwise match predictions, and the result will be published on the leaderboard.

To generate predictions, the provided descriptor evaluation script will run an exhaustive similarity search, comparing each submitted query vector to each submitted reference vector, and using the maximum of the inner-product similarity between query and reference vectors for a given query-reference video pair as the ranking score for the prediction that the query video has content derived from the reference video.

Mathematically, average precision is computed as follows:

$$ \text{AP} = \sum_k (R_k - R_{k-1}) P_k $$

where \(P_k\) and \(R_k\) are the precision and recall, respectively, when thresholding at the kth score threshold of the predictions sorted in order of decreasing score.

This metric summarizes model performance over a range of operating thresholds from the highest predicted score to the lowest predicted score. This is equivalent to estimating the area under the precision–recall curve with the finite sum over every threshold in the prediction set, without interpolation. See this wikipedia entry for further details.

For this challenge, micro-average precision is then average precision calculated globally as a micro-average of all predictions across queries.

Have fun!


Good luck and enjoy this problem! If you have any questions you can always visit the user forum!