competition
complete
Glory!

# Problem description

### Summary

You will receive a reference set of 1 million images and a query set of 50,000 images. Some of the query images are derived from images in the reference set, and the rest are not.

For this Matching Track, your task is to build a model that detects whether a query image is derived from one of the images in a large corpus of reference images.

### The Challenge

#### Background

The primary domain of this competition is a field of computer vision called copy detection.

Arguably, copy detection has not received as much attention as other fields of computer vision like object classification or instance detection, but that's not because it's an easy problem. Our competition sponsors at Facebook AI regard copy detection as a problem without a clear "state of the art" and which remains unsolved, particularly at the scale required to moderate the huge volume of content generated on social media today.

The problem is made more challenging by the fact that it can be "adversarial" -- that is, efforts to detect and flag bad content will often elicit new attempts by users to sneak past the detection system the next time around.

For more background, you should definitely check out this paper that FB AI and their research partners have produced as it has a very thorough description of the problem space, competition and dataset.

#### Phase 1

During the first phase of the competition, teams will have access to the Phase 1 reference and query data sets, and be able to make submissions to be evaluated against the ground truth labels for this dataset. Scores will appear on the public leaderboard, but will not determine your team's final prize ranking.

Note: You will not be allowed to re-train your model in Phase 2, so it must be finalized by the end of Phase 1 (October 19, 2021 23:59 UTC). In order to be eligible for final prizes, you will need to submit a file containing code and metadata. See below for more details.

#### Phase 2

Phase 2 of the competition will determine the final leaderboard and take place in a 48 hour period from October 26, 2021 00:00 UTC to October 27, 2021 23:59 UTC.

During Phase 2, you will receive a new, unseen query set of 50,000 images and make a new submission using the model you have built in Phase 1. Manual annotation of the Phase 2 query set or re-training of your model is prohibited and will result in disqualification. Prize-eligible solutions must also treat each unseen query image as an independent observation (i.e., not use any information from other query images) when producing their submission.

Phase 2 is just 48 hours. If you would like to be eligible for final prizes, please ensure in advance that your team will be available during this period.

### The Dataset

#### Overview

For this competition Facebook has compiled a new dataset, the Dataset for ISC21 (DISC21), composed of 1 million reference images and an accompanying set of 50,000 query images. A subset of the query images have been derived in some way from the reference images, and the rest of the query images have not.

Let's look at some examples.

The image on the left is a reference image. The image on the right is a query image that has been derived from the reference image -- in this case it has been cropped and combined with another image.

 reference image query image

Edited query images may have been modified using a number of techniques. The query image below has had filters, brushmarks, and script applied using photo editing software.

 reference image query image

The editing techniques employed on this data set include cropping, rotations, flips, padding, pixel level transformations, different color filters, changes in brightness level, and the like. One or more techniques may be used for a given image. This is not an exhaustive list but gives you a sense for the types of edits you will come across.

Although synthetically created by our sponsors at Facebook, the transformations are similar to those seen in a production setting on social media platforms, with a particular emphasis on the ones that are more difficult to detect.

Here are some more examples. Once again, your task is to identify which, if any, of the 1 million reference images is the source for a given query image.

 reference image query image

#### Details

Data will be made available in separate tranches over the course of Phase 1 and Phase 2.

• 1 million reference images
• 50,000 query images, with identifiers from Q00000 to Q49999, consisting of:
• 25,000 labeled images
• 25,000 unlabeled images
• 1 million training images

• 50,000 new query images, with identifiers from Q50000 to Q99999

To access the Phase 1 data, first make sure you are registered for the competition, and then download the data from the data tab. The files are very large, particularly for the reference images, so this might take a little while. Make sure you have a stable internet connection.

• reference_images corpus containing 1 million reference image JPEG files. Filenames correspond to each reference image id, going from R000000.jpg to R999999.jpg.
• query_images corpus containing 50,000 query image JPEG files. Filenames correspond to each query image id, going from Q00000.jpg to Q49999.jpg in Phase 1. In Phase 2, you will receive access to a new corpus of 50,000 query images, with filenames going from Q50000.jpg to Q99999.jpg.
• training_images corpus containing a disjoint set of 1 million image JPEG files that can be used for training models. Filenames correspond to each training image id, going from T000000.jpg to T999999.jpg.
• public_ground_truth.csv file containing the ground truth for 25,000 of the query images, with the following columns:
• query_id gives the query image id
• reference_id gives the id of the corresponding reference image that the query image was derived from; if the given query image was not derived from a reference image, this will be null.

TOOL TIP: Looking for great tools to help augment your training set? The Facebook AI team just released a new open source data augmentation library to help build more robust AI models. Check out AugLy on GitHub to get started!

#### RULES ON DATA USE

The goal of the competition is to reflect real world settings where new images are continuously added to the reference set, or old images get removed from it. Therefore, the used approach should have an equivalent result if all of the reference images are provided at once, or if they are provided in chunks over time. We restrict the focus here, and only allow approaches that treat each new image of the reference set independently and without any interaction with other reference images.

Building on this focus, the following conditions apply for use of the provided data in this challenge. “Adding annotations" here refers to creating additional image labels beyond the provided ground truth. "Augmenting" images refers to applying a transformation to an image to generate a new image, such as the manner in which the query set images were derived from the reference set images.

• Adding annotations to any images used in Phase 2 is prohibited.
• Adding annotations to query images provided during Phase 1 is permitted for developing models, as long as solutions adhere to the Competition Rules.
• Augmenting reference images is permitted in the inference process of generating embeddings and matching scores, so long as each reference image is used independently without any interaction with other reference images. Use of augmented reference images for any other reason, including model training, is prohibited. Submitted individual predictions may not take into account more than one query image or more than one reference image at a time.
• A disjoint set of training images is also provided, and augmentation and annotation of this separate set is permitted for training purposes.

Additionally, as the source of the challenge data, the YFCC100M dataset is not external data and training using data from this source is prohibited (with the exception of the training data provided through the challenge). The 1 million training images have been safely drawn from this source to ensure no overlap, so these are a great resource to use for training.

### Submission Format

Your Phase 1 submission needs to be a csv file containing predictions for the 50,000 images in the Phase 1 query set. It should have the following 3 columns:

• query_id provides the query image id, e.g. Q12123 (don't include the .jpg extension)
• reference_id provides the corresponding reference image id, e.g. R323210 (don't include the .jpg extension)
• score is a float indicating the confidence that the given query image is derived from the given reference image, with higher values indicating greater confidence

Your Phase 1 submission should include scores for query images from both the labeled and unlabeled sets. Even though you already have the labels for many of these query images, we encourage you to use actual model predictions rather than simply re-submitting the labels. Re-submitting the labels will result in an artificially inflated score, and there will be no labels provided in Phase 2. Remember that the final leaderboard is determined by Phase 2 and will not be affected by your results in Phase 1.

Your submission may include up to 500,000 rows, or ten possible reference images as candidates for a given query image on average.

Keep in mind that since your submission will be evaluated using micro average precision, a higher false positive rate will reduce your evaluation metric. Therefore, it is in your interest to only submit scores for those reference images you believe are likely to be the source for the given query image. Similarly, if you do not believe a query image is derived from the reference set, do not include it in your submission.

You are not required to submit 10 reference candidates per query image on average. This is just the upper limit.

Also note that your scores do not need to be limited to the range from zero to 1 like a probability, although they can be. For example, in the Descriptor Track the scores will be Euclidean distances computed from the descriptors, which will not necessarily be between zero and 1. What's important is that the higher scores reflect higher confidence that there is a correspondence between a given image pair, with the magnitude of the score reflecting your confidence relative to other image pairs.

#### Example

Here's how your submission might look. If your model predicted the following:

query_id  |  reference_id  |  score
-------------------------------------------------
Q00000    |  R123123       |  0.95
Q00001    |  R123111       |  0.72
Q00001    |  R123222       |  0.52
Q00005    |  R543210       |  0.84


The csv file that you submit would look like:

query_id,reference_id,score
Q00000,R123123,0.95
Q00001,R123111,0.72
Q00001,R123222,0.52
Q00005,R543210,0.84


You do not need to sort the submission, e.g. by score or image ID. We will take care of that on our end.

### Performance Metric

Submissions will be evaluated by their micro-average precision, also known as the area under the precision-recall curve or the average precision over recall values. This metric will be computed against a held-out test set that no participants have access to, and the result will be published on the Phase 1 leaderboard (remember that this is not the final leaderboard).

The plot below shows the precision-recall curve for a baseline model developed for this competition using naive GIST descriptors (for more details check out the baseline model repository). This is a simple model that we expect to perform well on minor image edits or mild crops, but we think the participants in this challenge can do better!

Micro-average precision is computed as follows:

$$\text{AP} = \sum_n (R_n - R_{n-1}) P_n$$

where ${P}_{n}$ and ${R}_{n}$ are the precision and recall, respectively, when thresholding at the nth image pair of the sequence sorted in order of increasing recall.

This metric summarizes model performance over a range of operating thresholds from the highest predicted score to the lowest predicted score. For this challenge, the integral of the area under the curve is estimated with the finite sum over every threshold in the prediction set, without interpolation.

See this wikipedia entry for further details.

Also note that the evaluation metric for this challenge differs from the scikit-learn implementation. The reason is that the scikit-learn implementation assumes that all positives in the ground truth are included in y_true, whereas for this challenge it is possible for participants to submit predictions which do not include all of the positives in the ground truth.

In addition to micro-average precision, the leaderboard will also show each submission's recall at 0.9 precision, which is the highest recall reached for any score threshold at or exceeding 90% precision. This is a secondary metric provided for information purposes only as an indicator of a model's recall at a reasonably good precision level, but is not used for ranking purposes or prizes.

TOOL TIP: We are also providing an evaluation script (eval_metrics.py) on the data tab that you can use to generate a score for your submission locally. The score will be based only on the ground truth you already have and does not include the ground truth that has been held out from all competitors. It also does not include all of the validation and error messaging that we apply when you make your submission on the platform. But it should be a useful tool for preliminary checks on your submission, and might give you more insight into the computations applied when we evaluate your submission.

### End Phase 1 Submission

As noted above, you will not be allowed to re-train your Phase 1 model when participating in Phase 2. You will simply be applying your existing Phase 1 model onto a new data set.

In order to be eligible for final prizes in Phase 2, you will need to submit an additional zip file containing the following files prior to the end of Phase 1:

1. code directory containing the code used to produce the submission. We recommend using a structure similar to our open source data science template. The directory should include a thorough README with documentation of dependencies and any other assets you used. See the End Phase 1 Submission page for more details on how this should look.
2. Optionally, a matching-track-metadata.txt file with descriptions of the hardware, processing time, and related stats for your final models used for Phase 2. A blank template is available on the data tab. Download the template and fill out the fields prior to submitting.

Once it's ready, you can make this submission through the End Phase 1 Submission page.

## Good luck!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!

NO PURCHASE NECESSARY TO ENTER/WIN. A PURCHASE WILL NOT INCREASE YOUR CHANCES OF WINNING. The Competition consists of two (2) Phases, with winners determined based upon Submissions using the Phase II dataset. The start and end dates and times for each Phase will be set forth on this Competition Website. Open to legal residents of the Territory, 18+ & age of majority. "Territory" means any country, state, or province where the laws of the US or local law do not prohibit participating or receiving a prize in the Challenge and excludes any area or country designated by the United States Treasury's Office of Foreign Assets Control (e.g. Cuba, Sudan, Crimea, Iran, North Korea, Syria, Venezuela). Any Participant use of External Data must be pursuant to a valid license. Void outside the Territory and where prohibited by law. Participation subject to official Competition Rules. Prizes: $50,000 USD (1st),$30,000 (2nd), \$20,000 USD (3rd) for each of two tracks. See Official Rules and Competition Website for submission requirements, evaluation metrics and full details. Sponsor: Facebook, Inc., 1 Hacker Way, Menlo Park, CA 94025 USA.