PHASE 2 | Facebook AI Image Similarity Challenge: Matching Track

Advance the science of image similarity detection, with applications in areas including content tracing, copyright infringement and misinformation. In the "Matching Track" participants develop models to predict if an image comes from a large corpus of images. #society

$100,000 in prizes
Completed oct 2021
1,101 joined


# Problem description ### Summary -------- You will receive a reference set of 1 million images and a query set of 50,000 images. Some of the query images are derived from images in the reference set, and the rest are not. For this **Matching Track**, your task is to build a model that detects whether a query image is derived from one of the images in a large corpus of reference images. **Table of contents** ### The Challenge -------- #### Background The primary domain of this competition is a field of computer vision called **copy detection**. Arguably, copy detection has not received as much attention as other fields of computer vision like object classification or instance detection, but that's not because it's an easy problem. Our competition sponsors at Facebook AI regard copy detection as a problem without a clear "state of the art" and which remains unsolved, particularly at the scale required to moderate the huge volume of content generated on social media today. The problem is made more challenging by the fact that it can be "adversarial" -- that is, efforts to detect and flag bad content will often elicit new attempts by users to sneak past the detection system the next time around. For more background, you should definitely check out **[this paper](https://arxiv.org/abs/2106.09672)** that FB AI and their research partners have produced as it has a very thorough description of the problem space, competition and dataset. #### Phase 1 During the first phase of the competition, teams will have access to the Phase 1 reference and query data sets, and be able to make submissions to be evaluated against the ground truth labels for this dataset. Scores will appear on the public leaderboard, but will not determine your team's final prize ranking. **Note: You will not be allowed to re-train your model in Phase 2, so it must be finalized by the end of Phase 1 (October 19, 2021 23:59 UTC). In order to be eligible for final prizes, you will need to submit a file containing code and metadata. See below for more details.** #### Phase 2 Phase 2 of the competition will determine the final leaderboard and take place in a 48 hour period from October 26, 2021 00:00 UTC to October 27, 2021 23:59 UTC. During Phase 2, you will receive a new, unseen query set of 50,000 images and make a new submission using the model you have built in Phase 1. Manual annotation of the Phase 2 query set or re-training of your model is prohibited and will result in disqualification. Prize-eligible solutions must also treat each unseen query image as an independent observation (i.e., not use any information from other query images) when producing their submission.
Phase 2 is just 48 hours. If you would like to be eligible for final prizes, please ensure in advance that your team will be available during this period.
### The Dataset -------- #### Overview For this competition Facebook has compiled a new dataset, the Dataset for ISC21 (DISC21), composed of **1 million reference images** and an **accompanying set of 50,000 query images**. A subset of the query images have been derived in some way from the reference images, and the rest of the query images have not. Let's look at some examples. The image on the left is a reference image. The image on the right is a query image that has been derived from the reference image -- in this case it has been cropped and combined with another image.
reference image reference image query image reference image

Edited query images may have been modified using a number of techniques. The query image below has had filters, brushmarks, and script applied using photo editing software.
reference image reference image query image reference image

The editing techniques employed on this data set include cropping, rotations, flips, padding, pixel level transformations, different color filters, changes in brightness level, and the like. One or more techniques may be used for a given image. This is not an exhaustive list but gives you a sense for the types of edits you will come across. Although synthetically created by our sponsors at Facebook, the transformations are similar to those seen in a production setting on social media platforms, with a particular emphasis on the ones that are more difficult to detect. Here are some more examples. Once again, your task is to identify which, if any, of the 1 million reference images is the source for a given query image.
reference image reference image query image reference image
reference image reference image
reference image reference image
reference image reference image
#### Details Data will be made available in separate tranches over the course of Phase 1 and Phase 2. In **Phase 1**, you will have access to: - 1 million reference images - 50,000 query images, with identifiers from `Q00000` to `Q49999`, consisting of: - 25,000 labeled images - 25,000 unlabeled images - 1 million training images In **Phase 2**, you will have access to: - 50,000 new query images, with identifiers from `Q50000` to `Q99999` To access the Phase 1 data, first make sure you are registered for the competition, and then download the data from the [data tab](https://www.drivendata.org/competitions/79/competition-image-similarity-1-dev/data/). The files are very large, particularly for the reference images, so this might take a little while. Make sure you have a stable internet connection. You will be able to download the following: * `reference_images` corpus containing 1 million reference image JPEG files. Filenames correspond to each reference image id, going from `R000000.jpg` to `R999999.jpg`. * `query_images` corpus containing 50,000 query image JPEG files. Filenames correspond to each query image id, going from `Q00000.jpg` to `Q49999.jpg` in Phase 1. In Phase 2, you will receive access to a new corpus of 50,000 query images, with filenames going from `Q50000.jpg` to `Q99999.jpg`. * `training_images` corpus containing a disjoint set of 1 million image JPEG files that can be used for training models. Filenames correspond to each training image id, going from `T000000.jpg` to `T999999.jpg`. * `public_ground_truth.csv` file containing the ground truth for 25,000 of the query images, with the following columns: * `query_id` gives the query image id * `reference_id` gives the id of the corresponding reference image that the query image was derived from; if the given query image was not derived from a reference image, this will be null. **TOOL TIP**: Looking for great tools to help augment your training set? The Facebook AI team just released a new open source data augmentation library to help build more robust AI models. Check out [AugLy](https://github.com/facebookresearch/AugLy) on GitHub to get started! #### RULES ON DATA USE The goal of the competition is to reflect real world settings where new images are continuously added to the reference set, or old images get removed from it. Therefore, the used approach should have an equivalent result if all of the reference images are provided at once, or if they are provided in chunks over time. We restrict the focus here, and only allow approaches that treat each new image of the reference set independently and without any interaction with other reference images. Building on this focus, **the following conditions apply for use of the provided data in this challenge**. “Adding annotations" here refers to creating additional image labels beyond the provided ground truth. "Augmenting" images refers to applying a transformation to an image to generate a new image, such as the manner in which the query set images were derived from the reference set images. - Adding annotations to any images used in Phase 2 is **prohibited**. - Adding annotations to query images provided during Phase 1 is **permitted** for developing models, as long as solutions adhere to the Competition Rules. - Augmenting reference images is **permitted** in the inference process of generating embeddings and matching scores, so long as each reference image is used independently without any interaction with other reference images. Use of augmented reference images for any other reason, including model training, is **prohibited**. Submitted individual predictions may not take into account more than one query image or more than one reference image at a time. - A disjoint set of training images is also provided, and augmentation and annotation of this separate set is **permitted** for training purposes. Additionally, as the [source of the challenge data](https://arxiv.org/pdf/2106.09672.pdf), the YFCC100M dataset is not external data and training using data from this source is **prohibited** (with the exception of the training data provided through the challenge). The 1 million training images have been safely drawn from this source to ensure no overlap, so these are a great resource to use for training. ### Submission Format -------- Your Phase 1 submission needs to be a csv file containing predictions for the 50,000 images in the Phase 1 query set. It should have the following 3 columns: - `query_id` provides the query image id, e.g. `Q12123` (don't include the `.jpg` extension) - `reference_id` provides the corresponding reference image id, e.g. `R323210` (don't include the `.jpg` extension) - `score` is a float indicating the confidence that the given query image is derived from the given reference image, with higher values indicating greater confidence Your Phase 1 submission should include scores for query images from both the labeled and unlabeled sets. Even though you already have the labels for many of these query images, we encourage you to use actual model predictions rather than simply re-submitting the labels. Re-submitting the labels will result in an artificially inflated score, and there will be no labels provided in Phase 2. Remember that the final leaderboard is determined by Phase 2 and will _not_ be affected by your results in Phase 1. Your submission may include up to 500,000 rows, or ten possible reference images as candidates for a given query image on average. Keep in mind that since your submission will be evaluated using **micro average precision**, a higher false positive rate will reduce your evaluation metric. Therefore, it is in your interest to only submit scores for those reference images you believe are likely to be the source for the given query image. Similarly, if you do not believe a query image is derived from the reference set, do not include it in your submission. You are not required to submit 10 reference candidates per query image on average. This is just the upper limit. Also note that your scores do not need to be limited to the range from zero to 1 like a probability, although they can be. For example, in the [Descriptor Track](https://www.drivendata.org/competitions/80/competition-image-similarity-2-dev/page/380/) the scores will be Euclidean distances computed from the descriptors, which will not necessarily be between zero and 1. What's important is that the higher scores reflect higher confidence that there is a correspondence between a given image pair, with the magnitude of the score reflecting your confidence relative to other image pairs. #### Example Here's how your submission might look. If your model predicted the following:
query_id  |  reference_id  |  score
-------------------------------------------------
Q00000    |  R123123       |  0.95
Q00001    |  R123111       |  0.72
Q00001    |  R123222       |  0.52
Q00005    |  R543210       |  0.84
The `csv` file that you submit would look like:
query_id,reference_id,score
Q00000,R123123,0.95
Q00001,R123111,0.72
Q00001,R123222,0.52
Q00005,R543210,0.84
You do not need to sort the submission, e.g. by score or image ID. We will take care of that on our end. ### Performance Metric -------- Submissions will be evaluated by their **micro-average precision**, also known as the area under the precision-recall curve or the average precision over recall values. This metric will be computed against a held-out test set that no participants have access to, and the result will be published on the Phase 1 leaderboard (remember that this is not the *final* leaderboard). The plot below shows the precision-recall curve for a baseline model developed for this competition using naive GIST descriptors (for more details check out the [baseline model repository](https://github.com/facebookresearch/isc2021)). This is a simple model that we expect to perform well on minor image edits or mild crops, but we think the participants in this challenge can do better! ![precision-recall curve](https://drivendata-public-assets.s3.amazonaws.com/gist-baseline-pr-curve.png "precision-recall curve") Micro-average precision is computed as follows: $$ \text{AP} = \sum_n (R_n - R_{n-1}) P_n $$ where Pn and Rn are the precision and recall, respectively, when thresholding at the nth image pair of the sequence sorted in order of increasing recall. This metric summarizes model performance over a range of operating thresholds from the highest predicted score to the lowest predicted score. For this challenge, the integral of the area under the curve is estimated with the finite sum over every threshold in the prediction set, without interpolation. See this [wikipedia entry](https://en.wikipedia.org/w/index.php?title=Information_retrieval&oldid=793358396#Average_precision) for further details. Also note that the evaluation metric for this challenge differs from the [scikit-learn implementation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#rcdf8f32d7f9d-1). The reason is that the scikit-learn implementation assumes that all positives in the ground truth are included in `y_true`, whereas for this challenge it is possible for participants to submit predictions which do not include all of the positives in the ground truth. In addition to micro-average precision, the leaderboard will also show each submission's **recall at 0.9 precision**, which is the highest recall reached for any score threshold at or exceeding 90% precision. This is a secondary metric provided for information purposes only as an indicator of a model's recall at a reasonably good precision level, but is not used for ranking purposes or prizes. **TOOL TIP**: We are also providing an **evaluation script** (`eval_metrics.py`) on the [data tab](https://www.drivendata.org/competitions/79/competition-image-similarity-1-dev/data/) that you can use to generate a score for your submission locally. The score will be based only on the ground truth you already have and does not include the ground truth that has been held out from all competitors. It also does not include all of the validation and error messaging that we apply when you make your submission on the platform. But it should be a useful tool for preliminary checks on your submission, and might give you more insight into the computations applied when we evaluate your submission. ### End Phase 1 Submission -------- As noted above, you will not be allowed to re-train your Phase 1 model when participating in Phase 2. You will simply be applying your existing Phase 1 model onto a new data set. **In order to be eligible for final prizes in Phase 2, you will need to submit an additional `zip` file containing the following files prior to the end of Phase 1**: 1. `code` directory containing the code used to produce the submission. We recommend using a structure similar to our [open source data science template](http://drivendata.github.io/cookiecutter-data-science/). The directory should include a thorough README with documentation of dependencies and any other assets you used. See the [End Phase 1 Submission page](https://www.drivendata.org/competitions/79/submissions/extra/) for more details on how this should look. 2. Optionally, a `matching-track-metadata.txt` file with descriptions of the hardware, processing time, and related stats for your final models used for Phase 2. A blank template is available on the [data tab](https://www.drivendata.org/competitions/79/competition-image-similarity-1-dev/data/). Download the template and fill out the fields prior to submitting. Once it's ready, you can make this submission through the [End Phase 1 Submission page](https://www.drivendata.org/competitions/79/submissions/extra/). ## Good luck! -------- Good luck and enjoy this problem! If you have any questions you can always visit the [user forum](https://community.drivendata.org/t/about-the-image-similarity-challenge-category/6193)! *** **NO PURCHASE NECESSARY TO ENTER/WIN. A PURCHASE WILL NOT INCREASE YOUR CHANCES OF WINNING.** The Competition consists of two (2) Phases, with winners determined based upon Submissions using the Phase II dataset. The start and end dates and times for each Phase will be set forth on this Competition Website. Open to legal residents of the Territory, 18+ & age of majority. "Territory" means any country, state, or province where the laws of the US or local law do not prohibit participating or receiving a prize in the Challenge and excludes any area or country designated by the United States Treasury's Office of Foreign Assets Control (e.g. Cuba, Sudan, Crimea, Iran, North Korea, Syria, Venezuela). **Any Participant use of External Data must be pursuant to a valid license.** Void outside the Territory and where prohibited by law. Participation subject to official [Competition Rules](https://www.drivendata.org/competitions/79/competition-image-similarity-1-dev/rules/). Prizes: $50,000 USD (1st), $30,000 (2nd), $20,000 USD (3rd) for each of two tracks. See Official Rules and Competition Website for submission requirements, evaluation metrics and full details. Sponsor: Facebook, Inc., 1 Hacker Way, Menlo Park, CA 94025 USA.