PHASE 1 | Facebook AI Image Similarity Challenge: Descriptor Track

Advance the science of image similarity detection, with applications in areas including content tracing, copyright infringement and misinformation. In the "Descriptor Track" participants generate image descriptors that enable the problem to be solved at scale. #society

jun 2022
561 joined


Problem description

Summary


You will receive a reference set of 1 million images and a query set of 50,000 images. Some of the query images are derived from images in the reference set, and the rest are not.

For this Descriptor Track, your task is to compute image descriptors (embeddings) for both the 50,000 query and 1 million reference images. Your descriptors will be floating point vectors of up to 256 dimensions.

After receiving your descriptors, the evaluation script will compute the Euclidean distance between each pair of descriptor vectors in the query and reference sets, and then use this as a confidence score to compute micro average precision as in the Matching Track.

Table of contents

The Challenge


Background

The primary domain of this competition is a field of computer vision called copy detection.

Arguably, copy detection has not received as much attention as other fields of computer vision like object classification or instance detection, but that's not because it's an easy problem. Our competition sponsors at Facebook AI regard copy detection as a problem without a clear "state of the art" and which remains unsolved, particularly at the scale required to moderate the huge volume of content generated on social media today.

The problem is made more challenging by the fact that it can be "adversarial" -- that is, efforts to detect and flag bad content will often elicit new attempts by users to sneak past the detection system the next time around.

For more background, you should definitely check out this paper that FB AI and their research partners have produced as it has a very thorough description of the problem space, competition and dataset.

Phase 1

During the first phase of the competition, teams will have access to the Phase 1 reference and query data sets, and be able to make submissions to be evaluated against the ground truth labels for this dataset. Scores will appear on the public leaderboard, but will not determine your team's final prize ranking.

Note: You will not be allowed to re-train your model in Phase 2, so it must be finalized by the end of Phase 1 (October 19, 2021 23:59 UTC). In order to be eligible for final prizes, you will also need to submit a file containing code, metadata, and finalized reference descriptors. See below for more details.

Phase 2

Phase 2 of the competition will determine the final leaderboard and take place in a 48 hour period from October 26, 2021 00:00 UTC to October 27, 2021 23:59 UTC.

During Phase 2, you will receive a new, unseen query set of 50,000 images and make a new submission using the model you have built in Phase 1. Manual annotation of the Phase 2 query set or re-training of your model is prohibited and will result in disqualification. Prize-eligible solutions must also treat each unseen query image as an independent observation (i.e., not use any information from other query images) when producing their submission.

Phase 2 is just 48 hours. If you would like to be eligible for final prizes, please ensure in advance that your team will be available during this period.

The Dataset


Overview

For this competition Facebook has compiled a new dataset, the Dataset for ISC21 (DISC21), composed of 1 million reference images and an accompanying set of 50,000 query images. A subset of the query images have been derived in some way from the reference images, and the rest of the query images have not.

Let's look at some examples.

The image on the left is a reference image. The image on the right is a query image that has been derived from the reference image -- in this case it has been cropped and combined with another image.

reference image reference image query image reference image


Edited query images may have been modified using a number of techniques. The query image below has had filters, brushmarks, and script applied using photo editing software.

reference image reference image query image reference image


The editing techniques employed on this data set include cropping, rotations, flips, padding, pixel level transformations, different color filters, changes in brightness level, and the like. One or more techniques may be used for a given image. This is not an exhaustive list but gives you a sense for the types of edits you will come across.

Although synthetically created by our sponsors at Facebook, the transformations are similar to those seen in a production setting on social media platforms, with a particular emphasis on the ones that are more difficult to detect.

Here are some more examples. Once again, your task is to identify which, if any, of the 1 million reference images is the source for a given query image.

reference image reference image query image reference image
reference image reference image
reference image reference image
reference image reference image

Details

Data will be made available in separate tranches over the course of Phase 1 and Phase 2.

In Phase 1, you will have access to:

  • 1 million reference images
  • 50,000 query images, with identifiers from Q00000 to Q49999, consisting of:
    • 25,000 labeled images
    • 25,000 unlabeled images
  • 1 million training images

In Phase 2, you will have access to:

  • 50,000 new query images, with identifiers from Q50000 to Q99999

To access the Phase 1 data, first make sure you are registered for the competition, and then download the data from the data tab. The files are very large, particularly for the reference images, so this might take a little while. Make sure you have a stable internet connection.

You will be able to download the following:

  • reference_images corpus containing 1 million reference image JPEG files. Filenames correspond to each reference image id, going from R000000.jpg to R999999.jpg.
  • query_images corpus containing 50,000 query image JPEG files. Filenames correspond to each query image id, going from Q00000.jpg to Q49999.jpg in Phase 1. In Phase 2, you will receive access to a new corpus of 50,000 query images, with filenames going from Q50000.jpg to Q99999.jpg.
  • training_images corpus containing a disjoint set of 1 million image JPEG files that can be used for training models. Filenames correspond to each training image id, going from T000000.jpg to T999999.jpg.
  • public_ground_truth.csv file containing the ground truth for 25,000 of the query images, with the following columns:
    • query_id gives the query image id
    • reference_id gives the id of the corresponding reference image that the query image was derived from; if the given query image was not derived from a reference image, this will be null.

TOOL TIP: Looking for great tools to help augment your training set? The Facebook AI team just released a new open source data augmentation library to help build more robust AI models. Check out AugLy on GitHub to get started!

RULES ON DATA USE

The goal of the competition is to reflect real world settings where new images are continuously added to the reference set, or old images get removed from it. Therefore, the used approach should have an equivalent result if all of the reference images are provided at once, or if they are provided in chunks over time. We restrict the focus here, and only allow approaches that treat each new image of the reference set independently and without any interaction with other reference images.

Building on this focus, the following conditions apply for use of the provided data in this challenge. “Adding annotations" here refers to creating additional image labels beyond the provided ground truth. "Augmenting" images refers to applying a transformation to an image to generate a new image, such as the manner in which the query set images were derived from the reference set images.

  • Adding annotations to any images used in Phase 2 is prohibited.
  • Adding annotations to query images provided during Phase 1 is permitted for developing models, as long as solutions adhere to the Competition Rules.
  • Augmenting reference images is permitted in the inference process of generating embeddings and matching scores, so long as each reference image is used independently without any interaction with other reference images. Use of augmented reference images for any other reason, including model training, is prohibited. Submitted individual predictions may not take into account more than one query image or more than one reference image at a time.
  • A disjoint set of training images is also provided, and augmentation and annotation of this separate set is permitted for training purposes.

Additionally, as the source of the challenge data, the YFCC100M dataset is not external data and training using data from this source is prohibited (with the exception of the training data provided through the challenge). The 1 million training images have been safely drawn from this source to ensure no overlap, so these are a great resource to use for training.

Submission Format


For the Descriptor Track, you will submit a single HDF5 file containing four datasets:

  • the query dataset is a 50,000 by n matrix of descriptors for the query images, where n is the number of descriptor dimensions (up to 256 are allowed).
  • the reference dataset is a 1,000,000 by n matrix of descriptors for the reference images, where n is the number of descriptor dimensions (must be same as your query dataset).
  • the query_ids dataset is a 50,000 element array of the query image IDs corresponding to each descriptor vector in query.
  • the reference_ids dataset is a 1,000,000 element array of the reference image IDs corresponding to each descriptor vector in reference.

The descriptor values must be stored in float32 format (single-precision floating-point). The filename doesn't matter, but we think "fb-isc-submission.h5" has a nice ring to it.

Important: The descriptor matrix rows must be sorted by image ID, beginning with the lowest-numbered IDs and ending with the highest-numbered IDs. In other words, the first row of the query matrix must correspond to query image Q00000, the second row to Q00001, and so on up to Q49999. The first row of the reference matrix must correspond to reference image R000000, the second row to R000001, and so on up to R999999. The query_ids and reference_ids must also be sorted from lowest to highest ID.

You must ensure that the row order you submit meets these expectations. Also note that the only values in the query and reference datasets should be the descriptor floating point values; you do not need to submit column or row index labels.

For example, if the below table represented your version of the descriptors for the query set, with a column for each dimension:

image_id  |      1       |      2      |   ...   |     255      |     256     |
----------|--------------|-------------|---------|--------------|-------------|
Q00000    |  0.37167516  | -0.7784221  |   ...   |  0.0605562   |  0.6990247  |
Q00001    |  0.31181982  | -0.2303174  |   ...   | -0.7049109   |  0.3569864  |
Q00002    | -0.81381774  |  0.8651062  |   ...   |  1.2294384   |  0.1994895  |
...
Q49997    |  0.92375445  |  0.5524814  |   ...   | -0.10801959  |  0.6297743  |
Q49998    |  1.15396512  | -0.4784533  |   ...   | -0.10089809  |  1.2083315  |
Q49999    |  0.5610036   |  1.2792902  |   ...   |  0.61131167  | -1.9094679  |

The "query" dataset in your HDF5 submission would include just the following values:

 0.37167516  | -0.7784221  |   ...   |  0.0605562   |  0.6990247  |
 0.31181982  | -0.2303174  |   ...   | -0.7049109   |  0.3569864  |
-0.81381774  |  0.8651062  |   ...   |  1.2294384   |  0.1994895  |
...
 0.92375445  |  0.5524814  |   ...   | -0.10801959  |  0.6297743  |
 1.15396512  | -0.4784533  |   ...   | -0.10089809  |  1.2083315  |
 0.5610036   |  1.2792902  |   ...   |  0.61131167  | -1.9094679  |

If you are not familiar with HDF5, fear not. It is a widely used, open-source format with extensive cross-platform and cross-language support. See below for some entry points into the documentation. Chances are it is already implemented in your language of choice.

For example, in Python, you could use the following code to create a valid (but not very high-scoring) HDF5 file from a numpy ndarray. Remember that for your actual submission the ID arrays and descriptor matrices both need to be sorted from lowest to highest ID. So you will want to use a process that couples the elements of the ID arrays with their corresponding rows in the descriptor matrices, for example, by processing the image files in order from lowest to highest ID.

import h5py
import numpy as np

M_ref = np.random.rand(1_000_000, 256).astype('float32')
M_query = np.random.rand(50_000, 256).astype('float32')

qry_ids = ['Q' + str(x).zfill(5) for x in range(50_000)]
ref_ids = ['R' + str(x).zfill(6) for x in range(1_000_000)]

out = OUT_DIR / "fb-isc-submission.h5"
with h5py.File(out, "w") as f:
    f.create_dataset("query", data=M_query)
    f.create_dataset("reference", data=M_ref)
    f.create_dataset('query_ids', data=qry_ids)
    f.create_dataset('reference_ids', data=ref_ids)

(By the way, if you are having any trouble writing string data using h5py, you may want to double check which version of h5py you are using as the rules for writing string data changed between 2.10 and 3.0. See here for more information.)

Performance Metric


Unlike with the Matching Track, for this track you will submit only the image descriptors, with no predictions. As with the Matching Track, you will submit descriptors for all query images, including the ones that you received labels for.

After receiving your descriptor submission, our competition platform will then conduct a similarity search between query and reference descriptors. The similarity search uses the faiss library, an open-source efficient similarity search C++ library with Python wrappers published by Facebook AI Research. We will take the 500,000 most similar query-reference image pairs globally across all query images, using Euclidean distance as the confidence score. These scores will then be evaluated in the same manner as they are for the Matching Track (the following discussion of the evaluation metric is identical to what's in the Matching Track).

Submissions will be evaluated by their micro-average precision, also known as the area under the precision-recall curve or the average precision over recall values. This metric will be computed against a held-out test set that no participants have access to, and the result will be published on the Phase 1 leaderboard (remember that this is not the final leaderboard).

The plot below shows the precision-recall curve for a baseline model developed for this competition using naive GIST descriptors (for more details check out the baseline model repository). This is a simple model that we expect to perform well on minor image edits or mild crops, but we think the participants in this challenge can do better!

precision-recall curve

Micro-average precision is computed as follows:

$$ \text{AP} = \sum_n (R_n - R_{n-1}) P_n $$

where Pn and Rn are the precision and recall, respectively, when thresholding at the nth image pair of the sequence sorted in order of increasing recall.

This metric summarizes model performance over a range of operating thresholds from the highest predicted score to the lowest predicted score. For this challenge, the integral of the area under the curve is estimated with the finite sum over every threshold in the prediction set, without interpolation.

See this wikipedia entry for further details.

Also note that the evaluation metric for this challenge differs from the scikit-learn implementation. The reason is that the scikit-learn implementation assumes that all positives in the ground truth are included in y_true, whereas for this challenge it is possible for participants to submit predictions which do not include all of the positives in the ground truth.

In addition to micro-average precision, the leaderboard will also show each submission's recall at 0.9 precision, which is the highest recall reached for any score threshold at or exceeding 90% precision. This is a secondary metric provided for information purposes only as an indicator of a model's recall at a reasonably good precision level, but is not used for ranking purposes or prizes.

TOOL TIP: We are also providing an evaluation script (eval_metrics.py) on the data tab that you can use to generate a score for your submission locally. The score will be based only on the ground truth you already have and does not include the ground truth that has been held out from all competitors. It also does not include all of the validation and error messaging that we apply when you make your submission on the platform. But it should be a useful tool for preliminary checks on your submission, and might give you more insight into the computations applied when we evaluate your submission.

End Phase 1 Submission


As noted above, you will not be allowed to re-train your Phase 1 model when participating in Phase 2. You will simply be applying your existing Phase 1 model onto a new data set.

In order to be eligible for final prizes in Phase 2, you will need to submit an additional zip file containing the following files prior to the end of Phase 1:

  1. code directory containing the code used to produce the submission. We recommend using a structure similar to our open source data science template. The directory should include a thorough README with documentation of dependencies and any other assets you used. See the End Phase 1 Submission page for more details on how this should look.
  2. h5 file in the submission format described above, containing your final version of the reference descriptors. This will be the finalized Phase 1 version of the file that you've already been submitting for evaluation.
  3. Optionally, a descriptor-track-metadata.txt file with descriptions of the hardware, processing time, and related stats for your final models used for Phase 2. A blank template is available on the data tab. Download the template and fill out the fields prior to submitting.

Once it's ready, you can make this submission through the End Phase 1 Submission tab.

Good luck!


Good luck and enjoy this problem! If you have any questions you can always visit the user forum!


NO PURCHASE NECESSARY TO ENTER/WIN. A PURCHASE WILL NOT INCREASE YOUR CHANCES OF WINNING. The Competition consists of two (2) Phases, with winners determined based upon Submissions using the Phase II dataset. The start and end dates and times for each Phase will be set forth on this Competition Website. Open to legal residents of the Territory, 18+ & age of majority. "Territory" means any country, state, or province where the laws of the US or local law do not prohibit participating or receiving a prize in the Challenge and excludes any area or country designated by the United States Treasury's Office of Foreign Assets Control (e.g. Cuba, Sudan, Crimea, Iran, North Korea, Syria, Venezuela). Any Participant use of External Data must be pursuant to a valid license. Void outside the Territory and where prohibited by law. Participation subject to official Competition Rules. Prizes: $50,000 USD (1st), $30,000 (2nd), $20,000 USD (3rd) for each of two tracks. See Official Rules and Competition Website for submission requirements, evaluation metrics and full details. Sponsor: Facebook, Inc., 1 Hacker Way, Menlo Park, CA 94025 USA.