N+1 fish, N+2 fish

Sustainable fishing means tracking every fish caught. New tools using automated video processing and artificial intelligence can help responsible fisheries comply with regulations, save time, and lower the safety risk and cost from an auditor on board. #climate

$50,000 in prizes
oct 2017
461 joined

Problem description

The features in this dataset


Videos

You are provided with video segments with one or many fish in each video segment. These videos are gathered on different boats in that are fishing for ground fish in the Gulf of Maine. The videos are collected from fixed-position cameras that are placed to look down on a ruler. A fish is placed on the ruler, the fisherman removes their hands from the ruler, and then either discards or keeps the fish based on the species and the size. The camera captures 5 frames per second.

Currently, these videos are manually reviewed for both scientific and compliance purposes. The ultimate goal of this competition is to create an algorithm that can identify the number and length of fish, and what species those fish are in the videos. This will significantly reduce the human review time for the videos and increase the volume of data to manage and protect fisheries.

Note: It is strictly prohibited to add additional annotations or hand-label the dataset. Algorithms must be created using the annotations as provided. Data augmentation that generates new videos and transforms the annotations is permitted.

Annotations

We also provide the annotations for these videos marking the locations of the fish. We provide these annotations in two formats. We provide a csv with all of the annotations for all of the videos that is used for the purposes of the competition. We also provide json files for each video in the training set that has the same annotations. These json files can be loaded using the annotation software to view the annotations.

Human annotators review the videos following this process:

  • Identify a single frame that has a clear view of a fish
  • Label this frame with the species present
  • Draw a line along the fish that stretches the entire length from head to tail (the software then records this length in pixels)
  • Do not add any additional labels for this individual fish

For the training data, we have provided the the x,y coordinates for both the start and end of the line drawn along the fish. This can be used to localize where in the image a fish appears when training models.

In addition to the fish, we have annotated some "no-fish" video frames. These frames can be used to help ensure that your algorithm correctly identifies when there is not a fish present on the ruler. These appear in the data as rows where all of the species columns are set equal to zero.

The data included in this competition can be visualized using the open source video annotation software provided by CVision, which utilizes FFMPEG for video decoding. This software guarantees frame level accuracy of annotations by constructing a map between decoding timestamps (DTS) and frame numbers when a video is loaded. This map is subsequently referenced when seeking and stepping forward and backward in the video. The imagery associated with each frame has been verified to match the imagery from decoding sequentially with both OpenCV and Scikit-Video.

When developing your algorithms, you can use the same tool to visualize your outputs. To do this, simply follow the JSON file format described in the manual. For this competition, all of your annotations in the "detections" array should be line annotations ("type" field is "line"). In this case the "w" and "h" fields actually correspond to the endpoint coordinates of the line. There should be one track per detection in the "tracks" array. Corresponding tracks and detections should share the same ID. The "global_state" field should be empty.

Note, for the majority of the videos there is a fish in the first frame. Also, for some videos there is a fish 4 frames before the end of the video. This is an artifact of the video processing.

These are the columns that appear in the training data:

  • row_id - A is number for the row in the dataset.
  • video_id - The ID of the video. Videos appear in the folder as <video_id>.mp4 and have individual annotations for use in the annotation software stored as <video_id>.json
  • frame - The number of the frame in the video. Note: not every video software/library encodes/decodes videos with the same assumptions about frames. You should double check that the library you use renders frames where these annotations line up with the fish.
  • fish_number - For a given video_id, this is the Nth (e.g., 1st, 2nd, 3rd) fish that appears in the video.
  • length - The length of the fish in pixels.
  • x1 - The x coordinate of the first point on the fish.
  • y1 - The y coordinate of the first point on the fish.
  • x2 - The x coordinate of the second point of the fish.
  • y2 - The y coordinate of the second point of the fish.
  • species_fourspot - 1 if the fish is a Fourspot Flounder
  • species_grey sole - 1 if the fish is a Grey Sole
  • species_other - 1 if the fish is one of a number of other species
  • species_plaice - 1 if the fish is a Plaice Flounder
  • species_summer - 1 if the fish is a Summer Flounder
  • species_windowpane - 1 if the fish a Windowpane Flounder
  • species_winter - 1 if the fish is a Winter Flounder

Feature data example


For example, a single row in the dataset, has these values:

row_id 0
video_id 00WK7DR6FyPZ5u3A
frame 0
fish_number 1
length 165.303
x1 766
y1 531
x2 659
y2 405
species_fourspot 0
species_grey sole 1
species_other 0
species_plaice 0
species_summer 0
species_windowpane 0
species_winter 0

Performance metric


There are three tasks that are important to this project. Our ultimate goal is to create an algorithm that generates automatically generates annotations for video files, where the annotations are comprised of:

  • The sequence of fish that appear
  • The species of each fish that appears in the video
  • The length of each fish that appears in the video

Given that this is a competition, we've created an aggregate metric that will give a general sense of performance on all of these tasks. The metric is a simple weighted combination of an individual metric for each of the tasks. While there are certain weights, we recommend you focus on a well-rounded algorithm that can contribute to each of these tasks!

The following metric is calculated for every video file and then averaged to give an overall score:

$$ f(y, \hat{y}) = \alpha_N * (1 - EDIT(y_N, \hat{y}_N)) + \alpha_L * R^2(y_L, \hat{y}_L) + \alpha_S * (2 * AUC(y_S, \hat{y}_S) - 1) $$

Task 1: The sequence of fish that appear

The first task is to output the sequence of fish that appear the video. To do this, we take the frame-by-frame predictions from the submission format and generate the sequence. We then calculate the error based on the generated sequence. This is performed with the following procedure:

  • Group by the fish_number column, taking the maximum species probability value that appears for the frames with this fish number. This gives us a sequence of predicted fish (e.g., Cod, Cod, Haddock, Plaice).
  • Compute the edit distance (Levenshtein distance) between this sequence and the true sequence of fish.
  • Normalize the edit distance between [0, 1] by dividing by the actual number of fish and then clipping values greater than 1.
  • Subtract the edit distance from 1 so the best score is 1, and the worst score is 0
  • Multiply this score by |$\alpha_N = 0.6$| since this is the most important task.

Task 2: Identify species of fish

AUC can be used to measure the performance on a multiclass classification problem. This is the metric we use to calculate how well algorithms just identify the species in a single frame. Technically has a range of [0, 1], but an AUC of 0.5 means predictions are no different from random noise. To calculate this portion of the metric:

  • Calculate the per-class AUC for each species.
  • Take the mean of the class AUCs for an overall AUC.
  • Rescale from [0.5, 1] to [0, 1]
  • Multiply by |$\alpha_S = 0.3$| since this is the second most important task.

Note: There are some annotated frames without any fish present. Algorithms will be expected to predict 0 for every species in these rows.

Note: For situations where AUC is undefined, MAE is calculated instead (e.g., for a video where all fish are the same species).

Task 3: Identify length of fish

The length of a fish is given in pixels in the annotations. We evaluate an algorithm's success in this task using |$R^2$| (also known as the coefficient of determination). This is a standard metric for regression tasks, and has a range |$[-\inf, 1]$|. For this task we clip negative values to zero. Finally, we multiply by |$\alpha_L = 0.1$| since this is the least important task (but still worth 10% of the overall metric).

Overall, our metric is on [0, 1] with 0 being the worst possible score and 1 being the best possible score.

Submission format


For each video in the test set, we ask competitors to make a submission that has predictions for every frame in the video file. These are the columns that are in the submission format.

  • row_id - A row ID for the test set file.
  • frame - The frame in the video.
  • video_id - The video ID for the video in the test set.
  • fish_number - For each video_id, this should be incremented each time a new fish appears clearly on the ruler.
  • length - The length of the fish in pixels.
  • species_fourspot - a probability between 0 and 1 that the fish is a Fourspot Flounder
  • species_grey sole - a probability between 0 and 1 that the fish is a Grey Sole
  • species_other - a probability between 0 and 1 that the fish belongs to a different species not explicitly tracked here
  • species_plaice - a probability between 0 and 1 that the fist is a Plaice Flounder
  • species_summer - a probability between 0 and 1 that the fist is a Summer Flounder
  • species_windowpane - a probability between 0 and 1 that the fist is a Windowpane Flounder
  • species_winter - a probability between 0 and 1 that the fist is a Winter Flounder

For example, a prediction for a single frame may look like:
row_id 0
frame 0
video_id 01rFQwp0fqXLHg33
fish_number 1
length 10
species_fourspot 0.5
species_grey sole 0.5
species_other 0.5
species_plaice 0.5
species_summer 0.5
species_windowpane 0.5
species_winter 0.5

Your .csv file that you submit should look like:

row_id,frame,video_id,fish_number,length,species_fourspot,species_grey sole,species_other,species_plaice,species_summer,species_windowpane,species_winter
0,0,01rFQwp0fqXLHg33,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,01rFQwp0fqXLHg33,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

...

918501,2631,zyjEx84aUTaBzbIX,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
918502,2632,zyjEx84aUTaBzbIX,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
918503,2633,zyjEx84aUTaBzbIX,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

Good luck!


Good luck and enjoy this problem! If you have any questions you can always visit the user forum!