competition
complete
\$10,000

# Problem description

In this challenge, you will be predicting roof type from drone imagery. The data consists of a set of overhead imagery of seven locations across three countries with labeled building footprints. Your goal is to classify each of the building footprints with the roof material type.

## The features in this dataset

The only features in this dataset are the images themselves and the building footprints in the GeoJSONs.

### Images

The images consist of seven large high-resolution Cloud Optimized GeoTiffs of the seven different areas. The spatial resolution of the images is roughly 4 cm.

#### Colombia

Area Name Thumbnail Resolution Pixel Width x Height
borde_rural ~ 4 cm 52318 x 31315
borda_soacha ~ 4.25 cm 40159 x 45650

#### Guatemala

Area Name Thumbnail Resolution Pixel Width x Height
mixco_1_and_ebenezer ~ 4.3 cm 27604 x 26641
mixco_3 ~ 3.8 cm 26066 x 19271

#### St. Lucia

Area Name Thumbnail Resolution Pixel Width x Height
castries ~ 4.5 cm 50027 x 62570
dennery ~ 4.2 cm 21184 x 41534
gros_islet ~ 3.6 cm 53492 x 90729
Note: Castries and Gros Islet contain labels from an unverified automated process. For this reason, images from Castries and Gros Islet are included only in the training dataset.

It is up to you to decide whether or not you want to utilize the potentially noisy labels from Castries and Gros Islet in your training data. These can easily be filtered out using the verified column in train_labels.csv or the verified attribute in the GeoJSON FeatureCollections. This column is True if the ground truth label is verified; for Gros Islet and Castries, it is False.

### Labels

Each image1 corresponds to train and test GeoJSONs, where labels are encoded as FeatureCollections. metadata.csv links the each image with its corresponding GeoJSON. For each area in the train set, the GeoJSON includes the unique building ID, building footprint, roof material, and verified field (see note above). For each area in the test set, the GeoJSON contains just the unique building ID and building footprint.

Roof material labels are also provided in train_labels.csv, where each row contains a unique building ID followed by five roof material columns, with a 1.0 indicating that building's roof type and 0.0s in the remaining columns. Each building has only one roof type.

Here are examples of each roof type from the Borde Soacha area.

Roof Material Example Description Count
concrete_cement Roofs are made of concrete or cement. 1518
healthy_metal Includes corrugated metal, galvanized sheeting, and other metal materials. 14817
incomplete Under construction, extremely haphazard, or damaged. 669
irregular_metal Includes metal roofing with rusting, patching, or some damage. These roofs carry a higher risk. 5241
other Includes shingles, tiles, red painted, or other material. 308

### Data format

A STAC (SpatioTemporal Asset Catalog)2 of the imagery and label data is provided. The STAC is organized with a root catalog, containing sub-catalogs for each country. Each country contains collections for the various areas within that country.

An area collection links to STAC items — one for the imagery, one for the training label data, and if that region has test building footprints, an item for those labels. The imagery STAC item geometry is the footprint of the image. The training data label items have overviews that give the class counts for each of the roof_material classes contained in the labeled data.

## Performance metric

To measure your model's accuracy by looking at prediction error, we'll use a metric called log loss. This is an error metric, so a lower value is better (as opposed to an accuracy metric, where a higher value is better). Log loss can be calculated as follows:

$$loss = -\frac{1}{N}\cdot\sum\limits_{i=1}^{N}\sum\limits_{j=1}^{M} y_{ij}\log p_{ij}$$

where $N$ is the number of observations, $M$ is the number of classes (in terms of our problem $M=5$), $y_{ij}$ is a binary variable indicating if classification for observation $i$ was correct, and $p_{ij}$ was the user-predicted probability that label $j$ applies to observation $i$.

In Python you can easily calculate log loss using the scikit-learn function sklearn.metrics.log_loss. R users may find the MultiLogLoss function in the MLmetrics package.

## Submission format

The format for the submission file is the building id followed by the five roof material types, with a floating point representation of the probability that each roof type applies to the building. Since you are submitting probabilities, make sure there is a decimal point in your submission. Probabilities range from 0.0 to 1.0. Remember that this is a multiclass, but not multilabel, problem.

For example, if you predicted concrete with a probability of 0.9 for the first five buildings,

id concrete_cement healthy_metal incomplete irregular_metal other
7a4d630a 0.9 0.0 0.0 0.0 0.0
7a4bbbd6 0.9 0.0 0.0 0.0 0.0
7a4ac744 0.9 0.0 0.0 0.0 0.0
7a4881fa 0.9 0.0 0.0 0.0 0.0
7a4aa4a8 0.9 0.0 0.0 0.0 0.0

your .csv file that you submit would look like:

id,concrete_cement,healthy_metal,incomplete,irregular_metal,other
7a4d630a,0.9,0.0,0.0,0.0,0.0
7a4bbbd6,0.9,0.0,0.0,0.0,0.0
7a4ac744,0.9,0.0,0.0,0.0,0.0
7a4881fa,0.9,0.0,0.0,0.0,0.0
7a4aa4a8,0.9,0.0,0.0,0.0,0.0
⁝

## Good luck!

Good luck and enjoy this problem! If you're planning to use MATLAB for your solution, be sure to request your complimentary license and check out the MATLAB starter code in the benchmark!

If you have any questions you can always visit the user forum!

1. Neither the Castries and Gros Islet areas have a "test-" GeoJSON file file. This is because those ground truth labels are from unverified predictions and so are not tested against.

2. This STAC uses the 0.8 version of the spec, of which a release candidate was recently published. It also utilizes the label extension, which is still in the proposal phase. The STAC is a self-contained catalog and uses relative links.