Keeping it Fresh: Predict Restaurant Inspections

Flag public health risks at restaurants by combining Yelp reviews with open city data on past inspections. An algorithmic approach discovers more violations with the same number of inspections. #civic

$5,000 in prizes
aug 2015
604 joined

Competition timeline

There are two phases for this competition. PHASE I runs like every other competition with a public test set and private training set. During PHASE I submit your predictions for the restaurant inspections in SubmissionFormat.csv and your score will appear on the PHASE I leaderboard.

However, the final leaderboard is not determined during this phase. In PHASE II, we will release the latest Yelp reviews that occurred between the start of the competition and the beginning of the second phase. Competitors will then have one week to submit predictions for the coming 6 weeks. At this point, submissions for PHASE II will close. During those subsequent 6 weeks, we will score these predictions against new inspections that are reported by health inspectors in the City of Boston.

PHASE I - Training
PHASE II - Submissions
PHASE II - Evaluation
PHASE I Submissions for PHASE II open Submissions for PHASE II close Predictions for PHASE II evaluated against live inspections PHASE II ends, winners announced
Launch - 07/07/2015 06/27/2015 07/07/2015 07/07/2015 - 08/18/2015 08/18/2015
Train your models on historical data and submit predictions for the Phase I leaderboard We release new data and you predict for a period 6 weeks into the future You must have your predictions for the next 6 weeks submitted by this date No new submissions are allowed. As inspectors report their findings, we compare your predictions against what actually happens. Final winner is determined based on comparison of predictions to actual inspection results during the PHASE II evaluation period

The features in this dataset

The features in this dataset are the data provided by Yelp. This data includes business descriptions, restaurant reviews, restaurant tips, user review history, and check-ins. It's up to you to process these reviews into features that are predictive of hygeine inspection violations.

All of the Yelp data is structured as one JSON object per line in the file. You can find some examples of working with Yelp data on their Academic Dataset GitHub repository.


Information about businesses can be found in the yelp_academic_dataset_business.json file.
    'type': 'business',
    'business_id': (business id),
    'name': (business name),
    'neighborhoods': [(hood names)],
    'full_address': (localized address),
    'city': (city),
    'state': (state),
    'latitude': latitude,
    'longitude': longitude,
    'stars': (star rating, rounded to half-stars),
    'review_count': review count,
    'categories': [(localized category names)]
    'open': True / False (corresponds to closed, not business hours),
    'hours': {
        (day_of_week): {
            'open': (HH:MM),
            'close': (HH:MM)
    'attributes': {
        (attribute_name): (attribute_value),


Reviews can be found in the yelp_academic_dataset_review.json file. Reviews contain the business, the user, the number of stars, the text of the review, and the date of the review.
    'type': 'review',
    'business_id': (business id),
    'user_id': (user id),
    'stars': (star rating, rounded to half-stars),
    'text': (review text),
    'date': (date, formatted like '2012-03-14'),
    'votes': {(vote type): (count)},


Users can be found in the yelp_academic_dataset_user.json file.
    'type': 'user',
    'user_id': (user id),
    'name': (first name),
    'review_count': (review count),
    'average_stars': (floating point average, like 4.31),
    'votes': {(vote type): (count)},
    'friends': [(friend user_ids)],
    'elite': [(years_elite)],
    'yelping_since': (date, formatted like '2012-03'),
    'compliments': {
        (compliment_type): (num_compliments_of_this_type),
    'fans': (num_fans),


Check-ins can be found in the yelp_academic_dataset_checkin.json file.
    'type': 'checkin',
    'business_id': (business id),
    'checkin_info': {
        '0-0': (number of checkins from 00:00 to 01:00 on all Sundays),
        '1-0': (number of checkins from 01:00 to 02:00 on all Sundays),
        '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays),
        '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays)
    }, # if there was no checkin for a hour-day block it will not be in the dict


Tips can be found in the yelp_academic_dataset_tip.json file.
    'type': 'tip',
    'text': (tip text),
    'business_id': (business id),
    'user_id': (user id),
    'date': (date, formatted like '2012-03-14'),
    'likes': (count),

The targets in this dataset

The City of Boston records health violations at three different levels *, **, and ***, which can be thought of as "minor", "major", and "severe" violations. The structure of this competition is to predict the count of each of these levels of violations for an inspection of a particular restaurant. For a restaurant, on a particular inspection date, you must predict the number of violations at each of the three levels.

In order to be able to identify restaurants across the Boston and the Yelp data, we provide you with an ID mapping that correlates restaurants in both data sets. Particular restaurants in the Boston dataset may match multiple Yelp businesses (since sometimes restaurants change names or move). The IDs listed in the columns yelp_id_* match a business_id in the businesses, reviews, check-ins, and tips files provided by Yelp.

Your target variables are integer counts for the columns *, **, and ***.

Label example

date restaurant_id * ** ***
9417 2013-12-30 rJoQVAOV 0 0 0
9309 2010-02-25 njoZKbOr 1 0 2
9462 2010-08-04 XJ3r4qOR 4 2 0
27235 2013-11-18 6K3lG5Ez 2 1 0
20652 2008-05-07 lnORDD3N 9 2 2

Matching Entities

Competitors will notice that restaurant_id does not match the business_id in the Yelp data. We include a mapping between the restaurant_id (used in the violations data from the City of Boston) and business_id, which is Yelp's id for a business. This mapping can be found in the file restaurant_ids_to_yelp_ids.csv. This maps a restaurant_id to one or more business_id s in the Yelp dataset.

yelp_id_0 yelp_id_1 yelp_id_2 yelp_id_3
KAoKyPEg Urw6NASrebP6tyFdjwjkwQ NaN NaN NaN
1JEbl1oR ktYpqtygWIJ2RjVPGTxNaA NaN NaN NaN
6VOpaBEL k_NyuJ67p9_D_7iIzsbJUw NaN NaN NaN
we395l3k O7Wgpxd8iGLKMmnhdbtJuw NaN NaN NaN
KAoK96Eg nUbPzfRH5-d01dSWRv4UZQ NaN NaN NaN

PHASE I Submission Format

The goal is to predict integer counts for the variables *, **, and ***, which represent minor, major, and severe violations that are recorded by the public health inspectors. You must predict these counts for a given restaurant and inspection date. The pairs of restaurants and inspection dates are given in the SubmissionFormat.csv file. This file has the same columns as the training labels, and an example is provided below.

Submission values

Again, you must submit integer counts for each level of violation. Put another way, you must fill in the 0s in the submission format file with the correct number of violations recorded by the inspectors.
date restaurant_id * ** ***
11444 2011-05-10 njoZgbOr 0 0 0
9804 2013-12-30 eVOBnpOj 0 0 0
18045 2010-04-22 B1oXQmOV 0 0 0
10243 2013-05-31 VpoGjQ3r 0 0 0
23134 2014-12-05 8KoAzaO6 0 0 0

PHASE II Submission Format

The submissions for PHASE II have exactly the same structure as the first phase. However, you must make predictions for all of the restaurants on every individual day during the evaluation period. You will only be evaluated against the results of the inspections that actually happen.

Submission values

Again, you must submit integer counts for each level of violation. Put another way, you must fill in the 0s in the submission format file with the correct number of violations recorded by the inspectors.
date restaurant_id * ** ***
67865 2015-06-17 WeEeRe3a 0 0 0
57758 2015-06-17 6VOpDlOL 0 0 0
26978 2015-06-17 1JEbKN3R 0 0 0
57446 2015-06-17 rJoQ0a3V 0 0 0
35632 2015-06-17 JGoNP6EL 0 0 0