Navigation

Competition timeline

There are two phases for this competition. PHASE I runs like every other competition with a public test set and private training set. During PHASE I submit your predictions for the restaurant inspections in SubmissionFormat.csv and your score will appear on the PHASE I leaderboard.

However, the final leaderboard is not determined during this phase. In PHASE II, we will release the latest Yelp reviews that occurred between the start of the competition and the beginning of the second phase. Competitors will then have one week to submit predictions for the coming 6 weeks. At this point, submissions for PHASE II will close. During those subsequent 6 weeks, we will score these predictions against new inspections that are reported by health inspectors in the City of Boston.

PHASE I - Training
PHASE II - Submissions
PHASE II - Evaluation

PHASE I	Submissions for PHASE II open	Submissions for PHASE II close	Predictions for PHASE II evaluated against live inspections	PHASE II ends, winners announced
Launch - 07/07/2015	06/27/2015	07/07/2015	07/07/2015 - 08/18/2015	08/18/2015
Train your models on historical data and submit predictions for the Phase I leaderboard	We release new data and you predict for a period 6 weeks into the future	You must have your predictions for the next 6 weeks submitted by this date	No new submissions are allowed. As inspectors report their findings, we compare your predictions against what actually happens.	Final winner is determined based on comparison of predictions to actual inspection results during the PHASE II evaluation period

The features in this dataset

The features in this dataset are the data provided by Yelp. This data includes business descriptions, restaurant reviews, restaurant tips, user review history, and check-ins. It's up to you to process these reviews into features that are predictive of hygeine inspection violations.

All of the Yelp data is structured as one JSON object per line in the file. You can find some examples of working with Yelp data on their Academic Dataset GitHub repository.

Businesses

Information about businesses can be found in the yelp_academic_dataset_business.json file.

{
    'type': 'business',
    'business_id': (business id),
    'name': (business name),
    'neighborhoods': [(hood names)],
    'full_address': (localized address),
    'city': (city),
    'state': (state),
    'latitude': latitude,
    'longitude': longitude,
    'stars': (star rating, rounded to half-stars),
    'review_count': review count,
    'categories': [(localized category names)]
    'open': True / False (corresponds to closed, not business hours),
    'hours': {
        (day_of_week): {
            'open': (HH:MM),
            'close': (HH:MM)
        },
        ...
    },
    'attributes': {
        (attribute_name): (attribute_value),
        ...
    },
}

Reviews

Reviews can be found in the yelp_academic_dataset_review.json file. Reviews contain the business, the user, the number of stars, the text of the review, and the date of the review.

{
    'type': 'review',
    'business_id': (business id),
    'user_id': (user id),
    'stars': (star rating, rounded to half-stars),
    'text': (review text),
    'date': (date, formatted like '2012-03-14'),
    'votes': {(vote type): (count)},
}

Users

Users can be found in the yelp_academic_dataset_user.json file.

{
    'type': 'user',
    'user_id': (user id),
    'name': (first name),
    'review_count': (review count),
    'average_stars': (floating point average, like 4.31),
    'votes': {(vote type): (count)},
    'friends': [(friend user_ids)],
    'elite': [(years_elite)],
    'yelping_since': (date, formatted like '2012-03'),
    'compliments': {
        (compliment_type): (num_compliments_of_this_type),
        ...
    },
    'fans': (num_fans),
}

Check-ins

Check-ins can be found in the yelp_academic_dataset_checkin.json file.

{
    'type': 'checkin',
    'business_id': (business id),
    'checkin_info': {
        '0-0': (number of checkins from 00:00 to 01:00 on all Sundays),
        '1-0': (number of checkins from 01:00 to 02:00 on all Sundays),
        ...
        '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays),
        ...
        '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays)
    }, # if there was no checkin for a hour-day block it will not be in the dict
}

Tips

Tips can be found in the yelp_academic_dataset_tip.json file.

{
    'type': 'tip',
    'text': (tip text),
    'business_id': (business id),
    'user_id': (user id),
    'date': (date, formatted like '2012-03-14'),
    'likes': (count),
}

The targets in this dataset

The City of Boston records health violations at three different levels *, **, and ***, which can be thought of as "minor", "major", and "severe" violations. The structure of this competition is to predict the count of each of these levels of violations for an inspection of a particular restaurant. For a restaurant, on a particular inspection date, you must predict the number of violations at each of the three levels.

In order to be able to identify restaurants across the Boston and the Yelp data, we provide you with an ID mapping that correlates restaurants in both data sets. Particular restaurants in the Boston dataset may match multiple Yelp businesses (since sometimes restaurants change names or move). The IDs listed in the columns yelp_id_* match a business_id in the businesses, reviews, check-ins, and tips files provided by Yelp.

Your target variables are integer counts for the columns *, **, and ***.

Label example

	date	restaurant_id	*	**	***
id
9417	2013-12-30	rJoQVAOV	0	0	0
9309	2010-02-25	njoZKbOr	1	0	2
9462	2010-08-04	XJ3r4qOR	4	2	0
27235	2013-11-18	6K3lG5Ez	2	1	0
20652	2008-05-07	lnORDD3N	9	2	2

Matching Entities

Competitors will notice that restaurant_id does not match the business_id in the Yelp data. We include a mapping between the restaurant_id (used in the violations data from the City of Boston) and business_id, which is Yelp's id for a business. This mapping can be found in the file restaurant_ids_to_yelp_ids.csv. This maps a restaurant_id to one or more business_id s in the Yelp dataset.

	yelp_id_0	yelp_id_1	yelp_id_2	yelp_id_3
restaurant_id
KAoKyPEg	Urw6NASrebP6tyFdjwjkwQ	NaN	NaN	NaN
1JEbl1oR	ktYpqtygWIJ2RjVPGTxNaA	NaN	NaN	NaN
6VOpaBEL	k_NyuJ67p9_D_7iIzsbJUw	NaN	NaN	NaN
we395l3k	O7Wgpxd8iGLKMmnhdbtJuw	NaN	NaN	NaN
KAoK96Eg	nUbPzfRH5-d01dSWRv4UZQ	NaN	NaN	NaN

PHASE I Submission Format

The goal is to predict integer counts for the variables *, **, and ***, which represent minor, major, and severe violations that are recorded by the public health inspectors. You must predict these counts for a given restaurant and inspection date. The pairs of restaurants and inspection dates are given in the SubmissionFormat.csv file. This file has the same columns as the training labels, and an example is provided below.

Submission values

Again, you must submit integer counts for each level of violation. Put another way, you must fill in the 0s in the submission format file with the correct number of violations recorded by the inspectors.

	date	restaurant_id	*	**	***
id
11444	2011-05-10	njoZgbOr	0	0	0
9804	2013-12-30	eVOBnpOj	0	0	0
18045	2010-04-22	B1oXQmOV	0	0	0
10243	2013-05-31	VpoGjQ3r	0	0	0
23134	2014-12-05	8KoAzaO6	0	0	0

PHASE II Submission Format

The submissions for PHASE II have exactly the same structure as the first phase. However, you must make predictions for all of the restaurants on every individual day during the evaluation period. You will only be evaluated against the results of the inspections that actually happen.

Submission values

	date	restaurant_id	*	**	***
id
67865	2015-06-17	WeEeRe3a	0	0	0
57758	2015-06-17	6VOpDlOL	0	0	0
26978	2015-06-17	1JEbKN3R	0	0	0
57446	2015-06-17	rJoQ0a3V	0	0	0
35632	2015-06-17	JGoNP6EL	0	0	0

Keeping it Fresh: Predict Restaurant Inspections

Quick Facts

Participants

No. of Entries

Prize

Winner

LilianaMedina

Navigation

Competition timeline

The features in this dataset

Businesses

Reviews

Users

Check-ins

Tips

The targets in this dataset

Label example

Matching Entities

PHASE I Submission Format

Submission values

PHASE II Submission Format

Submission values