Navigation

Problem description

Your goal is to predict a DrivenData team member's productivity based on their proximity to their pet. To learn about the competition setup—and see how cute our pets are—prowl the competition About page.

Challenge dataset
Features and target variable
Training data example

Submissions and evaluation
Performance metric
Submission format
Optional: Data visualizations

Challenge dataset

The data includes four different pet-human pairs. Each row of data represents the average over a one-minute period during the work day. You are provided with two features (pet_name, distance), and there is one target variable (lines_per_sec):

id (string, identifier) - A randomly generated unique identifier for each row.
pet_name (string, feature) - The name of the DrivenData members' pets.
distance (float, feature) - The distance between the DrivenData team member and their pet, measured in meters.

test.csv includes only the feature columns listed above. train.csv also includes the target variable:

lines_per_sec (float, target) - The productivity of the DrivenData team member, measured in lines of code per second.

Training data example

For example, a single row in the training dataset has these values:

id	AF-000000
distance	0.115839
pet_name	nico
lines_per_sec	0.76592

Performance metric

To measure your model’s performance, we’ll use a metric called Root Mean Square Error (RMSE), which is a measure of accuracy and quantifies differences between estimated and observed values. RMSE is the square root of the mean of the squared differences between the predicted values and the actual values. This is an error metric, so a lower value is better.

$$ RMSE = \sqrt{\frac{1}{N} \sum_{i=0}^N (y_i - \hat{y_i})^2} $$

where

|$N$| is the number of samples
|$\hat{y_i}$| is the estimated rate of code written for the |$i$|th sample
|$y_i$| is the actual productivity of the |$i$|th sample

Please note that your cat will never be impressed by your RMSE, even if it is 0.

Submission format

Your submission file should include two columns:

id (string): unique identifier for each row from test.csv
lines_per_sec (float): your prediction for lines of code written per second test.csv contains all of the feature data for the test set. submission_format.csv is the id row of test.csv, plus a column for lines_per_sec with a random placeholder value. To create a submission, overwrite the lines_per_sec column with your predictions.

For example, if you predicted:

id	distance	pet_name	lines_per_sec
AF-156400	0.107008	nico	0.41
AF-156401	0.107008	nico	0.41
AF-156402	0.107008	nico	0.41
AF-156403	0.104064	nico	0.41
AF-156404	0.104064	nico	0.41

The first few lines of the .csv file that you submit would look like:

id,lines_per_sec
AF-156400,0.41
AF-156401,0.41
AF-156402,0.41
AF-156403,0.41
AF-156404,0.41

Optional: Data Visualization Submissions

Data visualization is an important part of solving data science problems, and we are accepting submissions of any especially informative visualizations for this challenge. Head over to the Data Visualization Submission page if you would like to share yours.

Good luck!

Good luck and enjoy this problem! If you have any questions you can visit the general category of the user forum!

Pawsitive Predictive Value: Pets and Productivity

Quick Facts

Participants

No. of Entries

Prize

Winner

gpilgrim