Pawsitive Predictive Value: Pets and Productivity

Understanding the pet–productivity connection for workers is critical to sustained long-term economic growth. Can you predict DrivenData team members' productivity based on their proximity to their pets?

advanced practice
apr 2022
95 joined

Problem description

Your goal is to predict a DrivenData team member's productivity based on their proximity to their pet. To learn about the competition setup—and see how cute our pets are—prowl the competition About page.

Challenge dataset


The data includes four different pet-human pairs. Each row of data represents the average over a one-minute period during the work day. You are provided with two features (pet_name, distance), and there is one target variable (lines_per_sec):

  • id (string, identifier) - A randomly generated unique identifier for each row.
  • pet_name (string, feature) - The name of the DrivenData members' pets.
  • distance (float, feature) - The distance between the DrivenData team member and their pet, measured in meters.

test.csv includes only the feature columns listed above. train.csv also includes the target variable:

  • lines_per_sec (float, target) - The productivity of the DrivenData team member, measured in lines of code per second.

Training data example


For example, a single row in the training dataset has these values:

id AF-000000
distance 0.115839
pet_name nico
lines_per_sec 0.76592

Performance metric


To measure your model’s performance, we’ll use a metric called Root Mean Square Error (RMSE), which is a measure of accuracy and quantifies differences between estimated and observed values. RMSE is the square root of the mean of the squared differences between the predicted values and the actual values. This is an error metric, so a lower value is better.

$$ RMSE = \sqrt{\frac{1}{N} \sum_{i=0}^N (y_i - \hat{y_i})^2} $$

where

  • |$N$| is the number of samples
  • |$\hat{y_i}$| is the estimated rate of code written for the |$i$|th sample
  • |$y_i$| is the actual productivity of the |$i$|th sample

Please note that your cat will never be impressed by your RMSE, even if it is 0.

Submission format


Your submission file should include two columns:

  • id (string): unique identifier for each row from test.csv
  • lines_per_sec (float): your prediction for lines of code written per second test.csv contains all of the feature data for the test set. submission_format.csv is the id row of test.csv, plus a column for lines_per_sec with a random placeholder value. To create a submission, overwrite the lines_per_sec column with your predictions.
For example, if you predicted:
id distance pet_name lines_per_sec
AF-156400 0.107008 nico 0.41
AF-156401 0.107008 nico 0.41
AF-156402 0.107008 nico 0.41
AF-156403 0.104064 nico 0.41
AF-156404 0.104064 nico 0.41

The first few lines of the .csv file that you submit would look like:

id,lines_per_sec
AF-156400,0.41
AF-156401,0.41
AF-156402,0.41
AF-156403,0.41
AF-156404,0.41

Optional: Data Visualization Submissions


Data visualization is an important part of solving data science problems, and we are accepting submissions of any especially informative visualizations for this challenge. Head over to the Data Visualization Submission page if you would like to share yours.

Good luck!


Good luck and enjoy this problem! If you have any questions you can visit the general category of the user forum!