competition
complete
€20,000

# Problem description

The objective of this competition is to predict the quantity of turbidity—the level of product traces or suspended solids in the effluent—that will be returned in last rinsing phase during the cleaning of food and beverage industrial equipment. Depending on the expected level of turbidity, the cleaning station operator can adjust the length of the final rinse accordingly, enabling high cleaning standards while minimizing unnecessary use of water and chemicals.

## Overview

This competition will include two stages.

### Stage 1: Prediction Competition

The Clean-In-Place (CIP) processes constitute a method of cleaning production objects without disassembly. Within the CIP station, the cleaning material is stored in cleaning tanks connected to food and beverage production objects through different outlet and return pipelines.

Each process has up to five sequential phases:

• Pre-rinse phase: Rinse water is pushed into the cleaning object
• Caustic phase: Caustic soda is pushed into the cleaning object
• Intermediate rinse phase: Clean or rinse water is pushed into the object
• Acid phase: Nitric acid is pushed into the cleaning object
• Final rinse phase: Clean water is pushed into the object

For each cleaning process, you are given time series measurements, sampled every two seconds, as well as relevant metadata.

Your job is to predict the final_rinse_total_turbidity_liter, the total quantity of turbidity returned during the final rinsing phase multiplied by the outgoing flow during the final rinsing, for each cleaning process.

In the train set, you are given all available data for each cleaning process. However, in the test set you are only given data from select previous phases (up to a given time, t) and then asked to predict into the future.

• For 10% of the test instances, t corresponds to the end of the first (pre-rinse) phase.
• For 30% of the test instances, t corresponds to the end of the second (caustic) phase.
• For 30% of the test instances, t corresponds to the end of the third (intermediate rinse) phase.
• For 30% of the test instances, t corresponds to the end of the fourth (acid) phase.

Note: you may not use future data in making your predictions. The train and test sets are split in time (i.e. all the observations in the test set occur after the train set) so you may use all of the training set in making your predictions. However, you must be careful not to use any of the time series information provided in the test set that is future to the process being predicted.

### Stage 2: Modeling Report Competition

#### Report

In addition to getting the best possible predictions on turbidity, Schneider Electric is interested in getting a deeper understanding of the quantitative patterns that drive the performance of top algorithms. Specifically, they are interested in understanding which signal(s) at which moment(s) are mainly responsible for the presence of turbidity during the final rinse. Explanations of how outcomes are influenced by measurements earlier in the process can facilitate communication with process managers and potentially inform corrective actions.

Perspectives of particular interest to consider for the Modeling Report include:

• a sensitivity analysis to identify which variable(s) have the greatest impact on the rinsing phase outcome
• a clustering of physical world situations/events which can be correlated with the rinsing phase outcome
• information on level of confidence for predicted outcomes (for instance, to inform corrective action when confidence is higher)
• any other insights are welcome

#### Evaluation

In Stage 2, the top 15 finalists from Stage 1 will be invited to submit brief reports that analyze quantitative patterns in the data and help illuminate the factors that influence outcomes. Each report will consist of up to 10 slides, delivered in PDF format. The final prizes will be awarded to the top 3 reports selected by a panel of judges including domain and data experts from Schneider Electric.

The judging panel will evaluate each report based on the following criteria:

• Rigor (25%): To what extent is the report built on sound, sophisticated quantitative analysis and a performant statistical model?
• Clarity (25%): How clearly are underlying patterns exposed, communicated, and visualized?
• Insight (50%): How useful are the contents of the report for understanding the dynamics of the system and informing action?

Note: The judging will be done primarily on a technical basis, rather than on language (since many participants may not be native English speakers).

#### Meet the judging panel!

Veronique Boutin, Analytics & AI Industry Domain Leader (Schneider Electric)

Veronique Boutin is an engineer from Ecole Supérieure d'Electricite. She wrote her PhD thesis on an experimental thermodynamic solar power plant. At Schneider Electric, she designed numerous automatic systems in various industrial contexts. She then focused on innovation and has been involved in several large cooperative programs such as HOMES, dedicated to energy efficiency in buildings, and Arrowhead, dedicated to cooperative automation for industry, buildings, and infrastructures. She is part of the Analytics & AI team, where she is in charge of analytics applications for the Industry domain.

Kevin Caye, Artificial Intelligence Engineer (Schneider Electric)

Kevin Caye is an engineer from École Nationale Supérieure d'Informatique et de Mathématiques Appliquées currently working as consultant for Schneider Electric. He received a PhD in statistical modeling for genomics from Université Grenoble Alpes in 2017. During the last year he has been involved in the organization of the Data Science competitions launched by Schneider Electric in collaboration with DrivenData.

Thomas Filloque, Global Solution Architect for CPG (Schneider Electric)

Thomas Filloque is an engineer graduated from l’IFFI (the French Institute of Industrial Cooling and Air Conditioning) and ISUPFERE (Engineering school from CNAM Paris and Ecole des Mines de Paris), with mechanical engineering and energy efficiency background. Working at Schneider Electric since more than 7 years, Thomas started as a consulting engineer in several domains such as DataCenters optimization, Food & Beverage Utilities, sensors and CIP. Then, moving from CIP installation audits to detailed CIP trends analysis, Thomas specialized in CIP processes; he managed several CIP projects in multi-countries environment before endorsing the role of Offer Manager for a dedicated SE offer called EcoStruxure Clean-In-Place Advisor.

Claude Le Pape, Artificial Intelligence, Optimization, Dependability and Analytics Domain Leader (Schneider Electric)

Claude Le Pape is in Schneider Electric coordinating the evaluation of new technologies and the recognition of experts in the Artificial Intelligence, Optimization, Dependability and Analytics domain. He received a PhD in Computer Science from University Paris XI and a MBA from "Collège des Ingénieurs" in 1988. From 1989 to 2007, he was successively postdoctoral student at Stanford University, consultant and software developer at ILOG S.A., senior researcher and R&D team leader at Bouygues S.A., Bouygues Telecom and ILOG. He contributed to the development of software tools and applications in various domains: manufacturing, construction, telecommunications, energy.

Emily Miller, Data Scientist (DrivenData)

Emily is one of DrivenData’s stellar data scientists. In addition to leading the setup of this competition on DrivenData, Emily has worked on a number of different challenges and DrivenData Labs projects including work in financial inclusion, energy modeling, conservation, and women’s economic empowerment. Emily holds a master's in International Development from The New School and has previously worked at the Bill & Melinda Gates Foundation, Stanford Center for International Development, and Brookings Institution.

## Datasets

train_values.csv and test_values.csv contain metadata on the cleaning process, phase, and object as well as time series measurements, sampled every 2 seconds. The time series data pertain to the monitoring and control of different cleaning process variables in both supply and return Clean-In-Place lines as well as in cleaning material tanks during the cleaning operations.

• process_id - Process ID
• object_id - Object ID
• phase - Phase of the cleaning process
• timestamp - Timestamp of measurement
• pipeline - Pipeline name

#### Clean-In-Place Measurements

• supply_flow - Measure of the flow of the fluid entering the pipeline
• supply_pressure - Pressure in bar of the cleaning agents in the supply line
• return_temperature - Temperature in degrees Celsius of cleaning agents in the return line
• return_conductivity - Conductivity in millisiemens of cleaning agents in the return line
• return_turbidity - Turbidity in NTU of cleaning agents in the return line
• return_flow - Flow in liter per hour of cleaning agents in the return line
• supply_pump - State (Boolean) of the supply pump on the supply line
• supply_pre_rinse - State (Boolean) of the pre rinse valve on the supply line
• supply_caustic - State (Boolean) of the caustic valve on supply line
• return_caustic - State (Boolean) of the caustic valve on return line
• supply_acid - State (Boolean) of the acid valve on supply line
• return_acid - State (Boolean) of the acid valve on return line
• supply_clean_water - State (Boolean) of the clean water valve on supply line
• return_recovery_water - State (Boolean) of the recovery water valve on return line
• return_drain - State (Boolean) of the drain valve on return line
• object_low_level - Presence (Boolean) of liquid in the cleaning object (for example, liquid will remain if the pump on the CIP return line did not fully purge the object)
• tank_level_pre_rinse - Level in percentage of the pre rinse tank
• tank_level_caustic - Level in percentage of the caustic tank
• tank_level_acid - Level in percentage of the acid tank
• tank_level_clean_water - Level in percentage of the clean water tank
• tank_temperature_pre_rinse - Temperature in degrees Celsius in the water recovery tank
• tank_temperature_caustic - Temperature in degrees Celsius in the caustic tank
• tank_temperature_acid - Temperature in degrees Celsius in the acid tank
• tank_concentration_caustic - Concentration in millisiemens in the caustic tank
• tank_concentration_acid - Concentration in millisiemens in the acid tank
• tank_lsh_caustic - State (Boolean) of the High Level Switch of the acid tank (used to determine if the tank is full or not)
• tank_lsh_acid - State (Boolean) of the High Level Switch of the acid tank (used to determine if the tank is full or not)
• tank_lsh_clean_water - State (Boolean) of the High Level Switch of the clean water tank (used to determine if the tank is full or not)
• tank_lsh_pre_rinse - State (Boolean) of the High Level Switch of the pre rinse tank (used to determine if the tank is full or not)
• target_time_period - Indicator (Boolean) of if the observation is included when calculating the target variable.

train_labels.csv contains the target variable, final_rinse_total_turbidity_liter. This is defined as the quantity of turbidity returned multiplied by the outgoing flow during the target_time_period. The target time period is the portion of the final rinse phase when the return caustic and return acid valves have been closed for the last time. Every process_id in the training values data has a corresponding final_rinse_total_turbidity_liter label in this file. The calculation for the target variable is as follows: sum(max(0, return_flow) * return_turbidity) where target_time_period=True.

#### Cleaning recipe data

recipe_metadata.csv contains the cleaning recipe for each process ID, with 1s indicating which cleaning phases the object is intended to go through. For more information, see this anouncement.

• process_id - Process ID (can be used to match the process to the training set and test set)
• pre_rinse - Boolean of prescribed pre-rinse phase
• caustic - Boolean of prescribed caustic phase
• intermediate_rinse - Boolean of prescribed immediate rinse
• acid - Boolean of prescribed acid phase
• final_rinse - Boolean of prescribed final rinse phase

### External data

External data is not allowed in this competition.

## Performance metric

The performance metric is a variant of mean absolute percentage error, called mean adjusted absolute percent error.

For each cleaning process (process_id), the adjusted absolute percent error is written as follows:

$$APE_i = \frac{|\hat{y_i} - y_i|}{max(|y_i|, threshold)}$$

where $y$ is the actual quantity of turbidity returned during the final rinsing, $\hat{y}$ is the predicted quantity of turbidity returned during the final rinsing, and the threshold is 290,000 NTU L. The use of the threshold ensures that predictions on smaller values are not excessively penalized.

The overall metric, mean adjusted absolute percent error, is the average of $APE_i$ over all test instances, where

$$MAPE = \frac{1}{N} \sum_{i=1}^{N} \frac{|\hat{y_i} - y_i|}{max(|y_i|, \text{290000})}$$

• $N$ - The total number of turbidity predictions submitted
• $\hat{y}$ - The predicted turbidity value
• $y$ - The actual turbidity value

## Submission format

submission_format.csv provides the structure your predictions need to satisfy to be evaluated successfully. The submission format has two columns: process_id and final_rinse_total_turbidity_liter. final_rinse_total_turbidity_liter is a floating point number (e.g. 1.0).

For example, if you predicted 1.0 for each cleaning process,
final_rinse_total_turbidity_liter
process_id
20000 1.0
20006 1.0
20007 1.0
20009 1.0
20010 1.0

the .csv file that you submit would look like:

process_id,final_rinse_total_turbidity_liter
20000,1.0
20006,1.0
20007,1.0
20009,1.0
20010,1.0


## Good luck!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!