1 week left
€20,000

Problem description

The objective of this competition is to predict the quantity of turbidity—the level of product traces or suspended solids in the effluent—that will be returned in last rinsing phase during the cleaning of food and beverage industrial equipment. Depending on the expected level of turbidity, the cleaning station operator can adjust the length of the final rinse accordingly, enabling high cleaning standards while minimizing unnecessary use of water and chemicals.

Overview

This competition will include two stages.

Stage 1: Prediction Competition

The Clean-In-Place (CIP) processes constitute a method of cleaning production objects without disassembly. Within the CIP station, the cleaning material is stored in cleaning tanks connected to food and beverage production objects through different outlet and return pipelines.

Each process has up to five sequential phases:

• Pre-rinse phase: Rinse water is pushed into the cleaning object
• Caustic phase: Caustic soda is pushed into the cleaning object
• Intermediate rinse phase: Clean or rinse water is pushed into the object
• Acid phase: Nitric acid is pushed into the cleaning object
• Final rinse phase: Clean water is pushed into the object

For each cleaning process, you are given time series measurements, sampled every two seconds, as well as relevant metadata.

Your job is to predict the final_rinse_total_turbidity_liter, the total quantity of turbidity returned during the final rinsing phase multiplied by the outgoing flow during the final rinsing, for each cleaning process.

In the train set, you are given all available data for each cleaning process. However, in the test set you are only given data from select previous phases (up to a given time, t) and then asked to predict into the future.

• For 10% of the test instances, t corresponds to the end of the first (pre-rinse) phase.
• For 30% of the test instances, t corresponds to the end of the second (caustic) phase.
• For 30% of the test instances, t corresponds to the end of the third (intermediate rinse) phase.
• For 30% of the test instances, t corresponds to the end of the fourth (acid) phase.

Note: you may not use future data in making your predictions. The train and test sets are split in time (i.e. all the observations in the test set occur after the train set) so you may use all of the training set in making your predictions. However, you must be careful not to use any of the time series information provided in the test set that is future to the process being predicted.

Stage 2: Modeling Report Competition

In addition to getting the best possible predictions on turbidity, Schneider Electric is interested in getting a deeper understanding of the quantitiative patterns that drive the performance of top algorithms. Specifically, they are interested in understanding which signal(s) at which moment(s) is/are mainly responsible of the presence of turbidity during the final rinse. Explanations of how outcomes are influenced by measurements earlier in the process can facilitate communication with process managers and potentially inform corrective actions.

In Stage 2, the top 15 finalists from Stage 1 will be invited to submit brief reports that analyze quantitative patterns in the data and help illuminate the factors that influence outcomes. Each report will consist of up to 10 slides, delivered in PDF format. The final prizes will be awarded to the top 3 reports selected by a panel of judges including domain and data experts from Schneider Electric.

The judging panel will evaluate each report based on the following criteria, to be confirmed at the outset of Stage 2:

• Rigor: To what extent is the report built on sound, sophisticated quantititive analysis and a performant statistical model?
• Clarity: How clearly are underlying patterns exposed, communicated, and visualized?
• Insight: How useful are the contents of the report for understanding the dynamics of the system and informing action?

Note: The judging will be done primarily on a technical basis, rather than on language (since many participants may not be native English speakers).

Datasets

train_values.csv and test_values.csv contain metadata on the cleaning process, phase, and object as well as time series measurements, sampled every 2 seconds. The time series data pertain to the monitoring and control of different cleaning process variables in both supply and return Clean-In-Place lines as well as in cleaning material tanks during the cleaning operations.

• process_id - Process ID
• object_id - Object ID
• phase - Phase of the cleaning process
• timestamp - Timestamp of measurement
• pipeline - Pipeline name

Clean-In-Place Measurements

• supply_flow - Measure of the flow of the fluid entering the pipeline
• supply_pressure - Pressure in bar of the cleaning agents in the supply line
• return_temperature - Temperature in degrees Celsius of cleaning agents in the return line
• return_conductivity - Conductivity in millisiemens of cleaning agents in the return line
• return_turbidity - Turbidity in NTU of cleaning agents in the return line
• return_flow - Flow in liter per hour of cleaning agents in the return line
• supply_pump - State (Boolean) of the supply pump on the supply line
• supply_pre_rinse - State (Boolean) of the pre rinse valve on the supply line
• supply_caustic - State (Boolean) of the caustic valve on supply line
• return_caustic - State (Boolean) of the caustic valve on return line
• supply_acid - State (Boolean) of the acid valve on supply line
• return_acid - State (Boolean) of the acid valve on return line
• supply_clean_water - State (Boolean) of the clean water valve on supply line
• return_recovery_water - State (Boolean) of the recovery water valve on return line
• return_drain - State (Boolean) of the drain valve on return line
• object_low_level - Presence (Boolean) of liquid in the cleaning object (for example, liquid will remain if the pump on the CIP return line did not fully purge the object)
• tank_level_pre_rinse - Level in percentage of the pre rinse tank
• tank_level_caustic - Level in percentage of the caustic tank
• tank_level_acid - Level in percentage of the acid tank
• tank_level_clean_water - Level in percentage of the clean water tank
• tank_temperature_pre_rinse - Temperature in degrees Celsius in the water recovery tank
• tank_temperature_caustic - Temperature in degrees Celsius in the caustic tank
• tank_temperature_acid - Temperature in degrees Celsius in the acid tank
• tank_concentration_caustic - Concentration in millisiemens in the caustic tank
• tank_concentration_acid - Concentration in millisiemens in the acid tank
• tank_lsh_caustic - State (Boolean) of the High Level Switch of the acid tank (used to determine if the tank is full or not)
• tank_lsh_acid - State (Boolean) of the High Level Switch of the acid tank (used to determine if the tank is full or not)
• tank_lsh_clean_water - State (Boolean) of the High Level Switch of the clean water tank (used to determine if the tank is full or not)
• tank_lsh_pre_rinse - State (Boolean) of the High Level Switch of the pre rinse tank (used to determine if the tank is full or not)
• target_time_period - Indicator (Boolean) of if the observation is included when calculating the target variable.

train_labels.csv contains the target variable, final_rinse_total_turbidity_liter. This is defined as the quantity of turbidity returned multiplied by the outgoing flow during the target_time_period. The target time period is the portion of the final rinse phase when the return caustic and return acid valves have been closed for the last time. Every process_id in the training values data has a corresponding final_rinse_total_turbidity_liter label in this file. The calculation for the target variable is as follows: sum(max(0, return_flow) * return_turbidity) where target_time_period=True.

Cleaning recipe data

recipe_metadata.csv contains the cleaning recipe for each process ID, with 1s indicating which cleaning phases the object is intended to go through. For more information, see this anouncement.

• process_id - Process ID (can be used to match the process to the training set and test set)
• pre_rinse - Boolean of prescribed pre-rinse phase
• caustic - Boolean of prescribed caustic phase
• intermediate_rinse - Boolean of prescribed immediate rinse
• acid - Boolean of prescribed acid phase
• final_rinse - Boolean of prescribed final rinse phase

External data

External data is not allowed in this competition.

Performance metric

The performance metric is a variant of mean absolute percentage error, called mean adjusted absolute percent error.

For each cleaning process (process_id), the adjusted absolute percent error is written as follows:

$$APE_i = \frac{|\hat{y_i} - y_i|}{max(|y_i|, threshold)}$$

where $y$ is the actual quantity of turbidity returned during the final rinsing, $\hat{y}$ is the predicted quantity of turbidity returned during the final rinsing, and the threshold is 290,000 NTU L. The use of the threshold ensures that predictions on smaller values are not excessively penalized.

The overall metric, mean adjusted absolute percent error, is the average of $APE_i$ over all test instances, where

$$MAPE = \frac{1}{N} \sum_{i=1}^{N} \frac{|\hat{y_i} - y_i|}{max(|y_i|, \text{290000})}$$

• $N$ - The total number of turbidity predictions submitted
• $\hat{y}$ - The predicted turbidity value
• $y$ - The actual turbidity value

Submission format

submission_format.csv provides the structure your predictions need to satisfy to be evaluated successfully. The submission format has two columns: process_id and final_rinse_total_turbidity_liter. final_rinse_total_turbidity_liter is a floating point number (e.g. 1.0).

For example, if you predicted 1.0 for each cleaning process,
final_rinse_total_turbidity_liter
process_id
20000 1.0
20006 1.0
20007 1.0
20009 1.0
20010 1.0

the .csv file that you submit would look like:

process_id,final_rinse_total_turbidity_liter
20000,1.0
20006,1.0
20007,1.0
20009,1.0
20010,1.0


Good luck!

Good luck and enjoy this problem! If you have any questions you can always visit the user forum!