GRU + SHAP - Explainable Predictive Maintenance for Irregular Multivariate Time Series
This recurrent neural network model exploits historical data measured from machine sensors to perform inference on future usage and detect possible future faults in the machine itself. Explainability metrics targets sensor groups and are powered by the SHAP library.
Sensors inside a production machine record data from usage and activities. Each sensor produces a time series with its measurements, and a neural network model is trained on each series. The model learns sequences and patterns in measured values and is able to perform inference by replicating such patterns suggesting the future trend of the measured values, thus allowing to preemptively plan maintenance right before a machine component breaks down or causes a machine fault, avoiding plant stoppages.
Default parameters from the Keras library have been kept and improved during learning. The chosen optimizer is Nadam.
Network depth depends on the specific sensor being modelled, but it generally is between 1 and 3 GRU layers.
Each GRU layer has a number of units that is one of 8, 16, 32 or 64 units in one case. The number of units loosely depends on the irregularity of the time series.
The metric used for the evaluation of the models was the mean squared error (MSE). The metric selected for the case is common for regression problem and particularly informative of outliers. Nonetheless, with extreme values (especially close to 0), it may become unreliable alone, and so other metrics were computed during training for additional manual evaluation. These metrics included RMSE, MAE and MAPE. Particularly useful for choosing the best model has been the cumulative error too.
To explain the models’ results, a series of explainability metrics have been produced leveraging the SHAP library. In both univariate and multivariate approaches, a series of post-hoc explanations were built representing the minimum score registered by the models for each sensor (or group of seniors) within a selected temporal window. In the multivariate approach, the SHAP explainability library was also employed to investigate the influence of each variable within each sensor group. By leveraging SHAP, it was possible to gain insights into the relative importance of variables and understand the factors contributing to anomalous behavior within specific sensor groups. Finally, for the multivariate approach, a series of aggregated metrics have been produced to explain the connection between sensor groups and individual anomalies, by means of the minimum silhouette score across the selected time range, weighted by the number of times each individual anomaly occurred overall. The obtained values, used to fill-in the matrix, indicates quantitatively the overall contribution of each group of sensors to each anomaly.