ADIOS - I-NERGY Training Model
The docker contains the execution script to train the models on a dataset with extended labels
Given the alarm labelled set, which can be extended using the labelling system described above, we train a machine learning model to predict its category. The available alarms are randomly split in half, and the first part is used as a training set and the latter as a test set, on which we evaluate the performance.
Alarms are mainly made of string fields (categorical data), and not all machine learning alarms can be fed with this kind of data. For this purpose, we first need to preprocess the data. OneHotEncoding is applied to convert categorical data into numerical ones, except for the event_message field, which is pre-processed differently. As previously mentioned, we build a vocabulary by tokenizing all the words contained in this field, and then cleaning them from meaningless words. For each alarm x, we count the frequency of each v V and use the obtained |V|-dimensional vector as feature for x. We either apply our machine learning models on the categorical data or on the event_message field. Experimental results have shown that the latter option delivers superior results.
Once that data is ready, we train the machine learning model on it. We use the implementation of the widely adopted scikit-learn Python package, which offers an intuitive interface to apply machine learning algorithms. Model hyperparameters are left to their default value as we were not provided with enough data to perform the cross-validation phase for model selection, which usually allows us to choose the best hyperparameters configuration for the given data. The evaluation on the test set is done using four different metrics: accuracy, precision on critical alarms, recall on critical alarms and f1 on critical alarms, which are described above.