4.1. Experimental Setting
First, we consider approaches that make use of static data (Pipeline A). We study the Metadata collected by the welding robot for each weld.
We also use the engineered features (EF) extracted from the time evolution for each time series Voltage (EF), Current (EF), Force (EF). With All (EF), we refer to the case in which all voltage-, current-, and force-related features are used together. Finally, we refer to the combination of approaches with a “+” sign. For instance, Voltage (EF) + Metadata uses jointly the engineered features from voltage and the metadata. We use different ML models specific for tabular data, i.e., a decision tree (DT) with maximum five splits, a random forest (RF) with 100 trees, and k-nearest neighbors (K-NN), where we employed different numbers of neighbors from 1 (1-NN) to 6 (6-NN).
Second, we consider the time series (
TS) using Pipeline B. We consider either the single time series (
) or all the time series together (ALL,
). Additionally, to balance the dataset, we subsample the data points in the majority class, i.e., “Good” welds, to have a more balanced dataset (
Sub.). As models, we employ a Time Series Forest (
TSF) [
7], Canonical Interval Forest (
CIF) [
8], Diverse Representation Canonical Interval Forest (
DrCIF) [
9], different CNN configurations, a simple
CNN, a
ResNet CNN [
11], and an
Inception CNN [
12] trained with a batch size equal to 8 for 200 epochs, and, lastly, the
K-NN specialized for time series data, which considers the Euclidean distance between the curves. We point out that we also considered other distances for the K-NN (e.g., DTW) that exhibited similar or slightly worse performances and are omitted here for brevity.
Third, we consider the best model based on static data and the best model based on time series, and we test three ensemble methods (Pipeline C) to try to exploit the joint effect of the two approaches. More in detail, we employ a shallow Average Voting schema, a Weighted Voting schema, and a Meta-Learning approach based on a random forest classifier. The last two ensemble techniques are calibrated over the training data predictions of the best models discovered in the two previous pipelines.
Unless otherwise stated, we use classical k-fold cross-validation with
; i.e., we divide the dataset into
k folds of equal size, train the model over
folds, and test it with the remaining subset. The procedure is repeated
k times. As evaluation metrics, we first use the F1 score, which reflects the model’s ability to correctly identify positive cases and its precision in the cases it deems positive. It is formally defined by two other measures, namely,
precision and
recall. Precision represents the fraction of
true positives across all positively classified records, i.e.,
. The recall, on the other hand, represents the fraction of positively classified data records over all truly positive data records, i.e.,
. The F1 score can then be defined as follows:
Note that considering as a performance metric the accuracy (i.e., the fraction of correctly classified examples out of the total number) is not appropriate in a highly imbalanced environment like ours. Indeed, a void classifier that always outputs “Good” would achieve an accuracy of 96%.
As a second performance measure, we consider the confusion matrix, a representation used to understand how well an algorithm predicts different classes. The rows represent the actual classes, and the columns are the predicted classes, so the diagonal elements are the correctly predicted instances. Since we consider the “positive” class as the “NOK” class (we recall that our task is fault detection), the first row of the matrix contains the True Negatives (TNs), i.e., the correctly classified non-defective (good) welds, and the False Positives (FPs), i.e., the good welds classified as defective (in a production environment, this value should be as low as possible, as it represents a situation of false alarm). The second row contains the False Negatives (FNs), i.e., the faulty welds that were not identified by the algorithm, and the True Positives (TPs), i.e., the correctly classified faulty welds.
For training the static models, we employed the
sklearn (
https://scikit-learn.org, accessed on 5 August 2024) library, while for the time series one, we employed
sktime (
https://www.sktime.net, accessed on 5 August 2024) for
K-NN, TSF, CIF, and
DrCIF, and we used
aeon (
https://www.aeon-toolkit.org, accessed on 5 August 2024) for the CNN-based models. For all algorithms, we left the default hyperparameters unless for those above-stated. We trained all models over an Intel i7-13700H (Intel, Santa Clara, CA, USA) equipped with 16 GB of RAM and an Nvidia 4060 GPU (Nvidia Corporation, Santa Clara, CA, USA) with 6 GB VRAM.
4.2. Training over the Entire Dataset
In
Table 2, we report a comparison of the various models for Pipeline A with different sets of features, when considering the entire dataset. The best absolute approach in terms of the F1 score is the
Current (EF) + Metadata with
DT, i.e., the one considering the features extracted by the current data series and the metadata (contextual information) with a decision tree, which reaches an F1 score of 0.729. However, the best set of features, on average (last column), is
All + Metadata, i.e., the set including the features extracted from all the data series, and the metadata (the most comprehensive set).
In
Table 3, we present the performance of the different models based on Pipeline B. Unlike in the previous case, the models that use the
Voltage (TS) Sub. information perform best on average, with and F1 score of 0.658. Nevertheless, the best-performing approach is the one that uses the
CIF model over the subsampled version of all time series (
ALL (TS) Sub.) with an F1 score equal to 0.744, with
DrCIF coming as a close second (0.737 F1 score). Additionally, note that subsampling the “Good” class (
Sub.) increases the average performance by 1–3% in all cases. In line with the previous results, considering
ALL time series is also a good approach, comparable with the
Voltage (TS) one, and superior when employing
CIF, DrCIF and
Inception CNN. Finally, note that the two empty cells in
Table 3 are due to the fact that the ML library used does not allow multiple time series as input in the TSF model.
Finally,
Table 4 reports the performance of the three tested ensemble techniques when working over the best model based on static data, i.e., the
DT using
Current (EF) + Metadata and the best model based on time series, i.e.,
CIF working over
All (TS) Sub. time series data. We can notice how both
Weighted Voting and
Meta Learning results are more effective than both best models. The best ensemble model reaches an F1 score of 0.752.
We further evaluate the models by looking at the confusion matrices of the best approach from each Pipeline in
Figure 3. Confusion matrices provide deeper insights into the predictive capabilities of a model. Domain experts in the automotive sector are particularly interested in the second column of this matrix (FP and TP, from the top). They clearly want the TP to be as high as possible as it represents the correctly classified defective welds. However, they also want the FP, i.e., the welds that are classified as defective but are actually flawless, to be as low as possible. High rates of FP can lead to unnecessary production halts for inspections, thereby reducing productivity due to increased downtime. The first two approaches differ considerably with respect to the confusion matrices. The first approach (
Figure 3a) is able to correctly classify 29 of the 79 defects in the dataset (TP), while the second (
Figure 3b) performs better and correctly identifies 36 defective welds. However, this comes at the cost of producing many more
false alarms, as the class of FP grows from only 10 in the first approach to more than double (24) in the second. There is, therefore, a trade-off between the ability to detect faults and the generation of false alarms. In
Figure 3c, the Weighted Ensemble method demonstrates its ability to combine the strengths of the two methods, maintaining a high number of TPs (34) while substantially lowering the FPs to 16 (as required by domain experts). This method, as indicated by the higher F1 score, appears to be the most promising.
Critical Analysis
As a further step in understanding the decision process behind the trained model decisions, we determined the
feature importance for the
DT approach and the t-distributed stochastic neighbor embedding (t-SNE) [
24] over the time series data.
The
feature importance quantifies the relevance of the input features in predicting the target, and it is reported in
Figure 4a for the
DT. Interestingly, the most important feature is the
expulsion time, which indicates whether and at what time an expulsion (ejection of molten metal) took place during the welding process. The second and third features refer to the (numerical) identifier of two specific welding spots, which we refer to as #1 and #2. Additionally, the most important remaining features are some characteristics of the first welding phase. When manually inspecting the two welding spots (#1 and #2), we found that spot #1 is defective in over 80% of the cases, while spot #2 is defective in over 60% of the cases. This points to a possible bias of the faults in certain spots.
Similar conclusions can be drawn from
Figure 4b, in which t-SNE is applied to the
Voltage time series. t-SNE enables the visualization of high-dimensional data (such as time series) and gives an idea of how “close” the data points can be in a reduced space. In
Figure 4b, we see that the defects cluster mainly in certain areas, and these areas correspond to the welding spots #1 and #2.
4.3. Removing the Bias in the Dataset
These previous results hint that there may be a bias in the dataset at hand. Therefore, we decided to remove the two overly defective spots (#1 and #2), reducing the number of defective welds from 79 to 54. In the following, we report the same scenarios explored in
Section 4.2, applying Pipeline A, Pipeline B, and Pipeline C to this
filtered version of the dataset.
Table 5 shows the results of Pipeline A. When comparing these results with those obtained on the entire dataset, we can appreciate a consistent drop in performance. In particular, while the previous best result for Pipeline A attained an F1 score of 0.729, on the filtered dataset, the best approach for Pipeline A (i.e., the
DT working over the
Metadata) attains an F1 score of 0.606. The percentage performance drop amounts to
.
Table 6 reports the result of Pipeline B. Additionally, in this case, the performance has degraded with respect to that over the entire dataset. The previous best result in
Table 3 reached an F1 score of 0.744, while in this case, we obtain at most 0.657, with the
CIF model working over
Voltage (TS) Sub. The percentage performance drop in this case amounts to ≈12%, suggesting that the
Metadata may have been more affected by the bias. Indeed, the spot names of the removed welding points were identified as the second and third most important features.
In Pipeline C, we consider the
DT working over the
Metadata as the best model for Pipeline A and the
CIF working over
Force (TS) Sub. features for the model from Pipeline B. In
Table 7, we report the results of the three considered ensemble methods. In this case, the performance is lower than the original pipeline B approach. This is likely due to the low performance of the best model working along Pipeline A, which mostly introduces noises in the predictions of the ensemble models.