1. Introduction
There is a general assumption in supervised machine learning (ML) that the training and test data come from the same distribution, referred to as in-distribution (ID) [
1]. However, in real-world scenarios, data distributions often deviate from the training distribution due to various reasons, such as changes in operational conditions, sensor degradation, or environmental variations [
2,
3,
4]. This phenomenon, known as a distribution shift or domain shift, can result in out-of-distribution (OOD) data (inputs that deviate from the distribution observed during training) and lead to reduced accuracy of trained models [
5,
6]. This issue is particularly challenging in safety-critical applications and in industrial condition monitoring, where it may cause inaccurate predictions or undetected failures [
7,
8,
9]. Monitoring deployed models is crucial to ensure reliable and robust condition monitoring, as it helps identify when a model requires maintenance, e.g., updating or expanding [
10,
11,
12].
Various methods exist to address distribution shifts during testing, including transfer learning [
13], domain adaptation [
14], and domain generalization [
15,
16] techniques. However, none of these methods can fully guarantee the prevention of failures in deployed systems [
15]. In condition monitoring, the risk of failure is particularly high due to the typically limited number of independent observations, the time-consuming nature of data acquisition, and the difficulty in covering all possible operational conditions [
3]. Therefore, detecting potential OOD inputs is essential to improve the reliability and robustness of these systems [
17,
18].
Methods for detecting OOD have been extensively studied, leading to the development of various approaches [
19]. These methods can be applied in a post-hoc manner, both unsupervised [
20,
21] and supervised [
22,
23]. Supervised methods assume the availability of some data from the shifted distribution, allowing the model to learn the characteristics of possible shifts. In contrast, unsupervised approaches are more suitable when obtaining samples from the shifted distribution is challenging or impractical. This is often the case in condition monitoring [
3], where reproducing faults under new conditions is not feasible.
OOD detection methodologies can be categorized into several approaches [
19]. Classification based methods use the model’s output to differentiate between ID and OOD samples. A simple OOD baseline in this category is the maximum softmax probability (MSP) method, which identifies OOD data based on the model’s highest softmax output [
20]. An early extension of this approach is ODIN [
21], which applies temperature scaling to improve OOD detection by adjusting the neural network’s output. The energy-based method (referred to as Energy in the rest of the article) further refines this concept by using an energy score instead of a softmax-based score [
24]. Another group of methods, known as density-based methods, assumes that ID data is more likely to be located in areas of high density within the feature space compared to OOD data. Techniques like the Histogram-Based Outlier Score (HBOS) [
25] fall into this category, where density estimation is employed to identify anomalies.
Distance-based methods offer a different approach, relying on the idea that the embedded features of OOD samples should be relatively far from those of ID data. These methods compute the distance between input samples and the training data in the feature space to determine if the input is OOD [
22,
26]. Another category is OOD detection using Deep Ensembles [
27,
28,
29], which leverages the intuition that disagreement or variation in the output of multiple models within an ensemble can effectively indicate OOD data.
Several studies address the problem using open-set classification approaches, where models are tested under conditions where a portion of faults is only introduced during the test phase. For instance, Li et al. [
30] integrated stochastic elements to improve fault detection in high-speed trains, ensuring better sensitivity to previously unseen faults. In another approach, Zhou et al. [
31] employed a Bayesian convolutional neural network to differentiate between known and unknown faults in a bearing dataset. Similarly, Wu et al. [
32] employed Bayesian deep learning to identify unexpected faults in high-speed train bogies, focusing on capturing the uncertainty to enhance differentiation between known and unknown faults. Kamranfar et al. [
33] proposed an anomaly detection framework using Multiple Instance Learning tailored for real-world sequential datasets. In their study utilizing a bearing dataset, they defined the damaged states of the bearings as anomalies, while the normal bearing conditions served as the non-anomalous class.
While these studies have made significant progress, the majority are limited to single use-case designs and fail to offer generalizable solutions for realistic multi-sensor applications. To address these gaps, this paper makes the following key contributions:
We propose a multi-sensor deep ensemble framework that improves fault detection accuracy while effectively identifying domain shift and distribution shift.
We integrate hyperparameter (HP) optimization into model training to generalize the ensemble framework for various condition monitoring and fault detection tasks.
The proposed method is evaluated on multiple industrial condition monitoring datasets, demonstrating its generalizability and robustness in real-world scenarios.
The remainder of this paper is organized as follows:
Section 2 reviews the background and related works.
Section 3 describes the datasets used.
Section 4 explains the proposed methodology, and
Section 5 presents the results, followed by a discussion in
Section 6. Finally,
Section 7 concludes the paper and outlines future research directions.
2. Background and Related Works
2.1. Hyperparameter Optimization
Modern L models have many parameters optimized during training and HPs that are fixed during training but need to be optimized for each task. Hyperparameter optimization (HPO) is one of the tasks in automated ML (AutoML) [
34,
35] that focuses on automatically finding the optimal HPs for a given task.
For example, in deep neural networks (DNNs), weights and biases are optimized during training. However, the optimization process itself relies on several HPs that significantly influence the final performance of the trained network. These training HPs include the learning rate, learning rate schedule, regularization method, and the training optimizer.
In addition to training HPs, defining a network architecture involves numerous design decisions. For example, in convolutional neural networks (ConvNets), important architectural HPs include the number of convolutional layers, kernel size, and stride size. Neural Architecture Search (NAS) is a branch of AutoML dedicated to optimizing these architectural HPs and, more broadly, identifying the best architecture for a given task within a defined search space. The search space encompasses all valid combinations of HPs for the task.
Various NAS algorithms have been proposed, differing in how they define the search space and implement search strategies. Incorporating expert knowledge can help constrain the search space to a more efficient subset of HPs, significantly reducing computational cost while maintaining performance.
Grid search and random search are baseline strategies for HPO. However, both are particularly computationally expensive for NAS applications. Bayesian optimization [
36] is a state-of-the-art strategy for global optimization of expensive black-box functions. It consists of two main components: a surrogate model and an acquisition function. After each trial, the surrogate model is updated based on the available observations, and the acquisition function evaluates the performance of candidate points. In practice, the number of iterations is often limited by the computational resources allocated to the task.
2.2. Model Ensembles for Robust Predictions
Ensemble methods enhance the prediction accuracy of ML systems [
27]. By combining predictions from multiple models, ensembles create a more robust and powerful predictive framework. There are several approaches to building ensemble models, including bagging, boosting, and stacking. Stacking, in particular, offers flexibility by allowing independent training of base models and aggregating predictions from different types of models.
Deep ensembles, which involve multiple neural networks, have demonstrated superior performance across a wide range of applications, providing greater reliability and accuracy compared to individual models [
28]. Various strategies have been proposed to increase diversity within deep ensembles, as diversity is a key factor in improving performance. One early approach involved training ensembles by initializing networks with different random seeds, resulting in diverse sets of weights across models [
28]. Additionally, introducing diversity through variations in training batches and data selection has been shown to enhance results [
37]. Further diversity can be achieved by employing different HPs and model architectures, which boosts the performance of ensembles [
38,
39].
Predictions from aggregated results can be made using various strategies, such as majority voting, averaging, or weighted combination methods. When the ensemble comprises diverse models with potentially complementary strengths, a meta-classifier can be introduced to learn how to best combine their outputs [
40]. The use of a meta-classifier adds flexibility and can improve performance by leveraging patterns in the predictions of the base models. The meta-classifier can range from a simple logistic regression model to a more complex method, such as gradient boosting [
41]. The choice of the meta-classifier depends on factors such as the dataset’s size, the base models’ diversity, and computational constraints.
2.3. Out-of-Distribution Detection
In this study, alongside our proposed ensemble approach, we evaluated several alternative OOD detection techniques that are based on individual neural networks.
Maximum Softmax Probability (MSP) ref. [
20] is a widely used indicator for OOD detection. This method is simple to apply since it relies on the standard neural network output without modifying the training process. The score is computed as follows:
where
represents the logit (input to the softmax function) for class
i.
ODIN ref. [
21] is an early method that improves OOD detection using softmax scores by applying temperature scaling and input perturbation. It aims to enhance the separation between ID and OOD samples. The ODIN score is calculated as follows:
where
T is the temperature scaling parameter.
Energy ref. [
24] computes an energy score to measure the likelihood that a sample belongs to the ID data, with lower scores typically indicating ID samples and higher scores indicating OOD samples. The energy score is given by
The
Histogram-Based Outlier Score (HBOS) ref. [
25] is a density-based method that uses the distribution of feature values to detect outliers. The method relies on creating histograms for each feature, where the frequency of the data points in each bin is used to compute the outlier score.
Mahalanobis distance-based OOD detection (MDS) ref. [
42] uses the features extracted from the penultimate layer of a neural network to determine if a given sample is OOD. Instead of calculating the distance to a single mean, the Mahalanobis distance is computed for each class, and the minimum distance across all classes is used as the OOD score.
for a sample
x is given by
where
- -
is the feature vector of the input sample x from the penultimate layer;
- -
is the mean vector of the ID features for class i (computed during training);
- -
is the inverse of the shared covariance matrix of the ID features across all classes.
OOD Detection Using Statistical Moments of Data (StatData) is a method that relies on simple features extracted from raw data. In certain use cases, distribution shifts can be identified directly from the input data. To evaluate this approach in our benchmark, we construct a
k-nearest neighbor [
43] (kNN) model based on the statistical moments of the input data, including mean, variance, skewness, and kurtosis, as described in Algorithm 1. This method leverages the inherent distributional properties of sensor data to classify samples as either ID or OOD.
Algorithm 1 OOD Detection Using Statistical Moments of Data |
- 1:
Input: S sensors, in-distribution (ID) training data - 2:
Output: OOD score for new samples - 3:
for each sensor to S do - 4:
Compute the statistical moments of the signal - 5:
end for - 6:
features = concatenate statistical moments across all sensors - 7:
Build a kNN model (k = 1) using these features on the ID data - 8:
Return: OOD score from the kNN model
|
2.4. Evaluation Metrics
Three metrics are used to evaluate the proposed methods and their results. The first metric, accuracy, is used for assessing supervised classification methods, while the other two metrics address the performance of unsupervised methods.
Accuracy: To evaluate the prediction performance of models in supervised classification tasks, one of the essential metrics is accuracy. In tasks with approximately equal representation of classes, accuracy is one of the most useful and easy-to-interpret metrics [
44]. Accuracy is defined as follows:
where:
= True Positives
= True Negatives
= False Positives
= False Negatives
AUROC: Area Under the Receiver Operating Characteristic curve (AUROC) shows the relation between TP rate (TPR) and FP rate (FPR) and is a threshold invariant metric in binary classification tasks [
45]. A perfect model has 100% AUROC.
FPR95: AUROC alone cannot capture all aspects of the model’s behavior, particularly in handling false positives. FPR at TPR
x (FPR
x) measures the likelihood of misclassifying an OOD sample as an ID sample when the TPR is fixed at
x%. While the choice of
x can vary depending on the task requirements, 95% (FPR95) is a commonly used standard in related studies [
19]. AUROC and FPR95 should be evaluated together to gain a comprehensive understanding of the model’s performance.
3. The Proposed Framework
This section outlines the proposed framework designed to address diverse fault detection applications in real-world scenarios. Datasets from condition monitoring applications vary in data length and the number of sensors. The proposed framework is designed to accommodate these differences. As illustrated in
Figure 1, the method is an ensemble-based approach that incorporates embedded HPO and provides OD scores for predictions. This approach aligns with the principles of AutoML by integrating HPO and model selection as essential components of the ML process [
34,
35]. Rather than being tailored to a specific use case, the framework is designed to ensure broad applicability across datasets with similar characteristics.
The framework consists of four modular building blocks: the HP-driven ensemble, fusion, meta-classifier, and anomaly detection components. Each block offers flexibility in selecting methods to suit specific requirements and constraints. Common considerations influencing these choices include hardware limitations, interpretability, and prediction performance. For instance, interpretability constraints may restrict the selection of methods for the ensemble block. In this study, we demonstrate the framework using a single configuration for each block, as summarized in
Table 1. However, alternative configurations can also be employed based on specific requirements. The following sections provide a detailed description of the framework and its components.
All methods and algorithms of this study were implemented in MATLAB [
46]. The DNNs were developed using the MATLAB Deep Learning Toolbox [
47], while Bayesian optimization was conducted using the Statistics and ML Toolbox [
48]. The training of ConvNets was performed on NVIDIA Quadro RTX 5000 GPUs to ensure computational efficiency.
3.1. HP-Driven Ensemble
The HP-driven ensemble block consists of multiple base models selected from the HP search space after the HPO process. Using the flexibility of the stacking approach, the selection of base models can be customized according to specific design criteria, such as deployment constraints, computational budget, interpretability requirements, or expert domain knowledge [
49]. Possible options for the base models include interpretable methods [
50], DNNs [
28], or custom feature engineering approaches.
This study employs DNNs as the base models for the ensemble block. The architecture designs for DNNs vary widely, ranging from multi-layer perceptrons to ConvNets [
51] and transformers [
52]. ConvNets have become a cornerstone of ML applications, initially revolutionizing the field of computer vision [
53] and subsequently being adapted to a wide range of domains, including signal processing, natural language processing, and more [
51]. Over the years, many variations of ConvNets have been proposed [
54,
55,
56,
57]. We utilize the HyperParam Convolutional Neural Network (HP-ConvNet), a 1D-ConvNet [
58], as depicted in
Figure 2. HP-ConvNet is an adaptive architecture designed to accommodate a wide range of signals across diverse use cases. At its core, the network employs a convolutional (Conv) block, which integrates a convolutional layer, batch normalization, ReLU activation, and a dropout layer. A comprehensive description of HP-ConvNet, including its design and parameters, is provided in
Appendix B.
The HPs of the HP-ConvNet are selected using Bayesian optimization over a defined search space, as summarized in
Table A1. This study primarily focuses on architecture HPs, while the training HPs, except for the learning rate (LR), remain fixed. The training optimizer for all evaluations is Adam optimization [
59]. Given the large number of potential HPs for the HP-ConvNet, numerous valid network configurations are possible [
60]. To reduce the search space and facilitate early rejection of invalid HP combinations, several constraints are applied to the optimization algorithm, as detailed in
Appendix C.
3.2. Fusion
The scores or embedded features generated by the ensemble block must be aggregated to create a unified representation suitable for subsequent processing. The data fusion process can range from simple concatenation to more complex operations, such as weighted averaging or advanced transformation techniques [
61]. Aggregating features from different models may require additional preprocessing steps, including dimensionality reduction or feature selection, to ensure consistency and reduce redundancy.
In this study, we assign equal weights to each base model within the ensemble block, with decision-level fusion employed as the integration point. The ensemble block outputs softmax scores generated by each ConvNet, where all scores share the same dimensions, corresponding to the number of classes in the labels. The aggregated matrix, formed by concatenating the softmax scores, is represented as , where S is the number of sensors, m the number of selected models per sensor, and K is the number of classes in the classification task.
3.3. Ensemble Prediction
The primary output of the workflow consists of the predictions for the defined task, which are computed using the aggregated features. Following the stacking approach, a meta-classifier is used to generate the final prediction by combining the outputs of the ensemble block. In this study, as reported in
Table 1, we utilize adaptive boosting (AdaBoost) [
62] as the meta-classifier. AdaBoost is an iterative ensemble method that combines a series of weak learners, typically decision trees, to form a strong classifier. The selection of AdaBoost as the meta-classifier is motivated by its ability to handle diverse input distributions and its effectiveness with smaller datasets [
63].
3.4. Ensemble-Based OOD Detection
Ensemble-based OOD detection leverages the intuition that disagreement or variation among the outputs of multiple models can effectively signal the presence of OOD data. This method involves training an ensemble of models, often neural networks with different initialization [
28], HPs [
39], or architectures [
38], and combining their predictions to detect OOD samples.
In the proposed framework, we employ a non-parametric anomaly detection approach on the aggregated scores generated by the ensemble to compute the OOD detection scores. As outlined in
Table 1, we employ k-Nearest Neighbors (kNN) as the anomaly detection method, inspired by its effectiveness in prior studies [
64]. kNN is a straightforward, distance-based approach that provides interpretable results by quantifying the similarity between a new sample’s aggregated features and those of the ID data. This method is particularly appealing due to its simplicity and ability to adapt to different distributions without requiring extensive parameter tuning. However, alternative anomaly detection techniques can also be utilized within the framework, depending on specific requirements. Examples include one-class SVM, isolation forests, or variance-based thresholds on aggregated features.
Appendix D provides a comparison of four different possible methods for anomaly detection within this framework.
3.5. Training Multi-Sensor Deep Ensemble
Using the selected methods outlined in
Table 1, we integrate diverse HPs within the multi-sensor framework to construct the deep ensemble model. The training process, detailed in Algorithm 2, utilizes an HP search budget of
trials. HPO of the base model (HP-ConvNet) is conducted for each sensor
M times using the defined search space. Based on validation performance, the top
m models are selected from the
M trained models. In total,
base models are selected, and their softmax scores are concatenated to form the aggregated features.
The workflow described is based on the selected methods from
Table 1; however, the approach remains similar for other possible configurations. The framework’s model-specific element lies in selecting the base models for the HP-driven ensemble and defining the corresponding HP search space. Notably, due to the stacking-based ensemble block, a mixture of different ML methods or DNN architectures can be utilized, depending on the properties of the input signals.
The meta-classifier and anomaly detection methods are trained using the aggregated features. During inference, the trained deep ensemble model, meta-classifier, and kNN model are employed to predict class labels and compute OOD scores for each new input.
Selecting an appropriate data planning strategy within this framework is critical to avoiding overfitting and data leakage. Specifically, ID-test data must remain excluded from the HPO and training processes. In this study, the data is divided into 50% training, 20% validation, and 30% test sets. For small datasets, cross-validation is also a viable approach to ensure robust evaluation.
Algorithm 2 Multi-Sensor Deep Ensemble Training |
- 1:
Input: S sensors, M trials for HP search per sensor - 2:
Output: Trained deep ensemble model, meta-classifier, and kNN model - 3:
for each sensor to S do - 4:
Perform HP search using HP-ConvNet with M trials for sensor s - 5:
Select the top m networks based on validation performance from the M trials - 6:
end for - 7:
Concatenate the softmax scores of all selected models across sensors - 8:
Train the meta-classifier on the aggregated softmax scores for final prediction - 9:
Train the kNN model on the aggregated softmax scores for OOD detection - 10:
Return: Trained deep ensemble model, meta-classifier, and kNN model
|
4. Datasets
Five publicly available datasets from industrial condition monitoring and fault detection were used. This section describes the datasets used in the study.
The ZeMA Electromechanical Axis (EA) dataset [
65];
The ZeMA Hydraulic System (HS) dataset [
66];
The Open Guided Waves (OGW) dataset [
67];
The Paderborn University Bearing (PU) dataset [
68];
The Case Western Reserve University Bearings (CWRU) dataset [
69].
The aim was to select datasets with multiple sensors and multiple domains. Many industrial datasets consist of readings from multiple sensors. These recordings may capture different physical properties, such as vibration, pressure, or current, across one or more components. Vibration, in particular, is a property that can be measured in a wide range of applications and from multiple locations within a system (see
Table 2).
In this study, a “sensor” is not limited to a physically separate device; it may also refer to a transformation or subdivision of an original recording. For example, Fourier or Wavelet transformations of vibration recordings can be considered separate sensors. Furthermore, different phases of a physical process may be treated as different sensors. Thus, sensors in this study are not restricted to concurrent or synchronized recordings.
Variations in force level, pressure, or temperature can be potential causes of shifts in the data distribution and can be treated as different domains. The datasets used in this study were not explicitly designed to address domain shift. This selection was made for two main reasons. First, there are no well-known datasets specifically designed for this purpose in the targeted field. Second, using frequently studied datasets highlights the presence of domain shift problems in real-world scenarios, underscoring the relevance and practicality of addressing these challenges.
To ensure comparability across datasets, we used balanced versions in this study. For datasets that were originally imbalanced, we applied downsampling to the majority class to create balanced versions.
Table 2 provides details for each dataset, including the ranges of signal lengths, number of classes, defined domains, and sensors. Dataset descriptions covering sensor types and defined domains are provided in
Appendix A.
5. Results
We trained ensemble models for each use case listed in
Table 2, treating all datasets as multivariate classification tasks using Algorithm 2. To maximize the model’s performance, all available sensor data were included in the modeling process for each dataset. Following the training of the HP-driven ensemble,
m models for each sensor were selected from the pool of all trained models. The number of models in the deep ensemble,
m, is a critical HP affecting both prediction accuracy and OOD detection efficacy. Increasing
m generally enhances the ensemble’s generalization and anomaly-detection capabilities. While fine-tuning
m for each dataset individually would be the preferred approach for addressing a specific task, we opted to set
m based on the average ID accuracy across all datasets to demonstrate the generalizability of the approach.
Figure 3 shows that the ID accuracy plateaued after five models, so we fixed
for the remaining experiments to balance performance and computational efficiency.
The prediction results of the models for each dataset are shown in
Figure 4.
Figure 4a compares the ID accuracy of the best single model with the ensemble of five models for each sensor. The accuracy improved across all datasets when using the deep ensemble.
Figure 4b illustrates the accuracy of the deep ensemble model for both ID and OOD data. A clear accuracy drop from ID to OOD data is observed for all datasets, with the most significant decreases occurring in the EA and HS (Acm) datasets. The error bars in
Figure 4b indicate the range between the maximum and minimum accuracy of the ensemble model across different domains. There is significant variation in the prediction accuracy of the ensemble models for most datasets. For instance, in the OGW dataset, the OOD accuracy varies between almost 50% and 100%, depending on the target domain.
As outlined in
Section 3.4, the multi-sensor ensemble-based OOD detection method (Ens+kNN) employs kNN to generate OOD scores. The value of
k is an important HP for kNN, and we ran the algorithm with
k values ranging from 1 to 10 to examine its impact.
Figure 5 presents the AUROC and FPR95 results across datasets as a function of different
k values. For three datasets—EA, HS (Valve), and HS (Acm)—the value of
k shows minimal effect on performance. However, increasing
k results in a decreased AUROC for the other datasets. Consequently,
was chosen for the remainder of the experiments, this is inline with the explanation from the original finding [
43]. Performance metrics for other anomaly detection methods are compared in
Appendix D.
The AUROC and FPR95 metrics are measured for each OOD detection method described in
Section 2.3 and
Section 3.4 across all datasets.
Table 3 presents performance results for each method. For methods MSP, ODIN, Energy, HBOS, and MDS, we report results from the best-trained model with the highest ID validation accuracy. The Ens+kNN method consistently achieves the highest performance for both AUROC and FPR95 metrics, followed by StatData, which relies solely on input features. However, the other methods display relatively poor performance in FPR95, underscoring the superior reliability of Ens+kNN for OOD detection.
To visualize the results more effectively,
Figure 6 shows the average AUROC and FPR95 across all datasets. The error bars represent the standard deviation for each method, and the number above each bar indicates the method’s average rank across datasets. Ens+kNN consistently ranks highest in both AUROC and FPR95, with Data ranking second but exhibiting high variability in FPR95.
Table 3 and
Figure 6 show a large gap between the performance of OOD detection from single models and ensemble models. To explore a combined OOD detection approach, we aggregated the OOD detection outputs of the base models as features. Specifically, kNN was applied to aggregate features from MSP, ODIN, Energy, HBOS, and MDS methods for comparison.
Figure 7 illustrates the average AUROC and FPR95 achieved by each OOD detection method when using kNN with
to aggregate features from the ensemble models. While Ens+kNN continues to achieve the highest average AUROC and lowest FPR95, other methods also show notable improvement. Conversely, StatData now exhibits the second lowest performance, with MSP achieving the lowest performance.
6. Discussion
The multi-sensor deep ensemble workflow performed remarkably well across datasets with a wide range of properties, benefiting from automatic HPO. The dataset parameters included sensor counts from 3 to 17 and signal lengths ranging from 60 to 13,108. For all datasets, the ID test accuracy was close to 100%, indicating the model’s ability to generalize within the provided training set.
The number of models in the final ensemble, denoted as
m, is an HP that directly influences prediction outcomes. Previous research, such as [
28], suggests that increasing
m enhances prediction performance, but this comes with the trade-off of increased ensemble size and computational demands. It is important to strike a balance between the benefits of more models and the practicality of implementation. Moreover, while ensemble size also affects OOD detection performance, our model selection process was conducted independently of the OOD samples, focusing primarily on ID performance.
In this study, we treated each possible shift in the data distribution as a true label for OOD detection. While the true labels of distribution shifts were not available and this assumption may not perfectly represent the problem, the results effectively highlight the existence of the issue and demonstrate the method’s potential in identifying shifts. Developing specialized datasets targeting distribution shifts in this field would be valuable for future research [
70].
Figure 4b illustrates significant variation in model accuracy on OOD data across different domains, validating the presence of the problem in the defined scenarios. These variations differ between datasets, indicating that the severity of distribution shifts depends on the covariate variables. For example, in the HS (Valve) dataset, OOD accuracy ranged from 33% to 85%, with the lower value approximating random guessing. This variability highlights the complexity of OOD detection, as some distributional shifts pose substantial challenges to the model, while others have minimal impact on prediction results. The selection of
for kNN-based OOD detection was motivated by maximizing domain shift sensitivity, but this choice might not be optimal for all datasets. Larger
k values could smooth out the influence of noise and improve OOD detection in certain contexts, warranting further exploration of dataset-specific HPO.
Despite the small gap between the ID prediction accuracy of single models and the multi-sensor deep ensemble, the OOD detection performance of single models was notably inferior. As shown in
Figure 6, OOD detection methods based on single models consistently exhibit high FPR values, averaging around 90%. However, the presence of domain shift is evident, as reflected by the severe drop in OOD prediction accuracies, with the EA dataset dropping to around 10% and the HS (Acm) dataset ranging between 3% and 27%. Conversely, Ens+kNN achieved near-perfect prediction metrics. This inconsistency suggests that focusing solely on ID accuracy for model selection in multi-sensor datasets is insufficient. Additional factors must be considered to ensure robust model performance.
Aggregating information from input signals across multiple sensors, “StatData” has shown potential for detecting shifts in data distribution early on. When compared to single-model OOD detection methods, StatData ranked as the second-best approach, delivering strong AUROC results, particularly for the EA and HS datasets, including both the Valve and Acm tasks. However, the method exhibited significant variability in FPR95, ranging from 10% to 80%, underscoring its inconsistency in detecting certain types of domain shift. Compared to other aggregation-based methods, StatData had the second-highest FPR95, making it less reliable for scenarios requiring low false positive rates. Nonetheless, since it operates independently of the model architecture, StatData can serve as a pre-model selection tool, highlighting potential shifts in data before more computationally intensive models are deployed.
7. Conclusions
In this study, we proposed a robust framework tailored for multi-sensor applications, particularly valuable in high-stakes and costly decision-making areas such as industrial fault detection systems. Our approach demonstrated leading-edge performance across various datasets, accommodating diverse signal lengths and sensor counts. Its adaptability and reliability across the datasets highlight its promising generalizability. While a collection of widely used datasets was employed in this study, the generalizability of the framework to other industrial scenarios may depend on factors such as task specifications, sensor types, and the nature of domain shifts encountered. Future work should explore these aspects to better understand the framework’s potential and limitations in broader industrial applications.
The datasets used in this study are not explicitly designed to capture domain or distribution shifts. We treated any deviations from the initial working conditions as potential shifts in the distribution. However, further investigation is needed to quantify the degree of deviation from the training distribution within these datasets. Additionally, developing datasets specifically designed to address domain shifts in this field would be highly valuable.
The framework’s model-agnostic design allows it to be extended to ML models beyond DNNs. Although the models in this study were limited to ConvNets, exploring a broader range of DNN architectures or a combination of different models could enhance ensemble diversity, offering a promising direction for future research. Additionally, future studies could investigate different fusion methods and assess how the choice of fusion depth (defined as the stage at which features or decisions are integrated) influences overall performance.
This study’s unique contribution lies in its AutoML approach, which addresses diverse datasets from multi-sensor fault detection use cases while accounting for distribution shifts. By leveraging a flexible design, the framework effectively accommodates variations in sensor data, ensuring consistent performance across a wide range of datasets. While the findings are promising, we hope this work inspires further research to refine and expand similar approaches, enhancing the practical applicability of ML methods in critical industrial settings.