1. Introduction
Modern industrial systems are increasingly more complex and prone to failure, which can lead to significant dangers or high costs. Detecting faults is crucial in these systems [
1], but it is challenging due to the vast amount and complex nature of data the systems handle and produce. Once a fault is detected, further analysis and decision-making processes are necessary to identify the specific fault type and prevent it from spreading.
In recent years, deep learning (machine learning using neural networks with many hidden layers) has showcased its remarkable ability to learn complex data representations, revolutionizing various learning tasks. Deep learning was also successfully applied to fault detection, which must handle a broad range of multivariate data [
2,
3]. Multivariate data are a sequence of chronologically recorded observations of interrelated and interacting multidimensional variables.
Although deep learning has shown success in fault detection, it is important to acknowledge that researchers using deep fault-detecting models sometimes overlook the crucial temporal aspect of fault detection. Specifically, they fail to leverage the existing temporal dependencies among variables using only a single data sample in one time step or, conversely, too much data in too many time steps [
4]. Consequently, they fail to optimize for the time it takes to detect a fault. In many applications, the fault-detection delay—the time gap between the actual occurrence of a fault and its recognition by the fault-detecting component—can be quite dangerous and therefore this delay should be as small as possible [
5].
If attainable, fault prediction is better than fault detection as it allows us to prevent potentially expensive faults from occurring. We can consider fault prediction as a negative fault-detection delay. However, if predictions are unattainable, a low fault-detection delay is just as important as high sensitivity and a low false alarm rate.
Quick detection of faults while minimizing false alarms is paramount in fault-tolerant systems. A functional fault-detection system raises alarms with acceptable delays. Our objective is to demonstrate the inadequacy of approaches that ignore fault-detection delays when evaluating fault-detection (and prediction) performance. We experimented on a large and renowned synthetic dataset [
6] and tried to identify the main factor influencing the accuracy and delay of the fault-detection process.
The main contributions of this paper are as follows:
- (1)
Emphasizing the significance of fault-detection delays when considering deep-learning fault-detection models. By disregarding the temporal aspect of the solution, standard loss functions produce solutions with little practical value.
- (2)
Introducing a methodology for estimating deep fault delays. In cases where the timestamps of faults are unknown, it is only possible to estimate the fault-detection delay to a certain extent.
- (3)
Proposing a pseudo-multi-objective approach to fault detection with any deep-learning model, although we have only validated it with Long Short-Term Memory on a single dataset. Deep models should only have access to short training sequences, or they will not learn short-term relations needed for short fault-detection delays.
- (4)
Providing a clear integration of machine learning concepts with fault detection. Bridging these two domains facilitates better knowledge exchange and helps prevent experimental errors.
This paper delves into the temporal aspect of deep fault detection. Additionally, we examined the influence of data windowing on fault-detection accuracy and delay.
Section 2 introduces the concept of a monitoring component and provides an overview of artificial neural networks.
Section 3 introduces a uniform notation for describing the context and data windows in sequential data and estimating fault-detection delays for any model.
Section 4 presents a pseudo-multi-objective optimization approach that surpasses certain naive approaches observed in the literature. Finally,
Section 5 illustrates the application of the proposed concepts using the widely used Tennessee Eastman Process (TEP) dataset [
7].
3. Deep Fault-Detection Models for Time-Series Data
Data-based fault detection using machine learning can go two ways. The first approach involves constructing a classifier using supervised learning techniques. This classifier is trained on labeled data and learns to classify a given data sample as either normal operating conditions (NOC) or a potential fault. Predicting faults in the (near) future is even more advantageous than simply detecting existing faults, and classification models can be trained accordingly. However, supervised learning relies on high-quality labeled data representing all possible system states. Obtaining such data in large quantities can be challenging, especially in many industrial environments where faults are infrequent and come at a high cost. Nevertheless, supervised deep-learning approaches for fault detection have been successfully applied in various areas, including chemical production systems [
17], semiconductor manufacturing processes [
18], and high-performance computing [
19].
On the other hand, unsupervised learning provides a more suitable alternative for fault detection, particularly in scenarios with few or no faulty samples available for training. In this approach, autoencoders (AE) are widely used as a technique for deep fault detection [
20,
21]. Autoencoders are neural network architectures that encode the input signal into a latent representation and then attempt to reconstruct the original signal from this compressed information. A level of successful reconstruction of the input signal is a measure of similarity between the test signal and the pre-learned normal signals.
By training autoencoders exclusively with normal operating condition data, they learn the underlying structure or principal components of the signals that describe the normal operation of a plant. When presented with an anomalous signal, their ability to accurately recreate the input will be lower than with non-anomalous signals used in training. This difference in reconstruction performance enables us to distinguish anomalous samples from the NOC samples based on the reconstruction error alone.
Consequently, the performance of autoencoders heavily relies on selecting an appropriate error threshold, which leads researchers to employ Receiver Operating Characteristic (ROC) analysis by varying these thresholds. Unlike classifiers that can only distinguish between normal operating conditions and expected faults, autoencoders can detect even unanticipated states of operation.
3.1. Performance Metrics
Machine learning (ML) strongly emphasizes accuracy, primarily due to the generalization challenges that artificial neural networks face when confronted with unseen data. The main concern in ML is to mitigate incorrect outputs caused by the unstable propagation of features through ANNs, rendering them ineffective in the presence of minor input perturbations [
22,
23].
Several studies have exclusively concentrated on established machine learning evaluation metrics, such as accuracy, while neglecting the inherent sequential nature of the data and failing to consider fault-detection delays [
24,
25,
26].
3.2. The Data
The time-series data are a sequence of data samples denoted by , where each sample represents the input signals at time t, and T is the total number of time ticks. Specifically, the sample , represents the signal values of the N features at time tick t. The fault-detection model takes an input sample and generates the output . Each input sample corresponds to one fault-detection case. Finally, a dataset, which consists of d sequences, can be denoted as .
Let us define a sliding window function that selects the last
w samples from a sequence
S at time
t:
This allows us to denote
. In a sequence consisting of
T samples, the window
selects the entire sequence:
, and
selects only the last sample:
. The sliding window utilized for sample selection is similar to, but should not be confused with, the fault-absorbing sliding windows [
27].
3.3. The Context
All deep-learning models rely on a context that serves as the reference frame for the inputs they process. Samples preceding the most recent sample form the simplest context. Specifically, the context includes the w samples visible through the window .
In the case of feed-forward (FF) models, the context is external and must be provided at each time step. One way to incorporate it is by introducing it as a second parameter to the fault-detection function, such as . However, for simpler implementation, machine learning flattens the context and appends it to the current sample, resulting in , or even more simply . It is important to note that flattening the external context removes the temporal axis from the data.
In contrast, recurrent models utilize a hidden state H, eliminating the need for an external context. The operation of recurrent models can be represented as . As H is an internal state, we can omit it from the notation, resulting in . Although recurrent models process time-series data and update their state by considering one sample at a time, they still require the processing of several consecutive samples to detect a fault because the state depends on . Considering that all the previous states influence the internal state, we can express the entire history of processing as . Random initialization of makes a possible starting point for sequentially processing the sequence , leading (again) to . Unlike feed-forward networks, which require flattening consecutive samples into one wide input, recurrent models retain the temporal organization of the data.
The ‘context data’ size, determined by the hyperparameter
w, limits the model’s ability to capture long relationships. Feed-forward models have a limitation in that they can only capture correlations within the constrained context (
) and are unable to detect correlations that extend beyond it. It is important to note that the implementation of the model may impose restrictions on the context size. The context size limitation is observed in popular models like ChatGPT as well [
28].
3.4. Fault-Detection Delay
Artificial neural networks are extremely quick at generating output. Unless a highly responsive system is being monitored, the time required for this transformation can be ignored, if we compare it to the duration of a single time step.
To quantify the fault-detection delay
, which refers to the time elapsed between the actual onset of a fault and its detection [
5], it is necessary to possess accurate information regarding the fault’s exact occurrence time
within the system. Unless an external source or an oracle provides this temporal data, determining the exact fault occurrence time becomes extremely challenging. Faults typically require a certain amount of time to be propagated through the system and are seldom detectable immediately upon initiation. Furthermore, there are instances when the process control system actively masks the fault by compensating for its adverse effects before the system enters a visibly anomalous state that can be detected.
In machine learning, there is a common assumption that the associations within a system can be acquired by recording its features and that the comprehensive data captures all relevant relationships. Consequently, utilizing all available data to make informed decisions through the execution of establishes the performance baseline for fault detection, with the fault-detection delay . It is important to note that the baseline model exhibits the maximum delay when applied to pre-recorded data. The true fault-detection delay occurs only when the model is employed live on a data stream.
To determine the model’s fault-detection delay when the fault time is unknown, we must identify the minimum number of samples required to provide sufficient information for accurate enough fault detection at any time step:
Unfortunately, the brute force approach to determine
using Equation (
2) is very slow, because we must determine
for all possible training inputs:
where
K is the set of all possible inputs of size
w obtainable from the
T time steps of long sequences from
D. Using the sampling stride of 1 gives
possible inputs. However, if one has knowledge of
, one can determine
much simpler, because the model requires
time steps to detect a fault:
If we do not have
, we should try to put a lower bound on
by measuring the delay at time step
T. This is carried out by finding the smallest input that still produces the same output as the baseline decision
; we shall denote this statistic as
:
High
suggests that the model is unstable, while a lower
value is a sign of a simple problem.
is not
—the fault-detection delay at time
T—but a lower bound (If the model performs fault identification, the lower bound corresponds to the highest
of all faults) for fault-detection delay at time
T. Although the delays for other inputs are probably different, the worst-case fault-detection delay will be at least
:
Because no (deep) model is perfect, the decision regarding a fault can change with additional input(s). Raising premature alarms is bad but can only be mitigated by waiting for more data that would confirm the alarm but delay the decision. We need a measure of how fast the model’s output stabilizes. A simple boundary can be determined by testing the model using inputs
, which include the first
t samples. Let us denote with
the time step when the output becomes
stable–subsequent samples do not change the outcome of the fault-detection process:
In contrast to
,
retains older samples and discards more recent ones from the input. When the precise time of fault initiation
is known (e.g., provided by an oracle), the fault-detection delay can be calculated by straightforward subtraction:
The delay estimation strategy applies when the fault occurrence time is unknown (
Figure 2a). After determining the
and
statistics, we need
to interpret the results. Because
and
represent the delay’s lower and upper bound (
), a situation with
indicates problems with fault detection in the underlying model. These problems could be due to overfitting [
29], instability [
23], generalization problems [
30], or other issues we must address. When we know the fault time
, we can measure the detection time and directly compute the delay (
Figure 2b).
4. Pseudo-Multi-Objective Optimization
In multi-objective optimization, a single model cannot simultaneously achieve the best performance in all dimensions [
31]. Optimizing for accuracy and low delay in fault detection involves a trade-off, as these objectives are inherently conflicting. Consequently, exploring multiple Pareto optimal solutions that offer a balanced trade-off between accuracy and delay becomes crucial.
Various universal techniques for tackling multi-objective optimization exist. One common approach involves using weighting methods, such as adaptive weighting techniques proposed by Xie et al. [
32], or using multi-objective instance weights as discussed by Lee et al. [
33]. By incorporating such techniques, solutions that balance competing objectives form the Pareto front of candidate solutions to the optimization problem.
Figure 3 positions various models according to their accuracy and fault-detection delay performance. If the trivial model solution, which never signals a fault, has no fault-detection delay, the ideal model would detect all faults instantly. Learning increases the models’ accuracy because explicit ML loss optimization pushes models horizontally toward higher accuracy. Only an orthogonal incentive (explicit or implicit) would push the models towards short fault-detection delays.
In this context, we adopt the implicit approach. We prioritize utilizing established deep optimization techniques to maximize fault-detection accuracy while aiming for a short delay as an implicit objective. This implicit optimization for short delays aligns harmoniously with the accuracy-optimizing objectives of existing ML libraries without necessitating any modifications to the existing ML code.
When training a fault-detection model with recordings of historical data that stretch over many time steps, the conventional ML approach poses the question, “Do these historical data contain any faults?” However, in real-time fault detection, it is important to rephrase the question: “Do these historical data indicate an imminent or a recently occurred fault?” This shift in emphasis redirects the focus from analyzing the distant past to the near future or present.
When training the model with historical data with the fault introduced relatively early compared to the overall length of the sequence (), we effectively ask the first question. To ask the second question, we would ideally need training data with the fault occurring in the last time step () or, even better, in the future (). The acquisition of such data, especially in sufficient quantities for deep learning, can be very challenging, however.
However, when working with a large volume of historical data, it is important to exercise caution in its utilization. Instead of treating a single sequence as a single learning case, where the input
includes all
T samples, one can reorganize data into multiple smaller inputs [
34].
Figure 4 shows a sliding window that samples at every time step and produces
distinct yet overlapping training cases
of size
w, where
. We denote models trained on inputs of size
w as
.
Strictly speaking, utilizing shorter inputs only partially falls under the umbrella of multi-objective optimization. Nevertheless, training with shorter inputs implicitly encourages machine learning algorithms to uncover short correlations for rapid fault detection.
5. Case Study
The data used in our study are obtained from the Tennessee Eastman Process, originally introduced by Downs and Vogel [
7] and extensively described by Chiang [
35]. TEP is a widely recognized benchmark for researching process monitoring and control. It replicates natural processes by incorporating modified components, kinetics, and operating conditions. TEP is a synthetic dataset where all dynamic behavior arises from software-based simulations. Since its inception, the simulation code has undergone several enhancements, solidifying TEP as one of the most frequently employed benchmarks for studying highly nonlinear and strongly coupled data.
Our decision to utilize the Tennessee Eastman Process dataset in our study is based on several factors. First, TEP has been widely adopted by many researchers, as evident from the works of Heo et al. [
36,
37], Sun et al. [
38], and Park et al. [
39]. Second, the published papers we reviewed did not adequately address or fulfill the specific objectives of our research. Lastly, acquiring high-quality datasets is often a challenging endeavor. To ensure repeatability and facilitate transparency, we opted to employ an extensive recording of TEP simulation data provided by Dataverse [
40]. Notably, this dataset is also employed in the MATLAB Help Center as an illustrative example for demonstrating the application of deep learning with time series and sequences using the Deep-Learning Toolbox [
24].
The TEP dataset represents a chemical plant, where the overarching control strategy, as outlined in Downs and Vogel [
7], aims to optimize overall performance. The plant’s control system diligently monitors and logs 52 distinct features, comprised of 41 sensor measurements and 11 manipulated variables at 3-minute intervals. Within the TEP environment, the plant can operate under normal operating conditions, denoted as fault 0, or encounter any of the 20 preprogrammed faults (faults 1–20). Upon the occurrence of a fault, the control system attempts to mitigate the disturbance, by either successfully restoring the system to the NOC state or allowing the fault to escalate beyond the NOC boundaries.
The dataset includes a substantial amount of training and test data. Specifically, the training phase consists of 500 simulated plant runs for each combination of normal operating conditions and 20 fault scenarios, resulting in a total of 10,500 simulation runs. Each run spans 25 h, providing 500 samples per training sequence. In the case of a faulty run, the fault is intentionally triggered at time steps into the normal plant operation.
The test data follow a similar structure, with 500 independent simulations conducted for each NOC/fault scenario. However, these simulations have a longer duration, consisting of 960 samples, and faults are introduced at a later stage, precisely after time steps of normal operation.
The training data were divided into 400 training samples and 100 validation samples for effective model development and evaluation. Each simulation run was an individual classification case throughout all study phases, including training, validation, and testing.
The Tennessee Eastman Process has been the subject of extensive study by numerous researchers. Several authors excluded faults 3, 9, and 15 from their research [
24,
39]. The rationale behind this decision stems from the observation that the plant’s control system can effectively handle the disturbances caused by these faults. Consequently, distinguishing these faults from the plant’s normal operation proves to be a challenging task. We also removed faults 3, 9, and 15 from our dataset in light of these findings. Consequently, the training and test dataset consisted of 18 distinct plant states, resulting in 9000 recorded sequences for each dataset. Since the TEP data includes information about the faults and their occurrence times, the task goes beyond simple fault detection and involves fault identification through classification.
5.1. Setup
In developing and training our deep fault-detection models, we strictly adhered to the setup and procedure outlined in the tutorial [
24], except that we employed Python in conjunction with the Keras/TensorFlow framework. The code snippet below illustrates the model creation process, faithfully reflecting the prescribed methodology. Each model had three layers of LSTM cells and 43,788 trainable parameters.
During the training phase, the simulation records were processed in batches of 32. Like in tutorial [
24], we employed the mean squared error loss function and the Adam optimizer to optimize the model’s parameters.
5.2. Fault-Detection Performance
The results presented in
Table 1 are the average values obtained from 30 trained networks. At first glance, the selected model exhibited remarkable learning capabilities on the TEP data, yielding outstanding accuracy. This was confirmed by Matthew’s Correlation Coefficient (MCC) score, which is known to be more effective in describing performance on multiclass and unbalanced datasets compared to traditional metrics like the F1 score [
41].
The limitation of training the model within 30 epochs was sufficient in discovering a solution that appears to be close to the global optimum. The model achieved a validation loss of
and a training loss of
. The test results for this model are summarized in
Table 2. We will designate this specific model as Model #1, representing the baseline
family, where all members are trained on cases of 500 samples.
Although the MATLAB tutorial [
24] focuses exclusively on classification accuracy by employing all samples per sequence for a single fault classification, it ignores the fault-detection delay. The important question to answer is when deep models begin detecting faults.
5.3. Fault-Detection Delay Analysis
During the training and validation phase, faults were introduced in the TEP dataset at time step . However, detecting these faults was delayed until step 500, when the last training record became accessible. We conducted the and analyses on the training, validation, and test data to investigate the behavior of the taught Model #1.
To demonstrate the
score,
Figure 5 depicts the performance of the reference classifier on a single test sequence labeled as ‘fault 1’. The sequence was divided into 960 sub-sequences, denoted as
, and subsequently classified using the reference Model #1. Inputs with
or more samples accurately supported the baseline ‘fault 1’ classification.
Table 3 presents the
scores of the reference model for different faults on the validation data, which was otherwise perfectly classified. Surprisingly, the maximum
score of 500 was observed for ‘fault 12’, indicating that the network required at least 500 steps to correctly classify a sample belonging to the ‘fault 12’ category. This finding is unexpected because the training/validation data encompass 20 steps of NOC data at the beginning of every 500-step sample. These initial 20 steps should not have influenced fault categorization, meaning that the network is overfitting the training data.
For completeness,
Table 4 displays the
values for the reference model on the test data, including a few incorrect baseline predictions.
Figure 6 shows our reference classifier’s
performance on the ‘fault 1’ test sample. Slices with less than 425 records were insufficient to recognize ‘fault 1’; however, if at least
or more time steps were available, the network could predict ‘fault 1’ correctly.
Table 5 presents the
results obtained from the test data. It is important to note that the
metric does not assess classification accuracy but focuses on its consistency over time. Similar to the observations with
, we can again identify unexpected behavior. Despite training the network on 500 time step sequences, the average stability of classification can only be achieved well beyond 500 steps, as indicated by the value of
. This counterintuitive behavior further suggests that the network is not aligning with our initial expectations.
Given that we are aware of the fault introduction time of
in the TEP test data, we can calculate the average fault-detection delay for each fault by subtracting 160 from the highest
score achieved by the classifier, as shown in Equation (
8). On average, our best classifier would require
time steps to detect a fault. It is worth noting that only the NOC data slices were detectable before the 160th time step, with an average detection occurring after just 15 samples. One NOC sample, however, required at least 939 records to be correctly classified.
Table 1,
Table 2,
Table 3,
Table 4 and
Table 5 and
Figure 5 and
Figure 6 are © 2021 Matej Šprogar, Matjaž Colnarič, Domen Verber, and reproduced with permission from “On Data Windows for Fault Detection with Neural Networks”. IFAC-PapersOnLine, 54/04 (2021), pp. 38–43.
5.4. Comparison with Other Studies
There is a need for more directly comparable studies on fault-detection delay. For instance, the study by Heo et al. [
37] mentions that linear PCA and p-NLPCA detected fault 5 as early as at time step 162, which corresponds to only two samples (equivalent to 6 min) after its introduction into the system. In contrast, our reference Model #1 needed 196 additional samples to detect the fault correctly.
Providing more insightful results are the findings from Park et al. [
39], where a combined autoencoder and LSTM network was employed. According to their report, the average detection delay for faults 01, 02, 05, 06, 07, 12, and 14 was less than 30 min, showcasing superior performance compared to our baseline. However, it is worth noting that their model exhibited lower accuracy at 91.9%, which likely accounts for the disparities in the achieved detection delays. Our network, on the other hand, required more information to ensure higher classification accuracy. The question is, would shorter training cases make a difference?
5.5. Using Smaller Windows
The baseline Model #1 exhibited a notable fault-detection delay, highlighting the inadequacy of training on long sequences with faults embedded early in the process. Conversely, training on an individual sample per case fails to capture autocorrelations.
Given that our objective is not to find the optimal model but to illustrate the impact of shorter training cases, we chose to utilize the window to create models in the category. In the TEP dataset, a training case spanning 5 time steps represents 15 min of plant operation.
Generating training cases from a single sequence allowed us to augment the training dataset to include 3,571,200 cases. Similarly, the validation and test sets increased to 892,800 and 8,604,000 cases, respectively, resulting in a dataset with over 13 million fault-describing sequences. To create a representative Model #2 for the category of models, we followed the same procedure as when generating the baseline models of the family. The final representative Model #2 models achieved a training loss of 0.00294 and a validation loss of 0.00293. With an accuracy of 97.88% and an MCC score of 0.8955, the test performance of Model #2 was lower than that of the reference Model #1.
To facilitate a more thorough comparison between the two models, we must assess their usability for fault detection. This evaluation should consider accuracy and delay based on the consecutive input samples that describe the operation of the plant.
In
Figure 7, we can observe the variations in accuracy over time for the models.
Figure 7a illustrates the models’ performance on 500 steps of the training and validation data, while
Figure 7b depicts the corresponding analysis for the test data, encompassing 960 steps. The accuracy at each time step is calculated based on 7200, 1800, and 9000 sample recordings from the training, validation, and test datasets, respectively.
The training and validation datasets were structured to include an initial period of 20 steps representing a faultless operation. For the baseline Model #1, the accuracy in identifying normal operating conditions started at 20% for the first time step and improved to 86% after 20 time steps. In contrast, Model #2 achieved 92% accuracy at the first time step and reached 100% accuracy after three samples. On the test data, it took the baseline Model #1 58 time steps to achieve a perfect (100%) accuracy score, while the accuracy of Model #2 had already started to decline by then.
The introduction of faults (indicated by arrows in
Figure 7) resulted in a significant decrease in the accuracy of both models. Model #1 kept classifying all inputs as NOC for a while, whereas Model #2 started to improve immediately. On the validation data, it re-reached the 95% accuracy level with a delay of 55 time steps, whereas the baseline model’s delay was 298 steps to reach the same level of performance.
5.6. Discussion
The subpar performance of the baseline model during the initial time steps can be caused by the inability of the window to specify the first samples as belonging to the NOC category when the whole sequence was faulty. Per design, all samples, including the normal ones, were categorized as faulty. This combination of data could have caused the model to mistakenly associate the NOC state with a faulty condition, resulting in inferior start-up performance.
In all three datasets, there was a point where the baseline model started outperforming the representing model, reaching its peak accuracy at the final time step. This outcome aligns with the primary objective of machine learning, which aims to achieve high overall accuracy based on all available data. However, these graphs clearly illustrate that the standard ML loss criterion is inadequate for effective fault detection. If a model is not penalized for delays, it will prioritize marginal accuracy increases over significant delay improvements.
Recognizing faults becomes progressively easier the longer they persist in the system. However, training machine learning models to identify long-standing faults tends to result in slower detection. To address the detrimental effects of long-standing faults, we can employ data windowing and restrict the model’s access to information during training and operation. Windowing creates multiple smaller fault-detection cases with reduced information, which may not capture long autocorrelations but encourage the model to focus on learning and leveraging short-term correlations. The “less means more” principle holds, as less data can lead to shorter fault-detection delays.
All models in this study utilize LSTMs with recurrent architecture, allowing them to generate outputs starting from the first sample. However, it is important to note that their internal contexts still require a warm-up period. Interestingly, the warm-up periods for both models differ significantly. The baseline model exhibits a much slower warm-up process, requiring a larger number of samples to achieve comparable levels of accuracy.
A pseudo-multi-objective optimization can improve delay and accuracy simultaneously. It uses the data window size as a hyperparameter that significantly influences the temporal behavior of the model. Tweaking other deep-learning hyperparameters, such as the number of training epochs or the size of the neural network, can improve accuracy [
36] but can also negatively affect the delay. However, as our objective did not involve finding the optimal TEP fault-detection model, we refrained from conducting an extensive analysis of various window settings.
6. Conclusions
Although a control system can handle various disturbances, developing a dedicated monitoring system specifically designed for fault detection and identification is crucial. When applying deep-learning approaches to solve fault-detection problems, it is important to consider an additional objective other than accuracy. Standard metrics are inadequate and, as a result, misleading. They primarily focus on the correctness of results but neglect the importance of delays in fault detection. Furthermore, it is unreasonable to expect machine learning toolkits to excel in all domains universally; their effectiveness can vary significantly.
The monitoring component must recognize a fault from data samples collected during the fault-detection delay. The MC receives less information in the short period of time after a recently occurred fault than in the longer period of time after an old fault. Moreover, recent faults manifest less detectable anomalous traits than the older ones. Consequently, long delays support better detection of older faults. Fault detection is inherently a bicriteria optimization problem, where the fault-detection delay objective conflicts with the fault-detection accuracy objective.
The comparison of two deep neural network models, which were trained differently in some and identically in other aspects, highlights the need to understand the fundamental limitations of the machine learning approach for optimization. In this context, we described a simple alternative to a more complex multi-objective methodology that would have been required otherwise. Use of shorter training cases implicitly encourages deep-learning models to detect and leverage shorter correlations. Additionally, this approach aligns well with readily available machine learning frameworks.
Although we were able to replicate the high-accuracy results on the TEP dataset reported elsewhere, it is important to acknowledge that the baseline solution is not suitable for real-life applications. This becomes evident when observing the accuracy scores over time. The case study illustrates the flaw of the baseline deep fault-detection concept. We can produce better models only by circumventing the issue or applying multi-objective optimization. Following the No Free Lunch theorem [
42], however, it is important to recognize that there is no universally best approach. We aim to highlight why fault detection and identification warrant special attention.