1. Introduction
Hydraulic structures are critical components of national infrastructure, key to resisting external flood risks and ensuring the benefits of water supply and irrigation, which makes their safe and stable operation essential [
1,
2]. Hydraulic projects in operation face numerous challenges such as external loads, internal structural adjustments, and aging material properties. Among many factors, monitoring and assessing the safety status of engineering structures is an urgent problem to be solved. Digital twin construction for hydraulic engineering is an innovative information technology that systematically solves structural response perception, physical process simulation, and comprehensive performance assessment, and it achieves the goals of the “four predictions” [
3]. The important foundation of digital twin construction is its data infrastructure, including geographic spatial data, basic data, monitoring data, business management data, and external shared data, among which monitoring data directly reflects the true response of the structure and is crucial for structural safety assessment.
The safety monitoring of hydraulic structures primarily involves measurements of deformation, seepage, stress–strain, and cracks. Currently, the monitoring devices used in hydraulic structures typically utilize vibrating wire and resistive sensors, with vibrating wire sensors being more widely used due to their stability and durability. In practical engineering applications, monitoring systems may encounter circuit failures or network transmission issues, leading to anomalies in data collection. For instance, incorrect wiring or signal interference could result in abnormal data readings. We typically analyze these potential risks to identify inaccuracies in data readings caused by these technical issues, thereby reducing the impact of erroneous data and enhancing the overall performance and accuracy of the monitoring system (
Figure 1).
The safety monitoring of hydraulic engineering structures is a critical task in the field of hydraulic engineering, as the quality of the data directly affects the accuracy and reliability of dam safety assessments. Gross error identification is an important means of improving the quality of monitoring data. It aims to identify and remove outliers from the monitoring data, which are caused by measurement errors, data entry mistakes, or other abnormal factors. Currently, the identification of gross errors in monitoring data mostly employs traditional statistical methods or single machine learning algorithms. Traditional statistical methods, such as Grubbs’ test, are simple in principle and widely used but are greatly influenced by the accuracy of statistical regression models. Moreover, due to the fixed threshold of three standard deviations, they struggle to adapt to various data series forms, resulting in poor identification accuracy. In terms of machine learning algorithms, Saihua Cai et al. proposed a new method, MFP-OD, which detects outliers by identifying rare and significant different patterns in uncertain data streams [
4]; Ekin Can Erkus et al. introduced a new non-parametric anomaly detection technique, FOD, based on the definition of Fourier transformation, effective in detecting quasi-periodic anomalies in time series data [
5]. Zhou Z Y et al. present a fast minimal infrequent pattern mining algorithm, called MIMDS_1/2, and an efficient outlier detection method for data streams [
6]. Yuehua Huang et al. introduced a new unsupervised anomaly detection algorithm called ISOD, which enhances the interpretability and scalability of anomaly detection [
7]. Saihua Cai et al. proposed the UWFP-Outlier method, which efficiently and accurately detects outliers from uncertain weighted data streams at a low time cost [
8]. Although these methods perform well in certain specific data sequences, due to the complexity and variability of hydraulic structure monitoring data, a single model often struggles to accommodate the multidimensional features and potential complex distributions of the data, resulting in inadequate accuracy and robustness in gross error identification [
9]. Based on the application of machine learning and deep learning technologies in the field of anomaly detection for the safety monitoring data of hydraulic engineering structures, various approaches have been explored and applied to address specific issues. A. Weckenmann et al. investigated multi-sensor data fusion in dimension measurement using deep learning [
10]; C.A. Perez-Ramirez et al. studied building response prediction based on recurrent neural networks [
11]; Yongjia Xu et al. proposed a framework for earthquake damage assessment using long short-term memory networks, offering deep learning algorithms with various data fusion techniques [
12].
Given the uniqueness of hydraulic structure monitoring data, such as their various physical mechanisms, high-dimensional features, and multiple fault types, there are broader requirements for models. Many gross error identification techniques rely on specific statistical models or threshold settings, which may not be suitable for multivariate and high-dimensional data sets [
13]. Some methods require strict prior assumptions and demand certain distributions for data sequences, such as normal distribution, which are difficult to meet in practical applications. Moreover, due to the interaction between the distribution of gross errors and normal data, many existing gross error identification algorithms are overly sensitive to outliers, prone to falsely classifying non-anomalous data as anomalous, especially in cases of complex data features or strong intercorrelations [
14,
15].
This paper aims to establish a pre-classification method based on data sets to address the issue of complex data distributions. Data set pre-classification involves categorizing the data set before performing gross error identification. Through pre-classification, a large and complex data set can be decomposed into several smaller, more manageable, and analyzable subsets [
16,
17,
18,
19]. Such decomposition helps reduce the overall complexity of data analysis, as the data within each subset are more homogeneous. This makes it easier for the data set to exhibit its own characteristics and patterns, allowing for the optimal selection of gross error identification methods tailored to the specific features of each subset [
20,
21,
22].
Distance-based time series classification methods classify by calculating the similarity (i.e., distance) between different time series. Batista G.E. et al. introduced the “complexity-invariant distance measure”, a method that significantly improves classification accuracy without sacrificing efficiency [
23]. The Euclidean distance method is commonly used to measure similarity; however, it has limitations in dealing with phase shifts and noise in time series data. To address these issues, Cuturi M. et al. proposed a hierarchical clustering algorithm based on DTW (Dynamic Time Warping) distance measurement, which enhanced the precision of similarity measurements and proved effective in handling large-scale time series data [
24]; Xi X. et al. proposed a fast time series classification method using a reduced number of data points, improving classification speed and accuracy by reducing the amount of training data and optimizing the size of the DTW warping window [
25]; Keogh E. et al. highlighted flaws in experimental design in time series data mining research and called for a more extensive and cautious empirical evaluation concerning the choice of data sets, the repeatability of experiments, and the transparency of evaluation criteria [
26]; and Ye L. introduced a new method for mining time series data—Time Series Shapelets [
27]. This method focuses on identifying representative shape fragments (shapelets) from time series, which effectively distinguish between different categories of sequences.
Feature-based time series classification methods first convert time series data into a set of features, then use these features for classification. STL decomposition is a method for extracting features from time series, which decomposes the characteristics of time series into seasonal, trend, and residual components [
28]. Classification methods based on local sequence features effectively solve the problem of feature redundancy caused by inputting entire data sets. Bakirtzis S et al. conducted a review of time series classification based on deep learning, summarizing the applications and the advantages and disadvantages of various network architectures such as multilayer perceptrons, convolutional neural networks, recurrent neural networks, and attention mechanisms in time series classification [
29]; Bagnall A et al. explored the potential benefits of enhancing classification accuracy through data transformation into alternative representations, proposing a transformation-based ensemble method [
30]; and LE GUENNEC A et al. discussed the application of convolutional neural networks in time series classification, improving classification performance on small data sets through data augmentation and semi-supervised learning methods [
31].
The safety monitoring projects of hydraulic structures, including deformation, seepage, and stress–strain monitoring, are influenced by factors such as age, temperature, and water loads. Temperature and water loads typically exhibit periodic change characteristics, while aging effects usually manifest as monotonic changes. These combined factors make the monitoring data predominantly display periodic and growth (or reduction)-type characteristics in time series. Therefore, this study employs methods such as linear regression, wavelet transformation, and random forest classifiers for data set pre-classification. By using advanced feature extraction methods to capture the essential characteristics of time series data, subsets of data with similar features are delineated, and suitable algorithms are then chosen for each group for in-depth analysis.
2. Data Set Pre-Classification
Data set pre-classification helps us identify the main features and patterns of data sets early on, such as trends, periodicities, or other statistical characteristics, thus providing a basis for selecting the most suitable gross error identification algorithm [
32]. Linear regression is a basic statistical tool used to evaluate a series of data points, determining whether they are on an upward or downward trend [
33]. It can quickly provide the main trends in time series data, including slope and the statistical significance of the trend, which helps in assessing the long-term behavior of the time series. Wavelet analysis is effective for locally analyzing time series data in the time-frequency domain, particularly suited for feature extraction from non-stationary and aperiodic data [
34]. It offers a method to decompose signals on different scales, capturing transient changes and anomalous behaviors in the data, and is particularly effective in distinguishing between periodic and aperiodic components. When pre-classifying data sets, linear regression, wavelet transformation, and random forest classifiers are used to analyze the time series data set, capturing its essential characteristics through advanced feature extraction methods (
Figure 2) [
35]. This helps determine their trends (growth or decline), periodicity (periodic or aperiodic), and basic statistical features (mean, standard deviation, normality) [
36].
In safety monitoring projects for hydraulic structures, including the monitoring of deformation, seepage, and stress–strain, most of the data for a dam operating stably are affected by aging, temperature, and upstream water loads. Temperature acts as a cyclic variable, and operational scheduling requirements typically cause water loads to also display cyclic characteristics. Aging effects generally manifest as monotonic characteristics that increase or decrease over time. The combination of these three factors results in monitoring data primarily exhibiting cyclic and increasing (or decreasing) trends from a time series perspective. Accordingly, this study categorizes dam monitoring data into the following types and, to more prominently display the characteristics of different types of data, selects four representative groups to generate experimental data for demonstration (
Figure 3).
- (1)
Sinusoidal Wave Cyclical Data Set: This data set exhibits changes in a sinusoidal wave pattern, one of the most common types of cyclical variations. It represents an ideal, smooth cyclic change with peaks and troughs occurring at regular intervals.
- (2)
Triangular Wave Cyclical Data Set: The changes in this data set appear in a linear ascending and descending form, creating a sawtooth wave pattern. Compared to the sinusoidal wave, the rise and fall in a triangular wave are more direct and sudden, with no smooth transitions.
- (3)
Seasonal Cyclical Data Set: This data set shows periodic changes that vary with the seasons. This type of cyclical change is usually associated with factors related to specific seasons, reflecting the environmental and operational conditions that vary throughout the year.
- (4)
Weakly Cyclical Growth Data Set: This data set exhibits both periodic fluctuations and a long-term upward trend. It includes data that fluctuate cyclically at a certain frequency but generally show an upward trend and data that overall exhibit a growth trend but whose cyclical pattern changes or undergoes abrupt changes during the growth process. This data type not only reflects a long-term growth trend but also demonstrates the complexity of short-term cyclical changes. These waveforms may exhibit different characteristics due to changes in the cycle’s length, amplitude, or phase over time.
A pre-classification of data sets from actual monitoring data of hydraulic structures was conducted, involving 100 sets of data from piezometers, crack meters, thermometers, and strain gauges. For hydraulic structures, automated data collection is generally set at fixed intervals and times. In this study, to preserve all features of the data, all valid data are retained, meaning the interval between data collections serves as the standard for parameter selection. Waveform tagging is performed on a small data set using linear regression and wavelet analysis techniques, enabling automated extraction and analysis of data features. The data exhibited distinct sinusoidal wave characteristics confirmed through peak detection and spectral analysis; periodicities were detected using Fourier transform to check for sawtooth waveform features; and seasonal components in the time series were identified and quantified using STL (Seasonal and Trend decomposition using Loess). Finally, a random forest classifier was trained on the training set using all extracted features to pre-classify the data set. Monitoring data for hydraulic structures can be complexly influenced by water level loads, temperature loads, instrument stability, and the surrounding environmental conditions. Different monitoring devices under various monitoring tasks reflect different periodic and growth characteristics in their data due to these influences. For example, data might exhibit sinusoidal or seasonal cyclical patterns when primarily influenced by temperature and water level loads or weak periodicity when affected by complex environmental factors or instrument instability. For instance, if piezometers are installed at locations affected by seasonal water level changes, their data might display sinusoidal waveforms or seasonal cyclical changes corresponding to the rising and falling water levels. Under extreme weather conditions like heatwaves or cold snaps, or sudden changes in water levels due to heavy rainfall or high wind speeds, piezometer data might experience significant short-term fluctuations, leading to atypical and irregular waveforms. Crack meter data record physical deformations of structures, potentially related to seasonal changes in temperature or humidity, resulting in seasonal cyclical data waveforms or triangular waveforms due to the opening and closing actions of structural cracks. If crack meters and strain gauges are installed near high-stress areas of hydraulic structures or in areas sensitive to environmental changes, they might record more chaotic data signals due to minor movements or deformations of the structure, resulting in disordered weak cyclical data. By pre-classifying 100 sets of monitoring data, including piezometers, crack meters, thermometers, and strain gauges, it is possible to observe that the pre-classified results bear a high resemblance to the standard data set waveforms of the Sinusoidal, Triangular, and Seasonal Cyclical types (
Figure 4). For Weakly Cyclical Growth Data Sets, due to their characteristic periodic fluctuations and long-term growth trends, the complexity of such data allows not only for the classification of data sets that are highly similar to standard data sets but also for defining categories for data sets with weaker and more disordered cyclical patterns. Therefore, data characteristics that are not suitable for Sinusoidal Wave Cyclical, Seasonal Cyclical, and Triangular Wave Cyclical types are defined as weakly cyclical growth data. The classification of weakly cyclical growth data characteristics has made a significant contribution in distinguishing whether data sets exhibit distinct periodicity. In summary, the data classification is comprehensive, meaning that each data sequence is classified into a distinct and unique category.
After pre-classifying the data set, the model’s performance was evaluated on an independent test set using comprehensive metrics such as precision, recall, and F1 score [
37]. The test results showed that 44 sets of Sinusoidal Wave Cyclical Data, 20 sets of Triangular Wave Cyclical Data, 8 sets of Seasonal Cyclical Data, and 28 sets of Weakly Cyclical Growth Data were identified. Precision and recall rates were above 80% across all categories, with an overall accuracy reaching 86%, demonstrating that this method effectively enhances the accuracy and efficiency of time series data classification. Pre-classification of the data set directly reduces the complexity and uncertainty of subsequent analyses. By dividing the data set into groups with similar characteristics and then selecting the most suitable algorithms for detailed analysis of each group, the process becomes more targeted and efficient (
Table 1).
3. Outlier Detection after Pre-Classification of Data Sets
Based on the pre-classification of the actual monitoring data of hydraulic structures, by observing the distribution shape of the data, it can be noted that some data sets fluctuate around a long-term average or have segments that approximate a normal distribution. Therefore, the statistically based 3σ algorithm was selected. The data sets also include periodic data with obvious peaks and troughs and have dispersed data points far from the center; thus, the cluster-based K-medoids algorithm was chosen. This algorithm clusters by optimizing the distance between data points and their central points, suitable for small-to-medium data sets, especially those requiring robust outlier detection. Unlike the traditional threshold-based or clustering methods, this paper also selected the regression-based Isolation Forest algorithm for outlier detection testing. Isolation Forest is a tree-based ensemble learning method specifically used for anomaly detection. It isolates each data point by randomly selecting features and splitting points. In this process, anomalies are usually easier to isolate due to the rarity of their values; thus, they have shorter average paths in the trees. This makes it very suitable for handling large-scale, high-dimensional data sets with complex distributions.
3.1. 3-Sigma Rule for Outlier Detection
The 3σ algorithm, based on the normal distribution assumption, asserts that 99.73% of data points lie within three standard deviations (σ) of the mean. Applying the 3σ algorithm for outlier analysis on four typical data types of actual monitoring data for hydraulic structures (
Figure 5) (
Table 2), the results indicate that the model performs best with Triangular Wave Periodic Data Sets. For this data set, the analysis shows an accuracy of 99.96%, precision of 100.00%, recall of 92.18%, and an F1 score of 0.8195. This demonstrates that the model is extremely precise and reliable in detecting anomalies in these types of data. The high recall rate also proves that the model effectively captures most of the true outliers; in seasonal periodic data sets, the model similarly demonstrates good performance, with an accuracy of 99.80%, precision of 68.67%, recall of 90.90%, and an F1 score of 0.7843. Although the recall rate is lower than that of the Triangular Wave Data Set, it still shows the model’s efficiency in identifying anomalies in such data; for Sinusoidal Wave Periodic Data Sets, while the accuracy reaches 99.57%, the recall rate is low at 48.28%, resulting in an F1 score of 0.5185. This result may indicate that a large number of outliers are not captured by the model in Sinusoidal Wave Periodic Data Sets; for Weakly Periodic Growth Data Sets, the model performs poorly with an accuracy of only 72.13%, precision of 15.00%, recall of 1.87%, and an F1 score of 0.0351. This inefficient outcome indicates that the model struggles with data sets that have complex periodic and growth characteristics, failing to effectively identify and classify outliers in such data. Overall, the performance of the 3σ algorithm is satisfactory on triangular and seasonal periodic data sets. However, for weakly periodic growth data sets with more complex dynamics, the method performs poorly, indicating a need for further optimization or consideration of more advanced anomaly detection techniques suited to these data characteristics.
3.2. K-Medoids Machine Learning Algorithm for Outlier Detection
The K-medoids algorithm is an unsupervised clustering algorithm developed to address the issue of the K-Means algorithm being overly sensitive to outliers. K-Means can introduce significant errors when handling data sets with noise or extreme values, thus requiring high-quality data sets. By contrast, the K-medoids algorithm selects actual data points as cluster centers, making it less affected by extreme outliers in the data set and better at overcoming noise and gross errors in the data. This makes it particularly suitable for handling the data sets requiring gross error identification discussed in this paper. However, since each iteration involves calculating distances between multiple points, it runs slower on large data sets compared to the K-Means algorithm. The following are the implementation steps and key design elements of the K-medoids algorithm for anomaly detection (Algorithm 1).
Algorithm 1. K-medoids Machine Learning Algorithm for Outlier Detection |
Input: Data set D = {x1, x2, …, xn}, Number of clusters K, Maximum number of iterations M |
Output: Clustering results clusters, Outliers outliers |
1: Select initial medoids as m1,m2;…,mK |
2: Initialize total cost to infinity costtotal = ∞ |
3: for iteration = 1 to M do// |
4: For all x∈D and all m∈Medoids, calculate distance d(x,m)//Calculate the distance from each data point to each medoid |
5: For each x∈D, find minmd(x,m) and assign x to the corresponding cluster//Assign each data point to the nearest medoid to form clusters |
6: For each cluster C, find the new center m′ = minx∈C∑xi∈Cd(x,xi) and calculate the cost costC // Recalculate the new center point (medoid) and cost for each cluster |
7: if current cost <costtotal then// Compare costs |
8: Update medoids |
9: Update total cost costtotal = ∑costC |
10: else |
11: break// Terminate if there is no cost improvement |
12: end if |
13: end for |
14: Set the threshold for determining outliers // Select based on the actual problem context |
15: Mark those points whose distance from the medoid exceeds the threshold as outliers // Mark outliers |
16: return clusters and outliers // Return results |
The K-medoids algorithm was applied to analyze outliers in four typical data types of actual monitoring data for hydraulic structures, showing varying performances across different data sets (
Figure 6) (
Table 3). In the Sinusoidal Wave Cyclical Data Set, it achieved a high accuracy of 99.90%, perfect precision of 100.00%, a recall of 79.31%, and an F1 score of 0.8864. This demonstrates the K-medoids algorithm’s effectiveness in accurately identifying most true anomalies with almost no false positives. For the Triangular Wave Cyclical Data Set, the accuracy was similarly high at 99.96%, and the precision was perfect at 100.00%. However, the recall was lower at 40.00%, resulting in an F1 score of 0.5714. For the Seasonal Cyclical Data Set, the algorithm’s overall performance slightly declined, with an accuracy of 99.85%, a precision of 85.00%, a recall of 72.72%, and an F1 score of 0.8095. This performance indicates challenges in anomaly detection, especially in misclassifying normal data points as anomalies, leading to lower precision and recall. In the Weakly Cyclical Growth Data Set, the data set showed an extreme case for the K-medoids algorithm with an accuracy of only 44.85% but a very high recall of 89.40%, a precision of 30.30%, and an F1 score of 45.26%. This indicates that while the algorithm could identify most anomalies, it also generated a large number of false positives, resulting in very low precision.
3.3. Isolation Forest Machine Learning Algorithm for Outlier Detection
The Isolation Forest algorithm is generally used to identify anomalies in data, which is particularly suitable for the task of detecting gross errors in dam monitoring data, as discussed in this paper. The Isolation Forest algorithm isolates each data point by randomly constructing multiple isolation trees, which are a special type of binary tree. During this isolation process, because the characteristics of anomalous data are different from normal data and the quantity of anomalous data is smaller, anomalous data are more easily isolated. The algorithm mainly includes two steps, building isolation trees and calculating anomaly scores, as shown in
Figure 6. It is important to note that the data splitting mentioned in the diagram refers to dividing the data set into two subsets based on a randomly selected feature and split value from the previous step; one subset contains all data with values smaller than the split value on that feature, while the other contains all data with values greater than or equal to the split value.
Using the Isolation Forest algorithm, anomaly analysis was conducted on four typical data types of actual monitoring data for hydraulic structures (
Figure 7) (
Table 4). For the Sinusoidal Wave Cyclical Data Set, the Isolation Forest algorithm demonstrated high accuracy (99.31%), indicating that the vast majority of data points were correctly classified. However, the precision was low (39.68%), suggesting that a significant portion of the detected anomalies were false positives. The recall was very high (86.20%), meaning that most actual anomalies were successfully detected, but due to the high false positive rate, the F1 score was 0.5434. In the Triangular Wave Cyclical Data Set, the Isolation Forest algorithm showed extremely high accuracy (99.99%) and perfect precision (100.00%), meaning all detected anomalies were actual anomalies with no false positives. However, the recall was only 40.00%, indicating that the algorithm failed to detect the majority of anomalies. Despite this, the F1 score was 0.5714, demonstrating the algorithm’s reliability under specific conditions. For the Seasonal Cyclical Data Set, the Isolation Forest algorithm exhibited high accuracy (99.78%) and good precision (77.27%). The recall was 70.83%, indicating that the algorithm could detect most anomalies. The F1 score was 0.7391, showing the algorithm’s overall balanced performance in anomaly detection. In the Weakly Cyclical Growth Data Set, the Isolation Forest algorithm performed poorly. The accuracy was only 75.10%, with a precision of 30.00%, and the recall was very low, at only 1.05%. This indicates that the algorithm struggled to effectively detect anomalies in this type of data set and had a high false positive rate, resulting in an F1 score of just 0.2020. This might be due to the characteristics of the Weakly Cyclical Growth Data Set, making it difficult for the Isolation Forest algorithm to distinguish between normal and anomalous values.
3.4. Evaluation of Outlier Detection Algorithm Matching Based on Data Set Pre-Classification
For the four types of actual monitoring data of hydraulic structures, the 3σ algorithm, K-medoids algorithm, and Isolation Forest algorithm were used for anomaly detection. In the analysis of the Sinusoidal Wave Cyclical Data Set, the K-medoids algorithm demonstrated superior performance. Although the 3σ algorithm had a high accuracy rate of 99.57%, its recall rate was only 48.28%, resulting in an F1 score of 0.5185, suggesting that a large number of anomalies might not have been captured. This could be due to the typically smooth periodic changes in sinusoidal wave data, which may make it difficult for the algorithm to distinguish minor anomalous fluctuations, thus affecting the recall rate. The Isolation Forest algorithm had an accuracy rate of 99.31%, a low precision rate (39.68%), a recall rate of 86.20%, and an F1 score of 0.5434. Despite the low precision, the high recall rate indicates that the algorithm can capture most of the true anomalies, possibly because its independent processing of data points can reveal unusual patterns. The K-medoids algorithm achieved an extremely high accuracy rate of 99.90%, a precision rate of 100.00%, a recall rate of 79.31%, and an F1 score of 0.8846. Its high accuracy and recall rates demonstrate excellent detection capabilities for data sets with strong and regular periodicity, as K-medoids can identify and cluster data points with similar periodic features, effectively recognizing most of the true anomalies.
In the analysis of the Triangular Wave Cyclical Data Set, this type of data set typically displays a waveform with linear rises and falls, making the changes more direct and sudden compared to sinusoidal waves. The 3σ algorithm, which identifies anomalies based on the standard deviation from the mean, is suitable for data sets that approximately follow a normal distribution. This algorithm performs well on the Triangular Wave Cyclical Data Set, especially in terms of high precision, demonstrating its effectiveness in identifying deviations from the central trend as anomalies. For triangular wave cyclical data with clear periodicity and distinct patterns, the 3σ algorithm can effectively detect anomalies that exceed the standard deviation range. The sharp peaks and troughs of the triangular wave make it easier for anomalies to be detected through statistical thresholds. The Isolation Forest algorithm isolates anomalies by building isolation trees, and its high precision rate indicates its reliability in marking anomalies, although its slightly lower recall rate compared to the 3σ algorithm may be due to some anomalies being more difficult to isolate. The K-medoids algorithm, with its ability to effectively distinguish between normal and anomalous points in triangular wave data due to its periodicity and regularity, may have a lower recall rate possibly because some minor anomalies do not form distinct clusters. Both the 3σ and K-medoids algorithms perform well on the Triangular Wave Cyclical Data Set, and the choice of which algorithm to use can be based on prior knowledge of data distribution and the precision requirements of anomaly detection.
In the analysis of Seasonal Cyclical Data Sets, these types of data typically exhibit periodic changes that are closely associated with seasonal variations, potentially involving fluctuations in temperature, humidity, or other cyclical environmental factors. When assessing algorithm performance on such data sets, considering both the periodicity and potential non-linear characteristics of the data, the Isolation Forest algorithm has demonstrated a good balance of performance. This is likely because its approach is well suited to handling non-linear and non-periodic anomalies within the data. The periodic changes inherent to seasonal data sets play to the strengths of the Isolation Forest, enabling the algorithm to effectively identify most anomalies.
In analyzing Weakly Cyclical Growth Data Sets, the complexity of standard weakly cyclical growth data lies in that they include not only periodic changes but also trends that increase or decrease over time. Besides regular data, they also comprise data sets with poor regularity, making the identification of gross errors in Weakly Cyclical Data Sets undoubtedly challenging. The three methods currently attempted have shown specific adaptability in handling other three pre-classified results, but they have not demonstrated particularly outstanding performance for weakly cyclical data. This may be due to the especially complex structural characteristics of these types of data, which exceed the conventional processing capabilities of these algorithms. For regular weakly cyclical data, in subsequent work, we may need more specialized methods, such as advanced time series models that combine trend analysis with cyclical analysis, or more complex machine learning techniques, such as time-window-based sequence anomaly detection or deep learning models, to better understand and predict long-term trends and cyclical changes in the data. As for complex disordered data sets, they may be influenced by the surrounding complex environment or sensor aging, leading to poor data quality. For such low-quality data, it is not appropriate to use intelligent detection methods for anomaly detection; instead, it is more suitable to determine anomalies based on historical extremes or typical engineering condition extremes.
4. Conclusions
In this study, we thoroughly explored the importance and implementation methods of gross error identification and data set pre-classification in the monitoring data of hydraulic structures. The results indicate that effective data pre-classification and precise anomaly detection algorithms can significantly enhance the accuracy and efficiency of safety monitoring for hydraulic structures.
Firstly, this study developed a method for preprocessing and classifying monitoring data for the identification of gross errors in dam monitoring data. By combining linear regression and wavelet analysis techniques with the random forest algorithm, various waveform characteristics in the data set, such as Sinusoidal Wave Cyclical, Triangular Wave Cyclical, Seasonal Cyclical, and Weakly Cyclical Growth types, were effectively distinguished. The implementation of this classification method aims to provide a more accurate foundation for subsequent gross error identification.
Secondly, in the experiments for gross error identification, the 3σ algorithm, K-medoids algorithm, and Isolation Forest algorithm were applied to test the experimental data. By comparing the performance of the three algorithms on different data types, it was found that each algorithm has its strengths and limitations in specific situations. The results indicated that the K-medoids algorithm demonstrated superior performance when processing Sinusoidal Wave Cyclical Data Sets; the 3σ algorithm and K-medoids algorithm showed good adaptability when dealing with Triangular Wave Cyclical Data Sets; and the Isolation Forest algorithm, due to its stability and resistance to extreme values, proved more suitable for handling Seasonal Cyclical Data Sets with strong seasonality and large data fluctuations, whereas for Weakly Cyclical Growth Data Sets, the performance of all three algorithms was not ideal, suggesting that more advanced analysis methods or a combination of multiple methods might be required to handle these data sets with complex dynamics.
Finally, the results of this study not only improved the capability to identify gross errors in the monitoring data of hydraulic structures but also provided a valuable reference and foundation for future related research. Currently, the monitoring data for hydraulic structures are collected and gathered by automated instruments, and the main goal of our work is to use intelligent algorithms to identify anomalies in the collected monitoring data, thereby enhancing the automation level of the entire data processing workflow. In the future, we can use current anomaly detection efforts to conduct real-time monitoring of data, which could then serve to predict and prevent issues with the structural integrity of hydraulic structures. Additionally, automating these algorithms and implementing real-time processing will be key steps in enhancing the response speed and accuracy of hydraulic structure monitoring systems. By integrating advanced data processing technologies and machine learning algorithms, we aim to further enhance the safety monitoring capabilities of hydraulic structures, ensuring the stable operation and long-term safety of these critical infrastructures. In subsequent research, decomposing and classifying complex disordered data sets may significantly enhance the effectiveness of data set pre-classification, and designing better gross error identification methods based on the characteristics of complex disordered data sets may also greatly improve the ability to identify gross errors when monitoring data of hydraulic structures.