1. Introduction
As the significance of hydropower units in modern power systems grows, the urgency for accurate condition monitoring and prediction increases. However, the quality of monitoring data in hydropower units is often compromised due to interference, abnormalities or failures in data acquisition and transmission links, leading to issues like data anomalies and missing data. These problems hamper the accuracy and reliability of condition monitoring data [
1].
Traditional clustering methods such as K-Means [
2,
3,
4,
5,
6,
7,
8], subspace clustering [
9,
10,
11,
12] and Mean-Shift [
13,
14,
15,
16] struggle to adapt to the varying distributions of monitoring and collection data in industrial settings. Furthermore, statistical feature-based methods like the Lyddane criterion [
17,
18] and the quartile method [
19,
20,
21] are limited in their accuracy due to the fluctuating operating conditions of hydropower units, which affect the signal amplitude. Density-based clustering methods, such as Density-Based Spatial Clustering with Noise Applications (DBSCAN) [
22,
23,
24], the Local Outlier Factor Algorithm (LOF) [
25,
26], Density Estimation Based Clustering (DENCLUE) [
27], hierarchical agglomerative clustering (HAC) [
28] and On-Point Sorting Clustering Methods (OPTICS) [
29,
30], although capable of extracting clusters of arbitrary shapes and recognizing noise, still face challenges in parameter selection and do not fully meet the current needs. The Hierarchical Density-Based Spatial Clustering with Noise Applications (HDBSCAN) [
31,
32,
33,
34,
35,
36], which integrates the advantages of these methods and addresses the parameter selection issue, has shown promising results in application verification.
The condition monitoring of missing values has a great impact on both dimensionality reduction analysis and model training, and the timely handling of missing data is crucial for intelligent decision making [
1,
37]. Data augmentation is a technical tool for generating new data samples by transforming or processing the original data while keeping the data labels unchanged. It is widely used in the fields of machine learning and deep learning to enhance the generalization ability of models and mitigate the overfitting problem, especially in the case of data scarcity, which is of great practical significance. Missing data filling is an important task in data processing, the goal of which is to estimate or fill the missing values in a dataset by a reasonable method to improve the availability and quality of the dataset. In recent years, traditional statistical methods, machine learning-based methods, time series data filling methods, multiple filling methods and so on have been mainly developed.
Traditional data filling methods, including mean filling and local interpolation, often fail to account for temporal dependencies between samples, resulting in a suboptimal filling accuracy. Adhikari et al.’s comprehensive survey [
37] underscores the importance of accurate data handling in IoT for intelligent decision-making, specifically highlighting the inadequacy of many methods in addressing temporal correlations. Little and Rubin [
38] illuminate the utility of mean filling in missing data analysis, yet also point out its limitations in maintaining the inherent structure and relationships within datasets. Ratolojanahary et al. [
39] demonstrate improvements in multiple imputation for datasets with high rates of missingness but note the challenge of capturing temporal dependencies. Goodfellow et al. [
40], in their pioneering work on GANs, opened avenues for data augmentation but faced challenges with unstable training, impacting the quality of imputed data. Kim et al. [
41] discuss the potential of deep learning methods, including GANs, in complex data structures, yet identify overfitting and the neglect of time-related characteristics as issues. Haliduola et al. [
42] approach missing data imputation in clinical trials using clustering and oversampling to enhance neural network performance but do not fully address the temporal aspects of data series. Zhang et al. [
43] show progress in handling sequential data for time series imputation yet struggle with accurately reflecting temporal dynamics in variable datasets. Verma and Kumar [
44] acknowledge the importance of capturing temporal sequences in healthcare data using LSTM networks but also highlight the computational complexities involved. Azur et al. [
45] explore multiple imputation methods, revealing their limitations in handling datasets with significant time series elements. Finally, Song and Wan [
46] highlight the effectiveness of certain interpolation methods in specific contexts but indicate their general inadequacy in capturing time-dependent data patterns. Together, these studies contribute to our understanding of the challenges in data imputation, particularly in capturing temporal dependencies, which is crucial for advancing the field. These studies collectively illustrate that while traditional data filling methods have evolved and improved in various ways, a significant gap remains in accurately addressing the temporal dependencies in data series. This gap underscores the need for more advanced and nuanced approaches to data imputation.
To overcome these limitations, Generative Adversarial Networks (GANs) [
47,
48,
49,
50] have been introduced in the field of data generation, capable of learning data distributions and generating synthetic data with features of real data. GANs can learn the distribution of data and generate synthetic data with real data features. In the data augmentation field, many studies have shown that GANs can be used to generate more data samples. Specifically, GANs can be trained on the original dataset to obtain a generative model. Then, the generative model is used to generate more data samples, which can be used to augment the original dataset, increase the number of data samples and improve the model’s generalization ability. However, traditional GANs have problems such as unstable training and mode collapse during the training process. To overcome the issues of traditional GANs, Wasserstein Generative Adversarial Networks (WGANs) were introduced [
51,
52,
53]. By introducing the Wasserstein distance to measure the distance between generated data and real data, WGANs improve training stability and the quality of generated data. Furthermore, to increase the diversity of generated data, Gradient Penalty was introduced into WGAN, forming Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP) [
54,
55,
56,
57]. Based on the GAN and WGAN architectures, dedicated generative imputation networks were developed for data imputation, namely, the Generative Adversarial Imputation Network (GAIN) [
58] and Wasserstein Generative Adversarial Imputation Network (WGAIN) [
59]. The Slim Generative Adversarial Imputation Network (SGAIN), a lightweight generative imputation network architecture without a hint matrix, was proposed as an improvement on the GAIN. To address the issues of traditional GANs in SGAIN, the Wasserstein Slim Generative Adversarial Imputation Network (WSGAIN) was further improved, along with the Wasserstein Slim Generative Adversarial Imputation Network with Gradient Penalty (WSGAIN-GP) [
59].
In this study, we introduce a novel approach for enhancing the quality and usability of the condition monitoring data of hydroelectric units, addressing the limitations of traditional methods, such as inadequate accuracy, disregard for temporal dependencies and obscured data distribution characteristics. Our methodology uniquely integrates two advanced data processing techniques: anomaly detection using the HDBSCAN clustering method and data imputation through the WSGAIN-GP generative model. This combination not only retains the intrinsic characteristics of the data but also significantly improves their completeness and utility. The HDBSCAN clustering method effectively groups monitoring data according to density levels, enabling the precise identification of outliers, which is crucial for accurate data enhancement. Following this, the WSGAIN-GP generative model, utilizing unsupervised self-learning, adeptly approximates the distribution characteristics of real monitoring data. This is instrumental in generating high-quality substitutes for missing data, thereby addressing the gap left by traditional methods. Our contribution is noteworthy in that we are the first to apply these sophisticated methods to the realm of hydropower unit condition monitoring. By doing so, we not only preserve the fidelity of the data but also augment its integrity and applicability. The enhanced data quality and accuracy provided by our approach lay a solid foundation for the more reliable condition monitoring and prediction of hydropower units. This advancement is a step forward in realizing intelligent warnings for hydropower unit conditions, ultimately contributing positively to the maintenance and operational efficiency of these units. This paper delves into the specifics of our quality enhancement methodology, using the HDBSCAN clustering method and the WSGAIN-GP generative model, and presents experimental evidence demonstrating its significant impact on improving data quality and accuracy in hydroelectric unit condition monitoring.
3. Enhancement Methodology Flow of Hydropower Unit Condition Monitoring Data Based on HDBSCAN-WSGAIN-GP
A monitoring data quality enhancement method based on HDBSCAN-WSGAIN-GP improves the quality and usability of hydropower unit condition monitoring data by combining the advantages of density clustering and a generative adversarial network. The quality enhancement process of hydropower unit monitoring data based on HDBSCAN-WSGAIN-GP is shown in
Figure 3. The detailed steps are as follows:
Step 1. Data pre-processing: Anomaly data detection cleaning based on HDBSCAN.
Step 2. Initialize the network parameters for the generator and discriminator.
Step 3. Define the inputs and outputs for the generator and the discriminator.
Step 4. Define the loss functions, including the losses for the generator, discriminator and gradient penalty.
Step 5. Mark missing values and construct mask vector M.
Step 6. During model training, alternate between training the generator network and the discriminator network, updating model parameters by optimizing the loss functions, which include both the generator and discriminator losses. The generator is used to generate estimated values for imputing missing data, while the discriminator evaluates the difference between generated data and real data.
Step 7. After each training epoch, evaluate the performance of the model, including the imputation effectiveness of the generator and the discrimination accuracy of the discriminator.
Step 8. Based on the evaluation results, adjust the model’s hyperparameters or structure to further optimize its performance, ultimately obtaining an efficient WSGAIN-GP model for imputing missing data.
Step 9. Perform missing value imputation: using the trained WSGAIN-GP model, merge the imputed data generated by the generator with the existing data in the original dataset to obtain a complete dataset, thereby achieving the imputation of missing data.
To quantitatively assess the consistency of the filled data sequence with the original data sequence in terms of the distribution and characteristics, KL Divergence, JS Divergence and Hellinger Distance are introduced to quantify the similarity between two distributions, as shown in Equations (16)–(18):
KL Divergence, JS Divergence and Hellinger Distance are all non-negative. KL Divergence ranges from 0 to ∞, while JS Divergence and Hellinger Distance range from 0 to 1. Smaller values of these three metrics indicate a greater similarity between two distributions, with a value of 0 indicating complete similarity.
5. Conclusions
In addressing issues such as anomalies and missing data that compromise the quality of condition monitoring datasets in hydropower units, this chapter introduces a method for enhancing data quality through the integration of HDBSCAN-WSGAIN-GP. This method capitalizes on the strengths of density clustering and generative adversarial networks to enhance the reliability and utility of the condition monitoring data.
Initially, the HDBSCAN clustering method categorizes the monitoring data based on density levels, aligned with operational conditions, to adaptively detect and cleanse anomalies in the dataset. Furthermore, the WSGAIN-GP model, through its data imputation capabilities, employs unsupervised learning to understand and replicate the features and distribution patterns of actual monitoring data, thereby generating values for missing data.
The validation analysis, conducted using an online monitoring dataset from real operational units, provides compelling evidence of the method’s effectiveness:
(1) Comparative experiments reveal that the clustering contour coefficient (SCI) of the anomaly detection model based on HDBSCAN achieves 0.4935, surpassing those of other comparative models, thereby demonstrating its superior ability in distinguishing between valid and anomalous samples.
(2) The probability density distribution of the data imputation model based on WSGAIN-GP closely mirrors that of the measured data. Notably, the Kullback–Leibler (KL) divergence, Jensen–Shannon (JS) divergence and Hellinger’s distance metrics, when comparing the distribution between the imputed and original data, approach values near zero, indicating a high degree of accuracy in data representation.
(3) Through comparative analyses with other filling methods, including SGAIN, GAIN and KNN, the WSGAIN-GP model demonstrates superior effectiveness in data imputation across various rates of missing data. The Root Mean Square Error (RMSE) of the WSGAIN-GP consistently outperforms other models, particularly noted in its lowest RMSE across different missing data rates. This confirms the high accuracy and generalization capability of the proposed imputation model.
The findings and methodologies presented in this study lay a robust foundation for high-quality data, crucial for subsequent trend prediction and state warnings in the context of hydropower unit monitoring.