1. Introduction
Rotating machinery is one of the most widely used types of machinery in the industrial field. It is primarily utilized in devices such as automobiles, power plants, turbines, and pumping machines. Therefore, the failure of such machines can lead to system shutdowns due to mechanical malfunctions. To avoid such problems, various approaches have been developed to enhance the reliability of rotating machinery to prevent these issues [
1,
2]. In particular, approaches that integrate technologies such as digital instrumentation and control systems have emerged [
3].
Recently, the field of prognostics and health management (PHM) has advanced, focusing on the development of diagnostic and prognostic models based on data-driven approaches [
4,
5]. In artificial intelligence (AI), machine learning-based anomaly detection methodologies, such as support vector machines, have been widely used for a long time [
6,
7,
8]. Additionally, research on PHM models based on deep neural networks, such as recurrent neural networks and long short-term memory models, has been extensive [
9,
10,
11].
One remarkable advancement is the emergence of Transformer networks. Originally developed for natural language processing, Transformers recently showed excellent performance in processing time-series data, including vibration analysis [
12,
13]. The self-attention mechanism of Transformers effectively captures long-range dependencies in vibration data, making them particularly useful for analyzing the complex patterns in rotating machinery’s vibration signals. Several studies demonstrated that Transformer-based models can perform more accurate fault prediction and condition diagnosis than traditional RNN- or CNN-based models [
14,
15,
16]. These advancements create new possibilities for improving the accuracy and efficiency of PHM systems.
However, effectively applying AI to PHM systems requires large amounts of data to achieve robust feature identification and a high learning accuracy [
17]. This requirement presents a significant challenge in implementing AI-based PHM systems, where the quantity and quality of data and the development of relevant data-processing techniques are considered essential for advancement. In learning-based methodologies, particularly deep learning, a dataset balanced between normal and abnormal data is needed [
18]. For example, in plants, abnormal vibration data are relatively scarce compared with normal operation data, leading to a data imbalance within datasets [
19]. Such datasets exhibit what is called a “long-tail distribution”, where the scarcity of information about abnormal conditions can degrade the performance during algorithm training, which, in turn, can negatively affect the model accuracy and reliability in real operational environments.
Thus, addressing data imbalance is a critical consideration in the development of deep learning-based PHM systems. The most intuitive solution to data imbalance is developing techniques that use generative models to generate data with patterns similar to those of abnormal vibrations [
20,
21]. This approach effectively extends the dataset of abnormal-state data, thereby providing a balanced dataset and improving the algorithm training efficiency and performance. Data augmentation using generative models is expected to become an essential strategy for strengthening machine learning and deep learning algorithms, especially in scenarios where limited data are available.
Recently, data augmentation utilizing AI has garnered attention through the use of generative models, such as variational autoencoders (VAEs) and generative adversarial networks (GANs) [
22,
23]. These techniques have been successfully applied in various domains, including images, time-series data, and text. While VAEs learn latent representations of data to generate new samples, GANs generate high-quality synthetic data through adversarial training between a generator and a discriminator network [
24]. GAN-based approaches have shown promising results in augmenting complex time-series patterns, such as vibration data, and can significantly contribute to addressing the data imbalance issue in the PHM field [
25]. Recently, GAN techniques that reflect frequency characteristics have also been developed to generate more realistic vibration data, further improving the fault detection performance in PHM systems [
26].
The main contributions of this study are as follows:
(1) We propose a novel method for augmenting vibration data using a Transformer-based generative adversarial network (GAN). This approach combines the powerful time-domain feature extraction capabilities of Transformers with the data generation capabilities of GANs.
(2) We applied the multi-resolution short-time Fourier transform (multi-STFT) to effectively capture the frequency characteristics of vibration data and configured them for the learning process of the AI models.
(3) We demonstrated that combining a Transformer-based GAN with a multi-STFT-based loss function allowed for the finer modeling and generation of vibration data’s time–frequency characteristics.
(4) We quantified and visualized the similarity between the data generated by the proposed model and the original data to compare and analyze their differences.
(5) We experimentally verified that the data generated by the proposed model contributed to improving the performance of the condition diagnosis for rotating machinery.
This study showed that the Transformer-based GAN could capture the various time-domain and frequency-domain characteristics of vibration data, which generated more realistic and useful data.
4. Experiment
4.1. Evaluation Metrics
Jensen–Shannon (JS) divergence is one of the methods for measuring the similarity between two probability distributions P and Q [
41]. The equation for two divergences is as follows:
The fact that the Kullback–Leibler (KL) divergence cannot be used as a distance metric is an important issue affecting its utilization as an evaluation metric. This is because the resulting value can vary depending on the selected reference probability distribution. Consequently, KL divergence may not be an ideal performance measure for evaluating generative models. By contrast, JS divergence, which is based on KL divergence, is symmetric and invariably has a finite value, which makes it quantifiable. In this study, we numerically evaluated the similarity between the distributions of generated data and real data by using JS divergence, which has these features.
4.2. Dataset
We utilized a dataset that represented various types of failures in rotating machinery, which was the ‘Rotating Machinery Fault Type AI Dataset’ (Ministry of SMEs and Startups, Korea AI Manufacturing Platform (KAMP), KAIST, 23 December 2022). The machinery structures used in power plants operate mostly under normal conditions, and the proportion of normal-type data in this dataset exceeded 90 percent. Therefore, this dataset was considered suitable for analyzing the characteristics of mechanical structure failure types and for gathering data corresponding to various failure types by using a purpose-built rotor testbed,
Figure 5.
As followed
Table 2, the speed of the rotor testbed could be adjusted from 0 to 3000 rpm. In this study, data were acquired at a rotor speed of approximately 1500 rpm. In the analysis, we utilized four sensors, and over the course of 140 s, we collected 3,772,385 data points. The dataset comprised data corresponding to four types of conditions: normal, mass imbalance, mechanical looseness, and a combination of mass imbalance and mechanical looseness. Mass imbalance was considered to occur when the centers of mass of the rotor and motor were misaligned, while mechanical looseness was considered to occur when the rotor was not properly secured or was tilted.
The sensor used for data acquisition on the rotor testbed was a smart vibration sensor manufactured by Signallink Co., Ltd. (Suwon-City, Republic of Korea), as shown in
Figure 6. Detailed specifications can be found in
Table 3. The sensor collected data from the three bars that supported the rotor disk.
The collected sensor data followed several preprocessing steps to ensure the consistency and accuracy of the dataset. First, linear interpolation was applied to synchronize the measurement times of sensors with different sampling periods. This allowed for the unification of the time axis, enabling a comparison between the sensor data. Next, moving average filtering was utilized to remove noise from the data while preserving the dynamic characteristics of the signal. This method effectively reduced transient fluctuations while maintaining the key features of the signal. Finally, the sensor data were transformed using min–max normalization to suit the requirements of machine learning algorithms. This technique scaled the data values between 0 and 1, which prevented distortions in the learning performance due to differences in the data magnitudes.
4.3. Implementation
The proposed model was implemented using the PyTorch framework, and experiments were conducted on a computer with the following environment. The generator and discriminator were built based on the encoder part of the transformer structure, and a 1D convolutional neural network (CNN) structure was used to generate data or extract features. Only the last layer of the discriminator included a sigmoid layer for the binary classification of real and fake data.
The training was performed in batches with a batch size of 32, and an additional loss function, multi-STFT, was used in the generator. Consequently, the learning rate of the generator was slower than that of the discriminator, which introduced the risk of model collapse.
Ablation studies were conducted to compare the performance of the proposed model with those of other models based on the GAN architecture. In total, six experiments were set up, and these experiments differed in terms of the model and use of the proposed multi-STFT loss function. In each of these experiments, training and validation were performed for 50 epochs under the same conditions.
4.4. Model Training Convergence
In this section, we analyze the convergence of the proposed model during the training process.
Figure 7 illustrates the training results over 150 epochs, showing the variations in the loss functions for both the discriminator and the generator. The discriminator’s loss (loss_D) is represented in the left graph of
Figure 7. Initially, the loss value experienced a rapid decrease, followed by a gradual decline, indicating a continuous improvement in the discriminator’s performance. Notably, after epoch 20, the loss stabilized, suggesting that the model had reached a convergent state.
Conversely, the generator’s loss (loss_G) is depicted in the right graph of
Figure 7. Similar to the discriminator, the generator’s loss showed a significant reduction in the initial epochs, followed by a gradual tapering off. As the epochs progressed, the loss tended to stabilize, indicating that the generator progressively produced more realistic data.
Overall, both loss functions exhibited a decreasing trend, demonstrating that the proposed model effectively converged during the training process. These findings can be interpreted as positive indicators of the model’s future performance. Through this analysis, we could clearly describe and evaluate the convergence and stability of the proposed model during its training.
4.5. Ablation Study
The proposed model was constructed as a GAN based on the encoder structure of the existing transformer architecture, and additional loss functions were introduced to improve the model performance. The additional function was used to compare using multi-STFT to capture the frequency characteristics of the model.
To evaluate the performance of the proposed model in terms of these two aspects, the transformer-based model and additional loss, an ablation study was conducted to verify the degree of improvement of the proposed model relative to other models with GAN-based structures. First, wGAN (Wasserstein GAN) and the Transformer-based GAN were used to investigate the effects of changes in the model structure on the data generation performance [
42]. Additionally, to analyze the changes in performance that resulted from the use of the STFT loss function, three conditions were set: not using any loss function, using the single-STFT loss function, and using the multi-STFT loss function. A total of six experiments were conducted, and the results of all experiments were analyzed quantitatively in terms of the JS divergence. The results of the experiments are summarized in
Table 4.
As presented in
Table 4, the average JS divergence in the experiments conducted using the wGAN model was 0.254, whereas the average JS divergence in the experiments conducted using the proposed model, that is, the transformer-based GAN, was 0.129. Therefore, on average, the proposed model yielded a performance improvement of approximately 50.78%. Specifically, in the performance evaluation based on the distribution difference of each model when not using any STFT loss function, the transformer-based GAN model yielded a 53.79% improvement over the wGAN model, indicating that the transformer-based GAN model was more suitable for data generation.
In addition, the average JS divergence when using the multi-STFT loss function was 0.162, that when the function was not applied was 0.223, and that when using the single-STFT loss function was 0.189. These values confirmed that the performance improved by approximately 72.64% when using the multi-STFT loss function compared with not using it. Moreover, these experiments demonstrated that the two methodologies proposed in this study improved the model performance. Therefore, the use of domain-specific transformation techniques in data-driven deep learning models can lead to a higher performance compared with that when using raw data alone.
4.6. Validation of Generated Data Through Training
To check whether the vibration data generated using the proposed model were effective as input data for actual model training, we selected a representative deep learning model and conducted an experiment. The selected deep learning model was a 1D-CNN model that consisted of four 1D convolution layers and one fully connected layer [
43]. In the first experimental case, only the real collected data were used without generating any synthetic data. In the second and third experimental cases, ablation studies were conducted with the synthetic data generated using the wGAN-based model and transformer-based GAN model, respectively.
These data were grouped into four classes: normal, imbalance, mechanical looseness, and a combination of imbalance and mechanical looseness. For the quantitative performance evaluation, we used the average accuracy. The results are summarized in
Table 5.
According to the results, an accuracy of 75.65% was achieved when using the trained classifier only on the real data. By contrast, when using the transformer-based GAN with the multi-STFT loss function to generate synthetic data, which was the best-performing case, an accuracy of 85.79% was achieved, representing a performance improvement of more than 10%. These results indicate that in scenarios with limited data availability, it is more efficient to generate data by using the proposed approach and use the generated data for classification training.
4.7. Application to Real Power Plant Vibration Data
The future challenge of this study is to contribute to intelligent fault diagnosis by augmenting the vibration fault data of the internal machinery used in a power plant rather than using the data of general rotating machines. Therefore, we collected vibration data of the internal rotating machinery of an operational power plant and used them to train the proposed model and demonstrate its robustness to real data. The accelerometer sensor data were configured to collect data from six channels at a high sample rate of 50,000 Hz, as shown in
Table 6. Data were stored as soon as a trigger signal was received, with a pre-time of 1000 ms to capture the relevant context before the trigger event.
Table 7 lists the results of the experiments conducted using the proposed model with and without the STFT loss function.
Although the overall score was lower than that obtained using the data employed in the ablation study, the JS divergence was the highest at 0.229 when using the model with the proposed multi-STFT loss function.
In addition, we used kernel density estimation to visualize the data distributions obtained using the two model variants (with and without the proposed loss function) as continuous curves [
44]. This step facilitated a straightforward comparison of the differences between the distributions.
To construct this plot, we used a Gaussian kernel to estimate and visualize the density function of each dataset. As illustrated in
Figure 8, the first peak of the data generated using the model with the single-STFT loss function was smoothed out compared with that of the real data. However, when using the model with the proposed multi-STFT loss function, the data distribution was similar to that of the original data. Therefore, the proposed multi-STFT loss function performed well even when real power plant data were used for the model training.
5. Conclusions
In this study, we proposed an approach to generate data for rotating machinery by using a transformer-based GAN model. The proposed model leverages the frequency-domain characteristics of the data to ensure that the deep learning model can adequately capture frequency features. As a result, the model demonstrated a JS divergence of over 50.78% compared with the baseline model, wGAN, which allowed it to generate all types of anomaly and normal data for rotating machinery with a high degree of similarity to real data.
We applied multi-STFT loss functions in our experiments and observed that the proposed model improved the classifier accuracy by 10%, indicating that the data generated by our model contributed to an enhanced classifier performance. Furthermore, we validated the robustness and performance of the proposed model by training it with real-world data from a power plant. The similarity of the data distributions was visually represented using a KDE plot, which recorded a JS divergence of 0.229.
In the future, we aim to acquire even small amounts of real anomaly data from power plants, generate synthetic anomaly data, and derive anomaly diagnosis performances based on this data.