1. Introduction
One of the clean energy sources that has gained the most momentum in recent years is wind energy, whose global installed capacity grew by 53% year-on-year in 2020 to 743 GW, with more than 93 GW of newly installed capacity [
1]. Wind energy is generated by wind turbines that convert the wind’s kinetic energy into electrical energy. The geometry of a wind turbine blade is designed to spin the rotor hub when positioned at a certain angle of attack with the wind. The torque is transmitted through a mechanical system consisting of shafts, couplings, REBs and gearboxes to an asynchronous machine specifically designed for this system: a doubly fed asynchronous machine.
The study of REBs failures from the point of view of vibration analysis is a field extensively researched by the international scientific community, and an essential subject of condition monitoring of rotating machines. Kiral & Karagülle developed a method based on finite element vibration analysis to detect single or multiple faults under the action of an imbalanced force on different components of the REB structure using time and frequency domain features [
2]. Sawalhi & Randall conducted an in-depth study on the nature of vibration signals at different stages of rolling elements’ impacts with faults with applications in defect size estimation [
3]. Smith & Randall thoroughly analysed the Case Western Reserve University (CWRU) signals with their benchmark method, based on pre-processing them with Discrete/Random Separation and using the envelope spectrum to identify faults [
4].
In the scope of artificial intelligence [
5,
6], numerous papers have been published proposing different methods to train models for REBs diagnosis. In the field of ML, Dong et al. proposed a method based on SVM and the Markov model to predict the degradation process of REBs [
7]. Pandya et al. extracted features based on Acoustic Emission (AE) analysis with the Hilbert–Huang Transform (HHT) and used them to train a k-NN model [
8]. Piltan et al. proposed a non-linear observer-based technique called Advanced Fuzzy Sliding Mode Observer (AFSMO) to improve the average performance of fault identification with a DT model [
9]. In the field of DL, Pan et al. used the second-generation wavelet transform to improve the robustness of DL-based fault diagnosis [
10]. Peng et al. converted vibration signals from REBs into greyscale images and used them to extract features to serve as inputs to their proposed CNN [
11]. Zhao et al. conducted a DL-based benchmark study evaluating four models: a multilayer perceptron, an auto-encoder, a CNN and an recurrent neural network [
12]. Duong et al. suggested obtaining Defect Signature Wavelet Images (DSWIs) from REBs signals, which show the visualization of the discriminated pattern for different types of faults, to train a CNN [
13].
Nevertheless, the known advantages offered by the cepstral domain to enhance cyclostationary components in the envelope spectrum are not sufficiently developed by the community in order to extract functional condition indicators that can train ML models to diagnose REBs faults automatically. In this paper, we prove the usefulness of CPW combined with the envelope spectrum to extract frequency-domain features from the (CWRU) signals database that, combined with time-domain features, have yield accuracies over 97% for some ML models. The former frequency-domain condition indicators are the impulse responses in the envelope spectrum, whose amplitudes are the maximum of all impulse responses’ amplitudes spaced at each fault’s characteristic frequency for every signal and their log ratios (Ball Pass Frequency of the Inner Race (BPFI) Amplitude, Ball Pass Frequency of the Outer Race (BPFO) Amplitude, Ball Spin Frequency (BSF) Amplitude, Log (BPFI Amplitude/BPFO Amplitude), Log (BPFI Amplitude/BSF Amplitude), Log (BSF Amplitude/BPFO Amplitude)). The latter time-domain indicators are basic statistics (mean, standard deviation, Root Mean Square (RMS) and shape factor), high-order statistics (kurtosis and skewness) and impulsive metrics (peak value, impulse factor, crest factor and clearance factor). In particular, the log ratio between the maximum amplitude of the impulse responses spaced at the BPFI and the maximum amplitude of the impulse responses spaced at the BPFO in the envelope spectrum has been proven a crucial feature.
The significance of the aforementioned condition indicators has been assessed by applying two statistical methods: one-way Analysis of Variance (ANOVA) and the Kruskal–Wallis test. The ML models trained with these features are Decision Tree (DT), Support Vector Machine (SVM), k-Nearest Neighbors (k-NN) and Naïve Bayes (NB), developed by the MATLAB®-powered Statistics and Machine Learning Toolbox 11.7™. Therefore, this paper focuses on the extraction, analysis and demonstration of feature validity, and not on developing application-specific ML models. Their performance has been compared with that of a CNN trained with greyscale images of raw time signals, as this is a simple method that has delivered outstanding results in previous studies. Furthermore, we provide our MATLAB® functions and processed datasets for a proper assessment of the merits of our method.
This manuscript is organised as follows:
Section 2.1 summarises the CWRU’s experimental set-up.
Section 2.2 explains the theoretical basis for frequency- and time-domain feature extraction.
Section 2.3 covers the statistical analysis of these features and concludes by explaining the method for training ML models.
Section 2.4 summarises the process for obtaining greyscale images from the raw CWRU’s time signals.
Section 2.5 summarises the proposed CNN’s architecture and defines its hyper-parameters.
Section 3.1 presents the statistical analysis results and the different ML algorithms’ validation accuracies.
Section 3.2 presents CNN’s validation accuracies.
Section 4 discusses the results, comparing them with some of the works referred to above. Finally,
Section 5 concludes by presenting the main advantages and disadvantages of the proposed method.
2. Materials and Methods
The following section will cover the in-depth feature extraction for ML and DL models. It also details the method by which the different ML algorithms will be trained, as well as the statistical analysis to which the time- and frequency-domain features have been subjected in order to shed light on the importance of each one for the classification problem. Furthermore, the architecture of the proposed CNN and its different hyper-parameters are explained.
2.1. CWRU’s Experimental Setup Overview
The test stand layout is depicted in
Figure 1, and the different REBs geometries and faults’ characteristic frequencies are detailed in [
14].
In summary, the test rig consisted of a two-hp induction motor, a torque transducer and encoder and a dynamometer. Faults were implanted into the different REBs’ parts with an electro-discharge machine. Outer race faults were seeded at different positions relative to the REB load zone, since they are stationary, with the fault’s position having an essential effect on the motor/REB system’s vibration response. Vibration signals were captured with accelerometers. During each test, acceleration was measured at the drive-end REB housing (Drive End Accelerometer (DEA)), the fan end REB housing (Fan End Accelerometer (FEA)), and the motor support base plate (Base Accelerometer (BA)). Data were captured at 12 kHz and 48 kHz sampling rates using a 16-channel data acquisition card.
The faulty REBs vibration signals to be classified belong to the drive and fan end REBs tables [
14], accompanied by healthy REBs vibration signals. These vibration signals have a commonality: they were collected at a sample rate of 12 kHz. Thus, four groups are derived from the dataset: healthy, inner race fault, outer race fault and ball fault. If vibration signals collected at 48 kHz had wanted to be included, they would have formed four different groups, taking into account the difference between sample rates. For simplification purposes, they have not been included in the dataset. As previously mentioned, each file contains FEA, DEA and BA data. Overall, there are 307 vibration signals, 8 corresponding to healthy REBs, 76 to inner race fault REBs, 76 to ball fault REBs and 147 to outer race fault REBs, as summarised in
Table 1.
2.2. Machine Learning Condition Indicators
From a mechanical vibration point of view, each drive train component excites the entire system at its corresponding characteristic frequency in a wind turbine. When an accelerometer is placed at a certain point in the machine, the signal obtained is a sum of the convolutions of the excitations of the different components and the transfer path of these excitations to the measuring element. The latter can be modelled mathematically as a specific impulse response function, as shown by Barszcz in [
15].
REBs are components that transmit the loads coming from the supported shaft. They consist of an outer race, an inner race, the rolling elements and a cage to maintain their relative position. When a spall appears on the surface of one of these parts due to fatigue, the rolling elements repetitively impact it. These impacts excite the system in the form of repetitive impulses at the corresponding fault’s characteristic frequency, a function of the shaft rotation frequency, the REB geometry, the number of rolling elements and the load angle, as shown in
Table 2.
By computing the fast Fourier transform [
16] of a time-domain signal captured by an accelerometer, the harmonics of such excitation will be observed in a specific bandwidth, where the impulse’s resonant frequency will be the fault’s characteristic frequency. Usually, the fault information will be masked in a complex signal such as that of a wind turbine’s drive train.
The envelope spectrum is a widely used tool for diagnosing REB faults in an early stage of development and limited size. Information about the different excitation sources is extracted from the spacing between impulse responses and not from excited frequencies. Therefore, the envelope spectrum consists of a series of impulse responses in the frequency domain spaced at the characteristic frequency of the different excitation sources, as shown in
Figure 2. When obtaining the envelope spectrum of a time-domain signal for diagnosing REB faults, it is essential to demodulate the signal in the bandwidth where the fault information is present. For this purpose, a tool called the Kurtogram can be used.
As mentioned above, the rolling elements’ impacts on a spall excite the mechanical system in the form of repetitive impulses at the corresponding fault’s characteristic frequency. These impulses take the form of peaks in a time signal. Spectral Kurtosis (SK) is a remarkably sensitive indicator of a signal’s peakedness. SK is a statistical method to detect non-Gaussian components in a signal, i.e., very impulsive components. Randall and Antoni demonstrated the usefulness of SK in detecting faults in rotating machines [
17]. The practical application of SK is the Kurtogram proposed by Antoni and Randall in the same paper, a 2D colour map representing the SK at different levels and bandwidths of a signal. The Kurtogram returns the signal’s most impulsive bandwidth and its centre frequency. Later, Antoni proposed an optimised version of the Kurtogram called the Fast Kurtogram [
18]. This version reduces the number of variants for which the filter parameters are calculated without affecting the result’s accuracy by applying the filter bank approach. The main drawback of SK is that it is susceptible to non-Gaussian random components external to the mechanical system, such as noise or interference of various sorts.
Sawalhi & Randall’s CPW [
19] can be applied to a raw time signal to enhance the presence of REB’s fault’s impulse responses in the envelope spectrum and remove both unwanted components’ harmonics and sidebands. Earlier in this section, REB fault signals have been referred to as repetitive and not periodic, as they are not strictly periodic: they are second-order cyclostationary. This means that their second-order statistic, the variance, is periodic but not the impulses themselves. The repetitive impulses of REB failures are not precisely periodic due to the rolling elements’ slightly random location in the cage-free space and the non-exact cage’s rotational speed due to slippage. This is why the REB’s fault information is not altered by applying CPW, as the cyclostationary components of a signal do not represent significant peaks in the cepstral domain. A detailed explanation of CPW can be found in the paper by Borghesani et al. [
20].
In summary, when a signal is demodulated correctly, a series of impulse responses will appear in the envelope spectrum spaced at the fault’s characteristic frequency. When a fault affects a specific REB part, the impacts with the fault will excite a particular frequency band in the form of repetitive impulses. Suppose this band (including the transfer path from the place of impact to the sensor) is selected for demodulation. In that case, the impulse responses in the envelope spectrum spaced at the actual fault’s characteristic frequency will have a greater amplitude than those spaced at other fault’s characteristic frequencies, as depicted in
Figure 2. Computing CPW to a raw time signal enhances cyclostationary components’ responses by filtering out periodic components’ harmonics and sidebands from the envelope spectrum. Furthermore, it sets all frequency components to the same magnitude (around
for the CWRU’s signals), as shown in
Figure 3.
The above defines a health-state condition indicator that can be calculated using MATLAB
®. The idea is to set up a function [
21,
22] that computes and stores the impulse responses in the envelope spectrum, whose amplitudes are the maximum of all impulse responses’ amplitudes spaced at each fault’s characteristic frequency for every signal (BPFI Amplitude, BPFO Amplitude and BSF Amplitude). Then, by calculating the ratio of the different amplitudes, a quantified relationship between each magnitude and the others is obtained, reflecting their proportion in each case. The logarithm of these ratios is then computed to set them to the same scale. The above is known in the statistical jargon as the log ratio of two values.
The combination of time- and frequency-domain features has yielded outstanding results, such as in the research by Sánchez, R.-V. et al. [
23]. Therefore, the former condition indicators have been combined with time-domain features such as basic statistics (mean, standard deviation, RMS and shape factor), high-order statistics (kurtosis and skewness) and impulsive metrics (peak value, impulse factor, crest factor and clearance factor). These have been obtained by taking advantage of the MATLAB
®-powered
Predictive Maintenance Toolbox 2.2™, ultimately leading to the MATLAB
® function shown in [
24,
25]. Both condition indicator sets have been ranked according to one-way ANOVA [
26] & the Kruskal–Wallis test [
27] and used to train four widely renowned ML models (DT, SVM, k-NN and NB).
2.3. Machine Learning-Based Classification Method & Statistical Analysis
The most important step when approaching a ML classification problem is selecting the best feature combination to achieve the most accurate results possible. This is of the utmost importance when the dataset’s size is limited. The most natural questions to be asked once the condition indicators have been calculated are: how are these features meaningful to the classification problem? Are any of them better than the rest? Two ranking techniques for datasets containing more than two classes will be applied to answer these questions: one-way ANOVA [
26] and the Kruskal–Wallis test [
27].
Comparing the means of three or more unrelated groups within a dependent variable (feature), one-way ANOVA determines whether there are statistically significant differences between them. A non-parametric alternative to the one-way ANOVA test is the Kruskal–Wallis test. Unlike one-way ANOVA, where means are compared, the Kruskal–Wallis test contrasts the samples’ distribution to determine whether they belong to the same population. Both tests make assumptions that the data of each group in every dependent variable must meet. The Kruskal–Wallis test is known to be used if one-way ANOVA’s assumptions are not met for each class within every dependent variable, since they are more potent than those of the Kruskal–Wallis test.
One-way ANOVA tests the null hypothesis that all group means are equal (
) against the alternative hypothesis that at least one group mean is different from the others (
for at least one
and
j) in a one-way layout, where
is the observation number and
is the group number. The one-way ANOVA result is the ratio of across-group variation to within-group variation
F. If
F is larger than the critical value of the F-distribution [
28] with (
,
) degrees of freedom and a significance level of
, the null hypothesis is rejected. Therefore, large values of
F rank better if the degree of difference between the groups is considered to evaluate if the dependent variable is suitable for a classification problem.
In the Kruskal–Wallis test, the null hypothesis states that
J groups from potentially different populations actually derive from a similar population, at least regarding their central tendencies or medians. As an alternative hypothesis, not all groups are derived from the same population. The Kruskal–Wallis test result is the test statistic
H. If
H is larger than the critical value of the chi-square distribution [
29] with
degrees of freedom and a significance level of
, the null hypothesis is rejected. Therefore, large values of
H rank better if the degree of difference between the groups is to be considered.
For the one-way ANOVA results to be reliable, the residuals of the I observations belonging to the J groups need to meet the following assumptions:
They have to be normally distributed in each group being compared. In practice, the dependent variable is tested to be normally distributed in each group rather than the residuals since the results are the same.
There is homogeneity of variances (homoscedasticity) between each group of residuals. Again, the population variances in each group are tested to be equal rather than the residuals’ variances in each group.
Independence. The residuals (or rather the observations) need to be independent.
One-way ANOVA is known to be a robust test against the normality assumption, especially for large datasets. When the homoscedasticity condition has been violated, the Welsch-correlated ANOVA test has proved reliable. It has been stated that the most critical assumption to fail is the lack of independence between observations. For the results of the Kruskal–Wallis test to be reliable, the observations within each group need to meet the following assumptions:
They do not have to be normally distributed in each group. However, the observations within each group have to belong to the same continuous distribution.
There is homogeneity of variances (homoscedasticity) between each group of observations.
Independence of observations.
The use of the Kruskal–Wallis test is recommended when the populations to be compared are clearly asymmetric; it is fulfilled that they are all in the same direction and that the variance is homogeneous. To assess if the data within every group and feature are suitable for the results of one-way ANOVA or the Kruskal–Wallis test to be reliable, several hypothesis tests will be applied using the MATLAB
®-powered
Statistics and Machine Learning Toolbox 11.7™. These are the Anderson–Darling test [
30] to test the normality assumption, the Levene test [
31] to test the homoscedasticity condition and the Kolmogorov–Smirnov test [
32] to test the equality of continuous distributions condition.
Once the features are ranked, the procedure to investigate which features will deliver the highest accuracy for every ML algorithm will be as follows: for every case, each ML algorithm will be computed with a features vector that will be of dimension 1, firstly, increasing its dimensionality until it reaches its maximum, following the order of importance established by each statistical test. Therefore, two tables will be produced, each with the features vector’s dimensionality being increased differently. This way, the best results can be achieved with the fewest features possible, understanding which features are valid and which are not for every algorithm.
2.4. Deep Learning Condition Indicators Overview
In addition, a CNN has been built to classify REBs signals from a DL point of view. CNNs are a type of feed-forward neural network initially designed for image processing. Therefore, the CWRU’s healthy and faulty REBs signals must be transformed into images, with every pixel within an image being a single feature. CNNs are known to deliver outstanding results when the training dataset size is considerably large [
33]. This paper will test its performance using a 14.736 image database generated from the CWRU’s raw time signals using the code shown in [
34,
35].
To transform a 1D time-domain signal into a 2D image to serve as an input to the CNN,
N data are split off from a signal
. Each datum is aligned sequentially, shaping
m rows of
n points. A matrix is then built by placing one of the
m rows onto the following. To create square images of acceptable size while preserving as much information as possible about the defect, a 120 k data signal section has been transformed into 48 50 × 50 pixels images. The REBs signals sampled at 12 kHz have approximately 120 k data, given that each reading took about 10 seconds. Suppose the test stand’s shaft rotational speed is 1730 rpm. In that case, a single rolling element will impact approximately 29 times against the fault every second, giving a total of 290 impacts of each rolling element with the fault during these ten seconds (this pattern is not valid if the defect is in the rolling element). In mathematical terms, the transformation described above looks like this:
where
denotes the signal image and
the vibration datum of time
t. In this particular case,
and
.
Some signal images are depicted in
Figure 4:
2.5. CNN Architecture and Hyper-Parameters Overview
The architecture of the proposed CNN model, shown in [
36,
37], which comprises one input layer responsible for receiving external data, two hidden layers accountable for filtering the inputs, a fully connected layer responsible for the classification and an output layer. Each hidden layer comprises a convolutional layer, a batch normalisation layer, an activation layer and a max-pooling layer. The convolutional layer convolves the local input regions with filter kernels and then generates the output features by computing the activation function. The
function is used as the activation layer to sort out the issue of vanishing gradients. The batch normalisation layer, placed between the convolutional layer and the activation layer, helps to reduce the CNN’s sensitivity to network initialisation. Finally, a max-pooling layer is placed between the first activation layer and the second convolutional layer. This is known to improve the CNN’s accuracy. The fully connected layer comprises as many neurons as there are labels—four, in this case. The
function is used as an activation layer for the latter.
Every ML algorithm aims to minimise a loss/cost function or maximise a likelihood function, depending on the existing model, to find its optimal values and achieve the most accurate prediction. In a CNN, every unit within a convolutional layer is a slight regression model trained by learning the filter kernel’s parameters. The loss function, in this case, is defined by the task. For image classification, the cross-entropy loss function is used to compute the difference between the
output probability distribution and the label probability distribution. The stochastic gradient descent with momentum is used to minimise the cross-entropy loss function. It is an upgraded version of the stochastic gradient descent that accelerates gradient vectors in the right direction. The hyper-parameters for the proposed CNN are shown in
Table 3. Furthermore, the training options for the SGDM are set to 0.01 for the learning rate, 15 epochs, a mini-batch size of 35 and a frequency of network validation in 50 iterations. These parameters have been tuned by trial and error.
4. Discussion
This paper has trained various ML algorithms and a CNN to classify healthy and faulty REBs. Each model’s best results are summarised in
Figure 12. The different models have been trained using the CWRU bearing data center signals. These models are prepared to receive features of never-before-seen REBs’ signals to diagnose their state of health, with an accuracy corresponding to the validation result obtained for each model. Considering the results obtained, the following can be stated:
Although the frequency-domain features are ranked better, according to the applied statistical methods, than the time-domain features, it can be observed in the accuracy development of some models that some frequency-domain features contribute to worse signal classification against some time-domain features.
The logarithmic ratio between the maximum amplitude of the impulse responses spaced at the BPFI and the maximum amplitude of the impulse responses spaced at the BPFO is the feature containing the most relevant information about REBs’ health status. That follows from the high accuracy obtained by computing the algorithms with this feature alone. According to the statistical methods, it is also the best-positioned feature in the rankings.
The combination of time- and frequency-domain features yields better results than by being computed separately. This can be deduced since the best results combine both types of features. To design a robust ML model that generalises correctly, the most important thing to find is the features’ combination providing the most relevant information about the REBs’ health status.
Table 13 compares the results of some relevant papers cited in the introduction with those obtained in this paper. The Root Mean Square Error (RMSE) of the model proposed by Dong et al. suggests an almost perfect REB degradation prediction compared to the data obtained in their experiment.
The results of ML models trained with features based on AE analysis with the HHT, proposed by Pandya et al., demonstrate the usefulness of these physical features. The observer-based non-linear models of Piltan et al. have helped improve the accuracy of the DT model. However, these authors divide the CWRU dataset into numerous groups, considering fault sizes and different load regimes. The latter work overlooks the fact that the load is practically meaningless in this case, as there is no mechanism in the CWRU experiment that converts the torque into a radial load supported by the REBs. Furthermore, the usefulness of these non-physical models with more heterogeneous groups is unclear. Pan et al. obtained excellent results in their novel neural network enhanced with the second-generation wavelet transform. Peng et al. show that the simple method of converting raw time signals into greyscale images to train a CNN provides outstanding results. Once again, the CWRU dataset is subdivided into numerous groups in this work, considering different fault sizes and load regimes. Finally, the model proposed by Duong et al. demonstrates the usefulness of DSWI obtained from faulty REBs’ signals to train CNNs.
This work evaluated the applicability of physical methods, such as the CPW-enhanced envelope spectrum, for the ML-based classification problem. Feature extraction from this method has yielded excellent results demonstrating its usefulness in combination with time-domain features. These results have been superior to those of the CNN trained with greyscale images of raw time signals under the same conditions. The results obtained in this paper are to be understood as a complement to the study of relevant features for diagnosing REBs based on ML.
5. Conclusions
In general terms, excellent accuracies have been obtained from the ML models. The statistical tests’ results and the different algorithms’ accuracy have shown that the frequency-domain features, particularly the logarithmic ratios between the maximum amplitudes of the envelope spectrum around the pulses spaced at the faults’ characteristic frequency, have been the most important ones. These, in addition to some of the most critical time-domain features, have been able to yield the highest accuracies in this paper. Among the models, the k-NN and SVM classifiers were the best performers. The accuracy of these algorithms is given in
Table 8,
Table 9,
Table 10 and
Table 11. It can also be seen from
Figure 10 and
Figure 11 that signals corresponding to REBs with faulty balls have been the most difficult for the algorithms to classify. The CNN’s accuracy can be improved by increasing the dataset’s size, as mentioned in
Section 4. These models can be potentially helpful for automatically diagnosing the health status of REBs.
The main drawback of the proposed method is that the database on which the study has been conducted is imbalanced, as seen in
Table 1. This means that the classification result may be biased, as the classifiers are more sensitive to detecting the majority class and less sensitive to the minority class. A common technique to counteract the imbalanced data problem is to oversample the dataset. Nevertheless, after applying cross-validation, it is known to lead to over-optimistic results, for which conventional model evaluation metrics such as accuracy are not good indicators of the algorithms’ performance. In the industry, faults in REBs are typically not evenly distributed, with malfunctions such as those affecting inner and outer races occurring more frequently than others. Therefore, dealing with imbalanced databases is a common challenge. The best practice to address this issue is to apply cross-validation by stratifying each k-fold to capture the unbalanced distribution of groups on each target feature. In this paper, as discussed, conventional five-fold cross-validation has been implemented. This is an essential point of improvement for future research.