1. Introduction
Bearings are easily damaged parts in rotating machinery, and approximately 50% of motor faults are bearing related [
1,
2]. The machinery running noise is a type of mechanical wave, which includes a wealth of information about machine status, and propagates energy to the surrounding environment through vibration [
3,
4]. Both noise and vibration are caused by the elastic deformations of the rotor, and therefore, the machinery running noise is a good indicator as the vibration signal [
3,
5]. Compared with vibration diagnostics, the noise diagnostics have the characteristics of non-contact measurements, convenient sensor installation, no influence on machinery operation, and online monitoring. Noise diagnostics are especially suitable for occasions where the vibration signal is not easy to measure [
4]. This paper studies the rotating machinery fault diagnosis method based on noise signals.
Rotating machinery noise diagnosis achieves diagnosis of machinery working conditions by monitoring the elastic waves induced by deformations, exfoliations, or cracks. Fault diagnosis can be regarded as a pattern recognition problem; artificial intelligence (AI) has attracted great attention and shows promise in rotating machinery fault recognition applications [
6]. The rotating machinery fault diagnosis based on AI includes sensing, data acquisition, feature extraction, dimensionality reduction, and fault classification. Among them, feature extraction and dimension reduction are the most critical steps in the workflow [
7]. They are related to the upper limit of the fault identification accuracy of the subsequent classification algorithm. Too much redundant information in high-dimension feature vectors may lead to curse of dimensionality and increasing calculation time. The principle of selection is to try not to miss a feature that may be useful, but not to abuse too many features. To extract the features, many signal processing methods have been used in the area of rotating machine health monitoring and diagnosis, such as time-domain and frequency-domain feature parameters processing [
8,
9,
10], discrete wavelet transform (DWT) [
11], empirical mode decomposition (EMD) [
12], time-frequency analysis (TFA) [
13], Mel-frequency cepstrum (MFC) [
14], and Shannon entropy [
15]. Among them, Shannon entropy features have been widely used in machine health monitoring recently. For example, the instantaneous energy distribution-permutation entropy (IED-PE) [
16], the improved multiscale dispersion entropy (IMDE) [
17], the composite multi-scale weighted permutation entropy (CMWPE) [
18], the stationary wavelet packet Fourier entropy (SWPFE) [
19], and similarity-fuzzy entropy [
20] have been proposed to construct the sensitive feature for rolling balling heath monitoring. However, the construction of good sensitive features requires manual experience, which is called feature engineering problem. With the application of deep learning, some feature self-encoding methods are adopted [
21]. However, the difficulty of deep learning is how to evaluate the contribution of representation learning to the final system output. At present, a more effective method is to use the final output layer as predictive learning and other layers as representation learning.
Feature selection is to select an effective subset of the original feature set, so that the model trained based on this feature subset has the highest accuracy. A direct feature selection algorithm is a subset search algorithm, and a commonly used method is to adopt a greedy strategy, such as forward search or reverse search. Subset search algorithms are divided into two types: filter and wrapper. The filter method is a feature selection method that does not depend on a specific machine learning model, while the wrapper method is a method that uses the accuracy of subsequent machine learning models as a feature selection criterion. Another feature learning is feature extraction, which is to project the original feature in a new space to obtain a new feature representation, such as principal component analysis (PCA) and auto-encoder. In existing feature selection or feature extraction algorithms, PCA transforms the original data into linearity-independent data via linear transformation, and it can be used to extract the main feature components of the data [
22]. PCA expands features in the direction in which the covariance is the largest so that the obtained low dimensional features have no corresponding physical meaning. Chen B. et al. achieved selection and dimensionality reduction of intrinsic mode function (IMF) components of motor bearing via distance evaluation technology (DET) and utilized dimensionality-reduced feature vectors as input vectors for a support vector machine (SVM) [
23]. Lei et al. proposed compensative distance evaluation technology (CDET) with enhanced dimensionality reduction performance, and they applied this to feature dimensionality reduction of bearing vibration signals [
24]. CDET selects the features that have the smallest distance within the cluster and the largest distance between clusters. PCA, DET, and CDET do not consider the characteristics of the classification network. Melih Kuncan et al. proposed a feature extraction method based on one-dimensional ternary patterns (1D-TP) obtained from comparisons between neighbors of each value on vibration signals for bearing fault classification [
25]. To solve the problems of variable redundancy and model complexity in the prediction model, Xu et al. combined the neural network and the mean impact value (MIV) for wind power prediction [
26]. In addition, methods based on decision trees or GBDT for feature extraction or dimensionality reduction have been used in machinery diagnostics. Madhusudana et al. used the decision tree technique to select prominent features out of all extracted features [
27]. Li et al. proposed a wrapped feature selection algorithm based on XGBoost which used the importance measure of XGBoost as a feature subset search heuristic, and it was verified on 8 data sets [
28]. Aiming at the problem of variable working conditions of rotating equipment, Wu et al. proposed a deep autoencoder feature learning method and applied it to fault diagnosis of rotating equipment [
29].
In terms of feature classification, neural networks [
30,
31] and SVM [
32,
33] have been widely applied in machinery diagnosis. Han et al. compared the performance of random forest, artificial neural networks and SVM methods in the intelligent diagnosis of rotating equipment [
34]. Hu et al. utilized the wavelet package transform and SVM ensemble technology for fault diagnosis [
35]. Liu et al. proposed a genetic algorithm (GA) based self-adaptive resonance demodulation technique [
36]. Zhu et al. proposed a fault diagnosis method based on an SVM optimized by the GA [
37]. Han et al. combined EMD, particle a swarm optimization SVM (PSO-SVM), and fractal box dimensions for gear fault feature extraction and fault classification [
38]. Indeed, heuristic searching methods, such as the GA and simulated annealing [
39] and tabu searching methods [
40] have also been applied in feature classification. In addition, ensemble learning and deep neural networks are widely used in fault diagnosis [
41]. Zhou et al. proposed a novel bearing diagnosis method based on ensembled empirical mode decomposition (EEMD) and weighted PE and further enhanced the classification accuracy by a mixed voting strategy and a similarity criterion [
42]. Aiming at the problem of big data analysis, Wu et al. proposed a two-stage big data analytics framework and achieved a high-level of classification accuracy [
43].
The conventional rotating machinery diagnosis algorithms separate the complementarity of the feature selection algorithm and classification network in feature selections. To this end, this paper proposes an end-to-end feature selection and diagnosis method that organically unifies feature expression learning and machine prediction learning into one model. This method realizes the compromise between the two types of algorithms and applies it to the state classification of machinery. First, based on the modified MIVs algorithm, our algorithm not only achieves feature selection for noise signals based on the contributions of independent variables to classified networks but also solves the randomness problem of MIVs value. By eliminating the features that have less influence on the classification, this step realizes the primary feature selection oriented to the classification network. Second, in order to characterize the metric ability of the feature itself, a new between-class sorting WBDA algorithm was introduced into the intra-class and inter-class aggregation degree calculation, and feature diversity selection strategy is proposed to prevent the phenomenon that the calculated WBDA of the features in the same category are relatively large. Experimental results show that this feature diversity selection strategy can effectively improve the accuracy of the algorithm. Thus, secondary selection of features was achieved through feature indexability. Since there are few faulty data in industrial applications, it is hoped that the diagnosis algorithm can run online. The classification network uses the SVM to compute the actual classification accuracy and removes the local optimal solution through the Monte Carlo method. The present paper compares the proposed algorithm with the MIV algorithm for network variable selection, the CDET algorithm based on variable selection, and the variable dimensionality reduction algorithm PCA. After selecting features with the same dimensions, the proposed algorithm is found to have better classification accuracy than the other methods, which verifies its superiority.
This paper is organized as follows.
Section 1 introduces the background, motivation, and a brief literature review of Feature learning and feature classification.
Section 2 constructs the machinery noise feature set which is used for testing in
Section 6. In
Section 3, a bearing noise diagnosis algorithm based on network variable selection and WBDA, named MIVs-WBDA, is proposed. Since feature classifications were achieved by an SVM,
Section 4 introduces two classifier parameter optimization algorithms for the SVM: PSO algorithm and the GA.
Section 5 summarizes the procedures of the MIVs-WBDA.
Section 6 describes the simulation testing. Finally,
Section 7 presents our conclusions and some further remarks.
2. Feature Extraction
In practical applications, it is difficult to determine which features are key ones in advance, and classifiers based on different features may have significantly different performance. For the application of this paper, in order to verify whether the proposed feature selection algorithm can select the most suitable features from the undetermined feature set, a large number of features used in the previous literatures are constructed as the candidate feature set. These features form a feature pool. As a test, a total of 31 features were constructed in this article, which were divided into 6 classes, as shown in
Figure 1.
2.1. Tranditional Time Domain Feature Set
Traditional time domain and statistic features are a powerful tool which can characterize the change of bearing vibration signals when faults occur [
44]. The time-domain characteristics are more significant, which can be directly obtained from the monitoring signal, and reflect the change of energy amplitude on the time scale of the signal. It is a common index that can be used for rapid diagnosis. This paper uses 11 features shown in
Table 1. Herein,
refers to the
i-th measurement of the time domain signal,
refers to the
i-th frequency domain value based on the short-time Fourier transform (STFT), and
refers to the i-th of
in ascending order, where
N is an even number. Subscript
i takes values from 1 to
N.
refers to the
j-th feature of the signal.
is the mean of signal x, and
is the variance. These features are calculated for every short-time frame of bearing noise signal.
2.2. Empirical Mode Decomposition Energy Entropy
Features 12 to 17 are empirical mode decomposition energy entropy. EMD is a signal analysis method proposed by Dr. Huang in 1998 [
45]. It is an adaptive data processing or mining method, which is very suitable for the processing of nonlinear and non-stationary time series. The EMD extraction method is given as the following:
(a) Decompose bearing noise signals into some IMFs.
(b) Calculate the energy of all IMFs
(c) Calculate the energy entropy of all IMFs
(d) Calculate the energy entropy of the whole original signal
(e) Construct the feature vector with the first six
and
Figure 2 shows the empirical mode decomposition diagram of a sample.
2.3. Permutation Entropy
Feature 18 is permutation entropy. Permutation entropy algorithm is a kind of vibration mutation detection method, which can conveniently locate the mutation time of the system and has the ability to detect the small change of the signal.
The calculation steps of PE are as follows:
(1) Let the length of time series xj (j = 1, 2, ..., N) be N, and define an embedding dimension m and a time delay D.
(2) The signal is reconstructed in phase-space to obtain
k (
k =
N− (
m − 1)
d) reconstructed components, and each component is represented by
Xi = {
x(
i),
x(
i +
d),…,
x(
i + (
m − 1)
d}.
(3) The inner part of each subsequence Xi is sorted incrementally, that is, . When sorting, if two values are equal, they are sorted according to the subscript n of . In this way, an Xi is mapped to a sequential pattern , which is one of all possible sequential patterns of m number. Therefore, every m-dimensional subsequence Xi is mapped to one of m! permutations.
(4) Calculate the times of each permutation pattern
appearing in
m! permutations, denoted as
, then the probability of each permutation pattern appearing is defined as
(5) The permutation entropy of time order is defined as
Obviously, . In general, is normalized to 0–1, and is defined for this purpose.
2.4. Dispersion Entropy
Features 19 is dispersion entropy. Rostaghi [
46] et al. gave the detailed calculation steps of DE as follows. For a given univariate signal of length
N:
, the DE algorithm includes 4 main steps:
(1) First, are mapped to c classes, labeled from 1 to c. To do so, there are a number of linear and nonlinear approaches. The linear mapping algorithm is the fastest one. When the maximum and/or minimum values of a time series are much larger or smaller than the mean/median value of the signal, the majority of xi are assigned to only few classes. Thus, we first employ the normal cumulative distribution function (NCDF) to map x into from 0 to 1. Next, we use a linear algorithm to assign each to an integer from 1 to c. To do so, for each member of the mapped signal, we use , where shows the jth member of the classified time series and rounding involves either increasing or decreasing a number to the next digit. It is worth noting that this step could be done by some other linear and nonlinear mapping techniques.
(2) Each embedding vector with embedding dimension m and time delay d is created according to . Each time series is mapped to a dispersion pattern , where , , , . The number of possible dispersion patterns that can be assigned to each time series is equal , since the signal has m members and each member can be one of the integers from 1 to c.
(3) For each of
potential dispersion patterns, relative frequency is obtained as follows:
In fact, shows the number of dispersion patterns that are assigned to , divided by the total number of embedding signals with embedding dimension m.
(4) Finally, based on the Shannon’s definition of entropy, the DE value with embedding dimension
m, time delay
d, and the number of classes
c is calculated as follows:
2.5. Wavelet Packet Decomposition
Features 20 to 27 are the norm of wavelet packet decomposition coefficient reconstruction signal. Wavelet decomposition expands the signal on a series of wavelet basis functions. In engineering applications, because useful signals usually appear as low-frequency parts or some relatively stable signals, interference usually appears as high-frequency signals. Therefore, the signal can be approximated by low-frequency coefficients with a small amount of data and several high-frequency layer coefficients.
Figure 3 shows a three-layer decomposition structure diagram, where
and
(
) are the low-frequency and high-frequency decomposition coefficients of the corresponding layer.
Feature extraction based on wavelet decomposition is divided into the following steps:
(1) Wavelet packet decomposition of one-dimensional signal. Select db1 wavelet and determine the level of wavelet decomposition to be 3, and then, perform 3-level wavelet packet decomposition on signal x.
(2) Perform wavelet reconstruction on the decomposed coefficients. According to the low-frequency coefficients of the Nth layer of wavelet decomposition and the high-frequency coefficients of the first to Nth layers, a one-dimensional signal wavelet reconstruction is performed.
(3) Calculate the 2 norms of the reconstructed signal and use them as features F20–F27.
2.6. Frequency Domain Feature Set
The frequency domain features include the sum of the spectrum amplitude, the average value of the spectrum, the standard deviation of the spectrum, and the integral of the frequency domain curve, which are represented by F28–F31, respectively.
3. Feature Selection Algorithm for Rotating Machinery Noise Diagnosis
The rotating machinery noise diagnosis process generally includes three steps: feature extraction, feature selection (or feature dimension reduction), and state classification.
The traditional feature selection is to separate the data from the classification and map the original features into several features selected by the algorithm by dimension reduction. As shown in
Table 2, the characteristics and differences of the commonly used feature filtering algorithms are mainly described, and their methods of processing data have their own focus.
Aiming at the problem that the traditional feature selection is usually separated from the learning of prediction model for rotating machinery noise diagnosis, this paper proposes a feature selection algorithm based on network variable selection and within-class and between-class discriminant analysis (WBDA). The proposed algorithm realizes the compromise between the two types of feature selection technique, as shown in
Figure 4.
3.1. Primary Feature Selection Oriented to the Classification Network—MIVs-SVM
The selection of meaningful time-frequency features of noises as SVM input is a key step for status predictions. The MIV is considered as one of the most effective indexes to evaluate the influence of variables on the output of neural network. However, when the neural network is used as classification network to calculate the MIV of the feature variable, the calculated MIVs have great randomness because the parameters of the neural network obtained by each training are not the same.
Figure 5 shows the randomness when the neural network is used to calculate MIV. Among them, the abscissa is the characteristic, and the ordinate is the MIV.
Since SVM is used for fault classification, this algorithm uses SVM network to calculate MIV named MIVs-SVM. Considering that, the final output of SVM is the sample belonging to a class rather than a continuous output value. After the SVM classification hyperplane is obtained by training, the estimation value of posterior probability
of sample
belonging to each class
c is calculated in this paper by Softmax Regression function at first, and then the probability corresponding to the real class of sample
is selected as the output result. The specific calculation method is shown in
Figure 6 and is described as follows:
(a) After the network training, each feature variable in the training sample P was increased and decreased by 10% to obtain training samples P1 and P2, respectively. P1 and P2 were input into the established networks, and Softmax Regression function was applied to the output of SVM network. Two new classification results are represented by A1 and A2.
(b) The difference between A1 and A2 was obtained and regarded as the impact value (IV) of independent variable variation on the output.
(c) The output MIV of the independent variable on the dependent variable was obtained based on the average IV of all monitoring cases (different fault samples), resulting in the MIV of the specific feature (average of different fault samples).
(d) Repeat steps a–c to obtain the MIV of each feature variable.
(e) The effects of each independent variable on the output were evaluated based on their absolute MIV, and then the effects of the input feature on the results were evaluated, thus achieving variable selection.
Since this modified method directly uses the subsequent classification network SVM to calculate the MIV, it is called MIVs-SVM, abbreviated as MIVs.
3.2. Secondary Feature Selection Based on Feature Divisibility—WBDA
The effects of feature variables on the output were sorted based on network feature selection, which reflected the correlation of feature selection algorithms and feature classification algorithms. It provides references for variable selection oriented to the classification network. Nevertheless, to evaluate the divisibility of features, we hope that the features in the same sample are as close as possible, while the features of different samples are as far as possible. To this end, the idea of WBDA was introduced.
The idea of WBDA comes from linear discriminant analysis (LDA). The idea of LDA is very naive: given the set of training samples, try to project the samples onto a straight line, so that the projection points of the same kind of samples are as close as possible, and the projection points of different samples are as far away as possible. LDA is used for feature dimensionality reduction, so it is necessary to construct the optimal linear transformation W. In this case, the purpose of the algorithm is feature selection, so the linear transformation can be omitted. The specific algorithm is described as follows.
For any feature
xk, define within-class divergence
where
is called the divergence of
.
Define between-class divergence
Therefore, the larger
and the smaller
are the better. Taking these two points into consideration, the objective function is defined as
In order to prevent the phenomenon that the calculated WBDA of the features in the same category are relatively large, so that the selected features do not have the characteristics of diversity, this paper proposes a between-class selection strategy, that is, select the maximum WBDA value of one class each time, then select the maximum value among the remaining classes next time. Once a certain class participates in the selection, it will not participate in the selection of subsequent features until all features in all classes are selected. After that, feature selection will go to the next cycle.
4. Classifier and Its Parameter Optimization
Feature classifications were achieved using the SVM. Multi core support vector machine is suitable for complex industrial environment, which requires relatively less hardware resources and has stable classification effect and good generalization performance. Let the training set be
, where
is the
i-th input data, and
is its corresponding output label. The process of the SVM processing the nonlinear binary classification problem is shown below [
30]:
(1) Select the appropriate kernel function
and the appropriate penalty parameter
to construct the following constraint optimization problem:
where
is the mapping function, and
is the inner product of
and
.
(2) Use the sequential minimal optimization (SMO) algorithm to find the optimal solution corresponding to the minimum of the above formula.
(3) Calculate the normal vector of the separated hyperplane, where cannot be directly and explicitly evaluated.
(4) Find all of the
S support vectors
on the maximum interval boundary, and calculate the
corresponding to each support vector. The average of all
is the final
. Thus, the final classification hyperplane is
, and the classification decision function is
The kernel function is equivalent to transforming the original input space into a new feature space through the mapping function and learning the linear support vector machine from the training samples in the new feature space. Learning is implicitly done in the feature space. In practical applications, the choice of kernel function needs to be verified by experiments. The radial basis kernel function is chosen in this paper.
The performance of the SVM classifier is mainly affected by the penalty factor (C) and the nuclear parameter (γ). The nuclear function mainly reflects the complicity of sample data in high-dimension space, meanwhile, the penalty factor affects the generalization capability of the SVM by tuning the ratio of confidence interval and empiric risk in the feature space. Hence, optimization of SVM performance is usually converted into optimization selection of (C, γ) by parameters. Conventional optimization algorithms include the PSO algorithm and the GA.
PSO employs the swarm-based global searching strategy and the speed-displacement model and involves no complicated genetic procedures. The unique memory capability of PSO allows dynamic tracking of the current searching situation. Indeed, PSO can be regarded as searching of a swarm consisting of
m particles
in an n-dimensional space, and the location of each particle
refers to a solution. The optimized solution of each particle obtained is denoted as
, and the optimized solution in the particle swarm is denoted as
. The particle speeds are denoted as
and the renewal rule of
Vi in cases of two optimized solutions is as follows [
38]:
where
refers to the speed of the
i-th particle at the (
t + 1)-th iteration in the
d-th dimension,
w refers to the weight,
and
refer to acceleration constants, and rand() refers to a random number between 0 and 1.
The GA is a parallel random searching optimization approach that mimics biological evolution [
42]. Individuals are selected by selection, cross, and mutation in genetics according to the selected fitness function to retain individuals with good applicability and exclude individuals with poor applicability. In this way, the new generation inherits information from the old generation and outperforms the old generation. This process is repeated until the requirements are satisfied.
Optimizations of network classification parameters by this algorithm classifier were achieved using the two optimization algorithms.
5. Network Variable Selection and WBDA Fusion-Oriented Rotating Machinery Noise Diagnosis Algorithm
The network variable selection and WBDA fusion-oriented rotating machinery noise diagnosis algorithm (MIVs-WBDA algorithm) is a feature selection algorithm using network variable selection and WBDA. First, features were selected according to the contributions of independent variables on the classified network, thus achieving classified network oriented variable primary selection. Then, secondary feature selection and dimensionality reduction were achieved according to WBDA, which reflects the divisibility, thus achieving SVM identification. The steps were as follows:
(1) According to the calculated data feature set, samples were randomly divided into training samples, cross-validation reference samples, and testing samples. Cross-validation is a statistical analysis method to validate classifier performance, and experimental results demonstrated that the effectiveness of SVM training based on parameters selected by the cross-validation set was higher than that based on randomly selected parameters. Therefore, the feature MIVs was calculated by cross validation samples.
(2) After excluding N features with significant MIVs and features with negligible MIVs, the remaining features were arranged in the order of ascending between-class WBDA. According to the dimensionality after dimensionality reduction (L), a new feature vector consisting of the first L-N features and N features with significant MIVs was generated.
(3) According to the SVM optimization algorithm, the (C, γ) of the SVM was optimized using the cross-validation set.
(4) We conducted learning based on the training set and tested the identification accuracy of the current SVM.
Figure 7 shows the MIVs-WBDA algorithm flow and the relationship between the two feature selection algorithms and other modules in the algorithm. The result of primary feature selection is controlled by the classifier type, and secondary feature selection is mainly conducted for the residual feature set according to the characteristics of the feature itself. The feature metric chosen for secondary feature selection is the WBDA defined in this paper. Therefore, we produce a feature selection algorithm for network variable selection and WBDA fusion. The superiority of this method is proved in
Section 6.
Algorithm 1 summarizes the procedures of the network variable selection and feature entropy fusion oriented bearing noise diagnosis algorithm, including feature extraction and feature classification.
Algorithm 1. The MIVs-WBDA Algorithm |
Input: Data set X, dimensions after reduction L Output: Feature set FS, classification result O, and recognition rate R Step 1: Calculate the data feature set; randomly assign the training samples, cross-validation samples, and test samples Step 2: The MIVs of each feature are calculated using the cross-validation samples, and the most prominent N features are selected to form the feature set FS1 Step 3: Calculate the between-class WBDA of residual features Step 4: Arrange the WBDA from small to large, select the first L-N features, and form the feature set FS2. Then form a new special collection FS with FS1. It should be noted that the L-N features should be distributed in as many classes as possible. FS = {FS1,FS2} Step 5: According to the SVM optimization algorithm, the cross-validation set is used to optimize the selection of support vector machines (C, γ) Step 6: Learn through the training set, and test the SVM output classification result O and recognition accuracy R |
7. Conclusions and Future Works
Since redundant information in high-dimension feature vectors may lead to curse of dimensionality and increasing calculation time, this paper proposes an end-to-end feature selection and dimension reduction method (MIVs-WBDA), and compares it to popular PCA, CDET, MIV, FA, LPP, NPE, and PPCA dimensionality reduction methods. Unlike the conventional feature learning algorithm, MIVs-WBDA is a sample feature selection method based on the fusion of network variable selection and WBDA. Moreover, it involves the correlation of feature selection and the classified network and the correlation of the classified network and feature similarity. Hence, the MIVs-WBDA can partially overcome the drawbacks of linear classifications. The classification effect of noise measurement depends on the condition of the environment. Different feature selection may affect the final classification result under different operating environment, and the selection will be different when the environment changes. The common feature selection algorithm only maps the data and does not consider the influence of the data on the classifier. This paper mainly considers the influence of the features on the model classification and integrates the model classification and feature selection organically. The WBDA algorithm considers the generalization performance of the algorithm comprehensively. This paper demonstrates the running time and accuracy of MIVs-WBDA algorithm and several common feature selection algorithms. Finally, the results show that the MIVs WBDA algorithm has a good effect on the basis of considering time and classification accuracy. MIVs-WBDA feature extraction algorithm can screen out several features that are most conducive to classification, which has high application value in practice. MIVs-WBDA can select the most important features and exhibits enhanced classification performance, which realizes the unification of feature representation learning and machine prediction learning. Experiments show that under the condition of reducing to the same dimension, the classification accuracy for rotating machinery status using the MIVs-WBDA method has a 3% classification accuracy improvement under the two feature set construction methods. The typical running time of this classification learning algorithm is less than 10 s, while using deep learning; its running time will be more than a few hours. It should be noted that when the feature dimension is reduced to 1, the classification accuracy of the MIVs-WBDA algorithm is not high. It means that the best feature is not selected at this time, and we can consider how to introduce other strategies to solve the accuracy problem when the dimension is 1. In the later stage, the idea of feature extraction can be combined to achieve the improvement of classification performance in low dimensions. Of course, in practical applications, the dimension of the feature vector will not only take one dimension. Therefore, it will not affect the use of this algorithm. The idea of constructing diversity feature pool, end-to-end feature selection and prediction model learning can also be applied to other similar application scenarios.