A Novel Method for Fault Migration Diagnosis of Rolling Bearings Based on MSCVIT Model

Liu, Xiu-Yan; He, Dong-Lin; Guo, Dong-Qing; Guo, Ting-Ting

doi:10.3390/electronics13234726

Open AccessArticle

A Novel Method for Fault Migration Diagnosis of Rolling Bearings Based on MSCVIT Model

¹

School of Information and Control Engineering, Qingdao University of Technology, No. 777 Jialingjiang East Road, Qingdao 266520, China

²

School of Electromechanical and Automative Engineering, Yantai University, No. 30 Qingquan Road, Yantai 264003, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(23), 4726; https://doi.org/10.3390/electronics13234726

Submission received: 25 October 2024 / Revised: 19 November 2024 / Accepted: 26 November 2024 / Published: 29 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

The normal operation of rolling bearings is crucial to the performance and reliability of rotating machinery. However, the collected vibration signals are often mixed with complex noise, and the transformer network cannot fully extract the characteristics of the vibration signals. To solve this problem, we propose a data preprocessing method that utilizes singular value decomposition (SVD) and continuous wavelet transform (CWT) along with an improved vision transformer (ViT) model for fault diagnosis. First, the SVD is applied to identify the noise components to improve the data quality. Then, the CWT is used to convert the denoised signal into a two-dimensional (2D) time–frequency representation (TFR) to display the fault features more intuitively. Finally, an improved multi-scale convolutional block attention module (MSCBAM) is embedded into the ViT network to extract fault features. Experimental results on the classical Case Western Reserve University (CWRU) dataset show that the average diagnostic accuracy of the proposed method is 99.3%. Compared with six other fault diagnosis methods, the method proposed in this paper has also achieved good diagnostic results on three other datasets, which can be effectively applied to the timely handling of problematic equipment and reduce downtime.

Keywords:

fault diagnosis; singular value decomposition; continuous wavelet transform; multi-scale convolutional block attention module; vision transformer

1. Introduction

Maintaining rolling bearings in optimal condition is vital to keep rotating machinery running smoothly and ensure the stability of the entire production process. [1,2]. However, many factors can result in the breakdown of machinery, such as prolonged operation, excessive loads, and environmental impacts. According to statistics, approximately 30% of rotating machinery failures are related to rolling bearings [3]. Therefore, it is of great significance to study rolling bearing fault diagnosis technology in actual industrial production.

Traditional fault diagnosis approaches are typically classified into model-based approaches and data-driven methods. Model-based methods typically use physical models to describe the dynamic properties of mechanical systems. Shen et al. proposed a physics-based deep learning method that utilized a threshold model to assess the health classes of bearings based on the known physics of bearing faults, while a convolutional neural network (CNN) network was used to predict health condition based on high-level features extracted from the input [4]. To solve the problem that most deep learning algorithms tend to ignore with physical information, Ni et al. proposed a new physics-informed residual network. This network was designed to learn the underlying physics embedded in the training and testing data to provide a physically consistent solution for incomplete data [5]. Borghesani et al. proposed a generalized bearing signal model, which mainly evaluated the influence of key model parameters corresponding to the physical properties of bearings on signals in the time and frequency domains [6]. Zhang et al. reconstructed the air-gap displacement profile based on the stator current electrical model, enabling a quantitative assessment of rolling bearing fault severity and solving the tedious problem of manually calibrating fault thresholds under different power, speed, and load conditions in traditional methods [7]. Liu et al. proposed a personalized diagnostic method based on finite element method simulations and support vector machine, which generated simulated fault samples and performed classification, solving the problem of effectively detecting faults in mechanical components in the absence of actual fault samples [8]. Keshun et al. proposed a sound-vibration physical information fusion-constrained deep learning method (PFCG-DL), which combined physical models with deep learning models to improve the accuracy, interpretability, and computational efficiency of bearing fault diagnosis, addressing the issues of lack of physical mechanism guidance and low interpretability in existing deep learning methods [9]. However, model-based approaches have disadvantages such as poor adaptability and strong reliance on models. Therefore, data-driven approaches use large quantities of data to find the underlying rules of a model without relying on a deep understanding of the system or precisely building mathematical models.

As a data-driven method, deep learning can autonomously derive features from a large quantity of data and reduce manual intervention [10]. In recent years, it has increasingly been applied across different domains in artificial intelligence, including object detection [11], audio recognition [12], natural language processing [13], and disease prediction [14]. Zhang et al. introduced a method to convert one-dimensional (1D) fault signals into two-dimensional (2D) maps and send the maps to a CNN for diagnosis [15]. Li et al. proposed a hybrid diagnostic model combining a dual-stage attention-based recurrent neural network (DA-RNN) and convolutional block attention module (CBAM). They utilized the DA-RNN to extend imbalanced datasets and combined image processing with the CBAM network for fault classification, effectively addressing the issue of improving fault diagnosis accuracy under imbalanced data conditions [16]. Chen et al. utilized a combination of CNN and long short-term memory (LSTM) to reduce calculation time and eliminate the problem of large data volume and unreliable manual analysis [17]. Gu et al. adopted discrete wavelet transformation (DWT) to extract detailed fault information of different frequencies and time scales and used an LSTM network to capture the temporal relationships within the fault data [18]. An et al. proposed an LSTM gating unit for time-varying conditions to selectively forget some unimportant information [19].

Despite the success of deep learning networks in fault diagnosis, some inherent drawbacks must be considered. For example, CNNs may not be as effective as RNNs in capturing long-range dependencies, especially in tasks that require global information. Additionally, RNNs struggle with parallel computation and are prone to the vanishing gradient problem, which limits the length of sequences they can handle. However, attention-based transformer models can effectively capture long-range dependencies in sequence data by supporting global interactions between each time step and other time steps in the sequence, thereby achieving higher parallelism and computational capability. Yang et al. segmented and linearly encoded 1D vibration signals and then used the transformer model to extract features, aiming to enhance the performance of fault diagnosis [20]. Alexakos et al. utilized short-time Fourier transform to transfer 1D fault signals into 2D maps, which were subsequently classified by the transformer model [21]. Li et al. proposed a twin transformer to solve the problem of traditional deep learning models not being able to perform parallel computation in fault diagnosis [22]. Fan et al. input a gray texture image into a vision transformer (ViT) and utilized the self-attention mechanism to identify global patterns for fault classification [23]. Xie et al. utilized singular value decomposition (SVD) and energy-dispersive spectroscopy (EDS) to denoise the signal, and then employed the generalized S transform (GST) and Res-ViT network for feature extraction and fault classification [24]. Tang et al. developed an integrated ViT model that incorporated discrete wavelet transform (DWT) and a soft voting method, which transformed signals from different frequency bands into time–frequency images to achieve fault diagnosis [25]. To address the problem of traditional CNNs not being able to capture temporal information in rolling bearings, Weng et al. developed a 1D vision transformer architecture that incorporated a fusion of multi-scale CNNs (MCF-1DViT). This model utilized the MCF layer to capture fault features at multiple time scales and employed a transformer model to learn long-term temporal correlations [26]. Ding et al. innovated the time–frequency transformer (TFT) model based on the ViT model to address the shortcomings of the traditional network in feature representation, and extracted effective information from time–frequency representation (TFR) of vibration signals by using a fresh tokenizer and encoder module [27]. Xiang et al. proposed a frequency channel attention-based ViT method, which enhanced the feature extraction capability and interpretability of the model for rolling bearing fault identification by introducing a frequency domain channel-attention mechanism and self-attention mechanism [28]. Guo et al. proposed a bidirectional parallel rolling bearing intelligent diagnosis method based on multi-scale center-cascaded adaptive dynamic convolutional residual network and a Swin transformer. The method utilized multi-scale center-cascaded dynamic convolutional residual block and a multi-dimensional coordinate attention mechanism to extract local fault features, while the moving window self-attention mechanism of the Swin transformer network was used to capture global features of the fault information [29]. Li et al. proposed a lightweight multi-feature fusion ViT model for rolling bearing fault diagnosis. The model utilized a multi-scale wide convolutional neural network perception module for local feature extraction, while an improved lightweight multi-feature fusion ViT was built for global feature extraction and fault recognition [30].

Although the transformer models in the literature above have made significant progress in many aspects, they still fall short in local feature extraction, which may limit their overall performance. To solve this problem, an innovative multi-scale convolutional block attention module—a vision transformer (MSCVIT) bearing fault diagnosis model—is proposed. The main contributions of this paper are as follows.

(1): First, noise is added to the original vibration data to simulate a real production environment. The order of singular values is reconstructed through the energy threshold method to achieve a denoising effect. The 1D denoised data are transformed into TFRs by continuous wavelet transform (CWT) technology to enhance the multi-dimensional representation of the data and capture the fault features in the time–frequency domain.
(2): Secondly, we improved the CBAM. We introduced an MLP with different reduction factors to enhance the capability of the channel-attention mechanism (CAM) to effectively capture the importance of different channels. Meanwhile, for the spatial attention mechanism (SAM), we combined convolutional kernels of different sizes to enhance the fusion of multi-scale features.
(3): Finally, an innovative MSCVIT model is proposed. The multi-scale processing of the multi-scale convolutional block attention module (MSCBAM) enables the model to capture local features more comprehensively, while the vision transformer effectively utilizes the self-attention mechanism to capture the dependencies between image patches and extract global information. The interaction between the MSCBAM and the ViT model effectively extracts the main features from the TFRs.

The structure of this article is as follows. Section 2 mainly introduces the basic theory of each module. The MSCBAM and the MSCVIT model diagnosis process are briefly introduced in Section 3. Section 4 mainly describes the classical dataset. The experimental results and analysis are given in Section 5, and ablation experiments on the MSCVIT model are conducted to better assess its performance. The sixth section concludes this paper.

2. Methodologies

2.1. Singular Value Decomposition

As a matrix decomposition method, the SVD plays an important role in linear algebra and numerical analysis. This technique is widely applied across fields such as signal noise reduction, image compression, and data dimensionality reduction [31]. The SVD reduces noise by identifying the signal’s main components, with the key to noise reduction being the reconstruction of the bearing vibration signal’s order. In this paper, the energy threshold method is utilized to realize the reconstruction process. By effectively preserving the signal’s temporal properties, the Hankel matrix enables accurate separation between the signal and noise. Therefore, the Hankel matrix is introduced to convert the 1D signal

A = \{a_{1}, a_{2}, a_{3} \dots \dots, a_{n}\}

into a 2D matrix. The Hankel matrix

H_{x \times y}

is described as follows:

H_{x \times y} = [\begin{matrix} a_{(1)} \dots \dots \dots . a_{(y)} \\ a_{(2)} \dots \dots a_{(y + 1)} \\ \dots \dots \dots \dots \dots . \\ \dots \dots \dots \dots \dots . \\ a_{(x)} \dots \dots \dots a_{(n)} \end{matrix}] = U_{x \times y} + N_{x \times y}

(1)

where

H_{x \times y}

is the constructed Hankel matrix,

n = x + y - 1

.

U_{x \times y}

represents the useful signal, and

N_{x \times y}

represents the noise signal. The energy

U_{x \times y}

and

N_{x \times y}

mainly determine the reconstructed order of the singular value. The energy threshold method automatically determines the order of singular values that need to be retained based on the energy distribution of singular values. This process can be described thus:

p = \frac{\sum_{i = 1}^{r} σ_{i}^{2}}{\sum_{i = 1}^{q} σ_{i}^{2}}

(2)

where

\sum_{i = 1}^{q} σ_{i}^{2}

is the total energy of the signals,

σ

is the energy associated with each singular value, and the positive integer

q

is the total order of the singular value.

p

is the energy threshold, and r is the reconstruction order. The signals with higher energy before the order corresponding to the threshold are considered useful signals, while those with lower energy after that are considered noise signals.

2.2. Continuous Wavelet Transform

A bearing vibration signal has complex time–frequency characteristics as a typical nonstationary signal. The CWT method can effectively analyze the signal’s time–frequency characteristics and construct its TFRs. Therefore, CWT is applied to convert vibration signals into TFRs, enabling the extraction of local signal features across different scales and positions in the time–frequency space [29]. The CWT is expressed as follows:

{C W T}_{x}^{φ} (a, b) = \frac{1}{\sqrt{|b|}} \int x (t) φ^{*} (\frac{t - a}{b}) d t

(3)

where

x (t)

represents the denoised signal,

{C W T}_{X}^{φ} (a, b)

represents the wavelet transform of

x (t)

, and

φ^{*}

is the conjugate of the wavelet function. The symbols a and b are the translation parameter and the scaling parameter of wavelet bases, respectively. In addition, the composite Morlet wavelet function cmor3-3 is chosen as the wavelet function due to its superior performance in capturing the characteristics of the signal. Cmor3-3 represents a composite Morlet wavelet with a center frequency and width of 3. The Morlet wavelet function is described as follows:

φ (t) = \frac{σ}{\sqrt{π}} e^{- σ^{2} * t^{2}} e^{i * 2 π f t}

(4)

where

σ

is the wavelet bandwidth parameter,

π

is the perimeter,

t

is the sampling time, and

f

is the wavelet center frequency.

2.3. Vision Transformer

The transformer model is a pivotal advancement in natural language processing (NLP) and was introduced in 2017. The transformer model has become a mainstream model in the current NLP field due to its efficient inference speed, and it has been widely used in tasks such as name recognition, text generation, classification, and speech recognition [32]. The transformer utilizes the attention mechanism to grasp the dependencies across different positions in the input sequence to better deal with long-distance dependencies. The basic components of this model are the encoder and decoder, with each encoder consisting of a multi-head attention and a feedforward neural network [33]. The model can attend to various positions concurrently through the multi-head attention, while the feedforward neural network is responsible for the nonlinear transformations and mappings of features at each position. To enhance the expressiveness of the model, multiple encoders can be stacked together to extract more complex features. After the encoder, a decoder can also be added to generate sequences or predict downstream tasks.

Traditional transformer is limited to fixed-length input sequences when processing images. To overcome this problem, Dosovitskiy et al. introduced the ViT architecture, and this model addressed the above challenges by using image segmentation as sequence input [34]. The ViT module comprises an embedding layer, a transformer encoder, and a multilayer perceptron (MLP) head.

2.3.1. Embedding Layer

The embedding layer functions to convert the input image blocks into a compressed vector representation. In addition, it can embed location information and add class labels to each patch. In this paper, the TFR is denoted as

X \in R^{H * W * C}

, where H is the height, W is the width, and C is the channel of the TFR. The TFR is divided into multiple patches, and the size of each patch is

P * P

.

The total count of image blocks is denoted

N = (\frac{H * W}{P^{2}})

, where each block with a dimension is denoted

d \in 1 * (P^{2} * C)

. Therefore, after passing through the linear projection layer, the input tensor

X \in R^{H * W * C}

becomes

D \in N * (P^{2} * C)

.

2.3.2. Transformer Encoder

The transformer encoder is a dedicated architecture for extracting features, consisting of several stacked encoder blocks. Each encoder block includes multi-head attention, a dropout layer, an MLP, and an activation function. Compared to the limitations of traditional CNN algorithms in capturing global features, the vision transformer integrates spatial structure into the input sequence by adding positional encodings to the patches through an embedding layer. The encoder layer can effectively capture global information from the input data through the multi-head attention. The weights in the attention mechanism are calculated as follows:

A t t e n t i o n (Q_{i}, K_{i}, V_{i}) = s o f t (\frac{Q^{i} * K_{i}^{T}}{\sqrt{d_{k}}}) V_{i}

(5)

(Q_{i}, K_{i}, V_{i}) = (W_{Q}, W_{K}, W_{V}) X

(6)

where

d_{k}

represents the scaling factor,

W_{Q}

,

W_{K}

, and

W_{V}

represent three matrices that need to be learned,

X

represents the input dimension obtained by the embedding layer, and

Q_{i}

,

K_{i}

, and

V_{i}

represent the query, key, and value matrices, respectively. The attention weight is obtained by multiplying the output of

Q_{i}

with the transpose of

K_{i}

. Subsequently, these computed weights are normalized by the SoftMax function. Finally, the resulting weights are applied element by element to

V_{i}

to obtain an adaptive TFR. The multi-head attention learns various features by using multiple attention heads and combines these features after linear transformations to generate the final feature vector. This approach enables the model to attend to different aspects of the input information in parallel, thereby generating a more comprehensive representation. The multi-head attention is calculated as follows:

M u l t i H e a d (Q_{i}, K_{i}, V_{i}) = C o n c a t (H_{1} * W, H_{2} * W \dots \dots H_{n} * W)

(7)

where

H

represents the attention heads and

W

is the parameter matrix of linear projection, which is used to project the connected vector into the final output feature space.

2.3.3. MLP Head

The MLP head is made up of several fully connected layers and nonlinear activation functions. The MLP head takes the feature representation produced by the encoder as input and maps these features to the corresponding representation space through multiple fully connected layers, ultimately generating the prediction results.

3. Proposed Method

3.1. Local Feature Extraction Module MSCBAM

The MSCBAM is embedded between the embedding layer and the transformer encoder layer, which mainly contains two modules: CAM and SAM [35]. The CAM can effectively illuminate important semantic information by adaptively learning the significance weights of each channel within the TFRs. Adaptive average pooling (AAP) and adaptive maximum pooling (AMP) are employed to extract global features from the channel dimension. Subsequently, multilayer perceptrons (MLPs) process the output of each pooled layer at different scale factors, and a Relu activation function is applied after each MLP to introduce nonlinearity. The outputs activated by ReLU are subsequently processed through a sigmoid function to produce channel-attention weights. The SAM captures spatial context information by learning feature maps with channel-attention weights generated from the CAM. It utilizes AMP and AAP to extract the highest and mean values of each spatial position in the channel maps. The two feature maps are combined in channel dimensions to form a new feature map. Dynamic convolution is applied to capture spatial features at various scales. The output of all convolutions is added, the sigmoid activation function generates spatial attention weight, and the map’s spatial feature is adjusted by multiplying it with the original input feature map. The MSCBAM can capture the important channels and spatial locations in the TFR to enhance the capability of the MSCVIT model for local feature extraction. Figure 1 illustrates the design of the MSCBAM.

3.1.1. Channel-Attention Module

The core concept of the CAM involves using GAP to calculate the importance of the weights of each channel, which are then applied to their respective channels. The following outlines the calculation steps of the CAM. First, the input TFRs undergo parallel processing through adaptive GAP and GMP, with the goal of compressing the spatial dimensions of the TFR to 1 × 1, thus obtaining the highest value and mean value of each channel. These two values are then input into an MLP with a reduction factor and different scales to compute the channel-attention weights. The reduction factor is set to 16 to decrease the channel count and thus lower the computation burden [36]. The scales are set to 1, 2, and 4, and the multi-scale channel-attention mechanism can aid the network in identifying key channels at different scales. This facilitates the network’s attention to both fine details and broader patterns, improving the model’s ability to express complex relationships and its performance. The mathematical expression for the number of channels is:

C_{o u t} = \frac{C_{i n}}{R \times {M L P}_{s}}

(8)

where

C_{i n}

represents the input channels and

C_{o u t}

the out channels.

R

is the reduction factor, and

s

represents the different scales.

The model captures the significance of different channels by using reduction factors and different scales. After the ReLU operation is applied, the elements are summed individually, and the sigmoid is then utilized to generate the channel-attention map. As far as the picture is concerned, the CAM emphasizes the vital feature of this TFR. The GAP layer receives feedback from every pixel of the TFRs. While the GMP layer performs gradient backpropagation calculation, only the location with the highest response in the TFRs has feedback. The mathematical expression of the CAM is:

x^{'} = σ [\sum_{s \in \{1, 2, 4\}}^{R = 16} {M L P}_{s} (A M P (x)) + \sum_{s \in \{1,2, 4\}}^{R = 16} {M L P}_{s} (A A P (x))] * x

(9)

where

x

is the input TFR,

x^{'}

is the output TFR, and

σ

is the sigmoid function.

3.1.2. Spatial Attention Module

As shown in Figure 1, the SAM further extracts features in the channel dimension by AMP and AAP to obtain two new feature maps after the CAM processing. After the two feature maps are concatenated, they are input into the dynamic convolutional layer for processing to enhance their focus on important regions. The convolution kernels have different sizes to capture multi-scale information. The smaller convolution cores focus more on detailed features, while the larger convolution cores capture global context information. Similarly, the results are normalized by a sigmoid activation function and applied to the feature map to generate the spatial attention map. The calculation equation of the SAM is:

x^{″} = σ [\sum_{i \in \{1, 3, 5, 7\}} C K (C o n c a t (A M P (x^{'}), A A P (x^{'})))] * x^{'}

(10)

where

x^{'}

denotes the input TFR and

x^{″}

represents the SAM’s output.

C K

is the convolution kernel.

3.2. Proposed MSCVIT Framework

The features contained in the original vibration signal of rolling bearings are not obvious, and are easily disturbed by noise during machine operation. First, the SVD method is utilized to filter out the noise in the raw signal. Then, the CWT is applied to transform 1D denoised signals to generate 2D TFRs. To address the shortcomings of traditional ViT networks in local feature extraction, we have integrated the MSCBAM into the ViT network, which enhances the robustness and generalization of the diagnostic model. This module extracts features from the time–frequency representation (TFR) across the independent channel and spatial dimensions. It then passes the TFR, which contains local features, to the ViT network for global feature extraction. The improved MSCVIT model effectively combines the advantages of the MSCBAM in local feature extraction with the powerful capabilities of the ViT in capturing global features and long-range dependencies, thereby addressing the deficiencies of traditional transformers in time–frequency feature extraction. The architecture of MSCVIT is illustrated in Figure 2.

4. Datasets

4.1. CWRU Dataset

In this section, the data utilized are from the CWRU dataset. The experimental platform includes a 1.5 kW electric motor, a torque transducer, and a dynamometer [37]. At the motor’s drive end (DE), an accelerometer is installed to capture signals, with the sampling frequency set to 12 kHz. In addition to standard operating conditions, each bearing is also prone to failure in the three positions of the ball, inner ring, and outer ring. The outer ring fault encompasses fault points at 3, 6, and 12 o’clock, but only the 6 o’clock fault point is utilized in the experiment. Each type of fault contains three different diameters: 0.007, 0.014, and 0.021 inches. Each type of fault diameter contains four distinct motor loads—0 HP, 1 HP, 2 HP, and 3 HP—which correspond to different motor velocities of 1797, 1772, 1750, and 1730 RMP, respectively. Therefore, the CWRU dataset contains a total of 10 operating conditions, including normal operating conditions.

In Table 1, the constructed dataset is labeled A, B, and C. The training, validation, and test datasets are divided into a ratio of 5:2.5:2.5. Each dataset consists of 2000 training TFRs, 1000 validation TFRs, and 1000 test TFRs, totaling 4000 TFRs, and each TFR consists of 1024 data points. The distribution details of datasets are outlined in Table 1.

4.2. Other Datasets

4.2.1. Bearing Dataset of Jiangnan University

In this section, the bearing data of Jiangnan University (JNU) are utilized. The experiment adopted two different kinds of bearings, N205 and NU205 [38]. The normal, outer ring, and ball fault data are sampled from the N205 bearing, whereas the inner ring data are sampled from the NU205 bearing. The accelerometer operates at a sampling rate of 50 kHz, with a data collection duration of 20 s. There are four fault types in total, each with three different motor speeds: 600 RMP, 800 RMP, and 1000 RMP. In this experiment, only data with a speed of 800 RMP are utilized for the test. The distribution details of datasets are outlined in Table 2.

4.2.2. Gear Dataset of Connecticut University

The gear dataset of Connecticut University (CU) is collected from the secondary reducer of replaceable gears [39]. The motor regulates the speed of the gear, and the adjustable brake regulates the torque. When the power rating is set to 20 kHz, 9 distinct gear faults are detected from the pinions on the first and second input shafts. These conditions include health, missing teeth, cracks, peeling, and five varying degrees of wear. The distribution details of datasets are outlined in Table 3.

4.2.3. Variable Condition Bearing Vibration Dataset of Ottawa University

The dataset used in this section is from the Ottawa University (OU) and contains bearing vibration signal data collected at different speeds of the motor [40]. The experimental platform is composed of a motor, AC driver, encoder, coupling, rotor, and bearing. Two bearings are utilized for this experiment, a standard bearing and an experimental bearing, which can be replaced with different fault types of bearings. An accelerometer is installed on the experimental bearing housing to collect vibration signals with a rated power of 200 kHz. The obtained data comprise vibration signals corresponding to three fault categories: normal, inner ring fault, and outer ring fault. These signals are recorded across four distinct speed stages: acceleration, deceleration, first acceleration then deceleration, and first deceleration then acceleration. The distribution details of datasets are outlined in Table 4.

5. Experimental Results

5.1. CWRU Experimental Results

5.1.1. SVD for Noise Reduction

A certain amount of Gaussian white noise is introduced into the raw signal to simulate the noise disturbances during the signal acquisition process in a real environment. The energy threshold method is used for reconstructing the order and data to reduce noise. In this paper, the first 500 singular values are reconstructed.

SVD ensures that the principal components occupy most of the energy of the data while removing insignificant noise components. However, a higher threshold may retain too much noise signal and lose the details of the signal. A lower threshold may not be effective in denoising [41]. To balance data reconstruction quality and noise suppression, we refer to the SVD denoising method in the literature and set the energy threshold to 0.7, which can effectively denoise while retaining the most important information [31]. In Figure 3, we can see that when the energy threshold is 0.7, the sum of the energy of the first 20 singular values reaches 70% of the total energy. This phenomenon indicates that the first 20 singular values correspond to the main features of the data. In contrast, the low-ranked singular values contain less key information and have limited explanatory power, mainly corresponding to noise or unimportant features.

In Figure 4, the noise interference of the denoised signal is significantly less, the overall periodicity is more obvious, and the signal quality is significantly improved.

5.1.2. Time–Frequency Representation

In this paper, the signal denoised by SVD is represented as TFRs by the CWT algorithm. Figure 5a–j show the two-dimensional TFRs of rolling bearings under 10 fault conditions for denoised signals. The CWT arithmetic describes the local and global information of the signals across time–frequency domains, depicting signal features across scales and identifying changes in specific frequency components. This helps models comprehensively understand the time–frequency characteristics of signals.

5.1.3. Model Comparison Results

The experimental data used are from the CWRU dataset, and the diagnostic capability of the MSCVIT model is verified by experiments. The experimental setup comprises an i7-12400H CPU and an RTX 3050 Ti GPU. CUDA 11.6, Python 3.8, and torch 1.13 are utilized to create a deep learning environment. The initial learning rate is 0.001, the weight decay factor is 5 × 10⁻⁵, and the optimization algorithm adopted is SGD. Our proposed MSCVIT model undergoes iterative training for 100 cycles and repeated 5 times.

The diagnostic capability of the MSCVIT model is assessed by comparing it to six deep learning models: CNN, WDCNN, ResNet50, EfficientNetV2, MobileViT, and Swin transformer. Each model takes TFRs as input data and uses the same data preprocessing methods and hyperparameters. Table 5 illustrates the model’s performance in terms of training time, number of parameters, and diagnostic accuracy. As seen in the table, when using the same data preprocessing method, the proposed model balances training time and test accuracy effectively. The MSCVIT model achieves the highest test accuracy of 99.31%.

In addition, as shown in Figure 6, the proposed method gradually stabilizes after 30 rounds of iterative training. The curve converges gradually, indicating the model has strong learning and fitting abilities. These results demonstrate that the model can effectively suppress overfitting and exhibits good robustness.

In Figure 7, the model’s classification ability is assessed using 1000 test datasets. In the figure, we can see that the model’s accuracy for various categories exceeds 90%. When inner ring faults are identified, the recognition rates for labels 3 and 5 also exceed 97%. For label 4, 3% of the data are misidentified as an outer ring fault and 4% are misidentified as a normal fault. Under the condition of a mild fault, the TFRs of healthy bearings, inner ring fault, and outer ring fault overlap in a certain frequency band. Especially when the vibration signal characteristics are not obvious, this overlap will lead to certain errors in the classification model. Labels 1 and 2 (ball fault) have a recognition rate of 99%. These results show that the model is highly effective in recognizing 10 fault types, verifying the feasibility of the proposed model.

5.1.4. Experimental Results of Different Noise Interference

To evaluate the model’s diagnostic capability in different noise interferences, we conduct experiments at various signal-to-noise ratios (SNRs). Figure 8 is the diagnostic histogram of seven models in the case of SNR = −15, SNR = 0, SNR = 15. The SNR represents the proportion of signal power to noise power. It is typically expressed in decibels (dB), as follows:

{S N R}_{d B} = 10 {l o g}_{10} (\frac{P_{s i g n a l}}{P_{n o i s e}})

, where

P_{s i g n a l}

represents the signal’s power and

P_{n o i s e}

indicates the noise’s power. The higher the SNR, the stronger the signal strength relative to the noise strength.

Figure 8 shows that as the SNR decreases from 15 to −15, the signal becomes increasingly mixed with noise, resulting in a decrease in accuracy for all seven models. The histogram illustrates that the MSCVIT model achieves the highest diagnostic accuracy across different SNRs, indicating that our proposed MSCVIT model demonstrates significantly better diagnostic performance than other network models under various noise conditions. Table 6 shows that the MSCVIT model has a diagnostic accuracy of 0.4% higher than the second-ranked Swin transformer model.

5.1.5. MSCVIT Model Migration Ability

Cross-domain experiments were designed to assess the generalization ability of the MSCVIT model. Considering the complexity and diversity of actual working conditions, one dataset was selected for training. Subsequently, the trained model is transferred to the other two datasets. In the previous classified datasets, six situations were included: A-B, A-C, B-A, B-C, C-A, and C-B. Each experiment is iterated 50 times and repeated 5 times.

As depicted in Figure 9, in the migration experiment, the MSCVIT method outperforms the six other models under identical experimental conditions. The MSCVIT model has the highest diagnostic accuracy in the cases of A-B. However, in the case of C-A, the accuracy of the seven models decreases to different degrees. According to reference [32], the difference in motor speed between dataset A and dataset C is the main reason for the reduced diagnostic accuracy.

As shown in Table 7, the MobileViT network exhibits the lowest average accuracy, at only 78.9%. Similarly, the average accuracies of the ResNet50 and EfficientNetV2 models are below 84%. In contrast, the CNN, WDCNN, and Swin Transformer models demonstrate superior diagnostic accuracies, all exceeding 90.0%. The MSCVIT model achieves the highest average diagnostic accuracy, reaching 94.2%, which is 2.4 percentage points higher than the second-ranked Swin transformer. This highlights the MSCVIT model’s remarkable ability in domain generalization.

5.2. Other Datasets’ Experimental Results

5.2.1. Feature Extraction Ability on OU Dataset

To assess the performance of the MSCVIT model under variable operating conditions, experiments are carried out on the OU dataset. The experiment uses 180 test TFRs. Figure 10 shows that the MSCVIT’s feature extraction capability at four rotational speeds. In cases b and c, only a few normal and inner faults are misdiagnosed as outer faults. In contrast, cases a and d clearly distinguish between different categories and can fully identify three different types of failure. In general, our proposed MSCVIT model can clearly distinguish clusters.

5.2.2. Migration Experiments on Different Datasets

These three datasets have undergone the same preprocessing as the CWRU dataset. Then, the seven models trained on the CWRU dataset will be applied to the JNU dataset, the CU dataset, and Dataset A, belonging to the OU dataset. As shown in Figure 11, the EfficientNetV2 network achieves the worst results on both the JNU and OU datasets. Conversely, the MSCVIT model exhibits the highest diagnostic accuracy across all three datasets, surpassing 90%.

Table 8 shows that the average diagnostic accuracy of EfficientNetV2, ResNet50, and WDCNN models is less than 90%. Among these, EfficientNetV2 model has the lowest accuracy at 78.1%. In contrast, the diagnostic accuracy of CNN, MobileViT, Swin transformer, and MSCVIT models exceeds 92%. Notably, the MSCVIT model surpasses the other six comparison models with an accuracy of 95.8%, which is 2.8%, 3.2%, and 2.5% above the CNN, MobileViT, and Swin transformer models, respectively. These results indicate that the MSCVIT model has relatively excellent generalization ability even under complex working conditions.

5.2.3. Ablation Experiments

Each experiment removes a specific module from the overall architecture: (1) removing the SVD module, where “r” denotes removement; (2) replacing the CWT module with the Wilson transform, where “re” denotes replacement; and (3) removing the MSCBAM and using only the ViT model for diagnosis. The datasets used in this section are still the JNU dataset, the CU dataset, and Dataset A, belonging to the OU dataset. In Table 9, when the trained MSCVIT model is transferred to three different datasets, removing each module reduces diagnostic accuracy. Specifically, the average diagnostic accuracy drops by 3.0% without the MSCBAM. This underscores the importance of the MSCBAM in the MSCVIT model, as it facilitates the model’s ability to focus on key feature areas, thereby enhancing performance and generalization capability.

To validate the modified MSCBAM’s performance, comparative experiments are carried out with the classic CBAM on three datasets. We evaluate both models using four metrics: accuracy (Acc), precision (Pre), recall (Re), and F1 score (F1). In Table 10, the modified MSCBAM outperforms the traditional CBAM across all four metrics on all three datasets. Specifically, the accuracy increases by 10.9%, 0.4%, and 0.8%, while the F1 score improves by 11.9%, 0.2%, and 0.8% on the three datasets, respectively. These results confirm the effectiveness of the modified MSCBAM.

In this experiment, we compare the number of layers in the transformer encoder and the attention head count in the multi-head attention to assess the model’s performance. The transformer encoder of the MSCVIT model used in this paper consists of 8 layers, and the attention mechanism has 12 heads. On this basis, we carry out experiments with the CU dataset, setting the number of encoder layers to 4, 8 and 12 and the number of attention mechanism heads to 8, 12 and 16, respectively. In Table 11, the parameters used by our model achieved the best diagnostic result at 98.1%, and it performed best under different hyperparameter settings. This result underscores the robust applicability of the model parameters selected.

Figure 12 shows that when the MSCVIT model diagnoses the four fault categories of the JNU dataset, the diagnostic accuracy for each category exceeds 90%. In contrast, when only the ViT model is used, its performance is lower than that of the MSCVIT model, especially the accuracy of fault category 1, which is only 80%. When diagnosing the nine fault categories in the CU dataset, the MSCVIT model achieves over 95% diagnostic accuracy for each category. When diagnosing the three fault categories in the OU dataset, the MSCVIT model achieves over 96% diagnostic accuracy for each category. However, the diagnostic accuracy of the ViT model for category 0 and category 2 is only 83% and 86%, respectively. The results further confirm the importance of the MSCBAM in the MSCVIT model.

6. Conclusions

In this paper, we present a new MSCVIT fault diagnosis network with two main contributions. Firstly, the MSCBAM adjusts the channel and spatial importance of input TFRs through the CAM and SAM. It employs different reduction factors and convolution kernels of various sizes to improve the model’s attention to local features, thereby addressing the shortcomings of transformer in extracting local information. Secondly, the MSCBAM is embedded in ViT, improving MSCVIT’s generalization performance. Thus, the model extracts local and global features more comprehensively and adapts more effectively to various fault types and operating conditions. MSCVIT’s performance is validated through migration experiments and model comparison tests, demonstrating its advantages and broad prospects in mechanical equipment failure detection. Although the method has demonstrated good diagnostic outcomes during the classification process, in the transfer experiments, the differences in motor speeds between different machines have had a certain impact on diagnostic accuracy. Therefore, future research should explore methods to enhance the robustness of the MSCVIT model. On the one hand, more effective feature normalization strategies can be introduced to decrease the influence of load changes on the TFRs. On the other hand, domain adaptation techniques can be incorporated to strengthen the model’s generalization capacity.

Author Contributions

Conceptualization, X.-Y.L. and T.-T.G.; methodology, D.-L.H.; software, D.-Q.G.; data curation, D.-L.H. and D.-Q.G.; validation, X.-Y.L. and T.-T.G.; formal analysis, X.-Y.L. and D.-L.H.; investigation, X.-Y.L., D.-L.H., D.-Q.G. and T.-T.G.; visualization, D.-L.H.; writing—original draft preparation, D.-L.H. and D.-Q.G.; writing—review and editing, D.-L.H.; supervision, T.-T.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Shandong Province, under grants ZR2020QF008 and ZR2023QF013 and the National Natural Science Foundation of China, under grant 62001262.

Data Availability Statement

The data presented in this study are available in this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jiang, Z.; Zhang, K.; Xiang, L.; Yu, G.; Xu, Y. A time-frequency spectral amplitude modulation method and its applications in rolling bearing fault diagnosis. Mech. Syst. Signal Process. 2023, 185, 109832. [Google Scholar] [CrossRef]
Wang, L.; Li, Y.; Chen, K.; Li, C. A novel deep convolution multi-adversarial domain adaptation model for rolling bearing fault diagnosis. Measurement 2022, 191, 110752. [Google Scholar]
Cui, L.; Tian, X.; Wei, Q.; Liu, Y. A self-attention based contrastive learning method for bearing fault diagnosis. Expert Syst. Appl. 2024, 238, 121645. [Google Scholar] [CrossRef]
Shen, S.; Lu, H.; Sadoughi, M.; Hu, C. A physics-informed deep learning approach for bearing fault detection. Eng. Appl. Artif. Intell. 2021, 103, 104295. [Google Scholar] [CrossRef]
Ni, Q.; Ji, J.C.; Halkon, B.; Feng, K.; Nandi, A.K. Physics-Informed Residual Network (PIResNet) for rolling element bearing fault diagnostics. Mech. Syst. Signal Process. 2023, 200, 110544. [Google Scholar] [CrossRef]
Borghesani, P.; Smith, W.A.; Randall, R.B.; Antoni, J.; El Badaoui, M.; Peng, Z. Bearing signal models and their effect on bearing diagnostics. Mech. Syst. Signal Process. 2022, 174, 109077. [Google Scholar] [CrossRef]
Zhang, S.; Wang, B.; Kanemaru, M.; Lin, C.; Liu, D.; Miyoshi, M.; Teo, K.H.; Habetler, T.G. Model-Based Analysis and Quantification of Bearing Faults in Induction Machines. IEEE Trans. Ind. Appl. 2020, 56, 2158–2170. [Google Scholar] [CrossRef]
Liu, X.; Huang, H.; Xiang, J. A Personalized Diagnosis Method to Detect Faults in a Bearing Based on Acceleration Sensors and an FEM Simulation Driving Support Vector Machine. Sensors 2020, 20, 420. [Google Scholar] [CrossRef] [PubMed]
You, K.; Wang, P.; Huang, P.; Gu, Y. A sound-vibration physical-information fusion constraint-guided deep learning method for rolling bearing fault diagnosis. Reliab. Eng. Syst. Saf. 2024, 253, 110556. [Google Scholar]
Luczak, D. Data-Driven Rotary Machine Fault Diagnosis Using Multisensor Vibration Data with Bandpass Filtering and Convolutional Neural Network for Signal-to-Image Recognition. Electronics 2024, 13, 2940. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Wang, H.; Zhang, H.; Li, S.; Wu, D. Gated parametric neuron for spike-based audio recognition. Neurocomputing 2024, 609, 128477. [Google Scholar] [CrossRef]
Sousa, S.; Karn, R. How to keep text private? A systematic review of deep learning methods for privacy-preserving natural language processing. Artif. Intell. Rev. 2023, 56, 1427–1492. [Google Scholar] [CrossRef]
Yu, Z.; Wang, K.; Wan, Z.; Xie, S.; Lv, Z. Popular deep learning algorithms for disease prediction: A review. Clust. Comput. 2023, 26, 1231–1251. [Google Scholar] [CrossRef]
Zhang, J.; Sun, Y. A new bearing fault diagnosis method based on modified convolutional neural networks. Chin. J. Aeronaut. 2020, 33, 439–447. [Google Scholar] [CrossRef]
Li, J.; Liu, Y.; Li, Q. Intelligent fault diagnosis of rolling bearings under imbalanced data conditions using attention-based deep learning method. Measurement 2022, 189, 110500. [Google Scholar] [CrossRef]
Chen, X.; Zhang, B.; Gao, D. Bearing fault diagnosis base on multi-scale CNN and LSTM model. J. Intell. Manuf. 2021, 32, 971–987. [Google Scholar] [CrossRef]
Gu, K.; Zhang, Y.; Li, H. DWT-LSTM-based fault diagnosis of rolling bearings with multi-sensors. Electronics 2021, 10, 2076. [Google Scholar] [CrossRef]
An, Z.; Li, S. A novel bearing intelligent fault diagnosis framework under time-varying working conditions using recurrent neural network. ISA Trans. 2020, 100, 155–170. [Google Scholar] [CrossRef]
Yang, Z.; Liu, X. Research on bearing fault diagnosis method based on transformer neural network. Meas. Sci. Technol. 2022, 33, 085111. [Google Scholar] [CrossRef]
Alexakos, C.; Karnavas, Y. A combined short time fourier transform and image classification transformer model for rolling element bearings fault diagnosis in electric motors. Mach. Learn. Knowl. Extr. 2021, 3, 228–242. [Google Scholar] [CrossRef]
Li, J.; Liu, W.; Ji, P. Twins transformer: Cross-attention based two-branch transformer network for rotating bearing fault diagnosis. Measurement 2023, 223, 113687. [Google Scholar] [CrossRef]
Fan, H.; Ma, N. New intelligent fault diagnosis approach of rolling bearing based on improved vibration gray texture image and vision transformer. Proc. Inst. Mech. Eng. Part C J. Mech. Eng. Sci. 2024, 238, 6117–6130. [Google Scholar]
Xie, F.; Wang, Y.; Wang, G.; Sun, E.; Fan, Q.; Song, M. Fault Diagnosis of Rolling Bearings in Agricultural Machines Using SVD-EDS-GST and ResViT. Agriculture 2024, 14, 1286. [Google Scholar] [CrossRef]
Tang, X.; Xu, Z.; Wang, Z. A novel fault diagnosis method of rolling bearing based on integrated vision transformer model. Sensors 2022, 22, 3878. [Google Scholar] [CrossRef]
Weng, C.; Lu, B.; Yao, J. A one-dimensional vision transformer with multiscale convolution fusion for bearing fault diagnosis. In Proceedings of the 2021 Global Reliability and Prognostics and Health Management (PHM 2021), Nanjing, China, 15–17 October 2021; pp. 1–6. [Google Scholar]
Ding, Y.; Jia, M. A novel time–frequency Transformer based on self–attention mechanism and its application in fault diagnosis of rolling bearings. Mech. Syst. Signal Process. 2022, 168, 108616. [Google Scholar] [CrossRef]
Xiang, L.; Bing, H.; Li, X.; Hu, A. A frequency channel-attention based vision Transformer method for bearing fault identification across different working conditions. Expert Syst. Appl. 2024, 262, 125686. [Google Scholar] [CrossRef]
Guo, H.; Zhao, X. Intelligent Diagnosis of Dual-channel Parallel Rolling Bearings Based on Feature Fusion. IEEE Sens. J. 2024, 24, 10640–10655. [Google Scholar] [CrossRef]
Li, S.; Zhang, X. A lightweight multi-feature fusion vision transformer bearing fault diagnosis method with strong local sensing ability in complex environments. Meas. Sci. Technol. 2024, 35, 065104. [Google Scholar] [CrossRef]
Xie, F.; Wang, G.; Zhu, H.; Sun, E.; Fan, Q.; Wang, Y. Rolling bearing fault diagnosis based on SVD-GST combined with vision transformer. Electronics 2023, 12, 3515. [Google Scholar] [CrossRef]
Liu, W.; Zhang, Z.; Zhang, J.; Huang, H.; Zhang, G.; Peng, M. A novel fault diagnosis method of rolling bearings combining convolutional neural network and transformer. Electronics 2023, 12, 1838. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV 2018), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hu, J.; Li, S.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. arXiv 2019, arXiv:1709.01507. [Google Scholar]
Neupane, D.; Seok, J. Bearing Fault Detection and Diagnosis Using Case Western Reserve University Dataset with Deep Learning Approaches: A Review. IEEE Access 2020, 8, 93155–93178. [Google Scholar] [CrossRef]
Jiang, Y.; Xie, J.; Meng, L.; Jia, H. Multiple Working Condition Bearing Fault Diagnosis Method Based on Channel Segmentation Improved Residual Network. Electronics 2022, 12, 145. [Google Scholar] [CrossRef]
Zhu, X.; Yang, D.; Pan, H.; Karimi, H.R.; Ozevin, D.; Cetin, A.E. A novel asymmetrical autoencoder with a sparsifying discrete cosine Stockwell transform layer for gearbox sensor data compression. Eng. Appl. Artif. Intell. 2024, 127, 107322. [Google Scholar] [CrossRef]
Sehri, M.; Dumond, P. University of Ottawa constant and variable speed electric motor vibration and acoustic fault signature dataset. Data Brief 2024, 53, 110144. [Google Scholar] [CrossRef]
Xie, B.; Xiong, Z.; Wang, Z.; Zhang, L.; Zhang, D.; Li, F. Gamma spectrum denoising method based on improved wavelet threshold. Nucl. Eng. Technol. 2020, 52, 1771–1776. [Google Scholar] [CrossRef]

Figure 1. The structure of the CBAM.

Figure 2. The overall framework of the proposed method (“*” is the positional embedding).

Figure 3. Energy threshold curve of singular value.

Figure 4. (a) Original data; (b) signal comparison diagram before and after noise reduction.

Figure 5. The TFRs of rolling bearings: (a–c) ball fault; (d–f) inner ring fault; (g–i) outer ring fault; (j) normal bearing.

Figure 6. (a) Loss curve; (b) accuracy curve.

Figure 7. Confusion matrix diagram of ten kinds of faults.

Figure 8. Diagnostic accuracy with different SNRs.

Figure 9. Column diagram of model migration ability.

Figure 10. T-SNE dimensional reduction visualization: (a) acceleration; (b) acceleration then deceleration; (c) deceleration then acceleration; (d) deceleration.

Figure 11. The diagnostic effects of the seven models on three datasets.

Figure 12. Model migration results in the three datasets. The JNU dataset (a,b); the CU dataset (c,d); the OU dataset (e,f). (a,c,e) belong to the MSCVIT; (b,e,f) belong to the ViT.

Table 1. CWRU dataset partitioning.

Fault Condition		Normal	Ball Fault			Inner Ring Fault			Outer Ring Fault			Load	RMP
Fault diameter (inch)			0.007	0.014	0.021	0.007	0.014	0.021	0.007	0.014	0.021	/	/
Class labels		0	1	2	3	4	5	6	7	8	9	/	/
Dataset A	Train	200	200	200	200	200	200	200	200	200	200	0	1797
	Validation	100	100	100	100	100	100	100	100	100	100
	Test	100	100	100	100	100	100	100	100	100	100
Dataset B	Train	200	200	200	200	200	200	200	200	200	200	1	1772
	Validation	100	100	100	100	100	100	100	100	100	100
	Test	100	100	100	100	100	100	100	100	100	100
Dataset C	Train	200	200	200	200	200	200	200	200	200	200	2	1750
	valid	100	100	100	100	100	100	100	100	100	100
	Test	100	100	100	100	100	100	100	100	100	100

Table 2. JNU dataset partitioning.

Bearing Labels			N205		NU205	Load	RMP
Fault location		Normal	Outer ring fault	Ball fault	Inner ring fault	/	/
Class labels		0	1	2	3	/	/
Dataset	Train	400	400	400	400	1	800
	Validation	200	200	200	200
	Test	200	200	200	200

Table 3. CU dataset partitioning.

Fault Location		Normal	Missing Tooth	Root Crack	Spalling	Chipping Tip
Chipping degree		/	/	/	/	1	2	3	4	5
Class labels		0	1	2	3	4	5	6	7	8
Dataset	Train	240	240	240	240	240	240	240	240	240
	Validation	120	120	120	120	120	120	120	120	120
	Test	120	120	120	120	120	120	120	120	120

Table 4. OU dataset partitioning.

Bearing Labels		Normal Bearing	Experimental Bearings		RMP
Fault location		Normal	Inner ring fault	Outer ring fault	/
Class labels		0	1	2	/
Dataset A	Train	300	300	300	Acceleration
	Validation	150	150	150
	Test	150	150	150
Dataset B	Train	300	300	300	Deceleration
	Validation	150	150	150
	Test	150	150	150
Dataset C	Train	300	300	300	Acceleration
	validation	150	150	150	then
	Test	150	150	150	deceleration
Dataset D	Train	300	300	300	Deceleration
	validation	150	150	150	then
	Test	150	150	150	acceleration

Table 5. Performance test results of seven models on the CWRU dataset.

Models	Time/s	Accuracy	Parameters
CNN	205	96.17	52,810
WDCNN	204	93.14	54,160
ResNet50	7400	91.61	23,520,000
EfficientNetV2	21,681	91.21	20,190,000
MobileViT	8705	98.96	950,000
Swin Transformer	7800	99.06	27,530,000
MSCVIT	9901	99.31	57,450,000

Table 6. The anti-noise ability of each model.

Models	Snr = −15	Snr = 0	Snr = 15	Average
CNN	86.6	90.4	95.9	91.1
WDCNN	90.1	91.6	91.3	91.0
ResNet50	87.3	84.8	88.6	86.9
EfficientNetV2	86.4	87.4	90.2	88.0
MobileViT	97.0	97.1	96.9	97.0
Swin Transformer	97.8	97.1	98.1	97.6
MSCVIT	98.0	97.6	98.5	98.0

Table 7. Detailed diagnostic results of model migration ability.

Models	A-B	A-C	B-A	B-C	C-A	C-B	Average
CNN	95.1	92.6	91.3	89.9	81.7	93.3	90.7
WDCNN	92.4	90.4	91.1	90.6	85.6	90.1	90.0
ResNet50	89.4	82.0	88.4	85.1	72.3	80.6	83.7
EfficientNetV2	92.2	76.7	78.5	83.2	81.9	80.4	82.2
MobileViT	97.2	65.2	85.1	71.1	76.2	78.5	78.9
Swin Transformer	97.6	94.3	89.0	91.5	87.5	90.8	91.8
MSCVIT	98.8	95.0	93.3	95.7	90.2	92.1	94.2

Table 8. Migration performance of the models across three datasets.

Models	JNU Dataset	CU Dataset	OU Dataset	Average
CNN	89.2	95.2	94.7	93.0
WDCNN	84.9	91.4	92.9	89.7
Resnet50	80.6	90.3	85.5	85.5
EfficientNetV2	74.1	97.5	62.8	78.1
MobileViT	90.0	97.4	90.5	92.6
Swin Transformer	85.6	97.0	97.2	93.3
MSCVIT	91.3	98.1	98.1	95.8

Table 9. Results of ablation experiment on three different datasets.

Condition	JNU Dataset	CU Dataset	OU Dataset	Average
rSVD	89.6	94.0	58.3	80.6
WignerVille/re/CWT	90.5	84.4	89.7	88.2
ViT	89.4	97.5	91.6	92.8
MSCVIT	91.3	98.1	98.1	95.8

Table 10. Performance comparison between MSCBAM and CBAM.

Model	JNU Dataset				CU Dataset				OU Dataset
Model	Acc	Pre	Re	F1	Acc	Pre	Re	F1	Acc	Pre	Re	F1
CBAM	74.0	74.2	74.0	72.9	96.0	96.0	96.0	96.0	96.6	96.6	96.6	96.6
MSCBAM	84.9	84.9	84.9	84.8	96.4	96.2	96.3	96.2	97.4	97.4	97.4	97.4

Table 11. Comparative test of transformer encoder parameters.

Condition	CU Dataset
Encoder layer	4			8			12
Multi-head	8	12	16	8	12	16	8	12	16
Accuracy	95.9	96.9	95.8	97.1	98.1	96.9	95.4	97.5	97.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.-Y.; He, D.-L.; Guo, D.-Q.; Guo, T.-T. A Novel Method for Fault Migration Diagnosis of Rolling Bearings Based on MSCVIT Model. Electronics 2024, 13, 4726. https://doi.org/10.3390/electronics13234726

AMA Style

Liu X-Y, He D-L, Guo D-Q, Guo T-T. A Novel Method for Fault Migration Diagnosis of Rolling Bearings Based on MSCVIT Model. Electronics. 2024; 13(23):4726. https://doi.org/10.3390/electronics13234726

Chicago/Turabian Style

Liu, Xiu-Yan, Dong-Lin He, Dong-Qing Guo, and Ting-Ting Guo. 2024. "A Novel Method for Fault Migration Diagnosis of Rolling Bearings Based on MSCVIT Model" Electronics 13, no. 23: 4726. https://doi.org/10.3390/electronics13234726

APA Style

Liu, X.-Y., He, D.-L., Guo, D.-Q., & Guo, T.-T. (2024). A Novel Method for Fault Migration Diagnosis of Rolling Bearings Based on MSCVIT Model. Electronics, 13(23), 4726. https://doi.org/10.3390/electronics13234726

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Method for Fault Migration Diagnosis of Rolling Bearings Based on MSCVIT Model

Abstract

1. Introduction

2. Methodologies

2.1. Singular Value Decomposition

2.2. Continuous Wavelet Transform

2.3. Vision Transformer

2.3.1. Embedding Layer

2.3.2. Transformer Encoder

2.3.3. MLP Head

3. Proposed Method

3.1. Local Feature Extraction Module MSCBAM

3.1.1. Channel-Attention Module

3.1.2. Spatial Attention Module

3.2. Proposed MSCVIT Framework

4. Datasets

4.1. CWRU Dataset

4.2. Other Datasets

4.2.1. Bearing Dataset of Jiangnan University

4.2.2. Gear Dataset of Connecticut University

4.2.3. Variable Condition Bearing Vibration Dataset of Ottawa University

5. Experimental Results

5.1. CWRU Experimental Results

5.1.1. SVD for Noise Reduction

5.1.2. Time–Frequency Representation

5.1.3. Model Comparison Results

5.1.4. Experimental Results of Different Noise Interference

5.1.5. MSCVIT Model Migration Ability

5.2. Other Datasets’ Experimental Results

5.2.1. Feature Extraction Ability on OU Dataset

5.2.2. Migration Experiments on Different Datasets

5.2.3. Ablation Experiments

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI