1. Introduction
Maintaining rolling bearings in optimal condition is vital to keep rotating machinery running smoothly and ensure the stability of the entire production process. [
1,
2]. However, many factors can result in the breakdown of machinery, such as prolonged operation, excessive loads, and environmental impacts. According to statistics, approximately 30% of rotating machinery failures are related to rolling bearings [
3]. Therefore, it is of great significance to study rolling bearing fault diagnosis technology in actual industrial production.
Traditional fault diagnosis approaches are typically classified into model-based approaches and data-driven methods. Model-based methods typically use physical models to describe the dynamic properties of mechanical systems. Shen et al. proposed a physics-based deep learning method that utilized a threshold model to assess the health classes of bearings based on the known physics of bearing faults, while a convolutional neural network (CNN) network was used to predict health condition based on high-level features extracted from the input [
4]. To solve the problem that most deep learning algorithms tend to ignore with physical information, Ni et al. proposed a new physics-informed residual network. This network was designed to learn the underlying physics embedded in the training and testing data to provide a physically consistent solution for incomplete data [
5]. Borghesani et al. proposed a generalized bearing signal model, which mainly evaluated the influence of key model parameters corresponding to the physical properties of bearings on signals in the time and frequency domains [
6]. Zhang et al. reconstructed the air-gap displacement profile based on the stator current electrical model, enabling a quantitative assessment of rolling bearing fault severity and solving the tedious problem of manually calibrating fault thresholds under different power, speed, and load conditions in traditional methods [
7]. Liu et al. proposed a personalized diagnostic method based on finite element method simulations and support vector machine, which generated simulated fault samples and performed classification, solving the problem of effectively detecting faults in mechanical components in the absence of actual fault samples [
8]. Keshun et al. proposed a sound-vibration physical information fusion-constrained deep learning method (PFCG-DL), which combined physical models with deep learning models to improve the accuracy, interpretability, and computational efficiency of bearing fault diagnosis, addressing the issues of lack of physical mechanism guidance and low interpretability in existing deep learning methods [
9]. However, model-based approaches have disadvantages such as poor adaptability and strong reliance on models. Therefore, data-driven approaches use large quantities of data to find the underlying rules of a model without relying on a deep understanding of the system or precisely building mathematical models.
As a data-driven method, deep learning can autonomously derive features from a large quantity of data and reduce manual intervention [
10]. In recent years, it has increasingly been applied across different domains in artificial intelligence, including object detection [
11], audio recognition [
12], natural language processing [
13], and disease prediction [
14]. Zhang et al. introduced a method to convert one-dimensional (1D) fault signals into two-dimensional (2D) maps and send the maps to a CNN for diagnosis [
15]. Li et al. proposed a hybrid diagnostic model combining a dual-stage attention-based recurrent neural network (DA-RNN) and convolutional block attention module (CBAM). They utilized the DA-RNN to extend imbalanced datasets and combined image processing with the CBAM network for fault classification, effectively addressing the issue of improving fault diagnosis accuracy under imbalanced data conditions [
16]. Chen et al. utilized a combination of CNN and long short-term memory (LSTM) to reduce calculation time and eliminate the problem of large data volume and unreliable manual analysis [
17]. Gu et al. adopted discrete wavelet transformation (DWT) to extract detailed fault information of different frequencies and time scales and used an LSTM network to capture the temporal relationships within the fault data [
18]. An et al. proposed an LSTM gating unit for time-varying conditions to selectively forget some unimportant information [
19].
Despite the success of deep learning networks in fault diagnosis, some inherent drawbacks must be considered. For example, CNNs may not be as effective as RNNs in capturing long-range dependencies, especially in tasks that require global information. Additionally, RNNs struggle with parallel computation and are prone to the vanishing gradient problem, which limits the length of sequences they can handle. However, attention-based transformer models can effectively capture long-range dependencies in sequence data by supporting global interactions between each time step and other time steps in the sequence, thereby achieving higher parallelism and computational capability. Yang et al. segmented and linearly encoded 1D vibration signals and then used the transformer model to extract features, aiming to enhance the performance of fault diagnosis [
20]. Alexakos et al. utilized short-time Fourier transform to transfer 1D fault signals into 2D maps, which were subsequently classified by the transformer model [
21]. Li et al. proposed a twin transformer to solve the problem of traditional deep learning models not being able to perform parallel computation in fault diagnosis [
22]. Fan et al. input a gray texture image into a vision transformer (ViT) and utilized the self-attention mechanism to identify global patterns for fault classification [
23]. Xie et al. utilized singular value decomposition (SVD) and energy-dispersive spectroscopy (EDS) to denoise the signal, and then employed the generalized S transform (GST) and Res-ViT network for feature extraction and fault classification [
24]. Tang et al. developed an integrated ViT model that incorporated discrete wavelet transform (DWT) and a soft voting method, which transformed signals from different frequency bands into time–frequency images to achieve fault diagnosis [
25]. To address the problem of traditional CNNs not being able to capture temporal information in rolling bearings, Weng et al. developed a 1D vision transformer architecture that incorporated a fusion of multi-scale CNNs (MCF-1DViT). This model utilized the MCF layer to capture fault features at multiple time scales and employed a transformer model to learn long-term temporal correlations [
26]. Ding et al. innovated the time–frequency transformer (TFT) model based on the ViT model to address the shortcomings of the traditional network in feature representation, and extracted effective information from time–frequency representation (TFR) of vibration signals by using a fresh tokenizer and encoder module [
27]. Xiang et al. proposed a frequency channel attention-based ViT method, which enhanced the feature extraction capability and interpretability of the model for rolling bearing fault identification by introducing a frequency domain channel-attention mechanism and self-attention mechanism [
28]. Guo et al. proposed a bidirectional parallel rolling bearing intelligent diagnosis method based on multi-scale center-cascaded adaptive dynamic convolutional residual network and a Swin transformer. The method utilized multi-scale center-cascaded dynamic convolutional residual block and a multi-dimensional coordinate attention mechanism to extract local fault features, while the moving window self-attention mechanism of the Swin transformer network was used to capture global features of the fault information [
29]. Li et al. proposed a lightweight multi-feature fusion ViT model for rolling bearing fault diagnosis. The model utilized a multi-scale wide convolutional neural network perception module for local feature extraction, while an improved lightweight multi-feature fusion ViT was built for global feature extraction and fault recognition [
30].
Although the transformer models in the literature above have made significant progress in many aspects, they still fall short in local feature extraction, which may limit their overall performance. To solve this problem, an innovative multi-scale convolutional block attention module—a vision transformer (MSCVIT) bearing fault diagnosis model—is proposed. The main contributions of this paper are as follows.
- (1)
First, noise is added to the original vibration data to simulate a real production environment. The order of singular values is reconstructed through the energy threshold method to achieve a denoising effect. The 1D denoised data are transformed into TFRs by continuous wavelet transform (CWT) technology to enhance the multi-dimensional representation of the data and capture the fault features in the time–frequency domain.
- (2)
Secondly, we improved the CBAM. We introduced an MLP with different reduction factors to enhance the capability of the channel-attention mechanism (CAM) to effectively capture the importance of different channels. Meanwhile, for the spatial attention mechanism (SAM), we combined convolutional kernels of different sizes to enhance the fusion of multi-scale features.
- (3)
Finally, an innovative MSCVIT model is proposed. The multi-scale processing of the multi-scale convolutional block attention module (MSCBAM) enables the model to capture local features more comprehensively, while the vision transformer effectively utilizes the self-attention mechanism to capture the dependencies between image patches and extract global information. The interaction between the MSCBAM and the ViT model effectively extracts the main features from the TFRs.
The structure of this article is as follows.
Section 2 mainly introduces the basic theory of each module. The MSCBAM and the MSCVIT model diagnosis process are briefly introduced in
Section 3.
Section 4 mainly describes the classical dataset. The experimental results and analysis are given in
Section 5, and ablation experiments on the MSCVIT model are conducted to better assess its performance. The sixth section concludes this paper.