1. Introduction
Mechanical equipment condition monitoring is a technology to collect, process, and analyze the information of mechanical operation status, which has been widely used in the maintenance of auxiliary equipment [
1]. The mature application of this technology can reduce equipment damage, reduce maintenance costs, reduce productivity loss, and avoid catastrophic accidents, thus saving a lot of resources for society. The main application fields include precision and complex mechanical equipment such as aircraft or ship engines, large machinery with poor operating conditions such as large hydraulic presses, and offshore drilling platform equipment. Realizing real-time online condition monitoring is of great significance to ensure the safe and reliable operation of important mechanical equipment, especially the safe and reliable operation of large ships and aircraft [
2].
Many scholars have conducted extensive research on the intelligent recognition of wear particle images. Fan et al. put forward an online multilabel classification model WPC-SS for wear particles based on semantic segmentation, which solved the problem that it is difficult to distinguish tiny wear particles from the background in online images, but the recognition and classification accuracy of tiny wear particles by WPC-SS needs to be further improved [
3]. Gu Daqiang and others put forward the pattern recognition of ferrography wear particle analysis based on SVM, which provided an effective method for the classification of wear particles [
4]. This method has high requirements for data preprocessing and feature extraction and needs to be optimized and improved for each specific type of wear particle. However, the application process involves a large amount of data processing and model training, which takes a long time and a large amount of computational resources. These studies have carried out the classification and identification of the manual feature selection of individual particles, but the accuracy is low, the effect is not obvious, and the result of the algorithm or method is relatively simple. The image content of the ferrography analysis of wear particles is complex, and different types of images may have high similarities, such as fatigue wear particles and severe sliding wear particles [
5].
The classification of wear particles involves categorization based on the morphology, size, chemical composition, and mechanical properties of the wear particles. However, the complexity of wear particle classification arises from the fact that wear outcomes often stem from a combination of different wear mechanisms, making the classification process highly intricate [
6]. To gain a comprehensive understanding of wear results, it is essential to employ a comprehensive approach that integrates multifeature fusion, feature selection and dimensionality reduction, and an ensemble of multiple classifiers, as well as deep learning strategies [
7]. These methods effectively address the combined effects of different wear mechanisms, thereby improving the accuracy and robustness of wear particle classification systems, and are of significant importance for research into wear-related issues.
With the application of CNN in the intelligent identification and classification of ferrography wear particle images, this method has gradually become a substitute for manual and traditional machine learning identification. Scholars have solved many problems in the traditional wear particle classification method by applying CNN to the wear particle classification task through research. Wang et al. linked an image recognition model based on a convolutional neural network with wear particle analysis and proposed a two-stage wear particle recognition model [
8]. Based on the above research results, using CNN to identify and classify wear particle images can greatly improve work efficiency. However, it is still necessary to further study and speed up the real-time model identification and classification, reduce computational complexity and improve accuracy, and solve the problems of large-scale data training, superparameter adjustment, and data imbalance under actual conditions.
Wear particle image classification plays an important role in the field of mechanical equipment fault diagnosis and early warning. Classical deep learning models such as ResNet101 [
9], VGG16 [
10], MobileNetV2 [
11], EfficientNetV1 [
12], and AlexNet [
13] have achieved significant success in other image classification tasks. ResNet101 is a deep residual network that solves the problem of gradient vanishing by introducing residual connections, but it has high model complexity. The VGG16 model uses multiple 3 × 3 convolution layers for feature extraction and exhibits excellent performance in terms of classification accuracy, but it consumes significant computational resources. The MobileNetV2 model improves the lightweight nature of the model by using depthwise separable convolution, making it suitable for applications on mobile devices, but it falls short of 70% accuracy in wear particle recognition. The EfficientNetV1 balances the width, depth, and resolution of the network to achieve a better trade-off between performance and computational cost related to wear particle recognition. However, it still fails to address the issue of low wear particle recognition accuracy. AlexNet is one of the earliest deep learning models applied to image classification, combining the advantages of convolutional neural networks and SVM classifiers, but it cannot recognize similar and overlapping wear particles.
Additionally, new algorithms have been developed to improve the task of mechanical equipment fault diagnosis and early warning. Yeping Peng et al. integrated transfer learning and SVM into a convolutional neural network model, successfully establishing a model for identifying different types of faults [
14]. However, this model cannot recognize new class wear particles that were not encountered during training. Given the limitations of existing algorithms in mechanical equipment fault diagnosis and early warning tasks, this paper proposes a novel deep learning algorithm called SV-ERnet. This algorithm combines the characteristics of ResNet50 and SepViT models and introduces the XAI Grad-CAM method to explain the reasoning process of the model. Through comparative experiments and result analysis, we will demonstrate the superiority of SV-ERnet over traditional models in the field of mechanical equipment fault diagnosis and early warning, providing new ideas and methods for research and application in this domain.
In recent years, Vision Transformer (ViT) has been widely used in the field of computer vision by using a pure transformer structure and has achieved great success in the traditional visual classification task by using its self-attention mechanism and strong global modeling ability [
15]. More and more scholars have made improvements on this basis. For example, Hao Xu and others put forward a fine-grained classification algorithm based on a compact Vision Transformer [
16], which reduces the dependence on data volume, cancels the use of classification tokens, and reduces the computational complexity; Jiang Lei and others put forward a fine-grained classification algorithm of a visual Transformer based on a circular structure [
17], which can greatly improve the performance of the visual Transformer without changing the parameters. Yuan Yuan and others put forward the research of fundus image classification based on an integrated convolutional neural network and ViT [
18] and obtained better classification results by using two completely different methods to extract the features of fundus images. The above research provides multiangle contributions to the field of image classification, but its application in the ferrography image classification of wear particles needs further study. According to the principle analysis, ViT may perform better in wear particle classification tasks than CNN. For small wear particle images, the traditional convolutional neural network needs to reduce the resolution to keep the information, but this may lead to the loss of information and the decline of classification accuracy. ViT does not need to subsample or crop the image but can pay attention to the whole image, thus making full use of the image information. For large wear particle images, CNN may encounter memory limitations, but ViT is not subject to this restriction. It divides the image into several sub-blocks for processing, and the results are finally summarized. This method improves the efficiency of processing large images without losing information. When designing CNN, it pays attention to the spatial locality of images and ignores the correlation between different regions. ViT adopts self-attention mechanism to learn the correlation between different regions in the image, thus improving the classification accuracy.
To sum up, ViT has superior image processing efficiency, comprehensive information utilization ability, and strong relevance learning ability when dealing with the classification of wear particles. These characteristics make it possible to use ViT to achieve a better performance in wear particle classification. Therefore, further research is needed. On the basis of ViT, this paper takes into account efficiency, accuracy, and learning ability. SepViT: Separable Vision Transformer has made the following contributions:
- (1)
To solve the problem of the low resolution of wear particle images, an ESRGAN is applied to generate high-resolution images [
19].
- (2)
Combined with the depth, separable convolution [
20] has the characteristics of separating parameter parameters and reducing parameters and parameters. The depth SepViT [
21] is applied to the classification of wear particles.
- (3)
Combined with the convolutional neural network, the strong ability to capture local features can solve the problems of sparse wear particle images and inconspicuous features, so this paper proposes to apply ResNet50 to wear particle classification.
- (4)
The optimal weight is calculated by an adaptive weighted fusion algorithm [
22], and the SV-ERnet model is integrated using the weighted soft voting method [
23]. The model can extract the features of ferrography wear particle images in two completely different ways, so as to achieve a better classification effect and effectively solve the problems of the complex background, irregular shape, different sizes, and high similarity of wear particle images.
2. SV-ERnet Algorithm Structure
The SV-ERnet image classification method studied in this paper mainly includes the following three steps. First, the resolution of the image is improved by ESRGAN. Then, the SepViT and ResNet50 models are trained, respectively, and the optimal weighting factor is calculated by the adaptive weighting algorithm and integrates these two models together to form a more powerful model-Resnet50-SepViT (SV-ERnet) model. Finally, the trained SV-ERnet model is used to classify the wear particle images, and the performance of the model is tested to evaluate its classification accuracy and reliability. The schematic diagram of the overall algorithm structure is shown in
Figure 1. The function of each module and the principle of the algorithm structure is introduced in detail from
Section 2.1 to
Section 2.4 below.
2.1. Image Enhancement Algorithm ESRGAN
ESRGAN is a super-resolution algorithm based on GAN, which realizes the super-resolution of images by learning the mapping from low-resolution images to high-resolution images. It uses the confrontational learning characteristics of GAN to continuously optimize the super-resolution effect of the generator in the confrontation between the generator and the discriminator. Compared with traditional interpolation algorithms, ESRGAN can generate more detailed high-resolution images. The schematic diagram of ESRGAN is shown in
Figure 2, and its function is to generate high-resolution pictures by inputting a low-resolution picture. The network is mainly composed of three parts: (1) a shallow feature extraction network, which is used to extract shallow features. Low-resolution pictures will go through a convolution +RELU function to adjust the number of input channels to 64; (2) an RRDB (residual in residual dense block) network architecture comprising n sets of RDB (residual dense block) dense residual blocks and one residual edge. Each set of RDB blocks includes five groups of convolutional layers followed by rectified linear unit (ReLU) activations; and (3) an upsampling network, whose function is to increase the height and width of the original image by four times and improve the resolution.
2.2. Deep Separable Vision Transformer (SepViT)
2.2.1. Parameter Operation Analysis of Depthwise Convolution Parameters
Depthwise convolution can be divided into two parts, as shown in
Figure 3, namely, depthwise convolution and pointwise convolution, respectively [
24]. Different from the conventional convolution operation, depth separable in the convolution process, one channel of the feature map is convolved by only one convolution kernel, and the number of convolution kernels is equal to the number of channels. Therefore, the expression of depthwise convolution is as shown in Equation (1):
where
G is the output feature graph,
K is the convolution kernel with width
w and height
h,
X is the input feature graph,
m is the
m-th channel of the feature graph,
i,
j are the (
i,
j) coordinates of the output feature graph on the
m-th channel, and
w and
h are the convolution kernel weight element coordinates of the
m-th channel.
Point-by-point convolution is basically the same as ordinary convolution, except that the size of the convolution kernel is set to 1 × 1. The schematic diagram of depth separable convolution is shown in
Figure 3. Firstly, the features of each channel are extracted by the depth convolution, and then the features are correlated by point-by-point convolution. In the figure,
Df and
N are the side length and channel number of the input feature graph,
Dk is the side length of the
Dw convolution kernel, and
M is the channel number of the
Pw convolution. The depthwise convolution replaces the standard convolution with less parameters and computation, which is compared with the computation of the standard convolution, as shown in Equation (2):
where
P1 and
P2 are the calculation quantities of depthwise convolution and standard convolution, respectively.
In the process of feature extraction, the size of the convolution kernel is usually 3 × 3. Therefore, the amount of calculation and parameters of the depthwise convolution is about 1/9 of that of the conventional convolution. From the comparison of calculation amount, SepViT, which uses the idea of depthwise convolution, is smaller than the parameters and parameter operations of ViT, thus learning more deeply.
2.2.2. SepViT Algorithm Principle
SepViT uses conditional position coding. SepViT at each stage has an overlapping patch merging layer for feature image downsampling, followed by a series of SepViT blocks. The spatial resolution is downsampled step by step with stride = 4 or stride = 2, reaching 32 times downsampling, and the channel size is gradually doubled. This operation comes from PVT(Pyramid Vision Transformer). Compared with ViT, PVT introduces a pyramid structure similar to CNN. Compared with the traditional ViT, the core optimization of SepViT lies in the calculation of attention. The internal self-attention mechanism is redesigned mainly through depthwise convolution, in which Sep-attention consists of two parts: depthwise convolution self-attention and pointwise convolution self-attention, and the depth convolution self-attention is mainly used for feature map extraction. The principle structure diagram of SepViT is shown in
Figure 4.
SepViT
DWA (depthwise self-attention) is a self-attention scheme based on windows, which simplifies interactive calculation computing between windows by introducing an independent window token. The token can be initialized to a fixed vector or a learnable vector. Experiments show that the learnable vector is better than the strategy based on average pooling and depthwise convolution. Through
DWA, the interaction between the window token and pixel token in the window is realized, so it can be used as the global representation of the window and perform an attention operation on the sequence set of all pixel tokens in the window and the corresponding window token and process the information in a separate window. This operation can regard them as a channel for inputting the feature map, and these windows contain different information. Therefore, the window-wise operation here is similar to the depthwise convolution layer, which aims at fusing the spatial information in each channel. The principle diagram of the deep convolution self-attention convolution operation is shown in
Figure 5.
The implementation of
DWA can be summarized as in Equation (3):
where
z is a feature token, which consists of pixels and a window token.
WQ,
WK, and
WV represent three linear layers, which are used for the general self-attention of query, key, and value calculation, respectively. Attention refers to the standard self-attention operator that works on the local window.
PWA builds a cross-window interaction by simulating pointwise convolution for associated channels, so as to obtain the final feature map. Firstly, the feature map and window tokens are extracted from
DWA, and then the window tokens are used to model the attention relationship between windows. After
LN (layer normalization) and
Gelu (activation function),
Q and
K are obtained by two independent linear mappings, and an attention map between windows is generated. At the same time, the previous feature map is directly regarded as the
V of
PWA (without additional processing), and the window dimension is globally weighted to calculate the final output. The schematic diagram of the point-by-point convolution self-attention operation is shown in
Figure 6.
Formally, the implementation of
PWA can be described as in Equation (4):
Group Self-Attention (GSA)
In addition to
DWA and
PWA, the SepViT’s depth separation from the attention mechanism also introduces the idea of the grouping convolution of AlexNet [
25]. As shown in
Figure 7, the grouping self-attention mechanism splices adjacent subwindows into larger windows, which is similar to dividing windows into group. Using
DWA in a group of windows, GSA can capture the long-term visual dependence of multiple windows. In terms of calculation cost and performance gain, GSA has additional cost to DSSA (depthwise separable self-attention), but it also has better performance. Finally, the block with GSA is applied to SepViT and runs alternately with DSSA in the later stage of the network.
2.3. ResNet50 Algorithm Principle
ResNet50 is a deep convolution neural network, which extracts image features through multiple convolution layers and pooling layers and inputs these features into the fully connected layer for classification. The whole network structure is divided into five stages and an output layer. Stage0 includes a convolution layer and a maximum pooling layer for adjusting image input; Output layer is composed of global average pooling layer and fully connected layer, which is used to output the classification results of images. Except Stage0 and Output layer, the other four stages adopt residual network structure. The schematic structure diagram of the ResNet50 network is shown in
Figure 8.
In ResNet50, a bottleneck is widely used in the residual network, which is mainly used to solve the problem of different channels. The bottleneck includes two modes: when the number of input and output channels is the same, the BTNK2 mode is adopted; when the number of input and output channels is different, the BTNK1 mode is adopted. The BTNK2 mode has two variable parameters, C and W, which, respectively, represent the number and width of channels in the input shape (C, W, W). Let x be input with the shape of (C, W, W) and let the three convolution blocks on the left side of BTNK2 (and related BN and ReLU) be the function F(x), then the output of BTNK2 is F(x) + x, and the output shape is still (C, W, W) after a ReLU activation function. The BTNK1 mode includes a convolution layer on the right, which turns the input x into G(x) to match the difference in the number of input and output channels and then performs the summation operation F(x) + G(x).
Specifically, each residual block in ResNet50 includes two paths: one is to directly transfer the input data to the output, and the other is to perform convolution and activation function processing on the input data and then add the processed result with the original input to obtain the output. This design can avoid the problems of gradient disappearance or explosion during training and help the network learn more complex features. Finally, ResNet50 can classify the input images efficiently and accurately, so it is widely used in various image-related tasks.
2.4. Fusion Integration Model
In this study, an integrated model of prediction results is used to improve the classification accuracy. The research shows that the greater the difference among the models, the better the integrated model performance [
26]. SepViT is a new type of ViT, which has the ability of mining long-distance dependencies and powerful parallel computing, but like ViT, it lacks the inductive bias of convolutional neural networks, such as translation invariance and local correlation. On the other hand, ResNet50 is a powerful convolutional neural network, but its core convolution operation lacks the global understanding of the image, cannot build dependency between features, cannot make full use of context information and convolution fixed weights, and cannot dynamically adapt to the changes of input.
In order to solve the problem of abrasive image classification, this paper combines SepViT and ResNet50 models to distinguish the image differences comprehensively by using two different feature extraction methods to obtain better classification results. Through the adaptive weighting algorithm, the optimal weighting factor is determined after testing, and the improved soft voting method is used to integrate the model. In the process of classification, an adaptive weighted fusion algorithm is adopted, the core idea of which is to adaptively find the optimal weighting factor corresponding to each classification model based on the accuracy of all classifications in order to obtain the optimal fusion result. There are two classification models in this study, so let the variance of the two classification models be
and
, respectively, and the true value to be estimated is
S. The classification accuracy of each classification model is
X1 and
X2, which are unbiased estimates of
S and independent of each other. The weighting factors of each classification model are
W1 and
W2, respectively, so the fused
value and each weighting factor satisfy the following conditions:
According to Equation (6),
E represents the expected value, population variance
is a multivariate quadratic function about each weighting factor
W1 and
W2 of the classification model, and there must be a minimum value, and its minimum population variance is
The corresponding optimal weighting factors are
After calculating the variance and adaptive optimal weighting factor of each classification model by using Equations (6)–(9), and then carrying out adaptive weighted fusion on the data of each classification model, the calculated value after fusion is
In Equation (10),
represents the weighting factor corresponding to the
P-th model,
is the average of multiple prediction results from the
P-th model, and
denotes the number of predictions.
Equation (11) can be used to calculate the fused population variance and minimum population variance and evaluate the accuracy after fusion. The calculated weighting factors are 0.4 and 0.6, and each sample
xmn is subjected to binary soft voting under each category label. The classification probability of each sample
xmn under two models of each category label is
where
P0mn represents the probability that the m-th sample is judged as a positive example class under the 0-th category label of the n-th model and
P1mn,
P2mn,
P3mn,
P4mn, and so on. After finding
Pmn, the output of the SepViT model is the probability of the tag multiplied by the weighting factor
W1, and the output of the CNN model is multiplied by the weighting factor
W2. After adding the two, the predicted output of the sample is the probability of this label, that is
Find the final classification probability Pm of the sample xmn under the five categories of labels, and output the label with the highest probability of the sample under Pm.
5. Conclusions
The proposed SV-ERnet integration model combines the optimized network structure of ResNet50 and SepViT. A convolutional neural network (CNN) takes a convolution kernel as its core and has inductive bias characteristics such as translation invariance and local sensitivity. It can capture local spatiotemporal information, but it lacks a global understanding of images. Compared with CNN, the Transformer’s self-attention mechanism is not limited by local interaction, which cannot only mine long-distance dependencies but also calculate in parallel. In this study, ResNet50 is selected as the CNN model and SepViT as the main model. The optimal weighting factor is calculated by an adaptive weighted fusion method, and the model is integrated by the weighted soft voting method. Experiments show that the two models are integrated, and applied to wear particle image classification, the accuracy of the integrated model is improved by 2.0 and 1.8%, the accuracy is improved by 1.7% and 2.0%, and the specificity is improved by 0.5%. From the curve results, it can be seen that the training accuracy of SepViT always shows an upward trend, which indicates that the effect of the model will be further improved if more time and cost are invested. If the model is applied to the fault diagnosis of mechanical equipment, it can improve the working efficiency of inspectors, effectively alleviate the problems of a long waiting time and the difficult classification of wear particles, and obtain better results in the online analysis and fault diagnosis of oil.
The proposed SV-ERnet model, combined with the XAI Grad-CAM visualization method, has broad potential for industrial applications. In the future, it can play a positive role in fields such as diagnosing and the early warning of faults in the maritime and aviation industries, medical image analysis, and industrial product defect detection.