Next Article in Journal
Exact Soliton Solutions to the Variable-Coefficient Korteweg–de Vries System with Cubic–Quintic Nonlinearity
Previous Article in Journal
Brauer Configuration Algebras Induced by Integer Partitions and Their Applications in the Theory of Branched Coverings
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Mask-Space Optimized Transformer for Semantic Segmentation of Lithium Battery Surface Defect Images

1
College of Electronic Engineering (College of AI), South China Agricultural University, Guangzhou 510642, China
2
National Citrus Industry Technical System Machinery Research Office, Guangzhou 510642, China
3
Guangzhou Agricultural Information Acquisition and Application Key Laboratory, Guangzhou 510642, China
4
Guangdong Provincial Agricultural Information Monitoring Engineering Technology Research Center, Guangzhou 510642, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(22), 3627; https://doi.org/10.3390/math12223627
Submission received: 11 October 2024 / Revised: 18 November 2024 / Accepted: 18 November 2024 / Published: 20 November 2024

Abstract

:
The segmentation of surface defects in lithium batteries is crucial for enhancing the overall quality of the production process. However, the severe foreground–background imbalance in surface images of lithium batteries, along with the irregular shapes and random distribution of foreground regions, poses significant challenges for defect segmentation. Based on these observations, this paper focuses on the separation of foreground and background in surface defect images of lithium batteries and proposes a novel Mask Space Optimization Transformer (MSOFormer) for semantic segmentation of these images. Specifically, the Mask Boundary Loss (MBL) module in our model provides more efficient supervision during training to enhance the accuracy of the mask computation within the mask attention mechanism, thereby improving the model’s performance in separating foreground and background. Additionally, the Dynamic Spatial Query (DSQ) module allocates spatial information of the image to each query, enhancing the model’s sensitivity to the positions of small foreground targets in various scenes. The Efficient Pixel Decoder (EPD) ensures deformable receptive fields for irregularly shaped foregrounds while further improving the model’s performance and efficiency. Experimental results demonstrate that our method outperforms other state-of-the-art methods in terms of mean Intersection over Union (mIoU). Specifically, our approach achieves an mIoU of 84.18% on the lithium battery surface defect test set and 85.53% and 87.05% mIoUs on two publicly available defect test sets with similar defect characteristics to lithium batteries.

1. Introduction

Due to the uncontrollability of industrial equipment and the production environment, product surface defects will inevitably appear, which will affect the manufacturing efficiency and economic benefits. Surface defect detection (SDD) has become an indispensable part of modern production lines by identifying defect areas in industrial product images through a pixel level prediction method. Especially in the fields of industrial manufacturing [1], product quality testing [2], urban transportation [3] and military aerospace [4], surface defect detection plays a vital role. With the increasing demand for high-precision automatic detection, image semantic segmentation technology based on deep learning is gradually replacing the traditional manual detection methods, to overcome the shortcomings of manual detection in accuracy and efficiency, and has become the mainstream detection method. At present, most traditional deep learning neural networks are designed for general image prediction tasks, and their performance in defect detection scenarios is still limited. Therefore, it is particularly necessary to optimize the design of deep learning network architecture for defect detection scenarios.
In recent years, Convolutional Neural Networks (CNNs) [5] have been widely applied to various computer vision tasks, including classification, segmentation, and detection, due to their ability to extract more discriminative and robust features within deep networks [6]. For semantic segmentation tasks, Fully Convolutional Networks (FCN) [7] have been proposed to achieve end-to-end dense pixel prediction directly through convolutional and upsampling layers, achieving breakthroughs in natural scene analysis. Due to the simplicity and outstanding performance of the FCN framework, many studies [8,9,10,11,12] have applied it to industrial defect image detection, making pioneering progress. However, the inherent locality of convolutional operations often limits their ability to explicitly model long-range interactions, which is crucial for pixel-level defect detection in complex scenarios, such as cluttered backgrounds and hard-to-identify pseudo-defects [13].
To effectively capture information about remote interactions, various attention mechanisms have been introduced into industrial defect detection algorithms [14,15], facilitating rapid advancements in semantic segmentation of industrial defect images. For example, Xiao et al. [16] combined the local feature bias of CNNs with the global feature bias of Transformers, enhancing the model’s capability for defect feature extraction. At the same time, Zhang et al. [17] proposed a multi-layer attention mechanism to capture spatial dependencies between multi-level features, thereby strengthening the expression of defect features. In terms of global feature representation, Zhang et al. [18] designed a context-aware attention module that integrates high-level and low-level image feature regions that often influence the result(s) within the backbone network to improve global feature representation of defects. On the other hand, Wang et al. [19] proposed a bidirectional stripe-refined attention mechanism that utilizes stripe pooling to capture vertical and horizontal global context dependencies, enhancing the representation of edge features.
Although attention-based methods demonstrate strong remote-dependency modeling capabilities and effectively learn global-spatial representations of industrial surface defect images, they still face a performance bottleneck. This bottleneck primarily arises regions often result features within global spatial information and the foreground-background imbalance in surface defect images. Specifically, the imbalance in area between the foreground and background regions often results in missed detections of defect regions and false detections of pseudo-defects. Numerous studies [20,21] have noted that the foreground proportion in surface defect images is significantly smaller than the background proportion, and relevant research has explored foreground object segmentation accordingly. However, achieving complete foreground–background separation remains challenging due to varying foreground proportions across different defect categories in industrial surface defect images.
Inspired by the mask classification task Mask2Former [22], we observed that the segmentation task can be designed to predict a set of binary masks, with each mask associated with a class prediction. This method naturally separates the foreground and background, clearly indicating their proportions. Meanwhile, the mask cross-attention mechanism selectively focuses on regions with high attention scores, ignoring regions with lower scores. In this paper, we introduce mask classification into the analysis of industrial lithium battery surface defect images and propose a novel Mask Spatial Optimization Transformer (MSOFormer) for the semantic segmentation of these images. MSOFormer aims to explore a new paradigm for the semantic segmentation of surface defect images through foreground–background separation. Specifically, MSOFormer introduces a mask classification pipeline to separate the foreground and background of surface defect images through predicted masks. Additionally, by studying the impact of the computational region accuracy in the mask attention mechanism on model performance, we propose a Mask Boundary Loss (MBL) module to provide supervision information to the computational regions. This not only eliminates redundant background features but also ensures the effective utilization of background information around the foreground regions. Furthermore, the proposed Dynamic Spatial Query (DSQ) module enhances the model’s sensitivity to the spatial positions of small foreground objects, while the Efficient Multi-Scale Pixel Decoder (EPD) improves the model’s ability to extract irregular foreground objects through deformable receptive fields. These components further improve the model’s foreground–background separation performance. The main contributions of this work are summarized as follows.
  • We introduce mask classification into the analysis of lithium battery surface defect images and propose a novel Mask Boundary Loss (MBL) module to aid the mask attention mechanism in learning more precise computational regions, thereby further improving the model’s foreground segmentation accuracy.
  • Considering the randomness in the generation of small defect locations in lithium battery surface defect images, the Dynamic Spatial Query (DSQ) module integrates position encoding of image features into the query, aiming to enhance the model’s sensitivity to small foreground objects.
  • The Efficient Pixel Decoder (EPD) module achieves deformable receptive fields for irregular foreground objects through the fully convolutional FPN (Feature Pyramid Network) architecture, enhancing both model performance and operational efficiency.
  • Experimental results demonstrate that the proposed MSOFormer surpasses existing methods, achieving state-of-the-art performance on the lithium battery surface defect dataset, the MT dataset, and the NEU-Seg dataset.
The structure of this paper is arranged as follows: Section 2 introduces the related work; Section 3 describes the proposed MSOFormer model for semantic segmentation of lithium battery surface defect images; Section 4 analyzes the performance of the proposed method through experiments; and Section 5 concludes the paper.

2. Related Work

2.1. General Semantic Segmentation

Semantic segmentation involves classifying each pixel in an image into specific semantic categories, thereby achieving a pixel-level understanding and segmentation of the image content. After the introduction of an FCN [7], many semantic segmentation algorithms adopted its design principles to enable dense pixel-level predictions at arbitrary resolutions, which laid the foundation for subsequent encoder–decoder structures. For instance, UNet [23] proposed a U-shaped architecture based on an FCN, which captures both contextual and positional information, thereby achieving higher performance and robustness. In the meantime, PspNet [24] addressed the limitations of a FCN in scene analysis, such as insufficient contextual information and a small receptive field, by introducing a Pyramid Pooling Module to enhance the detection accuracy of objects with similar colors and shapes. On the other hand, the Deeplab series [25,26,27,28] employed atrous convolution to effectively expand the receptive field without increasing the number of parameters, while the Atrous Spatial Pyramid Pooling (ASPP) structure further strengthened the network’s robustness in multi-scale, multi-class segmentation tasks. In addition, models like PSANet [29], CCNet [30], SegFormer [31], and SegNeXt [32] utilized various attention mechanisms to further exploit global spatial modeling capabilities and capture long-range dependencies, resulting in outstanding semantic segmentation performance.

2.2. Semantic Segmentation of a Surface Defect Image

Deep learning methods have made significant progress in the semantic segmentation of industrial surface defect images. For instance, Yang et al. [33], Liu et al. [34], and Du et al. [35] applied semantic segmentation techniques to surface defect detection with promising results. Considering that surface defect images contain richer semantic information than natural scene images, various studies have attempted to design deep networks specifically suited to surface defect images. In this case, Yu et al. [36] designed an adaptive network depth selection mechanism to distinguish similar defects by aggregating features at different depths. Building on this, Li et al. [37] combined strip pooling and a spatial pyramid pooling module (SPPM) to enhance the model’s ability to perceive defect location and shape information. In the meantime, Zhou et al. [38] developed a hybrid backbone network based on CNN inverse-residual blocks and Swin Transformer blocks, introducing an efficient channel attention-enhanced feature fusion module to merge the features output by these two modules. In addition, Yao et al. [39] developed a vision transformer with dual attention mechanisms to meet the needs of surface defect detection under more complex semantic conditions. On the other hand, Wang et al. [13] combined the advantages of convolutional attention blocks and multi-pool self-attention to achieve better local and global information perception. Zhang et al. [40] used a wavelet transform to guide the down-sampling layers in a Swin Transformer, improving the network’s sensitivity to surface defect textures and its ability to detect high-frequency information. Although current methods have achieved significant advances by enhancing remote spatial representation, their performance has encountered a bottleneck. Based on the characteristics of lithium battery surface defects, this paper focuses on foreground–background separation in surface defect images and proposes a more effective mask space optimization method to overcome the performance limitations in defect image semantic segmentation.

2.3. Mask Classification

Mask classification has been extensively utilized in image segmentation methodologies. For example, Mask R-CNN [41] filters and classifies regions of interest produced by the Region Proposal Network, subsequently generating masks for these classified areas to achieve pixel-level segmentation. More recently, MaskFormer [42] has sought to redefine the segmentation task as predicting a series of binary masks in conjunction with their respective class predictions to compute the final output, where each mask is linked with a class prediction. Building on this concept, Mask2Former [22] introduced a mask attention mechanism that effectively distinguishes between foreground and background regions. Its multi-scale deformable attention mechanism further refines mask features at various scales. However, the computational complexity of the multi-scale deformable attention mechanisms makes Mask2Former unable to meet the efficiency requirements of industrial application scenarios. In contrast, PEM [43] redesigned the mask attention mechanism through a prototype selection strategy, thereby reducing the model’s computational load. Our method employs Mask2Former as a baseline model and introduces a mask space optimization approach for the semantic segmentation of lithium battery surface defects.

3. Methods

MSOFormer follows the standard structure of mask classification models in its design. Initially, four hierarchical features ( F 1 to F 4 ) are extracted from the network image. The pixel decoder, EPD, based on the FPN architecture, further refines these features and sends three lower-resolution features in a cyclical manner to the consecutive transformer decoder layers, while the higher-resolution feature is utilized as the mask feature M a s k F . The decoder learns multi-scale features and trainable queries Q u e r y to train the query embeddings Q u e r y e m b e d . In the semantic segmentation process, the query embeddings Q u e r y e m b e d generate the class prediction M a s k c l a s s and the corresponding mask embeddings M a s k e m b e d through an MLP. The binary mask prediction M a s k p r e d is obtained via the Einstein summation convention between the mask features M a s k F and the mask embeddings M a s k e m b e d . Finally, by using the Einstein summation convention to combine the mask predictions M a s k p r e d with the class predictions M a s k c l a s s , the final semantic segmentation results are produced. The specific network structure of MSOFormer is shown in Figure 1.

3.1. Architecture of an Efficient Multi-Scale Pixel Decoder

The pixel decoder plays a fundamental role in extracting multi-scale features for precise object segmentation. Mask2Former implements this by employing the FPN enhanced with multi-scale deformable attention. Compared to the pixel decoder of MaskFormer, it offers several advantages: dynamic weight calculation based on the input, feature refinement through multi-scale information, and enhanced local information representation using deformable receptive fields. Although the performance has been improved, considering the small amount of defect data produced by actual industrial production lines, using deformable attention on the FPN will bring additional computational overhead and computing resources, resulting in an inefficient pixel decoder that is unsuitable for application on actual production lines. To balance performance and computational efficiency, this paper employs a fully convolutional FPN architecture for the pixel decoder. This approach maintains the benefits of a multi-scale deformable attention FPN while using a cost-effective and simple convolutional design. Firstly, to reintroduce dynamic weights and multi-scale information, this study refers to the CoordAtt [44] design and proposes a Self-Adaptive Module (SAM) based on spatial channel weights and context. This module globally rebalances the importance of scene representation at each scale, suppressing channels and spatial information with less data and enhancing those with more data. Additionally, to achieve deformable receptive field capabilities, deformable convolutions are used to dynamically adjust the receptive field to focus on relevant regions of the image, thereby improving the model’s representation of irregular defects. This method maintains computational efficiency while delivering competitive performance.
Specifically, given the input features of the i -th stage F i R H i × W i × C i , with i 1 ,   2 ,   3 ,   4 , they are first projected into a low-dimensional space C to obtain F i C R H i × W i × C . Then, global pooling is applied along the horizontal and vertical directions at each scale, resulting in the global pooling visual feature projection F i G R ( H i + W i ) × C for subsequent contextual information fusion:
F i G = M l p C o n c a t 1 W i 0 < j W i F i C H i , j + max 0 j W i F i C H i , j , 1 H i 0 < j H i F i C j , W i + max 0 j H i F i C j , W i ,  
where MLP refers to a two-layer network composed of 1 × 1 convolutions. Next, a pair of feature maps with directional perception and channel information, namely F i H R H i × C and F i W R W i × C , are separated along the horizontal and vertical directions of F i G . These feature maps are processed through the sigmoid function σ to obtain the correlation for each spatial direction and channel. Finally, the output of the SAM module for each scale F i S A M is obtained as follows:
F i S A M = F i C + F i C · σ P r o j F i H · σ P r o j F i W ,  
where P r o j represents linear projection. After obtaining the contextual features for each scale, they are aggregated to construct a feature pyramid network. This work follows previous efficient segmentation works [45,46] and utilizes deformable convolutions [47] to integrate features across different scales as follows:
E P D i = DefConv F i S A M , i = 4 DefConv F i S A M + U p E P D i + 1 , i = 1 ,   2 ,   3 ,  
where D e f C o n v refers to deformable convolutions, and U p represents the upsampling operation. As shown in Figure 1, the three low-resolution features E P D i i 2 ,   3 ,   4 output by the pixel decoder are used as multi-scale visual features in the transformer decoder to further enhance the quality of the defect area masks. The high-resolution feature E P D 1 is processed through a depth-wise separable convolution layer to obtain the refined mask features M a s k F R C × H × W as described by the following:
M a s k F = D w C o n v E P D 1 .  

3.2. Architecture of a Dynamic Spatial Query Module

Natural images are typically captured by photographers with a specific purpose, usually placing the subject in the visual center of the image. In contrast, on automated industrial production lines, the positions of industrial cameras relative to industrial components are generally fixed, while the locations of defects on the components are random. In industrial surface defect detection, small-scale surface defects such as dirt and foreign objects, are randomly distributed across the surface area, and lithium batteries are no exception. Metal scrap randomly appears in the Tab area, while nonmetal impurities randomly occur in the Sealant area. Compared to natural images, the distinct positional differences of defect targets on lithium battery surfaces increase the difficulty of extracting foreground objects in semantic segmentation. To address this, we developed a Dynamic Spatial Query (DSQ) module to enhance MSOFormer’s sensitivity to the spatial positions of small-scale defect targets in lithium battery defect detection. This is achieved by incorporating spatial location information of the image into the learnable queries, as illustrated in Figure 2. The DSQ module enables information interaction between the second, third, and fourth layers of the image features output by the pixel decoder and the query block in the transformer decoder. This design facilitates the detection of targets of varying sizes.
The computation of the standard mask attention function with residual connections is given by Equations (5) and (6):
Q u e r y c r o s s = W A · f V F e a t u r e i + Q u e r y l 1 , l = 0 ,   1 ,   2 ,   , t 1 ,             i = 4 l % 3 ,  
W A = s o f t m a x M l 1 + f Q Q u e r y l 1 + P O S Q u e r y · f K F e a t u r e i + P O S F e a t u r e i T ,
where l denotes the number of layers in the transformer decoder; f Q , f K , and f V are linear transformations of the given query Q u e r y R N × C and image features F e a t u r e i R C × H i × W i output by the pixel decoder, where i = 2 , 3 , 4 ; P O S Q u e r y represents the learnable positional embedding with the same dimension as Q u e r y ; P O S F e a t u r e i denotes the cosine positional encoding with the same dimension as F e a t u r e i ; and M l 1 is the attention mask obtained from the previous layer of the transformer decoder.
To further incorporate additional image positional features into the learnable query, we first compute the mixed intermediate features that include the image features. This is achieved through the dot product of the query features and key image features, as shown in Equation (6). The mixed intermediate features are represented as W A R N × ( H i · W i ) , where N is the number of queries and H i and W i are the height and width of the key image features, respectively. W A records the patch information of interest for each query. To inject cosine positional encoding into the learnable query, we multiply the positional encoding of the input image features by the attention weights W A , obtaining the positional encoding of the patches of interest for each query, which matches the shape of the queries that record the information of interest. Therefore, the final query can be represented as in Equation (7):
Q u e r y D S Q = W A · P O S F e a t u r e i + Q u e r y l .  
The final query obtained by the Dynamic Spatial Query (DSQ) module is Q u e r y D S Q . As shown in Equation (7), spatial location information is explicitly embedded in the learning process. This enhances the segmentation quality in defect detection scenarios, as MSOFormer’s learned queries are more sensitive to mask features at different spatial locations. Furthermore, W A is already part of Equation (5), and P O S F e a t u r e i is an existing variable. Therefore, the number of parameters in the DSQ module does not increase, and the convergence speed of mask attention is not affected.

3.3. Architecture of a Mask Boundary Loss Module

Due to the small pixel proportion of defect regions, there is an imbalance between foreground and background in lithium battery defect images, which leads to a performance bottleneck in semantic segmentation of defect images. In the mask attention mechanism of Mask2Former, the foreground and background are separated by recording the mask regions of interest from the previous layer decoder, enhancing the representation of local target regions. This approach not only effectively avoids unnecessary computations in the background regions and reduces the training cycle of the model, but also gradually refines the foreground regions through multi-layer attention decoders. This allows the model to focus more on the foreground areas, thereby improving accuracy [22]. Therefore, the mask regions of interest passed between decoders determine the computational areas of the mask attention mechanism, and the accuracy of these areas is crucial for improving model performance. Considering this and the binary nature of mask regions, this paper attempts to use edge loss to supervise the computational areas, forcing the model to generate more accurate computational regions and more accurately distinguish between foreground and background areas.
We found that the performance upper limit of Mask2Former is positively correlated with the overlap between the sum of the foreground regions of interest for the N mask queries in the mask attention mechanism and the expanded GT mask area. The proof of this correlation is detailed in Section 5.1.
According to Equation (6), the mask attention calculation area of a certain layer in the transformer decoder is derived from the mask M l 1 of the previous layer. The content at the feature position ( x ,   y ) in each layer mask M l is calculated using Equation (8):
M l x ,   y = 0 , if M a s k a t t n x ,   y = 1 , otherwise ,  
where the content at the feature position ( x ,   y ) of the mask calculation area M a s k a t t n R N × H × W is obtained from the predicted mask M a s k p r e d after a sigmoid activation function operation and taking a fractional threshold of 0.5. The predicted mask M a s k p r e d is obtained by the mask embedding M a s k e m b e d R N × C output by the transformer decoder and the mask feature M a s k F R C × H × W output by the pixel decoder by the Einstein summation convention calculation. The specific formulas are shown in Equations (9) and (10):
M a s k p r e d = e i n s u m M a s k e m b e d · M a s k F ,
M a s k a t t n x ,   y = 1 , if   σ M a s k p r e d x ,   y > 0.5 0 , otherwise .
Additionally, when Mask2Former performs surface defect segmentation tasks, the N mask queries in M a s k a t t n can be roughly divided into two categories. The first category, M a s k 1 R N 1 × H × W , primarily covers the defect areas of the surface images. These mask queries are the majority and are used to separate defect and normal regions, forcing the model to focus on learning information related to the defect areas. The second category, M a s k 2 R N 2 × H × W , mainly covers the normal areas of the images, extracting more potentially valuable information from the foreground regions. Figure 3 illustrates the two types of regions of interest.
These two types of mask information satisfy Equation (11):
M a s k a t t n = C o n c a t M a s k 1 ,   M a s k 2 .  
Based on this, the first category of the regions of interest is selected from these two categories. In the first category, the mask queries result in inconsistent mask queries due to varying attention scores, causing each mask query to include part of the foreground region. The overall foreground calculation region for mask queries, which is also the calculation region C R of the mask attention mechanism, is obtained by Equation (12):
C R = i = 0 N 1 1 M a s k 1 i .  
Since the foreground mask calculation regions are generally block-shaped, the accuracy of the edge determines the accuracy of the overall calculation region. Therefore, calculating the overlap of the edge regions between C R and the ground truth mask area is superior to calculating the Intersection over Union of the overall mask regions. The edge loss of the mask region is given by Equation (13):
l e d g e = G E g t ,   α · log σ G E C R ,   α 1 G E g t ,   α · log 1 σ G E C R ,   α ,  
where G E is the boundary generation function [48], and α is the dilation coefficient for boundary generation, providing a more accurate reference value for the calculation region of the mask attention mechanism. Therefore, the target loss function of the MSOFormer model is given by Equation (14):
l = l c l s + l m a s k + l e d g e = λ c l s l c l s + λ c e l c e + λ d i c e l d i c e + λ e d g e l d e g d ,
where the hyperparameters λ c l s , λ c e , λ d i c e , and λ e d g e are used to balance the weights of the classification loss, mask loss, and edge loss.

4. Experiments

4.1. Dataset Description

This chapter introduces the construction of the lithium battery surface defect dataset, as well as the open-source MT and NEU-Seg datasets, which share similar defect characteristics. In the subsequent chapters, comprehensive experiments on these three datasets will be conducted to demonstrate the effectiveness and generalizability of MSOFormer.

4.1.1. The Lithium Battery Surface Defect Dataset

This dataset was collected from the production line at CATL’s headquarters from October to December 2022, using a Hikvision MV-CA032-10GC camera with an effective pixel count of 3.2 million. The dataset consists of images from 325 lithium battery sets from different production batches. Each battery set was photographed from four angles (positive and negative views of the anode and cathode), resulting in a total of 1300 images. After selection by professional technicians, 1272 images were retained for the dataset. The defect pixels in the lithium battery images vary in area and shape, with a resolution of 2048 × 1536 pixels. The area proportion of various defects ranges from approximately 0.05% to 5.12%, while the foreground area occupies only 1.24%. The dataset includes six main types of surface defects in lithium batteries: tab damage, nonmetal impurity, metal scrap, electrode fold, electrode damage, and electrode weld crack. Each type of defect is manually labeled by professionals. Figure 4 illustrates the regions on the lithium battery surface, and Table 1 provides a brief introduction to each type of defect.
As shown in the Table 1, there is a significant imbalance in the data volume across different defect categories. Since models tend to learn features from data with higher proportions, this imbalance can lead to learning biases. This paper will discuss data balancing optimization methods in the dataset preprocessing section. The entire dataset is divided into training and test sets in a 4:1 ratio. The simulation production line test data in Section 4.5 consists of images from 63 lithium batteries sourced from a different production line, and this section will analyze the model’s performance in real-time production line settings.

4.1.2. The MT Dataset

The MT dataset [49] encompasses five distinct types of surface defects on magnetic tiles: unevenness, wear, cracks, pores, and fracture defects, in addition to a normal class. The proportion of foreground pixels is 2.54%. All defect shapes are irregular, with their generation areas being random. Each image within the dataset includes pixel-level annotation masks. The MT dataset comprises a total of 1344 images. Given that MT did not provide a specific division protocol, the experiment was randomly divided into training and testing sets at a ratio of 4:1.

4.1.3. The NEU-Seg Dataset

The NEU-Seg dataset [11] encompasses three prevalent surface defects in hot-rolled strip steel: inclusions, patches, and scratches. The dataset comprises 11.76% foreground pixels, with all defects exhibiting irregular shapes and random generation areas. Each defect category is represented by 300 images, and the dataset is partitioned into training and testing sets at a 4:1 ratio.

4.2. Experimental Settings

4.2.1. Implementation Details

In the CentOS 9 environment, all models involved in the experiments were implemented using the PyTorch 1.13.1 [50] and MM-Segmentation [51] frameworks and executed on an NVIDIA RTX 4090 GPU (NVIDIA, Santa Clara, CA, USA) with 24 GB of memory. During the training phase, MSOFormer followed the same training strategy as Mask2Former. All models applied random scale jittering (between 0.5 and 2.0), random horizontal flipping, random cropping, and random color jittering for data augmentation within the pipeline, and comparison experiments were conducted on the same training and testing datasets. MSOFormer is compatible with any backbone architecture. For fair comparison, the experiments used a standard convolutional-based 50-layer ResNet [52] and a Transformer-based Swin-Transformer (Swin-T) [53] as the visual backbones. Unless otherwise specified, all backbone networks were pre-trained on ImageNet-1K [54]. Additionally, a batch size of eight was used in the experiments, and all models were trained for 80,000 iterations. MSOFormer employed the AdamW [55] optimizer with a Poly learning rate schedule, an initial learning rate of 10−4, and a weight decay of 0.05.

4.2.2. Evaluation Metrics

During the testing phase, four evaluation metrics are used to assess the performance of the proposed model across three datasets: Intersection over Union (IoU), Precision, Recall, and F1 score. IoU measures the model’s ability to differentiate between foreground and background. Precision and Recall evaluate the model’s accuracy in predicting specific categories, with Precision focusing on the false positive rate and Recall on the false negative rate. The F1 score provides a combined measure of Precision and Recall. The definitions of these four metrics are provided in Equations (15)–(18):
m I o U = 1 n c l s · i n i i j n i j + j n j i n i i ,  
m P r e c i s i o n = 1 n c l s · i n i i j n j i ,  
m R e c a l l = 1 n c l s · i n i i j n i j ,  
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l ,  
where n c l s denotes the number of classes and n i j denotes the number of pixels of class i predicted as class j .
In the simulation production line testing phase, since the test data lack GT labels for pixel-level evaluation metrics, defect recognition accuracy (Accuracy), false positive rate (False), and false negative rate (Leak) are used to assess the model’s production line performance. The recognition accuracy for a specific defect category is the percentage of correctly detected instances of that category out of the total number of detections for that category; the false positive rate is the percentage of instances of that category incorrectly detected as other categories, relative to the actual number of instances of that category; and the false negative rate is the percentage of instances of other categories incorrectly detected as that category, relative to the actual number of instances of other categories. The specific formulas are given in Equations (19)–(21):
A c c u r a c y = T P T P + F P + F N ,    
F a l s e = F N T P + F N ,  
L e a k = F P T N + F P ,  
where TP is the number of samples correctly classified as the target category, TN is the number of samples correctly classified as other categories, FP is the number of samples incorrectly classified as the target category, and FN is the number of samples incorrectly classified as other categories.

4.3. Dataset Preprocessing

The issues of insufficient and imbalanced training samples in industrial surface defect detection have long been critical problems that need to be addressed in this field [56]. There is a significant disparity in the number of labeled samples for each category. Traditional data augmentation methods involve randomly applying several pre-defined augmentation techniques to each image, and then randomly selecting a fixed number of samples from all samples of each category as training samples. Although this approach increases the sample size, the imbalance in the number of training samples significantly restricts the average and overall accuracy of the model. To enable the model to learn more discriminative features and enhance classification robustness, this paper employs an adaptive data augmentation-based balancing strategy.
Currently, most models during training only consider positive samples in the images, while ignoring unlabeled negative samples. This can lead to misclassification of some negative samples as positive ones, affecting the accuracy of the final predicted images. In the proposed strategy, negative samples are uniformly labeled as class 0. Before data augmentation, the number of samples for each class must be iterated and the proportion of samples for each class recorded. Since each lithium battery image may contain one or more types of defects, it is not feasible to fully balance the sample proportions for each category by merely increasing the number of samples in a small dataset; instead, an approximation is used. Experiments show that the sample proportions are better balanced according to the data augmentation methods listed in the table below. Table 2 demonstrates that the adaptive data augmentation-based balancing strategy is effective in addressing the issues of insufficient and imbalanced training samples.

4.4. Comparison with Other Methods

To validate the effectiveness of the proposed method, this paper compares MSOFormer with other existing methods on a lithium battery surface defect segmentation dataset and two standard industrial surface defect segmentation datasets. The comparison methods include classical semantic segmentation approaches such as FCN [7], DeepLabv3+ [28], UperNet [57], KNet [58], SegFormer [31], and SegNeXt [32], as well as advanced methods in the industrial defect segmentation field such as McNet [59] and NadiNet [60]. Additionally, several mask-based semantic segmentation methods are considered, including MaskFormer [42], Mask2Former [22], and PEM [43].

4.4.1. Segmentation Results on the Lithium Battery Surface Defect Dataset

The quantitative comparison between the lithium battery defect test set and other methods is shown in Table 3. The following analysis can be carried out:
  • The MSOFormer model demonstrates superior performance compared to other advanced models, regardless of whether the ResNet or Swin backbone network is used.
  • In the incremental comparison of MSOFormer relative to other methods, the increase in mIoU is the most significant among the three evaluation metrics. For instance, compared to the KNet model, which performs best in pixel classification, MSOFormer achieved increases of 3.7%, 2.26%, and 2.84% in mIoU, mPrecision, and mRecall, respectively, showing consistent results across other methods. IoU reflects the model’s ability to differentiate between foreground and background, indicating that the mask classification mechanism of MSOFormer effectively addresses the issue of foreground–background imbalance in lithium battery defect images.
  • For defect mPrecision and mRecall, MSOFormer shows more substantial improvements over Mask2Former (baseline) and achieves the best Intersection over Union (IoU) performance in the categories of nonmetal impurity, metal scrap, electrode fold, and electrode damage. In categories prone to false defects, such as nonmetal impurity and metal scrap, MSOFormer and its variants also demonstrate significant progress compared to other methods, highlighting the considerable advantage of mask classification in learning imbalanced image information. Additionally, under the same backbone network conditions, the MSOFormer model achieves higher performance with fewer parameters and computational resources, thanks to the design of the EPD network. For the detection of small target categories such as nonmetal impurity and metal scrap, the MSOFormer model, based on a multi-scale learning strategy, incorporates spatial location awareness information according to defect characteristics, enhancing the representation of small-size foreground targets and further alleviating the issue of random generation of small target positions.
Qualitative comparisons of the lithium battery surface defect image test set are shown in Figure 5. This paper presents six representative local visualization examples to demonstrate the superiority of the proposed method, with some key areas highlighted using dashed boxes. In the results of the fifth row, a subtle and shallow crack on the left side of the electrode tab is nearly imperceptible to the naked eye; only MSOFormer can extract inter-class differences and correctly classify it, demonstrating the effectiveness of the method. In the last row of results, the final solder point on the right side of the electrode tab clearly does not have a weld crack defect, and its texture and color significantly differ from the surrounding area, showing substantial intra-class variability. Pixel-based classification networks, such as FCN, KNet, and NadiNet, incorrectly identified this area as a solder crack defect, whereas MSOFormer made an accurate prediction. This indicates that the MSOFormer model effectively learns the interaction between foreground and background, highlighting the importance of foreground–background balance in semantic segmentation of lithium battery defect images. Additionally, results from other rows show that MSOFormer more accurately identifies surface defects on lithium batteries. Overall, the proposed MSOFormer model produces more accurate segmentation maps, particularly for complex and irregular defect patterns, offering finer boundary details and superior performance compared to baseline models.

4.4.2. Segmentation Results on the MT Dataset and the NEU-Seg Dataset

Table 4 and Table 5 report the quantitative comparisons on the MT and NEU-Seg test sets, respectively. Both MT and NEU-Seg are open-source industrial surface defect datasets, characterized by imbalances between foreground and background and irregular defect shapes. This study evaluates the generalization ability of MSOFormer using these two open-source datasets. The following conclusions can be drawn:
  • MSOFormer consistently outperforms other methods, showing significant advantages in mIoU, mPrecision, and mRecall, which demonstrates the superiority and generalization capability of the proposed MSOFormer model.
  • A further analysis of the IoU scores for each category reveals that MSOFormer achieves the best performance in categories such as Blowhole, Break, Crack, and Uneven in the MT dataset, as well as Inclusion and Patches in the NEU-Seg dataset. In some categories, its performance is only slightly behind other mask classification methods. This indicates that the proposed method is more capable of modeling complex and irregular targets and highlights the importance of mask classification approaches in the semantic segmentation of industrial surface defect images.
  • Compared to the baseline, with the same backbone, MSOFormer shows improvements in mIoU and mPrecision over the baseline model Mask2Former by 0.56% and 1.97% on the MT dataset, and 0.59% and 0.78% on the NEU-Seg dataset. These results preliminarily demonstrate the advantages of the DSQ and MBL modules, providing a more accurate and robust segmentation model and exploring a new paradigm for foreground–background separation in industrial surface defect image semantic segmentation. More details of the ablation experiments are analyzed in Section 5.3.

4.5. Simulation Production Line Test

To evaluate the operational efficiency of MSOFormer in a production pipeline and its detection rate for lithium battery defects, professionals utilized an additional 63 sets of lithium battery packs to simulate the production line and test the model. The model was run on an RTX3060 GPU using the TensorRT acceleration framework.
Due to the absence of standard ground truth masks on the production line, it was impossible to calculate pixel-level evaluation metrics such as IoU. Therefore, detection rate, false detection rate, and missed detection rate were used to assess the models’ performances. The models’ operational efficiencies were measured using an input resolution of 512 × 512 pixels, and the speed was evaluated in frames per second (fps).
As shown in Table 6, MSOFormer demonstrated superior performance on the TensorRT acceleration framework, achieving higher detection rates, lower false detection rates, and lower missed detection rates, while also achieving higher FPS compared to the baseline model.

5. Discussion

5.1. Certification of MBL Module

This section analyzes the mask attention mechanism of the Mask2Former model, focusing on the correlation γ between the foreground regions of interest for N mask queries and the true mask regions, and its correlation with the model performance limit δ . Here, γ represents the IoU between the regions of interest and the true regions, while δ denotes the IoU between the model’s predictions and the true masks. The Pearson correlation coefficient between γ and δ was calculated and followed by a quadratic polynomial fitting. The Pearson correlation coefficient measures the linear relationship between two variables and is calculated based on the covariance matrix of the data. The formula for the Pearson correlation coefficient between γ and δ is as follows:
C o r r γ ,   δ = c o v γ ,   δ v a r γ × v a r δ ,  
where c o v γ ,   δ is the covariance of γ and δ , and v a r γ and v a r δ are the variances of γ and δ , respectively. The computed Pearson correlation coefficient of 0.7205 indicates a significant positive correlation between γ and δ , with a strong linear relationship. The significance level, with a p-value of 1.161 × 10−17, is far below 0.05, confirming the statistical significance of this correlation.
The relevant data of γ and δ were fitted linearly and quadratically, and the RMSE of the linear fitting and quadratic fitting were 5.852 and 5.751 respectively. Therefore, the relationship between γ and δ can be more intuitively observed through quadratic polynomial fitting, as shown in the curve graph in Figure 6.
The fitted curve indicates that as γ increases, δ generally shows an upward trend, further confirming the strong positive correlation between γ and δ . However, in higher ranges of γ , the growth rate of δ slows down and even begins to decline. This phenomenon suggests that the region of interest should include not only the GT mask area but also its surrounding area, meaning that a higher overlap between the region of interest and the expanded GT mask area results in better model performance. Therefore, the expansion coefficient α of the GT mask area becomes a critical factor in determining the effectiveness of the MBL guidance module. An ablation study of related parameters will be presented in Section 5.3.

5.2. Study of Hyperparameters

In the MSOFormer model, four hyperparameters directly influence network performance: the number of mask queries N q , the number of mask query decoders N t d , the boundary generation expansion coefficient α , and the MBL loss weight β . The following is an ablation study of these hyperparameters on the lithium battery defect dataset.
The number of mask queries N q and the number of mask query decoders N t d both affect the quality of the mask query generation. In the experiments, N t d was set to 3, 6, 9, and 12 (corresponding to 1, 2, 3, and 4 decoder groups), while N q was set to 20, 50, 100, and 200. As shown in the 3D curves in Figure 7, as N q increases from 20 to 100 and beyond, the mIoU initially increases and then stabilizes, while the model performance stabilizes when N t d reaches 9 layers or more. Considering the impact of N q and N t d on model running speed, the final hyperparameters are set to 50 and 9, respectively.
The MBL loss weight β and the boundary generation expansion coefficient α both affect the precision of the MBL module in generating guidance for the region of interest. Therefore, these two hyperparameters were evaluated separately, with one parameter fixed and the other varied. First, β was set to 20 to analyze the boundary generation expansion coefficient α , which ranged from 0 to 4. The performance curve is shown in Figure 8a. The results indicate that mIoU increases and then decreases, with the highest mIoU when α is 1. Similarly, in the analysis of β , α was set to 1 and β was varied from 0 to 40. Figure 8b shows that the model’s mIoU performs best when β is 20. Thus, the hyperparameters α and β are finally set to 1 and 20, respectively.

5.3. Effectiveness Analysis of the Modules

In the proposed MSOFormer model, the EPD focuses on addressing the irregular shapes of foreground objects, DSQ mitigates the random generation of small foreground objects in defect images, and the MBL mechanism aims to address the issue of foreground–background imbalance. To better evaluate the contribution of each module, we conducted ablation studies using eight different settings to analyze the effectiveness of EPD, DSQ, and MBL. Among these settings, the first corresponds to the baseline model Mask2Former, while the eighth represents the complete MSOFormer model. The comprehensive ablation study results on the lithium battery defect dataset are shown in Table 7, documenting how the model continuously improves segmentation performance, with the improvements relative to the baseline shown in parentheses. The following conclusions can be drawn:
(1)
Adding EPD and DSQ to the baseline module resulted in improvements across all metrics. Compared to the baseline model Mask2Former, DSQ can more effectively integrate the semantic and visual representations of randomly dispersed objects through positional information.
(2)
The MBL module provides a more significant performance boost compared to the traditional mask attention mechanism, with improvements of 0.71% and 0.34% in mIoU and m-F1, respectively.
(3)
The EPD, DSQ, and MBL modules are complementary and work synergistically to enhance the final model’s performance. Thus, the proposed MSOFormer can achieve state-of-the-art performance on the lithium battery defect dataset by learning a balanced interdependence between foreground and background.

6. Conclusions

The segmentation of surface defects in lithium batteries is crucial for various industrial applications, particularly in the manufacturing process, as it significantly impacts product quality and reliability. To address issues of false positives and negatives caused by the severe foreground–background imbalance in defect images, this paper proposes a defect segmentation approach based on the mask classification model MSOFormer. By introducing the MBL module, the boundaries of the mask computation area in the mask attention mechanism are refined to achieve a more complete foreground region. Additionally, challenges such as the randomness of defect location generation and irregular shapes present further difficulties in defect segmentation. The DSQ module leverages the spatial attributes of defect images, assigning the spatial location data to each mask query. The EPD, through a fully convolutional FPN architecture design with relatively low computational complexity, captures the deformed receptive field of irregular defects, thereby enhancing the quality of the defect segmentation.
Comprehensive experiments conducted on the lithium battery surface defect dataset and two public datasets with similar defect characteristics demonstrate that the proposed MSOFormer achieves higher accuracy compared to other state-of-the-art defect detection methods. A series of ablation studies also confirm the effectiveness of each module in MSOFormer. Furthermore, the visualization results on the lithium battery surface defect test set indicate that MSOFormer and other mask classification tasks achieve lower false positive and false negative rates, further validating the reliability of the model’s final recognition results.
Given the high computational complexity of the mask attention mechanism in the MSOFormer model, which demands significant computational power for real-time production lines, future research will focus on optimizing algorithms for the mask attention mechanism to reduce the computational requirements for real-time production lines.

Author Contributions

Conceptualization, D.S. and X.X.; formal analysis, D.S. and H.Z.; methodology, D.S.; writing–review and editing, D.S. and X.X.; project administration, J.C.; validation, J.C.; writing–original draft preparation, J.C.; data curation, P.W.; software, P.W. and Z.D.; visualization, P.W.; resources, Y.P. and X.X.; investigation, H.Z.; supervision, Z.D. and X.X.; funding acquisition, X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially funded by the Guangdong Province Key Area Research and Development Plan (2023B0202090001), Enterprise Commissioned Horizontal Projects (HXKJHT20242017, HXKJHT20240499, H20220780), and the 2023 Guangdong Province Science and Technology Innovation Strategic Special Fund (College Students’ Science and Technology Innovation Cultivation) project (pdjh2023b0081).

Data Availability Statement

The data presented in this study are available from the corresponding author upon request. This is because the dataset used in this paper is the company’s private data.

Acknowledgments

The authors would like to thank Yao Gangdong and Chen Yanming from Guangzhou Heyisihui Electronic Information Co., Ltd. for providing hardware and financial support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Jha, S.B.; Babiceanu, R.F. Deep CNN-Based Visual Defect Detection: Survey of Current Literature. Comput. Ind. 2023, 148, 103911. [Google Scholar] [CrossRef]
  2. Rong, D.; Rao, X.; Ying, Y. Computer Vision Detection of Surface Defect on Oranges by Means of a Sliding Comparison Window Local Segmentation Algorithm. Comput. Electron. Agric. 2017, 137, 59–68. [Google Scholar] [CrossRef]
  3. Kim, H.; Lee, S.; Han, S. Railroad Surface Defect Segmentation Using a Modified Fully Convolutional Network. KSII Trans. Internet Inf. Syst. TIIS 2020, 14, 4763–4775. [Google Scholar] [CrossRef]
  4. Guo, F.; Chen, Z.; Hu, J.; Zuo, L.; Xiahou, T.; Liu, Y. An End-to-End Bilateral Network for Multidefect Detection of Solid Propellants. IEEE Trans. Ind. Inform. 2024, 20, 8347–8357. [Google Scholar] [CrossRef]
  5. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  6. Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
  7. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  8. Cao, J.; Yang, G.; Yang, X. A Pixel-Level Segmentation Convolutional Neural Network Based on Deep Feature Fusion for Surface Defect Detection. IEEE Trans. Instrum. Meas. 2020, 70, 5003712. [Google Scholar] [CrossRef]
  9. Liang, Z.; Zhang, H.; Liu, L.; He, Z.; Zheng, K. Defect Detection of Rail Surface with Deep Convolutional Neural Networks. In Proceedings of the 2018 13th World Congress on Intelligent Control and Automation (WCICA), Changsha, China, 4–8 July 2018; pp. 1317–1322. [Google Scholar]
  10. Zhang, J.; Ding, R.; Ban, M.; Guo, T. FDSNeT: An Accurate Real-Time Surface Defect Segmentation Network. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 3803–3807. [Google Scholar]
  11. Dong, H.; Song, K.; He, Y.; Xu, J.; Yan, Y.; Meng, Q. PGA-Net: Pyramid Feature Fusion and Global Context Attention Network for Automated Surface Defect Detection. IEEE Trans. Ind. Inform. 2020, 16, 7448–7458. [Google Scholar] [CrossRef]
  12. Schmid, S.; Reinhardt, J.; Grosse, C.U. Spatial and Temporal Deep Learning for Defect Detection with Lock-in Thermography. NDT E Int. 2024, 143, 103063. [Google Scholar] [CrossRef]
  13. Wang, J.; Xu, G.; Yan, F.; Wang, J.; Wang, Z. Defect Transformer: An Efficient Hybrid Transformer Architecture for Surface Defect Detection. arXiv 2022, arXiv:2207.08319. [Google Scholar] [CrossRef]
  14. Cheng, Z.; Sun, H.; Cao, Y.; Cao, W.; Wang, J.; Yuan, G.; Zheng, J. Pyramid Cross Attention Network for Pixel-Wise Surface Defect Detection. NDT E Int. 2024, 143, 103053. [Google Scholar] [CrossRef]
  15. Liu, T.; Zheng, P.; Liu, X. A Multiple Scale Spaces Empowered Approach for Welding Radiographic Image Defect Segmentation. NDT E Int. 2023, 139, 102934. [Google Scholar] [CrossRef]
  16. Xiao, M.; Yang, B.; Wang, S.; Mo, F.; He, Y.; Gao, Y. GRA-Net: Global Receptive Attention Network for Surface Defect Detection. Knowl. Based Syst. 2023, 280, 111066. [Google Scholar] [CrossRef]
  17. Zhang, C.; Cui, J.; Wu, J.; Zhang, X. Attention Mechanism and Texture Contextual Information for Steel Plate Defects Detection. J. Intell. Manuf. 2024, 35, 2193–2214. [Google Scholar] [CrossRef]
  18. Zhang, Y.; Wu, J.; Li, Q.; Zhao, X.; Tan, M. Beyond Crack: Fine-Grained Pavement Defect Segmentation Using Three-Stream Neural Networks. IEEE Trans. Intell. Transp. Syst. 2022, 23, 14820–14832. [Google Scholar] [CrossRef]
  19. Wang, C.; Chen, H.; Zhao, S. RERN: Rich Edge Features Refinement Detection Network for Polycrystalline Solar Cell Defect Segmentation. IEEE Trans. Ind. Inform. 2024, 20, 1408–1419. [Google Scholar] [CrossRef]
  20. Lin, Q.; Zhou, J.; Ma, Q.; Ma, Y.; Kang, L.; Wang, J. EMRA-Net: A Pixel-Wise Network Fusing Local and Global Features for Tiny and Low-Contrast Surface Defect Detection. IEEE Trans. Instrum. Meas. 2022, 71, 2504314. [Google Scholar] [CrossRef]
  21. Niu, S.; Li, B.; Wang, X.; Peng, Y. Region- and Strength-Controllable GAN for Defect Generation and Segmentation in Industrial Images. IEEE Trans. Ind. Inform. 2022, 18, 4531–4541. [Google Scholar] [CrossRef]
  22. Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  23. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention, Proceedings of the MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer International Publishing: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
  24. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  25. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
  26. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
  27. Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
  28. Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Computer Vision—ECCV 2018, Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018, Proceedings, Part XV; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
  29. Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C.C.; Lin, D.; Jia, J. PSANet: Point-Wise Spatial Attention Network for Scene Parsing. In Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part IX. pp. 270–286, ISBN 978-3-030-01239-7. [Google Scholar]
  30. Huang, Z.; Wang, X.; Wei, Y.; Huang, L.; Shi, H.; Liu, W.; Huang, T.S. CCNet: Criss-Cross Attention for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  31. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  32. Guo, M.-H.; Lu, C.-Z.; Hou, Q.; Liu, Z.; Cheng, M.-M.; Hu, S.-M. SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
  33. Yang, L.; Fan, J.; Huo, B.; Li, E.; Liu, Y. A Nondestructive Automatic Defect Detection Method with Pixelwise Segmentation. Knowl. Based Syst. 2022, 242, 108338. [Google Scholar] [CrossRef]
  34. Liu, T.; He, Z.; Lin, Z.; Cao, G.-Z.; Su, W.; Xie, S. An Adaptive Image Segmentation Network for Surface Defect Detection. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 8510–8523. [Google Scholar] [CrossRef]
  35. Du, W.; Shen, H.; Fu, J. Automatic Defect Segmentation in X-Ray Images Based on Deep Learning. IEEE Trans. Ind. Electron. 2021, 68, 12912–12920. [Google Scholar] [CrossRef]
  36. Yu, H.; Li, X.; Song, K.; Shang, E.; Liu, H.; Yan, Y. Adaptive Depth and Receptive Field Selection Network for Defect Semantic Segmentation on Castings X-Rays. NDT E Int. 2020, 116, 102345. [Google Scholar] [CrossRef]
  37. Li, W.; Li, B.; Niu, S.; Wang, Z.; Wang, M.; Niu, T. LSA-Net: Location and Shape Attention Network for Automatic Surface Defect Segmentation. J. Manuf. Process. 2023, 99, 65–77. [Google Scholar] [CrossRef]
  38. Zhou, Z.; Zhang, J.; Gong, C. Hybrid Semantic Segmentation for Tunnel Lining Cracks Based on Swin Transformer and Convolutional Neural Network. Comput. Aided Civ. Infrastruct. Eng. 2023, 38, 2491–2510. [Google Scholar] [CrossRef]
  39. Yao, H.; Luo, W.; Yu, W.; Zhang, X.; Qiang, Z.; Luo, D.; Shi, H. Dual-Attention Transformer and Discriminative Flow for Industrial Visual Anomaly Detection. IEEE Trans. Autom. Sci. Eng. 2023, 21, 6126–6140. [Google Scholar] [CrossRef]
  40. Zhang, Q.; Lai, J.; Zhu, J.; Xie, X. Wavelet-Guided Promotion-Suppression Transformer for Surface-Defect Detection. IEEE Trans. Image Process. 2023, 32, 4517–4528. [Google Scholar] [CrossRef] [PubMed]
  41. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  42. Cheng, B.; Schwing, A.G.; Kirillov, A. Per-Pixel Classification Is Not All You Need for Semantic Segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar]
  43. Cavagnero, N.; Rosi, G.; Cuttano, C.; Pistilli, F.; Ciccone, M.; Averta, G.; Cermelli, F. PEM: Prototype-Based Efficient MaskFormer for Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
  44. Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. Available online: https://arxiv.org/abs/2103.02907v1 (accessed on 20 June 2024).
  45. Zhang, D.; Hao, X.; Wang, D.; Qin, C.; Zhao, B.; Liang, L.; Liu, W. An Efficient Lightweight Convolutional Neural Network for Industrial Surface Defect Detection. Artif. Intell. Rev. 2023, 56, 10651–10677. [Google Scholar] [CrossRef]
  46. Min, X.; Zhou, W.; Hu, R.; Wu, Y.; Pang, Y.; Yi, J. LWUAVDet: A Lightweight UAV Object Detection Network on Edge Devices. IEEE Internet Things J. 2024, 11, 24013–24023. [Google Scholar] [CrossRef]
  47. Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More Deformable, Better Results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  48. Xu, J.; Xiong, Z.; Bhattacharyya, S.P. PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
  49. Huang, Y.; Qiu, C.; Yuan, K. Surface Defect Saliency of Magnetic Tile. Vis. Comput. 2020, 36, 85–96. [Google Scholar] [CrossRef]
  50. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
  51. Open-Mmlab/Mmsegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. Available online: https://github.com/open-mmlab/mmsegmentation (accessed on 24 June 2024).
  52. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  53. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  54. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
  55. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  56. Ling, Z.; Zhang, A.; Ma, D.; Shi, Y.; Wen, H. Deep Siamese Semantic Segmentation Network for PCB Welding Defect Detection. IEEE Trans. Instrum. Meas. 2022, 71, 5006511. [Google Scholar] [CrossRef]
  57. Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified Perceptual Parsing for Scene Understanding. In Computer Vision—ECCV 2018, Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018, Proceedings, Part XV; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
  58. Zhang, W.; Pang, J.; Chen, K.; Loy, C.C. K-Net: Towards Unified Image Segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 10326–10338. [Google Scholar]
  59. Zhang, D.; Song, K.; Xu, J.; He, Y.; Niu, M.; Yan, Y. MCnet: Multiple Context Information Segmentation Network of No-Service Rail Surface Defects. IEEE Trans. Instrum. Meas. 2021, 70, 5004309. [Google Scholar] [CrossRef]
  60. Li, G.; Han, C.; Liu, Z. No-Service Rail Surface Defect Segmentation via Normalized Attention and Dual-Scale Interaction. IEEE Trans. Instrum. Meas. 2023, 72, 5020310. [Google Scholar] [CrossRef]
Figure 1. Framework of the proposed MSOFormer for industrial surface defect image semantic segmentation.
Figure 1. Framework of the proposed MSOFormer for industrial surface defect image semantic segmentation.
Mathematics 12 03627 g001
Figure 2. Block structure of the DSQ module.
Figure 2. Block structure of the DSQ module.
Mathematics 12 03627 g002
Figure 3. Illustration of M a s k a t t n . (a) M a s k 1 . (b) M a s k 2 . Red represents the region of interest; Black represents the region of non interest.
Figure 3. Illustration of M a s k a t t n . (a) M a s k 1 . (b) M a s k 2 . Red represents the region of interest; Black represents the region of non interest.
Mathematics 12 03627 g003
Figure 4. (a) Original image. (b) Area diagram. The red area represents the battery electrode. The green area represents the battery tab. The blue area represents the sealant.
Figure 4. (a) Original image. (b) Area diagram. The red area represents the battery electrode. The green area represents the battery tab. The blue area represents the sealant.
Mathematics 12 03627 g004
Figure 5. The visualizations of the perdition results for different methods on the lithium battery surface defect test set. The dotted box highlights some key areas. (a) Original image, (b) ground Truth, (c) FCN, (d) KNet, (e) NadiNet, (f) Mask2Former, (g) PEM, and (h) MSOFormer.
Figure 5. The visualizations of the perdition results for different methods on the lithium battery surface defect test set. The dotted box highlights some key areas. (a) Original image, (b) ground Truth, (c) FCN, (d) KNet, (e) NadiNet, (f) Mask2Former, (g) PEM, and (h) MSOFormer.
Mathematics 12 03627 g005
Figure 6. Quadratic graph of γ and δ .
Figure 6. Quadratic graph of γ and δ .
Mathematics 12 03627 g006
Figure 7. Ablation study about N q and N t d . The fps of each model version is recorded in the text of each coordinate.
Figure 7. Ablation study about N q and N t d . The fps of each model version is recorded in the text of each coordinate.
Mathematics 12 03627 g007
Figure 8. Ablation study about (a) α and (b) β .
Figure 8. Ablation study about (a) α and (b) β .
Mathematics 12 03627 g008
Table 1. Introduction to the categories of the lithium battery surface defect dataset.
Table 1. Introduction to the categories of the lithium battery surface defect dataset.
Defect CategoryIntroductionExample
FreeNo defectMathematics 12 03627 i001
Tab damageThe welding of the tab area is crooked or damagedMathematics 12 03627 i002
Nonmetal impurityThere are foreign objects such as carbon powder, hair, fiber flocs, free flakes, etc. on the sealantMathematics 12 03627 i003
Metal scrapIt often appears in the area around the lug, producing small debris with the color characteristics of the lug.Mathematics 12 03627 i004
Electrode foldThe outer electrode is folded to the battery body, left and right edges, or is not welded to the tabs.Mathematics 12 03627 i005
Electrode damageThe outer cathode electrode is damaged, generally in the welding area, and the damage occurs in the transverse direction and the width exceeds half of the electrode width.Mathematics 12 03627 i006
Electrode weld crackThere is a distinct black color on the electrode, or a pattern that is obviously different from the color of the solder joint and the electrode, extending horizontally from the solder joint.Mathematics 12 03627 i007
Table 2. Introduction to the quantity of the categories of the lithium battery defect dataset.
Table 2. Introduction to the quantity of the categories of the lithium battery defect dataset.
CategoryBaselineAugmentationAdaptive
QuantityProportionQuantityProportionQuantityProportion
Free453.54%4503.54%57614.74%
Tab damage393.07%3903.07%42910.98%
Nonmetal impurity117592.37%11,75092.37%334785.67%
Metal scrap13910.93%139010.93%83221.3%
Electrode fold40331.68%403031.68%75919.43%
Electrode damage483.77%4803.77%54714%
Electrode weld crack1148.96%11408.96%67417.25%
Total num1272112,720139071
MethodmIoUm-F1mIoUm-F1mIoUm-F1
Deeplabv3plus73.5479.5674.2380.7876.5385.21
SegFormer74.7781.0275.3481.5479.0487.04
Mask2Former78.585.779.5786.2483.0190.02
Table 3. Quantitative comparison with state-of-the-art methods on the lithium battery surface defect test set (%). (a) Background; (b) tab damage; (c) nonmetal impurity; (d) metal scrap; (e) electrode fold; (f) electrode damage; and (g) electrode weld crack.
Table 3. Quantitative comparison with state-of-the-art methods on the lithium battery surface defect test set (%). (a) Background; (b) tab damage; (c) nonmetal impurity; (d) metal scrap; (e) electrode fold; (f) electrode damage; and (g) electrode weld crack.
MethodBackboneClassmIoUm-Prem-ReFlops (GB)Param (M)
(a)(b)(c)(d)(e)(f)(g)
FCNResnet5099.8496.5332.2263.5585.0963.4564.1772.1285.9179.6057.9947.13
Deeplabv3plusResnet5099.8497.0338.9271.0482.865.3580.7476.5388.8383.2259.7941.22
UperNetswint99.8597.1244.0665.0685.370.1779.9177.3586.7785.723658.95
KNetswint99.8697.4347.8376.9185.1872.6383.5180.4889.8986.7224972.16
SegFormer-99.8796.7542.2572.387.7472.4781.7379.0490.6684.6141.9444.6
SegNeXt-99.8998.5642.3472.8889.4472.179.9779.3189.0386.2632.4927.56
McNetResnet5099.8797.8837.4776.9288.4568.772.2677.3689.783.2354.3546.77
NadiNet-99.8797.643.7575.5387.2571.4684.2479.9690.9285.3561.9241.27
MaskFormerswint99.8797.6849.7482.6187.4173.1385.6482.390.2588.6654.8946.46
Mask2Formerswint99.8898.5452.1684.1687.2172.686.5683.0191.0988.9869.4447.4
PEMswint99.8898.5950.7886.4288.1376.786.1283.891.289.6940.6547.95
MSOFormerResnet5099.8798.4751.0986.385.3274.487.3483.2691.4888.7133.7446.4
swint99.8998.5652.8686.7588.2876.7986.1284.1892.1589.5634.6946.83
Table 4. Quantitative comparison with state-of-the-art methods on the MT test set (%). (a) Background; (b) Blowhole; (c) Break; (d) Crack; (e) Fray; and (f) Uneven.
Table 4. Quantitative comparison with state-of-the-art methods on the MT test set (%). (a) Background; (b) Blowhole; (c) Break; (d) Crack; (e) Fray; and (f) Uneven.
MethodBackboneClassmIoUm-Prem-Re
(a)(b)(c)(d)(e)(f)
FCNResnet5099.5451.7781.0560.9789.6873.7376.1287.0683.98
Deeplabv3plusResnet5099.2961.2565.1971.9942.6377.6369.6690.0076.02
UperNetswint99.6862.7487.2267.5593.4181.5782.0389.5189.73
KNetswint99.6863.9988.3271.0393.1481.3382.9189.2691.22
SegFormer-99.6755.9587.669.792.4681.0181.0688.5289.12
SegNeXt-99.6666.2989.575.5594.6578.3884.091.6290.23
McNetResnet5099.5464.9184.0972.5992.971.5180.9290.4987.54
NadiNet-99.5964.5389.1471.2792.4274.5181.9192.4486.93
MaskFormerswint99.5670.9887.5974.8493.6672.7183.2290.2190.76
Mask2Formerswint99.6971.8388.9873.5893.7681.9984.9789.5593.85
PEMswint99.6873.2589.5375.7193.3280.8585.3991.9291.79
MSOFormerResnet5099.6965.1588.875.4694.9380.7184.1290.192.06
swint99.774.1490.0673.9593.382.0485.5391.5292.4
Table 5. Quantitative comparison with state-of-the-art methods on the NEU-Seg test set (%). (a) Background; (b) Inclusion; (c) Patches; and (d) Scratches.
Table 5. Quantitative comparison with state-of-the-art methods on the NEU-Seg test set (%). (a) Background; (b) Inclusion; (c) Patches; and (d) Scratches.
MethodBackboneClassmIoUm-Prem-Re
(a)(b)(c)(d)
FCNResnet5097.573.3985.4479.8884.0590.6691.56
Deeplabv3plusResnet5097.5672.8685.781.9284.5191.1291.62
UperNetswint97.8477.2586.783.6486.3692.1092.94
KNetswint97.7977.4886.5883.1586.2591.693.33
SegFormer-97.8176.8286.6383.186.0992.1592.57
SegNeXt-97.8477.9186.6583.2986.4291.8493.29
McNetResnet5097.6576.1185.7181.2985.1991.3992.23
NadiNet-97.6275.086.4882.1485.3189.7194.29
MaskFormerswint97.8378.5486.3183.6286.5791.9193.42
Mask2Formerswint97.8777.8386.6883.4586.4692.4392.75
PEMswint97.8578.4586.4484.1686.7391.8993.62
MSOFormerResnet5097.7977.0686.383.6486.291.9392.92
swint97.9779.0887.2783.8687.0593.2192.71
Table 6. Ablation study of the effectiveness of each modules on the lithium battery surface defect dataset. (a) Free; (b) tab damage; (c) nonmetal impurity; (d) metal scrap; (e) electrode fold; (f) electrode damage; (g) electrode weld crack.
Table 6. Ablation study of the effectiveness of each modules on the lithium battery surface defect dataset. (a) Free; (b) tab damage; (c) nonmetal impurity; (d) metal scrap; (e) electrode fold; (f) electrode damage; (g) electrode weld crack.
MethodClassMaccMfalMleakFps
(a)(b)(c)(d)(e)(f)(g)
FCN45.9788.8938.0345.8395.1285.71100.071.370.5220.50222.92
Deeplabv3plus63.5380.074.5838.8986.8166.6789.4771.423.110.8888.88
UperNet74.29100.087.0360.7189.8975.0100.083.853.426.0159.55
KNet79.69100.091.7488.091.67100.0100.093.013.712.2426.84
SegFormer77.4688.8983.5468.096.3975.0100.084.181.796.6897.22
SegNeXt73.42100.080.4358.6297.5685.71100.085.110.687.6986.9
McNet47.27100.046.0345.8396.34100.0100.076.52.6818.22194.12
NadiNet73.61100.083.7574.0796.39100.0100.089.692.854.6975.9
MaskFormer76.67100.092.3795.8395.1885.71100.092.256.310.5470.95
Mask2Former90.16100.094.9295.8393.9885.71100.094.371.441.1432.13
PEM86.89100.094.5488.4695.1810010095.012.390.9055.55
MSOFormer93.33100.096.210095.2410010097.821.360.4854.21
Table 7. Ablation study of the effectiveness of each of the modules on the lithium battery surface defect dataset.
Table 7. Ablation study of the effectiveness of each of the modules on the lithium battery surface defect dataset.
Ablation SettingEPDDSQMBLmIoU (%)mF1 (%)Flops (GB)Params (M)
1 83.0190.0269.4447.4
2 83.63 (+0.62)90.29 (+0.27)34.4846.83
3 83.76 (+0.75)90.36 (+0.34)69.8647.4
4 83.72 (+0.71)90.37 (+0.35)69.4447.4
5 83.94 (+0.93)90.47 (+0.45)34.6946.83
6 84.05 (+1.04)90.55 (+0.53)34.4846.83
7 83.99 (+0.98)90.54 (+0.52)69.8647.4
884.18 (+1.17)90.78 (+0.76)34.6946.83
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, D.; Chen, J.; Wu, P.; Pan, Y.; Zhong, H.; Deng, Z.; Xue, X. Mask-Space Optimized Transformer for Semantic Segmentation of Lithium Battery Surface Defect Images. Mathematics 2024, 12, 3627. https://doi.org/10.3390/math12223627

AMA Style

Sun D, Chen J, Wu P, Pan Y, Zhong H, Deng Z, Xue X. Mask-Space Optimized Transformer for Semantic Segmentation of Lithium Battery Surface Defect Images. Mathematics. 2024; 12(22):3627. https://doi.org/10.3390/math12223627

Chicago/Turabian Style

Sun, Daozong, Jiasi Chen, Peiwen Wu, Yucheng Pan, Hongsheng Zhong, Zihao Deng, and Xiuyun Xue. 2024. "Mask-Space Optimized Transformer for Semantic Segmentation of Lithium Battery Surface Defect Images" Mathematics 12, no. 22: 3627. https://doi.org/10.3390/math12223627

APA Style

Sun, D., Chen, J., Wu, P., Pan, Y., Zhong, H., Deng, Z., & Xue, X. (2024). Mask-Space Optimized Transformer for Semantic Segmentation of Lithium Battery Surface Defect Images. Mathematics, 12(22), 3627. https://doi.org/10.3390/math12223627

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop