1. Introduction
Automation in welding and defect detection technology are two essential pillars driving the modernization of the manufacturing industry [
1,
2,
3]. This technology can not only realize the standardization, automation, and intelligence of the welding process but also improve production efficiency and quality. As the most effective and economical technology for the permanent connection of metal parts, welding is widely used in many key industries such as the automotive, aviation, petrochemical, manufacturing, and construction industries and national defense [
4]. The ongoing advancements in the automotive industry have led to heightened safety standards, particularly concerning the quality of weld seams in brake joints, which directly affect vehicle safety and performance [
5,
6]. Thus, accurate detection of defects in these weld seams is crucial. Common detection methods include non-destructive testing (NDT) and image processing techniques [
7]. While NDT offers high accuracy, it entails high costs, complex operations, and strict environmental requirements. Conversely, image processing methods utilize machine learning to analyze surface images of weld seams, providing cost-effective and flexible solutions, though they are sensitive to parameter selection and may struggle with noise and varying lighting conditions [
8,
9,
10]. This paper explores defect detection technologies for brake joint weld seams, emphasizing their significance in automotive manufacturing and proposing optimized strategies to enhance brake system safety and reliability. Through this study, we aim to offer the automotive industry more reliable and efficient quality monitoring methods, ensuring the safety of drivers and passengers.
In recent years, weld defect detection technologies have undergone substantial advancements, driven by researchers integrating advanced techniques to overcome the limitations of traditional methods. For instance, Tsun-Yen Wu et al. [
11] employed laser-generated ultrasound and electromagnetic ultrasonic transducers to evaluate weld seams, utilizing various mother wavelets through discrete wavelet transform statistical methods to extract relevant defect features. Liguo Zhang et al. [
12] designed a wall-climbing robot with a cross-structured optical sensor for weld seam detection, enabling the acquisition of three-dimensional information about the welds. Addressing the complexities and subtleties of defects recorded in X-ray images, Mengmeng Li et al. [
13] combined thermal infrared imaging sensors with industrial robots and developed an image detection algorithm that transforms input RGB thermal images into single-channel grayscale images. This transformation is followed by adaptive thresholding for binarization, effectively revealing the shape and location of defects. Congyi Wang [
14] proposed the application of eddy current testing for the detection of micro-gap weld joints, establishing a magnetic dipole model for defects, which was validated against the grayscale values of pit magneto-optical images, demonstrating the superiority of finite element analysis over the magnetic dipole model. Zhifen Zhang [
15] focused on typical surface welding defects in the pulse GTAW of aluminum alloys, proposing an algorithm that computes local gray probabilities in regions of interest for monitoring welding defects, thus enhancing the real-time and intelligent capabilities of robotic welding. Ahmad Bazzi [
16] proposed a compressive sensing-based full matrix capture (FMC) data compression method for the phased array ultrasonic technique (PAUT) of nozzle welds. Qian Xu [
17] proposed a compressive sensing-based full matrix capture (FMC) data compression method to address the issues of excessive signal acquisition, storage, and transmission data volume in nozzle weld defect monitoring. Han Ye [
18] proposed a compressive sensing-based weld defect detection method aimed at addressing defects in submerged arc welds. Through continuous advancements, welding defect detection technology is progressing towards higher precision, automation, intelligence, and real-time monitoring. This evolution significantly enhances the efficiency and reliability of welding quality assessments across various industries. Traditional methods have relied on manually designed feature extraction and machine vision classifiers; however, these approaches are subject to human intervention, leading to subjective biases that may result in missed or redundant detections [
19]. With advancements in computer technology, deep learning methods have increasingly been applied to weld defect detection. Unlike traditional algorithms, convolutional neural networks (CNNs) automatically extract features, facilitating feature selection and classification while avoiding the pitfalls of manual methods [
20].
The YOLO (You Only Look Once) algorithm, introduced by Joseph Redmon et al. in 2015 through the paper titled “You Only Look Once: Unified, Real-Time Object Detection,” has seen multiple updates from version V1 to V10 since 2016, significantly improving detection accuracy while maintaining the speed of single-stage algorithms, making it widely used in engineering inspection applications [
21,
22]. For instance, Melakhsou [
23] proposed a comprehensive control system based on YOLOv3 for detecting weld defects in hot water tank connection pipes, utilizing a 13-layer Darknet-13 feature extractor that generates predictions at two scales, achieving high accuracy in identifying and localizing welding anomalies. Yang Xianbiao [
24] introduced an improved YOLO-based method for detecting weld regions, addressing low recognition rates in the welding sections by employing image inversion, k-nearest median filtering, CLAHE image enhancement, and gamma correction to enhance image contrast and improve detection accuracy. Ang Gao [
25] et al. developed an enhanced YOLOv5 detection network by incorporating a RepVGG module and a Normalized Attention Module (NAM), optimizing network structure to enhance detection speed and the network’s sensitivity to feature points. Lu HuaiXu [
26] proposed the YOLOv5-IMPROVEMENT model, which integrates a CA attention mechanism, SIOU loss function, and FReLU activation function to enhance detection capabilities for small targets and capture low-sensitivity spatial information. Jiayi Tsang et al. [
27] introduced a BOT module to extract.
Global information from road damage images, accommodating the large span characteristics of crack targets, incorporated a large separable kernel attention (LKSA) mechanism to improve detection accuracy. Additionally, a C2f Ghost block was constructed in the neck network to reduce computational load while enhancing feature extraction for complex road damage. While these advancements have significantly improved detection accuracy and speed, as we approach 2025, YOLO still faces challenges in specialized tasks, however, such as weld seam defect detection. These include limited adaptability to variations in size, shape, and texture, difficulty capturing long-range dependencies for detecting small or dispersed defects, and sensitivity to noise and low resolution, leading to inaccurate defect identification under diverse conditions.
To solve these problems, in this paper, we design a more effective weld defect detection algorithm by employing a multi-level and multi-scale attention mechanism, enhancing the model’s ability to capture fine-grained details and distinguish subtle defect features from complex backgrounds, thereby improving detection robustness and accuracy in challenging welding scenarios.
3. Model and Optimization
3.1. YOLOv8 Model
YOLOv8, introduced by Ultralytics in 2023, represents the latest evolution of the YOLO (You Only Look Once) series. Building upon the foundation of YOLOv5, it incorporates significant architectural and methodological innovations. YOLOv8 refines and extends the ideas initially introduced in YOLOv5, with a primary focus on enhancing the model’s accuracy and usability for real-time object detection tasks. This version optimizes both detection precision and computational efficiency, addressing the growing demands of modern applications that require fast and reliable performance in dynamic environments. The advancements in YOLOv8 demonstrate an important step forward in the progression of deep learning models for computer vision, balancing the trade-offs between speed, accuracy, and resource utilization in real-world scenarios.
As is shown in
Figure 4, YOLOv8 is an efficient object detection model with three main components: the backbone, the neck, and the detection head. The backbone uses convolutional neural networks (CNNs) to extract both low-level features (edges and textures) and high-level features (object shapes and patterns). The neck network enhances these features by merging multi-scale representations, combining shallow (detailed) and deep (abstract) features to improve the detection of objects of varying sizes. The detection head then outputs bounding boxes, class labels, and confidence scores for object recognition [
30].
YOLOv8 is available in five variants (YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x) to suit different computational needs. The YOLOv8n (Nano) version, optimized for edge devices like the NVIDIA Jetson Nano, balances speed and accuracy in resource-constrained environments. YOLOv8n achieves 161 frames per second (FPS) with a batch size of 1 and 2.8 ms of training time with a batch size of 32, making it ideal for real-time and large-scale applications. These design improvements enhance both detection accuracy and computational efficiency across various platforms.
3.2. Backbone Improvement
Weld seam defects typically vary in size, shape, and texture. EfficientViT, through cascading grouped attention and parameter reallocation, can more effectively capture and integrate features from different scales, enabling the model to better represent and identify subtle variations in defects [
31].
EfficientViT’s architecture features an efficient sandwich layout, cascaded group attention modules, and parameter redistribution, enhancing memory use, computational efficiency, and parameter allocation.
The sandwich layout is shown in
Figure 5; input features pass through
Feed-Forward Networks (FFNs), followed by cascaded group attention, and are then processed by
additional FFN layers to produce the output features. This design aims to improve the model’s memory efficiency while enhancing computational and parameter efficiency. The calculation formula is as follows:
where
represents the full input features of the
-th block. This block uses
FFNs before and after a single self-attention layer
to transform
into
.
is the result of
after being processed by a single self-attention layer.
The cascaded attention mechanism is the middle module in
Figure 5 that improves computational efficiency and increases the diversity of attention by decomposing the full features into multiple subspaces and projecting and accumulating these subspaces. The calculation formulas are as follows:
where in the
-th attention head, self-attention is computed on
, the
-th partition of the input feature
, where
and
. The projection layers
,
, and
map each partition into different subspaces, while
projects the concatenated outputs back to the original input dimension.
, formed by adding
to the output of the
-th head,
replaces
as input for the
-th head’s self-attention.
Parameter reallocation optimizes model efficiency by expanding key network component channels and reducing less critical ones. This redistribution minimizes redundancy, refining summation and projection at each stage based on importance analysis. The strategy improves parameter utilization and enhances overall model performance.
3.3. Attention Mechanism Improvement
To solve issues such as feature color blur, lighting interference, and background noise in weld seam photos, the CAFM attention mechanism was employed. It effectively captures both global and local features, enabling the model to focus more accurately on key information in the image, which in turn enhances detection accuracy and efficiency. In
Figure 6, the CAFM architecture comprises a local branch and a global branch, each addressing different feature scales. The global branch incorporates a self-attention mechanism to capture long-range dependencies, while the local branch utilizes channel mixing to enhance model complexity, improving representation and reducing overfitting risks. This combination optimizes feature extraction and generalization in complex datasets [
32].
The local branch in
Figure 6 improves detail capture and noise suppression by first applying a 1 × 1 convolution to adjust the channel dimensions, followed by channel shuffling to enhance inter-channel information integration. The input tensor is divided into groups, with depthwise separable convolutions applied within each group, and the outputs are concatenated to form a new tensor. A 3 × 3 × 3 convolution is then used for feature extraction. The global branch employs attention mechanisms to model long-range dependencies, integrating global context with local details for enhanced detection accuracy and efficiency. The formula for the local branch can be expressed as:
where
denotes the output of the local branch,
represents the
convolution,
denotes the
convolution,
indicates the channel shuffling operation, and
refers to the input features.
In the global branch, as is shown in
Figure 6, an attention mechanism is introduced to enhance long-range information interaction. Initially,
convolutions and
depthwise convolutions generate the
(Query),
(Key), and
(Value) tensors. These tensors, with specific dimensions, are reshaped to compute the attention map, focusing on relationships between image regions. Some formulas for the global branch are as follows:
where through
convolution and
convolution, the
(Query),
(Key), and
(Value) are generated, resulting in three tensors with shapes of
. Next,
is reshaped to
and
is reshaped to
. Then, the attention map
is computed through the interaction between
and
, instead of calculating a large regular attention map of size
, which reduces the computational burden.
is a learnable scaling parameter used to control the size of the matrix multiplication between
and
before applying the softmax function.
Finally, the output of the CAFM module is computed as:
3.4. C2f Improvement
Traditional convolution operations use fixed kernels, limiting their ability to adapt to diverse features, particularly in complex tasks like weld defect detection. While increasing the depth, width, or resolution of convolutional neural networks (CNNs) can enhance performance, these approaches may still fail to capture intricate features. To address this, we propose Dynamic Convolution (DyConv), which improves model performance without increasing network size [
33].
DyConv dynamically adjusts its convolutional parameters to suit task-specific requirements, enhancing the model’s ability to detect diverse objects. This is particularly useful in defect detection, where weld defect dimensions and shapes vary. DyConv uses dynamically generated kernels to improve flexibility and efficiency, adapting based on the input image.
By integrating a coefficient generation module, such as a multilayer perceptron (MLP), DyConv adjusts kernel weights to focus on critical defect regions, minimizing background noise and improving precision, especially in noisy welding environments. This dynamic adjustment enhances both defect recognition and detection accuracy for variable targets, making DyConv effective for complex detection tasks that traditional CNNs struggle with.
In summary, DyConv’s adaptive mechanism enhances robustness and precision, making it particularly suitable for defect detection in challenging industrial environments. It outperforms conventional CNNs by improving focus on key areas, boosting detection accuracy, and reducing noise sensitivity.
Grad-CAM (Gradient-weighted Class Activation Mapping) is a gradient-based technique that generates class activation maps by weighting gradients from intermediate layers of a convolutional neural network (CNN). It visualizes the regions of an input image most influential to the model’s prediction, where red areas indicate high contribution, yellow represents secondary attention, and blue suggests minimal relevance. As shown in
Figure 7, the original YOLOv8 model disperses attention across the image. However, with DyConv incorporated, the model focuses more effectively on key features, demonstrating its ability to enhance feature localization in YOLOv8.
4. Experimental Results and Analysis
4.1. Evaluation Metrics
This study uses accuracy, recall, mAP@50, and F1 score to evaluate the YOLOv8-PD model. F1 score and mAP@50 are the main metrics, with precision and recall as supplementary measures, to assess the model’s practical performance.
Precision measures the proportion of true positives among all positive classifications and is calculated as:
where
is the number of true positives and
is the number of false positives. High precision alone does not guarantee model performance if recall is low.
is the ratio of true positives to the total number of actual positives and is given by
where
represents false negatives.
measures the model’s ability to identify all relevant instances.
Mean Average Precision (mAP) is the average of average precision values across all classes:
Accuracy and recall are typically inversely related, with improvements in one often leading to a decline in the other. To address this trade-off, F1 score is used as a comprehensive metric, balancing precision and recall. A higher F1 score indicates better model performance, reflecting an optimal trade-off between the two:
4.2. Experimental Environment
In this experiment, the PyTorch 2.2.0 framework was utilized for training, with the system environment based on Windows 10, Python 3.8.10 as the interpreter, and PyCharm 2023.1 as the integrated development environment (IDE). Training was executed using an NVIDIA GeForce GTX 1650 GPU (Santa Clara, CA, USA), equipped with 4 GB of VRAM, and accelerated via CUDA 11.8. Model training was configured for 150 epochs. The input image dimensions were standardized to 640 × 640 pixels, with a batch size of 32. The initial learning rate was set to 0.01, and the momentum parameter was defined as 0.937. A cosine annealing learning rate adjustment algorithm was applied, with the minimum cosine annealing learning rate set to 10.
4.3. Ablation Experiment in the Study
To assess the performance of the proposed enhanced algorithm, we evaluated and compared the training outcomes of the standard YOLOv8 model against those of models integrated with different enhancements. Specifically, the comparison focuses on the average precision (mAP@50) and F1 score. In the results table, a cross (×) represents the absence of an improvement, while a check mark (√) denotes its inclusion. The ablation study results are summarized in
Table 1.
In the experiment, we tested eight different model configurations by sequentially adding or removing three major improvements: EfficientViT, CAFM, and DynamicConv. The comparison between Model 1 (no improvements) and Model 8 (all improvements) reveals the contribution of each modification to the model’s performance.
Model 8 demonstrated the best overall performance, achieving a mAP@50 of 90.5% and an F1 score of 86.1%. This indicates that the combined application of EfficientViT, CAFM, and DynamicConv significantly enhances the model’s detection capabilities, thereby confirming the effectiveness of the integrated improvement strategy.
When introduced individually, EfficientViT (Model 2), CAFM (Model 3), and DynamicConv (Model 4) did not show substantial performance gains. The mAP@50 values were 88.9%, 87.1%, and 87.6%, respectively, and the F1 scores also declined. These results suggest the limitations of each improvement when applied in isolation, highlighting that although these modifications have potential, their impact is limited without the support of complementary improvements.
Notably, Model 5 exhibited improved mAP@50 and F1 scores compared to Models 2 and 3, suggesting some degree of complementarity between the improvements. However, Model 6 showed a 0.9% decrease in F1 score compared to Model 4, and Model 7 experienced a 0.2% drop in mAP@50 compared to Model 2, indicating that in certain cases, the combination of improvements may result in performance degradation.
In summary, this ablation study demonstrates the effectiveness of the proposed improvements by comparing different model configurations. Furthermore, the results indicate that an appropriate combination of enhancements can significantly improve the performance of the YOLOv8 model, while inappropriate combinations may lead to performance deterioration.
4.4. Results Comparison Experiment
To further validate the effectiveness of the proposed algorithm, experiments were conducted using the same dataset described previously. The study compared the improved algorithm with current state-of-the-art approaches, specifically utilizing YOLOv5, YOLOv8, and YOLOv10 for comparative analysis. Additionally, enhancements were introduced to the YOLOv8 architecture by incorporating modules such as BiFPN, iRMB, SCConv, and SWC. The improved model was then compared to YOLOv8-WD. The comparison primarily focused on evaluating the performance of these object detection models using mAP@50 and F1 score, two key metrics for performance assessment.
Based on
Table 2, among the baseline models, YOLOv8 demonstrated an improvement over YOLOv5 in precision but a slight drop in recall. Despite the slight recall drop, YOLOv8′s F1 score (84.3%) is higher than YOLOv5′s F1 score (81.8%), indicating that YOLOv8 offers a better balance between precision and recall. The
[email protected] of Yolov5 was slightly higher than that of Yolov8, but the difference was not much more than 0.3%, meaning that the two performed similarly in the overall accuracy of the detection targets. YOLOv10 had the highest accuracy (87.1%) across all models, but its recall rate declined substantially (60.0%), significantly lower than that of YOLOV5 and Yolov8. The low recall rate severely affected its
[email protected] (65.9%) and F1 score (70.8%). This indicates that, although YOLOv10 performs well in precision, it fails to detect a large number of related targets and is therefore less reliable in applications that require high recall rates. Overall, YOLOv10 shows an imbalance between accuracy and recall that has a negative impact on its overall effectiveness.
For the enhanced models, YOLOv8-WD showed the overall best performance with the highest
[email protected] (90.5%) and highest F1 score (86.1%) across all models. The precision and F1 fraction of YOLOv8+BiFPN were improved, but the recall rate was slightly lower than that of the basic YOLOv8 model by 2.0%. YOLOv8+iRMB maintained a similar balance to Yolov8, with a slight decrease in precision of 0.9%, but a higher recall rate than base YOLOV8, with a good final F 1 score (83.6%) and
[email protected] (87.8%). YOLOv8+SCConv performed nearly as well as Yolov8+iRMB with the same precision as
[email protected] but with a slightly lower recall rate and F1 score than Yolov8+iRMB. YOLOv8+SWC provided a robust balance, with a high recall rate (83.7%) and an F1 score (83.9%) superior to some other YOLOv8 variants.
In conclusion, while YOLOv10 achieves the highest precision, it is clear that YOLOv8-WD offers the most robust and balanced performance, making it the ideal choice in scenarios requiring both high precision and recall. Variants of YOLOv8 show specific improvements depending on architectural changes, but none outperform YOLOv8-WD in overall effectiveness.
As illustrated in
Figure 8, YOLOv8-WD initially exhibits lower accuracy compared to YOLOv8 but surpasses it in later stages of training. However, the accuracy curve for YOLOv8-WD does not stabilize smoothly towards the end, indicating that while the model’s performance improves over time, it does not achieve a steady increase in accuracy.
Figure 9 demonstrates that YOLOv8-WD’s recall rate consistently surpasses that of YOLOv8, reflecting a superior ability to capture positive samples. According to
Figure 10 and
Table 2, YOLOv8-WD shows a 3% improvement in mAP@50 over YOLOv8 and exhibits a stable convergence of the curve. This stability indicates that YOLOv8-WD possesses robust performance, good generalization capabilities, and an effective training strategy. Consequently, YOLOv8-WD proves to be more valuable in object detection tasks, particularly in scenarios requiring high reliability and consistency.
Figure 11 shows that YOLOv8-WD’s box, dfl, and classification loss curves are similar to YOLOv8 on the training set, indicating comparable learning and stability. On the validation set, YOLOv8-WD exhibits smoother Box and DFL loss curves, suggesting greater robustness. However, its slightly higher loss values indicate that the added complexity in YOLOv8-WD results in increased loss due to more features being processed. This does not necessarily reflect worse performance, and additional metrics are needed for a full assessment.
Figure 12 compares the defect detection results between YOLOv8 and YOLOv8-WD by defect category. YOLOv8 performed well on dents and scratches with mAP@50 scores of 96.30% and 91.40%, respectively, but less effectively on bubbles and slag, with scores of 75.30% each. YOLOv8-WD showed similar performance on indentations and scratches but improved detection of bubbles and slag, with mAP@50 increasing by 1.7% and 13.6%, respectively. Thus, YOLOv8-WD effectively mitigates dataset class imbalance, reducing average precision gaps and achieving more balanced performance.
Figure 13 presents a test set sample containing four types of welding defects: pits, slag, scratches, and bubbles. For pits and slag, the original algorithm demonstrates satisfactory detection performance, with the improved algorithm yielding a slight increase in mAP@50. However, the original algorithm exhibits suboptimal detection performance for scratches and bubbles. As observed in
Figure 13c, the original algorithm suffers from the missed detection of scratches. The improved algorithm addresses this issue, significantly enhancing detection accuracy. Furthermore, the improved algorithm also shows a notable improvement in bubble detection accuracy, with mAP@50 increasing by approximately 10%, representing a substantial advancement.
5. Conclusions and Outlook
As welding technologies are increasingly applied across different industries, accurate weld defect detection has become critical for quality control. This study develops an image acquisition platform to capture and annotate high-quality weld defect images. Building on the YOLOv8 architecture, an improved model, YOLO-WD, integrates EfficientViT as the backbone, a Cross-Attention Feature Mechanism (CAFM), and Dynamic Convolution (DyConv) in the C2f module. Experimental results show that YOLO-WD outperforms YOLOv8 in accuracy (86.4%), recall (85.9%),
[email protected] (90.5%), and F1 score (86.1%) while also demonstrating better stability and robustness. YOLO-WD offers potential for integration into industrial workflows, enhancing defect detection accuracy and efficiency.
Welding equipment manufacturers, quality control service providers, and companies engaged in automated inspection systems could integrate YOLO-WD into their workflows to improve the accuracy and efficiency of defect detection. A potential commercialization approach involves licensing the algorithm as part of an industrial inspection solution, where end-users can deploy the model in real-time applications through hardware interfaces, such as robots or drones equipped with cameras for automated weld inspection.
This research helps to improve the accuracy and reliability of automated welding defect detection and provides great potential for broader industrial applications. Future research needs to further verify robustness under different lighting and complex backgrounds and adapt it to a variety of operating environments to ensure that it meets the needs of actual industrial applications.