1. Introduction
Steel serves as a critical material in the manufacturing industry, infrastructure sector, and other fields. The quality of steel directly impacts both the quality of the finished products and the construction of infrastructure. Therefore, ensuring strict quality control for steel is exceptionally important, as it constitutes the initial guarantee for product qualification.
In the production and processing of steel products, defects such as pitting, scratches, and patches often occur on the surface of the steel. These defects can lead to a decrease in the quality and performance of steel products, thereby reducing the reliability and service life of the steel. Therefore, measures must be taken during steel production and usage to accurately and quickly detect and eliminate these non-compliant steel materials. The methods for detecting surface defects in steel can be categorized as manual inspection and machine inspection. Manual inspection is prone to randomness, and the accuracy is heavily influenced by the experience and attentiveness of the inspectors. Moreover, small defects in steel may not be easily noticed by human workers [
1]. On the other hand, machine inspection offers the advantages of low cost, high efficiency, and good stability.
Object detection algorithms serve as the core of machine-based detection. However, there are still two challenges for mainstream object detection algorithms in recognizing surface defects on steel. Firstly, there is a high similarity between the different types of defects on the steel surface, while some similar defects exhibit significant variations [
2]. Secondly, the multitude of defect types on the steel surface leads to imprecise classification results [
3]. These two challenges result in decreased precision and a slower detection speed of the object detection algorithms. And mainstream object detection algorithms can no longer meet the strict defect detection requirements of factories [
4]. Therefore, there is an urgent need for more advanced algorithms with improved performance in order to meet the production demands of factories.
As a solution to this problem, the architecture of the YOLOv5 model was introduced. However, it should be noted that in the YOLOv5 model architecture, the propagation of convolution calculations leads to the loss of feature information. Therefore, it is crucial to focus on the rational distribution of structures and module replacements that are more suitable for machine computations. Zhang et al. [
5] proposed an improved algorithm based on YOLOv5 by incorporating deformable modules to adaptively adjust the perception field scale. They also introduced the ECA-Net attention mechanism to enhance feature extraction capabilities. The improved algorithm achieved a 7.85% increase in the mean average precision (
mAP) compared to the original algorithm. Li et al. [
6] put forward a modified algorithm based on YOLOv5, where they integrated the Efficient Channel Attention for Deep Convolutional Neural Networks (ECA-Net)—an attention mechanism to emphasize feature extraction in defective regions. They replaced the PANNet module with the Bidirectional Feature Pyramid Network (BiFPN) module to integrate feature maps of different sizes. The results showed that compared to the original YOLOv5 model, the
mAP increased by 1% while the computation time decreased by 10.3%. Guizhong Fu et al. [
7] proposed a compact Convolutional Neural Network (CNN) model that focused on training low-level features to achieve the accurate and fast classification of steel surface defects. This model demonstrated high precision performance with a small training dataset, even under various modes of interference such as non-uniform illumination, motion blur, and camera noise. Yu He et al. [
3] proposed a fusion of multi-level feature maps approach, enabling the detection of multiple defects on a single image. They utilized a Region Proposal Network (RPN) to generate regions of interest (ROI), and the final conclusions were produced by the detector. The results showed an accuracy of 82.3% on the MEU-DET dataset.
Real-Time Object Detection is also an important requirement for the industrialization of steel defect detection. Qinglang et al. [
8], focusing on the elongated nature of road cracks, proposed an improved algorithm based on YOLOv3. They fused high-level and low-level feature maps to enhance feature representation and achieve real-time detection of road surfaces. To achieve faster object detection, Jiang et al. [
9] introduced the YOLOv4-tiny. This model replaced two (CSPBlock) modules with two ResnetBlock-D modules to improve computation speed. Furthermore, residual network blocks were utilized to extract more feature information from images, thus improving the detection accuracy. The results showed that the improved algorithm achieved a faster detection rate without sacrificing accuracy.
There are two main categories of deep learning-based object detection methods: one-stage and two-stage. The one-stage approach directly utilizes convolutional neural networks to extract image features, and perform object localization and classification. Classic algorithms in this category include the YOLO [
10,
11,
12,
13] series and SSD [
14,
15] series. In contrast, the two-stage approach generates candidate regions before performing the aforementioned processes. Popular algorithms in this category include the RCNN [
16,
17] series, SPPNet [
18], and R-FCN [
19]. Considering its practicality within factories, the YOLOv5 model from the one-stage category is commonly used. It offers a faster detection speed but suffers from a lower accuracy. To address this problem, the authors made improvements to certain structures in YOLOv5 specifically for steel training datasets. These modifications made the algorithm’s structure more suitable for machine feature extraction, resulting in an improved detection speed and an increased average accuracy.
This article begins by introducing the basic architectural features of the YOLOv5-7.0 algorithm. Afterwards, the author addresses several issues affecting the measurement accuracy in the original algorithm and proposes three improved modules to replace the problematic ones. The structure and distinguishing characteristics of these replacement modules are emphasized. Subsequently, details regarding the experimental setup, dataset, evaluation metrics, and other relevant information are provided.
The article then presents comparative experimental results, including comparisons of six different module replacements for the c3 module, three different forms of CAM for replacing the SPPF module, and eight different forms of Carafe for replacing the nearest module. Additionally, a comparative experiment is conducted using the three selected optimal modules in combination. Furthermore, the improved algorithm is compared with mainstream detection algorithms.
Finally, the article concludes by presenting comparative visual results of the detection performance between the improved algorithm and the original algorithm.
2. Materials and Methods
2.1. YOLOv5-7.0 Algorithm
The YOLOv5 algorithm is one of the typical one-stage algorithms that utilizes a series of convolutional computations to extract hierarchical features from images within the backbone network. By fusing high-level semantic information with low-level details, it facilitates effective classification and localization, ultimately performing object detection in the “detect” stage.
Version 7.0 of YOLOv5 represents the latest release in the YOLOv5 series and brings significant improvements to instance segmentation performance. YOLOv5-7.0 offers five different models, namely YOLOv5s, YOLOv5m, YOLOv5n, YOLOv5l, and YOLOv5x, arranged in increasing order of module size. These models exhibit varying speeds and accuracy levels.
The network structure of YOLOv5-7.0 consists of three components: the backbone, neck, and head. The backbone is responsible for extracting high-level semantic features from input images. The default backbone network in YOLOv5-7.0 is CSPDarknet, which employs stacked convolutional and pooling layers to reduce the image resolution while capturing essential features. After the processing by the backbone, the neck module in YOLOv5-7.0 performs feature fusion and processing, combining features from different levels to extract more comprehensive information. A Feature Pyramid Network (FPN) is commonly used as the fusion method in YOLOv5-7.0, enabling the extraction of scale information from multiple feature maps. The processed information from the neck module is then fed into the head module, which employs non-maximum suppression to generate three prediction results for predicting object locations and categories.
The algorithm proposed in this paper aims to enhance the YOLOv5s architecture of version 7.0, with the goal of improving detection efficiency and reducing error rates in practical applications. This enhancement seeks to achieve both rapid detection and increased accuracy, catering to the requirements of factories.
2.2. YOLOv5-7.0 Improvement
2.2.1. C3F Operator
C3 is derived from the Cross Stage Partial Networks (CSPNet) architecture. C3 has two variations: one in the backbone of YOLOv5-7.0 as shown in
Figure 1a, and another in the head of YOLOv5-7.0 as shown in
Figure 1b. The difference between BottleNeck1 and BottleNeck2 lies in their input processing. In BottleNeck1, the result is obtained by adding the output of Conv applied twice to the initial input.
Many studies have shown that the differences in feature maps across different channels of the same image are minimal [
20,
21]. While most algorithms aim to reduce computational complexity and improve accuracy, they have not effectively addressed the issue of computing redundant features across different channels. The C3 structure, which follows traditional methods for processing feature maps in each channel, inevitably results in redundant computations between similar feature maps.
An improved version of the C3 module in YOLOv5s-7.0, known as C3-Faster (C3F), has been introduced to effectively address the aforementioned issues. Its design concept is derived from the PConv module used by Jierun Chen [
22] in FasterNet. In C3F, the unprocessed data are concatenated with the PConv module for further computation. This approach significantly reduces the computational workload while enhancing the accuracy. The structure of the PConv module can be seen in
Figure 2a, while the structure of C3F is depicted in
Figure 2b.
C3F is a fundamental operator that can be embedded into various neural networks to address the issue of redundant convolutions that often occur in neural network computations. By reducing memory access, C3F performs conventional Conv convolutions on only a portion of the input data, typically treating either the first or last channel as the representation of the entire image. The floating point operations (FLOPs) for conventional Conv are
The corresponding amount of memory access is
In a contrasting manner, after replacing the C3 module with a C3F module, the FLOPs are reduced to
The corresponding amount of memory access is
where
is the number of channels.
This means that the FLOPs for Conv are at least four times greater than the FLOPs for C3F, and the memory access for Conv is more than double that of C3F.
In order to fully utilize the unused channels () in the feature maps after the aforementioned operations, these channels are transformed into pointwise convolutions (Conv1 × 1) and added to the center position of the PConv module. This results in a convolutional layer with efficient computational capabilities. Batch normalization (BN) is then applied to further improve the convergence speed. To avoid the issues of gradient vanishing or exploding during computation, a Rectified Linear Unit (ReLU) is used as the activation function to enhance the non-linear fitting ability of the upper and lower layer function values. Subsequently, pointwise convolutions, average global pooling, and fully connected layers are employed to merge and output the final results.
This approach of processing digital images allows for a reduction in computational workload, thereby enhancing the speed of computation without sacrificing accuracy. By intelligently combining convolutional layers (attaching Conv1 × 1 layers to the center position of the PConv module, forming a T-shaped structure), greater attention is given to this central position, which has the highest Frobenius norm. This arrangement aligns with the pattern of feature extraction in images and can even reduce the computational workload while improving the precision.
2.2.2. Information Feature Enhancement Module—CAM
To address the issue of the varying candidate box sizes after the first stage of detection in RCNN, He, Kaiming et al. [
23] proposed the spatial pyramid pooling (SPP) structure to fix the detection box size. The SPP structure incorporates parallel operations of MaxPool2d with 5 × 5, 9 × 9, and 13 × 13 modules, as shown in
Figure 3a, to ensure fixed-size outputs. Subsequently, He, Kaiming further improved the SPP structure by introducing the spatial pyramid pooling-fast (SPPF) structure. The innovation in the network architecture is due to the replacement of the MaxPool2d operation in SPP with three consecutive 5 × 5 modules, as depicted in
Figure 3b. This modification leads to outputs of the same size while reducing the computational workload and improving the detection speed. However, it is important to note that the SPPF structure inherently involves the loss of partial information during the pooling process. If the convolutional operations prior to pooling fail to learn sufficient features, it can have a significant impact on the detection results.
The utilization of the context augmentation module (CAM) has demonstrated remarkable effectiveness in handling low-resolution targets, while also providing a robust solution to the aforementioned issues. The conceptual architecture of CAM was a solution to address the imbalanced training dataset and limitations of the network. In this study, the CAM module is incorporated into YOLOv5s-7.0, replacing the SPPF structure and further enhancing the computational precision.
The CAM module uses convolution with varying dilation rates to process images and enrich the contextual information from both the upper and lower regions of the image. By combining the feature information from multiple images with different dilation rates, the expression of the features becomes more evident. The structure of the CAM module is illustrated in
Figure 4.
In the above figure, 3 × 3 convolution kernels are applied with rates of 1, 3, and 5. This approach draws inspiration from the way humans recognize objects, where using a rate of 1 is akin to observing details up close, such as when observing a panda and noticing its creamy white torso, sharp black claws, and black ears. However, these details may not be sufficient to determine the object’s category. By contrast, performing convolution calculations with rates of 3, 5, or even larger rates is akin to viewing an object in its entirety, comparing it to the surrounding environment. Applying this visual approach to machine learning has demonstrated comparable results. By simulating this method of human observation, machine learning adjusts the rate to obtain different receptive fields and then fuses them for improved accuracy. This learning technique works particularly well for smaller targets at various resolutions.
The CAM module includes three types of weight forms: Weight, Adaptive, and Concatenation, as illustrated in
Figure 5.
The Weight mode involves adding the information after it undergoes Conv1 × 1 processing three times. The Adaptive mode adopts an adaptive approach to match the weights, where [bs, 3, h, w] in the diagram represents spatially adaptive weight values. The Concatenation mode combines the information after it undergoes Conv1 × 1 processing three times through weighted fusion. There is no phenomenon of one mode being better than the other among these three modes. The performance of the CAM module is influenced by factors such as the dimensions of different datasets’ images, their characteristic features, and the connections between different modules.
2.2.3. Variable Receptive Field Module—CARAFE
Upsampling is widely used in deep neural networks to enlarge high-level features. In version 7.0 of YOLOv5, the nearest module, which means the nearest neighbor interpolation, is employed for upsampling operations. When using the upsampling module in image processing, each point in the output image can be mapped to the input image. The mapped point takes on the value of the nearest point among the adjacent four points in the input image, assumed as a, b, c, and d. The “nearest” module does not add computational complexity as it only requires passing values and processing data blocks. However, this module also has several drawbacks that significantly affect the results of neural network computations. The data transformation approach of the “nearest” module reduces the gradual correlation between adjacent pixels in the original image. Additionally, its 1 × 1 perceptual field is very small, and this uniform and simplistic sampling method does not effectively utilize the semantic information of the feature map.
The Content-Aware ReAssembly of FEatures (CARAFE) module, initially proposed by CL Chen et al. [
24] can address the limitations of nearest neighbor methods. CARAFE provides a larger receptive field for images by generating corresponding kernels based on the semantic information of the feature map. This allows for a more focused consideration of the content surrounding the image, thereby improving the detection accuracy without significantly increasing the computational requirements. The execution process of the CARAFE module consists of two steps: “Creation of the Upsampling Kernel” and “New Kernel Convolution Calculation”. The creation of the upsampling kernel is illustrated in
Figure 6, while the process of new kernel convolution calculation is depicted in
Figure 7.
The image above depicts the process of creating an upsampling kernel. To meet the requirement of reducing the computational complexity, the input image of size h × w × l undergoes calculations using a 1 × 1 convolutional operation, compressing the channels to l1. Next, the content is re-encoded and passed through a k2 × k2 convolutional operation, dividing the l1 channels into m2 groups of k12 channels. These channels are then rearranged and combined in a mh × mw × k12 structure after unfolding the spatial dimensions. Finally, the rearranged mh × mw × k12 structure is normalized, ensuring that the weights of the newly created upsampling kernel sum up to 1.
At each position in the output feature map h × w × l, there is a mapping to the input feature map. For example, in the rectangular shape shown in the image, the yellow region corresponds to a 6 × 6 area in the input feature map. Then, the upsampling kernel mh × mw × k12 is rearranged into a k1 × k1 structure within the red region, and the dot product between this rearranged kernel and the 6 × 6 input area yields an output value. The yellow region of the rectangular shape determines the corresponding positional coordinates of the red region in the upsampling kernel, and a single upsampling kernel can be shared among all channels at that corresponding position.
The sizes k1 × k1 and k2 × k2 represent the reorganized upsampling kernel and the receptive field, respectively. A larger receptive field means a larger range of content perception, requiring a larger reorganized upsampling kernel to match the increased receptive field. CL Chen introduced and explained the permutation patterns of k1k2 in the Carafe module, which include [1, 3], [1, 5], [3, 3], [3, 5], [3, 7], [5, 5], [5, 7], and [7, 7]. Additionally, it was noted that the parameter size increases as the value of k increases.
2.2.4. The Architecture of the Improved Model Based on YOLOv5s-7.0
The proposed improved algorithm is based on YOLOv5s-7.0. The purpose of the improvement is to improve the detection performance of small target objects, such as those with low pixel clarity. The structure of the improved algorithm is depicted in
Figure 8. In this improvement, all seven C3 modules in the original algorithm’s backbone structure and four C3 modules in the neck structure are replaced with C3F modules. The C3F modules reduce redundant computations and their T-shaped pattern allows the receptive field to focus more on the center position with the maximum Frobenius norm.
Additionally, the SPPF in the original algorithm’s backbone is replaced with a CAM. The CAM obtains three different receptive fields with rates of 1, 3, and 5 when processing an image. This allows the algorithm to focus more on contextual information, reducing the impact of low pixel clarity features. Moreover, three fusion methods (weight, adaptive, and concatenation) are proposed for combining the obtained receptive fields.
Furthermore, the two nearest neighbor upsampling operators—the “nearest” modules in the original algorithm’s neck—are replaced with a feature recombination module called CARAFE, which focuses on semantic information. Based on the values of k1 and k2 ([1, 3], [1, 5], [3, 3], [3, 5], [3, 7], [5, 5], [5, 7], and [7, 7]), a total of 16 combinations can be obtained from the two “nearest” modules. This approach of increasing the receptive field differs from directly expanding it with CAM, as it reconstructs the upsampling kernel based on the feature information to enhance the receptive field.
By combining these three improvement methods without significantly increasing the computational complexity, the algorithm achieves a significant improvement in its accuracy. A comparative analysis will be presented in the following sections.