1. Introduction
China is a major tobacco-producing country. Although tobacco poses certain hazards to human health, the tobacco industry contributes significantly to China’s tax revenue, raises farmers’ incomes, and remains crucial for maintaining the country’s economic stability. Cigarettes are the most important products of the tobacco industry. On large-scale high-speed assembly lines, various appearance defects may occur for multiple reasons; these defects include anomalies that the literature labels “Dotted”, “Folds”, “Untooth”, and “Unfilter” [
1]. These appearance defects seriously affect the quality of cigarettes and damage the brand. Therefore, tobacco companies must strictly control quality and avoid selling cigarettes with appearance defects.
Early quality control in cigarette factories relied on manual inspection. However, on a high-speed production line, cigarettes are produced at a rate of 150–200 per second, with each line producing about 15 million cigarettes per day [
2]. With such a high production rate, manual inspection is impractical, creating an urgent need for automated cigarette appearance inspection technology to replace manual methods.
With the development of machine vision, manual inspection of product quality has been gradually replaced by automatic inspection, significantly improving the efficiency of industrial inspections, such as printed circuit board defect detection, packaging seal defect detection, welding defect detection, photovoltaic component defect detection, and steel surface defect detection. Product surface defect detection falls under the category of object detection. At present, the most widely used object detection algorithms are Faster R-CNN [
3,
4,
5], SSD [
6], YOLO [
7,
8,
9,
10], etc. Practitioners in various industries have improved these methods to adapt to object detection tasks.
Domestic and international research on cigarette appearance defect detection is relatively limited. However, significant advancements have been made in other appearance defect and defective product detection techniques, which can be leveraged for cigarette appearance detection tasks. Zhang et al. [
11] used YOLOv3 to detect surface defects on aluminum profiles. Cheng et al. [
12] proposed an algorithm to improve YOLOv3 for metal surface defect detection by generating a large-scale feature layer to extract more small target features. Wei et al. [
13] combined SSD with the MobileNet algorithm and then designed an SSD-MobileNet model for surface defect detection on manufactured boards. Peng et al. [
14] used the Res-Net101 backbone network based on Faster R-CNN to detect surface defects on particleboard. Luo et al. [
15] introduced the ResNet18 structure into YOLOv3, significantly improving the accuracy and detection speed of the model for UAV power grid inspection. Hu et al. [
16] put forward a detection method to detect pipeline weld surface defects based on multi-feature extraction and binary tree support vector machine (BT-SVM) classification. Lu et al. [
17] incorporated a multi-scale detection network based on FPN and GA-RPN structure into Faster R-CNN, achieving a mAP of 92.5% in solar cell defect detection. Hu et al. [
18] added a pyramid pooling module and a spatial and coordinate attention module to the YOLOv5 network to improve the detection of small-size defects, achieving an average accuracy of 97.6% in detecting surface defects on rebar. Wang et al. [
19] designed a defect detection method based on a watershed algorithm and two-pathway convolutional neural network and applied it to corn seed defect detection. The average accuracy of the experiment reached 95.63%. Wan et al. [
20] improved the detection ability of small-size defects by adding a small object detection layer in the YOLOv5s network, effectively solving problems caused by small-size defects and insufficient feature information in tile surface defect detection. Zhang et al. [
21] embedded the SENet module into the Unet network to detect appearance defects of parts, achieving a detection accuracy of 92.98%. Qi et al. [
22] combined the YOLOv7-tiny network with a weighted Bidirectional Feature Pyramid Network (BiFPN) for steel surface defects, effectively improving the efficiency of small object detection with an average accuracy of 94.6%. Jing et al. [
23] proposed a Mobile-Unet network to realize end-to-end fabric defect detection and introduced depthwise separable convolution, reducing model complexity and achieving an accuracy rate of 92%. Li et al. [
24] designed a lightweight convolutional neural network, WearNet, to realize automatic surface scratch detection, achieving an excellent classification accuracy of 94.16% with smaller model size and faster detection speed.
In recent years, some scholars have researched cigarette appearance defect detection. Qu et al. [
25] introduced a method for detecting cigarette appearance defects utilizing an improved SSD model, using the ResNet50 network for feature extraction and adding pyramid convolution to achieve a mAP of 94.54%. However, the ResNet50 model only increases the network depth by simply stacking the residual blocks, resulting in a relatively simple network structure. This simplicity leads to insufficient feature extraction capability and generates many redundant features, as a result, the model’s recall is only 69.83%. The deeper network structure also reduces the inference speed, with the detection speed being only 66 FPS. Yuan et al. [
26] proposed a classification method for cigarette appearance defects based on ResNeSt, utilizing a transfer learning method to train ResNeSt to address the problem of an insufficient number of samples. However, the ResNeSt network uses the Split-Attention Block to slice the feature channels, reducing the correlation, complementarity, and information interaction ability between the channels, thus weakening the feature fusion ability of the model. Yuan et al. [
27] proposed a defect detection method for cigarette appearance based on improved YOLOv4; by introducing the channel attention mechanism and replacing the spatial pyramid pooling structure (SPP) with atrous spatial pyramid pooling structure (ASPP), a mAP of 91.77% was achieved. However, introducing these modules increased the computational complexity of the model, and the detection speed was lower than that of the original YOLOv4 model, which was only 53 FPS. Liu et al. [
1] proposed a defect detection method for cigarette appearance based on improved YOLOv5s. The original model was improved by data augmentation, the introduction of the channel attention mechanism, and optimization of the activation function and loss function. However, since YOLOv5s is a lightweight model in the YOLOv5 series, its feature extraction capability needs to be improved, and all improvements focus on improving the detection speed rather than the algorithm’s capacity for learning features. As a result, its recall is only 86.8%. Liu et al. [
2] proposed a cigarette defect detection method based on C-CenterNet, which improves the detection ability of the algorithm for small target defects, by introducing the CBAM attention mechanism, replacing ordinary convolution with deformable convolution, and adopting the feature pyramid network (FPN). However, since the FPN uses a top-down approach to feature fusion, it cannot fully utilize low-level feature information, therefore, the recall is only 85.96%, and the detection speed is reduced to 112 FPS, which cannot meet the real-time requirements.
The above methods are significant for researching cigarette appearance defect detection. However, most of them need improvement in some areas, such as low detection accuracy due to insufficient feature extraction and poor feature fusion capabilities and slow detection speed caused by increased model complexity or the introduction of complex calculations. In addition, the dataset that has been obtained still needs help with small samples and unbalanced samples. Because YOLOv7 [
28] has shown high accuracy and fast detection speed in many object detection tasks, in order to meet the requirements of high-speed production lines of cigarettes, this paper uses YOLOv7 as the primary network and proposes a new model of cigarette appearance defect detection by introducing space-to-depth convolution, a convolutional block attention module, and a self-calibrated convolution module; the resulting model is called SCS-YOLO. The main contributions of this paper are as follows:
- (1)
The space-to-depth convolution is introduced into the backbone network part of the model, which achieves feature dimension reduction and reduces the loss of small target features. Its addition method is analyzed and experimented with in detail. We found that although the space-to-depth convolution can retain the information of small targets to the greatest extent, it damages the integrity of the information about large targets. Therefore, a pooling layer and two space-to-depth convolutional layers are used in the backbone network to increase the fine-grained information in the network while ensuring the integrity of the large targets’ information.
- (2)
To enhance the model’s ability to pay attention to target information, the convolutional block attention modules are introduced after the two C5 modules in the backbone network, which improves the model’s ability to extract small targets and distinguish similar defects.
- (3)
The dual self-calibrated convolutional module (D-SCConv) is proposed and applied to the neck network; it enhances the receptive field of each region in the feature map and calibrates the high-level semantic information, thereby helping the model to generate higher-quality features.
- (4)
The EIoU [
29] loss function is used to replace the CIoU [
30] loss function in the original model, which accelerates the convergence speed of the model, improves the positioning ability of the model, and reduces the impact of sample imbalance.
The experimental results show that the SCS-YOLO model has improved the detection of various cigarette appearance defects and meets the requirements of industrial production lines.
3. Proposed Method
In July 2022, the original team that developed YOLOv4 proposed YOLOv7. YOLOv7 achieves a twofold improvement in detection speed and detection accuracy, enabling it to be applied to real-time object detection tasks. The YOLOv7 network consists of four components: input, backbone, neck, and head. The input module preprocesses the input image, zooming processing, and other operations to ensure it meets the network’s input requirements. Next, the processed image is fed into the backbone network, which extracts features from the images. The neck module then performs feature fusion on the extracted features, forming three different scales: large, medium, and small. Finally, these fused features are sent to the detection head, where the detection results are obtained.
The YOLOv7 series includes three basic versions: YOLOv7-tiny, YOLOv7, and YOLOv7-W6. These versions vary in the number of parameters, activation functions, network depth, and network width to meet diverse performance requirements. YOLOv7-tiny has the fewest parameters and the smallest network depth and width among all YOLOv7 versions, which makes it slightly less accurate, but it has the fastest detection speed. The detection speed of YOLOv7-tiny-SiLU (YOLOv7-tiny with the SiLU activation function) can even achieve 286 FPS. Given the production rate of cigarette production lines, which is 150–200 per second, the detection model must have a high detection speed. Therefore, based on YOLOv7-tiny and the application requirements, this paper proposes an SCS-YOLO model that is more suitable for detecting cigarette appearance defects.
3.1. SCS-YOLO Model
As shown in
Figure 1, the appearance defects of different cigarettes vary significantly: the Dotted defects in
Figure 1b are very small and easily confused with the pattern or logo on the filter paper, which is the typical, extremely small form of these targets. The Folds defect in
Figure 1c and the Untooth defect in
Figure 1e are both issues with the bonding of the filter paper; the difference between the two is not apparent, and most Folds and Untooth defects are large targets; the feature of the Unfilter defect in
Figure 1d is simple but a large target, with a defect area much larger than the Dotted defect.
Considering the characteristics of these four types of cigarette appearance defects and the need for fast detection speed and high detection precision, this study improves the original YOLOv7-tiny. Firstly, because Dotted defects are characterized by extremely small size as targets, this paper replaces the last two MP modules in the YOLOv7-tiny backbone network with the SPD-Conv [
31] module to enhance the detection capacity of the network for extremely small targets; secondly, to address the problem that the difference between Folds defects and Untooth defects is not apparent, an attention module is introduced, respectively, after each of the two C5 modules in the feature extraction network, which can better capture similarities and differences in the images; then, because Unfilter, Folds, and Untooth defects are characteristically large targets, the improved self-calibrated convolutional module (D-SCConv) is added to the network’s neck to increase the model’s receptive field and enhance its detection performance for large targets; finally, the loss function of the model is improved by using the EIoU loss function instead of the CIoU loss function to reduce the impact of sample imbalance. The SCS-YOLO model framework after improvement is shown in
Figure 2.
This paper describes each improved module in detail later.
Section 3.2 introduces SPD-Conv, suitable for low-resolution images and extremely small objects;
Section 3.3 introduces CBAM [
32], which better captures image similarities and differences;
Section 3.4 discusses the improved self-calibrated convolutional module; and
Section 3.5 introduces the EIoU loss function.
3.2. Space-to-Depth Convolution (SPD-Conv)
When the image resolution is very low or the object to be detected is small, traditional convolutional neural networks’ performance will deteriorate rapidly. In reference [
31], the authors point out that this is mainly because the existing CNN architecture uses pooling layers or strided convolutions, resulting in the loss of fine-grained information and less efficient feature representation. Therefore, considering that Dotted defects in cigarette appearance defects are extremely small targets and often fail to be detected by existing models, this paper introduces the SPD-Conv module into the backbone network of the YOLOv7-tiny model to replace the pooling layer (MP module) in the original network.
The SPD-Conv module consists of a space-to-depth (SPD) layer and a non-strided convolution (Conv) layer. SPD-Conv is formed by connecting SPD layers and Conv layers in series. Specifically, the input feature map is first converted from space to depth through the SPD layer, and then the output is convolved through the Conv layer [
31]. This combination method can reduce the size of the feature map without losing information, retaining all the information in the channel, which greatly enhances the ability of the model to detect low-resolution images and small targets [
31].
Figure 3 shows the structure of SPD-Conv.
In the backbone network of the YOLOv7-tiny model, there are three max pooling layers (MP modules) with a stride length of 2. Although these three pooling layers achieve feature dimension reduction and decrease the algorithm’s computational complexity, they also cause the loss of much fine-grained information. Therefore, this paper considers replacing MP modules with SPD-Conv modules. The number of MP modules that needed to be replaced was determined by the ablation experiments, with the results are displayed in the experiments section later. They indicate that replacing the last two MP modules instead of all of them in the backbone network with SPD-Conv modules is optimal. According to the analysis, the cigarette appearance defect dataset contains very small targets (Dotted) and large targets (Unfilter). Although the SPD-Conv module is effective for small targets, it splits large targets and compromises their integrity. Thus, not all of the three MP modules in the backbone network can be replaced with SPD-Conv modules. The experimental results in the experiments section support this view.
3.3. Attention Mechanism (CBAM)
The attention mechanism can enhance the algorithm’s focus on useful information while filtering out irrelevant information, thus improving detection accuracy and efficiency. Due to the extensive number of layers in the YOLOv7-tiny model’s network structure, some invalid feature information is quickly introduced during feature extraction and fusion, reducing the model’s ability to discriminate target features. Replacing the MP module with the SPD-Conv module in
Section 3.2 quadruples the number of channels in the SPD layer. This channel increase requires the model to pay attention to more information, which diminishes its ability to focus on target information. This paper introduces CBAM behind the C5 module of the original YOLOv7-tiny model to enhance the model’s perception of target features and improve detection performance.
CBAM is a lightweight attention module composed of two submodules: the CAM (Channel Attention Module) submodule and the SAM (Spatial Attention Module) submodule, which perform channel and spatial attention operations, respectively, on input features. The function of the CAM submodule is to make the model pay attention to the channel information; the model can learn the importance of each channel information through training. The function of the SAM submodule is to make the network pay more attention to the spatial location information; the network can learn the importance of each location region through training. By integrating channel and spatial attention mechanisms, CBAM helps the network focus more on target objects while ignoring irrelevant background information, thereby enhancing network interpretability. The network structure of CBAM is shown in
Figure 4.
3.4. Dual Self-Calibrated Convolutions (D-SCConv)
Conventional convolution has some limitations. Firstly, all of the convolution kernel learning modes of traditional convolution are similar; secondly, in the conventional convolution operation, the receptive field size of each spatial location is mainly determined by the predefined convolution kernel size, which can result in a lack of a large receptive field, thereby preventing the network from capturing sufficient high-level semantic information. Therefore, these two shortcomings may lead to low recognition of the features extracted by conventional convolution, subsequently affecting the accuracy of the model. To solve the above issues, this paper introduces self-calibrated convolutions (SCConv) [
33] into the neck of the original YOLOv7-tiny model, to improve the model’s receptive field for target features, thereby enhancing the detection accuracy. Additionally, SCConv can easily enhance the performance of standard convolutional layers, without introducing additional parameters and complexity, and its design is simple and plug-and-play [
33]. The structure of SCConv is shown in
Figure 5.
As shown in
Figure 5, SCConv first divides the input
into two parts,
and
, equably and then inputs these two parts into special paths that extract different types of context information. Of the two parts,
performs a self-calibration operation to obtain
. The
parameter in the self-calibration operation can adaptively adjust the receptive field of each spatial position, thereby obtaining the high-level semantic information of long-distance spatial locations and channels, enriching the diversity of output features.
performs a simple convolution operation to obtain
, which preserves the contextual relations of the original space. Finally, the two intermediate outputs,
and
, are spliced together to obtain the output
.
Some categories in the cigarette appearance defect dataset have small inter-class differences, and some appearance defects have unclear textures. This paper improves SCConv, with the expectation of further enhancing the module’s ability. The improved SCConv can adaptively construct correlations between long-distance spatial domains and channels and further expand the receptive field of each region in the network, thereby improving the feature expression ability and making the features more recognizable. The specific improvements are as follows: the part that retains the original spatial context is replaced by the self-calibration convolution block (Self-Calibration-B), and convolutions with three different convolution kernel sizes {K
4, K
5, K
6} are used to perform the self-calibration operation on
to obtain
, after which
and
are spliced to obtain the final output
. The improved dual self-calibrated convolutions (D-SCConv) module is shown in
Figure 6.
This paper conducts ablation experiments to test the improved D-SCConv module. The experimental results are displayed in the experiments section. The experiments demonstrate that using D-SCConv indeed improves the network’s accuracy. Upon analysis, although SCConv is discarded to the part of retaining the original spatial context relationship, in the whole network, the output Y of D-SCConv is spliced with the output obtained by C5, CBL, and upsample. This splicing retains the spatial context relationship information, and the discarded part reduces feature redundancy. Thus, the representation learning ability of the network is further enhanced.
3.5. Improve the Loss Function
The CIoU loss function is used to calculate the loss in the YOLOv7-tiny algorithm. CIoU considers various factors—the overlap area between the predicted box and the ground truth box, the distance of the center point, the aspect ratio, etc.—and introduces a correction factor. However, CIoU has some unreasonable points, which are embodied in the following: the design of the aspect ratio is unreasonable in the sense that if the aspect ratio of the two boxes is the same, the loss of the aspect ratio is always 0; the correction factor in CIoU significantly impacts the performance of the loss function, necessitating repeated adjustments to this parameter to achieve optimal performance. To this end, EIoU is introduced for loss calculation in this paper.
EIoU (Efficient IoU) is a further improvement on CIoU. It separates the aspect ratio factors of the predicted box and the ground truth box, calculating the width difference and height difference of the predicted box and the ground truth box, respectively, to replace the width and height loss based on the aspect ratio in CIoU. The calculation method for EIoU is given in Equation (1).
By expanding Equation (1), it is possible to obtain
In Equation (2), and represent the predicted box and the ground truth box, respectively. represents the Euclidean distance between the two center points of the predicted box and the ground truth box. and are the width and height of the minimum bounding rectangle between the predicted box and the ground truth box.
5. Discussion and Prospects
This paper proposes a real-time and high-precision cigarette appearance defect detection model, SCS-YOLO. The primary focus is on improving the model’s performance to meet the task’s high accuracy and speed requirements.
Firstly, this paper uses various data augmentation techniques on the original cigarette images to address the issue of sample imbalance. Then, considering the hardware and detection speed requirements in the industrial setting, the lightweight YOLOv7-tiny network is selected as the base model. On this basis, this paper uses SPD-Conv to replace the MP module, adds attention mechanisms, expands the receptive field of the network with the improved D-SCConv, replaces the CIoU with the EIoU, and names the resulting architecture the SCS-YOLO model. Experimental results demonstrate that the SCS-YOLO model is effective for cigarette appearance detection tasks, exhibiting good accuracy and recall. Furthermore, the speed satisfies the detection requirements of cigarette production lines.
Since SCS-YOLO primarily addresses three issues, namely, very small targets, significant inter-class size differences, and indistinct inter-class feature differences, it is well suited for defect detection tasks with these issues. Such tasks include aluminum profile surface defect detection, injection-molded part defect detection, steel surface defect detection, wood product defect detection, ceramic tile defect detection, etc. Additionally, SCS-YOLO is a lightweight model offering high detection accuracy and fast detection speed, making it easily adaptable to other tasks.
Although the SCS-YOLO model has significantly improved all indexes, there are still areas for further optimization. For instance, the recall is still slightly lower, with a 7.5% miss detection rate. In the future, efforts will focus on further optimizing the model. Transfer learning will then be tried to apply the model to other defect detection tasks, and integration with other quality control systems will be explored to enhance the model’s applicability.