BSMD-YOLOv8: Enhancing YOLOv8 for Book Signature Marks Detection

Guo, Long; Wang, Lubin; Yu, Qiang; Xie, Xiaolan

doi:10.3390/app142310829

Open AccessArticle

BSMD-YOLOv8: Enhancing YOLOv8 for Book Signature Marks Detection

¹

College of Computer Science and Engineering, Guilin University of Technology, Guilin 541006, China

²

Guangxi Key Laboratory of Embedded Technology and Intelligent Systems, Guilin University of Technology, Guilin 541006, China

³

College of Information Engineering, Guilin Institute of Information Technology, Guilin 541004, China

⁴

National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(23), 10829; https://doi.org/10.3390/app142310829

Submission received: 24 September 2024 / Revised: 15 November 2024 / Accepted: 16 November 2024 / Published: 22 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

In the field of bookbinding, accurately and efficiently detecting signature sequences during the binding process is crucial for enhancing quality, improving production efficiency, and advancing industrial automation. Despite significant advancements in object detection technology, verifying the correctness of signature sequences remains challenging due to the small size, dense distribution, and abundance of low-quality signature marks. To tackle these challenges, we introduce the Book Signature Marks Detection (BSMD-YOLOv8) model, specifically designed for scenarios involving small, closely spaced objects such as signature marks. Our proposed backbone, the Lightweight Multi-scale Residual Network (LMRNet), achieves a lightweight network while enhancing the accuracy of small object detection. To address the issue of insufficient fusion of local and global feature information in PANet, we design the Low-stage gather-and-distribute (Low-GD) module and the High-stage gather-and-distribute (High-GD) module to enhance the model’s multi-scale feature fusion capabilities, thereby refining the integration of local and global features of signature marks. Furthermore, we introduce Wise-IoU (WIoU) as a replacement for CIoU, prioritizing anchor boxes with moderate quality and mitigating harmful gradients from low-quality examples. Experimental results demonstrate that, compared to YOLOv8n, BSMD-YOLOv8 reduces the number of parameters by 65%, increases the frame rate by 7 FPS, and enhances accuracy, recall, and mAP50 by 2.2%, 8.6%, and 3.9% respectively, achieving rapid and accurate detection of signature marks.

Keywords:

small object detection; YOLOv8; book signature marks detection; WIoU

1. Introduction

In the field of bookbinding, the production of book signatures is a crucial step in the binding process, playing a vital role in ensuring the sequential order and completeness of book content. The signature mark, a special mark printed on the outer spine fold of each signature, serves as a visual inspection tool to verify the correctness of signature placement. When signatures are arranged in the correct order, these marks form a trapezoidal pattern, which becomes a key indicator for validating the accuracy of signature arrangement [1]. To prevent issues such as missing or extra signatures during the binding process, precise detection of signature mark positions is particularly important. This detection step enables the automated identification of binding errors, significantly enhancing production efficiency and quality control levels. Therefore, developing an accurate and efficient detection algorithm for signature marks is of paramount importance in improving the overall quality of bookbinding, further increasing production efficiency, and advancing the level of industrial automation.

Traditional manual inspection methods are time-consuming and labor-intensive, and long-term visual tasks can easily lead to fatigue, increasing the likelihood of detection errors. Although detection methods based on traditional machine vision technology [2,3,4,5,6] have alleviated this issue to some extent, they rely on tedious manual feature engineering, resulting in lower efficiency and sensitivity to target occlusion, dense distribution, illumination changes, and scale variations. In contrast, deep learning-based object detection algorithms can automatically learn image features and exhibit stronger generalization capabilities, making them an effective solution for signature mark detection [7]. However, applying mainstream deep learning-based object detection algorithms directly to tasks involving small and densely packed targets, such as book signature marks, remains challenging, often resulting in missed detections and false positives, as illustrated in Figure 1. Research on signature mark detection has introduced diverse methodologies, including approaches based on an enhanced version of YOLOv5 [8]. This method substantially boosts the detection accuracy of signature marks by incorporating a small object detection layer. Additionally, reference [9] draws inspiration from two-stage object detection algorithms to propose a dual-core convolutional neural network, achieving notable improvements in detection precision. While these methods have achieved some improvements, they still face trade-offs between accuracy, speed, and model complexity. Object detection algorithms are mainly divided into two categories: two-stage detection algorithms such as RCNN [10] and Faster RCNN [11], and single-stage detection algorithms such as the YOLO series [12] and SSD [13]. While two-stage algorithms offer higher recognition accuracy, they are slower in detection speed; single-stage algorithms, with their significant speed advantage, are more suitable for real-time detection requirements. Among these, YOLOv8 stands out in terms of the speed-accuracy trade-off [14].

YOLOv8 is comprised of three primary elements: the backbone, neck, and detection head. Additionally, it offers a choice of five scaling factors (n, s, m, l, x) to tailor the model’s size according to diverse application needs. YOLOv8’s backbone is based on CSPDarknet53 [15], which employs five downsampling stages to extract features at different scales. However, considering that signature marks in images are small and densely distributed, the structure design of CSPDarknet53 leads to the loss of details from shallower layers due to extensive downsampling. Additionally, it extracts numerous redundant and unnecessary information in the feature maps. Both of these factors significantly hinder the accuracy of signature mark detection. To address this problem, we propose a new backbone, the Lightweight Multi-scale Residual Network (LMRNet), which achieves a lightweight network design while enhancing the accuracy of signature mark detection.

The neck component in YOLOv8 serves the purpose of fusing features at multiple scales. To achieve this, the design integrates elements from both the Feature Pyramid Network (FPN) [16] and the Path Aggregation Network (PANet) [17] in its neck design. However, when it comes to integrating information across layers, the current method of information fusion struggles to transmit information without loss, leading to insufficient local features of signature marks in the deep network. To address this issue, we have designed the Low-GD and High-GD modules to enhance the neck’s multi-scale feature fusion capabilities, thereby improving the fusion of local and global features of signature marks.

In YOLOv8, the head component employs a decoupled architecture, where distinct branches are dedicated to object classification and bounding box regression tasks. The bounding box loss functions are crafted to precisely position objects by imposing penalties for any discrepancies observed between the predicted bounding boxes and the actual ground truth boxes. However, CIoU, which is used in YOLOv8, still exhibits certain limitations when dealing with low-quality samples in the dataset. To address this limitation, we introduce WIoU as a replacement for CIoU. WIoU prioritizes anchor boxes with moderate quality and mitigates the impact of harmful gradients arising from low-quality signature marks, ultimately leading to an improvement in the model’s overall performance.

In summary, we propose BSMD-YOLOv8, a novel and efficient approach for accurate book signature mark detection. Our method introduces a new backbone, the Lightweight Multi-scale Residual Network (LMRNet), which enhances the accuracy of small object detection while maintaining a lightweight network design. Additionally, we design Low-GD and High-GD modules to facilitate the integration of local and global features of signature marks. To reduce the impact of harmful gradients from low-quality examples, we replace the CIoU loss function with WIoU loss. Experimental results demonstrate that, compared to YOLOv8n, BSMD-YOLOv8 reduces the number of parameters by 65%, increases the frame rate by 7 FPS, and achieves improvements in accuracy, recall, and mAP50 by 2.2%, 8.6%, and 3.9% respectively, highlighting its capability for rapid and accurate detection of signature marks.

The rest of this paper is organized as follows. Section 2 offers an introduction to the dataset used in our study, accompanied by a statistical analysis. It also introduces the overall architecture of the BSMD-YOLOv8 model, details the proposed enhancements to YOLOv8, describes the model training environment, and outlines the evaluation metrics utilized in our research. In Section 3, we report the experimental results and perform a thorough evaluation, including comparisons with other leading one-stage models and ablation studies to evaluate the enhancements proposed in this paper. Lastly, Section 4 summarizes the key findings of the study.

2. Materials and Methods

2.1. Dataset

The dataset used in the experiment is a self-built one, collected from the production line of a printing factory using a CCD camera, with all images having a uniform resolution of 3072 × 2048. Since most of the book images collected from the production line come from the same batch of books, the similarity between the collected book images is extremely high, which is not conducive to model training. To address this issue, a large number of highly similar data samples were first eliminated. Subsequently, the labelImg annotation tool was used to accurately annotate the signature marks in the data. Finally, the annotated data was enhanced using a method that randomly combines eight types of image transformations, including horizontal or vertical flipping, mirror symmetry, affine transformation, rotation, adding Gaussian noise, contrast adjustment, scale transformation, and translation. The final dataset for training and validation contained 2739 images, covering 36,078 signature marks, and was divided into training, testing, and validation sets in a ratio of 7:2:1, as shown in Table 1. The sample examples before and after dataset augmentation are shown in Figure 2.

Upon analysis of the dataset, it is observed that the signature marks in the images are small and densely distributed, as depicted in Figure 3a. Specifically, the width and height of the signature mark in the training set constitute no more than 0.2 of the original image dimensions, with most being within the range of 0.04. Additionally, the dataset contains a significant number of signature marks that are of low quality, such as those obscured by binding threads, exhibiting dull colors, or possessing narrow shapes, as exemplified in Figure 3b.

2.2. Methodology

The architecture of the BSMD-YOLOv8 model is illustrated in Figure 4. Our research introduces three significant improvements aimed at enhancing signature mark detection.

Firstly, we replace the backbone network with LMRNet, which consists of multiple stacked Multi-scale Residual Convolution (MRConv) modules. Initially, the input image passes through a CBS module to fully extract its feature information. Subsequently, these image features are processed through a series of MRConv modules, ultimately generating a total of four feature maps of different scales, denoted as B1 (320 × 320), B2 (160 × 160), B3 (80 × 80), and B4 (40 × 40), respectively.

Secondly, we have made improvements to the PANet structure of the original YOLOv8 network by incorporating low-GD and high-GD modules. The low-GD module is responsible for aggregating and fusing the B1, B2, and B4 features output from the backbone network, while the high-GD module is responsible for aggregating and fusing the B2 and B3 features.

Lastly, we adopt WIoU as an enhanced bounding box regression metric in place of CIoU. The WIoU loss incorporates a strategic gradient allocation mechanism. It reduces the penalty for geometric metrics when the predicted box aligns well with the target box, thereby mitigating the harmful gradients generated by low-quality signature marks and enabling the model to possess better generalization capabilities.

2.2.1. Lightweight Multi-Scale Residual Network

During the image feature extraction process within the backbone, essential spatial details are primarily captured in the shallower layers of the network. YOLOv8’s backbone employs the CBS module to perform successive downsampling, ranging from 2 times to 32 times, resulting in a total of five feature maps of different resolutions. These are denoted as B1 (320 × 320), B2 (160 × 160), B3 (80 × 80), B4 (40 × 40), and B5 (20 × 20), respectively, as depicted in Figure 5a. However, excessive downsampling operations can lead to significant data loss for small object detection. Therefore, in the design of LMRNet, we remove the last downsampling operations for generating B5, as illustrated in Figure 5b.

Multi-scale Residual Convolution. In CSP-Darknet53, the extensive use of the CBS module effectively extracts image features, but it also generates a significant amount of redundant and irrelevant feature information in the feature maps, potentially obscuring the key signature mark features. To address this issue, we have designed the Multi-scale Residual Convolution (MRConv) module. As shown in Figure 6, the input feature map is equally divided along the channel dimension into two parts, denoted as

x_{1}

and

x_{2}

, respectively. Subsequently,

x_{1}

is processed through 3 × 3 and 5 × 5 convolutional modules to extract signature mark features under different receptive fields, capturing a diverse range of local feature information for signature marks. Meanwhile,

x_{2}

is passed through a Depth-wise Convolution (DWConv) layer to minimize the generation of redundant and irrelevant feature information in feature maps. Following this, the outputs from

x_{1}

and

x_{2}

are concatenated along the channel dimension. The concatenated results are then passed through a Point-wise Convolution (PWConv) layer to fully integrate the feature information across different channels. Depending on the convolution stride parameter, this module exhibits different behavior: when set to 2, it performs downsampling, reducing the feature map size by half; when set to 1, it employs a residual connection, adding the output result to the initial feature map, as illustrated in Equation (1).

M R C o n v (x) = \{\begin{array}{l} P W C o n v (C o n c a t (C o n v_{3 \times 3} (x_{1}), C o n v_{5 \times 5} (x_{1}), D W C o n v (x_{2}))) & , s = 2 \\ P W C o n v (C o n c a t (C o n v_{3 \times 3} (x_{1}), C o n v_{5 \times 5} (x_{1}), D W C o n v (x_{2}))) + x & , s = 1 \end{array}

(1)

2.2.2. Improved PANet

In image processing, the fusion of global and local features is a widely adopted and effective strategy [18,19,20]. Local features primarily focus on specific key points or regions within the image, such as the corners or edges of signature marks, and these fine details play a pivotal role in accurate signature mark recognition. On the other hand, global features capture the overall properties of the image, including the color distribution and shape contour of the book spine area. Given that signature marks are exclusively present in the book spine area, accurately pinpointing their positions heavily relies on these global features as well.

Features at different levels carry positional information about objects of various sizes. Larger features encapsulate low-dimensional texture details and positions of smaller objects, while smaller features contain high-dimensional information and positions of larger objects. This suggests that the positional information of signature marks is present in larger features, whereas the positional information of book spines is present in smaller features. As illustrated in Figure 7a, despite its numerous paths and indirect interaction methods, PANet is only capable of fully integrating feature information from adjacent layers and cannot ensure lossless propagation of cross-layer feature information. This limitation restricts the amount of local feature information of signature marks that can be fused with the global feature information of the book spine area in the deep network, hindering the distinction between book signature marks and the background. Based on the Gather-and-Distribute Mechanism [21], this paper introduces Low-GD and High-GD modules aimed at enhancing the fusion of local and global features by incorporating multi-scale features extracted from the backbone into the deep network. The structure of the improved PANet is shown in Figure 7b.

Low-stage gather-and-distribute module. The Low-GD module is employed to retain more low-level information by aggregating and fusing the B1, B2, and B4 features output from the backbone network, ultimately obtaining high-resolution features that preserve small target information. As depicted in Figure 8, the input multi-scale features are first aligned by downsampling the larger-scale B1 and B2 features through average pooling and upsampling the B4 feature through bilinear interpolation, adjusting them to the same scale as the N3 feature. The aligned features are then concatenated along the channel dimension and fused through PWConv. Following this, a Context Augmentation Module (CMA) [22] is utilized to capture spatial features of different receptive field sizes and model the differences between objects and backgrounds in greater detail. This module consists of 3×3 depth-wise separable dilated convolutions (DilateDWConv) [23] with dilation rates of 1, 3, and 5. Finally, another PWConv is applied for fusion and output.

High-stage gather-and-distribute module. In order to fuse global feature information more efficiently, we employ attention operations to fuse the information, As illustrated in Figure 9, this module aggregates and integrates the B2 and B3 features output from the backbone network. The B3 and B4 features are first downsampled through average pooling to obtain feature maps of the same scale as N4. Subsequently, the features are concatenated and merged using PWConv. Following this, an attention mechanism is employed to enhance the output by adaptively correlating global feature information. This attention module consists of channel attention [24] and spatial attention [25]. Finally, another PWConv is applied for the final output.

2.2.3. Improvement of the Loss Function

The loss function for bounding box regression plays a crucial role in object detection. In YOLOv8, the loss function adopted for bounding box regression is CIoU [26] Loss. The calculation formula for CIoU is as follows:

L_{CIoU} = 1 - I o U + \frac{ρ^{2} (b_{g t}, b_{p r e d})}{C^{2}} + α ν

(2)

I o U = \frac{b_{g t} \cap b_{p r e d}}{b_{g t} \cup b_{p r e d}}

(3)

v = \frac{4}{π^{2}} {(a r c t a n \frac{w_{g t}}{h_{g t}} - a r c t a n \frac{w_{p r e d}}{h_{p r e d}})}^{2}

(4)

α = \frac{ν}{(1 - I o U) + ν}

(5)

Here,

b_{g t}

represents the ground truth bounding box, while

b_{p r e d}

represents the predicted bounding box.

ρ^{2} (b_{g t}, b_{p r e d})

represents the Euclidean distance between the centers of the predicted bounding box and the ground truth bounding box.

C

represents the diagonal distance of the smallest enclosing rectangle that can contain both the predicted bounding box and the ground truth bounding box.

v

is used to measure the consistency of the relative proportions of the two rectangular boxes, and

α

is a balancing parameter.

CIoU comprehensively considers the aspect ratio, center point distance, and overlap area between the predicted bounding box and the ground truth bounding box. However, training data inevitably contain low-quality annotations of signature marks, and geometric factors such as distance and aspect ratio can aggravate the penalty for these low-quality annotations, potentially degrading the generalization performance of the model. To mitigate this issue, Tong et al. [27] designed WIoU, which weakens the penalty of geometric factors when the anchor box coincides well with the target box. This reduced intervention during training allows the model to achieve better generalization ability. The formula for WIoU is as follows:

L_{W I o U} = R_{W I o U} * L_{I o U}

(6)

R_{W I o U} = \exp (\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{{(W_{g}^{2} + H_{g}^{2})}^{*}})

(7)

L_{I o U} = 1 - I o U

, which belongs to the interval [0, 1]. When the anchor box aligns well with the target box, it significantly reduces the

R_{W I o U}

of the high-quality anchor box and diminishes its emphasis on the distance between central points. Conversely,

R_{W I o U} \in [1, e)

, which notably amplifies the

L_{I o U}

of the ordinary-quality anchor box.

2.3. Model Training Environment

The experiment was performed on an Ubuntu 18.04 operating system with a CPU configuration of 12 vCPUs, specifically an Intel(R) Xeon(R) Silver 4214R CPU running at 2.40GHz (Intel, Santa Clara, CA, USA). We used a single RTX 3080 TI GPU with 12GB of memory. The deep learning framework was PyTorch 1.8.1, along with CUDA version 11.1. For the training process, the image size was set to 640 × 640, the number of training epochs was 200, the batch size was 16, and the initial learning rate of the network was 0.01. The detailed model training configuration parameters are provided in Table 2.

2.4. Evaluation Metrics

To comprehensively assess the performance of the improved model, the evaluation metrics used in this paper include precision (P), recall (R), mean average precision (mAP50), number of parameters, giga floating-point operations (GFLOPs), and frames per second (FPS).

Precision (P). Precision is used to measure the model’s accuracy in detecting signature marks, with higher accuracy corresponding to a lower false detection rate. The calculation formulas are as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(8)

Recall (R). Recall is used to evaluate the comprehensiveness of the model, with higher recall corresponding to a lower missed detection rate, as defined by Equation (9).

R e c a l l = \frac{T P}{T P + F N}

(9)

where True Positive (TP) represents the number of signature marks correctly identified, False Positive (FP) represents the number of backgrounds falsely detected as signature marks and False Negative (FN) represents the number of signature marks that were not detected.

Average Precision (AP). AP represents the area under the precision-recall curve, calculated using Equation (10).

A P = \int_{0}^{1} P r e c i s i o n (R e c a l l) d (R e c a l l)

(10)

Mean Average Precision (mAP). mAP represents the average AP value across all categories, indicating the model’s overall detection performance across the entire dataset, as defined by Equation (11).

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(11)

mAP50. mAP50 is the average precision calculated at an IoU threshold of 0.5.

FPS. FPS was calculated by validating the testing set, during which we measured the average times for preprocessing, inferencing, and postprocessing and then used these values to determine the final FPS. The formula for this computation is given below:

F P S = \frac{1}{p r e p r o c e s s + i n f e r e n c e + p o s t p r o c e s s}

(12)

3. Experimental Results

In this section, we perform a thorough assessment of the BSMD-YOLOv8 model through a series of targeted experiments. Firstly, we compare LMRNet with other lightweight backbone networks to verify its superiority. Secondly, we compare BSMD-YOLOv8 with other mainstream algorithms and present a detailed comparison of their actual test results. Finally, we conduct ablation experiments on the enhanced methods of BSMD-YOLOv8 to validate the contributions of each improvement.

3.1. Detection Model Comparison

To demonstrate the advantages of the proposed BSMD-YOLOv8 algorithm model, we conducted comparative experiments with YOLOv8n and other popular one-stage models using the same dataset and experimental environment.

Table 3 showcases the distinctions between the proposed architecture and the standard YOLOv8n architecture, along with a performance comparison. Despite an increase in giga floating-point operations compared to YOLOv8n, BSMD-YOLOv8 demonstrates significant improvements in other metrics.

According to the results presented in Table 4, BSMD-YOLOv8 outperforms other models by achieving the highest accuracy (P), recall (R), and mAP50. Specifically, compared to Improved-YOLOv5s, BSMD-YOLOv8 exhibits an increase in accuracy and mAP by 0.4% and 0.1%, respectively, a reduction in parameters by 6.01M, and an improvement in FPS by 23 frames per second. Figure 10 provides a comparison of actual test results, revealing that the YOLOv8n model suffers from a higher number of missed detections and false detections, while Improved-YOLOv5s also exhibits some false detections. In contrast, the BSMD-YOLOv8 model demonstrates the best actual detection performance.

3.2. Ablation Experiments

To evaluate the superiority of LMRNet, we conducted experiments by replacing different lightweight backbone networks with YOLOv8n under consistent training conditions. As shown in Table 5, the detection accuracy of other lightweight backbone networks decreased, while LMRNet achieved the best detection performance.

To assess the effectiveness of each enhancement technique proposed in this study, ablation experiments were conducted under the same experimental settings as shown in Table 6.

In YOLOv8n_L, it was demonstrated that the incorporation of LMRNet significantly enhances the model’s accuracy, recall, and mAP50, while also notably reducing the number of model parameters. However, in YOLOv8n_I, there was a decrease in accuracy by 0.8%, recall decreased by 1.4%, and the mAP50 decreased by 0.7%. These results indicate that the use of the improved PANet alone does not significantly enhance performance. This is attributed to the fact that the features extracted by the backbone network of YOLOv8n contain a significant amount of noise information, which leads to the fusion of redundant and invalid features by the Low-GD and High-GD modules, ultimately affecting the model’s performance. When comparing YOLOv8n_L_I with YOLOv8n_L, it is evident that using Improved_PANet based on LMRNet results in increases in accuracy, recall, and mAP50 by 0.8%, 0.7%, and 0.3% respectively. These findings verify the effectiveness of the improved PANet. Furthermore, in YOLOv8_W and BSMD-YOLOv8, the introduction of the WIoU loss function led to improvements in accuracy, recall, and mAP50.

Figure 11 comprehensively displays the evaluation metrics of ablation experiment models over 200 training epochs, with a particular focus on mAP50 and Recall. In the initial stages, all models exhibit a steep surge in both mAP50 and recall, emphasizing the significant training effects achieved during this early phase. As the number of epochs progresses, the metrics gradually plateau, indicating convergence. Overall, the enhancement effect of using LMRNet as the backbone network in YOLOv8n (such as YOLOv8n_L, YOLOv8_L_I, BSMD-YOLOv8) is significantly more pronounced compared to the models that do not use LMRNet (such as YOLOv8n, YOLOv8_I, YOLOv8_W).

Figure 12 compares the detection results of YOLOv8L_I and BSMD-YOLOv8. It is evident that the incorporation of WIoU notably improves the model’s generalization capability, allowing it to effectively detect low-quality signature marks.

4. Conclusions

To address the challenges of detecting small and densely distributed book signature marks, which often result in high rates of false and missed detections, we have introduced BSMD-YOLOv8, a specialized object detection model designed specifically for this particular task. By incorporating the Lightweight Multi-scale Residual Network (LMRNet) as an alternative backbone to YOLOv8n, we have achieved a more streamlined network architecture while significantly enhancing the accuracy of signature mark detection. Additionally, the introduction of the Low-GD and High-GD modules has improved the multi-scale feature fusion capabilities of the neck, leading to better integration of local and global features and further boosting recognition accuracy. The adoption of the WIoU loss function in place of the CIoU loss function has contributed to the model’s enhanced generalization performance, particularly in detecting low-quality book signature marks.

Experimental results have demonstrated the exceptional performance of the BSMD-YOLOv8 algorithm model on our self-built dataset. Compared to the original YOLOv8n model, it has achieved a substantial 65% reduction in network parameters, a notable seven frames per second increase in FPS, and significant improvements in accuracy, recall, and mAP50 by 2.2%, 8.6%, and 3.9% respectively. Furthermore, it has surpassed the performance of other currently available mainstream algorithm models.

Although the BSMD-YOLOv8 model has effectively addressed the issues of false and missed detections in book signature mark detection and possesses a compact network size and rapid detection speed that make it well-suited for practical applications, there are still opportunities for further enhancement. Specifically, reducing the model’s floating-point operations and further refining its accuracy in identifying smaller and more densely packed targets are areas of focus. Future research endeavors could delve into optimizing the model’s architecture to elevate its performance, broadening its applicability to encompass more intricate scenarios, enhancing its real-time performance and efficiency, and integrating additional technologies or methodologies to propel the ongoing advancement and deployment of signature sequence detection technology within bookbinding processes.

Author Contributions

Conceptualization, Q.Y., L.G. and L.W.; methodology, L.G. and L.W.; software, L.G. and L.W.; validation, L.G., Q.Y., X.X. and L.W.; formal analysis, Q.Y. and X.X.; investigation, L.G. and L.W.; resources, Q.Y. and X.X.; data curation, L.G., L.W. and Q.Y.; writing—original draft preparation, L.G. and L.W.; writing—review and editing, L.G., Q.Y. and L.W.; visualization, L.G. and L.W.; supervision, Q.Y. and X.X.; project administration, Q.Y.; funding acquisition, Q.Y. and X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guilin Major Special Project (20220103-1), the Guangxi Science and Technology Base and Talent Special Project (Gui Ke AD24010012), the Guangxi Key Research and Development Plan (Gui Ke AB23026105), and the Science and Technology Innovation Base Construction Class (Gui Ke ZY21195030).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the authors due to data underlying the results used in this paper are not publicly available at this time.

Acknowledgments

We extend our heartfelt gratitude to the organizations that provided financial support for this research, namely the Guilin Major Special Project (20220103-1), the Guangxi Science and Technology Base and Talent Special Project (Gui Ke AD24010012), the Guangxi Key Research and Development Plan (Gui Ke AB23026105), and the Science and Technology Innovation Base Construction Class (Gui Ke ZY21195030).

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, L. Research on Book Association Detection Based on Signature Marks. Master’s Thesis, Beijing Jiaotong University, Beijing, China, 2021. [Google Scholar]
Wang, M.; Peng, X. Exploitation of the Online Detection System of Bookbinding Signature Mark. Packag. Eng. 2016, 37, 171–174. [Google Scholar]
Sheng, G.; Shu, X. An Adaptive Signature Mark Detection Method Based on Phase Correlation for Bookbinding. Packag. Eng. 2018, 39, 4. [Google Scholar]
Yan, F. Research and Design of Signature Detecting System Based on Robot Vision. Master’s Thesis, Xi’an University of Technology, Xi’an, China, 2011. [Google Scholar]
Hu, X. Overall Design of Production Line and the Design of Assembling Machine detection System for Children’s Hardcover. Master’s Thesis, Southwest Jiaotong University, Chengdu, China, 2008. [Google Scholar]
He, S.; Yu, Q. Design and implementation of automatic detection system for book production. Manuf. Autom. 2023, 45, 17–20. [Google Scholar]
Chen, Y.; Ding, Y.; Zhao, F.; Zhang, E.; Wu, Z.; Shao, L. Surface Defect Detection Methods for Industrial Products: A Review. Appl. Sci. 2021, 11, 7657. [Google Scholar] [CrossRef]
Yang, X.; Wang, H.; Dong, M. Improved YOLOv5’s book Ladder label detection algorithm. J. Guilin Univ. Technol. 2022. Available online: https://kns.cnki.net/kcms/detail/45.1375.N.20221013.1439.002.html (accessed on 10 November 2023).
Wang, L.; Xie, X.; Huang, P.; Yu, Q. DYNet: A Printed Book Detection Model Using Dual Kernel Neural Networks. Sensors 2023, 23, 9880. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Ma, P.; Jiang, T.; Zhao, X.; Tan, W.; Zhang, J.; Zou, S.; Huang, X.; Grzegorzek, M.; Li, C. SEM-RCNN: A Squeeze-and-Excitation-Based Mask Region Convolutional Neural Network for Multi-Class Environmental Microorganism Detection. Appl. Sci. 2022, 12, 9902. [Google Scholar] [CrossRef]
Wang, H.; Xiao, N. Underwater Object Detection Method Based on Improved Faster RCNN. Appl. Sci. 2023, 13, 2746. [Google Scholar] [CrossRef]
Hussain, M. YOLO-v1 to YOLO-v8, the Rise of YOLO and Its Complementary Nature toward Digital Manufacturing and Industrial Defect Detection. Machines 2023, 11, 677. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector; Springer: Berlin/Heidelberg, Germany, 2016; Volume 9905, pp. 21–37. [Google Scholar] [CrossRef]
Terven, J.; Cordova-Esparza, D. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Make 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Honolulu, HI, USA, 2017; pp. 936–944. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Lai, Z.-H.; Zhang, T.-H.; Liu, Q.; Qian, X.; Wei, L.-F.; Chen, S.-L.; Chen, F.; Yin, X.-C. InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition. arXiv 2023, arXiv:2305.16342. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
Zhao, B.; Xiong, Q.; Zhang, X.; Guo, J.; Liu, Q.; Xing, X.; Xu, X. PointCore: Efficient Unsupervised Point Cloud Anomaly Detector Using Local-Global Features. arXiv 2024, arXiv:2403.01804. [Google Scholar]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Han, K.; Wang, Y. Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism. arXiv 2023, arXiv:2309.11331. [Google Scholar]
Xiao, J.; Zhao, T.; Yao, Y.; Yu, Q.; Chen, Y. Context Augmentation and Feature Refinement Network for Tiny Object Detection. 2022. Available online: https://paperswithcode.com/paper/context-augmentation-and-feature-refinement (accessed on 23 September 2024).
Xu, S.; Zheng, S.; Xu, W.; Xu, R.; Wang, C.; Zhang, J.; Teng, X.; Li, A.; Guo, L. HCF-Net: Hierarchical Context Fusion Network for Infrared Small Object Detection. arXiv 2024, arXiv:2403.10778. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Seattle, WA, USA, 2020; pp. 11531–11539. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Trans. Cybern. 2022, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; IEEE: Chennai, India, 2024; pp. 1–6.
Tan, M.; Le, Q. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the Proceedings of the 38th International Conference on Machine Learning; PMLR, 1 July 2021; pp. 10096–10106. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features From Cheap Operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1577–1586. [Google Scholar]
Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. MobileOne: An Improved One Millisecond Mobile Backbone. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7907–7917. [Google Scholar]

Figure 1. Examples of mismatched detections and accurate detections.

Figure 2. Examples of the dataset: (a) before data augmentation; (b) after data augmentation.

Figure 3. Analysis results of the dataset: (a) Information regarding the manual annotation process for objects in the dataset; (b) Low-quality signature mark examples (The areas circled in red).

Figure 4. BSMD-YOLOv8 network structure diagram.

Figure 5. Different backbone network designs: (a) CSP-Darknet53; (b) LMRNet.

Figure 6. Multi-scale residual convolution module structure diagram.

Figure 7. PANet and Improved PANet structure diagram.

Figure 8. Low-stage gather-and-distribute module structure diagram.

Figure 9. High-stage gather-and-distribute module structure diagram.

Figure 10. Comparison of actual test results: (a) original image; (b) inference results for YOLOv8n; (c) inference results for Improved-YOLOv5s; (d) inference results for BSMD-YOLOv8.

Figure 11. Training progress plot comparing ablation experiments based on mAP50 and Recall: (a) mAP50 vs. Epochs; (b) Recall vs. Epochs.

Figure 12. Comparison of detection results between YOLOv8_L_I and BSMD-YOLOv8: (a) YOLOv8_L_I (using CIoU); (b) BSMD-YOLOv8 (using WIoU).

Table 1. Partitioning of data sets.

	Number of Images	Number of Signature Marks
train	1917	25,338
validation	274	3603
test	548	7137
total	2739	36,078

Table 2. Model training configuration parameters.

Configuration Parameter	Value
operating system	Ubuntu 18.04
CPU	12 vCPU Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
GPU	RTX 3080 TI (12GB) × 1
RAM	90GB
deep learning framework	PyTorch 1.8.1
programming language	Python 3.8
CUDA version	CUDA (11.1)
training epochs	200
batch size	16
learning rate	0.01

Table 3. Comparison with YOLOv8n.

Model	Backbone	Neck	Loss Function	P/%	R/%	mAP50/%	Params/M	GFLOPs/G	FPS
YOLOv8n	CSP-Darknet-53	PANet	CIoU	96.0	88.2	95.2	3.01	8.2	111
BSMD-YOLOv8	LMRNet	Improved_PANet	WIoU	97.2	94.5	97.8	0.93	10.4	118

Table 4. Comparison of different one-stage detection models.

Models	Backbone	P/%	R/%	mAP50/%	Params/M	GFLOPs/G	FPS
SSD [13]	MobileNetV2	61.7	38.5	64.7	91.60	116.3	52
YOLOv3-tiny [15]	Darknet-53	76.0	54.7	70.2	12.12	18.9	88
YOLOv5n	CSP-Darknet-53	95.4	82.6	93.1	2.50	7.2	105
YOLOv5s	CSP-Darknet-53	96.8	82.8	93.7	9.12	24.0	92
YOLOv6n [28]	CSPStackRep	94.1	81.8	91.9	4.23	11.9	103
YOLOv7-tiny [29]	ELAN	95.8	83.3	94.5	6.0	13.2	97
Gold-YOLO [21]	CSP-Darknet-53	96.1	80.3	92.5	6.02	12.0	98
YOLOv8s [30]	CSP-Darknet-53	96.8	89.6	95.9	11.13	28.6	90
YOLOv9t	-	96.4	83.9	93.3	2.00	7.8	125
YOLOv10n	-	92.7	79.9	91.5	2.70	8.4	115
Improved-YOLOv5s [8]	CSP-Darknet-53	97.8	96.8	99.0	7.06	25.1	95
BSMD-YOLOv8	LMRNet	98.2	96.8	99.1	1.05	11.2	118

Table 5. Comparison of different backbone networks.

Backbone	P/%	R/%	mAP50/%	Params/M	GFLOPs/G	FPS
CSP-Darknet-53(YOLOv8n)	96.0	88.2	95.2	3.01	8.2	111
EfficientNetv2 [31]	67.7	57.0	68.9	2.12	2.6	118
GhostNet [32]	95.5	84.6	93.5	2.82	7.8	115
MobileOne [33]	93.1	80.1	91.0	2.86	8.4	113
LMRNet	97.2	94.5	97.8	0.93	10.4	120

Table 6. Results of ablation experiments.

Models	LMRNet	Improved_PANet	WIoU	P/%	R/%	mAP50/%	Params/M
YOLOv8n	×	×	×	96.0	88.2	95.2	3.01
YOLOv8n_L	√	×	×	96.8	94.3	98.4	0.93
YOLOv8n_I	×	√	×	95.2	86.8	94.5	3.50
YOLOv8n_W	×	×	√	96.6	88.5	95.5	3.01
YOLOv8n_L_I	√	√	×	97.6	95.0	98.7	1.05
BSMD-YOLOv8	√	√	√	98.2	96.8	99.1	1.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, L.; Wang, L.; Yu, Q.; Xie, X. BSMD-YOLOv8: Enhancing YOLOv8 for Book Signature Marks Detection. Appl. Sci. 2024, 14, 10829. https://doi.org/10.3390/app142310829

AMA Style

Guo L, Wang L, Yu Q, Xie X. BSMD-YOLOv8: Enhancing YOLOv8 for Book Signature Marks Detection. Applied Sciences. 2024; 14(23):10829. https://doi.org/10.3390/app142310829

Chicago/Turabian Style

Guo, Long, Lubin Wang, Qiang Yu, and Xiaolan Xie. 2024. "BSMD-YOLOv8: Enhancing YOLOv8 for Book Signature Marks Detection" Applied Sciences 14, no. 23: 10829. https://doi.org/10.3390/app142310829

APA Style

Guo, L., Wang, L., Yu, Q., & Xie, X. (2024). BSMD-YOLOv8: Enhancing YOLOv8 for Book Signature Marks Detection. Applied Sciences, 14(23), 10829. https://doi.org/10.3390/app142310829

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

BSMD-YOLOv8: Enhancing YOLOv8 for Book Signature Marks Detection

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Methodology

2.2.1. Lightweight Multi-Scale Residual Network

2.2.2. Improved PANet

2.2.3. Improvement of the Loss Function

2.3. Model Training Environment

2.4. Evaluation Metrics

3. Experimental Results

3.1. Detection Model Comparison

3.2. Ablation Experiments

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI