2.1. YOLO Framework
YOLO represents an advanced object detection algorithm that not only inherits the efficiency and speed attributes of its predecessors but also further enhances the performance, especially within the realm of intricate and complex scenes [
29]. The algorithm’s network and techniques improve the detection precision and maintain high processing speeds. This unique combination of features allows the algorithm to effortlessly meet the diverse and demanding requirements of real-world applications. The network structure is depicted in the accompanying diagram. Firstly, the algorithm employs a lightweight network structure, such as MobileNetV2, which enables the model to operate effectively on mobile and edge computing devices while maintaining high performance. The lightweight design not only reduces the model’s computational complexity but also decreases its storage requirements, thereby facilitating deployment and usage in practical applications. Secondly, a novel backbone network design is introduced, which serves to enhance the extraction of features and effectively improve the model’s detection performance. By increasing the resolution of the features and incorporating more contextual information, the model is able to gain a more nuanced understanding of the objects within the image. The aforementioned backbone network design demonstrates enhanced resilience in the processing of objects of disparate sizes and shapes. Furthermore, YOLO employs a cascading and pyramid approach, thereby empowering the algorithm to process objects of varying sizes. Regarding the network structure, the object detection task is divided into two distinct sub-tasks: classification and localization. The algorithm is able to better handle objects of varying sizes due to the fact that each sub-task has its own dedicated network path. This design boosts the efficiency of object detection in visually cluttered scenes. Regarding the loss function, YOLO has undergone optimization as well. The novel loss function considers object positional and class information and includes additional regularization terms to improve the model stability and generalization. This allows YOLO to converge better during training and exhibit higher performance on the test set.
The YOLO object detection network enhances the accuracy and speed using innovative techniques like lightweight structures, new backbones, cascading, pyramid methods, and optimized loss functions [
30,
31]. This makes YOLO widely applicable in various real-world application scenes, particularly in fields such as autonomous driving, security surveillance, and medical image analysis, where it holds significant potential [
32,
33,
34].
This backbone network generally consists of a sequence of convolutional layers, arranged in a tiered structure. Each layer is responsible for extracting increasingly abstract features from the input image, as illustrated in
Figure 2.
2.1.1. Backbone Network in YOLO
The backbone network in YOLO serves as the foundation for feature extraction, playing a pivotal role in capturing hierarchical representations of input images. The backbone network generally consists of a sequence of convolutional layers, arranged in a hierarchical structure. Each layer is responsible for extracting increasingly abstract features from the input image.
Key Components. (1) The backbone network is composed of several layers of convolutional filters, which scan across the input image to extract pertinent features. These layers operate on the raw pixel values of the input image and progressively learn to capture more complex patterns and structures. (2) Downsampling layers such as max-pooling or strided convolutions are interspersed throughout the backbone network to reduce the spatial dimensions of the feature maps while increasing the receptive field. Downsampling helps in capturing features at different scales and resolutions, enabling the network to detect objects of varying sizes. (3) Some backbone architectures, such as ResNet, incorporate skip connections or residual connections between layers. These connections facilitate the flow of gradients during training and mitigate the vanishing gradient problem, allowing for deeper networks to be trained more effectively. (4) In certain backbone architectures, feature fusion techniques may be employed to combine features from multiple layers or branches of the network. Feature fusion enhances the network’s ability to capture both low-level details and high-level semantics, leading to richer feature representations.
Common Architectures. (1) Darknet is a lightweight neural network architecture specifically designed for YOLO models. It consists of a series of convolutional layers followed by downsampling operations, with skip connections between selected layers to aid in feature propagation. (2) ResNet is a popular backbone architecture, known for its effectiveness in training very deep networks. It introduces residual connections that enable the network to learn residual mappings, facilitating the training of deeper architectures without suffering from degradations in performance. (3) EfficientNet is a family of models that achieve state-of-the-art performance while maintaining computational efficiency. It leverages compound scaling to simultaneously scale the depth, width, and resolution of the network, resulting in models that are highly efficient and effective for various tasks, including object detection.
Our experimental goal was to improve and optimize the object detection algorithm, rather than merely focusing on using the latest version. As such, we chose YOLOv8 as the starting model, prioritizing its performance and potential for structural optimization over the novelty of the version. Therefore, YOLOv8 was selected as a practical and efficient choice, particularly since there is no pressing need for an upgrade at this stage.
In summary, the backbone network in YOLO plays a critical role in feature extraction, capturing hierarchical representations of the input images that are subsequently used for object detection. By leveraging advanced architectures and techniques, the backbone network enhances the model’s ability to detect objects accurately and efficiently.
2.1.2. Head Network in YOLO
The detection head in YOLO is a crucial component, responsible for generating predictions based on the features extracted from the backbone network. The system utilizes feature maps derived from various tiers of the feature pyramid as input, executing object detection through the prediction of bounding boxes, class probabilities, and object scores.
Key Components. (1) The detection head typically consists of a series of convolutional layers that operate on the feature maps obtained from the backbone network. These layers apply learned filters to extract high-level features relevant to object detection tasks. (2) Pooling layers are often included in the detection head to reduce the spatial dimensions of the feature maps while preserving important features. This downsampling process helps in capturing semantic information and improving the computational efficiency. (3) At the end of the detection head, there are output prediction layers, responsible for predicting the bounding boxes, class probabilities, and object scores. These predictions are made for a predefined set of anchor boxes spanning various aspect ratios and scales. (4) The output prediction layers predict offsets for each anchor box to adjust its position and size relative to the corresponding grid cell. These offsets, along with the coordinates of the anchor box, are used to compute the final bounding box predictions. (5) In addition to bounding box regression, the detection head predicts the probability distribution over different object classes for each bounding box. This is usually accomplished using Softmax activation on the class scores. (6) YOLO predicts an object score for each anchor box, indicating the likelihood of the box containing an object of interest. This score helps in filtering out irrelevant detections and improving the precision of the model.
2.2. YOLO-HDCS Network
This paper introduces a novel head detection algorithm, YOLO-HDCS, based on the YOLO framework. We propose a novel backbone network, designated as Backbone-CENet. The network is principally based on two key modules, the CESA and the Sim-CBS, both of which are proposed in this paper, as illustrated in
Figure 3. When facing the challenges of complex and crowded scenes, the optimization of detection models becomes particularly important, as these scenes often require models to have higher accuracy and robustness. We have developed innovative strategies with unique benefits, significantly improving the detection model’s performance and effectiveness using various technical approaches.
The first improvement is the CESA module. In particular, we conducted in-depth research and practice on the challenging issue of small object detection in complex and crowded scenes. We innovatively fuse convolutional kernels of various sizes, a strategy aimed at fully utilizing the differences in the ability of convolutional kernels of different sizes to capture features. Small convolutional kernels excel in capturing fine details, which is crucial for the detection of small objects, while larger convolutional kernels provide a broader contextual view, improving the overall detection accuracy. Our design draws inspiration from CAM and CSPStage, but with a key difference: the features obtained from the convolution of different kernel sizes cannot be directly used without further processing. By effectively integrating the strengths of both small and large convolutional kernels, our system significantly enhances small object detection, even in dense or complex environments. To fully realize this capability, we incorporate an adaptive processing stage to refine and optimize the features for accurate identification and localization.
The second improvement is the Sim-CBS module. In this paper, we draw inspiration from SimAM and standard convolution to design a lightweight, parameter-free convolution module for CNNs. The objective is to enhance the performance of the model in image processing tasks without introducing an additional computational burden or model complexity. Specifically, we introduce Sim-CBS to replace the standard convolutional module, CBS. This new module, Sim-CBS, consists of five components: standard convolution, feature extraction, local self-similarity computation, attention weight generation, and feature map weighting.
The third improvement is NMS, which is a method of filtering out duplicate bounding boxes. To ensure the accuracy of the detection process, the output of the detection network typically includes multiple overlapping frames for the same object. NMS is applied to eliminate redundant boxes, keeping only the most relevant one. In traditional NMS, the box with the highest score is kept, while lower-scoring overlapping boxes are suppressed. However, the SIoU works differently. Instead of removing overlapping boxes, the SIoU adjusts the scores of the other boxes based on their overlap with the highest-scoring box. When two boxes have a high level of overlap, the SIoU lowers the score of the second box, making it "weaker", rather than removing it entirely. In summary, the SIoU improves upon the traditional NMS by modifying the scores of overlapping boxes, instead of directly suppressing them, which allows more relevant boxes to be retained.
2.2.1. CESA
CESA is a feature fusion module for small object detection. For an illustration, please refer to
Figure 4. The integration of multi-scale convolutional features, from the highest to the lowest levels, is embedded within the feature pyramid network to enrich the contextual information. The integration of channel and spatial feature refinement mechanisms aims to suppress the emergence of conflicts during multi-scale feature fusion, thus protecting small objects from being swamped by contradictory information. To further improve the training, a data augmentation method called copy–reduce–paste is introduced. This increases the contribution of tiny objects to the loss during the training process, thereby ensuring more balanced training.
The newly developed CESA feature network is designed with three key objectives:
First, it aims to significantly improve the efficiency with which features are utilized throughout the network;
Second, it enhances the network’s ability to capture fine details in the input data, enabling the more accurate detection and recognition of subtle patterns;
Third, it accelerates the model’s convergence during training, allowing for faster learning and the optimization of the network parameters.
To strengthen the multi-scale detection capabilities, the fusion of features across different scales is essential. In previous detection networks, features at different scales often showed significant depth discrepancies. For example, high-resolution features used to detect small objects tend to have relatively shallow depths, which can negatively impact the small object detection performance. Directly fusing features of varying depths typically leads to unsatisfactory results, so it is important to retain more original information during the initial processing. To address this, dilated convolutions with varying dilation rates are used to gather contextual information at different receptive field sizes. Specifically, 3 × 3 convolution kernels with dilation rates of 1, 3, and 5 are applied to the input, providing richer contextual information within the FPN.
The CESA module includes the following steps.
In CESA, a global feature pyramid is introduced with the objective of extracting richer semantic information through multi-scale feature fusion. The module is designed for computer vision tasks and aims to enhance the deep learning model’s ability to understand objects or scenes in an image by incorporating contextual information. The following section outlines the operational procedure of CESA. The input feature maps are received from the preceding layer of the convolutional or feature extraction network. The feature maps in question contain representations of features extracted from the original image at varying degrees of abstraction.
Context awareness: At the core of this lies the context-aware mechanism, which enhances the representation of the feature maps by incorporating additional contextual information. Such additional information may be global contextual features, local region-related information, or other forms of contextual information.
Feature Fusion: In the CAM module, the contextual information is integrated with the original feature map. This process typically entails the utilization of convolutional operations or attention mechanisms, with the objective of facilitating the efficient integration of information from disparate sources. The objective of feature fusion is to augment the expressive capacity of the feature map, thereby enabling the model to more effectively comprehend the semantic and structural nuances present in the image. The further processing of the fused feature maps is conducted with the objective of enhancing the representation of the object or scene of interest in the image. This may entail an increase in the resolution of the feature map, the enhancement of the semantic relevance of the features, or the strengthening of the responses of specific regions. Subsequently, the enhanced feature map is generated as the output.
The CESA module receives input features from the preceding layer, which is typically the output of the backbone network. The aforementioned features comprise a variety of semantic and spatial information pertaining to the image in question. Cross-stage connectivity is a technique that enables the transfer of information between layers in a neural network. In this phase, the CESA module divides the input features into two sections. One section passes through one or more convolutional layers and becomes the backbone path, while the other section bypasses the backbone path and serves as the skip path. Processing of the Backbone Path: The convolutional layers on the backbone path are employed for the purpose of extracting high-level semantic information from the input features. These convolutional layers typically possess a substantial receptive field, enabling the capture of global information and abstract features within the image. Branch Processing: Features on the branch paths are processed directly, bypassing a portion of the trunk path, and they retain low-level spatial information about the input features. This information is instrumental in retaining the details and positional information of the object. Feature Fusion: The features of the trunk paths and branch paths are, to some extent, complementary, because the former emphasize semantic information while the latter retain spatial information. CESA combines the two through a feature fusion operation to obtain a richer and more balanced representation of the features. Output Features: After feature fusion, the features generated by CESA are passed to the next network stage for further processing. These features have better semantic and spatial information, which is beneficial for subsequent object detection tasks.
The REP structure usually consists of the following steps. Feature Fusion: Firstly, the REP structure receives features from the backbone and branch paths in CESA. These features have already been combined by the cross-stage linking and feature fusion steps in CESA, and the task of the REP structure is to further enhance the feature representation. The REP structure uses residual linking to sum the features of trunk paths and tributaries. This residual joining helps to retain more feature information and prevent information loss and gradient vanishing. Next, the REP structure performs an enhancement operation on the summed features. This enhancement operation can take various forms, such as convolution and attention mechanisms, in order to further improve the discriminability and representational capabilities of the features. Ultimately, the REP structure typically executes pooling operations on the enhanced features to diminish their dimensions and computational demands, while simultaneously bolstering the stability and invariance of these features. Common pooling operations include average pooling, maximum pooling, and others.
CESA is introduced into YOLOv8 and compared with the C2f structure. We achieve higher accuracy, as expected. It can fully exchange high-level semantic information and low-level spatial information. It is a powerful object detection architecture that can effectively improve the accuracy and performance of object detection by combining global and local features.
The adaptive method is a type of adaptive fusion method. In the fields of computer vision and deep learning, it typically refers to adaptive feature fusion strategies. These methods aim to enhance the model performance by dynamically selecting and integrating feature information from different levels or scales. In this way, the methods adapt to different data and task requirements. This approach significantly improves the model performance, optimizes the efficiency of feature utilization and accelerates the speed of experimental iteration. The depth and complexity of a network have a direct impact on the model’s performance. Enhanced fitting capabilities often lead to the faster convergence of the loss function, as the network can extract information from more layers, thereby accelerating the optimization process. Specifically, assuming that the input size can be represented as (bs, C, H, W), we can obtain spatial adaptive weights of (bs, 3, H, W) through convolutions, concatenations, and Softmax operations. The three channels correspond one-to-one with the three inputs, and context information can be aggregated into the output by calculating the weighted sum. Using this method requires more features to be processed. The advantages and characteristics of the adaptive method are as follows: (1) by fusing multi-layer features, the model can obtain more comprehensive and rich feature information, thereby improving the accuracy of predictions; (2) different feature layers contain information at different scales and semantic levels, and adaptive feature fusion enables the model to have stronger robustness to objects of different scales and shapes; (3) it allows the model to adaptively select useful information from the feature layers based on the actual situation of the region of interest, enhancing the flexibility and generalization ability of the model.
Below are the three functions of the CESA module.
Introducing contextual information can enhance the model’s understanding of the relationships between the targets and backgrounds in images, thereby improving the accuracy of object detection. Context enhancement techniques improve the target localization and classification accuracy by capturing both global and local information. The aim of this technology is to improve the model’s performance in processing complex scenes, allowing it to analyze targets or events in images more accurately. By incorporating global information, the model can better understand the semantics and context of the image, reducing false positives and false negatives. Analyzing the enhanced contextual information, the model can precisely locate the target position, as the contextual information provides the spatial relationship between the target and surrounding objects. By introducing diverse contextual information during training, the model can learn a wider range of visual features and semantic concepts to deal with various scenes. The generated heatmaps show the distribution of the model’s attention, enhancing the model’s visualization effect and credibility. Context enhancement takes into account multiple factors for a more comprehensive understanding of the image content and more accurate predictions.
The adaptive receptive field allows neural networks to automatically adjust the size of the receptive field based on the scale and position of the target, enhancing the ability to process targets of different scales and improving the generalization and adaptability. The receptive field is the range of responses of neurons to the input image or feature map. When the target is large, the receptive field expands to capture the overall features; when the target is small, the receptive field contracts to preserve the details. Through dynamic adjustment, the module takes into account the position of the target and the background, improving the detection accuracy. During training, the module learns to adapt to inputs of different scales and complexities, reducing the dependence on manual parameters and enhancing its universality and flexibility. Compared to fixed receptive fields, adaptive receptive fields adjust their size as needed, effectively utilizing the computational resources and reducing the computational complexity and storage requirements. This feature is compatible with modern deep learning frameworks, supporting end-to-end training and deployment and simplifying the model design and optimization process.
End-to-end training can simplify the model construction process and optimize the capture and utilization of contextual information. When combined with modern object detection frameworks, end-to-end training optimizes the performance while ensuring efficient and stable deployment. End-to-end training is a comprehensive training method from input to output, integrating all components of the system into a single model and using backpropagation algorithms for global optimization. This method eliminates the interface challenges of traditional multi-stage processing, such as feature extraction and candidate box generation in object detection, directly optimizing the overall performance through a single model. It avoids information loss in staged optimization, enhances the synergy of components through global loss function optimization, reduces manual intervention, and enhances the model’s automation and versatility. Through global information transfer and feedback, it reduces overfitting and improves the generalization ability. In addition, end-to-end training simplifies model design and increases the directness and efficiency of system implementation.
2.2.2. Sim-CBS
Despite various efforts to optimize CNN networks, such as reducing the convolutional kernel sizes, the 1 × 1 regular convolutional module still requires a significant amount of computational effort as the network becomes more complex. As is well known, the attention mechanism can focus on the most interesting detection areas, thereby performing dense detection in these areas and thus improving the object detection capabilities of the network. However, in complex scenes, we not only need to focus on the detection areas but also to minimize the computational complexity of the network as much as possible. Therefore, we need to incorporate a lightweight and simple attention mechanism. Regular 1 × 1 convolutional modules still demand substantial computation as the network deepens, which can be problematic. In this paper, we use a lightweight, parameter-free attention mechanism specifically designed for CNNs. Our aim is to enhance the performance of the model in image processing tasks without adding an additional computational burden or model complexity. As shown in
Figure 5, the proposed module replaces the standard convolutional block (CBS). It comprises five components: standard convolution, feature extraction, local self-similarity computation, attention weight generation, and feature map weighting.
SimAM aims to fully explore the self-similarity information inside the feature map and then dynamically adjust the weights of different regions in the feature map, thereby enhancing the model’s ability to capture key information in the image. This mechanism not only enhances the model’s performance in visual tasks but also significantly optimizes the model’s computational efficiency and resource consumption, providing a new perspective for the application of deep learning in the field of image processing. Traditional attention modules require the introduction of learnable parameters to model feature dependencies, which increases the complexity and may lead to overfitting. SimAM, on the other hand, generates attention weights using feature map information, without the need for additional parameters, simplifying the model structure and reducing the computational burden. The attention weights dynamically change, adaptively adjusting to different image content, which improves the model’s generalization ability. SimAM enhances the feature expression of highly self-similar regions, suppresses redundant noise, focuses on key information, and improves the accuracy of the model in tasks such as image classification, object detection, and segmentation. SimAM, as an attention mechanism based on self-similarity perception, represents a significant advancement in the application of deep learning in the field of image processing. In addition to streamlining the model structure and reducing the computational cost, the dynamic weight adjustment mechanism enables SimAM to focus on key information, thereby enhancing the overall performance.
The operational methodology of Sim-CBS can be delineated as follows. First, the image features are processed by standard convolution, and then the CNN extracts the feature maps of the input image. These feature maps contain high-level semantic information of the image. For each pixel in the feature map, SimAM calculates its similarity with the neighboring pixels. This step is based on the principle of the local self-similarity of images, which posits that neighboring pixels usually have strong similarity. Based on the calculation of the local self-similarity, SimAM generates an attention weight for each pixel. This weight reflects the importance of the pixel in the feature map. Finally, the generated attention weights are multiplied with the original feature maps to obtain weighted feature maps. These weighted feature maps highlight the key areas in the image more prominently, which helps to improve the performance of subsequent tasks.
The model is parameter-free. The introduction of additional parameters is not necessary for Sim-CBS, thus avoiding any increase in the complexity and computational cost of the model. This algorithm exhibits high computational efficiency. As no complex convolutions or fully connected operations are required, Sim-CBS is computationally efficient and suitable for embedding into various CNN models. It exhibits significant performance improvements: experimental results demonstrate that Sim-CBS can markedly enhance the performance of CNNs in tasks such as image classification and object detection.
2.2.3. NMS Loss Function and Non-Maximal Suppression Algorithm
The loss function for object detection consists of three parts: the predicted class loss Lossclass, the confidence loss Lossconf, and the bounding box loss Lossbbox for each predicted bounding box. Therefore, when constructing the loss function, it is necessary to evaluate the aforementioned components; calculate the class error, confidence error, and bounding box error of the prediction results separately; and obtain the overall loss function through weighting. The loss function for the predicted bounding box is defined as
The bounding box loss, including calculation metrics like the GIoU, CIoU, DIoU, and EIoU, evaluates how well the predicted bounding box matches the ground truth. These loss functions typically assess factors such as the distance between the center points, the overlap area, or the aspect ratio between the predicted and ground truth boxes. However, none of them consider the direction between the ground truth bounding box and the predicted bounding box. This deficiency can lead to slower convergence speeds and lower efficiency of the model. In view of this, the SIoU loss function improves upon the traditional loss functions by considering the vector angle between the ground truth bounding box and the predicted bounding box during regression and redefining the penalty indicators. By adding an angle constraint to the predicted bounding box, the model can better align the predicted box with the ground truth, ensuring that it is either vertically or horizontally oriented, thus reducing the wobbling effect. Thus, during the training process of the model, it accelerates the convergence of the model and ultimately helps the model to train to achieve better results. Soft-NMS smooths the scores of overlapping boxes, preventing the issue of potential mis-suppression that can occur with traditional hard-NMS methods. It helps the model to retain more appropriate bounding boxes, reduces false positives, and optimizes box selection, thereby accelerating the convergence of the loss function and improving both the training efficiency and accuracy.
Non-maximum suppression is an essential post-processing step in object detection algorithms, the purpose of which is to remove duplicate bounding boxes, i.e., to reduce false detection. In practical detection, the types of targets in complex scenes are diverse, and the background is also very cluttered. Moreover, the heads of pedestrians are relatively small, and, when there are many people, these heads may partially overlap. At this point, the complexity of the predicted boxes generated is even greater, so a good method is needed to eliminate this effect. At this point, if traditional NMS is used to process the bounding boxes, the NMS algorithm will set the scores of the bounding boxes that are greater than the threshold to zero, which can easily lead to the incorrect suppression of occluded objects and small objects, reducing the performance of the model. Therefore, we need to consider the overlapping area. The threshold may be set to 0.6, with the prediction frames exhibiting a probability greater than 0.6 retained. Subsequently, Soft-NMS is performed to retain the prediction frame that is most closely aligned with the bounding box, as illustrated in the accompanying
Figure 6. The Soft-NMS approach considers both the degree of overlap and the score when performing NMS in order to achieve an optimal outcome. In the event of two objects within the same category exhibiting substantial overlap, even if both prediction frames possess high scores, only one can be retained under NMS. This is due to the considerable degree of overlap between the two prediction frames, as illustrated in
Figure 7.
In contrast to the conventional approach of setting the original score to zero, the soft-NMS algorithm replaces it with a slightly lower score. Furthermore, soft-NMS can be readily integrated into the YOLO algorithm without necessitating the retraining of the original model. Accordingly, this paper employs the soft-NMS algorithm for non-maximal suppression.
The conventional non-maximal suppression algorithm initially generates a sequence of detection frames, designated as B, along with their respective scores, denoted as S. The detection frame M, exhibiting the highest score, is then removed from set B and incorporated into the final detection result set D. Concurrently, any detection frame within set B that exhibits an overlap with detection frame M exceeding the predefined overlap threshold Nt is also removed. The primary issue with the NMS algorithm is that it necessitates the assignment of a score of zero to all neighboring detection frames. In such instances, the presence of a genuine object within the overlapping region will result in the failure of its detection, thereby reducing the algorithm’s average precision (AP).
An alternative approach would be to reduce the scores of neighboring detection frames based on a correlation with the degree of overlap of M, rather than rejecting them completely. Despite the reduction in the score, the neighboring detection frames remain within the sequence of object detection.