1. Introduction
Unmanned Aerial Vehicle (UAV) intelligent systems have widely adopted target detection algorithms, such as pedestrian detection and vehicle detection [
1]. Currently, deep learning-based object detection algorithms are mainly divided into two-stage algorithms and one-stage algorithms. Two-stage networks (such as R-CNN [
2], Fast R-CNN [
3], and Faster R-CNN [
4]) first generate candidate regions, then classify and regress these candidate regions. One-stage networks, represented by the YOLO [
5] series and SSD [
6], only use convolutional neural networks to extract features and directly classify and regress objects. Compared to two-stage algorithms, one-stage algorithms have a simpler structure, resulting in lower computational requirements and higher real-time performance, making them more widely used in UAV intelligent systems. While these applications have achieved desirable results for large object detection, there are still certain issues with small object detection when the resolution is less than 32 pixels × 32 pixels.
The main challenges in small object detection arise due to the small scale and low resolution of the objects, insufficient contextual information, and object clustering, making small object detection extremely difficult with significant false positives and missed detections. The primary reasons for poor performance in small object detection are as follows: (1) High localization accuracy is required because small objects cover a small area in the image, making their bounding box localization more challenging compared to large/medium-sized objects. (2) Due to their inherent characteristics, small objects have limited feature representation extracted by the network. After multiple downsamplings, the features become weaker and may even be lost in the background. (3) Small objects have a higher probability of clustering, leading to missed detections. To address these issues, existing small object detection methods are improvements on general deep learning detection methods to enhance detection performance.
Based on the aforementioned issues, this paper proposes an improved small object detection algorithm called HSP-YOLOv8 for small object recognition to enhance small object detection performance. This algorithm adds an extra tiny prediction head and a Space-to-Depth Convolution module (SPD-Conv) [
7] to the YOLOv8 algorithm and employs a more suitable post-processing algorithm for small object recognition, the soft non-maximum suppression (Soft-NMS) [
8]. Experiments conducted on the VisDrone2019 dataset, which contains more small objects, demonstrate that the performance of HSP-YOLOv8 on the VisDrone2019 dataset improved by 11% (mAP0.5) and 9.8% (mAP0.5:0.95) compared to YOLOv8s.
The main contributions of this paper can be summarized as follows:
We designed an additional tiny prediction head to address the problem of low localization accuracy of small objects in low-resolution feature maps. Higher-resolution feature maps facilitate better feature extraction for small objects, thereby enhancing the detection capability for these targets.
We implemented the SPD-Conv module in the backbone to solve the problem of feature reduction and loss for small objects after multiple downsampling stages. This allows the preservation of fine-grained feature information for small objects, thereby improving the network’s learning and representation capabilities, and enhancing the accuracy of small object recognition.
We replaced the original non-maximum suppression (NMS) algorithm in YOLOv8 with Soft-NMS, which mitigates the issue of missed detections caused by the clustering of small objects. This modification enables the effective detection of densely overlapping objects, thereby improving the detection accuracy of small objects.
The remainder of this paper is organized as follows:
Section 2 reviews related work, beginning with an overview of the research progress in the YOLO series of detection algorithms, followed by a discussion of previous efforts in small object detection.
Section 3 details the specific improvements made in the HSP-YOLOv8 model, starting with an introduction to the overall architecture of HSP-YOLOv8 and then elaborating on the three enhanced modules individually.
Section 4 presents the experimental work, including a description of the datasets and relevant parameters, followed by comparative experiments, ablation experiments, and visualization experiments conducted on the VisDrone2019 dataset. Finally,
Section 5 concludes the paper with a summary of the research findings and a discussion of future work directions.
3. Methods
For small object recognition tasks, since small objects occupy very few pixels in the image and tend to cluster, YOLOv8’s ability to extract features of small objects is insufficient, and its localization accuracy is not adequate. This leads to low accuracy in detecting small objects, making it unsuitable for practical applications. To address the aforementioned issues, this paper designs the HSP-YOLOv8 model based on the YOLOv8 model. The structure of the model is shown in
Figure 3. The following three improvements are proposed: (1) An additional tiny prediction head is designed to enhance the model’s detection capability for small objects. (2) The SPD-Conv module is used in the downsampling stage of the backbone to effectively eliminate information loss. (3) The Soft-NMS algorithm is adopted to effectively alleviate the issue of missed detections caused by the clustering of small objects.
3.1. Tiny Prediction Head
The different resolutions of the three prediction heads in YOLOv8 (80 × 80, 40 × 40, 20 × 20) contribute significantly to the detection capability in various application scenarios but also pose difficulties for small object detection. The poor performance of YOLOv8 in detecting small objects is because the features of small objects, which occupy very few pixels, are easily overlooked. To solve this problem, as shown in
Figure 4, this paper adds an additional tiny prediction head (Tiny-Head), increasing the resolution of the detection feature map (160 × 160) for detecting tiny objects larger than 4 × 4 pixels. Smaller detection pixels can extract more features of small objects, thereby improving the model’s performance in small object detection.
3.2. SPD-Conv Module
When the image resolution is good and the targets are medium to large, the accuracy of the YOLOv8 model is generally high. However, when it comes to small objects in UAV aerial images, the model’s accuracy drops rapidly. One reason for the decrease in detection accuracy is that the Conv modules used in YOLOv8 for feature extraction are strided convolution modules. During feature extraction, strided convolution modules cause the loss of fine-grained information and inefficient feature representation. In tasks with low image resolution or small detection objects, the detection performance drops quickly. Therefore, this paper introduces a new convolutional neural network building block, SPD-Conv, in the backbone. SPD-Conv uses a space-to-depth layer and a non-stride convolution layer. The space-to-depth layer downsamples the feature maps while retaining all information in the channel dimension, effectively eliminating information loss. This strategy ensures that information about small objects is preserved during the downsampling process.
As shown in
Figure 5, the SPD-Conv module processes the feature maps. First, the input feature map is preprocessed from space to depth, dividing the input feature map into four categories in the spatial dimension and concatenating the four vectors in the spatial dimension. Then, the preprocessed feature map undergoes standard convolution, generating four feature maps of size S/2 × S/2 × C1, which are concatenated along the C1 dimension to obtain a feature map of size S/2 × S/2 × 4C1.
3.3. Soft-NMS
When detecting objects, the YOLOv8 model typically generates multiple high-confidence bounding boxes around the actual target. NMS effectively deletes redundant bounding boxes, ensuring that only one bounding box is retained for each actual target. However, in real scenarios, small objects often cluster, and the overlap between small objects leads to many overlapping bounding boxes during detection by the YOLOv8 model. If the IoU (Intersection over Union) of the resulting bounding boxes exceeds the set threshold, the NMS algorithm may remove the bounding boxes with lower confidence, risking the failure to recognize overlapping small objects and thus reducing the detection performance for small objects. While increasing the IoU threshold can mitigate the issue of missed detections, it also increases the possibility of redundant detections. To address this phenomenon, Soft-NMS provides a solution by improving NMS without adding extra complexity.
The specific steps of Soft-NMS are as follows: (1) Classify all boxes to delete background classes. (2) For each target category, sort the predicted boxes in descending order of classification confidence. (3) In the given category, select and retain the predicted box with the highest confidence. (4) Calculate the IoU between the box with the highest confidence and the adjacent boxes. Use a weighting function to decay the confidence of adjacent boxes that overlap with the highest-confidence box. Finally, set a threshold to delete all remaining boxes with an IoU and confidence score lower than the threshold. (5) Iterate steps 3 and 4 until the target category processing is complete. (6) Repeat steps 2 to 5 until the NMS processing for all target categories is complete. (7) Output the finally selected predicted boxes.
Soft-NMS reduces the confidence of adjacent boxes that intersect with the current highest-confidence predicted box instead of directly deleting these adjacent boxes. This method alleviates the issue of missed detections in dense scenes of small objects to some extent and proves effective in recognizing targets with a lot of overlap, ultimately improving the detection accuracy of small objects.
4. Experimental Results and Analysis
4.1. Dataset
The dataset used in this experiment is VisDrone2019, collected and prepared by the AISKYEYE team from Tianjin University, China. This dataset was captured using various drone platforms, including the DJI Phantom 4 Pro, across 14 different cities in China, over a period of four months from April to August 2018. VisDrone2019 is a drone aerial imagery dataset with image resolutions up to 2000 × 1500 pixels. The training set comprises 6471 images with a total of 343,205 annotations, with each image containing an average of 53 instances, indicating a high object density where most objects are very small (less than 32 × 32 pixels). The validation and test sets consist of 548 and 1610 images, respectively. The dataset includes 10 categories: pedestrian, people, bicycle, car, van, truck, tricycle, awning tricycle, bus, and motor [
28].
4.2. Experimental Equipment and Training Strategies
To conduct the object detection experiments in this section, a deep learning server with sufficient computational power and storage space is required for training, testing, and evaluation. The models used in this study are based on the PyTorch version of YOLOv8, and therefore, the relevant environment must be configured on the server. To ensure the smooth progress of the experiments, the server must be configured with the necessary environment, including the operating system, GPU drivers, deep learning frameworks, etc.
Table 1 details the versions and configuration information of all required environments to ensure the reproducibility and accuracy of the experiments.
YOLOv8 has faster detection speed and higher accuracy, capable of meeting the requirements for real-time detection and recognition of small objects. To satisfy different application needs, the YOLOv8 network can be extended to generate five different-sized network models, including YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. This paper balances the speed and accuracy of small object detection and finally selects the YOLOv8s model as the baseline network model. Some key parameter settings during the model training process are shown in
Table 2.
4.3. Evaluation Indicators
To evaluate the detection performance of the improved model, this paper uses precision, recall, mAP0.5, and mAP0.5:0.95 as evaluation metrics. Detailed definitions are listed below.
Precision represents the proportion of true positive samples out of all samples predicted as positive. It can be expressed as:
Recall represents the proportion of true positive samples identified by the model to the actual number of positive samples. It can be expressed as:
The average precision (AP) refers to calculating the precision for each category in the ranked results and then averaging the precision across all categories. It is calculated as follows:
Mean average precision (mAP) is the mean of AP across all categories and serves as a metric to evaluate the overall performance of a detection model across multiple categories. It takes into account both the recall and precision of the model at different thresholds, providing a comprehensive assessment of the model’s performance in object detection tasks. It is calculated as follows:
where
represents the class index, and
represents the number of categories in the training dataset. mAP0.5 refers to the mean average precision for all classes at an IoU threshold of 0.5. mAP0.5:0.95 represents the mean average precision computed at IoU thresholds from 0.5 to 0.95.
4.4. Comparison Experiment
In order to verify the effect of improving the model detection performance, three sets of comparative experiments were conducted on the VisDrone2019 dataset. The model was compared with the YOLO series algorithms, other new models, and YOLOv8s model.
Table 3 shows the AP values, mAP0.5 values, number of parameters, and computational cost of the improved model and the YOLO series algorithms. YOLOv4 [
29] combines the CSPDarknet53 backbone with PAN feature fusion, achieving a good balance between speed and accuracy. YOLOv5 uses the Focus structure to enhance the receptive field and employs Mosaic data augmentation, resulting in better network robustness. YOLOv6 [
30] adopts a new self-distillation strategy and simplifies the SPPF module in YOLOv5/v8. YOLOv7 uses a model re-parameterization strategy and proposes a planned re-parameterization model. Based on the experimental results, the improved model outperforms all models in the YOLO series. Notably, although the YOLOv8l model has significantly higher parameters and computational complexity than the improved model, its mAP0.5 and mAP0.5:0.95 values are lower than those of the improved model.
From
Table 4, the proposed model has the best detection performance compared to other state-of-the-art models. CenterNet introduces an anchor-free detection mechanism to address the imbalance issue caused by anchor-based mechanisms for small and large/medium-scale object samples. Faster R-CNN introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. CNXResNet [
31] proposes a new backbone network using the backbone architecture of PP-YOLOE with ConvNext V2 Block as the core module and further introduces the style translation idea into the PP-YOLOE model to improve the detection and recognition performance of the small objects. MC-YOLOv5 [
32] improves YOLOv5 in the feature extraction stage and introduces a new shallow network optimization strategy to reduce missed detections in dense small object scenarios. EDGS-YOLOv8, based on the YOLOv8n model, demonstrates superior performance in terms of model lightweighting. Although our model, which also uses YOLOv8n as the baseline, sacrifices some GFLOPs and model size compared to EDGS-YOLOv8, it achieves more than a 10% improvement in both mAP0.5 and mAP0.5:0.95. In recent years, there have been many improvements to the YOLO series algorithms for detecting small objects, but their detection results are still inferior to the proposed method.
In
Table 5, the AP values for each object and the mAP0.5 values for all classes on the VisDrone2019 dataset are shown for both the improved model and the YOLOv8s model. From the comparison results in
Table 5, it can be seen that the mAP0.5 value of the improved model increased by 11%. The AP values for each target class improved, with the most significant increase observed for the ‘people’ category, which saw an improvement of 17.8%. The AP values for several other categories, including pedestrian, bicycle, and motor, also improved by more than 10%.
In UAV aerial images, targets such as pedestrians, people, and motors are relatively small in size and density compared to other objects. The minimum resolution for detecting a person target can be as low as 4 pixels by 10 pixels, and these targets are more prone to aggregation. The significant improvement in the detection of these categories indicates that HSP-YOLO can effectively handle small object detection scenarios.
4.5. Ablation Experiment
To validate the effectiveness of all the proposed improvements and to discuss the impact of each added or modified module on the overall model performance, two sets of ablation experiments were conducted on the baseline model YOLOv8s using the Visdrone2019 dataset. The first set of experiments focused on the ablation of the prediction heads, comparing various combinations of prediction heads. The second set examined the ablation of the improvement methods, comparing the effects of different enhancements.
To determine the most suitable detection head for the model, we defined the following models: Baseline Model 1 (YOLOv8s), Improved Model 2 (with added 160 × 160 prediction head), Improved Model 3 (added 160 × 160 prediction head and cropped 20 × 20 prediction head), Improved Model 4 (with added 320 × 320 prediction head), and Improved Model 5 (added 320 × 320 prediction head and cropped 20 × 20 prediction head). The experimental results, shown in
Table 6, indicate that adding a 160 × 160 prediction head yields the highest detection accuracy. In Model 3, where the 20 × 20 prediction head was removed in addition to adding the 160 × 160 prediction head, although the GFLOPs are optimized, the values for Precision, Recall, mAP0.5/%, and mAP0.5:0.95/% were all lower than those for Model 2. When a 320 × 320 detection head was added, the performance metrics were inferior to those of Model 2, regardless of whether the 20 × 20 prediction head was retained. Based on the above considerations, this paper opts to add only a 160 × 160 prediction head. While this choice incurs some sacrifices in terms of GFLOPs and inference time, it effectively preserves the model’s ability to detect large objects while maximizing the precision for small object detection.
From the experimental results in
Table 7, it can be observed that adding a tiny prediction head increases the mAP0.5 value by 5.5% and the recall rate by 4.4%. This indicates that adding a larger scale detection head for small object datasets retains more abundant small object feature information, improves the matching of the target and prior boxes, and helps the loss function converge better. Introducing the SPD-Conv module into the backbone network, replacing the original Conv module, increases the mAP0.5 value by 2.3%, with improvements in both precision and recall rates. This suggests that this improvement can better preserve the features of small objects and reduce the probability of missing small objects. With the introduction of Soft-NMS, the mAP0.5 value increased by 6.1%, and the precision improved by 8.3%, indicating that the model effectively overcomes the issue of missed detections when objects overlap, thus improving detection accuracy.
In summary, adding a tiny prediction head effectively improves the localization and recognition ability for small objects. The SPD-Conv module reduces the loss of target information during feature extraction. Introducing Soft-NMS alleviates the problem of missed detections in densely clustered small object scenarios, enhancing the detection performance of aerial imagery. HSP-YOLOv8 significantly outperforms YOLOv8 in detection accuracy.
4.6. Visualization Analysis
To visually demonstrate the detection effect of the improved model algorithm, as shown in
Figure 6, visual experiments were conducted using YOLOv5s, YOLOv8s, and HSP-YOLOv8. Four representative scenes were selected as experimental data: low-light environments, public facilities, urban main roads, and intersections. In low-light environments, HSP-YOLOv8 shows significant improvement, successfully detecting targets with a minimum resolution of 12 pixels × 25 pixels. In public facilities with many crowded targets, HSP-YOLOv8 also shows significant improvement, detecting targets with a minimum resolution of 4 pixels × 10 pixels. In urban main road scenes, where there are many low-resolution (less than 10 pixels × 10 pixels) vehicle targets at the end of the road, HSP-YOLOv8 performs well and shows outstanding performance.
5. Conclusions
In UAV aerial target detection tasks, issues such as small target size, low resolution, insufficient context information, and target clustering are prevalent. This paper proposes an improved YOLOv8 algorithm, HSP-YOLOv8, to deal with small target detection scenarios. First, a tiny prediction head is added to directly utilize high-resolution feature maps for small object detection. Second, the SPD-Conv module is introduced in the backbone to focus on local context information, enhancing the spatial information of small targets and improving detection accuracy. Additionally, Soft-NMS effectively mitigates the missed detection issue caused by small object clustering, thus increasing precision. The experimental results indicate that on the Visdrone2019 dataset, the proposed algorithm has made significant progress in small target detection. Compared to other models, the algorithm shows clear advantages in both the accuracy and efficiency of detecting small targets. In future work, we will explore how to reduce computational complexity while improving small object detection accuracy and maintaining model accuracy while accelerating inference speed to adapt to applications in UAV aerial scenes with limited computational resources.