HSP-YOLOv8: UAV Aerial Photography Small Target Detection Algorithm

Zhang, Heng; Sun, Wei; Sun, Changhao; He, Ruofei; Zhang, Yumeng

doi:10.3390/drones8090453

Open AccessArticle

HSP-YOLOv8: UAV Aerial Photography Small Target Detection Algorithm

by

Heng Zhang

¹,

Wei Sun

^1,*,

Changhao Sun

²,

Ruofei He

³

and

Yumeng Zhang

¹

School of Aerospace Science and Technology, Xidian University, Xi’an 710071, China

²

Qian Xuesen Laboratory of Space Technology, China Academy of Space Technology, Beijing 100094, China

³

Northwestern Polytechnical University, 365th Research Institute, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(9), 453; https://doi.org/10.3390/drones8090453

Submission received: 7 August 2024 / Revised: 29 August 2024 / Accepted: 30 August 2024 / Published: 2 September 2024

(This article belongs to the Special Issue Intelligent Image Processing and Sensing for Drones 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

To address the larger numbers of small objects and the issues of occlusion and clustering in UAV aerial photography, which can lead to false positives and missed detections, we propose an improved small object detection algorithm for UAV aerial scenarios called YOLOv8 with tiny prediction head and Space-to-Depth Convolution (HSP-YOLOv8). Firstly, a tiny prediction head specifically for small targets is added to provide higher-resolution feature mapping, enabling better predictions. Secondly, we designed the Space-to-Depth Convolution (SPD-Conv) module to mitigate the loss of small target feature information and enhance the robustness of feature information. Lastly, soft non-maximum suppression (Soft-NMS) is used in the post-processing stage to improve accuracy by significantly reducing false positives in the detection results. In experiments on the Visdrone2019 dataset, the improved algorithm increased the detection precision mAP0.5 and mAP0.5:0.95 values by 11% and 9.8%, respectively, compared to the baseline model YOLOv8s.

Keywords:

UAV; small-object detection; YOLOv8; SPD-Conv; Soft-NMS

1. Introduction

Unmanned Aerial Vehicle (UAV) intelligent systems have widely adopted target detection algorithms, such as pedestrian detection and vehicle detection [1]. Currently, deep learning-based object detection algorithms are mainly divided into two-stage algorithms and one-stage algorithms. Two-stage networks (such as R-CNN [2], Fast R-CNN [3], and Faster R-CNN [4]) first generate candidate regions, then classify and regress these candidate regions. One-stage networks, represented by the YOLO [5] series and SSD [6], only use convolutional neural networks to extract features and directly classify and regress objects. Compared to two-stage algorithms, one-stage algorithms have a simpler structure, resulting in lower computational requirements and higher real-time performance, making them more widely used in UAV intelligent systems. While these applications have achieved desirable results for large object detection, there are still certain issues with small object detection when the resolution is less than 32 pixels × 32 pixels.

The main challenges in small object detection arise due to the small scale and low resolution of the objects, insufficient contextual information, and object clustering, making small object detection extremely difficult with significant false positives and missed detections. The primary reasons for poor performance in small object detection are as follows: (1) High localization accuracy is required because small objects cover a small area in the image, making their bounding box localization more challenging compared to large/medium-sized objects. (2) Due to their inherent characteristics, small objects have limited feature representation extracted by the network. After multiple downsamplings, the features become weaker and may even be lost in the background. (3) Small objects have a higher probability of clustering, leading to missed detections. To address these issues, existing small object detection methods are improvements on general deep learning detection methods to enhance detection performance.

Based on the aforementioned issues, this paper proposes an improved small object detection algorithm called HSP-YOLOv8 for small object recognition to enhance small object detection performance. This algorithm adds an extra tiny prediction head and a Space-to-Depth Convolution module (SPD-Conv) [7] to the YOLOv8 algorithm and employs a more suitable post-processing algorithm for small object recognition, the soft non-maximum suppression (Soft-NMS) [8]. Experiments conducted on the VisDrone2019 dataset, which contains more small objects, demonstrate that the performance of HSP-YOLOv8 on the VisDrone2019 dataset improved by 11% (mAP0.5) and 9.8% (mAP0.5:0.95) compared to YOLOv8s.

The main contributions of this paper can be summarized as follows:

We designed an additional tiny prediction head to address the problem of low localization accuracy of small objects in low-resolution feature maps. Higher-resolution feature maps facilitate better feature extraction for small objects, thereby enhancing the detection capability for these targets.
We implemented the SPD-Conv module in the backbone to solve the problem of feature reduction and loss for small objects after multiple downsampling stages. This allows the preservation of fine-grained feature information for small objects, thereby improving the network’s learning and representation capabilities, and enhancing the accuracy of small object recognition.
We replaced the original non-maximum suppression (NMS) algorithm in YOLOv8 with Soft-NMS, which mitigates the issue of missed detections caused by the clustering of small objects. This modification enables the effective detection of densely overlapping objects, thereby improving the detection accuracy of small objects.

The remainder of this paper is organized as follows: Section 2 reviews related work, beginning with an overview of the research progress in the YOLO series of detection algorithms, followed by a discussion of previous efforts in small object detection. Section 3 details the specific improvements made in the HSP-YOLOv8 model, starting with an introduction to the overall architecture of HSP-YOLOv8 and then elaborating on the three enhanced modules individually. Section 4 presents the experimental work, including a description of the datasets and relevant parameters, followed by comparative experiments, ablation experiments, and visualization experiments conducted on the VisDrone2019 dataset. Finally, Section 5 concludes the paper with a summary of the research findings and a discussion of future work directions.

2. Related Work

2.1. YOLO Detection Algorithm

YOLOv1, proposed in 2016, is a single-stage object detection algorithm. Due to its low latency and high detection accuracy, it has become dominant in drone systems. It takes an image as input and uses a convolutional neural network to output object information in one stage. The lightweight models can meet the real-time object detection requirements of drone systems. YOLOv5, introduced in 2020, is one of the most widely applied improvements in the YOLO series. YOLOv5 uses mosaic data augmentation at the input end and employs a feature pyramid network structure with three prediction feature maps to achieve multi-scale detection. YOLOv7 [9], proposed in 2022, introduced a new scaling strategy based on a concatenated model, where the depth and width of the blocks are scaled by the same factor to maintain the optimal structure of the model. YOLOv8 is the latest iteration of the YOLO series of real-time object detectors, released by Ultralytics in 2023. YOLOv8 introduces an anchor-free mechanism to replace the original anchor mechanism of the YOLO series, making it more friendly for small object detection. This paper chooses YOLOv8 as the experimental base model. The network structure of YOLOv8 is shown in Figure 1.

The backbone of YOLOv8 first extracts features from the input image and generates feature maps of different sizes. These feature maps are then fused with the feature mappings in the neck. Finally, the three feature maps of different sizes generated from the neck are sent to the prediction head for processing. The details are as follows:

YOLOv8 adopts the improved CSPDarknet53 [10] as the backbone network, which contains several modules such as Conv, C2f, and SPPF, as shown in Figure 2. The Conv module uses Conv2d, Batch Normalization, and the SiLU activation module. C2f is the main module for residual feature learning, utilizing a gradient parallel structure that enables the network to learn the direct mapping relationship between inputs and outputs, thereby optimizing the training process. The SPPF module is an efficient spatial pyramid pooling (SPP) [11] that is added at the end of the backbone. Compared to the SPP structure, SPPF connects three max-pooling layers sequentially, aiming to improve computational speed while retaining the original function of integrating feature maps from different receptive fields, thereby enriching the feature expression capability.

YOLOv8 employs a PAN-FPN structure in the neck, drawing on the structures of feature pyramid networks (FPNs) [12] and Path Aggregation Networks (PANs) [13]. FPN mainly includes two paths: bottom-up and top-down. The bottom-up path corresponds to the backbone of YOLOv8, gradually reducing the size of the feature maps and increasing semantic information. The top-down method transmits deep semantic information. Finally, feature maps of the same size from both paths are laterally connected to enhance the semantic information of the features, but some object localization information may be lost. PAN adds a bottom-up path based on FPN to enhance the learning of location information, which can also transmit low-level location information to deep layers, thereby enhancing multi-scale localization capability.

The detection module of YOLOv8 uses three decoupled prediction heads to detect objects of large, medium, and small sizes (80 × 80, 40 × 40, 20 × 20) [14], respectively. In the decoupled head section, Binary Cross-Entropy (BCE) loss is used for object classification, while Distribution Focal Loss (DFL) [15] and Complete Intersection over Union (CIoU) [16] are used for bounding box regression. YOLOv8 adopts an anchor-free mechanism, which avoids the issue of missed detection caused by improper anchor settings. Combined with non-maximum suppression (NMS), this improves the model’s detection accuracy and robustness.

2.2. Previous Works on Small Object Detection

Detecting small objects has long been a challenge and a research hotspot in computer vision. Driven by deep learning, significant progress has been made in small object detection.

CenterNet [17] proposed an anchor-free detection mechanism to address the imbalance between small objects and large/medium-sized objects caused by anchor mechanisms, effectively improving small object detection performance. However, the algorithm requires a high resolution of the input image; low-resolution images may result in a decrease in the algorithm’s performance.

QueryDet [18] introduced a new query mechanism that adds small object position labels to increase the network’s constraints, allowing the network to better predict small objects. However, if large object positions are incorrectly activated, it can result in the detection head failing to locate small objects.

The study in [19] effectively improved the detection rate of small objects such as pedestrians in drone aerial images by densely cropping the dataset and adding a local attention module for data augmentation. This method improved the low detection rate of small objects like pedestrians. However, this approach requires a large amount of annotated data for training, which can be challenging for data acquisition in certain application scenarios.

The research [20] addressed the problem of detecting dense and occluded objects by using a staggered cascade structure in the prediction head, refining the bounding box prediction through three stages. However, the algorithm has a high computational cost, resulting in lower network efficiency and running speed, making it unsuitable for deployment on UAV platforms.

HIC-YOLOv5 [21] introduces an involution block between the backbone and neck on the YOLOv5 model, reducing feature information loss at the initial stage of the feature pyramid network (FPNs) and aiding in small object detection. However, it did not address the loss of small object feature information due to multiple downsampling stages during feature extraction, resulting in only a modest improvement in small object detection.

KPE-YOLOv5 [22] improved small object detection by using the K-means++ clustering algorithm to obtain anchor box sizes suited for the VisDrone2019 dataset and modifying the initial anchor box sizes in YOLOv5. However, it did not resolve the issue of low boundary box localization accuracy for small objects, and the model’s generalization capability remained suboptimal.

MS-YOLOv7 [23], in an effort to solve the problem of small object clustering, introduced a YOLOv7 network structure combined with a Swin Transformer and a new SPPFS pyramid pooling module to enhance small object detection performance. However, this also increased the model’s parameters and computational load, making it unsuitable for deployment on UAV platforms.

DSAA-YOLOv7 [24] designed a novel dynamic anchor box regression strategy to address the issue of low boundary box localization accuracy for small objects and proposed a super-resolution module based on dense residuals to improve the resolution of small objects. This method demonstrated significant accuracy and robustness in inefficient network training and small object feature extraction.

The paper [25] enhanced the YOLOv8 model by improving the feature pyramid network structure for multi-scale feature fusion and replacing the C2f module with the GhostblockV2 structure in the backbone to mitigate information loss during long-range feature transmission while reducing the model’s parameters.

EDGS-YOLOv8 [26], aiming for model lightweighting, made several improvements to the YOLOv8n model. Firstly, it replaced the original Conv and C2f blocks in the neck network and introduced efficient multi-scale attention (EMA) mechanisms to improve small object detection performance while reducing model complexity. Additionally, a small object prediction head (160 × 160) was added, the large object prediction head (20 × 20) was removed, and the detection heads were further improved to enhance small object detection accuracy while further reducing the model’s parameters.

UAV-YOLOv8 [27] optimized YOLOv8 by designing a feature processing module FFNB to fully integrate shallow and deep features, and added two additional detection heads, reducing the missed detection rate of small targets. However, this also led to a decrease in detection speed.

3. Methods

For small object recognition tasks, since small objects occupy very few pixels in the image and tend to cluster, YOLOv8’s ability to extract features of small objects is insufficient, and its localization accuracy is not adequate. This leads to low accuracy in detecting small objects, making it unsuitable for practical applications. To address the aforementioned issues, this paper designs the HSP-YOLOv8 model based on the YOLOv8 model. The structure of the model is shown in Figure 3. The following three improvements are proposed: (1) An additional tiny prediction head is designed to enhance the model’s detection capability for small objects. (2) The SPD-Conv module is used in the downsampling stage of the backbone to effectively eliminate information loss. (3) The Soft-NMS algorithm is adopted to effectively alleviate the issue of missed detections caused by the clustering of small objects.

3.1. Tiny Prediction Head

The different resolutions of the three prediction heads in YOLOv8 (80 × 80, 40 × 40, 20 × 20) contribute significantly to the detection capability in various application scenarios but also pose difficulties for small object detection. The poor performance of YOLOv8 in detecting small objects is because the features of small objects, which occupy very few pixels, are easily overlooked. To solve this problem, as shown in Figure 4, this paper adds an additional tiny prediction head (Tiny-Head), increasing the resolution of the detection feature map (160 × 160) for detecting tiny objects larger than 4 × 4 pixels. Smaller detection pixels can extract more features of small objects, thereby improving the model’s performance in small object detection.

3.2. SPD-Conv Module

When the image resolution is good and the targets are medium to large, the accuracy of the YOLOv8 model is generally high. However, when it comes to small objects in UAV aerial images, the model’s accuracy drops rapidly. One reason for the decrease in detection accuracy is that the Conv modules used in YOLOv8 for feature extraction are strided convolution modules. During feature extraction, strided convolution modules cause the loss of fine-grained information and inefficient feature representation. In tasks with low image resolution or small detection objects, the detection performance drops quickly. Therefore, this paper introduces a new convolutional neural network building block, SPD-Conv, in the backbone. SPD-Conv uses a space-to-depth layer and a non-stride convolution layer. The space-to-depth layer downsamples the feature maps while retaining all information in the channel dimension, effectively eliminating information loss. This strategy ensures that information about small objects is preserved during the downsampling process.

As shown in Figure 5, the SPD-Conv module processes the feature maps. First, the input feature map is preprocessed from space to depth, dividing the input feature map into four categories in the spatial dimension and concatenating the four vectors in the spatial dimension. Then, the preprocessed feature map undergoes standard convolution, generating four feature maps of size S/2 × S/2 × C1, which are concatenated along the C1 dimension to obtain a feature map of size S/2 × S/2 × 4C1.

3.3. Soft-NMS

When detecting objects, the YOLOv8 model typically generates multiple high-confidence bounding boxes around the actual target. NMS effectively deletes redundant bounding boxes, ensuring that only one bounding box is retained for each actual target. However, in real scenarios, small objects often cluster, and the overlap between small objects leads to many overlapping bounding boxes during detection by the YOLOv8 model. If the IoU (Intersection over Union) of the resulting bounding boxes exceeds the set threshold, the NMS algorithm may remove the bounding boxes with lower confidence, risking the failure to recognize overlapping small objects and thus reducing the detection performance for small objects. While increasing the IoU threshold can mitigate the issue of missed detections, it also increases the possibility of redundant detections. To address this phenomenon, Soft-NMS provides a solution by improving NMS without adding extra complexity.

The specific steps of Soft-NMS are as follows: (1) Classify all boxes to delete background classes. (2) For each target category, sort the predicted boxes in descending order of classification confidence. (3) In the given category, select and retain the predicted box with the highest confidence. (4) Calculate the IoU between the box with the highest confidence and the adjacent boxes. Use a weighting function to decay the confidence of adjacent boxes that overlap with the highest-confidence box. Finally, set a threshold to delete all remaining boxes with an IoU and confidence score lower than the threshold. (5) Iterate steps 3 and 4 until the target category processing is complete. (6) Repeat steps 2 to 5 until the NMS processing for all target categories is complete. (7) Output the finally selected predicted boxes.

Soft-NMS reduces the confidence of adjacent boxes that intersect with the current highest-confidence predicted box instead of directly deleting these adjacent boxes. This method alleviates the issue of missed detections in dense scenes of small objects to some extent and proves effective in recognizing targets with a lot of overlap, ultimately improving the detection accuracy of small objects.

4. Experimental Results and Analysis

4.1. Dataset

The dataset used in this experiment is VisDrone2019, collected and prepared by the AISKYEYE team from Tianjin University, China. This dataset was captured using various drone platforms, including the DJI Phantom 4 Pro, across 14 different cities in China, over a period of four months from April to August 2018. VisDrone2019 is a drone aerial imagery dataset with image resolutions up to 2000 × 1500 pixels. The training set comprises 6471 images with a total of 343,205 annotations, with each image containing an average of 53 instances, indicating a high object density where most objects are very small (less than 32 × 32 pixels). The validation and test sets consist of 548 and 1610 images, respectively. The dataset includes 10 categories: pedestrian, people, bicycle, car, van, truck, tricycle, awning tricycle, bus, and motor [28].

4.2. Experimental Equipment and Training Strategies

To conduct the object detection experiments in this section, a deep learning server with sufficient computational power and storage space is required for training, testing, and evaluation. The models used in this study are based on the PyTorch version of YOLOv8, and therefore, the relevant environment must be configured on the server. To ensure the smooth progress of the experiments, the server must be configured with the necessary environment, including the operating system, GPU drivers, deep learning frameworks, etc. Table 1 details the versions and configuration information of all required environments to ensure the reproducibility and accuracy of the experiments.

YOLOv8 has faster detection speed and higher accuracy, capable of meeting the requirements for real-time detection and recognition of small objects. To satisfy different application needs, the YOLOv8 network can be extended to generate five different-sized network models, including YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. This paper balances the speed and accuracy of small object detection and finally selects the YOLOv8s model as the baseline network model. Some key parameter settings during the model training process are shown in Table 2.

4.3. Evaluation Indicators

To evaluate the detection performance of the improved model, this paper uses precision, recall, mAP0.5, and mAP0.5:0.95 as evaluation metrics. Detailed definitions are listed below.

Precision represents the proportion of true positive samples out of all samples predicted as positive. It can be expressed as:

Precision = \frac{True Positives}{True Positives + False Positives}

(1)

Recall represents the proportion of true positive samples identified by the model to the actual number of positive samples. It can be expressed as:

Recall = \frac{True Positives}{True Positives + False Negatives}

(2)

The average precision (AP) refers to calculating the precision for each category in the ranked results and then averaging the precision across all categories. It is calculated as follows:

AP = \int_{0}^{1} Precision (Recall) d (Recall)

(3)

Mean average precision (mAP) is the mean of AP across all categories and serves as a metric to evaluate the overall performance of a detection model across multiple categories. It takes into account both the recall and precision of the model at different thresholds, providing a comprehensive assessment of the model’s performance in object detection tasks. It is calculated as follows:

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(4)

where

i

represents the class index, and

N

represents the number of categories in the training dataset. mAP0.5 refers to the mean average precision for all classes at an IoU threshold of 0.5. mAP0.5:0.95 represents the mean average precision computed at IoU thresholds from 0.5 to 0.95.

4.4. Comparison Experiment

In order to verify the effect of improving the model detection performance, three sets of comparative experiments were conducted on the VisDrone2019 dataset. The model was compared with the YOLO series algorithms, other new models, and YOLOv8s model.

Table 3 shows the AP values, mAP0.5 values, number of parameters, and computational cost of the improved model and the YOLO series algorithms. YOLOv4 [29] combines the CSPDarknet53 backbone with PAN feature fusion, achieving a good balance between speed and accuracy. YOLOv5 uses the Focus structure to enhance the receptive field and employs Mosaic data augmentation, resulting in better network robustness. YOLOv6 [30] adopts a new self-distillation strategy and simplifies the SPPF module in YOLOv5/v8. YOLOv7 uses a model re-parameterization strategy and proposes a planned re-parameterization model. Based on the experimental results, the improved model outperforms all models in the YOLO series. Notably, although the YOLOv8l model has significantly higher parameters and computational complexity than the improved model, its mAP0.5 and mAP0.5:0.95 values are lower than those of the improved model.

From Table 4, the proposed model has the best detection performance compared to other state-of-the-art models. CenterNet introduces an anchor-free detection mechanism to address the imbalance issue caused by anchor-based mechanisms for small and large/medium-scale object samples. Faster R-CNN introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. CNXResNet [31] proposes a new backbone network using the backbone architecture of PP-YOLOE with ConvNext V2 Block as the core module and further introduces the style translation idea into the PP-YOLOE model to improve the detection and recognition performance of the small objects. MC-YOLOv5 [32] improves YOLOv5 in the feature extraction stage and introduces a new shallow network optimization strategy to reduce missed detections in dense small object scenarios. EDGS-YOLOv8, based on the YOLOv8n model, demonstrates superior performance in terms of model lightweighting. Although our model, which also uses YOLOv8n as the baseline, sacrifices some GFLOPs and model size compared to EDGS-YOLOv8, it achieves more than a 10% improvement in both mAP0.5 and mAP0.5:0.95. In recent years, there have been many improvements to the YOLO series algorithms for detecting small objects, but their detection results are still inferior to the proposed method.

In Table 5, the AP values for each object and the mAP0.5 values for all classes on the VisDrone2019 dataset are shown for both the improved model and the YOLOv8s model. From the comparison results in Table 5, it can be seen that the mAP0.5 value of the improved model increased by 11%. The AP values for each target class improved, with the most significant increase observed for the ‘people’ category, which saw an improvement of 17.8%. The AP values for several other categories, including pedestrian, bicycle, and motor, also improved by more than 10%.

In UAV aerial images, targets such as pedestrians, people, and motors are relatively small in size and density compared to other objects. The minimum resolution for detecting a person target can be as low as 4 pixels by 10 pixels, and these targets are more prone to aggregation. The significant improvement in the detection of these categories indicates that HSP-YOLO can effectively handle small object detection scenarios.

4.5. Ablation Experiment

To validate the effectiveness of all the proposed improvements and to discuss the impact of each added or modified module on the overall model performance, two sets of ablation experiments were conducted on the baseline model YOLOv8s using the Visdrone2019 dataset. The first set of experiments focused on the ablation of the prediction heads, comparing various combinations of prediction heads. The second set examined the ablation of the improvement methods, comparing the effects of different enhancements.

To determine the most suitable detection head for the model, we defined the following models: Baseline Model 1 (YOLOv8s), Improved Model 2 (with added 160 × 160 prediction head), Improved Model 3 (added 160 × 160 prediction head and cropped 20 × 20 prediction head), Improved Model 4 (with added 320 × 320 prediction head), and Improved Model 5 (added 320 × 320 prediction head and cropped 20 × 20 prediction head). The experimental results, shown in Table 6, indicate that adding a 160 × 160 prediction head yields the highest detection accuracy. In Model 3, where the 20 × 20 prediction head was removed in addition to adding the 160 × 160 prediction head, although the GFLOPs are optimized, the values for Precision, Recall, mAP0.5/%, and mAP0.5:0.95/% were all lower than those for Model 2. When a 320 × 320 detection head was added, the performance metrics were inferior to those of Model 2, regardless of whether the 20 × 20 prediction head was retained. Based on the above considerations, this paper opts to add only a 160 × 160 prediction head. While this choice incurs some sacrifices in terms of GFLOPs and inference time, it effectively preserves the model’s ability to detect large objects while maximizing the precision for small object detection.

From the experimental results in Table 7, it can be observed that adding a tiny prediction head increases the mAP0.5 value by 5.5% and the recall rate by 4.4%. This indicates that adding a larger scale detection head for small object datasets retains more abundant small object feature information, improves the matching of the target and prior boxes, and helps the loss function converge better. Introducing the SPD-Conv module into the backbone network, replacing the original Conv module, increases the mAP0.5 value by 2.3%, with improvements in both precision and recall rates. This suggests that this improvement can better preserve the features of small objects and reduce the probability of missing small objects. With the introduction of Soft-NMS, the mAP0.5 value increased by 6.1%, and the precision improved by 8.3%, indicating that the model effectively overcomes the issue of missed detections when objects overlap, thus improving detection accuracy.

In summary, adding a tiny prediction head effectively improves the localization and recognition ability for small objects. The SPD-Conv module reduces the loss of target information during feature extraction. Introducing Soft-NMS alleviates the problem of missed detections in densely clustered small object scenarios, enhancing the detection performance of aerial imagery. HSP-YOLOv8 significantly outperforms YOLOv8 in detection accuracy.

4.6. Visualization Analysis

To visually demonstrate the detection effect of the improved model algorithm, as shown in Figure 6, visual experiments were conducted using YOLOv5s, YOLOv8s, and HSP-YOLOv8. Four representative scenes were selected as experimental data: low-light environments, public facilities, urban main roads, and intersections. In low-light environments, HSP-YOLOv8 shows significant improvement, successfully detecting targets with a minimum resolution of 12 pixels × 25 pixels. In public facilities with many crowded targets, HSP-YOLOv8 also shows significant improvement, detecting targets with a minimum resolution of 4 pixels × 10 pixels. In urban main road scenes, where there are many low-resolution (less than 10 pixels × 10 pixels) vehicle targets at the end of the road, HSP-YOLOv8 performs well and shows outstanding performance.

5. Conclusions

In UAV aerial target detection tasks, issues such as small target size, low resolution, insufficient context information, and target clustering are prevalent. This paper proposes an improved YOLOv8 algorithm, HSP-YOLOv8, to deal with small target detection scenarios. First, a tiny prediction head is added to directly utilize high-resolution feature maps for small object detection. Second, the SPD-Conv module is introduced in the backbone to focus on local context information, enhancing the spatial information of small targets and improving detection accuracy. Additionally, Soft-NMS effectively mitigates the missed detection issue caused by small object clustering, thus increasing precision. The experimental results indicate that on the Visdrone2019 dataset, the proposed algorithm has made significant progress in small target detection. Compared to other models, the algorithm shows clear advantages in both the accuracy and efficiency of detecting small targets. In future work, we will explore how to reduce computational complexity while improving small object detection accuracy and maintaining model accuracy while accelerating inference speed to adapt to applications in UAV aerial scenes with limited computational resources.

Author Contributions

Conceptualization, H.Z. and W.S.; methodology, H.Z.; software, H.Z.; validation, H.Z., W.S. and C.S.; formal analysis, H.Z., Y.Z. and W.S.; investigation, H.Z. and Y.Z.; resources, H.Z., W.S. and R.H.; data curation, H.Z. and R.H.; writing—original draft preparation, H.Z.; writing—review and editing, H.Z. and W.S.; visualization, H.Z.; supervision, H.Z. and W.S.; project administration, W.S.; funding acquisition, W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was funded by the National Natural Science Foundation of China 62173330, 62371375; Shaanxi Key R&D Plan Key Industry Innovation Chain Project (2022ZDLGY03-01); China College Innovation Fund of Production, Education and Research (2021ZYAO8004); Xi’an Science and Technology Plan Project (2022JH-RGZN-0039).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wan, D.; Zhao, M.; Zhou, H.; Qi, F.; Chen, X.; Liang, G. Analysis of UAV patrol inspection technology suitable for distribution lines. J. Phys. Conf. Ser. 2022, 2237, 012009. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. M-Achine Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot MultiBox detector. In Computer Vision—ECCV 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. arXiv 2022, arXiv:2208.03641. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS--improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H. YOLOv7: Rainable Bag-of-Freebies Sets New State-of-the-Art for RealTime Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7464–7475. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 936–944. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Zhao, J.; Yang, W.; Wang, F.; Zhang, C. Research on UAV aided earthquake emergency system. In Proceedings of the IOP Conference Series: Earth and Environmental Science, Guiyang, China, 15–17 October 2020; Volume 610, p. 012018. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection. arXiv 2020, arXiv:2006.04388. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Yang, C.; Huang, Z.; Wang, N. Querydet: Cascaded sparse query for accelerating high-resolution small object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13668–13677. [Google Scholar]
Zhang, X.; Feng, Y.; Zhang, S.; Wang, N.; Mei, S. Finding Nonrigid Tiny Person With Densely Cropped and Local Attention Object Detector Networks in Low-Altitude Aerial Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4371–4385. [Google Scholar] [CrossRef]
Zhang, X.; Izquierdo, E.; Chandramouli, K. Dense and Small Object Detection in UAV Vision Based on Cascade Network. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 118–126. [Google Scholar] [CrossRef]
Tang, S.; Zhang, S.; Fang, Y. HIC-YOLOv5: Improved YOLOv5 for small object detection. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), IEEE, Rotterdam, The Netherlands, 23–26 September 2024; pp. 6614–6619. [Google Scholar]
Yang, R.; Li, W.; Shang, X.; Zhu, D.; Man, X. KPE-YOLOv5: An Improved Small Target Detection Algorithm Based on YOLOv5. Electronics 2023, 12, 817. [Google Scholar] [CrossRef]
Zhao, L.; Zhu, M. MS-YOLOv7:YOLOv7 Based on Multi-Scale for Object Detection on UAV Aerial Photography. Drones 2023, 7, 188. [Google Scholar] [CrossRef]
Hui, Y.; Wang, J.; Li, B. 2024. DSAA-YOLO: UAV remote sensing small target recognition algorithm for YOLOV7 based on dense residual super-resolution and anchor frame adaptive regression strategy. J. King Saud Univ. Comput. Inf. Sci. 2024, 36, 1. [Google Scholar] [CrossRef]
Huang, M.; Mi, W.; Wang, Y. EDGS-YOLOv8: An Improved YOLOv8 Lightweight UAV Detection Model. Drones 2024, 8, 337. [Google Scholar] [CrossRef]
Li, Y.; Fan, Q.; Huang, H.; Han, Z.; Gu, Q. A Modified YOLOv8 Detection Network for UAV Aerial Image Recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A SmallObject-Detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
Zeybek, M. Accuracy assessment of direct georeferencing UAV images with onboard global navigation satellite system and comparison of CORS/RTK surveying methods. Meas. Sci. Technol. 2021, 32, 065402. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A Single-Stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Cai, Z.; Hong, Z.; Yu, W.; Zhang, W. CNXResNet: A Light-weight Backbone based on PP-YOLOE for Drone- captured Scenarios. In Proceedings of the International Conference on Signal and Image Processing (ICSIP), Wuxi, China, 8–10 July 2023; pp. 460–464. [Google Scholar]
Chen, H.; Liu, H.; Sun, T.; Lou, H.; Duan, X.; Bi, L.; Liu, L. MC-YOLOv5: A MultiClass small object detection algorithm. Biomimetics 2023, 8, 342. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Network structure diagram of YOLOv8.

Figure 2. (a) Structure of Conv block; (b) structure of C2f block; (c) structure of SPPF block.

Figure 3. Network structure diagram of HSP-YOLOv8 (see Figure 2 for Conv, C2f, and SPPF blocks).

Figure 4. (a) Prediction head of YOLOv8; (b) prediction head of HSP-YOLOv8.

Figure 5. Structure diagram of SPD-Conv.

Figure 6. Comparison of detection effects in different scenarios, (a) YOLOv5s, (b) YOLOv8s, (c) HSP-YOLO.

Table 1. Training environment.

Parameters	Configuration
GPU	NVIDIA GTX1080Ti
GPU memory size	12 G
Operating systems	Ubuntu18.04.6 LTS
Deep learning architecture	Pytorch1.10.0 + Cuda11.2

Table 2. Neural network hyperparameters.

Parameters	Setup
Epochs	200
Input image size	640 × 640
Batch size	32
Depth_mult	0.33
Width_mult	0.50
Initial learning rate	0.01
Final learning rate	0.0001

Table 3. Comparative experiments of YOLO series algorithms. (The bold data in the table indicate the best results).

Models	mAP0.5/%	mAP0.5:0.95/%	Parameter/10⁶	GFLOPs	Inference Time/ms
YOLOv4	40.8	26.2	64.0	70.8	25.3
YOLOv5l	39.2	22.8	46.2	107.8	20.0
YOLOv6	31.7	21.7	34.2	82.0	12.0
YOLOv7	39.9	21.6	64.0	105.3	3.4
YOLOv8s	38.6	23.1	11.1	28.5	3.5
YOLOv8l	43.7	26.8	43.6	165.4	26.3
HSP-YOLO (Ours)	49.6	32.9	11.5	50.0	6.2

Table 4. Comparison experiments with various new models. (The bold data in the table indicate the best results).

Models	mAP0.5/%	mAP0.5:0.95/%	Parameter/10⁶	GFLOPs
CenterNet [17]	33.7	18.8	-	-
Faster R-CNN [4]	37.2	21.9	41.7	187
QueryDet [18]	38.1	23.7	-	-
CNXResNet [31]	47.2	29.3	49.9	-
MC-YOLOv5 [32]	45.9	26.6	38.2	69.7
UAV-YOLOv8 [27]	47.0	29.2	10.3	53.0
EDGS-YOLOv8 [26]	31.3	17.6	-	7.9
HSP-YOLO (Ours)	49.6	32.9	11.5	50.0

Table 5. Comparison of target detection accuracy between HSP-YOLOv8 and YOLOv8s. (The bold data in the table indicate the best results).

Models	Pedestrian	People	Bicycle	Car	Van	Truck	Tricycle	Awning Tricycle	Bus	Motor	mAP0.5/%
YOLOv8s	40.4	31.2	12.0	79.2	44.7	36.5	27.2	14.7	57.0	43.1	38.6
Ours	57.4	49.0	24.9	84.2	54.2	46.5	37.6	22.6	64.3	55.3	49.6

Table 6. Ablation experiments of prediction head. (The bold data in the table indicate the best results).

Models	Precision/%	Recall/%	mAP0.5/%	mAP0.5:0.95/%	GFLOPs	Inference Time/ms
Model 1	50.7	38.0	38.6	23.1	28.5	3.5
Model 2 (Ours)	53.7	41.9	43.7	26.5	36.7	4.9
Model 3	51.9	40.8	41.5	25.7	29.8	4.9
Model 4	50.5	39.9	41.1	25.3	49.0	14.0
Model 5	48.8	39.7	40.1	24.7	48.1	11.0

Table 7. Ablation experiments. (The bold data in the table indicate the best results).

Models	Precision/%	Recall/%	mAP0.5/%	mAP0.5:0.95/%	Inference Time/ms
Baseline	50.7	38.0	38.6	23.1	3.5
+Tiny-Head	54.5	42.4	44.1	26.8	4.9
+SPD-Conv	52.0	39.8	40.9	24.6	6.2
+Soft-NMS	59.0	29.9	44.7	29.5	3.5
+SPD-Conv + Soft-NMS	60.2	35.3	48.0	31.9	6.2
+Tiny-Head + Soft-NMS	58.2	39.0	49.4	32.6	5.0
+Tiny-Head + SPD-Conv + Soft-NMS	57.5	40.6	49.6	32.9	6.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, H.; Sun, W.; Sun, C.; He, R.; Zhang, Y. HSP-YOLOv8: UAV Aerial Photography Small Target Detection Algorithm. Drones 2024, 8, 453. https://doi.org/10.3390/drones8090453

AMA Style

Zhang H, Sun W, Sun C, He R, Zhang Y. HSP-YOLOv8: UAV Aerial Photography Small Target Detection Algorithm. Drones. 2024; 8(9):453. https://doi.org/10.3390/drones8090453

Chicago/Turabian Style

Zhang, Heng, Wei Sun, Changhao Sun, Ruofei He, and Yumeng Zhang. 2024. "HSP-YOLOv8: UAV Aerial Photography Small Target Detection Algorithm" Drones 8, no. 9: 453. https://doi.org/10.3390/drones8090453

APA Style

Zhang, H., Sun, W., Sun, C., He, R., & Zhang, Y. (2024). HSP-YOLOv8: UAV Aerial Photography Small Target Detection Algorithm. Drones, 8(9), 453. https://doi.org/10.3390/drones8090453

Article Menu

HSP-YOLOv8: UAV Aerial Photography Small Target Detection Algorithm

Abstract

1. Introduction

2. Related Work

2.1. YOLO Detection Algorithm

2.2. Previous Works on Small Object Detection

3. Methods

3.1. Tiny Prediction Head

3.2. SPD-Conv Module

3.3. Soft-NMS

4. Experimental Results and Analysis

4.1. Dataset

4.2. Experimental Equipment and Training Strategies

4.3. Evaluation Indicators

4.4. Comparison Experiment

4.5. Ablation Experiment

4.6. Visualization Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI