1. Introduction
With their compact size and ease of operation, drones have become indispensable. They offer high flexibility and affordability. These devices are widely used in various sectors. They span from military operations to civilian tasks. Examples include disaster relief and traffic monitoring [
1,
2,
3]. A critical technology supporting these diverse applications is object detection in drone-captured imagery.
In the field of object detection, deep neural networks have brought significant advancements. Their recent progress has markedly improved performance metrics. Notable benchmarks include MS COCO [
4] and PASCAL VOC [
5]. However, the majority of these deep CNNs were designed for natural scene images [
6,
7,
8]. Drone-captured imagery is quite different from natural scene imagery. Thus, these networks often do not perform as well when applied to drone-captured content.
Designing object detectors for low-cost mini rotary-wing drone platforms is challenging. This challenge is distinct from conventional datasets.As shown in
Figure 1, several challenges arise when adapting drone aerial images for object detection [
9]:
Resource Limitations: Low-cost mini rotary-wing drones have inherently limited computational capabilities. Specifically, these drones are constrained in terms of data processing and memory capacity. To equip these drones with real-time, high-precision object detection capabilities, there is an urgent need for a solution that ensures high accuracy and low latency while minimizing computational overhead.
Prevalenceof Small Objects: Drone images often feature small, densely populated objects.
LimitedForeground Proportion: The actual subjects of interest, or foreground objects, typically constitute a minor portion of the entire image.
Thesechallenges highlight the need for advanced, low-latency detection systems for drone imagery.
Most research favors complex models to improve small object detection in aerial images. These models often rely on high-resolution inputs, consuming significant computational resources. This approach is misaligned with the inherent computational limitations of drone platforms. There is a highlighted need for efficient, lightweight models. However, complex object detection models offer high precision but are often unsuitable for edge device deployment due to their computational demands.In contrast, lightweight detectors might not maintain the same level of accuracy [
10,
11,
12]. In an attempt to address this trade-off, numerous studies have focused on optimizing the primary network using methods such as network pruning [
13,
14] and structural redesign [
15,
16]. Though these methods prove effective for conventional images, their direct applicability to drone imagery remains questionable due to the distinct differences between conventional images and drone-captured scenes. Traditional detection strategies for aerial imagery often adopt a coarse-to-fine approach [
17,
18,
19,
20]. These methods use coarse detectors to discern larger instances and regions densely populated with smaller instances. Fine detectors are then applied to these highlighted regions for accurate identification of the small instances. While precise, the computational demands of these strategies make them less suitable for real-time drone applications.
Recognizing the limitations of existing object detection models in handling drone-captured images, we propose the Efficient YOLOv7-Drone, specifically designed to enhance object detection efficiency and accuracy in drone aerial imagery. Considering the dominance of smaller objects in aerial images and model efficiency, we removed the underperforming P5 detection head and introduced the P2 detection head, specifically to improve the detection of tiny objects. To ensure efficient feature relay from the Backbone to the Neck, we conducted channel optimization for the CBS module. Additionally, we made adjustments to other network components to boost the model’s performance.
As the foreground often occupies only a small fraction of aerial images, the CEASC [
21] module utilizes sparse convolution techniques [
22,
23], which narrows the network’s attention, diminishing superfluous computations on background elements. It achieves this by generating a learnable mask. This ensures convolutions are performed only on select sparse sampling areas, optimizing computational efficiency. However, the performance of sparse convolution heavily depends on the quality of the generated mask. Traditional methods often employ fixed mask ratios to guide mask generation, presenting its own set of challenges. A mask ratio that is too small might lead to overly extensive sparse sampling regions, incurring unnecessary computations on the background and potentially compromising both efficiency and accuracy. Conversely, an excessively large ratio could shrink the sparse sampling areas too much. This risks omitting crucial foreground and contextual information, which in turn hampers detection performance. To address this, [
21] introduced the adaptive multi-layer masking (AMM) scheme. It optimizes mask ratios across different feature pyramid network (FPN) levels using a custom loss function, thus balancing detection accuracy with efficiency. However, relying on the mask ratio to control mask generation could introduce significant uncertainties. Therefore, we proposed a novel module, TGM-CESC. Central to this approach is the target-guided mask approach, which leverages object labels to create foreground and background binary maps. By computing a particular loss with masks corresponding to sparse convolution, we achieved pixel-level guidance for generating sparse convolution masks.Subsequently, we integrated the TGM-CESC module into the Efficient YOLOv7-Drone, replacing the original re-parameterized convolution (RepConv) module.
To capture rich semantic information, we introduced the head context-enhanced method (HCEM). This method capitalizes on fusing feature maps from adjacent layers, compensating for the potential information loss at lower resolutions due to the mask quality in sparse convolution.
The main contributions of our work are as follows:
(1) The Efficient YOLOv7-Drone. In drone-captured images, we often observe objects that are small, densely packed, and of varied scales. To address these challenges, we present the Efficient YOLOv7-Drone. By omitting the P5 detection head, incorporating the P2 detection head, and fine-tuning the CBS module’s channels, our model skillfully narrows the divide between performance and computational efficiency. This results in significant performance enhancements in object detection for drone-captured images.
(2) Target-Guided Mask Strategy. Recognizing the inherent sparsity of foreground elements in aerial images, we proposed the context-enhanced sparse convolution with target-guided masking(TGM-CESC) module. Central to this module is our target-guided mask strategy.By producing ground truth binary maps that correspond to the masks, we establish pixel-level constraints for generating sparse convolution masks. This offers an accurate and efficient solution for detecting sparsely scattered objects in aerial imagery, further sharpening detection precision amidst vast backgrounds.
(3) Head Context-Enhanced Method. To compensate for potential information loss induced by sparse convolution, we introduced the head context-enhanced method (HCEM). This strategy exploits the synergistic effect between feature map layers, merging features from adjacent levels, effectively countering the information loss due to the quality of masks in sparse convolution.
(4) We conducted comprehensive experiments on two popular datasets, VisDrone and UAVDT. The results decisively demonstrate that the methodologies we introduced for drone platforms can achieve real-time detection while maintaining a high level of accuracy.
4. Experiments
4.1. Datasets and Evaluation Measures
To demonstrate the effectiveness of our proposed method, we conducted extensive experiments on two primary drone aerial object detection benchmark datasets, namely Visdrone [
48] and UAVDT [
49]. These datasets were selected due to their diverse data representation, encompassing aerial images from various weather conditions, terrains, and objects spanning different traffic and daily life scenarios. They provide meticulous annotations for each image and feature challenging scenarios, including object occlusions, small targets, target overlaps, and intricate backgrounds. Furthermore, their widespread use in the aerial image detection domain ensures a credible and pertinent evaluation of our method.
VisDrone dataset. This dataset comprises 288 video clips and 10,209 high-resolution static images. Of these, 6471 are allocated for training, 548 for validation, and 3190 for testing. The image resolutions range from 960 × 540 to 2000 × 1500. Captured across 14 different cities, the dataset encompasses a myriad of shooting scenarios, covering 10 distinct target categories, namely pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor. Due to its pronounced class and size imbalances, it serves as an ideal benchmark for studying small object detection challenges. To ensure consistency with prior research, all test results are based on the validation set.
UAVDT dataset. Compared to the VisDrone dataset, UAVDT offers an even more extensive collection of drone-captured imagery. It contains 23,258 images for training and 15,069 images for testing, all with a resolution of 1024 × 540. This dataset focuses on three categories: buses, trucks, and cars.
Evaluation measures. In our study, we adopted mean average precision (mAP), average precision (AP), and average recall (AR) for accuracy evaluation. For efficiency, we considered GFLOPs, FPS, and the total parameter count.
4.2. Implementation Details
We implemented our model using PyTorch. All of our experiments were conducted on a single NVIDIA RTX 3090 GPU for both training and testing. During the training phase, we leveraged partial weights from a pre-trained YOLOv7 model, considerably reducing the training time. To ensure consistency and fairness in our experiments, we trained our model for 200 epochs and fixed the batch size at 8. For efficiency, we standardized the input width and height to 640. We utilized the SGD optimizer, with all other parameters set to the default configurations of YOLOv7.
4.3. Comparison with the State of the Art
We conducted experiments on the VisDrone and UAVDT datasets, comparing our method to state-of-the-art object detectors. To emphasize our method’s detection performance efficiency, we refrained from using additional tricks during inference. Our model employs a backbone named “Modified ELANNet”.
Table 1 shows that the Modified ELANNet has only 51% of ResNet’s parameters, and its GFLOPs are also lower than those of ResNet50. Most of the comparative methods rely on high-resolution images and are multi-stage detectors. To maintain fairness in the experiment, the backbone in the selected methods was of similar complexity or even more intricate than ours.
On the VisDrone dataset, we compared our method against ten recent popular methods. Specifically, RetinaNet [
7], ClusDet [
20], DMNet [
19], GLSAN [
50], QueryDet [
45], CascadeNet [
51], and CascadeNet+MF [
51] utilize ResNet-50 as their Backbone. GFL V1 [
37], which incorporates the CEASC [
21] structure, employs ResNet-18, while HRDNet [
39] uses both ResNet-18 and ResNet-101. DFPN [
52] chose Modified CSP v5-M as its Backbone. As shown in
Table 2, even though our approach uses a lower resolution image as input, it achieved the best results across all three main evaluation metrics. This outcome convincingly demonstrates our technique’s ability to balance detection accuracy with enhanced efficiency.
To underscore the outstanding performance and resilience of our model, we configured its input to a resolution of
.
Table 3 illustrates that our model attains scores of
,
, and
across the three key evaluation metrics. This performance substantially surpasses that of other state-of-the-art methods, even when they utilize more complex Backbones and superior resolutions.
For the UAVDT dataset, we benchmarked our method against ClusDet [
20], DMNet [
19], DFPN [
52], and ARMNet [
53]. As demonstrated in
Table 4, even when working with a lower image resolution, our approach excels across all three primary evaluation metrics. This reaffirms our method’s prowess in seamlessly integrating detection accuracy with heightened efficiency.
4.4. Ablation Study
To further validate the effectiveness of our proposed model, we conducted an extensive ablation study on the VisDrone dataset. For brevity and clarity, we labeled our modifications as: introduction of the P2 detection head as ‘A’, removal of the P5 detection head as ‘B’, CBS module channel optimization as ‘C’, target-guided context enhancement sparse convolution (TGM-CESC) as ‘D’, and head context-enhanced method (HCEM) as ‘E’.
4.4.1. Comparison with the Baseline Model
We carried out a comprehensive evaluation on the VisDrone validation set to accurately assess the performance of each introduced component. Our evaluation took into account both detection accuracy and efficiency, utilizing a diverse set of metrics: mAP, AP50, AP75, GPU memory consumption, giga floating-point operations (GFLOPs), and FPS.
As detailed in
Table 5 and compared to the baseline model:
The introduction of the P2 detection head (A) led to improved detection metrics. Specifically, mAP, AP50, and AP75 increased by 0.7%, 0.8%, and 1.2%, respectively. However, this also introduced additional memory and computation overhead.
By removing the P5 detection head (B), we noted improvements in mAP, AP50, AP75, and AR, which reached 30.0%, 51.5%, 30.1%, and 52.8%, respectively. Additionally, this change reduced the GPU memory overhead by 56.32% and decreased GFLOPs from 118.2 to 115.2, resulting in an FPS boost of 29.68%.
The CBS module channel optimization (C) slightly increased memory and GFLOPs but boosted the mAP, AP50, and AP75 to 31.0%, 52.2%, and 31.6%, respectively.
With the integration of the TGM-CESC module (D), the model focused on image foreground areas, achieving mAP, AP50, and AP75 scores of 31.5%, 52.8%, and 32.1%. By utilizing the sparse convolution, the GFLOPs was reduced to 133.0. However, substituting the well-performing RepConv module during the inference phase led to a slight decrease in FPS.
Finally, with the introduction of HCEM (E), we achieved significant performance improvements while only incurring a slight increase in memory and computational overhead. Compared to the baseline, there was an uplift in mAP, AP50, AP75, and AR by 3.5%, 4.1%, 4.1%, and 3.3%, respectively.
For performance improvements of less than
, it is essential to ensure the reliability and consistency of these incremental gains. To rigorously ascertain the validity of these marginal enhancements and to mitigate potential overfitting or random variations, we utilized
k-fold cross-validation with
. This method offers a comprehensive assessment of the model’s resilience across diverse data subsets. For this validation, we combined the VisDrone training and validation datasets, resulting in a total of 7019 images, with 5615 images from the training set and 1404 from the validation set. Specifically, we applied
k-fold cross-validation to models incorporating the TGM-CESC and HCEM modules. For comparative analysis, we similarly conducted
k-fold cross-validation on models without these modules.
Table 6,
Table 7 and
Table 8 display the results, which indicate that the TGM-CESC and HCEM modules substantially improved the detection accuracy of the model.
4.4.2. Visualization between Baseline and the Efficient YOLOv7-Drone
From the experimental data, it is clear that the Efficient YOLOv7-Drone surpasses the baseline on the VisDrone dataset. To provide a more vivid and direct comparison between these two models, we have visualized their prediction results on the VisDrone dataset in
Figure 4. The first column displays the ground truths of the images, the second offers predictions from the baseline, and the third column highlights the predictions from the Efficient YOLOv7-Drone. To emphasize the distinctions, we have magnified the regions with significant prediction differences. Observing the first rows of the images, it is evident that our approach has made significant improvements in reducing false detections compared to the baseline. In the second row, when dealing with scenes filled with small, densely packed objects, our model clearly outperforms, showcasing superior detection accuracy.
4.4.3. Details of the Efficient YOLOv7-Drone Design
Introduction of the P2 Detection Head. Aerial imagery often contains numerous small objects. When using low-resolution images as input, these objects can become extremely minute, commonly referred to as “tiny objects”, which pose detection challenges. To address this and enhance detection of tiny objects, we incorporated features from the Stage 2 output of the Backbone, which leverages its rich and comprehensive representation of small objects. As demonstrated in
Table 9, the introduction of the P2 detection head resulted in a 0.9% improvement in detection accuracy for small objects, highlighting its crucial role in our model.
P5 Detection Head Omission: Enhancing Efficiency and Accuracy. As shown in
Table 10, following the introduction of the P2 detection head, we evaluated specific metrics for various detection heads. Notably, the P2 detection head exclusively focuses on detecting small objects, which aligns seamlessly with our initial purpose for its inclusion. Compared to the P4 detection head, the P5 detection head shows only a slight advantage specifically in the detection of larger objects. This difference arises from the initial allocation of anchors. In essence, if the P5 detection head were to be removed, the P4 detection head would be poised to assume its role. Consequently, to balance accuracy and computational efficiency, we opted to exclude the P5 detection head. As demonstrated in
Table 5, this exclusion led to improvements in our model’s mAP, AP50, and AP75 metrics by 0.7%, 1.2%, and 0.5%, respectively. Concurrently, the model achieved a significant 56.29% reduction in the number of parameters, enhancing computational efficiency and improving FPS performance.
Table 11 presents a comparison of various metrics for the P4 detection head before and after the removal of the P5 detection head. As the table illustrates, after removing the P5 detection head, all metrics associated with the P4 detection head display notable improvements. Specifically, the
value sees an increase of 10.2%. Although this metric remains marginally lower than that of the P5 detection head before its removal, the enhancement in other metrics sufficiently compensates for this difference. These experimental results confirm that, without the P5 detection head, the P4 detection head can effectively assume its responsibilities and even surpass its performance.
CBS Module Channel Optimization: To enhance the transfer efficiency of features from the Backbone to the Neck, we employed the CBS module channel optimization method. As shown in
Table 12, this optimization led to increments of 1.0% in mAP, 0.7% in AP50, 1.5% in AP75, 0.9% in
, 1.0% in
, and a significant 1.9% in
. The empirical results underscore the effectiveness of the CBS channel optimization in better preserving feature information.
4.4.4. Details of the TGM-CESC Analysis
To ensure our model emphasizes the image’s foreground, we integrated the TGM-CESC module.
Table 13 presents results from two configurations: one adding TGM-CESC after RepConv and another replacing RepConv entirely with TGM-CESC. The latter configuration showed slightly better accuracy but experienced a 1% decrease in the AR score. Nonetheless, considering overall detection efficiency, we chose to substitute RepConv with TGM-CESC.
To underscore the significance of pixel-level constraints in TGM for sparse convolution mask generation, we evaluated its efficacy against the AMM module and another method with a mask ratio set to zero. As
Table 14 indicates, the TGM method surpasses the other two techniques. Notably, in the AR metric, TGM achieves a 3.5% improvement over AMM. This improvement results from the pixel-level constraint of TGM, which promotes accurate mask generation while minimizing the chance of masking crucial details. Both AMM and TGM markedly outperform the fixed mask ratio approach, emphasizing the significance of focusing on the image foreground.
To clearly demonstrate our method’s superiority, we visualized the three methods. As depicted in
Figure 5, across different feature layers, the TGM method effectively directs the model’s focus toward the image’s foreground, outperforming the other two methods.
4.4.5. Details of the HCEM
Table 2 reveals that the introduction of the head context-enhanced method (HCEM) resulted in noticeable improvements in mAP, AP50, AP75, and AR metrics, increasing by 0.6%, 0.8%, 0.4%, and 0.3%, respectively. To offer a deeper insight into the HCEM’s impact on our model, we visualized its effects. As shown in
Figure 6, the feature information at each detection layer was substantially enriched with the integration of HCEM. This underscores HCEM’s potential in counteracting the detail loss that comes with the use of sparse convolution.
5. Discussion and Conclusions
In this study, we address three of the critical challenges in aerial image object detection. Targeting platforms powered by small drones with constrained computational capabilities, we present the Efficient Yolov7-Drone—an object detection algorithm boasting high precision and real-time performance. Unlike previous studies that primarily relied on computationally intensive high-resolution images, our approach capitalizes on low-resolution inputs. We surmise that if an algorithm performs effectively on low-resolution images, it will undoubtedly excel on high-resolution ones. As a result, our proposed algorithm is marked by its low-resolution, low-latency, and high precision and is tailored for drone platforms.
In our analysis, we observed that the process of downscaling high-resolution images often reduces small objects to “tiny” objects. This means that aerial images, which originally have a large number of small targets, now possess an abundance of these tiny objects. To address this, we integrated the P2 detection head, leveraging the detailed information from low-level, high-resolution feature maps to enhance the detection accuracy of these tiny objects. Notably, the P2 detection head introduces computational overhead. Furthermore, high-level, low-resolution images tend to lose crucial details of tiny objects, a phenomenon that is particularly pronounced in aerial images. This omission can result in erroneous object detection, affecting the overall accuracy. Such insights rendered the P5 detection head superfluous, prompting its removal along with the associated Stage 5 from the Backbone. Subsequently, we discerned that the CBS module, which bridges the Backbone and Neck, decreases the feature channel count from the Backbone by a factor of four. To safeguard against the potential loss of intricate details and semantic richness, we implemented a CBS channel optimization technique. Following this optimization, the channel count reduced by half. Additionally, we refined the architecture of the Neck segment to ensure proper channel alignment.
Given the notably low foreground proportion in drone aerial imagery and the potential detrimental effects of excessive background details on detection precision, we integrated the TGM-CESC module. Utilizing the advantages of sparse convolution, this module directs the model’s attention predominantly towards the image’s foreground. As part of this approach, the TGM method was developed to refine the generation of sparse convolution masks.
To address the potential masking of foreground information by sparse convolution, we devised the HCEM module. This module combines detailed, low-level, high-resolution feature maps with high-level, low-resolution maps, enhancing semantic comprehension. This integration helps restore any foreground details potentially masked by sparse convolution.
We validated the efficacy of our methodology through rigorous experiments using two popular drone aerial object detection benchmarks: VisDrone and UAVDT. Compared to other state-of-the-art methods evaluated on these datasets, our approach demonstrated superior performance.
Our approach is distinguished by its attention to the low foreground proportion in aerial images and its proficiency in balancing detection precision with efficiency. Notably, even with lower-resolution images, our method consistently delivers superior detection results. Additionally, while our enhancements were developed with YOLOv7 in mind, their applications are broader. Detectors addressing dense environments and numerous small targets could benefit from integrating our P2 detection head, foregoing the P5 detection head, and adopting the CBS channel optimization to align detection precision with efficiency. Likewise, detectors employing sparse convolution with an emphasis on foreground focus might find our TGM-CESC module advantageous.