Lightweight UAV Small Target Detection and Perception Based on Improved YOLOv8-E

Zhao, Yongjuan; Wang, Lijin; Lei, Guannan; Guo, Chaozhe; Ma, Qiang

doi:10.3390/drones8110681

Open AccessArticle

Lightweight UAV Small Target Detection and Perception Based on Improved YOLOv8-E

by

Yongjuan Zhao

^1,*,

Lijin Wang

^1,2,

Guannan Lei

³

,

Chaozhe Guo

^1,2 and

Qiang Ma

^1,2

¹

School of Mechanical and Electrical Engineering, North University of China, Taiyuan 030051, China

²

Institute of Intelligent Weapons, North University of China, Taiyuan 030051, China

³

School of Mechanical Engineering, North University of China, Taiyuan 030051, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(11), 681; https://doi.org/10.3390/drones8110681

Submission received: 14 October 2024 / Revised: 8 November 2024 / Accepted: 15 November 2024 / Published: 19 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

Traditional unmanned aerial vehicle (UAV) detection methods struggle with multi-scale variations during flight, complex backgrounds, and low accuracy, whereas existing deep learning detection methods have high accuracy but high dependence on equipment, making it difficult to detect small UAV targets efficiently. To address the above challenges, this paper proposes an improved lightweight high-precision model, YOLOv8-E (Enhanced YOLOv8), for the fast and accurate detection and identification of small UAVs in complex environments. First, a Sobel filter is introduced to enhance the C2f module to form the C2f-ESCFFM (Edge-Sensitive Cross-Stage Feature Fusion Module) module, which achieves higher computational efficiency and feature representation capacity while preserving detection accuracy as much as possible by fusing the SobelConv branch for edge extraction and the convolution branch to extract spatial information. Second, the neck network is based on the HSFPN (High-level Screening-feature Pyramid Network) architecture, and the CAA (Context Anchor Attention) mechanism is introduced to enhance the semantic parsing of low-level features to form a new CAHS-FPN (Context-Augmented Hierarchical Scale Feature Pyramid Network) network, enabling the fusion of deep and shallow features. This improves the feature representation capability of the model, allowing it to detect targets of different sizes efficiently. Finally, the optimized detail-enhanced convolution (DEConv) technique is introduced into the head network, forming the LSCOD (Lightweight Shared Convolutional Object Detector Head) module, enhancing the generalization ability of the model by integrating a priori information and adopting the strategy of shared convolution. This ensures that the model enhances its localization and classification performance without increasing parameters or computational costs, thus effectively improving the detection performance of small UAV targets. The experimental results show that compared with the baseline model, the YOLOv8-E model achieved (mean average precision at IoU = 0.5) an [email protected] improvement of 6.3%, reaching 98.4%, whereas the model parameter scale was reduced by more than 50%. Overall, YOLOv8-E significantly reduces the demand for computational resources while ensuring high-precision detection.

Keywords:

UAV detection; small target recognition; YOLOv8-E; C2f-ESFFM

1. Introduction

In recent years, unmanned aerial vehicles (UAVs) have seen widespread use in both the military and civilian fields owing to their small size, lightweight design, and ease of maneuverability [1,2,3]. In civilian applications, UAVs are helpful in agricultural plant protection, geographic mapping and exploration, electric power inspection, and logistics and transportation [4,5]. In the military field, UAV missions have become increasingly diverse, and the frequency of their use in local conflicts has increased dramatically [6,7]. Owing to their characteristics, such as small radar reflective surfaces, small size, and low heat dissipation, it is difficult to accurately identify and locate small UAVs in complex environments [8,9,10,11,12,13], making it essential to detect and identify low-altitude UAVs effectively.

Existing image-based small target detection techniques can be divided into two main categories: classical methods based on traditional features and advanced methods driven by deep learning. Classical techniques include SIFT, CANNY, DPM [14,15,16], etc. These methods generally first extract features from UAV images using techniques such as directional gradient histograms and scale-invariant feature transforms. Subsequently, algorithms such as support vector machines are used to classify and localize the extracted features. For example, Wang et al. [17] combined the geometric constraints of the scene and the numerical statistical properties of the SIFT features and proposed a geometric transformation-based fast SIFT feature matching algorithm, which solved the problem of low reconstruction accuracy of SfM global reconstruction, poor robustness to external points, and time-consuming incremental reconstruction. The method outperforms traditional tree and hash-based matching methods in terms of time and accuracy but increases the complexity of the algorithm and is not conducive to engineering implementation. Dhal et al. [18] improved the optimization of different HE variants based on nature-inspired optimization algorithms (NIOAs), solving the problem of significant changes in image brightness caused by histogram equalization (HE) techniques, which helps to balance the brightness distribution of images and modify the uniform pixel distribution of grayscale images, making image details clearer and increasing image contrast. Tang et al. [19] extracted local shape features using the Dynamic Threshold Oriented fast and Rotated Brief (DTORB) algorithm combined with the Directed Gradient Histogram (IHOG) algorithm to extract the global texture feature vectors and fused the feature vectors serially to obtain the improved HOG features. Experiments show that the proposed method can describe image details more accurately, and the average accuracy of the feature extraction classifier can reach 93.7%. Zhang et al. [20] introduced an enhanced histogram of oriented gradients (HOG) method to extract vehicle features, addressing the lack of recognizable vehicle silhouettes under low-light conditions. By applying the non-maximum suppression (NMS) method to eliminate the overlapping region and combining the fusion of vertical histograms of oriented gradient symmetric features (V-HOG), the vehicle recognition accuracy in nighttime scenes is significantly improved. Hu [21] proposed a morphological feature-based high-pass filtering and SUSAN fast detection algorithm to address the limited target information acquisition in infrared imaging under complex environments. The experiment demonstrates that this method can effectively detect targets and overcome the influence of complex interference, even with limited sample data. However, its adaptability to complex and changing detection backgrounds requires further improvement.

Deep-learning-based methods provide new solutions for UAV target detection, mainly one-stage algorithms (SSD, YOLO, etc. [22,23,24]) and two-stage algorithms (R-CNN, Faster R-CNN, etc. [25,26]), which automatically learn the feature representations of the target and detection models. Although the detection accuracy of the one-stage algorithms is slightly lower than that of the two-stage algorithms, they have faster detection speeds and are suitable for rapid deployment in various tasks. For example, Zhai et al. [27] proposed a detection method based on an improved YOLOv3 network, which enhances the detection of small targets by adding multi-scale prediction while keeping the basic framework of the original YOLOv3 model unchanged and uses a two-axis gimbal camera combined with a PID algorithm to control to keep the target in the center of the field of view. The improved YOLOv3 network achieved better detection accuracy for low-altitude high-speed UAV detection, better performance, and more accurate detection. Cheng et al. [28] used a lightweight MobileViT network as the backbone combined with coordinate attention-based PANet (CA-PANet) as the feature fusion network to solve the problem of low accuracy and efficiency in multi-scale UAV detection. Liu et al. [29] addressed the issue of the airport environment in the detection of multi-scale UAVs. HollowBox, a free anchor-based UAV detection method, was proposed. By resetting the detection feature layer and redefining the proportion of positive and negative samples, it achieves an average precision (AP) of 90.1%, a false detection rate of 6%, and an inference speed of 17.2 FPS on the test set, meeting the demand of real-time UAV detection in airports. Shi et al. [30] proposed a UAV detection method based on the YOLOv4 model for the problem of low-altitude UAV detection and constructed a sample set of UAV flight attitude images consisting of captured, network downloaded and extended existing data. The results show better performance than other YOLOv4 algorithms in terms of average accuracy and real-time detection speed. Zamri et al. [31] proposed the P2-YOLOv8n-ResCBAM model for small drone detection, which incorporated various attention mechanisms into the YOLOv8n architecture and added a high-resolution detection head, achieving an mAP of 92.6% and outperforming other YOLO models while also effectively detecting drones and distinguishing them from birds under long-range conditions.

Meanwhile, to address the problems of small UAVs occupying fewer pixels in the image, high cost of traditional detection methods, and large influence of equipment, Liu et al. [32] reduced the convolution channels and shortcut layers by pruning YOLOv4 and realized that the pruned YOLOv4 model improved the processing speed by 60.4% while maintaining 90.5% mAP, which effectively realizes the real-time detection of fast-moving small UAVs. Liu et al. [33] replaced the backbone network of YOLOv5 with EfficientLite, introduced adaptive spatial feature fusion technology, and also added an angle constraint in the loss function to improve the convergence speed of the network. The improved YOLOv5 model outperforms the original YOLOv5 model in detection performance on the UAV_data dataset, achieving a significant improvement in detection accuracy with a small increase in parameter count. Zhai et al. [34] proposed an optimized YOLO-Drone network to address challenges in small UAV target detection, such as small target size, use of stealthy materials, low-altitude flight, and complex aerial environments. The network introduces a high-resolution detection head and prunes redundant layers. It also employs SPD-Conv and the GAM attention mechanism, resulting in 11.9%, 15.2%, and 9% improvements in precision, recall, and mAP, respectively, while reducing model size. The YOLO-Drone network also demonstrates strong generalization on a self-built dataset.

Although the effectiveness of the above methods in improving target detection performance has been demonstrated, they still have limitations, such as poor detection performance for small targets, high model complexity, and large memory requirements. Additionally, existing models lack sufficient feature extraction capabilities for small targets, resulting in false or missed detections caused by environmental factors and variable UAV target sizes. To address the low accuracy of small target detection in resource-constrained environments, this study proposes the YOLOv8-E model, which aims to enhance the extraction capabilities of UAV small target detection. Focusing on feature extraction efficiency, small target detection, and the effective use of computational resources, a relationship network between distant pixels was constructed to enhance the spatial information reinforcement and edge perception mechanism. This approach minimizes feature loss information, reduces the probability of leakage detection, and prevents repeated detection errors, ensuring high-precision detection and identification of UAV targets in resource-constrained environments.

The main contributions of this paper are as follows:

(1): First, the C2f-ESCFFM module is proposed, which enhances the spatial information reinforcement capability by integrating the SobelConv branch, suppresses the loss of feature information from the perspective of edge sensitivity, improves the spatial information reinforcement capability of the model and the edge-aware mechanism, and significantly reduces the probability of missed detection and reduces the error probability of repeated detection.
(2): Second, this study applies the CAA (Context Anchor Attention) attention mechanism to the model and improves the HSFPN (High-level Screening-feature Pyramid Network) accordingly to obtain the CAHS-FPN (Context-Augmented Hierarchical Scale Feature Pyramid Network), which strengthens feature extraction while ensuring a lightweight model, enhances the central feature representation by constructing the relationship network between distant pixels, improves computational efficiency, and reduces computational cost.
(3): Finally, the lightweight small target detection head LSCOD (Lightweight Shared Convolutional Object Detector Head) is introduced, enhancing the multi-scale target detection generalization by integrating a priori information into the standard convolution operation, thereby solving the problem of insufficiently sensitive detail feature capture in UAV small target detection and enhancing the detection ability of small targets.

2. Frameworks

2.1. YOLOv8-E Model

The YOLOv8-E target detection model proposed in this study is based on the smallest model of the YOLOv8 series, YOLOv8n, and improves three aspects of the original model: the backbone, neck, and head networks. For the backbone network, the lightweight Edge-Sensitive Cross-Stage Feature Fusion Module (C2f-ESCFFM) is employed, which fuses the SobelConv branch to extract edge information and the convolutional branch to extract spatial information and can learn richer image features. Compared with the original backbone network C2f module, the C2f-ESCFFM module has a higher computational efficiency and feature representation capability. In the neck network, the High-level Screening-feature Pyramid Network (HSFPN) [35] is used as the base, with the Context Anchor Attention (CAA) [36] introduced to form the new Context-Augmented Hierarchical Scale Feature Pyramid Network (CAHS-FPN) network. This network comprises a multi-scale feature hierarchy, realizes the fusion of deep and shallow features, improves the feature representation ability of the model, and can effectively detect targets of different sizes. For the head network, this paper introduces the Lightweight Shared Convolutional Object Detector Head (LSCOD), which significantly reduces the number of parameters using the shared convolution technique, making the model more lightweight and efficient. In summary, the YOLOv8-E model maintains high detection accuracy while significantly improving computational efficiency through innovative network design, providing strong support for practical application scenarios. In this section, the improved YOLOv8 network model is introduced in three parts, with the model framework shown in Figure 1.

2.2. C2f-ESCFFM Module

In target detection tasks, the C2f (cross-stage-partial) module of YOLOv8 is a critical component of the backbone network owing to its efficiency and lightweight design. The C2f module fuses features from different levels using cross-stage-partial connectivity to enhance the ability of the network to detect multi-scale targets while reducing the computational burden. However, in specific scenarios, such as UAV detection, the generic design of C2f modules struggles to handle the challenges of small target detection, motion blur, complex backgrounds, and variable lighting conditions, limiting their detection accuracy and robustness. To solve the above problems, this study designed a lightweight and efficient Edge-Sensitive Cross-Stage Feature Fusion Module (C2f-ESCFFM), aiming to improve further the detection performance, particularly in challenging application scenarios of UAV detection. The C2f-ESCFFM module, as shown in Figure 2, retains the advantages of the original C2f module but also introduces targeted improvement strategies to meet the specific demands of UAV detection.

The core of the C2f-ESCFFM module lies in its edge sensing mechanism and spatial information enhancement capability. The module integrates a SobelConv branch, which utilizes the classical Sobel filter [37,38,39] to explicitly extract the edge features of the image. The Sobel filter is adept at capturing the drastic changes in pixel intensity within the image and thus can accurately identify edge information, which is crucial for detecting small or indistinct UAV targets. The Sobel filter formula is as follows:

S o b e l_{o u t} = \sqrt{{(f i l t e r_{x} \times i m g)}^{2} + {(f i l t e r_{y} \times i m g)}^{2}}

(1)

f i l t e r_{x} = [\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}], f i l t e r_{y} = [\begin{matrix} - 1 & 2 & 1 \\ 0 & 0 & 0 \\ - 1 & - 2 & - 1 \end{matrix}]

(2)

Additionally, the module also contains an additional convolutional branch (conv_branch) for extracting spatial information from the original image. Unlike the SobelConv branch, which focuses on edge details, the conv_branch preserves the rich texture and structural information of the picture, which is essential for understanding its background and context. The C2f-ESCFFM module fuses the features extracted from the SobelConv branch and the conv_branch. By splicing (concatenating) the outputs of these two branches, the module can construct a comprehensive feature representation containing both critical edge information and detailed spatial features. This fusion strategy significantly enhances the ability of the network to understand complex scenes, particularly in handling small targets, complex backgrounds, and variable lighting conditions, thereby providing a more comprehensive and detailed feature description.

2.3. CAHS-FPN Module

Although lightweight models feature simplified architecture, they often suffer from a loss of accuracy. To mitigate the loss of accuracy, the neck network adopts the HSFPN (High-level Screening-feature Pyramid Network) as the basis. As shown in Figure 3, integrating the CAA (Context Anchor Attention) into a multilevel feature fusion pyramid (HSFPN) architecture to form the CAHS-FPN structure can significantly strengthen the ability of the target detection model to deal with multi-scale targets and enrich feature representation levels.

The HS-FPN is an architecture proposed in the literature [35] that consists of two key components: the feature selection module and the feature fusion module. The channel attention (CA) and Dimension Matching (DM) in the feature selection module match the feature maps of different scales. The HS-FPN uses high-level features as weights, coupled with a channel attention mechanism, to refine the low-level features, achieving effective cross-level information fusion, with particular optimization for the diversity of target sizes. The fusion process of its feature selection is as follows:

f_{a t t} = B L (T - C o n v (f_{h i g h}))

(3)

f_{o u t} = f_{l o w} \times C A (f_{a t t}) = f_{a t t}

(4)

Given an input high-level feature stream

f_{h i g h} \in R^{C \times H \times W}

and the input low-level feature stream

f_{l o w} \in R^{C \times H_{1} \times W_{1}}

, it is initially expanded using a transposed convolution with a stride of 2 and a convolution kernel of 3 × 3, resulting in

f_{\hat{h i g h}} \in R^{C \times 2 H \times 2 W}

. Then, to unify the dimensions of the high-level and low-level features, the high-level features are upsampled or downsampled using bilinear interpolation to obtain

f_{a t t} \in R^{C \times H_{1} \times W_{1}}

. Subsequently, the CA module is used to transform the high-level features into corresponding attention weights, and the low-level features are filtered to obtain dimensionally consistent features. Finally, the filtered low-level features are fused with the high-level features to enhance the feature representation of the model to obtain

f_{o u t} \in R^{C \times H_{1} \times W_{1}}

.

The CAA [36] mechanism focuses on building a network of relationships between distant pixels, strengthening the central features while maintaining computational performance with the efficient design of strip-shaped depth-wise convolutions, ensuring that the model is lightweight and has a high performance. The formula for CAA is as follows.

First, obtain the local region features through average pooling and 1 × 1 convolution as follows:

F_{l - 1, n}^{p o o l} = C o n v_{1 \times 1} (P_{a v g} ({X_{l - 1, n}}^{(2)})), n = 0, \dots, N_{l} - 1,

(5)

where P_avg represents the average pooling operation. When n = 0,

X_{l - 1, n}^{(2)} = X_{l - 1}^{(2)}

.

Then, apply two depth-wise separable strip convolutions as an approximation of standard large-kernel depth-wise separable convolutions as follows:

F_{l - 1, n}^{w} = D W C o n v_{1 \times k_{b}} (F_{l - 1, n}^{p o o l})

(6)

F_{l - 1, n}^{h} = D W C o n v_{k_{b} \times 1} (F_{l - 1, n}^{w})

(7)

where k_b = 11 + 2 × l, i.e., the convolution kernel size increases with the depth of the block.

Finally, the CAA module generates an attention weight

A_{l - 1, n} \in R^{\frac{1}{2} C_{l} \times H_{l} \times W_{l}}

, which is used to enhance the output of the block module as follows:

A_{l - 1, n} = S i g m o i d (C o n v_{1 \times 1} (F_{l - 1, n}^{h}))

(8)

F_{l - 1, n}^{a t t n} = (A_{l - 1, n} ⊙ P_{l - 1, n}) \oplus P_{l - 1, n}

(9)

where ⊙ represents element-wise multiplication and ⊕ represents element-wise addition.

The integration of the CAA module can significantly enhance the semantic parsing ability of low-level features in the HSFPN architecture, especially in the accurate detection of small targets under complex background conditions. CAA compensates for the subtle information overlooked by the HSFPN during feature fusion by refining local features and accurately modeling long-range dependencies. The lightweight feature of CAA can enhance the overall performance of the model while effectively avoiding the unnecessary waste of computational resources, ensuring high feasibility and efficiency of the model in practical application scenarios.

2.4. LSCOD Module

In the original YOLOv8 design, the detection head module is responsible for predicting object locations and categories from the feature map. It transforms the features extracted from the backbone network into outputs that can be used for object detection through a series of convolutional operations. However, when faced with specific targets, such as UAVs, especially in small sizes and complex backgrounds, the original head module has some shortcomings, mainly due to the lack of sensitivity in capturing detailed features and the limited generalization ability in multi-scale target detection. These problems limit detection accuracy, especially in real-time applications where the accurate identification of small targets becomes particularly critical. To solve the above issues, this study proposes the LSCOD (Lightweight Shared Convolutional Object Detector Head) module, which introduces a depth-optimized detail-enhanced convolution (DEConv) [40] that can significantly improve the ability of the model to perceive detailed features, thus enhancing the localization and classification performance of the model without adding additional parameters and computational costs. The structure is depicted in Figure 4.

Ordinary convolution tends to be more inclined to capture low-frequency information (e.g., smooth regions) of an image while ignoring high-frequency information (e.g., edges, texture). Differential convolution (DC) helps to enhance the representation of high-frequency information by introducing differential convolution, which is able to explicitly encode the gradient prior information in the Sobel operator, a traditional local descriptor. In images, detailed features, such as edges and contours, are often closely associated with gradient changes, and capturing this gradient information by computing the differences in pixel pairs enables the model to more acutely perceive detailed changes in the image. Using difference convolution and ordinary convolution in parallel, DEConv can take into account both low-frequency and high-frequency information to better capture the overall features and detailed features of the image so that the model will not ignore the detailed parts when processing the image and thus improve the perception of detailed features. At the same time, combined with the application of reparameterization technology, rich feature representations are learned through multiple parallel convolutional layers in the training stage, and the additivity of convolution is used to update the kernel weights of these parallel convolutional layers separately in backpropagation, and then these kernel weights are summed up to obtain the equivalent kernel weights in forward propagation so as to simplify the structure of the model. In the testing phase, these parallel convolutional layers are equivalently converted to a common 3 × 3 convolutional layer, which not only maintains the efficiency of the model but also ensures the fast response of the model in the inference process, thus improving the detection accuracy of the model without sacrificing the performance. The DEConv structure is shown in Figure 5, and the formula is as follows (10).

F_{o u t} = D E C o n v (F_{i n}) = \sum_{i = 1}^{5} F_{i n} \times K_{i} = F_{i n} \times (\sum_{i = 1}^{5} K_{i}) = F_{i n} \times K_{c v t}

(10)

DEConv (⋅) represents the DEConv operation, K_i = 1:5 represents the kernels of VC, CDC, ADC, HDC, and VDC, × represents the convolution operation, and K_cvt represents the converted kernel, which combines the parallel convolution groups.

To further reduce model complexity, LSCOD adopts the strategy of shared convolution; that is, multiple detection heads share the same set of convolution kernels, drastically reducing the number of parameters and making the model more compact. However, considering the scale differences faced by different detection heads, this study introduces a scale layer into the design to dynamically adjust the size of the feature maps to ensure the flexibility and accuracy of the model in multi-scale target detection. To address the statistical bias problem that may occur in the batch normalization (BN) layer under the shared-parameter framework, this study adopts an approach similar to the NASFPN [41]; that is, after sharing the convolutional layer, the BN is computed independently for each branch to effectively avoid statistical errors and ensure the stability and detection accuracy of the model.

3. Experiments

3.1. Experimental Basis

The Real World Object Detection Dataset for Quadcopter UAV Detection [42] is a publicly available real-world object detection dataset focused on the detection of quadcopter unmanned aerial vehicles (UAVs), containing 52,676 UAV instances. The image resolutions range from 640 × 480 to 4 K. Most of the images contain UAVs, with a small number of negative samples (without UAVs). In this work, we avoid model overfitting situations, which reduce the generalization ability of different scenes. In this paper, the number of duplicate scene frames in the dataset is reduced, representative and high-quality images are selected, and a dataset containing 20,000 UAV images of different types, sizes, locations, and environments is constructed from selfies and online collections. The size distribution of the UAV targets is 40% small targets, 35% medium targets, and 25% large targets. We randomly split the dataset into 60% for the training set, 20% for the validation set, and 20% for the test set. Some sample data are shown in Figure 6.

To rigorously evaluate the performance of the target detection models and ensure the reliability and reproducibility of the experimental results, the experimental platform in this work is based on a Windows 10 Professional 64-bit operating system (Microsoft, Beijing, China) using an Intel(R) Core(TM) i5-10600KF processor (Inter, Santa Clara, CA, USA), an NVIDIA GeForce GTX 1050 Ti (Colorful, Shenzhen, China)graphics processing unit (GPU) with 4GB GDDR5 video memory, and CUDA 11.8 to realize the deep learning model training and real-time inference process. The experiments were run in a Python 3.9.19 environment, utilizing the PyTorch 2.0.0 deep learning framework to construct and train the models. The relevant experimental parameters are shown in Table 1 below.

3.2. Evaluation Metrics

To demonstrate the rigor of this study and make the data comparison more reliable and valuable, the evaluation of YOLOV8-E used comprehensive evaluation metrics, including precision (P), recall (R), mean average precision (mAP), and Frames Per Second (FPS). The mAP can be calculated using the following formulas:

R e c a l l = \frac{T P}{T P + F N}

(11)

P r e c i s i o n = \frac{T P}{T P + F P}

(12)

A P = \int_{0}^{1} P (R) d R

(13)

m A P = \frac{1}{N} \sum_{n \in N} A P (n)

(14)

where TP, FP, and FN represent true positives, false positives, and false negatives, respectively. Recall represents the ratio of detected objects (TP) to all objects that should have been detected (TP + FN), and precision represents the ratio of detected objects (TP) to all detected objects (TP + FP). P(R) is the precision–recall curve, with each IOU threshold corresponding to a P(R) curve. N is the number of classes, and mAP is the average of the AP values for all classes. These metrics have been similarly used in studies, such as PASCAL VOC [43] and MS COCO [44], reflecting the effectiveness and generality of the selected evaluation metrics. In addition, in order to better reflect the lightness of the model, this paper also evaluated the network’s performance in the UAV application scenario using the number of model parameters (Params) and the total floating-point operations (GFLOPs) as metrics.

In addition to the quantitative metrics, this paper also uses the TIDE toolbox [45] for detailed error analysis. TIDE defines six types of detection errors: Classification Error (Cls), Localization Error (Loc), Classification + Localization Error (Cls + Loc), Duplicate detection error (duplicate), Background Error (Bkgd), and Missed Error (Missed). These error types can be individually calculated to determine their impact on mAP. TIDE observes the changes in mAP after correcting each error type, allowing for a quantitative analysis of the model’s strengths and weaknesses. This helps to deeply understand the performance differences between the baseline model and the improved model, providing guidance for future model design.

4. Experimental Results and Analysis

4.1. Ablation and Parallel Testing

To quantify the enhancement effect of the improved modules on the detection performance of the model, a series of ablation experiments was conducted to evaluate the contribution of each module to overall performance systematically. Using YOLOv8n as the base reference model, the impact of each modification was meticulously examined by individually integrating the optimized modules. The C2f-ESCFFM module was first applied, and then the CAHS-FPN module was added. Finally, the LSCOD lightweight detection head was added to complete the final improvement. The effects of different modules on the detection were evaluated under the same experimental conditions, and the results of the relevant ablation experiments are shown in Figure 7 and Table 2.

As shown in Table 2, each improvement improves the detection performance of the network to varying degrees.

Model A indicates that the C2f-ESCFFM module was introduced into the original model, and this improvement not only significantly enhanced the computational efficiency of the model but also optimized its performance in complex target detection scenarios. Specifically, after applying the C2f-ESCFFM, the computation of the model was reduced by 0.9 GFLOPs, and the number of parameters was streamlined by 9.4%. In terms of detection performance, the recall (R) was reduced by 4.1%; this trade-off significantly improved the target localization accuracy under a high IoU threshold. The [email protected]:0.95/% index increased by 4.1%, which highlights that the detail awareness of the model when dealing with targets with high shape and location complexity was significantly enhanced. Notably, [email protected] showed only a slight decrease of 1.3%, indicating that the model maintains a robust detection capability, even under more relaxed IoU thresholds. These results verify that the C2f-ESCFFM module not only further optimizes the network performance by streamlining redundant computation while maintaining the original feature extraction effectiveness but also demonstrates significant superiority in target detection under high IoU thresholds.

Model B indicates that the CAHS-FPN module was added to the original model. The introduction of this module effectively reduced the computational burden of the model, with computational cuts of up to 1.2 GFLOPs, and reduced the number of parameters by over 1.057 × 10⁶, which confirms that the CAA mechanism significantly reduces the computational cost while keeping the model complexity controllable for deployments in resource-constrained environments. Although the recall rate (R) decreased slightly by 2.8%, the CAHS-FPN module improved the [email protected]/% and [email protected]:0.95/% metrics by 0.7% and 5%, respectively. This highlights the superior ability of the CAA mechanism to enhance contextual associations among distant pixels and optimize the expression of the central features, especially when dealing with the detection of small targets in complex environments, exhibiting excellent localization accuracy. In summary, the innovative design of the CAHS-FPN module not only optimizes computational efficiency but also significantly improves the performance of the model for small target detection in complex scenes.

Model C indicates that the original model detection head was replaced with the LSCOD lightweight detection head, and the introduced LSCOD module combined with the detail-enhanced convolution (DEConv) technique results in a substantial breakthrough in optimizing the target detection model. The experimental results show that although the [email protected]/% decreased slightly by 0.9%, the [email protected]:0.95/% increased significantly by 4.6%, which highlights the superior ability of DEConv technology to optimize target localization and classification, especially under high IoU threshold conditions. Additionally, the LSCOD module performed well in resource management, achieving a 24.9% reduction in model parameters and a 2.1 GFLOP reduction in the computational load. Despite sacrifices in the detection frame rate, the model remained competitive for high-precision detection tasks. Overall, the LSCOD module proposed in this study effectively managed computational resources while ensuring improved detection accuracy, representing a significant breakthrough in optimizing target detection models.

The final ablation results strongly validate the effectiveness of the YOLOv8-E design. By reducing redundant computations and optimizing memory usage, this study achieves two key benefits: a significant reduction in model parameters and computational requirements and a significant enhancement in feature extraction. This optimization leads to a substantial increase in the detection accuracy on the UAV dataset, specifically to 98.4% on the YOLOv8-E model, achieving a 6.3% improvement in [email protected]/%, reaching 98.4%, and even a remarkable 13.8% improvement in [email protected]:0.95/%. These metrics not only reflect the excellent performance of the model in detecting objects at different scales and orientations but also highlight its robustness in handling complex UAV scenarios. Additionally, compared with the YOLOv8 model, YOLOv8-E has a reduced FPS but can meet the real-time detection requirements, while the parameter scale of the YOLOv8-E model is reduced by more than 50%, and the computational load is reduced by 2.8 GFLOPs. Taken together, the YOLOv8-E model is capable of real-time and efficient detection of UAV targets in resource-constrained environments.

In parallel experiments, YOLOv7x outperformed the proposed YOLOv8-E algorithm by 0.7%, 2.6%, and 0.9% in terms of precision, recall, and mean average precision, respectively. However, YOLOv7x has 45 times more parameters and 32 times higher GFLOPs compared to YOLOv8-E. The research team believes that although there is a trade-off in performance when deploying real-time object detection tasks in resource-constrained environments, the substantial advantages of YOLOv8-E in model complexity and computational efficiency make this trade-off worthwhile.

In summary, the design strategy of YOLOv8-E significantly improves the detection accuracy of the UAV dataset through its careful reduction in redundancy and enhancement of feature extraction capabilities. The resulting model not only possesses a highly optimized architecture but also demonstrates excellent performance in practical applications, thus highlighting the potential of the method presented in this paper to advance the frontiers of object detection and recognition.

Meanwhile, in order to evaluate the model performance more comprehensively, this paper calculates the average precision values under a series of IoU (Intersection over Union) thresholds, as shown in Table 3 AP@ (50–95). AP (average precision) is a commonly used metric for evaluating the target detection model, which reflects the model’s ability to maintain the recall at different levels of the ability of the model to maintain the precision rate at different recall levels. Lower IoU thresholds may allow the model to detect UAVs with some deviation of the predicted frame from the true frame but still be considered correctly detected, while higher IoU thresholds require the model to localize UAVs more precisely. By considering multiple IoU thresholds, it is possible to understand the performance of the model in different scenarios with different accuracy requirements.

From the data analysis in the table, it can be seen that the final improved model, i.e., ours (YOLOv8-E), exhibits excellent performance throughout the entire range of IoU thresholds and especially maintains high average precision (AP) at high IoU thresholds, which indicates that the model is able to better balance the precision rate in both easy-to-detect (high recall) and harder-to-detect (low recall) cases, can reduce the number of false positive cases (non-UAVs are misclassified as UAVs) and false negative cases (UAVs are not detected), and possesses high accuracy and stability in the target detection task. In contrast, the YOLOv8 model, while performing well at low IoU thresholds (e.g., 50%), degrades faster as the IoU thresholds increase, showing limitations under high-precision detection requirements. Models “A”, “B”, and “F” maintain high AP values throughout the IoU range, while in a similar performance, the “C”, “D”, and “E” models are slightly worse. It is especially worth noting that at higher IoU thresholds (e.g., 80% to 95%), the model is required to locate the UAV more accurately, and the overlap between the prediction frame and the real frame needs to be higher, which is more difficult, and the AP values of most of the models drop significantly, but the “Ours” model still maintains a higher accuracy, which demonstrates its advantage compared to most of the models that have a significant decrease in AP value. The “Ours” model still maintains a high accuracy, which reflects its advantage in accurate detection. Therefore, the improved model, “Ours” (YOLOv8-E), in this paper has a better performance under different IoU thresholds and also reveals the shortcomings of the existing model, YOLOv8, in dealing with high-precision bounding box detection tasks.

Additionally, this study further analyzed the contribution of each error type to each improvement module using the TIDE toolbox. Since only UAVs were detected in this study, it can be seen in the figure that there is no categorization error; the largest error contribution in the original model comes from EBkg and EMiss and the original model has a weak detail perception ability in dealing with targets with high complexity in shape and position. It has poor detection and a poor ability to accurately localize the target, and it can easily be interfered with by similar objects in the same surroundings, resulting in misdetections and missed detections. The final improved model (YOLOv8-E) has a substantial reduction in terms of EBkg and EMiss errors, which remained present, suggesting that the algorithm still has room for improvement in terms of leakage detection. Overall, EBkg and EMiss were the leading causes of errors in these models, while the original model was more susceptible to these errors. The heatmap of the error-type contributions is shown in Figure 8. With the analysis results of the TIDE toolbox, the model parameters will be further adjusted in the future to improve the target localization ability and enhance the model’s detection effect on UAVs.

4.2. Experiments on Publicly Available Datasets

To evaluate the performance of YOLOv8-E on additional publicly available UAV datasets, the Drone Dataset (UAV) [50], TIB-Net [51], Drone Dataset [52], USC Drone Dataset [53], DUT-Anti-UAV [54], and Drone-vs-bird [55] were selected. These datasets cover a variety of UAV types and environments from multiple perspectives. Table 4 presents the comparative experimental results for the six datasets. In the DUT-Anti-UAV dataset, the fact that this dataset contains richer and more complex backgrounds, such as sky, cloud, sea, jungle, building, farmland, and city, makes the original YOLOv8 and YOLOv8-E perform poorly on this dataset as a whole; however, the precision value of the YOLOv8-E model is 1.2% higher than that of YOLOv8, the recall value is 9.3%, and the mean average precision (mAP) value is 5.8% higher than YOLOv8, demonstrating the superiority of the improved model. This paper pays more attention to the overall performance of the model for UAV detection in complex scenarios, and it can be observed that in terms of mAP metrics, the improved YOLOv8-E model in this paper shows some advantages over the above datasets compared to the original model. Figure 9 illustrates the performance of the original and improved models on the six datasets.

4.3. Results Visualization

Finally, in order to verify the effectiveness of the enhancement module, the visualized heatmaps of the original model, the improved model of the backbone network, the improved model of the neck network, and the final improved model are generated according to the Grad-CAM [56] method, as shown in Figure 10.

In the figure, the first column heatmap (b) corresponds to the original YOLOv8 model, where the darker intensity regions are concentrated only in specific areas of the UAV. In contrast, the other regions show lower levels, suggesting that YOLOv8 is easily interfered with by similar objects in the surroundings in the process of feature extraction, resulting in poor detection performance. The fourth column heatmap (e) corresponds to the YOLOv8-E model proposed in this paper, in which the dark-colored regions are concentrated in the body and support part of the UAV, which almost covers the whole UAV, effectively enhancing the target feature extraction, suppressing the complex background region, and making the network more focused on the UAV target, and the high level of attention of the improved model to the overall structure of the UAV is also visible, indicating more confidence in the detection of UAV targets. From the heatmaps in (c) and (d), the model still suffers from interference in the bright light environment during the UAV detection and identification processes. It can be concluded from the analysis of the visualization results that YOLOv8-E proposed in this paper can better reduce the interference and improve the target detection accuracy.

5. Conclusions

Given the complex challenges posed by the broad application of UAVs in military and civilian fields, this paper proposes a YOLOv8n-based UAV target detection model, YOLOv8-E, which aims to overcome the negative impacts of factors, such as shooting angle changes and complex backgrounds, by introducing innovative modules, such as the C2f-ESCFFM module, CAHS-FPN mechanism, and the lightweight small target detection head, LSCOD. The C2f-ESCFFM module enhances the spatial information enhancement capability and edge perception mechanism of the model by integrating the SobelConv branch to extract edge features using the Sobel filter explicitly. The CAHS-FPN module further enhances spatial information and edge perception through the fine refinement of local features and the accurate modeling of long-range dependencies. By maintaining computational efficiency with a striped depth convolution design, the CAHS-FPN ensures a balance between lightweight design and high performance. The LSCOD module enriches the feature representation by integrating a priori information into standard convolutional operations to enhance the generalization ability of the model and ensure the lightweight characteristics of the model. Additionally, a shared convolution strategy was used to reduce the number of parameters, and a scale layer was introduced to dynamically adjust the feature map size, ensuring the flexibility and accuracy of the model in multi-scale target detection. By computing the BN layer independently for each branch, statistical errors are effectively avoided, further improving detection accuracy and resource management efficiency. In this study, a real-world UAV dataset was used as an experimental validation object, and the feasibility and effectiveness of the proposed method were illustrated from multiple perspectives. The results show that the YOLOv8-E model proposed in this study outperforms the detection and recognition tasks. The model achieved a 6.3% improvement in [email protected]/% to 98.4% and even achieved a remarkable improvement of 13.8% in [email protected]:0.95/%. The parameter scale of the model was reduced by more than 50%, and its computational load was 2.8 GFLOPs. Compared to the other six algorithms, the proposed method achieved the best accuracy. Additionally, the best detection accuracy was achieved across six publicly available datasets. Overall, these results validate the effectiveness of the design strategy and demonstrate that the YOLOv8-E model is capable of achieving efficient deployment and operation in resource-constrained environments.

However, a limitation was encountered during the experiments. From the experiments on public datasets, the model in this study performs poorly on the DUT-Anti-UAV and TIB-Net datasets and needs to be improved in terms of generalization. In future research, the generalization ability of the model will be evaluated using cross-validation techniques during the training process to address the problem mentioned above, and the generalization and compatibility of the algorithm on different datasets will be gradually improved. Specifically, future research should (1) optimize the network structure and try deeper and more complex neural network architectures to improve the accuracy of feature extraction and classification; (2) enhance data enhancement strategies and adopt data enhancement techniques such as rotation, scaling, and cropping to further enrich the training samples; and (3) improve the loss function design with the loss function, which can dynamically optimize the weights of small targets, which helps to improve the accuracy of UAV target detection and overcome the problem of excessive focus on bounding box regression. In addition, model pruning will be further explored to facilitate deployment on mobile device platforms to further improve the performance and application scope of target detection.

Author Contributions

Conceptualization, validation, and methodology, Y.Z.; methodology and writing—original draft preparation, L.W.; writing—original draft preparation, G.L.; writing—review and editing, C.G. and Q.M.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shanxi Provincial Fundamental Research Program Grant under Grant TZLH20230818005. Shanxi Science and Technology Innovation Leading Talent Team for Special Unmanned Systems and Intelligent Equipment 202204051002001.

Data Availability Statement

The data used in this analysis are publicly available, and access is provided in the text.

Acknowledgments

The authors would like to thank all coordinators and supervisors involved and the anonymous reviewers for their detailed comments that helped to improve the quality of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hu, P.; Zhang, R.; Yang, J.; Chen, L. Development status and key technologies of plant protection UAVs in China: A review. Drones 2022, 6, 354. [Google Scholar] [CrossRef]
Daud, S.M.S.M.; Yusof, M.Y.P.M.; Heo, C.C.; Khoo, L.S.; Singh, M.K.C.; Mahmood, M.S.; Nawawi, H. Applications of drone in disaster management: A scoping review. Sci. Justice 2022, 62, 30–42. [Google Scholar] [CrossRef] [PubMed]
Mohsan SA, H.; Khan, M.A.; Noor, F.; Ullah, I.; Alsharif, M.H. Towards the unmanned aerial vehicles (UAVs): A comprehensive review. Drones 2022, 6, 147. [Google Scholar] [CrossRef]
Moshref-Javadi, M.; Winkenbach, M. Applications and Research avenues for drone-based models in logistics: A classification and review. Expert Syst. Appl. 2021, 177, 114854. [Google Scholar] [CrossRef]
Guo, Q.; Wu, F.; Hu, T.; Chen, L.; Liu, J.; Zhao, X.; Gao, S.; Pang, S. Perspectives and prospects of unmanned aerial vehicle in remote sensing monitoring of biodiversity. Biodivers. Sci. 2016, 24, 1267. [Google Scholar] [CrossRef]
Chamola, V.; Kotesh, P.; Agarwal, A.; Gupta, N.; Guizani, M. A comprehensive review of unmanned aerial vehicle attacks and neutralization techniques. Ad Hoc Netw. 2021, 111, 102324. [Google Scholar] [CrossRef]
Yaacoub, J.P.; Noura, H.; Salman, O.; Chehab, A. Security analysis of drones systems: Attacks, limitations, and recommendations. Internet Things 2020, 11, 100218. [Google Scholar] [CrossRef]
Wang, C.; Tian, J.; Cao, J.; Wang, X. Deep learning-based UAV detection in pulse-Doppler radar. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–12. [Google Scholar] [CrossRef]
Yang, T.; De Maio, A.; Zheng, J.; Su, T.; Carotenuto, V.; Aubry, A. An adaptive radar signal processor for UAVs detection with super-resolution capabilities. IEEE Sens. J. 2021, 21, 20778–20787. [Google Scholar] [CrossRef]
Kumar, S.; Jain, A.; Rodrigues, C.A.; Dsouza, G.S.; Pooja, N. Gesture control of UAV using radio frequency. AIP Conf. Proc. 2020, 2311, 060003. [Google Scholar]
Arjmandi, Z.; Kang, J.; Park, K.; Sohn, G. Benchmark dataset of ultra-wideband radio based UAV positioning. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020; pp. 1–8. [Google Scholar]
Svanström, F.; Englund, C.; Alonso-Fernandez, F. Real-time drone detection and tracking with visible, thermal and acoustic sensors. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 7265–7272. [Google Scholar]
Kang, J.; Park, K.; Arjmandi, Z.; Sohn, G.; Shahbazi, M.; Ménard, P. Ultra-wideband aided UAV positioning using incremental smoothing with ranges and multilateration. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 4529–4536. [Google Scholar]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Kumar, S.S.; Amutha, R. Edge detection of angiogram images using the classical image processing techniques. In Proceedings of the IEEE-International Conference on Advances in Engineering, Science and Management (ICAESM-2012), Nagapattinam, India, 30–31 March 2012; pp. 55–60. [Google Scholar]
Gangadharan, K.; Kumari, G.R.N.; Dhanasekaran, D.; Malathi, K. Automatic detection of plant disease and insect attack using effta algorithm. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 160–169. [Google Scholar] [CrossRef]
Wang, Y.; Yuan, Y.; Lei, Z. Fast SIFT feature matching algorithm based on geometric transformation. IEEE Access 2020, 8, 88133–88140. [Google Scholar] [CrossRef]
Dhal, K.G.; Das, A.; Ray, S.; Gálvez, J.; Das, S. Histogram equalization variants as optimization problems: A review. Arch. Comput. Methods Eng. 2021, 28, 1471–1496. [Google Scholar] [CrossRef]
Tang, M.; Liang, K.; Qiu, J. Small insulator target detection based on multi-feature fusion. IET Image Process. 2023, 17, 1520–1533. [Google Scholar] [CrossRef]
Zhang, L.; Xu, W.; Shen, C.; Huang, Y. Vision-based on-road nighttime vehicle detection and tracking using improved HOG features. Sensors 2024, 24, 1590. [Google Scholar] [CrossRef]
Hu, Z. A fast target detection method in the sky background. In Proceedings of the Third International Conference on Electronic Information Engineering, Big Data, and Computer Technology (EIBDCT 2024), Beijing, China, 26–28 January 2024; Volume 13181, pp. 490–496. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Ross, T.Y.; Dollár, G. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Zhai, H.; Zhang, Y. Target Detection of Low-Altitude UAV Based on Improved YOLOv3 Network. J. Robot. 2022, 2022, 4065734. [Google Scholar] [CrossRef]
Cheng, Q.; Li, X.; Zhu, B.; Shi, Y.; Xie, B. Drone detection method based on MobileViT and CA-PANet. Electronics 2023, 12, 223. [Google Scholar] [CrossRef]
Liu, S.; Qu, J.; Wu, R. HollowBox: An anchor-free UAV detection method. IET Image Process. 2022, 16, 2922–2936. [Google Scholar] [CrossRef]
Shi, Q.; Li, J. Objects detection of UAV for anti-UAV based on YOLOv4. In Proceedings of the 2020 IEEE 2nd International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Weihai, China, 14–16 October 2020; pp. 1048–1052. [Google Scholar]
Zamri, F.N.M.; Gunawan, T.S.; Yusoff, S.H.; Alzahrani, A.A.; Bramantoro, A.; Kartiwi, M. Enhanced Small Drone Detection using Optimized YOLOv8 with Attention Mechanisms. IEEE Access 2024, 12, 90629–90643. [Google Scholar] [CrossRef]
Liu, H.; Fan, K.; Ouyang, Q.; Li, N. Real-time small drones detection based on pruned yolov4. Sensors 2021, 21, 3374. [Google Scholar] [CrossRef] [PubMed]
Liu, B.; Luo, H. An improved Yolov5 for multi-rotor UAV detection. Electronics 2022, 11, 2330. [Google Scholar] [CrossRef]
Zhai, X.; Huang, Z.; Li, T.; Liu, H.; Wang, S. YOLO-Drone: An optimized YOLOv8 network for tiny UAV object detection. Electronics 2023, 12, 3664. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, C.; Chen, B.; Huang, Y.; Sun, Y.; Wang, C.; Fu, X.; Dai, Y.; Qin, F.; Peng, Y.; et al. Accurate leukocyte detection based on deformable-DETR and multi-level feature fusion for aiding diagnosis of blood diseases. Comput. Biol. Med. 2024, 170, 107917. [Google Scholar] [CrossRef]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27706–27716. [Google Scholar]
Wang, L.; You, Z.H.; Lu, W.; Chen, S.B.; Tang, J.; Luo, B. Attention-aware Sobel Graph Convolutional Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4409912. [Google Scholar] [CrossRef]
Hu, K.; Yuan, X.; Chen, S. Real-time CNN-based keypoint detector with Sobel filter and descriptor trained with keypoint candidates. In Proceedings of the Fifteenth International Conference on Machine Vision (ICMV 2022), Rome, Italy, 18–20 November 2022; pp. 231–238. [Google Scholar]
Chang, Q.; Li, X.; Li, Y.; Miyazaki, J. Multi-directional Sobel operator kernel on GPUs. J. Parallel Distrib. Comput. 2023, 177, 160–170. [Google Scholar] [CrossRef]
Chen, Z.; He, Z.; Lu, Z.M. DEA-Net: Single image dehazing based on detail-enhanced convolution and content-guided attention. IEEE Trans. Image Process. 2024, 33, 1002–1015. [Google Scholar] [CrossRef]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
Pawełczyk, M.; Wojtyra, M. Real world object detection dataset for quadcopter unmanned aerial vehicle detection. IEEE Access 2020, 8, 174394–174409. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Lin, T.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Bolya, D.; Foley, S.; Hays, J.; Hoffman, J. Tide: A general toolbox for identifying object detection errors. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 558–573. [Google Scholar]
Han, J.; Ren, Y.F.; Brighente, A.; Conti, M. RANGO: A Novel Deep Learning Approach to Detect Drones Disguising from Video Surveillance Systems. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–21. [Google Scholar] [CrossRef]
Wu, C. A drone detector with modified backbone and multiple pyramid featuremaps enhancement structure (MDDPE). arXiv 2024, arXiv:2405.02882. [Google Scholar]
Zhao, Y.; Ju, Z.; Sun, T.; Dong, F.; Li, J.; Yang, R.; Fu, Q.; Lian, C.; Shan, P. TGC-YOLOv5: An Enhanced YOLOv5 Drone Detection Model Based on Transformer, GAM & CA Attention Mechanism. Drones 2023, 7, 446. [Google Scholar] [CrossRef]
Yasmine, G.; Maha, G.; Hicham, M. Anti-drone systems: An attention based improved YOLOv7 model for a real-time detection and identification of multi-airborne target. Intell. Syst. Appl. 2023, 20, 200296. [Google Scholar] [CrossRef]
Mehdi Ozel. Available online: https://www.kaggle.com/dasmehdixtr/drone-dataset-uav (accessed on 25 December 2021).
Sun, H.; Yang, J.; Shen, J.; Liang, D.; Ning-Zhong, L.; Zhou, H. TIB-Net: Drone detection network with tiny iterative backbone. IEEE Access 2020, 8, 130697–130707. [Google Scholar] [CrossRef]
Aksoy, M.C.; Orak, A.S.; Özkan, H.M.; Selimoglu, B. Drone Dataset: Amateur Unmanned Air Vehicle Detection. Mendeley Data 2019. [Google Scholar] [CrossRef]
Chen, Y.; Aggarwal, P.; Choi, J.; Kuo CC, J. A deep learning approach to drone monitoring. In Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, 12–15 December 2017; pp. 686–691. [Google Scholar]
Zhao, J.; Zhang, J.; Li, D.; Wang, D. Vision-based anti-uav detection and tracking. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25323–25334. [Google Scholar] [CrossRef]
Coluccia, A.; Fascista, A.; Schumann, A.; Sommer, L.; Dimou, A.; Zarpalas, D.; Akyon, F.C.; Eryuksel, O.; Ozfuttu, K.A.; Altinuc, S.O.; et al. Drone-vs-bird detection challenge at IEEE AVSS2021. In Proceedings of the 2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Washington, DC, USA, 16–19 November 2021; pp. 1–8. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-Cam: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. YOLOv8-E network architecture. ⊗ represents that the weight information generated by CAA attention is multiplied with the feature map of the corresponding scale to generate a filtered feature map.

Figure 2. C2f-ESCFFM structure.

Figure 3. CAHS-FPN structure.

Figure 4. LSCOD structure.

Figure 5. DEConv structure. In this figure, VC, ADC, CDC, VDC, and HDC represent the five parallel deployed convolution layers that are included in the DEConv operation, which are the standard convolution, angle differential convolution, center differential convolution, vertical differential convolution, and horizontal differential convolution, respectively.

Figure 6. Partial dataset. The dataset contains images of different-sized UAVs with various backgrounds (sky, buildings, trees, occlusions, strong lighting, etc.).

Figure 7. Experimental results. (a) Ablation experiment results; (b) parallel experiment results.

Figure 8. Heatmap of error type contribution. The figure shows the error contribution of different algorithms for error analysis using the TIDE toolbox, where a, b, c, d, e, and f are YOLOv8, YOLOv8 + C2f-ESCFFM, YOLOv8 + CAHS-FPN, YOLOv8 + LSCOD, YOLOv8 + C2f-ESCFFM + CAHS-FPN, and YOLOv8 + C2f-ESCFFM + CAHS-FPN + LSCOD (ours). ECls: correctly localized but incorrectly categorized. ELoc: correctly categorized but incorrectly localized. EBoth: incorrectly categorized and incorrectly localized. EDupe: duplicate detection error. EBkg: detected background as foreground. EMiss: missed GT (ground truth) error.

Figure 9. Gain plots. The figure shows the experimental results gain plot of the YOLOv8 model and our (YOLOv8-E) model on six different datasets. Specifically, A, B, C, D, E, and F represent the DUT-Anti-UAV, TIB-Net, Drone Dataset, USC Drone Dataset, Drone Dataset (UAV), and Drone-vs-bird datasets, respectively.

Figure 10. Visualized heatmap of different networks, where (a) is the original image; (b) is the heatmap of the YOLOv8 model; (c) is the heatmap of the YOLOv8 + C2f-ESCFFM model; (d) is the heatmap of the YOLOv8 + C2f-ESCFFM + CAHS-FPN model; (e) is the heatmap of the YOLOv8 + C2f-ESCFFM + CAHS-FPN + LSCOD model.

Table 1. Table of experimental parameters.

Parameter	Numerical Value
Epochs	200
Workers	8
Initial learning rate	0.01
Optimizer	SGD
Input image size	640 × 640
Random seed	1

Table 2. Ablation experiments and parallel experiments of different models on the Real World Dataset.

Model	Backbone	Neck	Head	P	R	mAP @0.5	mAP @0.5:0.95	Params (M)	GFLOPs	FPS
RANGO [46]	-	-	-	0.795	0.782	0.800	-	17.064	26.8	64
SSD512 [47]	-	-	-	0.755	0.723	0.761	-	27.8	35.5	-
MDDPE [47]	-	-	-	0.800	0.791	0.801	-	3.2	1.5	-
TGC-YOLOv5 [48]	-	-	-	0.959	0.936	0.975	-	19.7	13.4	-
YOLOv5s [48]	-	-	-	0.957	0.919	0.966	-	7.04	16.3	-
YOLOv7x [49]	-	-	-	0.992	0.984	0.993	-	70.840	189.0	-
YOLOv8	-	-	-	0.918	0.874	0.921	0.583	3.152	8.7	78.5
A	√			0.937	0.833	0.908	0.624	2.856	7.8	61.4
B		√		0.966	0.846	0.928	0.633	2.095	7.5	69.0
C			√	0.951	0.845	0.912	0.629	2.367	6.6	51.0
D	√	√		0.951	0.822	0.909	0.609	1.940	7.1	55.2
E	√		√	0.947	0.788	0.896	0.614	2.212	6.3	66.7
F		√	√	0.949	0.869	0.927	0.625	1.729	6.3	73.0
Ours	√	√	√	0.985	0.958	0.984	0.721	1.574	5.9	57.4

Table 3. Comparison of the average precision (AP) of different models under different IoU thresholds.

Model	50	55	60	65	70	75	80	85	90	95
YOLOv8	86.23	85.71	84.39	80.99	75.24	65.83	47.34	28.45	6.80	1.21
A	92.12	91.37	90.92	88.34	83.45	74.18	59.77	39.49	8.55	0.60
B	91.48	90.82	89.56	87.30	81.65	71.51	55.85	29.27	9.06	0.75
C	89.53	88.03	87.63	85.15	79.79	72.40	56.92	36.01	11.85	1.63
D	90.14	89.50	88.47	85.57	82.31	70.71	51.37	30.72	7.71	0.78
E	88.87	87.98	86.76	84.77	80.80	72.56	56.12	31.44	10.93	0.40
F	91.88	91.23	90.33	86.46	82.96	72.57	54.66	30.55	10.63	0.69
Ours	97.61	97.12	95.69	93.39	89.18	81.05	61.97	50.22	22.95	2.22

Table 4. Comparative experiments on different publicly available datasets.

Dataset	DUT-Anti-UAV		TIB-Net		Drone Dataset		USC Drone Dataset		Drone Dataset (UAV)		Drone-vs-Bird
model	YOLOv8	Ours	YOLOv8	Ours	YOLOv8	Ours	YOLOv8	Ours	YOLOv8	Ours	YOLOv8	Ours
P	0.777	0.789	0.849	0.851	0.901	0.996	0.973	0.986	0.843	0.815	0.951	0.98
R	0.628	0.721	0.774	0.807	0.864	0.923	0.926	0.963	0.769	0.904	0.929	0.912
[email protected]	0.732	0.79	0.824	0.842	0.908	0.948	0.971	0.992	0.85	0.91	0.965	0.973

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Y.; Wang, L.; Lei, G.; Guo, C.; Ma, Q. Lightweight UAV Small Target Detection and Perception Based on Improved YOLOv8-E. Drones 2024, 8, 681. https://doi.org/10.3390/drones8110681

AMA Style

Zhao Y, Wang L, Lei G, Guo C, Ma Q. Lightweight UAV Small Target Detection and Perception Based on Improved YOLOv8-E. Drones. 2024; 8(11):681. https://doi.org/10.3390/drones8110681

Chicago/Turabian Style

Zhao, Yongjuan, Lijin Wang, Guannan Lei, Chaozhe Guo, and Qiang Ma. 2024. "Lightweight UAV Small Target Detection and Perception Based on Improved YOLOv8-E" Drones 8, no. 11: 681. https://doi.org/10.3390/drones8110681

APA Style

Zhao, Y., Wang, L., Lei, G., Guo, C., & Ma, Q. (2024). Lightweight UAV Small Target Detection and Perception Based on Improved YOLOv8-E. Drones, 8(11), 681. https://doi.org/10.3390/drones8110681

Article Menu

Lightweight UAV Small Target Detection and Perception Based on Improved YOLOv8-E

Abstract

1. Introduction

2. Frameworks

2.1. YOLOv8-E Model

2.2. C2f-ESCFFM Module

2.3. CAHS-FPN Module

2.4. LSCOD Module

3. Experiments

3.1. Experimental Basis

3.2. Evaluation Metrics

4. Experimental Results and Analysis

4.1. Ablation and Parallel Testing

4.2. Experiments on Publicly Available Datasets

4.3. Results Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI