1. Introduction
In recent years, the rapid development of UAV technology has brought about changes in many industries, ranging from agricultural monitoring, disaster assessment to city management, and been used increasingly widely [
1,
2,
3,
4,
5]. As a flexible and efficient aerial platform, UAVs can quickly cover large areas and provide high-resolution images and video data. Due to their small size and flexibility, UAVs play an important role in a variety of fields. Therefore, most studies focus on the detection and processing of images acquired by UAVs, but neglect the detection and regulation of UAVs themselves. Due to the small scale, complex background, target occlusion, and motion blurring [
6], it is difficult to identify and localize UAVs in images or videos.
Previous UAV detection methods mainly used audio, radar, radio frequency (RF) signals, and computer vision techniques [
7]. Hauzenberger et al. applied linear predictive coding (LPC) to detect UAVs [
8] through their unique sounds, though this is easily disturbed by noise. Mohajerin et al. used radar trajectories for detection [
9], but radar signals are limited in bad weather. Al-Emadi et al. employed a CNN to analyze RF signals between UAVs and controllers for detection [
10].
In recent years, the widespread application of neural networks in the field of computer vision has provided new ideas for UAV detection. Target detection algorithms are mainly divided into traditional manual feature-based detection methods and deep neural network-based detection techniques. Traditional target detection methods rely on manual feature extraction and rule-based classifiers, which have limited performance when dealing with complex scenes and varying environments. The accuracy and robustness of target detection have significantly improved with the advent of deep learning, especially with the development of convolutional neural networks (CNNs).
General object detectors are primarily categorized into two types: single-stage detectors and two-stage detectors. Two-stage detectors typically involve two phases: region proposal for generating candidate boxes and bounding box regression, such as R-CNN [
11], Fast R-CNN [
12], Faster R-CNN [
13], VFNet [
14], and CenterNet2 [
15]. These detectors tend to perform well in terms of accuracy but often suffer from lower real-time performance. Although many improvements have been made to the R-CNN family, they still fail to completely overcome the speed limitations of two-stage detectors.
In response, researchers combined the two phases into a single step, leading to the development of single-stage detectors such as SSD [
16], RetinaNet [
17], and YOLO [
18]. However, these single-stage detectors often compromise detection accuracy, and directly applying them to UAV detection may lead to suboptimal results. Moreover, most previous research aimed at improving detection accuracy has often overlooked the limited computational power of practical deployment devices.
Our motivation stems from the fact that current general-purpose object detectors cannot be directly applied to drone detection scenarios, particularly due to challenges such as small scale, occlusion, and complex backgrounds. These general detectors often struggle to effectively address these issues. Therefore, we have made improvements to existing advanced detectors YOLOv8 and proposed the RTSOD-YOLO model, aiming to enhance the model’s detection performance and real-time capabilities. A new downsampling module was introduced to retain more global and detailed information while reducing the number of parameters.
Additionally, to fully utilize the relationships between pyramid model features, we incorporated a dual TFE module to achieve comprehensive fusion of features at different scales in the pyramid model. We also introduced a scale-sequence feature fusion (SSFF) module to further extract and merge the results of dual encodings into a new small-object detection layer, enriching it with both global and detailed information.
Given that the widespread redundancy in current mainstream CNNs offers a comprehensive understanding of features but generates significant resource overhead, we designed a new efficient feature redundancy generation module. This module maintains redundant information while reducing the model’s parameters. We also introduced reparameterization techniques, further improving inference speed. Finally, to address the issue of small-scale objects being easily occluded, we incorporated an occlusion-aware attention mechanism, which enhances the model’s ability to detect occlusions.
In order to detect UAVs targets accurately and efficiently, this paper proposes the RTSOD-YOLO model, which has the following five main improvements compared with YOLOv8s:
We add an adaptive spatial attention mechanism to the Adown module. After downsampling, the feature maps undergo spatial enhancement, enabling the model to focus on more important regions and reduce attention to irrelevant features.
We introduce a new detection layer that draws inspiration from the TFE and SSFF modules. Given that the feature maps from the P2 layer contain critical information for small object detection, we perform scale-sequence feature fusion based on the P2 feature maps and the outputs from the two TFE modules, resulting in a detection layer specifically focused on small objects.
We design an efficient redundant feature generation module, replacing the original convolution module. This module ensures a comprehensive understanding of features while reducing the model’s parameter count and computational cost. By incorporating reparameterization techniques, the model’s inference speed is further enhanced.
The remainder of this paper is organized as follows: In
Section 2, we provide an overview of related work on drone detection techniques, focusing on traditional image processing methods and deep learning approaches.
Section 3 details the proposed method, including the enhancements to the TFE and SSFF modules. In
Section 4, we present the experimental setup and results, evaluating the performance of our model against state-of-the-art methods.
Section 5 concludes the paper and outlines future research directions.
2. Related Work
Unlike counter-UAV detection tasks, current research primarily focuses on target detection and tracking from the perspective of the UAV itself. UAV-based detection and tracking often involve a top-down, bird’s-eye view, which provides a wide field of view but also introduces new challenges such as high-density targets, small objects, and complex backgrounds.
In recent years, significant research has been conducted on UAV detection. The main UAV detection techniques include radar sensors [
19,
20,
21,
22,
23,
24,
25], RF sensors [
10,
26,
27,
28,
29,
30], audio sensors [
31,
32,
33,
34,
35,
36], and vision-based detector technologies [
37,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49]. A comparison of the characteristics of these detection technologies is presented in
Table 1.
(1) Radar is an electromagnetic technology that uses radio waves to detect and locate nearby objects. It can calculate important target features such as distance, speed, azimuth, and elevation [
19]. Some moving parts of UAVs produce unique radar echoes, and a method was proposed based on the analysis of these echo signals [
23], focusing on developing patterns and features to identify UAVs based on rotor blade types. In [
24], the authors analyzed the micro-Doppler characteristics of different UAVs and bird species, confirming that K-band (24 GHz) or millimeter-wave radar systems are effective in detecting UAVs.
(2) RF detection is mainly due to the presence of electronic components on UAVs, such as radio transmitters and GPS receivers, which emit energy detectable by RF sensors. Detection systems based on this technology capture the communication signals between the controller and the UAVs, and analyze these signals to detect the UAV [
26,
27,
28]. Nemer I. et al. proposed a machine-learning-based UAV detection and recognition system, introducing a new layered learning approach to recognize different types of UAV RF signals through a four-layer classifier [
26]. In [
27], the authors compared various machine learning techniques for RF detection and verified the good classification performance of the XGBoost algorithm. Some researchers have applied deep learning to this problem; for example, in [
10], a simple CNN achieved better detection accuracy. In the study [
29], VGG-16 was used to train compressed RF signals for UAV detection and recognition.
(3) Audio detection is based on sound information. During flight, UAVs produce distinct acoustic features due to their engines, propellers, and aerodynamic properties, which can be used to detect UAV targets. The most common sound feature is the noise generated by the propeller blades, which typically has a larger amplitude. By analyzing features such as frequency, amplitude, modulation, and duration, UAVs can be identified. This research area includes both machine learning and deep learning approaches. In [
32], the authors developed a simple and effective UAV detection system using machine learning, classifying through balanced random forests and multilayer perceptrons. Tejera-Berengue et al. conducted research on the dependency of various machine-learning-based UAV detection systems on distance, demonstrating the superior performance of random forests in detection and the excellent distance performance of linear classifiers.
(4)Vision-based detection is mainly divided into traditional handcrafted feature extraction and deep learning methods. In recent years, deep learning has demonstrated superior performance in various fields, and most recent studies have adopted deep learning techniques [
37,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49]. These methods are generally classified into two types: two-stage detectors and single-stage detectors. In [
37], a two-stage detector was used, employing a Mask R-CNN with two backbones, and it was shown that a ResNet-50 network outperforms MobileNet in terms of performance. However, while this method provides good detection accuracy, its detection speed is relatively low. Single-stage detectors, on the other hand, offer better speed performance.
In [
38], the authors proposed the AD-YOLOv5s algorithm to detect low-altitude UAVs. Firstly, the detection of small targets is addressed by feature enhancement. Secondly, the Ghost module and depth-separable convolution (DSConv) are used to reduce a large number of redundant parameters in YOLOv5s, which enables the network to be used in embedded devices and perform inference at a high speed. Comparatively, some researchers have discarded the requirement of real time and pursued extreme detection accuracy. The author of [
39] compared the performance of YOLOv4, YOLOv5, and DETR in terms of accuracy and speed. In [
40], the author created a low-visibility UAVs flight dataset and proposed an improved YOLOv5 algorithm. It obtains more texture and contour information in small targets by adding a new scale to the model, and performs multiscale feature fusion to reduce the loss of information. For the problem of balancing detection speed and accuracy, Selvis S.S. et al. verified the higher real-time performance of YOLOv5 compared to YOLOv4 and overcame the problem of high-precision detection [
41].
With the continuous development of deep learning technologies, multiple variants based on the YOLO architecture have emerged to address issues such as detection accuracy, speed, and complexity across different application scenarios. PP-YOLOE [
42], proposed by PaddlePaddle, is one such variant that significantly enhances detection efficiency through multiprecision training and mixed-precision inference. It offers various configurations from small to large, catering to different computational resource environments. PP-YOLOE maintains high detection accuracy while achieving rapid inference speeds, making it highly suitable for real-world deployment. On the other hand, DAMO-YOLO [
43], developed by Alibaba’s DAMO Academy, employs neural architecture search (NAS) technology and innovative modules like RepGFPN and AlignedOTA to boost detection efficiency and precision. This model, available in different configurations such as tiny, small, and medium, reduces parameter count and inference time, making it ideal for edge devices and resource-constrained environments. YOLO-MS [
44] focuses on addressing multiscale object detection challenges, particularly optimizing small object detection by fusing information from multiple feature layers, thus improving performance when detecting small targets. These model variants leverage innovative network designs, feature fusion strategies, and optimization techniques to enhance the overall performance of YOLO, further extending its applicability across various use cases.
There are also many studies that have conducted different research on datasets for UAVs detection [
45,
46]. In [
47], the authors collected a total of 2395 images of drones and birds from publicly available online resources for training, and Zhao et al. proposed a visible light mode dataset called Dalian University of Technology Anti-UAV dataset (DUT Anti-UAV). It contains a detection dataset with a total of 10,000 images and a tracking dataset with 20 videos that include short-term and long-term sequences [
48].
3. Methodology
YOLOv8 [
50] is one of the most advanced real-time detectors; it consists of three parts: backbone, neck, and head. This paper proposes an improved network RTSOD-YOLO based on YOLOv8 network for UAVs detection. We have made several improvements to address the issues in the network, such as downsampling, feature fusion, and model speed. First, we replaced the original convolution module with an adaptive Adown module for downsampling, allowing the model to retain rich information from higher layers. Next, we designed an efficient redundant feature generation module to preserve the comprehensive understanding of input features while reducing computational resources. Additionally, we employed a dual TFE module in a serial-parallel structure to perform two encodings, merging global information from deeper features and texture information from shallower features. This is followed by a new SSFF module that fuses these and P2’s features, forming a new detection layer specifically designed for small object detection, thereby improving the model’s capability in detecting small targets. Finally, we introduced an occlusion-aware attention module in each detection head, enhancing the network’s ability to detect objects under occlusion; the architecture of the network is shown in
Figure 1.
3.1. Adaptive Adown Downsampling
YOLOv8 extracts five layers of feature maps with different scales, P1–P5, corresponding to
,
,
,
, and
. However, P3, P4, and P5 are mainly used for feature fusion, which is followed by detection. These feature maps of different scales have different receptive fields and, therefore, contain different semantic and location information. Larger feature maps contain sufficient location information, but not enough semantic information. In YOLOv8, downsampling is achieved solely through convolution to extract deeper features, but this method results in the loss of many important feature information during the downsampling, and with the increase of downsampling, the lost feature information will further increase. Therefore, this paper introduces the Adown [
51] downsampling module, whose structure is shown in
Figure 2.
The approach employs average pooling initially to reduce the size of the feature map. Subsequently, it applies convolution and maximum pooling to the bisected channels. Finally, it seamlessly stitches the two components together, forming the ultimate downsampling feature map. As we all know, max pooling is used to retain the most prominent local activation, such as strong edges, textures, and key features that are important for fine-grained recognition, and average pooling provides a more generalized and smooth representation by capturing broader patterns and reducing the influence of noise. Thus, we use convolutions to help integrate the sharp, high-activation features from max pooling with the smooth, global context captured by average pooling.
To avoid potential inconsistencies between the outputs of the two pooling methods, the Adown module employs a learnable feature weighting mechanism. This allows the network to dynamically adjust the contributions of max pooling and average pooling based on the specific task or input, ensuring that the most relevant information is retained in the downsampled feature maps.
We represent the features obtained by the Adown module as
, where
C is the number of channels, and
H and
W are the height and width of the feature map. We newly apply both global average pooling and global max pooling along the channel dimension to capture different aspects of the feature map:
The results of average pooling and max pooling are concatenated along the channel dimension to form a feature map:
We apply a
convolution to the concatenated feature map to generate a spatial attention weight map:
where
is the sigmoid activation function to normalize the attention map to the range
. The original input feature map
F is multiplied element-wise with the spatial attention map
:
where ⊙ denotes the element-wise multiplication.
By leveraging the method above, the Adaptive Adown module ensures that downsampling not only reduces the spatial dimensions but also retains more information focused on key areas. This leads to more robust feature maps that maintain a balance between local details and global context.
3.2. Reparameter Redundant Feature Generation Module
In a GhostNet study [
52], it is noted that CNNs often compute intermediate feature maps that exhibit significant redundancy which refers to the presence of highly similar feature maps. An ample amount of redundant information in the feature maps of well-trained deep neural networks often ensures a thorough understanding of the input data. The effective utilization of feature map redundancy plays a crucial role in the performance of CNNs. However, in mainstream CNNs, there is widespread redundancy in the intermediate feature maps. Therefore, we aim to reduce the resources required, such as the convolution filters used to generate them. The Ghost Module is designed based on this concept, aiming to generate a portion of redundant feature maps using computationally “cheap” operations, such as
and
convolutions. By doing so, the Ghost Module reduces computational overhead and the number of model parameters.
From this perspective, we propose the Reparameterization Feature Redundancy Generation Block (RFR-Block), shown in
Figure 3b, as a replacement for the C2f module in the network. This block generates redundant features using more cost-efficient operations, thereby reducing both the number of parameters and the FLOPs of the model.
To compensate for the accuracy loss caused by discarding the original BottleNeck module, we employed the reparameterization technique RepConv. RepConv offers a way to convert the multibranch structure used during training into a more efficient single-branch structure during inference, which allows for faster inference speed and helps to improve the real-time performance of our model. The structure of RepConv is illustrated in
Figure 4, and its key concept involves merging the convolution and normalization layers into a single convolution. The normalization process can be expressed by Equation (
6).
Think of this as being of the form
, where
, and
, so the process of convolution and normalization can be represented as Equation (
7):
The conversion of the 1 × 1 convolution and the two unprocessed branches into a 3 × 3 convolution for fusion can be expressed as follows:
where fuse denotes the convolutional parameter after fusion of convolution and normalization.
In the subsequent steps, a cost-effective operation of convolution is employed to generate redundant feature maps. Following this, the feature maps undergo downsampling using a convolution. Multiple feature maps are then combined or stitched together to form a new RFR-Block module.
We constructed this efficient redundant feature generation module using cost-effective operations and reparameterization techniques. This module retains the comprehensive understanding provided by redundant features while reducing the computational resources required by the model. Additionally, the reparameterization technique not only compensates for some of the lost accuracy but also allows the model to achieve faster inference speed.
The proposed RFR-Block module consists of the following components:
A convolution is used to change the channels (where the change is influenced by different scale coefficients according to the size of the model), and then divide it into two parts along the channel dimension.
One part remains unprocessed, while the other part undergoes feature extraction through RepConv to compensate for the loss of accuracy. Additionally, cost-effective and convolutions are used to generate the necessary redundant features.
All intermediate feature maps are concatenated and passed through final convolution layers () which changes the channel to output channel, yielding the final output.
3.3. Small Target Detection with SSFF and TFE
In YOLOv8, the classical feature pyramid model is employed for feature fusion. However, the network utilizes simple concatenation and summation for fusing the pyramid features, without fully exploiting the connections between them. To address this limitation, we introduce the SSFF module and TFE module [
53].
The TFE module aims to improve feature fusion by segmenting features from three different scales: large, medium, and small. It includes the addition of large-scale feature maps and feature zooming to refine the detailed feature information.
Figure 5 illustrates the structure of the TFE module.
Prior to feature encoding, the number of feature channels at each scale is adjusted to match the main feature map by the convolution module. Subsequently, the small-scale feature map undergoes upsampling using nearest-neighbor interpolation, which preserves the local strong semantic feature information from the low-resolution image. Conversely, the large-scale feature maps are downsampled through a combination of maximum pooling and average pooling. This downsampling approach aims to retain the global location information from the high-resolution image, along with the diverse feature information of the target image. Finally, the three scales of features are concatenated in the channel dimension.
In the SSFF module, which is illustrated in
Figure 6, a 1 × 1 convolution is first applied to the P4 and P5 feature levels to reduce their channels to 256. Then, nearest neighbor interpolation adjusts their spatial dimensions to match P3. Next, the unsqueeze method is used to expand the feature maps along the first axis, changing them from 3D (height, width, channels) to 4D (depth, height, width, channels). After that, the 4D feature maps are concatenated along the first dimension depth to form a combined 3D feature map. Finally, 3D convolutions, followed by 3D batch normalization and SiLU activation, are applied to extract and refine the scale-sequence features. This approach effectively combines the high-dimensional information from deep feature maps with the detailed information from shallow feature maps.
The SSFF module is primarily designed based on the P3 feature layer, but since our network is aimed at detecting UAVs, which are typically small-scale targets, we applied scale-sequence fusion to the P2 feature layer. By integrating the output from the P2 feature layer with two TFE modules, we created a scale space that fuses shallower features—critical for small object detection. This fusion output serves as a new detection layer, enhancing the model’s capability to detect small-scale targets.
3.4. Separated and Enhancement Attention Module
To address issues caused by occlusion between targets, this paper introduces a novel occlusion-aware attention mechanism called the Separated and Enhancement Attention Module (SEAM) [
54]. The SEAM, depicted in
Figure 7, seeks to mitigate problems such as alignment errors and feature loss resulting from occlusion.
The SEAM incorporates a multistep approach to tackle the challenges posed by occlusion between targets. Firstly, it applies depthwise separable convolution with residuals to process the inputs. Depthwise convolutions apply a single filter to each input channel independently, which significantly reduces the number of parameters and computations compared to traditional convolutions. This parameter efficiency makes the model lighter and faster, which is particularly beneficial for real-time applications. Furthermore, by processing each channel separately, depthwise convolutions enable the model to learn specific features tailored to individual channels, enhancing the effectiveness of feature extraction without introducing excessive computational complexity.
However, it neglects the relationships between the channels. To compensate for this loss, the outputs of the depthwise separable convolutions are subsequently combined through pointwise () convolution. Pointwise convolutions serve to combine the outputs of depthwise convolutions across different channels, allowing for effective mixing of information. This capability is crucial for capturing interactions between various feature maps, leading to richer and more complex feature representations. Additionally, pointwise convolutions can adjust the number of channels, providing flexibility in the architecture. This characteristic is particularly useful for adapting the model to different input sizes or for reducing dimensionality after depthwise operations, ultimately contributing to improved model performance.
Then, a two-layer fully connected network is used to fuse the information from each channel, enabling the network to enhance the connections between all channels. The outputs derived from the fully connected layer are then subjected to an exponential function, mapping them from the range [0, 1] to [1, e]. This mapping facilitates improved tolerance towards positional errors. Finally, the module’s output is multiplied with the original input features, enabling the model to effectively handle target occlusion.
4. Experiments
Based on the improvements before, we proposed the RTSOD-YOLO model and validated its performance through several experiments, including ablation studies and comparative experiments. In this section, we present and discuss our experimental setup, datasets, evaluation metrics, and the experimental results.
4.1. Evaluation Metrics
In deep learning, particularly in tasks like classification, object detection, and segmentation, evaluation metrics play a crucial role in measuring the performance of models. Below are some commonly used evaluation metrics:
(1) Accuracy is the ratio of correctly predicted instances to the total instances, used to evaluate the overall performance of classification models.
where true positive (TP): The number of instances correctly predicted as positive; true negative (TN): The number of instances correctly predicted as negative; false positive (FP): The number of instances incorrectly predicted as positive (Type I error); false negative (FN): The number of instances incorrectly predicted as negative (Type II error).
(2) Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.
(3) Recall is the ratio of correctly predicted positive observations to all actual positives.
(3) Average precision (AP) In object detection, AP is the area under the precision–recall (PR) curve, measuring performance at different confidence thresholds.
(4) Mean average precision (mAP) is the mean of average precision scores across all object classes in a multiclass detection task. mAP is computed by averaging the APs for each class of the objects, and getting the ultimate mAP score.
where
C is the number of classes and
is the average precision for class
i.
(5) Intersection over union (IoU) measures the overlap between the predicted bounding box and the ground truth bounding box.
(6) Frames per second (FPS) measures how many frames the model can process per second, indicating its real-time performance. FPS is computed by dividing the number of frames processed by the algorithm by the overall time taken by the algorithm, and obtaining the ultimate FPS score.
(7) Confusion matrix is a table that summarizes the performance of a classification algorithm by comparing the predicted labels with the actual labels. It consists of four components:
(8) Number of parameters: The total number of learnable parameters in a model, indicating its complexity.
(9) Floating point operations (FLOPs) measure the number of floating-point operations required to perform one forward pass of the model.
Accuracy, precision, recall, and F1-dcore are basic metrics for classification models, while AP and mAP are typically used in object detection tasks. IoU evaluates how well bounding boxes match in object detection. FPS measures the real-time performance of a model, and the confusion matrix visually displays true and false positives/negatives. The number of parameters along with FLOPs are used to evaluate model complexity and computational efficiency.
4.2. Dataset
To ensure the model’s inference and detection feasibility in real-world scenarios, this paper constructs a dataset using UAV images captured under natural conditions with complex backgrounds. The data collection process involves gathering images from both network sources and actual scene photography. Public dataset websites such as Kaggle and Paddle Paddle are utilized to obtain a diverse range of image data. From the collected images, a total of 14,600 samples are carefully selected. These samples encompass color images, infrared images, targets of varying scales, and diverse complex detection scenarios which are shown in
Figure 8.
To facilitate comprehensive evaluation, the dataset is divided into training, validation, and test sets in an 8:1:1 ratio. This partitioning ensures a balanced representation of the data while enabling robust training, validation, and assessment of the model’s performance.
4.3. Experimental Setup
In this paper, the YOLOv8 model is adopted as the baseline network and implemented using the PyTorch framework (version 2.0.1). The experiments are conducted on a system equipped with an Intel Core™ i5-13490 CPU with 32 GB of RAM, running the Windows 11 operating system. To leverage parallel acceleration for the network, an NVIDIA RTX4060Ti GPU with 16 GB of video memory is utilized, along with CUDA version 11.8. The hyperparameter settings are shown in
Table 2.
4.4. Comparison Experiments
To track the evaluation metrics, we plotted the values during the iterations, which can be found in
Appendix A.
Figure A1 presents an overall training summary of the model. The loss curves exhibit a downward trend, indicating that both training and validation losses are minimized during the training process. The metric curves show an upward trend, suggesting that the model’s performance improves throughout the iterations of training.
Table 3 presents a comparison of the proposed RTSOD-YOLO model with other classical advanced real-time detection methods on our drone dataset, including YOLOv5, AD-YOLOv5s, YOLOv6 [
55], Damo-YOLO, Gold-YOLO [
56], YOLO-MS, YOLOv8, YOLOv9 (where GELAN is a simplified version that does not include the PGI module), and YOLOv10 [
57]. Our model has only 5.2 million parameters, yet it achieves the highest accuracy of 97.3% while maintaining a relatively high inference speed of 241.2 frames per second.
We also analyzed the confusion matrices for each model across different scenarios, as illustrated in
Figure A2 in
Appendix B. This analysis allows us to compare the actual performance of the models under various conditions.
The evaluation focuses on three key scenarios for drone object detection: occlusion, intense light, and dim light conditions. Each scenario consists of 200 instances, corresponding to the performance of YOLOv5, YOLOv6, YOLOv8, YOLOv9, YOLOv10, and our proposed model. This comprehensive assessment provides valuable insights into how each model responds to challenging environmental factors, enabling a clearer understanding of their strengths and weaknesses in practical applications. For further details, please refer to
Appendix B.
Additionally, in
Figure 9, we provide visual comparisons of each model across different scenarios. It can be observed that in complex conditions such as occlusion and intense lighting, our model maintains excellent detection performance, while the other detectors exhibit varying degrees of false positives and false negatives.
To further validate the performance of our proposed model, we conducted comparative experiments on several datasets. These include commonly used UAV detection datasets such as Anti-UAV300 [
58], Drone Detection [
59], and Drone vs. Bird [
60], which represent various real-world conditions, including different weather conditions, occlusions, and lighting variations.
Among the datasets we used, Anti-UAV includes both infrared and RGB images of UAVs, covering six types of UAVs (such as DJI and Parrot) captured under two different lighting conditions (daytime and nighttime). The images span various backgrounds, such as buildings, clouds, and trees, providing diverse scenarios for detection tasks. The Drone Detection dataset, on the other hand, focuses on UAV images in various environments with significant scale variations, making it ideal for evaluating how well models perform under changing target sizes. Lastly, the Drone vs. Bird dataset presents a challenging task of differentiating between UAVs and birds, both of which are small-scale aerial targets with similar backgrounds. The close similarity in appearance between drones and birds, combined with the complex background, makes detection and recognition particularly difficult in this dataset.
We conducted further experiments on the datasets mentioned above, and the results are shown in
Table 4. Our model ranked first in the Anti-UAV dataset with an accuracy of 98.4%/65.1%. Additionally, our model also achieved the best performance in the Drone vs. Bird dataset. Although the
in the Drone Detection dataset did not reach the highest, it was only second to YOLOv9. This validates that our model’s generalization performance across multiple datasets surpasses that of currently advanced real-time detectors.
4.5. Ablation Study
We also conducted ablation experiments on each module to validate the effectiveness of our improvements and proposed modules. First, we validated the contributions of each module to the network’s performance. Next, we compared the proposed RFR-Block with the C3, C2f, and ELAN modules. Finally, we assessed the performance of the SEAM in enhancing occlusion awareness.
4.5.1. Effect of the Improved Methods
Table 5 illustrates the contribution of each of our improved modules to enhancing detection performance. By integrating the small object detection layer that fuses spatial and scale features, as well as our occlusion-aware attention mechanism, we achieved improvements in the network’s detection accuracy, resulting in an increase of 2.0% and 1.7% in
and
, respectively. Additionally, the proposed RFR-Block maintains detection accuracy while reducing the computational resources required, with FLOPs decreased by 8.1 G and parameters reduced by 3.24 M.
4.5.2. Effect of the RFR-Block
Although the C2f module is powerful in feature extraction, it requires numerous parameters and FLOPs. In order to further reduce the complexity of our network model, we replaced the original C2f module with the newly designed RFR-Block module. We compared this module with the previous C3, ELAN, and C2f modules in v5, v7, and v8 of the YOLO series, and the results are presented in
Table 6. Our RFR-Block achieves the lowest number of parameters and computation, with the only trade-off being a slightly lower FPS compared to the C3 module in YOLOv5.
4.5.3. Effect of the SEAM
We conducted a comparison of occlusion capabilities by simulating occluded targets using random erasing (RE) in the original test set, as shown in
Figure 10. RE generates a black occlusion in the target region, thus simulating a situation where the target is partially occluded, and the occluded portion accounts for 10% to 50% of the target size.
We evaluated the performance of the baseline model against the model with the introduced occlusion-aware attention mechanism. The results in
Table 7 indicate that the improved model showed an enhancement in detection accuracy on the occluded test set, with
increasing by 3.2% and
improving by 2.9%, thereby validating the effectiveness of the modifications.
Additionally, we compared the performance of the two models on both the occluded and nonoccluded test sets. Due to the impact of occlusion, both models exhibited a decrease in accuracy on the occluded test set; however, the network with the attention mechanism experienced a significantly smaller drop in performance compared to the baseline model. This indicates that the SEAM possesses a certain degree of resilience against occlusion interference.
5. Conclusions
This paper presents an accurate and efficient real-time drone detection model, RTSOD-YOLO, which reintegrates pyramid features and optimizes the generation of redundant features, allowing the model to focus more on small object detection. We implemented various improvements within the YOLOv8 framework. The small object detection layer, which integrates spatial and scale features, enhances the model’s capability to detect small targets. The RFR-Block module reduces the computational resources required by the model, while the SEAM attention mechanism improves the network’s ability to perceive occlusions. Extensive experimental results demonstrate that this model can handle drone detection tasks under various conditions, significantly improving accuracy while maintaining inference speed. Our approach clearly outperforms currently popular real-time object detectors. Compared to the baseline network, our model achieved an accuracy improvement of 3.0%/3.5%, while reducing the number of parameters and computational cost by 25.7% and 53.1%, respectively. Also, there was a slight enhancement in inference speed. Additionally, in the ablation studies, we further validated the effectiveness of each module, providing a research basis for future improvements. However, due to the limited variety of drones in the dataset used in this study, the model’s generalization ability to different drone shapes needs further enhancement. In the future, we will collect more drone data to enrich and improve the dataset.