A Reparameterization Feature Redundancy Extract Network for Unmanned Aerial Vehicles Detection

Zhang, Shijie; Yang, Xu; Geng, Chao; Li, Xinyang

doi:10.3390/rs16224226

Open AccessArticle

A Reparameterization Feature Redundancy Extract Network for Unmanned Aerial Vehicles Detection

¹

National Laboratory on Adaptive Optics, Chengdu 610209, China

²

Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China

³

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(22), 4226; https://doi.org/10.3390/rs16224226

Submission received: 21 October 2024 / Revised: 31 October 2024 / Accepted: 1 November 2024 / Published: 13 November 2024

(This article belongs to the Special Issue Artificial Intelligence-Driven Methods for Remote Sensing Target and Object Detection II)

Download

Browse Figures

Versions Notes

Abstract

:

In unmanned aerial vehicles (UAVs) detection, challenges such as occlusion, complex backgrounds, motion blur, and inference time often lead to false detections and missed detections. General object detection frameworks encounter difficulties in adequately tackling these challenges, leading to substantial information loss during network downsampling, inadequate feature fusion, and being unable to meet real-time requirements. In this paper, we propose a Real-Time Small Object Detection YOLO (RTSOD-YOLO) model to tackle the various challenges faced in UAVs detection. We further enhance the adaptive nature of the Adown module by incorporating an adaptive spatial attention mechanism. This mechanism processes the downsampled feature maps, enabling the model to better focus on key regions. Secondly, to address the issue of insufficient feature fusion, we employ combined serial and parallel triple feature encoding (TFE). This approach fuses scale-sequence features from both shallow features and twice-encoded features, resulting in a new small-scale object detection layer. While enhancing the global context awareness of the existing detection layers, this also enriches the small-scale object detection layer with detailed information. Since rich redundant features often ensure a comprehensive understanding of the input, which is a key characteristic of deep neural networks, we propose a more efficient redundant feature generation module. This module generates more feature maps with fewer parameters. Additionally, we introduce reparameterization techniques to compensate for potential feature loss while further improving the model’s inference speed. Experimental results demonstrate that our proposed RTSOD-YOLO achieves superior detection performance, with

{mAP}_{50}

/

{mAP}_{50 : 95}

reaching 97.3%/51.7%, which represents improvement of 3%/3.5% over YOLOv8, and 2.6%/0.1% higher than YOLOv10. Additionally, it has the lowest parameter count and FLOPs, making it highly efficient in terms of computational resources.

Keywords:

drone detection; occlusion awareness; sequence feature fusion; attention mechanism

1. Introduction

In recent years, the rapid development of UAV technology has brought about changes in many industries, ranging from agricultural monitoring, disaster assessment to city management, and been used increasingly widely [1,2,3,4,5]. As a flexible and efficient aerial platform, UAVs can quickly cover large areas and provide high-resolution images and video data. Due to their small size and flexibility, UAVs play an important role in a variety of fields. Therefore, most studies focus on the detection and processing of images acquired by UAVs, but neglect the detection and regulation of UAVs themselves. Due to the small scale, complex background, target occlusion, and motion blurring [6], it is difficult to identify and localize UAVs in images or videos.

Previous UAV detection methods mainly used audio, radar, radio frequency (RF) signals, and computer vision techniques [7]. Hauzenberger et al. applied linear predictive coding (LPC) to detect UAVs [8] through their unique sounds, though this is easily disturbed by noise. Mohajerin et al. used radar trajectories for detection [9], but radar signals are limited in bad weather. Al-Emadi et al. employed a CNN to analyze RF signals between UAVs and controllers for detection [10].

In recent years, the widespread application of neural networks in the field of computer vision has provided new ideas for UAV detection. Target detection algorithms are mainly divided into traditional manual feature-based detection methods and deep neural network-based detection techniques. Traditional target detection methods rely on manual feature extraction and rule-based classifiers, which have limited performance when dealing with complex scenes and varying environments. The accuracy and robustness of target detection have significantly improved with the advent of deep learning, especially with the development of convolutional neural networks (CNNs).

General object detectors are primarily categorized into two types: single-stage detectors and two-stage detectors. Two-stage detectors typically involve two phases: region proposal for generating candidate boxes and bounding box regression, such as R-CNN [11], Fast R-CNN [12], Faster R-CNN [13], VFNet [14], and CenterNet2 [15]. These detectors tend to perform well in terms of accuracy but often suffer from lower real-time performance. Although many improvements have been made to the R-CNN family, they still fail to completely overcome the speed limitations of two-stage detectors.

In response, researchers combined the two phases into a single step, leading to the development of single-stage detectors such as SSD [16], RetinaNet [17], and YOLO [18]. However, these single-stage detectors often compromise detection accuracy, and directly applying them to UAV detection may lead to suboptimal results. Moreover, most previous research aimed at improving detection accuracy has often overlooked the limited computational power of practical deployment devices.

Our motivation stems from the fact that current general-purpose object detectors cannot be directly applied to drone detection scenarios, particularly due to challenges such as small scale, occlusion, and complex backgrounds. These general detectors often struggle to effectively address these issues. Therefore, we have made improvements to existing advanced detectors YOLOv8 and proposed the RTSOD-YOLO model, aiming to enhance the model’s detection performance and real-time capabilities. A new downsampling module was introduced to retain more global and detailed information while reducing the number of parameters.

Additionally, to fully utilize the relationships between pyramid model features, we incorporated a dual TFE module to achieve comprehensive fusion of features at different scales in the pyramid model. We also introduced a scale-sequence feature fusion (SSFF) module to further extract and merge the results of dual encodings into a new small-object detection layer, enriching it with both global and detailed information.

Given that the widespread redundancy in current mainstream CNNs offers a comprehensive understanding of features but generates significant resource overhead, we designed a new efficient feature redundancy generation module. This module maintains redundant information while reducing the model’s parameters. We also introduced reparameterization techniques, further improving inference speed. Finally, to address the issue of small-scale objects being easily occluded, we incorporated an occlusion-aware attention mechanism, which enhances the model’s ability to detect occlusions.

In order to detect UAVs targets accurately and efficiently, this paper proposes the RTSOD-YOLO model, which has the following five main improvements compared with YOLOv8s:

We add an adaptive spatial attention mechanism to the Adown module. After downsampling, the feature maps undergo spatial enhancement, enabling the model to focus on more important regions and reduce attention to irrelevant features.
We introduce a new detection layer that draws inspiration from the TFE and SSFF modules. Given that the feature maps from the P2 layer contain critical information for small object detection, we perform scale-sequence feature fusion based on the P2 feature maps and the outputs from the two TFE modules, resulting in a detection layer specifically focused on small objects.
We design an efficient redundant feature generation module, replacing the original convolution module. This module ensures a comprehensive understanding of features while reducing the model’s parameter count and computational cost. By incorporating reparameterization techniques, the model’s inference speed is further enhanced.

The remainder of this paper is organized as follows: In Section 2, we provide an overview of related work on drone detection techniques, focusing on traditional image processing methods and deep learning approaches. Section 3 details the proposed method, including the enhancements to the TFE and SSFF modules. In Section 4, we present the experimental setup and results, evaluating the performance of our model against state-of-the-art methods. Section 5 concludes the paper and outlines future research directions.

2. Related Work

Unlike counter-UAV detection tasks, current research primarily focuses on target detection and tracking from the perspective of the UAV itself. UAV-based detection and tracking often involve a top-down, bird’s-eye view, which provides a wide field of view but also introduces new challenges such as high-density targets, small objects, and complex backgrounds.

In recent years, significant research has been conducted on UAV detection. The main UAV detection techniques include radar sensors [19,20,21,22,23,24,25], RF sensors [10,26,27,28,29,30], audio sensors [31,32,33,34,35,36], and vision-based detector technologies [37,38,39,40,41,42,43,44,45,46,47,48,49]. A comparison of the characteristics of these detection technologies is presented in Table 1.

(1) Radar is an electromagnetic technology that uses radio waves to detect and locate nearby objects. It can calculate important target features such as distance, speed, azimuth, and elevation [19]. Some moving parts of UAVs produce unique radar echoes, and a method was proposed based on the analysis of these echo signals [23], focusing on developing patterns and features to identify UAVs based on rotor blade types. In [24], the authors analyzed the micro-Doppler characteristics of different UAVs and bird species, confirming that K-band (24 GHz) or millimeter-wave radar systems are effective in detecting UAVs.

(2) RF detection is mainly due to the presence of electronic components on UAVs, such as radio transmitters and GPS receivers, which emit energy detectable by RF sensors. Detection systems based on this technology capture the communication signals between the controller and the UAVs, and analyze these signals to detect the UAV [26,27,28]. Nemer I. et al. proposed a machine-learning-based UAV detection and recognition system, introducing a new layered learning approach to recognize different types of UAV RF signals through a four-layer classifier [26]. In [27], the authors compared various machine learning techniques for RF detection and verified the good classification performance of the XGBoost algorithm. Some researchers have applied deep learning to this problem; for example, in [10], a simple CNN achieved better detection accuracy. In the study [29], VGG-16 was used to train compressed RF signals for UAV detection and recognition.

(3) Audio detection is based on sound information. During flight, UAVs produce distinct acoustic features due to their engines, propellers, and aerodynamic properties, which can be used to detect UAV targets. The most common sound feature is the noise generated by the propeller blades, which typically has a larger amplitude. By analyzing features such as frequency, amplitude, modulation, and duration, UAVs can be identified. This research area includes both machine learning and deep learning approaches. In [32], the authors developed a simple and effective UAV detection system using machine learning, classifying through balanced random forests and multilayer perceptrons. Tejera-Berengue et al. conducted research on the dependency of various machine-learning-based UAV detection systems on distance, demonstrating the superior performance of random forests in detection and the excellent distance performance of linear classifiers.

(4)Vision-based detection is mainly divided into traditional handcrafted feature extraction and deep learning methods. In recent years, deep learning has demonstrated superior performance in various fields, and most recent studies have adopted deep learning techniques [37,38,39,40,41,42,43,44,45,46,47,48,49]. These methods are generally classified into two types: two-stage detectors and single-stage detectors. In [37], a two-stage detector was used, employing a Mask R-CNN with two backbones, and it was shown that a ResNet-50 network outperforms MobileNet in terms of performance. However, while this method provides good detection accuracy, its detection speed is relatively low. Single-stage detectors, on the other hand, offer better speed performance.

In [38], the authors proposed the AD-YOLOv5s algorithm to detect low-altitude UAVs. Firstly, the detection of small targets is addressed by feature enhancement. Secondly, the Ghost module and depth-separable convolution (DSConv) are used to reduce a large number of redundant parameters in YOLOv5s, which enables the network to be used in embedded devices and perform inference at a high speed. Comparatively, some researchers have discarded the requirement of real time and pursued extreme detection accuracy. The author of [39] compared the performance of YOLOv4, YOLOv5, and DETR in terms of accuracy and speed. In [40], the author created a low-visibility UAVs flight dataset and proposed an improved YOLOv5 algorithm. It obtains more texture and contour information in small targets by adding a new scale to the model, and performs multiscale feature fusion to reduce the loss of information. For the problem of balancing detection speed and accuracy, Selvis S.S. et al. verified the higher real-time performance of YOLOv5 compared to YOLOv4 and overcame the problem of high-precision detection [41].

With the continuous development of deep learning technologies, multiple variants based on the YOLO architecture have emerged to address issues such as detection accuracy, speed, and complexity across different application scenarios. PP-YOLOE [42], proposed by PaddlePaddle, is one such variant that significantly enhances detection efficiency through multiprecision training and mixed-precision inference. It offers various configurations from small to large, catering to different computational resource environments. PP-YOLOE maintains high detection accuracy while achieving rapid inference speeds, making it highly suitable for real-world deployment. On the other hand, DAMO-YOLO [43], developed by Alibaba’s DAMO Academy, employs neural architecture search (NAS) technology and innovative modules like RepGFPN and AlignedOTA to boost detection efficiency and precision. This model, available in different configurations such as tiny, small, and medium, reduces parameter count and inference time, making it ideal for edge devices and resource-constrained environments. YOLO-MS [44] focuses on addressing multiscale object detection challenges, particularly optimizing small object detection by fusing information from multiple feature layers, thus improving performance when detecting small targets. These model variants leverage innovative network designs, feature fusion strategies, and optimization techniques to enhance the overall performance of YOLO, further extending its applicability across various use cases.

There are also many studies that have conducted different research on datasets for UAVs detection [45,46]. In [47], the authors collected a total of 2395 images of drones and birds from publicly available online resources for training, and Zhao et al. proposed a visible light mode dataset called Dalian University of Technology Anti-UAV dataset (DUT Anti-UAV). It contains a detection dataset with a total of 10,000 images and a tracking dataset with 20 videos that include short-term and long-term sequences [48].

3. Methodology

YOLOv8 [50] is one of the most advanced real-time detectors; it consists of three parts: backbone, neck, and head. This paper proposes an improved network RTSOD-YOLO based on YOLOv8 network for UAVs detection. We have made several improvements to address the issues in the network, such as downsampling, feature fusion, and model speed. First, we replaced the original convolution module with an adaptive Adown module for downsampling, allowing the model to retain rich information from higher layers. Next, we designed an efficient redundant feature generation module to preserve the comprehensive understanding of input features while reducing computational resources. Additionally, we employed a dual TFE module in a serial-parallel structure to perform two encodings, merging global information from deeper features and texture information from shallower features. This is followed by a new SSFF module that fuses these and P2’s features, forming a new detection layer specifically designed for small object detection, thereby improving the model’s capability in detecting small targets. Finally, we introduced an occlusion-aware attention module in each detection head, enhancing the network’s ability to detect objects under occlusion; the architecture of the network is shown in Figure 1.

3.1. Adaptive Adown Downsampling

YOLOv8 extracts five layers of feature maps with different scales, P1–P5, corresponding to

320 \times 320

,

160 \times 160

,

80 \times 80

,

40 \times 40

, and

20 \times 20

. However, P3, P4, and P5 are mainly used for feature fusion, which is followed by detection. These feature maps of different scales have different receptive fields and, therefore, contain different semantic and location information. Larger feature maps contain sufficient location information, but not enough semantic information. In YOLOv8, downsampling is achieved solely through convolution to extract deeper features, but this method results in the loss of many important feature information during the downsampling, and with the increase of downsampling, the lost feature information will further increase. Therefore, this paper introduces the Adown [51] downsampling module, whose structure is shown in Figure 2.

The approach employs average pooling initially to reduce the size of the feature map. Subsequently, it applies convolution and maximum pooling to the bisected channels. Finally, it seamlessly stitches the two components together, forming the ultimate downsampling feature map. As we all know, max pooling is used to retain the most prominent local activation, such as strong edges, textures, and key features that are important for fine-grained recognition, and average pooling provides a more generalized and smooth representation by capturing broader patterns and reducing the influence of noise. Thus, we use convolutions to help integrate the sharp, high-activation features from max pooling with the smooth, global context captured by average pooling.

To avoid potential inconsistencies between the outputs of the two pooling methods, the Adown module employs a learnable feature weighting mechanism. This allows the network to dynamically adjust the contributions of max pooling and average pooling based on the specific task or input, ensuring that the most relevant information is retained in the downsampled feature maps.

We represent the features obtained by the Adown module as

F \in R^{C \times H \times W}

, where C is the number of channels, and H and W are the height and width of the feature map. We newly apply both global average pooling and global max pooling along the channel dimension to capture different aspects of the feature map:

F_{avg}^{spatial} = AvgPool (F) \in R^{1 \times H \times W}

(1)

F_{\max}^{spatial} = MaxPool (F) \in R^{1 \times H \times W}

(2)

The results of average pooling and max pooling are concatenated along the channel dimension to form a feature map:

F_{concat} = [F_{avg}^{spatial}, F_{\max}^{spatial}] \in R^{2 \times H \times W}

(3)

We apply a

7 \times 7

convolution to the concatenated feature map to generate a spatial attention weight map:

M_{spatial} (F) = σ ({Conv}_{7 \times 7} (F_{concat})) \in R^{1 \times H \times W}

(4)

where

σ

is the sigmoid activation function to normalize the attention map to the range

[0, 1]

. The original input feature map F is multiplied element-wise with the spatial attention map

M_{spatial} (F)

:

F^{'} = M_{spatial} (F) ⊙ F

(5)

where ⊙ denotes the element-wise multiplication.

By leveraging the method above, the Adaptive Adown module ensures that downsampling not only reduces the spatial dimensions but also retains more information focused on key areas. This leads to more robust feature maps that maintain a balance between local details and global context.

3.2. Reparameter Redundant Feature Generation Module

In a GhostNet study [52], it is noted that CNNs often compute intermediate feature maps that exhibit significant redundancy which refers to the presence of highly similar feature maps. An ample amount of redundant information in the feature maps of well-trained deep neural networks often ensures a thorough understanding of the input data. The effective utilization of feature map redundancy plays a crucial role in the performance of CNNs. However, in mainstream CNNs, there is widespread redundancy in the intermediate feature maps. Therefore, we aim to reduce the resources required, such as the convolution filters used to generate them. The Ghost Module is designed based on this concept, aiming to generate a portion of redundant feature maps using computationally “cheap” operations, such as

3 \times 3

and

5 \times 5

convolutions. By doing so, the Ghost Module reduces computational overhead and the number of model parameters.

From this perspective, we propose the Reparameterization Feature Redundancy Generation Block (RFR-Block), shown in Figure 3b, as a replacement for the C2f module in the network. This block generates redundant features using more cost-efficient operations, thereby reducing both the number of parameters and the FLOPs of the model.

To compensate for the accuracy loss caused by discarding the original BottleNeck module, we employed the reparameterization technique RepConv. RepConv offers a way to convert the multibranch structure used during training into a more efficient single-branch structure during inference, which allows for faster inference speed and helps to improve the real-time performance of our model. The structure of RepConv is illustrated in Figure 4, and its key concept involves merging the convolution and normalization layers into a single convolution. The normalization process can be expressed by Equation (6).

\begin{matrix} {\hat{x}}_{i} & = γ \cdot \frac{x_{i} - μ}{\sqrt{σ^{2} + ε}} + β \\ = \frac{γ}{\sqrt{σ^{2} + ε}} \cdot x_{i} + (β - \frac{γ \cdot μ}{\sqrt{σ^{2} + ε}}) \end{matrix}

(6)

Think of this as being of the form

y = ω x + b

, where

ω_{B N} = \frac{γ}{\sqrt{σ^{2} + ε}}

, and

b_{B N} = β - \frac{γ \cdot μ}{\sqrt{σ^{2} + ε}}

, so the process of convolution and normalization can be represented as Equation (7):

\begin{matrix} \hat{x} & = w_{B N} \cdot (w_{c o n v} \cdot x + b_{c o n v}) + b_{B N} \\ = (w_{B N} \cdot w_{c o n v}) \cdot x + (w_{B N} \cdot b_{c o n v} + b_{B N}) \end{matrix}

(7)

The conversion of the 1 × 1 convolution and the two unprocessed branches into a 3 × 3 convolution for fusion can be expressed as follows:

\begin{matrix} {\hat{x}}_{i} & = \sum_{k = 0, 1, 3} (w_{f u s e}^{k * k} x_{i} + b_{f u s e}^{k * k}) + (w_{B N}^{0 * 0} x_{i} + b_{B N}^{0 * 0}) \\ = (w_{f u s e}^{3 * 3} + w_{f u s e}^{1 * 1} + w_{B N}^{0 * 0}) \cdot x_{i} + (b_{f u s e}^{3 * 3} + b_{f u s e}^{1 * 1} + b_{B N}^{0 * 0}) \end{matrix}

(8)

where fuse denotes the convolutional parameter after fusion of convolution and normalization.

In the subsequent steps, a cost-effective operation of

3 \times 3

convolution is employed to generate redundant feature maps. Following this, the feature maps undergo downsampling using a

1 \times 1

convolution. Multiple feature maps are then combined or stitched together to form a new RFR-Block module.

We constructed this efficient redundant feature generation module using cost-effective operations and reparameterization techniques. This module retains the comprehensive understanding provided by redundant features while reducing the computational resources required by the model. Additionally, the reparameterization technique not only compensates for some of the lost accuracy but also allows the model to achieve faster inference speed.

The proposed RFR-Block module consists of the following components:

A $1 \times 1$ convolution is used to change the channels (where the change is influenced by different scale coefficients according to the size of the model), and then divide it into two parts along the channel dimension.
One part remains unprocessed, while the other part undergoes feature extraction through RepConv to compensate for the loss of accuracy. Additionally, cost-effective $3 \times 3$ and $1 \times 1$ convolutions are used to generate the necessary redundant features.
All intermediate feature maps are concatenated and passed through final convolution layers ( $1 \times 1$ ) which changes the channel to output channel, yielding the final output.

3.3. Small Target Detection with SSFF and TFE

In YOLOv8, the classical feature pyramid model is employed for feature fusion. However, the network utilizes simple concatenation and summation for fusing the pyramid features, without fully exploiting the connections between them. To address this limitation, we introduce the SSFF module and TFE module [53].

The TFE module aims to improve feature fusion by segmenting features from three different scales: large, medium, and small. It includes the addition of large-scale feature maps and feature zooming to refine the detailed feature information. Figure 5 illustrates the structure of the TFE module.

Prior to feature encoding, the number of feature channels at each scale is adjusted to match the main feature map by the convolution module. Subsequently, the small-scale feature map undergoes upsampling using nearest-neighbor interpolation, which preserves the local strong semantic feature information from the low-resolution image. Conversely, the large-scale feature maps are downsampled through a combination of maximum pooling and average pooling. This downsampling approach aims to retain the global location information from the high-resolution image, along with the diverse feature information of the target image. Finally, the three scales of features are concatenated in the channel dimension.

In the SSFF module, which is illustrated in Figure 6, a 1 × 1 convolution is first applied to the P4 and P5 feature levels to reduce their channels to 256. Then, nearest neighbor interpolation adjusts their spatial dimensions to match P3. Next, the unsqueeze method is used to expand the feature maps along the first axis, changing them from 3D (height, width, channels) to 4D (depth, height, width, channels). After that, the 4D feature maps are concatenated along the first dimension depth to form a combined 3D feature map. Finally, 3D convolutions, followed by 3D batch normalization and SiLU activation, are applied to extract and refine the scale-sequence features. This approach effectively combines the high-dimensional information from deep feature maps with the detailed information from shallow feature maps.

The SSFF module is primarily designed based on the P3 feature layer, but since our network is aimed at detecting UAVs, which are typically small-scale targets, we applied scale-sequence fusion to the P2 feature layer. By integrating the output from the P2 feature layer with two TFE modules, we created a scale space that fuses shallower features—critical for small object detection. This fusion output serves as a new detection layer, enhancing the model’s capability to detect small-scale targets.

3.4. Separated and Enhancement Attention Module

To address issues caused by occlusion between targets, this paper introduces a novel occlusion-aware attention mechanism called the Separated and Enhancement Attention Module (SEAM) [54]. The SEAM, depicted in Figure 7, seeks to mitigate problems such as alignment errors and feature loss resulting from occlusion.

The SEAM incorporates a multistep approach to tackle the challenges posed by occlusion between targets. Firstly, it applies depthwise separable convolution with residuals to process the inputs. Depthwise convolutions apply a single filter to each input channel independently, which significantly reduces the number of parameters and computations compared to traditional convolutions. This parameter efficiency makes the model lighter and faster, which is particularly beneficial for real-time applications. Furthermore, by processing each channel separately, depthwise convolutions enable the model to learn specific features tailored to individual channels, enhancing the effectiveness of feature extraction without introducing excessive computational complexity.

However, it neglects the relationships between the channels. To compensate for this loss, the outputs of the depthwise separable convolutions are subsequently combined through pointwise (

1 \times 1

) convolution. Pointwise convolutions serve to combine the outputs of depthwise convolutions across different channels, allowing for effective mixing of information. This capability is crucial for capturing interactions between various feature maps, leading to richer and more complex feature representations. Additionally, pointwise convolutions can adjust the number of channels, providing flexibility in the architecture. This characteristic is particularly useful for adapting the model to different input sizes or for reducing dimensionality after depthwise operations, ultimately contributing to improved model performance.

Then, a two-layer fully connected network is used to fuse the information from each channel, enabling the network to enhance the connections between all channels. The outputs derived from the fully connected layer are then subjected to an exponential function, mapping them from the range [0, 1] to [1, e]. This mapping facilitates improved tolerance towards positional errors. Finally, the module’s output is multiplied with the original input features, enabling the model to effectively handle target occlusion.

4. Experiments

Based on the improvements before, we proposed the RTSOD-YOLO model and validated its performance through several experiments, including ablation studies and comparative experiments. In this section, we present and discuss our experimental setup, datasets, evaluation metrics, and the experimental results.

4.1. Evaluation Metrics

In deep learning, particularly in tasks like classification, object detection, and segmentation, evaluation metrics play a crucial role in measuring the performance of models. Below are some commonly used evaluation metrics:

(1) Accuracy is the ratio of correctly predicted instances to the total instances, used to evaluate the overall performance of classification models.

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

where true positive (TP): The number of instances correctly predicted as positive; true negative (TN): The number of instances correctly predicted as negative; false positive (FP): The number of instances incorrectly predicted as positive (Type I error); false negative (FN): The number of instances incorrectly predicted as negative (Type II error).

(2) Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.

Precision = \frac{T P}{T P + F P}

(3) Recall is the ratio of correctly predicted positive observations to all actual positives.

Recall = \frac{T P}{T P + F N}

(3) Average precision (AP) In object detection, AP is the area under the precision–recall (PR) curve, measuring performance at different confidence thresholds.

A P = \int_{0}^{1} P (R) d R

(4) Mean average precision (mAP) is the mean of average precision scores across all object classes in a multiclass detection task. mAP is computed by averaging the APs for each class of the objects, and getting the ultimate mAP score.

mAP = \frac{1}{C} \sum_{i = 1}^{C} A P_{i}

where C is the number of classes and

A P_{i}

is the average precision for class i.

(5) Intersection over union (IoU) measures the overlap between the predicted bounding box and the ground truth bounding box.

IoU = \frac{Area of Overlap}{Area of Union}

(6) Frames per second (FPS) measures how many frames the model can process per second, indicating its real-time performance. FPS is computed by dividing the number of frames processed by the algorithm by the overall time taken by the algorithm, and obtaining the ultimate FPS score.

(7) Confusion matrix is a table that summarizes the performance of a classification algorithm by comparing the predicted labels with the actual labels. It consists of four components:

[\begin{matrix} T N & F P \\ F N & T P \end{matrix}]

(8) Number of parameters: The total number of learnable parameters in a model, indicating its complexity.

(9) Floating point operations (FLOPs) measure the number of floating-point operations required to perform one forward pass of the model.

Accuracy, precision, recall, and F1-dcore are basic metrics for classification models, while AP and mAP are typically used in object detection tasks. IoU evaluates how well bounding boxes match in object detection. FPS measures the real-time performance of a model, and the confusion matrix visually displays true and false positives/negatives. The number of parameters along with FLOPs are used to evaluate model complexity and computational efficiency.

4.2. Dataset

To ensure the model’s inference and detection feasibility in real-world scenarios, this paper constructs a dataset using UAV images captured under natural conditions with complex backgrounds. The data collection process involves gathering images from both network sources and actual scene photography. Public dataset websites such as Kaggle and Paddle Paddle are utilized to obtain a diverse range of image data. From the collected images, a total of 14,600 samples are carefully selected. These samples encompass color images, infrared images, targets of varying scales, and diverse complex detection scenarios which are shown in Figure 8.

To facilitate comprehensive evaluation, the dataset is divided into training, validation, and test sets in an 8:1:1 ratio. This partitioning ensures a balanced representation of the data while enabling robust training, validation, and assessment of the model’s performance.

4.3. Experimental Setup

In this paper, the YOLOv8 model is adopted as the baseline network and implemented using the PyTorch framework (version 2.0.1). The experiments are conducted on a system equipped with an Intel Core™ i5-13490 CPU with 32 GB of RAM, running the Windows 11 operating system. To leverage parallel acceleration for the network, an NVIDIA RTX4060Ti GPU with 16 GB of video memory is utilized, along with CUDA version 11.8. The hyperparameter settings are shown in Table 2.

4.4. Comparison Experiments

To track the evaluation metrics, we plotted the values during the iterations, which can be found in Appendix A. Figure A1 presents an overall training summary of the model. The loss curves exhibit a downward trend, indicating that both training and validation losses are minimized during the training process. The metric curves show an upward trend, suggesting that the model’s performance improves throughout the iterations of training.

Table 3 presents a comparison of the proposed RTSOD-YOLO model with other classical advanced real-time detection methods on our drone dataset, including YOLOv5, AD-YOLOv5s, YOLOv6 [55], Damo-YOLO, Gold-YOLO [56], YOLO-MS, YOLOv8, YOLOv9 (where GELAN is a simplified version that does not include the PGI module), and YOLOv10 [57]. Our model has only 5.2 million parameters, yet it achieves the highest accuracy of 97.3% while maintaining a relatively high inference speed of 241.2 frames per second.

We also analyzed the confusion matrices for each model across different scenarios, as illustrated in Figure A2 in Appendix B. This analysis allows us to compare the actual performance of the models under various conditions.

The evaluation focuses on three key scenarios for drone object detection: occlusion, intense light, and dim light conditions. Each scenario consists of 200 instances, corresponding to the performance of YOLOv5, YOLOv6, YOLOv8, YOLOv9, YOLOv10, and our proposed model. This comprehensive assessment provides valuable insights into how each model responds to challenging environmental factors, enabling a clearer understanding of their strengths and weaknesses in practical applications. For further details, please refer to Appendix B.

Additionally, in Figure 9, we provide visual comparisons of each model across different scenarios. It can be observed that in complex conditions such as occlusion and intense lighting, our model maintains excellent detection performance, while the other detectors exhibit varying degrees of false positives and false negatives.

To further validate the performance of our proposed model, we conducted comparative experiments on several datasets. These include commonly used UAV detection datasets such as Anti-UAV300 [58], Drone Detection [59], and Drone vs. Bird [60], which represent various real-world conditions, including different weather conditions, occlusions, and lighting variations.

Among the datasets we used, Anti-UAV includes both infrared and RGB images of UAVs, covering six types of UAVs (such as DJI and Parrot) captured under two different lighting conditions (daytime and nighttime). The images span various backgrounds, such as buildings, clouds, and trees, providing diverse scenarios for detection tasks. The Drone Detection dataset, on the other hand, focuses on UAV images in various environments with significant scale variations, making it ideal for evaluating how well models perform under changing target sizes. Lastly, the Drone vs. Bird dataset presents a challenging task of differentiating between UAVs and birds, both of which are small-scale aerial targets with similar backgrounds. The close similarity in appearance between drones and birds, combined with the complex background, makes detection and recognition particularly difficult in this dataset.

We conducted further experiments on the datasets mentioned above, and the results are shown in Table 4. Our model ranked first in the Anti-UAV dataset with an accuracy of 98.4%/65.1%. Additionally, our model also achieved the best performance in the Drone vs. Bird dataset. Although the

{mAP}_{50 : 95}

in the Drone Detection dataset did not reach the highest, it was only second to YOLOv9. This validates that our model’s generalization performance across multiple datasets surpasses that of currently advanced real-time detectors.

4.5. Ablation Study

We also conducted ablation experiments on each module to validate the effectiveness of our improvements and proposed modules. First, we validated the contributions of each module to the network’s performance. Next, we compared the proposed RFR-Block with the C3, C2f, and ELAN modules. Finally, we assessed the performance of the SEAM in enhancing occlusion awareness.

4.5.1. Effect of the Improved Methods

Table 5 illustrates the contribution of each of our improved modules to enhancing detection performance. By integrating the small object detection layer that fuses spatial and scale features, as well as our occlusion-aware attention mechanism, we achieved improvements in the network’s detection accuracy, resulting in an increase of 2.0% and 1.7% in

{mAP}_{50}

and

{mAP}_{50 : 95}

, respectively. Additionally, the proposed RFR-Block maintains detection accuracy while reducing the computational resources required, with FLOPs decreased by 8.1 G and parameters reduced by 3.24 M.

4.5.2. Effect of the RFR-Block

Although the C2f module is powerful in feature extraction, it requires numerous parameters and FLOPs. In order to further reduce the complexity of our network model, we replaced the original C2f module with the newly designed RFR-Block module. We compared this module with the previous C3, ELAN, and C2f modules in v5, v7, and v8 of the YOLO series, and the results are presented in Table 6. Our RFR-Block achieves the lowest number of parameters and computation, with the only trade-off being a slightly lower FPS compared to the C3 module in YOLOv5.

4.5.3. Effect of the SEAM

We conducted a comparison of occlusion capabilities by simulating occluded targets using random erasing (RE) in the original test set, as shown in Figure 10. RE generates a black occlusion in the target region, thus simulating a situation where the target is partially occluded, and the occluded portion accounts for 10% to 50% of the target size.

We evaluated the performance of the baseline model against the model with the introduced occlusion-aware attention mechanism. The results in Table 7 indicate that the improved model showed an enhancement in detection accuracy on the occluded test set, with

{mAP}_{50}

increasing by 3.2% and

{mAP}_{50 : 95}

improving by 2.9%, thereby validating the effectiveness of the modifications.

Additionally, we compared the performance of the two models on both the occluded and nonoccluded test sets. Due to the impact of occlusion, both models exhibited a decrease in accuracy on the occluded test set; however, the network with the attention mechanism experienced a significantly smaller drop in performance compared to the baseline model. This indicates that the SEAM possesses a certain degree of resilience against occlusion interference.

5. Conclusions

This paper presents an accurate and efficient real-time drone detection model, RTSOD-YOLO, which reintegrates pyramid features and optimizes the generation of redundant features, allowing the model to focus more on small object detection. We implemented various improvements within the YOLOv8 framework. The small object detection layer, which integrates spatial and scale features, enhances the model’s capability to detect small targets. The RFR-Block module reduces the computational resources required by the model, while the SEAM attention mechanism improves the network’s ability to perceive occlusions. Extensive experimental results demonstrate that this model can handle drone detection tasks under various conditions, significantly improving accuracy while maintaining inference speed. Our approach clearly outperforms currently popular real-time object detectors. Compared to the baseline network, our model achieved an accuracy improvement of 3.0%/3.5%, while reducing the number of parameters and computational cost by 25.7% and 53.1%, respectively. Also, there was a slight enhancement in inference speed. Additionally, in the ablation studies, we further validated the effectiveness of each module, providing a research basis for future improvements. However, due to the limited variety of drones in the dataset used in this study, the model’s generalization ability to different drone shapes needs further enhancement. In the future, we will collect more drone data to enrich and improve the dataset.

Author Contributions

Conceptualization, S.Z., C.G. and X.L.; methodology, S.Z. and X.Y.; software, S.Z.; validation, S.Z., X.Y., C.G. and X.L.; investigation, X.L.; resources, C.G., X.Y. and X.L.; original draft preparation, S.Z.; review and editing, S.Z., C.G. and X.L.; visualization, S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China grant number 62175241, U2141255 and 62305344.

Data Availability Statement

The datasets analyzed during the current study are not publicly available but are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

The loss curves demonstrate a consistent downward trajectory, reflecting a reduction in both training and validation losses throughout the training process. This trend indicates that the model is effectively learning and optimizing its parameters, leading to improved performance over time.

Figure A1. Summary of training.

Conversely, the metric curves display a steady upward trend, which signifies that the model’s performance enhances with each training iteration. This improvement suggests that the model is not only fitting the training data well but is also generalizing effectively to the validation set. Such behavior is indicative of a well-optimized training process, where the model successfully captures the underlying patterns in the data, ultimately leading to robust performance in real-world applications.

Overall, the convergence of the loss curves alongside the improvement in metric curves illustrates the model’s effectiveness and its potential for reliable predictions.

Appendix B

In different scenarios, the models exhibit varying degrees of missed detections and false detections. In the case of occlusion, the main issue is that the texture feature information of the target is obscured, leading to missed detections. In strong light conditions, the texture information of the target may blend into the background, resulting in missed detections. Additionally, the complex background can create textures that resemble the target, leading to missed detections and false detections. Under dim lighting, significant loss of detail information in the target primarily results in missed detections.

Based on the confusion matrix, our model exhibits the lowest false detection and missed detection rates under occlusion, strong lighting, and low light conditions compared to other models shown in Figure A2 in Appendix B.

Figure A2. Confusion matrices of different models in different scenarios (occlusion, strong light irradiation, and dim scenes). (a) Confusion matrix of YOLOv5. (b) Confusion matrix of YOLOv8. (c) Confusion matrix of YOLOv9. (d) Confusion matrix of YOLOv10. (e) Confusion matrix of RTSOD-YOLO.

References

Azad, H.; Mehta, V.; Dadboud, F.; Bolic, M.; Mantegh, I. Air-to-Air Simulated Drone Dataset for AI-powered problems. In Proceedings of the 2023 IEEE/AIAA 42nd Digital Avionics Systems Conference (DASC), Barcelona, Spain, 1–5 October 2023; pp. 1–7. [Google Scholar]
Huttner, J.P.; Friedrich, M. Current Challenges in Mission Planning Systems for UAVs: A Systematic Review. In Proceedings of the 2023 Integrated Communication, Navigation and Surveillance Conference (ICNS), Herndon, VA, USA, 18–20 April 2023; pp. 1–7. [Google Scholar]
Mohsan, S.A.H.; Khan, M.A.; Noor, F.; Ullah, I.; Alsharif, M.H. Towards the Unmanned Aerial Vehicles (UAVs): A Comprehensive Review. Drones 2022, 6, 147. [Google Scholar] [CrossRef]
Farlík, J.; Gacho, L. Researching UAV Threat—New Challenges. In Proceedings of the 2021 International Conference on Military Technologies (ICMT), Brno, Czech Republic, 8–11 June 2021; pp. 1–6. [Google Scholar]
Mohsan, S.A.H.; Othman, N.Q.H.; Li, Y.; Alsharif, M.H.; Khan, M.A. Unmanned aerial vehicles (UAVs): Practical aspects, applications, open challenges, security issues, and future trends. Intell. Serv. Robot. 2023, 16, 109–137. [Google Scholar] [CrossRef] [PubMed]
Marvasti-Zadeh, S.M.; Cheng, L.; Ghanei-Yakhdan, H.; Kasaei, S. Deep Learning for Visual Tracking: A Comprehensive Survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 3943–3968. [Google Scholar] [CrossRef]
Tao, L.; Hong, T.; Chao, X. Drone identification and location tracking based on YOLOv3. Chin. J. Eng. 2020, 42, 463–468. [Google Scholar]
Hauzenberger, L.; Holmberg Ohlsson, E. Drone Detection Using Audio Analysis. Master’s Thesis, Lund University, Lund, Sweden, 2015. [Google Scholar]
Mohajerin, N.; Histon, J.; Dizaji, R.; Waslander, S.L. Feature extraction and radar track classification for detecting UAVs in civillian airspace. In Proceedings of the 2014 IEEE Radar Conference, Cincinnati, OH, USA, 19–23 May 2014; pp. 0674–0679. [Google Scholar]
Al-Emadi, S.; Al-Senaid, F. Drone detection approach based on radio-frequency using convolutional neural network. In Proceedings of the 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), Doha, Qatar, 2–5 February 2020; pp. 29–34. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Zhang, H.; Wang, Y.; Dayoub, F.; Sünderhauf, N. VarifocalNet: An IoU-aware Dense Object Detector. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8510–8519. [Google Scholar]
Zhou, X.; Koltun, V.; Krähenbühl, P. Probabilistic two-stage detection. arXiv 2021, arXiv:2103.07461. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision–ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Taha, B.; Shoufan, A. Machine learning-based drone detection and classification: State-of-the-art in research. IEEE Access 2019, 7, 138669–138682. [Google Scholar] [CrossRef]
Batool, S.; Frezza, F.; Mangini, F.; Simeoni, P. Introduction to radar scattering application in remote sensing and diagnostics. Atmosphere 2020, 11, 517. [Google Scholar] [CrossRef]
Liu, J.; Xu, Q.Y.; Chen, W.S. Classification of bird and drone targets based on motion characteristics and random forest model using surveillance radar data. IEEE Access 2021, 9, 160135–160144. [Google Scholar] [CrossRef]
Wang, C.; Tian, J.; Cao, J.; Wang, X. Deep learning-based UAV detection in pulse-Doppler radar. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–12. [Google Scholar] [CrossRef]
Yan, J.; Hu, H.; Gong, J.; Kong, D.; Li, D. Exploring Radar Micro-Doppler Signatures for Recognition of Drone Types. Drones 2023, 7, 280. [Google Scholar] [CrossRef]
Rahman, S.; Robertson, D.A. Radar micro-Doppler signatures of drones and birds at K-band and W-band. Sci. Rep. 2018, 8, 17396. [Google Scholar] [CrossRef] [PubMed]
Narayanan, R.M.; Tsang, B.; Bharadwaj, R. Classification and discrimination of birds and small drones using radar micro-Doppler spectrogram images. Signals 2023, 4, 337–358. [Google Scholar] [CrossRef]
Nemer, I.; Sheltami, T.; Ahmad, I.; Yasar, A.U.H.; Abdeen, M.A. RF-based UAV detection and identification using hierarchical learning approach. Sensors 2021, 21, 1947. [Google Scholar] [CrossRef]
Zhang, Y. RF-based drone detection using machine learning. In Proceedings of the 2021 2nd International Conference on Computing and Data Science (CDS), Stanford, CA, USA, 28–29 January 2021; pp. 425–428. [Google Scholar]
Medaiyese, O.O.; Syed, A.; Lauf, A.P. Machine learning framework for RF-based drone detection and identification system. In Proceedings of the 2021 2nd International Conference On Smart Cities, Automation & Intelligent Computing Systems (ICON-SONICS), Tangerang, Indonesia, 12–13 October 2021; pp. 58–64. [Google Scholar]
Allahham, M.S.; Khattab, T.; Mohamed, A. Deep learning for RF-based drone detection and identification: A multi-channel 1-D convolutional neural networks approach. In Proceedings of the 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), Doha, Qatar, 2–5 February 2020; pp. 112–117. [Google Scholar]
Inani, K.N.; Sangwan, K. Machine Learning based framework for Drone Detection and Identification using RF signals. In Proceedings of the 2023 4th International Conference on Innovative Trends in Information Technology (ICITIIT), Kottayam, India, 11–12 February 2023; pp. 1–8. [Google Scholar]
Fagiani, F.R.E. Uav Detection and Localization System Using an Interconnected Array of Acoustic Sensors and Machine Learning Algorithms. Master’s Thesis, Purdue University, West Lafayette, IN, USA, 2021. [Google Scholar]
Ahmed, C.A.; Batool, F.; Haider, W.; Asad, M.; Hamdani, S.H.R. Acoustic Based Drone Detection Via Machine Learning. In Proceedings of the 2022 International Conference on IT and Industrial Technologies (ICIT), Chiniot, Pakistan, 3–4 October 2022; pp. 1–6. [Google Scholar]
Tejera-Berengue, D.; Zhu-Zhou, F.; Utrilla-Manso, M.; Gil-Pita, R.; Rosa-Zurera, M. Acoustic-Based Detection of UAVs Using Machine Learning: Analysis of Distance and Environmental Effects. In Proceedings of the 2023 IEEE Sensors Applications Symposium (SAS), Ottawa, ON, Canada, 18–20 July 2023; pp. 1–6. [Google Scholar]
Ohlenbusch, M.; Ahrens, A.; Rollwage, C.; Bitzer, J. Robust drone detection for acoustic monitoring applications. In Proceedings of the 2020 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 18–21 January 2021; pp. 6–10. [Google Scholar]
Al-Emadi, S.; Al-Ali, A.; Al-Ali, A. Audio-based drone detection and identification using deep learning techniques with dataset enhancement through generative adversarial networks. Sensors 2021, 21, 4953. [Google Scholar] [CrossRef]
Utebayeva, D.; Ilipbayeva, L.; Matson, E.T. Practical study of recurrent neural networks for efficient real-time drone sound detection: A review. Drones 2022, 7, 26. [Google Scholar] [CrossRef]
Mubarak, A.S.; Vubangsi, M.; Al-Turjman, F.; Ameen, Z.S.; Mahfudh, A.S.; Alturjman, S. Computer vision based drone detection using mask R-CNN. In Proceedings of the 2022 International Conference on Artificial Intelligence in Everything (AIE), Lefkosa, Cyprus, 2–4 August 2022; pp. 540–543. [Google Scholar]
Shang, Y.; Liu, C.; Qiu, D.; Zhao, Z.; Wu, R.; Tang, S. AD-YOLOv5s based UAV detection for low altitude security. Int. J. Micro Air Veh. 2023, 15, 17568293231190017. [Google Scholar] [CrossRef]
Kabir, M.S.; Ndukwe, I.K.; Awan, E.Z.S. Deep Learning Inspired Vision based Frameworks for Drone Detection. In Proceedings of the 2021 International Conference on Electrical, Communication, and Computer Engineering (ICECCE), Kuala Lumpur, Malaysia, 12–13 June 2021; pp. 1–5. [Google Scholar]
Delleji, T.; Chtourou, Z. An Improved YOLOv5 for Real-time Mini-UAV Detection in No Fly Zones. In Proceedings of the International Conference on Image Processing and Vision Engineering, Online, 22–24 April 2022. [Google Scholar]
Sethu Selvi, S.; Pavithraa, S.; Dharini, R.; Chaitra, E. A Deep Learning Approach to Classify Drones and Birds. In Proceedings of the 2022 IEEE 2nd Mysore Sub Section International Conference (MysuruCon), Mysuru, India, 16–17 October 2022; pp. 1–5. [Google Scholar]
Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y.; et al. PP-YOLOE: An evolved version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. Damo-yolo: A report on real-time object detection design. arXiv 2022, arXiv:2211.15444. [Google Scholar]
Xie, Y.; Zhang, L.; Yu, X.; Xie, W. YOLO-MS: Multispectral Object Detection via Feature Interaction and Self-Attention Guided Fusion. IEEE Trans. Cogn. Dev. Syst. 2023, 15, 2132–2143. [Google Scholar] [CrossRef]
Al-Qubaydhi, N.; Alenezi, A.; Alanazi, T.; Senyor, A.; Alanezi, N.; Alotaibi, B.; Alotaibi, M.; Razaque, A.; Abdelhamid, A.A.; Alotaibi, A. Detection of Unauthorized Unmanned Aerial Vehicles Using YOLOv5 and Transfer Learning. Electronics 2022, 11, 2669. [Google Scholar] [CrossRef]
Pansare, A.; Sabu, N.; Kushwaha, H.; Srivastava, V.; Thakur, N.; Jamgaonkar, K.; Faiz, M.Z. Drone Detection using YOLO and SSD A Comparative Study. In Proceedings of the 2022 International Conference on Signal and Information Processing (IConSIP), Pune, India, 26–27 August 2022; pp. 1–6. [Google Scholar]
Aydin, B.; Singha, S. Drone Detection Using YOLOv5. Eng 2023, 4, 416–433. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, J.; Li, D.; Wang, D. Vision-based anti-uav detection and tracking. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25323–25334. [Google Scholar] [CrossRef]
Singha, S.; Aydin, B. Automated drone detection using YOLOv4. Drones 2021, 5, 95. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features From Cheap Operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1577–1586. [Google Scholar]
Kang, M.; Ting, C.M.; Ting, F.F.; Phan, R.C.W. ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
Yu, Z.; Huang, H.; Chen, W.; Su, Y.; Liu, Y.; Wang, X. Yolo-facev2: A scale and occlusion aware face detector. Pattern Recognit. 2024, 155, 110714. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Han, K.; Wang, Y. Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism. arXiv 2023, arXiv:2309.11331. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Jiang, N.; Wang, K.; Peng, X.; Yu, X.; Wang, Q.; Xing, J.; Li, G.; Ye, Q.; Jiao, J.; Han, Z.; et al. Anti-UAV: A large-scale benchmark for vision-based UAV tracking. IEEE Trans. Multimed. 2021, 25, 486–500. [Google Scholar] [CrossRef]
Svanström, F.; Englund, C.; Alonso-Fernandez, F. Real-time drone detection and tracking with visible, thermal and acoustic sensors. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 7265–7272. [Google Scholar]
Coluccia, A.; Fascista, A.; Schumann, A.; Sommer, L.; Dimou, A.; Zarpalas, D.; Méndez, M.; De la Iglesia, D.; González, I.; Mercier, J.P.; et al. Drone vs. bird detection: Deep learning algorithms and results from a grand challenge. Sensors 2021, 21, 2824. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The architecture of RTSOD-YOLO.

Figure 2. Convolution and Adown downsampling. (a) Convolution. (b) Adown.

Figure 3. C2f and RFR-Block module. (a) Structure of C2f. (b) Structure of RFR-Block.

Figure 4. RepConv schematic diagram.

Figure 5. Triple feature encoding module.

Figure 6. Scale-sequence feature fusion module.

Figure 7. Separated and Enhancement Attention Module.

Figure 8. The unmanned aerial vehicle dataset.

Figure 9. Detection performance in different scenarios.

Figure 10. Test dataset augmented with random erasing.

Table 1. Comparison of detection methods.

Method	Principle	Advantages	Disadvantages
Radar-based	Utilizes radio waves for detection and localization of nearby objects.	Long-range detection; resilient to adverse weather; can analyze communication spectra.	Limited detection due to low radar cross section; performance drops in low altitudes; high cost and deployment complexity.
RF-based	Captures wireless signals from UAVs to detect their radio frequency emissions.	Long-range detection; resilient to adverse weather; can analyze communication spectra.	Cannot identify autonomous drones; susceptible to RF interference; vulnerable to hacking.
Acoustic-based	Detects UAVs based on sound signatures produced during operation.	Cost-effective; no line-of-sight required; quick deployment.	Affected by background noise; limited detection range; sensitive to wind conditions.
Vision-based	Uses visual data captured by cameras to detect UAVs.	Provides visual confirmation; nonintrusive; cost-effective.	Limited range; requires line-of-sight; affected by weather and lighting conditions.

Table 2. Hyperparameter settings.

Hyperparameter	Value
Epochs	50
Warmup epochs	3
Image size	640
Batch size	32
Learning rate	0.01
Weight decay	0.0005
Momentum	0.937
Optimizer	SGD

Table 3. Comparisons with state-of-the-art real-time object detectors.

Model	${mAP}_{50}$ (%)	${mAP}_{50 : 95}$ (%)	FLOPs (G)	Params (M)	FPS
YOLOv5-S	93.3	47.4	23.8	9.1	144.1
AD-YOLOv5s	94.3	48.5	26.7	12.8	-
YOLOv6-S	94.1	48.5	44.0	16.3	243.9
YOLOv8-S	94.3	48.2	28.4	11.1	238.6
Damo-YOLO-S	94.4	49.1	37.8	12.3	187.5
Gold-YOLO-S	94.6	48.6	46.0	21.5	223.7
YOLO-MS	95.1	50.7	31.1	8.1	173.5
YOLOv9-S	94.4	52.0	38.7	9.6	196.1
GELAN-S	94.2	48.8	26.2	7.1	252.4
YOLOv10-S	94.7	51.6	24.4	8.0	270.6
OURS	97.3	51.7	21.1	5.2	241.2

Table 4. Generalization performance comparison across different datasets.

Method	Anti-UAV300		Drone Detection		Drone vs. Bird
Method	${mAP}_{50}$	${mAP}_{50 : 95}$	${mAP}_{50}$	${mAP}_{50 : 95}$	${mAP}_{50}$	${mAP}_{50 : 95}$
YOLOv5	94.6	62.3	86.8	50.4	89.8	47.3
AD-YOLOv5s	95.6	62.6	87.2	51.3	90.5	47.8
YOLOv6	95.4	62.5	86.9	51.1	88.6	47.1
YOLOv8	96.8	64.5	88.6	53.5	91.4	50.4
Damo-YOLO	97.3	64.3	88.7	52.8	91.5	50.3
Gold-YOLO	96.9	63.9	87.4	52.7	91.9	50.2
YOLO-MS	97.4	64.7	90.8	52.9	92.8	50.7
YOLOv9	97.1	64.5	91.1	54.2	92.6	50.5
YOLOv10	96.7	64.1	90.4	53.2	92.4	50.4
Ours	98.4	65.1	91.3	53.6	94.0	51.6

Table 5. Ablation experiments on the UAVs dataset.

Model	S&T-P2	SEAM	Adown	RFR	${mAP}_{50}$ (%)	${mAP}_{50 : 95}$ (%)	FLOPs (G)	Params (M)
YOLOv8s					94.3	48.2	28.4	11.13
a	√				96.3	49.9	35.8	9.05
b		√			95.4	49.8	25.7	10.65
c			√		95.4	51.3	25.7	9.48
d				√	94.3	48.0	20.3	7.89
Ours-A	√				96.3	49.9	35.8	9.05
Ours-B	√	√			96.3	50.2	31.7	8.80
Ours-C	√	√	√		97.1	51.4	29.7	7.68
Ours-D	√	√	√	√	97.3	51.7	21.1	5.18

Table 6. Comparison of C3, ELAN, C2f, and RFR-Block modules.

Name	All-Time	Mean-Time	FPS	FLOPs (G)	Params (K)
C3	2.10088	0.00105	951.983364	4.832	295.680
ELAN	2.97037	0.00149	673.317590	8.053	492.288
C2f	2.58631	0.00129	773.303101	7.516	459.520
RFR-Block	2.49479	0.00125	801.669814	3.691	226.176

Table 7. Results for the network introducing SEAM on the original and masked dataset.

Model	${mAP}_{50}$ (%)	${mAP}_{50 : 95}$ (%)
YOLOv8s (Origin)	94.3	48.2
YOLOv8s (RE)	86.9 (7.4 ↓)	43.4 (4.8 ↓)
+SEAM-head (Origin)	95.4	49.8
+SEAM-head (RE)	90.1(5.3 ↓)	46.3 (3.5 ↓)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, S.; Yang, X.; Geng, C.; Li, X. A Reparameterization Feature Redundancy Extract Network for Unmanned Aerial Vehicles Detection. Remote Sens. 2024, 16, 4226. https://doi.org/10.3390/rs16224226

AMA Style

Zhang S, Yang X, Geng C, Li X. A Reparameterization Feature Redundancy Extract Network for Unmanned Aerial Vehicles Detection. Remote Sensing. 2024; 16(22):4226. https://doi.org/10.3390/rs16224226

Chicago/Turabian Style

Zhang, Shijie, Xu Yang, Chao Geng, and Xinyang Li. 2024. "A Reparameterization Feature Redundancy Extract Network for Unmanned Aerial Vehicles Detection" Remote Sensing 16, no. 22: 4226. https://doi.org/10.3390/rs16224226

APA Style

Zhang, S., Yang, X., Geng, C., & Li, X. (2024). A Reparameterization Feature Redundancy Extract Network for Unmanned Aerial Vehicles Detection. Remote Sensing, 16(22), 4226. https://doi.org/10.3390/rs16224226

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Reparameterization Feature Redundancy Extract Network for Unmanned Aerial Vehicles Detection

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Adaptive Adown Downsampling

3.2. Reparameter Redundant Feature Generation Module

3.3. Small Target Detection with SSFF and TFE

3.4. Separated and Enhancement Attention Module

4. Experiments

4.1. Evaluation Metrics

4.2. Dataset

4.3. Experimental Setup

4.4. Comparison Experiments

4.5. Ablation Study

4.5.1. Effect of the Improved Methods

4.5.2. Effect of the RFR-Block

4.5.3. Effect of the SEAM

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI