Vision-Based Anti-UAV Detection Based on YOLOv7-GS in Complex Backgrounds

Bo, Chunjuan; Wei, Yuntao; Wang, Xiujia; Shi, Zhan; Xiao, Ying

doi:10.3390/drones8070331

Open AccessArticle

Vision-Based Anti-UAV Detection Based on YOLOv7-GS in Complex Backgrounds

by

Chunjuan Bo

^*,

Yuntao Wei

,

Xiujia Wang

,

Zhan Shi

and

Ying Xiao

School of Information and Communication Engineering, Dalian Minzu University, Dalian 116600, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(7), 331; https://doi.org/10.3390/drones8070331

Submission received: 10 June 2024 / Revised: 11 July 2024 / Accepted: 13 July 2024 / Published: 18 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

Unauthorized unmanned aerial vehicles (UAVs) pose threats to public safety and individual privacy. Traditional object-detection approaches often fall short during their application in anti-UAV technologies. To address this issue, we propose the YOLOv7-GS model, which is designed specifically for the identification of small UAVs in complex and low-altitude environments. This research primarily aims to improve the model’s detection capabilities for small UAVs in complex backgrounds. Enhancements were applied to the YOLOv7-tiny model, including adjustments to the sizes of prior boxes, incorporation of the InceptionNeXt module at the end of the neck section, and introduction of the SPPFCSPC-SR and Get-and-Send modules. These modifications aid in the preservation of details about small UAVs and heighten the model’s focus on them. The YOLOv7-GS model achieves commendable results on the DUT Anti-UAV and the Amateur Unmanned Air Vehicle Detection datasets and performs to be competitive against other mainstream algorithms.

Keywords:

small target; object detection; anti-UAV detection; complex backgrounds; YOLOv7-tiny algorithm

1. Introduction

Unmanned aerial vehicles (UAVs), commonly known as drones, include aircraft that fly autonomously, not manned onboard by human pilots. These aircraft are extensively used across multiple sectors, including geological exploration [1], precision agriculture [2], and search and rescue operations [3,4]. The adoption of UAVs has led to innovation and efficiency improvements across various industries. However, the unauthorized use of UAVs seriously threatens public safety and individual privacy [5]. Consequently, the development of anti-UAV systems is urgently needed [6].

In these systems, object-detection technology plays a critical role, particularly in the real-time detection of small UAVs in complex backgrounds. In public safety areas, such as parks and residential neighborhoods, UAVs exhibit complex flight backgrounds and relatively small sizes, which causes them to be easily mistaken for birds. Therefore, the detection of UAVs in such scenarios involves the detection of small-scale objects in complex backgrounds. Regardless of its progress, anti-UAV technology [7,8,9] encounters considerable challenges in the detection of small UAVs in complex backgrounds due to environmental complexity, unclear edge features of small targets, and variations in lighting conditions [10].

To address the challenges involving the real-time detection of small UAVs in complex backgrounds, this paper proposes the YOLOv7-GS model. This model is an improvement of the YOLOv7-tiny [11] model and incorporates various enhancements. First, the sizes of anchors are redesigned using the k-means method to improve the alignment with targets and increase the model’s regression accuracy. Second, the SPPCSPC module is enhanced to specifically focus on small UAV targets. Moreover, the InceptionNeXt module with large convolutional kernels [12] is incorporated to enlarge the model’s receptive field, thereby improving its overall performance. To mitigate the information loss caused by traditional YOLO’s cross-layer information fusion, this paper introduces a Gather-and-Distribute (GD) mechanism [13] and improves the Inject-Local Aggregation Fusion (LAF) module, resulting in the Inject-ILAF module. This approach facilitates effective information exchange and integration through the incorporation of global context into features at various levels, which improves the model’s capability to detect small UAV targets.

Extensive experimental results indicate that the proposed YOLOv7-GS model achieves strong performance on the DUT Anti-UAV [14] and Amateur Unmanned Air Vehicle Detection (https://data.mendeley.com/datasets/zcsj2g2m4c/4 (accessed on 10 June 2024)) datasets. This approach effectively addresses the challenges of sustained real-time detection in complex flying scenarios in public areas.

The remainder of this paper is organized as follows: Section 2 reviews related works on algorithms and benchmarks for the anti-UAV task. Section 3 details the proposed anti-UAV detection algorithm. Section 4 describes the experiments and presents the results. Section 5 concludes the paper and discusses potential future research directions.

2. Related Work

The extensive use of UAVs has introduced various challenges, particularly regarding the issue of “black flying” UAVs, which seriously threaten public safety areas and personal privacy. The rapid advancement of deep learning algorithm theories and technologies increase the popularity of the application of deep learning algorithms in researching anti-UAV technologies. Deep learning-based object-detection algorithms fall under two categories: one-stage and two-stage approaches. Two-stage algorithms, such as faster region-based convolutional neural network (Faster R-CNN) [15], Mask R-CNN [16], and detection transformer (DETR) [17], and one-stage algorithms, such as single shot detector (SSD) [18], fully convolutional one-stage (FCOS) [19], RepPoints [20], and you only look once (YOLO) series, are notable examples. Among these, the YOLO algorithm shows considerable promise in enhancing the speed and accuracy of target detection, making it highly effective for various applications. However, the YOLO series of algorithms exhibit suboptimal performance in detecting small-target objects. To address the challenges related to the poor generalization and detection accuracy of YOLO series algorithms for small-target detection, Yang et al. [21] incorporated the scSE attention module to improve the backbone network’s emphasis on features associated with small targets. Zhao et al. [22] added an additional prediction head for the detection of small-scale objects. DC-YOLOv8 [23] implements a novel downsampling method to effectively preserve contextual feature information, which enhances the model’s detection accuracy for small-scale objects.

The key element of anti-UAV systems centers on the precise and continuous real-time detection of UAVs, particularly those considered small targets, because they account for the majority of “black-flying” UAVs. In addition, in areas concerning public safety, complex backgrounds and the abundant presence of easily confused negative samples present notable challenges in the accurate identification of small-target UAVs. Li et al. [24] propose a deep-to-shallow recursive backward path into the SSD model, which improves shallow features through the incorporation of semantic information and contextual cues from deeper layers using a feature-enhancement module. Liang et al. [25] developed Edge-YOLO, which utilizes a hybrid random loss function to improve the accuracy of small-target detection while enabling real-time target detection on edge devices. However, the improvement of small-target detection accuracy is limited. TPH-YOLOv5 [26] integrates transformers into the prediction head of YOLOv5 to improve the network’s predictive regression capabilities. Attention mechanisms are also adopted to increase the focus on small targets. Ju et al. [27] propose a method that utilizes feature fusion within various “expansion modules” to improve the network’s capability to detect small targets. However, the algorithm exhibits a subpar performance in complex scenes with small targets. Querydet [28] implements a query mechanism to increase the inference speed of an object detector. By utilizing low-resolution features, rough locations can be predicted to guide more accurate prediction regression on high-resolution features. Song et al. [29] apply an enhanced spatio-temporal context (STC) algorithm to develop a binocular vision system aimed at tracking UAV targets. This system can adjust to variations in scale and handle local occlusions during detection, which enables real-time UAV tracking. Despite its relatively low cost, the system still requires high hardware specifications.

3. Proposed YOLOV7-GS Method

Although the YOLOv7-tiny object-detection algorithm can detect UAVs, it exhibits a subpar performance in UAV detection under complex backgrounds. This algorithm also shows a low accuracy and susceptibility to confusion with birds, which results in missed detection and false alarms. In light of these shortcomings, this paper proposes the YOLOv7-GS anti-UAV system target detection model through the improvement of the YOLOv7-tiny algorithm. Figure 1 displays the model network structure. The improvements made to the YOLOv7-GS algorithm used in this work cover several aspects. First, the anchors are reclustered via the k-means method. Second, the SPPCSPC module is refined, and the SPPFCSPC-SR module is introduced. Third, the Inject-LAF module within the GD mechanism is refined to create the low-Inject-ILAF and high-Inject-ILAF modules, and the Get-and-Send module is composed based on this refinement. Finally, the InceptionNeXt module is embedded at the end of the neck section.

3.1. Improvements to Anchors

When the size and shape of the anchor do not match the target object, the model experiences difficulty in accurately predicting the target bounding box, resulting in a decrease in detection accuracy and recall. The model may cause missed detections or false alarms, and leads to increasing the cost of model optimization during the training process.

Figure 2a presents visualizations of the ground truth box in the dataset labels. The ground truth box predominantly exhibits a rectangular shape. However, the anchor sizes provided by YOLOv7-tiny, which are derived from clustering within the COCO dataset, are inappropriate for our dataset. Hence, we redesigned the anchors to align with our dataset using the k-means clustering algorithm. Comparisons of the anchors before and after the enhancement are shown in Figure 2b and Figure 2c, respectively. The anchors obtained through the k-means clustering algorithm closely resemble the ground truth box. Leveraging the improved anchors substantially increases the network optimization speed and effectively increases the algorithm’s recognition efficiency and localization accuracy.

3.2. SPPFCSPC-SR Module

Small target objects account for a small proportion of the image and have few effective features. Thus, the appearance information easily becomes negligible during the feature extraction process. Traditional models lack focus on small-target areas and are prone to missing detection. To address the challenges concerning the accurate localization of small targets in complex scenes and reduce the missed detection rate, we propose the SPPCSPC-SR module as an improvement over the existing SPPCSPC module.

The SPPCSPC module contributes to YOLOv7 by processing input feature maps through multiscale spatial pyramid pooling, and effectively expands the model’s receptive field and improves its feature representation capabilities. This module consists of two critical submodules: the SPP and CSPC modules. The SPP module generates multiple feature maps of varying scales through multiscale spatial pyramid pooling, and captures targets of various sizes and scene information. These two submodules enhance the model’s perceptual range and feature representation. The CSPC module performs convolutional operations on the feature maps produced by the SPP module, which further boosts the model’s feature representation capabilities.

In this enhanced module, the SPP module is replaced by the SPPF module. The SPPF module achieves the same functionality as the 5 × 5, 9 × 9, and 13 × 13 pooling kernels of the SPP module by sequentially passing through three 5 × 5 MaxPool layers and connecting these layers. Additionally, we adjust the original pooling kernel size in the SPPCSPC module from 5 to 3. The parameter-free attention mechanism Simple Attention Module (SimAM) [30] is embedded before the pooling layer, the target and other pixels are divided, and the three-dimensional attention weight of the feature map is inferred to increase the attention to small-target areas, reduce effective feature loss, and suppress confusion. These changes align the expanded receptive field of the smaller pooling kernel with the scale of small targets, which improves feature extraction and detection accuracy for small targets. Figure 3 presents the structure of the SPPFCSPC-SR module.

3.3. InceptionNeXt Module

In the detection of small targets, the resolution and visual information are limited, which poses challenges in the extraction of discriminative features. In addition, environmental factors easily disturb small targets. Another challenge in small-target detection is that small targets occupy a small area in images, and thus they contain relatively limited contextual information (especially in YOLOv7-tiny, which uses smaller convolution kernels and expands the receptive field through multi-layer superposition). In the YOLOv7-tiny model, the capability to express semantic information is relatively weak. However, large kernel depth convolutions enhance the model receptive field and retain the contextual semantic information of small-target objects more effectively [31,32,33,34], especially in small object detection. ConvNeXt [35] incorporates

7 \times 7

depth convolutions, which considerably boosts the detection performance for small objects. However, the high memory access cost of ConvNeXt adversely affects the device’s computational efficiency. We introduce the InceptionNeXt module to address this issue and further enhance the detection performance of small UAV targets while enlarging the model’s receptive field. Figure 4 shows the structural representation of this module.

The InceptionNeXt module innovatively decomposes large-kernel depthwise convolution into four parallel branches along the channel dimension: a small square kernel, two orthogonal band kernels, and an identity mapping. The input is first divided into four groups along the channel dimension. This process is presented in Equation (1).

\begin{matrix} X_{h w}, X_{w}, X_{h}, X_{i d} & = S p l i t (X) \\ = X_{:, : g}, X_{: g : 2 g}, X_{: 2 g : 3 g}, X_{: 3 g :} \end{matrix}

(1)

Then, these four features are processed using four different operators, and the output results are then combined. The calculation process is shown in Equations (2)–(6).

\begin{matrix} X_{h w}^{^{'}} = D W C o n v_{k_{s} \times k_{s}}^{g ⟶ g} g (X_{h w}), \end{matrix}

(2)

\begin{matrix} X_{w}^{^{'}} = D W C o n v_{1 \times k_{b}}^{g ⟶ g} g (X_{w}), \end{matrix}

(3)

\begin{matrix} X_{h}^{^{'}} = D W C o n v_{k_{b} \times 1}^{g ⟶ g} g (X_{h}), \end{matrix}

(4)

\begin{matrix} X_{i d}^{^{'}} = X_{i d} . \end{matrix}

(5)

X^{^{'}} = C o n c a t (X_{h w}^{^{'}}, X_{w}^{^{'}}, X_{h}^{^{'}}, X_{i d}^{^{'}})

(6)

The default value of

k_{s}

is 3, and that of

k_{b}

is 11.

By utilizing this clever decomposition strategy, we can reduce the number of parameters and computational burden while retaining the advantages of large kernel depth convolutions. This approach promotes the further enlargement of the receptive field and improves model performance.

To fully leverage the advantages of the InceptionNeXt module, this paper integrates it into the last layer of the network’s neck structure. At the end of the neck, the InceptionNeXt module facilitates in-depth integration and comprehension of the features learned in preceding layers. With this design, the network can comprehensively capture more global semantic information of the images, thereby improving the model’s detection performance for small targets. Moreover, this structural design offers remarkable advantages for the improvement of the model performance in both theoretical and practical applications while also considering computational efficiency.

3.4. Get-and-Send Module

In classical YOLO series models, the feature pyramid network (FPN) method [36] is commonly used to address the multiscale information processing during object detection. The original FPN structure fully integrates information from adjacent layers using a progressive multiscale feature fusion pattern. By contrast, information integration in other layers is only attainable indirectly through intermediary layers. This design often leads to information loss related to small targets during cross-layer information fusion, which weakens the capability of YOLO to detect small targets. Traditional approaches often involve adding shortcuts to create additional pathways to enhance information flow. Liu et al. [37] propose the PANet architecture, which incorporates top-down pathways and lateral connection pathways to effectively capture semantic information and contextual relationships. Liu et al. [38] also propose the adaptive spatial feature fusion structure to effectively integrate features of various scales more effectively to improve the model performance. Jin et al. [39] introduced adaptive feature fusion and self-enhancement modules. Chen et al. [40] propose a parallel FPN structure for bidirectional fusion object detection. However, the traditional FPN-based information fusion structures still suffer from drawbacks, such as slow speed and information loss, during cross-layer information exchange due to the numerous pathways and indirect interactions in the network.

We introduce the GD mechanism to address the issue of potential information loss during information fusion in the FPN structure of the YOLO series. This mechanism provides a uniform aggregation and fusion of features from various levels across a global view, which results in their distribution. Therefore, the information fusion capability of the neck section is notably boosted without introducing too much latency, which makes the GD mechanism more comprehensive and efficient for information interaction and fusion. The GD mechanism comprises two branches: low-GD and high-GD. Building upon the Inject module in the GD mechanism, we propose the improved low-Inject-ILAF and high-Inject-ILAF modules and construct a Get-and-Send module based on this module. Figure 5 illustrates the working principle of the Get-and-Send module.

The ”Get” process consists of two stages: initially, the FAM gathers and aligns features across various layers, followed by the IFM, which integrates these aligned features to extract global information. After the acquisition of the global information, the information injection module (Inject) ”Send” this information to each level, which results in the enhanced detection capabilities of each branch through a simple attention mechanism.

To improve the model’s capability to detect objects of different sizes, we introduce two branches: low-GS and high-GS. In FAM_4in, average pooling is used to downsample to a unified size

F_{a l i g n}

, with

R_{B 4} = \frac{1}{4} R

selected as the target size. IFM_4in includes multiple layers of re-parameterized convolutional blocks (RepBlock) and a split operation. RepBlock takes

F_{a l i g n}

(

c h a n n e l = s u m (C_{B 2}, C_{B 3}, C_{B 4}, C_{B 5})

) as the input to obtain

F_{f u s e}

(

c h a n n e l = C_{B 4} + C_{B 5}

), which is then split along the channel dimension into

F_{i n j_{M} 3}

and

F_{i n j_{M} 4}

. The specific equations are presented in Equations (7)–(9).

F_{a l i g n} = L o w_F A M ([B 2, B 3, B 4, B 5]),

(7)

F_{f u s e} = R e p B l o c k (F_{a l i g n}),

(8)

F_{i n j_M 3}, F_{i n j_M 4} = S p l i t (F_{f u s e}) .

(9)

FAM_3in and FAM_4in operate similarly, using global average pooling to downsample for size alignment, with a target size of

R_{P 5} = \frac{1}{8} R

. IFM_3in includes multiple transformer blocks and a split operation. The output of FAM_3in,

F_{a l i g n}

, is processed through the transformer blocks to obtain

F_{f u s e}

.

F_{f u s e}

is then channel-reduced via a 1 × 1 convolution to

s u m (C_{M 4}, C_{M 5})

and split along the channel dimension into

F_{i n j_N 4}

and

F_{i n j_N 5}

. The specific equations are presented in Equations (10)–(12).

F_{a l i g n} = H i g h_F A M ([M 3, M 4, M 5]),

(10)

F_{f u s e} = T r a n s f o r m e r (F_{a l i g n}),

(11)

F_{i n j_N 4}, F_{i n j_N 5} = S p l i t (C o n v 1 \times 1 (F_{f u s e})) .

(12)

The Get-and-Send module is crafted to boost the model’s accuracy and robustness in the detection of small targets. Its primary objective is to enable a more thorough flow and fusion of information across different levels. To achieve effective cross-level fusion of feature information and reduce information loss during fusion, we employ the Get-and-Send module to replace the FPN structure and improve the model’s performance in small-target detection.

The LAF module is a lightweight intralayer fusion module, and it combines the input local features (Bi or Mi) with neighboring layer features and further enriches the local feature maps with multilevel information through the Inject module. Schematic diagrams of the shallow and deep structures of the LAF module are shown in Figure 6a,b. As part of our efforts to improve the model’s performance, we optimized the Inject-LAF module (Figure 7a). Specifically, local (from the current layer) and global information (generated by IFM) are inputted simultaneously and denoted as

F_{l o c a l}

and

F_{i n j}

, respectively.

F_{i n j}

is processed through two different convolution layers to obtain

F_{g l o b a l_e m b e d}

and

F_{a c t}

.

F_{l o c a l}

is processed through a convolution layer to obtain

F_{l o c a l_e m b e d}

. The fused feature

F_{o u t}

is then obtained through an attention calculation. Here,

F_{l o c a l}

is equal to Bi, as detailed in Equations (13)–(16).

F_{g l o b a l_a c t_M i} = r e s i z e (S i g m o i d (C o n v_{a c t} (F_{i n j_M i}))),

(13)

F_{g l o b a l_e m b e d_M i} = r e s i z e (C o n v_{g l o b a l_e m b e d_M i} (F_{i n j_M i})),

(14)

F_{a t t_f u s e_M i} = C o n v_{l o c a l_e m b e d_M i} (B i) * F_{i n j_a c t_M i} + F_{g l o b a l_e m b e d_M i},

(15)

M i = R e p B l o c k (F_{a t t_f u s e_M i}) .

(16)

The high-GS branch has an Inject structure identical to that in the low-GS branch (Equations (17)–(20)).

F_{g l o b a l_a c t_N i} = r e s i z e (S i g m o i d (C o n v_{a c t} (F_{i n j_N i}))),

(17)

F_{g l o b a l_e m b e d_N i} = r e s i z e (C o n v_{g l o b a l_e m b e d_N i} (F_{i n j_N i})),

(18)

F_{a t t_f u s e_N i} = C o n v_{l o c a l_e m b e d_N i} (M i) * F_{i n j_a c t_N i} + F_{g l o b a l_e m b e d_N i},

(19)

N i = R e p B l o c k (F_{a t t_f u s e_N i}) .

(20)

The improved design includes the replacement of the original convolution layers of the second and third layers with InceptionNeXt large convolutional kernel layers. For each layer feature x_local, the feature map fused through the LAF fusion module and the resulting feature map x_global from the IFM module undergo deep processing (Figure 7b). This improvement ensures that the Inject-ILAF module effectively preserves the information on small-target objects, which enhances the focus on small targets and the model’s capability to detect them effectively.

4. Experimental Results and Analysis

4.1. Experimental Dataset

The dataset used for experimentation, known as the Amateur Unmanned Air Vehicle Detection dataset, consists of 4012 images of low-altitude flying UAVs and some ”negative” objects resembling UAVs. In addition, the DUT-Anti-UAV dataset, which contains 10,000 UAV images, was merged with the aforementioned dataset. A total of 14,012 images were obtained. Considering the varying image quality in the dataset, we performed data cleaning to remove blurry, black-and-white, and UAV images with high similarity. The dataset is split into training and testing sets at a ratio of 7:3, with the training set further divided to include a validation subset. Table 1 provides the statistics of the training and testing sets. Moreover, overfitting is a prevalent issue in deep learning, where models excel with training data but falter with unseen information. This challenge is particularly evident in tasks involving the detection of small UAVs, given the variability and complexity of real-world scenarios. Our dataset encompasses more than 35 distinct UAV models. In addition, the dataset includes visible-light images captured under various conditions, such as clear skies, dawn, overcast, dusk, and rainy weather.

In our study, we ensured model performance in different environments to perform elaborate preprocessing on the dataset and applied various data augmentation strategies. We conduct the following preprocessing operations on the dataset. First, all images are cropped to a uniform size, and the image resolution is adjusted to match the model-input requirements. Second, the images are color-normalized to reduce the effect of color changes caused by lighting differences on model training. Finally, Gaussian filtering is used to remove noise in the image and improve image quality. To further improve the model performance of the model under different environmental conditions, we applied a variety of data-enhancement strategies, including randomly rotating and flipping images during the training process, to increase the data diversity and prevent model overfitting. Images are randomly cropped and scaled to increase data diversity and improve the capability of the model to identify target objects of different sizes. These data preprocessing and enhancement strategies ensure the diversity of training data for the model under various environmental conditions, which improves the model’s generalization ability and robustness.

4.2. Experimental Environment

The experiment was conducted on an Ubuntu 20.04.6LTS operating system. The hardware configuration included an Intel® Xeon(R) Silver 4314 CPU @ 2.40 GHz × 64 and an NVIDIA GeForce RTX 4090 GPU with 24 GB of memory. The code was executed using the PyTorch 1.13.1+cu116 framework with CUDA version 11.8. To assess the performance of the proposed object-detection method, we employ several evaluation metrics: precision (P), recall (R), average precision (AP), mean average precision (mAP), frames-per-second (FPS), model size (MB), figa floating-point operations-per-second (GFLOPS), and params (M). P measures the proportion of samples correctly predicted as UAV among all samples predicted as UAV. R represents the percentage of actual positive samples correctly identified by the model. TP represents the number of correctly predicted UAV samples, FP denotes the number of incorrectly predicted UAV samples, and FN represents the number of incorrectly predicted negative samples. mAP is the average of AP values for all categories. The calculation formulas of P, R, and mAP are in Equations (21)–(24).

P r e c i s i o n = \frac{T P}{T P + F P}

(21)

R e c a l l = \frac{T P}{T P + F N}

(22)

A P = \int_{0}^{1} P (R) d R

(23)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(24)

In addition, class activation mapping (CAM) [41] was employed to visualize the effectiveness of algorithmic improvements. Our model training setting epoch is 200, batch size is set to 32, no layers are frozen, evolve is set to false, works is set to 8, and image resolution is set to 640 × 640. Table 2 presents the hyperparameters of our model.

4.3. Improvement Experiment

4.3.1. Anchor Improvement Experiment

Table 3 presents the performance comparison before and after the improvement of the anchor boxes. The improved algorithm YOLOv7-tiny (anchors) achieved increases in p, R, and [email protected] of 0.7%, 2%, and 0.9%, respectively, compared with the original YOLOv7-tiny algorithm. Furthermore, the FPS of YOLOv7-tiny (anchors) is similar to that of YOLOv7-tiny, and the model size, GFLOPS, and number of parameters are the same. Experiments demonstrate that the optimization of anchor sizes effectively enhances the model’s recognition efficiency and localization accuracy.

4.3.2. Improvement SPPFCSPC-SR Module Experiment

We assess the influence of the proposed SPPFCSPC-SR module on the accuracy of small UAV detection via comparative experiments between the original SPPCSPC and SPPFCSPC-SR modules. The experiments adopt uniform parameters and the same dataset, and the results are presented in Table 4.

The experimental results demonstrate that the inclusion of the SPPFCSPC-SR module enhances the model’s detection accuracy. The p value is enhanced by 1.6%, the R by 3.8%, and the [email protected] by 2.2%. These findings show that the improved SPPFCSPC-SR module can better match small UAV targets for achieving better small-target feature extraction and further improving detection accuracy. Although the FPS decreased by 24, it still met the requirements for real-time detection. The model size is increased by 1.2 MB with no effect on storage or deployment on most edge devices. The computational complexity increased by 0.5 GFLOPS, which indicates a high computational load but is well within the acceptable range for most edge devices equipped with modern or high-performance CPUs. The number of parameters increased by 590,000, which reflects increased complexity and capabilities, and they contribute to its improved detection performance.

Combined with the CAM visualization results in Figure 8, the improved SPPFCSPC-SR module enables a more accurate positioning of UAVs compared with the original SPPCSPC module, demonstrating the effectiveness of the improvement.

4.3.3. Experiments Introducing the InceptionNeXt Module

To verify the superiority of inserting InceptionNeXt into the last layer of the neck, we conduct training experiments on the model with uniform parameters and datasets. The experimental results are shown in Table 5.

The table illustrates the effective expansion of the model’s receptive field via the InceptionNeXt module, which exhibited notable performance in R, with a 6.8% improvement compared to YOLOv7-tiny. In addition, p increased by 2.1%, and [email protected] increased by 3.9%, FPS decreased by 36. However, the 125 FPS is still significantly above the typical real-time threshold (30 FPS). The increase in model size by 3.8 MB may pose concerns regarding storage and memory-limited edge devices, but it remains within a reasonable range for most modern devices. The YOLOv7-tiny (InceptionNeXt) requires fewer GFLOPS, which indicates a lower computational load despite its increased parameter count. This improvement in computational efficiency is beneficial for deployment on edge devices. The increase to 1.9 million parameters in the YOLOv7-tiny (InceptionNeXt) reflects additional complexity and capability, which contribute to its improved detection performance.

These results are combined with the CAM visualization in Figure 9. When inserted into the last part of the neck, the InceptionNeXt module enhances the model’s focus on the UAVs themselves. This condition improves the model’s attention to small-target UAVs under complex background conditions and the model’s detection capability for UAVs.

4.3.4. Experiments of the Get-and-Send Module

We validate the enhancement of the information fusion capability in the neck section attained through the Get-and-Send module, which aggregates, merges, and distributes features from different levels uniformly across the global view to inject them into various layers. We conduct experiments comparing the Get-and-Send module with the traditional FPN structure. Table 6 reports the experimental results.

The experimental results demonstrate that the Get-and-Send module improves the fusion capability of feature information in the neck section, reducing information loss. Compared with the FPN structure, the Get-and-Send module improves p by 1.1%, R by 5.1%, and [email protected] by 2.7%. Although the FPS decreased by 48, it still met the real-time detection requirements. The increase in model size (5.9 MB) may pose concerns for storage- and memory-limited edge devices. Regardless, it remains within a reasonable range for most modern devices. Although the incorporation of the Get-and-Send module increases the GFLOPS and parameters, this added computational overhead is justified by its benefits.

Combined with the CAM visualization results in Figure 10, the Get-and-Send module preserves the feature information of small UAV targets better than the FPN structure. This finding results in a stronger feature-learning capability for UAVs, more focused attention, and more accurate localization, and demonstrates the effectiveness of the Get-and-Send module.

4.3.5. Experiments with the YOLOv7-GS Module

The original YOLOv7-tiny model was compared with the improved YOLOv7-GS model in experimental trials (Table 7). The p, R, and [email protected] are enhanced by 2.2%, 7.8%, and 4.4%, respectively. Although the FPS decreased by 57, it still met the real-time detection requirements. The model size, GFLOPS, and params (M) increased by 10.9 MB, 2.3, and 5.42 million parameters, respectively.

The experimental results reveal that the YOLOv7-GS model provides a better P, R, and mAP, and substantially improves detection accuracy. The trade-off is a reduced FPS and increased model size, which possibly affects real-time performance and storage requirements on edge devices. In addition, an increase in GFLOPS indicates a high computational load, which may challenge the capabilities of some edge devices. Both models exhibit FPS well above real-time requirements, which makes them suitable for edge deployments. The increase in YOLOv7-GS model parameters and model size may pose a concern for devices with extremely limited resources. However, its performance improvements may justify the trade-off in many cases. The YOLOv7-GS model shows excellent performance metrics and is suitable for applications that require the highest detection accuracy. The YOLOv7 tiny model proves to be more suitable for devices with limited computing power given its lower GLFLOPS.

Overall, the YOLOv7-GS model provides an improved detection performance with a modest increase in parameters and model size. However, such performance was accompanied with a trade-off between computational efficiency and detection accuracy, which was supported by its increased GFLOPS and decreased FPS. For edge devices that can handle the increased computational load, the YOLOv7-GS model is an excellent choice for small-target UAV detection in complex environments.

Figure 11 illustrates the detection performance of the YOLOv7-GS model in various complex environmental conditions. The improved algorithm performs excellently in scenarios featuring complex flight backgrounds, insufficient lighting, partial occlusions, and cloudy weather conditions. The application of YOLOv7-GS in complex backgrounds results in the accurate detection of UAVs and effective reduction in false positives and miss detection. This finding is attributed to the introduction of the SPPFCSPC-SR module, which can extract multi-scale features and improve the model’s capability to distinguish between targets and backgrounds and the detection of small targets. Under insufficient lighting conditions, the YOLOv7-GS model maintains a high detection accuracy and reliably identifies UAVs in low-light environments. This result is due to the improved InceptionNeXt module, which enhances the model’s feature map representation capabilities. As a result, the model captures effective features even in low-light conditions and thus increases detection robustness. When the UAVs are partially occluded, the YOLOv7-GS model still accurately detects the targets. In UAV detection tasks, occlusions from objects, such as branches or buildings, are common, and the performance of YOLOv7-GS under such conditions is important. This improvement is caused by the introduction of the improved anchors, which enables the model to more accurately locate and identify partially occluded targets, and thus reduces miss detection. Under cloudy conditions, the YOLOv7-GS model demonstrates strong anti-interference capabilities, which results in the accurate detection of UAVs in this environment. This finding is achieved by the improved Get-and-Send module, which optimizes feature transmission and fusion methods, improving the model’s performance in noisy environments and maintaining a high detection accuracy even in adverse weather conditions. In summary, YOLOv7-GS performs desirably under various environmental conditions, showing its reliability and robustness in practical applications.

4.4. Ablation Experiments

Ablation experiments are conducted to assess the algorithm’s performance through the addition or removal of specific modules. Therefore, we conducted ablation experiments to confirm the influence of each improved module on the algorithm performance. Table 8 provides the findings of the experiments, with the improved modules being combined and incorporated into the original YOLOv7-tiny algorithm. The improved anchors are denoted as ”A”, the improved SPPFCSPC-SR module as ”SR”, the improved InceptionNeXt module as ”I”, and the improved Get-and-Send module as ”GS”. The p, R, [email protected], and FPS are obtained for each group of algorithms. The data for these four metrics are all higher than those of the original YOLOv7-tiny model, further validating the feasibility of each improvement.

First, we test the effect of using only the improved anchors. The results show that the [email protected] is increased from 0.888 in the baseline model to 0.910, indicating that the improved anchors effectively improve detection accuracy. The introduction of improved anchors optimized the initial position and size of the bounding boxes, making it easier for the network to learn accurate target localization information. Upon integrating the improved SPPFCSPC-SR module, the [email protected] is further increased to 0.927, demonstrating the module’s effectiveness in multi-scale feature extraction. This module enhances the network’s feature representation capability by incorporating spatial pyramid pooling and sparse residual connections, leading to better performance in handling complex backgrounds and detecting small objects. When we add the improved InceptionNeXt module, the [email protected] reached 0.929, which indicates the module’s effectiveness in multi-scale information fusion. The InceptionNeXt module improves feature map expression capability by introducing parallel convolution paths and expanding the number of feature map channels, thereby facilitating more precise detection of objects at different scales. Finally, after incorporating the improved Get-and-Send module, the [email protected] achieves 0.932, which is the best performance of our proposed final model. This module optimizes the transmission and fusion of feature maps, enhancing information flow efficiency and boosting the overall performance of the network.

Table 8 contains the details for experimental results of each combination. The final model, YOLOv7-GS, which combines the A, SR, I, and GS modules, achieves the best comprehensive performance with an [email protected] of 0.932 and an FPS of 104. Thus, the synergistic operation of the improved components maximizes model performance while maintaining efficiency and lightweightness, which makes it suitable for deployment in scenarios with moderate computing resources and limited edge device capabilities.

4.5. Comparison Experiments with Other Object-Detection Algorithms

To further validate the performance of the proposed YOLOv7-GS model for UAV detection, this study compares the model with the popular object-detection models YOLOv5l, YOLOv5m, YOLOv7, YOLOv8n, YOLOv8l, YOLOv9t [42], and YOLOv9c. These target detection models were trained for 200 epochs under identical experimental conditions, and the comparative outcomes are presented in Table 9.

Table 9 reveals that the proposed model attains the highest overall metric. The accuracy of the proposed model (0.932) is second to those of YOLOv7 (0.953) and YOLOv5m (0.937), and notably exceeds those of the other remaining models. The evaluation results reveal that the proposed model is better than YOLOv7 and YOLOv5m in terms of indicators (FPS, MB, GFLOPS, Params (M)). In terms of the same indicators, YOLOv9t and YOLOv8n perform better and are suitable for scenarios where small-target UAV detection accuracy is low. Although the performance of the proposed model is slightly worse than YOLOv9t and YOLOv8n in terms of evaluation indicators, it considerably surpasses YOLOv9t (0.854) and YOLOv8n (0.82) regarding precision. The proposed model is more appropriate for real-time detection in conditions with constrained computational capabilities, standard edge device performance, and stringent demands for small-target UAV detection accuracy.

To more intuitively convey the advanced comprehensive performance of the proposed model, we visualized the performance of each model using [email protected] as the vertical axis and other indicators as the x-axis (Figure 12).

Figure 13 shows a diagram of the detection effects of each model on small-target UAVs under four different backgrounds. YOLOv5m and YOLOv5L show false detections in the detection of small-target UAVs in complex backgrounds. YOLOv8l and YOLOv8n have missed detections in the detection of small-target UAVs in similar backgrounds, and their detection accuracy is low in other scenarios. Although YOLOv9t exhibits a fast inference speed, it fails to detect UAVs under occluded and confusing backgrounds, and the detection effect is poor under complex and similar backgrounds. YOLOv9c fails to detect UAVs under blocked backgrounds and obtains average detection results under other backgrounds. YOLOv7 and YOLOv7-GS can detect UAVs in four backgrounds with high accuracy. Notably, considering the model size and computational complexity, our proposed YOLOv7-GS can detect UAVs in environments that require high accuracy, fast response, and resource constraints.

In summary, our model shows distinct advantages in overall performance and achieves a balance between accuracy, speed, and model lightweight for object detection. The proposed model is an exemplary choice for the domain of convolutional neural network-based target recognition.

5. Conclusions

This study focuses on addressing the challenge of detecting small UAVs in low-altitude complex scenarios. The proposed YOLOv7-GS model substantially improves the model’s performance by redesigning anchors and introducing the SPPFCSPC-SR and InceptionNeXt modules. Notably, the small-target localization and detection accuracy are improved, especially for complex backgrounds. To mitigate small-target information loss in information fusion, we propose the Gather-and-Distribute Mechanism and augment the Inject-LAF module. This enables the development of the Get-and-Send module, significantly boosting the model’s small UAV-detection capabilities. Moreover, we substitute the FPN structure with the GD mechanism, improving cross-layer feature fusion efficacy. This reduces information loss and further enhances the model’s performance in small-target detection. The experimental results demonstrate that our proposed model performs well on the DUT Anti-UAV and Amateur Unmanned Air Vehicle Detection datasets. Although we have made progress in small UAV detection, further research is needed to simplify model parameters and improve real-time detection performance in more complex scenarios. This study provides new insights into the optimization of object-detection models and makes substantial progress in addressing the challenges of small-target detection. Thus, it offers robust support for enhancing safety and efficiency in the application of UAVs.

In future work, we will build an anti-UAV system based on the proposed algorithm. Inspired by De Curtò et al.’s [43] use of large language models (LLMs) [44] and visual language models (VLMs) [45] to provide zero-shot UAV scene literary text descriptions, we plan to incorporate LLMs and VLMs into the anti-UAV system to enhance its intelligence and user experience. LLMs can be used to generate natural language descriptions of detection results to enhance human–computer interaction capabilities; VLMs can integrate visual and textual information to improve detection accuracy and fine-grained classification capabilities. We will also explore strategies for the utilization of LLMs for label generation and validation to improve the quality of training data.

Author Contributions

All authors contributed to the idea for the article. C.B. and Y.X. provided the initial idea and framework regarding this manuscript. The literature search and analysis were performed by Y.W., X.W. and Z.S. Y.W. and X.W. conducted experiments and prepared the manuscript initially. All authors participated in the commenting and modification of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

The paper is supported in part by the National Natural Science Foundation of China (No.62176041), in part by Excellent Science and Technique Talent Foundation of Dalian (No.2022RY21), and in part by Fundamental Research Funds for the Central Universities (Nos. 04442024040, 04442024041).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Acknowledgments

The authors express their gratitude to the anonymous reviewers and the editor.

Conflicts of Interest

The authors have no competing interests to declare that are relevant to the content.

References

Ren, H.; Zhao, Y.; Xiao, W.; Hu, Z. A review of UAV monitoring in mining areas: Current status and future perspectives. Int. J. Coal Sci. Technol. 2019, 6, 320–333. [Google Scholar] [CrossRef]
Tsouros, D.C.; Bibi, S.; Sarigiannidis, P.G. A Review on UAV-Based Applications for Precision Agriculture. Information 2019, 10, 349. [Google Scholar] [CrossRef]
Scherer, J.; Yahyanejad, S.; Hayat, S.; Yanmaz, E.; Andre, T.; Khan, A.; Vukadinovic, V.; Bettstetter, C.; Hellwagner, H.; Rinner, B. An Autonomous Multi-UAV System for Search and Rescue. In Proceedings of the First Workshop on Micro Aerial Vehicle Networks, Systems, and Applications for Civilian Use; Association for Computing Machinery: New York, NY, USA, 2015; pp. 33–38. [Google Scholar] [CrossRef]
Naidoo, Y.; Stopforth, R.; Bright, G. Development of an UAV for search & rescue applications. In Proceedings of the AFRICON, Victoria Falls, Zambia, 13–15 September 2011; pp. 1–6. [Google Scholar] [CrossRef]
Shakhatreh, H.; Sawalmeh, A.; Al-Fuqaha, A.I.; Dou, Z.; Almaita, E.K.; Khalil, I.; Othman, N.S.; Khreishah, A.; Guizani, M. Unmanned Aerial Vehicles (UAVs): A Survey on Civil Applications and Key Research Challenges. IEEE Access 2019, 7, 48572–48634. [Google Scholar] [CrossRef]
Chen, S.; Yin, Y.; Wang, Z.; Gui, F. Low-altitude protection technology of anti-UAVs based on multisource detection information fusion. Int. J. Adv. Robot. Syst. 2020, 17, 48572–48634. [Google Scholar] [CrossRef]
Yu, Q.; Ma, Y.; He, J.; Yang, D.; Zhang, T. A Unified Transformer-based Tracker for Anti-UAV Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023—Workshops, Vancouver, BC, Canada, 17–24 June 2023; pp. 3036–3046. [Google Scholar] [CrossRef]
Li, Y.; Yuan, D.; Sun, M.; Wang, H.; Liu, X.; Liu, J. A Global-Local Tracking Framework Driven by Both Motion and Appearance for Infrared Anti-UAV. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023—Workshops, Vancouver, BC, Canada, 17–24 June 2023; pp. 3026–3035. [Google Scholar] [CrossRef]
Zhai, X.; Huang, Z.; Li, T.; Liu, H.; Wang, S. YOLO-Drone: An Optimized YOLOv8 Network for Tiny UAV Object Detection. Electronics 2023, 12, 3664. [Google Scholar] [CrossRef]
Chen, G.; Wang, H.; Chen, K.; Li, Z.; Song, Z.; Liu, Y.; Chen, W.; Knoll, A.C. A Survey of the Four Pillars for Small Object Detection: Multiscale Representation, Contextual Information, Super-Resolution, and Region Proposal. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 936–953. [Google Scholar] [CrossRef]
Wang, C.; Bochkovskiy, A.; Liao, H.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Yu, W.; Zhou, P.; Yan, S.; Wang, X. InceptionNeXt: When Inception Meets ConvNeXt. arXiv 2023, arXiv:2303.16900. [Google Scholar] [CrossRef]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2023. [Google Scholar]
Zhao, J.; Zhang, J.; Li, D.; Wang, D. Vision-Based Anti-UAV Detection and Tracking. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25323–25334. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October 2019–2 November 2019; pp. 9626–9635. [Google Scholar] [CrossRef]
Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. RepPoints: Point Set Representation for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October 2019–2 November 2019; pp. 9656–9665. [Google Scholar] [CrossRef]
Yang, R.; Li, W.; Shang, X.; Zhu, D.; Man, X. KPE-YOLOv5: An Improved Small Target Detection Algorithm Based on YOLOv5. Electronics 2023, 12, 817. [Google Scholar] [CrossRef]
Zhao, H.; Zhang, H.; Zhao, Y. YOLOv7-sea: Object Detection of Maritime UAV Images based on Improved YOLOv7. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, Waikoloa, HI, USA, 3–7 January 2023; pp. 233–238. [Google Scholar] [CrossRef]
Lou, H.; Duan, X.; Guo, J.; Liu, H.; Gu, J.; Bi, L.; Chen, H. DC-YOLOv8: Small-size object detection algorithm based on camera sensor. Electronics 2023, 12, 2323. [Google Scholar] [CrossRef]
Li, Q.; Deng, Z.; Luo, X.; GU, X.; Wang, S. SSD object detection algorithm with attention and cross-scale fusion. J. Front. Comput. Sci. Technol. 2022, 16, 2575–2586. [Google Scholar] [CrossRef]
Liang, S.; Wu, H.; Zhen, L.; Hua, Q.; Garg, S.; Kaddoum, G.; Hassan, M.M.; Yu, K. Edge YOLO: Real-Time Intelligent Object Detection System Based on Edge-Cloud Cooperation in Autonomous Vehicles. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25345–25360. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar] [CrossRef]
Ju, M.; Luo, J.; Zhang, P.; He, M.; Luo, H. A Simple and Efficient Network for Small Target Detection. IEEE Access 2019, 7, 85771–85781. [Google Scholar] [CrossRef]
Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13658–13667. [Google Scholar] [CrossRef]
Song, H.; Wu, Y.; Zhou, G. Design of bio-inspired binocular UAV detection system based on improved STC algorithm of scale transformation and occlusion detection. Int. J. Micro Air Veh. 2021, 13, 17568293211004846. [Google Scholar] [CrossRef]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. SimAM: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: Westminster, UK, 2021; pp. 11863–11874. [Google Scholar]
Peng, C.; Zhang, X.; Yu, G.; Luo, G.; Sun, J. Large Kernel Matters—Improve Semantic Segmentation by Global Convolutional Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1743–1751. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Zhou, Y.; Han, J.; Ding, G.; Sun, J. Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs. arXiv 2022, arXiv:2203.06717. [Google Scholar] [CrossRef]
Tao, Y.; Chang, F.; Huang, Y.; Ma, L.; Xie, L.; Su, H. Cotton Disease Detection Based on ConvNeXt and Attention Mechanisms. IEEE J. Radio Freq. Identif. 2022, 6, 805–809. [Google Scholar] [CrossRef]
Han, Q.; Fan, Z.; Dai, Q.; Sun, L.; Cheng, M.; Liu, J.; Wang, J. On the Connection between Local Attention and Dynamic Depth-wise Convolution. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11966–11976. [Google Scholar] [CrossRef]
Lin, T.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Jin, Z.; Liu, B.; Chu, Q.; Yu, N. SAFNet: A Semi-Anchor-Free Network With Enhanced Feature Pyramid for Object Detection. IEEE Trans. Image Process. 2020, 29, 9445–9457. [Google Scholar] [CrossRef]
Chen, P.; Chang, M.; Hsieh, J.; Chen, Y. Parallel Residual Bi-Fusion Feature Pyramid Network for Accurate Single-Shot Object Detection. IEEE Trans. Image Process. 2021, 30, 9099–9111. [Google Scholar] [CrossRef]
Zhou, B.; Khosla, A.; Lapedriza, À.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
De Curtò, J.; De Zarza, I.; Calafate, C.T. Semantic scene understanding with large language models on unmanned aerial vehicles. Drones 2023, 7, 114. [Google Scholar] [CrossRef]
Dorbala, V.S.; Mullen, J.F., Jr.; Manocha, D. Can an embodied agent find your “cat-shaped mug”? LLM-based zero-shot object navigation. IEEE Robot. Autom. Lett. 2024, 9, 4083–4090. [Google Scholar] [CrossRef]
Wang, W.; Chen, Z.; Chen, X.; Wu, J.; Zhu, X.; Zeng, G.; Luo, P.; Lu, T.; Zhou, J.; Qiao, Y.; et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 36. [Google Scholar]

Figure 1. Network structure of YOLOv7-GS. Input represents the input image, which is used for feature extraction and downsampling through the backbone. The extracted features are processed through the neck part consisting of Get-and-Send and three InceptionNeXt modules. Finally, the processed features are sent to the head part for detection.

Figure 2. Ground truth box and anchor box. (a) Visualization of the ground truth box of the dataset, (b) anchor visualization graph in the COCO dataset, and (c) anchor visualization diagram generated by re-clustering using the k-means algorithm.

Figure 3. Structure of the SPPFCSPC-SR module. The SPPCSPC-SR module cuts out layers 3 and 4 based on the original SPPCSPC module and introduces Simple Attention Module (SimAM). Set the kernel-size of maxpooling from 5 to 3.

Figure 4. Structure of the InceptionNeXt module. The InceptionNeXt module divides the large-kernel depth convolution into four branches along the channel. These branches consist of a small square kernel (

D W C o n v : k \times k

), with large-size convolution kernels only in the width direction (

D W C o n v : 1 \times k_{w}

), large-size convolution kernels only in the height direction (

D W C o n v : 1 \times k_{h}

), and an identity mapping (Identity).

Figure 4. Structure of the InceptionNeXt module. The InceptionNeXt module divides the large-kernel depth convolution into four branches along the channel. These branches consist of a small square kernel (

D W C o n v : k \times k

), with large-size convolution kernels only in the width direction (

D W C o n v : 1 \times k_{w}

), large-size convolution kernels only in the height direction (

D W C o n v : 1 \times k_{h}

), and an identity mapping (Identity).

Figure 5. Working principle diagram of the Get-and-Send module. The Get-and-Send module comprises two branches, namely, low-Get-and-Send (low-GS) and high-Get-and-Send (high-GS). Tese two branches work on the same principle. The ”Get” process includes two steps (taking the low-GS branch as an example): first, the Feature Alignment Module (FAM_4in) in the low-GS branch collects and aligns features (B2, B3, B4, and B5) from different layers, and the information fusion module (IFM_4in) aligns the features-obtained global information. The ”Send” process includes one step: Low_Information injection module-InceptionNeXt-Lightweight adjacent layer fusion (Low_Inject-ILAF) “Send” the obtained global information to each level.

Figure 6. Structure of the LAF module: low-stage and high-stage components. (a) Network structure of lightweight adjacent layer fusion (LAF) in low-stage, (b) network structure of LAF in high-stage.

Figure 7. Structure of Inject-LAF and Inject-ILAF modules. (a) Network structure of information injection module-LAF (Inject-LAF), (b) network structure of information injection module-InceptionNeXt-LAF (Inject-ILAF).

Figure 8. Visualization of CAM for the SPPCSPC and improved SPPFCSPC-SR modules.

Figure 9. Visualization of the CAM before and after the insertion of the InceptionNeXt module.

Figure 10. CAM visualization of the FPN structure and Get-and-Send module.

Figure 11. Detection results of YOLOv7-GS.

Figure 12. Comparison of different models in different indicators.

Figure 13. Detection performance of different models. Four different scenes from left to right: complex, occluded, similar, and confusing backgrounds.

Table 1. Dataset statistics.

Dataset	Class	Image	Target Amount
Dataset	Class	Image	Small	Medium	Large
Training Set	UAV	5200	4943	136	121
Training Set	Bird	500	417	36	47
Valid Set	UAV	2600	2268	153	179
Valid Set	Bird	300	224	46	30
Test Set	UAV	2200	1827	239	134
Test Set	Bird	300	232	26	42

Table 2. Experimental hyperparameters.

Hyperparameters	Numeric Value
initial learning rate	0.01
final learning rate	0.1
warm up training epochs	3
momentum	0.973
box loss gain	0.05
cls loss gain	0.3
obj loss gain	0.7
warm up training momentum	0.8
weight decay coefficient	0.0005

Table 3. Comparison of the experimental results before and after improving the anchor boxes.

Model	p	R	[email protected]	FPS	MB	GFLOPS	Params (M)
YOLOv7-tiny	0.946	0.825	0.888	161	12.3	13.2	6.01
YOLOv7-tiny (anchors)	0.953	0.845	0.897	158	12.3	13.2	6.01

Table 4. Comparison of experimental structures between SPPCSPC and SPPFCSPC-SR.

Model	p	R	[email protected]	FPS	MB	GFLOPS	Params (M)
YOLOv7-tiny (SPPCSPC)	0.946	0.825	0.888	161	12.3	13.2	6.01
YOLOv7-tiny (SPPFCSPC-SR)	0.962	0.863	0.910	137	13.5	13.7	6.60

Table 5. Comparison of the experimental results before and after the introduction of the InceptionNeXt module.

Model	p	R	[email protected]	FPS	MB	GFLOPS	Params (M)
YOLOv7-tiny	0.946	0.825	0.888	161	12.3	13.2	6.01
YOLOv7-tiny (InceptionNeXt)	0.967	0.893	0.927	125	16.1	11.7	7.91

Table 6. Comparison of the experimental results between the FPN and Get-and-Send structures.

Model	p	R	[email protected]	FPS	MB	GFLOPS	Params (M)
YOLOv7-tiny (FPN)	0.946	0.825	0.888	161	12.3	13.2	6.01
YOLOv7-tiny (Get-and-Send)	0.957	0.876	0.915	113	18.2	17.8	8.93

Table 7. Comparison of experimental results of YOLOv7-tiny with YOLOv7-GS.

Model	p	R	[email protected]	FPS	MB	GFLOPS	Params (M)
YOLOv7-tiny	0.946	0.825	0.888	161	12.3	13.2	6.01
YOLOv7-GS	0.968	0.903	0.932	104	23.2	15.5	11.43

Table 8. Comparison of ablation experiments.

Model	p	R	[email protected]	FPS	MB	GFLOPS	Params (M)
YOLOv7-tiny	0.946	0.825	0.888	161	12.3	13.2	6.01
A+SR	0.962	0.863	0.910	137	13.5	13.7	6.60
A+I	0.967	0.893	0.927	125	16.1	11.7	7.91
A+GS	0.957	0.876	0.915	118	18.2	17.8	8.93
A+SR+I	0.966	0.888	0.921	125	17.2	12.4	8.50
A+SR+GS	0.965	0.854	0.909	115	19.4	18.3	9.52
A+I+GS	0.971	0.899	0.929	111	21.3	17.4	10.4
A+SR+I+GS	0.968	0.903	0.932	104	23.2	15.5	11.43

Table 9. Comparison of the experimental results with those of other mainstream models.

Model	[email protected]	FPS	MB	GFLOPS	Params (M)
YOLOv5l	0.87	84	93.7	114.6	46.63
YOLOv5m	0.937	93	42.5	50.6	21.05
YOLOv7	0.953	71	74.8	105.1	37.20
YOLOv8n	0.82	110	6.3	8.2	3.01
YOLOv8l	0.899	87	87.7	164.8	43.61
YOLOv9t	0.854	146	6.1	11.0	2.61
YOLOv9c	0.912	54	102.8	238.9	50.69
YOLOv7-GS(Ours)	0.932	104	23.2	15.5	11.43

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bo, C.; Wei, Y.; Wang, X.; Shi, Z.; Xiao, Y. Vision-Based Anti-UAV Detection Based on YOLOv7-GS in Complex Backgrounds. Drones 2024, 8, 331. https://doi.org/10.3390/drones8070331

AMA Style

Bo C, Wei Y, Wang X, Shi Z, Xiao Y. Vision-Based Anti-UAV Detection Based on YOLOv7-GS in Complex Backgrounds. Drones. 2024; 8(7):331. https://doi.org/10.3390/drones8070331

Chicago/Turabian Style

Bo, Chunjuan, Yuntao Wei, Xiujia Wang, Zhan Shi, and Ying Xiao. 2024. "Vision-Based Anti-UAV Detection Based on YOLOv7-GS in Complex Backgrounds" Drones 8, no. 7: 331. https://doi.org/10.3390/drones8070331

APA Style

Bo, C., Wei, Y., Wang, X., Shi, Z., & Xiao, Y. (2024). Vision-Based Anti-UAV Detection Based on YOLOv7-GS in Complex Backgrounds. Drones, 8(7), 331. https://doi.org/10.3390/drones8070331

Article Menu

Vision-Based Anti-UAV Detection Based on YOLOv7-GS in Complex Backgrounds

Abstract

1. Introduction

2. Related Work

3. Proposed YOLOV7-GS Method

3.1. Improvements to Anchors

3.2. SPPFCSPC-SR Module

3.3. InceptionNeXt Module

3.4. Get-and-Send Module

4. Experimental Results and Analysis

4.1. Experimental Dataset

4.2. Experimental Environment

4.3. Improvement Experiment

4.3.1. Anchor Improvement Experiment

4.3.2. Improvement SPPFCSPC-SR Module Experiment

4.3.3. Experiments Introducing the InceptionNeXt Module

4.3.4. Experiments of the Get-and-Send Module

4.3.5. Experiments with the YOLOv7-GS Module

4.4. Ablation Experiments

4.5. Comparison Experiments with Other Object-Detection Algorithms

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI