Although the YOLOv7-tiny object-detection algorithm can detect UAVs, it exhibits a subpar performance in UAV detection under complex backgrounds. This algorithm also shows a low accuracy and susceptibility to confusion with birds, which results in missed detection and false alarms. In light of these shortcomings, this paper proposes the YOLOv7-GS anti-UAV system target detection model through the improvement of the YOLOv7-tiny algorithm.
Figure 1 displays the model network structure. The improvements made to the YOLOv7-GS algorithm used in this work cover several aspects. First, the anchors are reclustered via the k-means method. Second, the SPPCSPC module is refined, and the SPPFCSPC-SR module is introduced. Third, the Inject-LAF module within the GD mechanism is refined to create the low-Inject-ILAF and high-Inject-ILAF modules, and the Get-and-Send module is composed based on this refinement. Finally, the InceptionNeXt module is embedded at the end of the neck section.
3.1. Improvements to Anchors
When the size and shape of the anchor do not match the target object, the model experiences difficulty in accurately predicting the target bounding box, resulting in a decrease in detection accuracy and recall. The model may cause missed detections or false alarms, and leads to increasing the cost of model optimization during the training process.
Figure 2a presents visualizations of the ground truth box in the dataset labels. The ground truth box predominantly exhibits a rectangular shape. However, the anchor sizes provided by YOLOv7-tiny, which are derived from clustering within the COCO dataset, are inappropriate for our dataset. Hence, we redesigned the anchors to align with our dataset using the k-means clustering algorithm. Comparisons of the anchors before and after the enhancement are shown in
Figure 2b and
Figure 2c, respectively. The anchors obtained through the k-means clustering algorithm closely resemble the ground truth box. Leveraging the improved anchors substantially increases the network optimization speed and effectively increases the algorithm’s recognition efficiency and localization accuracy.
3.2. SPPFCSPC-SR Module
Small target objects account for a small proportion of the image and have few effective features. Thus, the appearance information easily becomes negligible during the feature extraction process. Traditional models lack focus on small-target areas and are prone to missing detection. To address the challenges concerning the accurate localization of small targets in complex scenes and reduce the missed detection rate, we propose the SPPCSPC-SR module as an improvement over the existing SPPCSPC module.
The SPPCSPC module contributes to YOLOv7 by processing input feature maps through multiscale spatial pyramid pooling, and effectively expands the model’s receptive field and improves its feature representation capabilities. This module consists of two critical submodules: the SPP and CSPC modules. The SPP module generates multiple feature maps of varying scales through multiscale spatial pyramid pooling, and captures targets of various sizes and scene information. These two submodules enhance the model’s perceptual range and feature representation. The CSPC module performs convolutional operations on the feature maps produced by the SPP module, which further boosts the model’s feature representation capabilities.
In this enhanced module, the SPP module is replaced by the SPPF module. The SPPF module achieves the same functionality as the 5 × 5, 9 × 9, and 13 × 13 pooling kernels of the SPP module by sequentially passing through three 5 × 5 MaxPool layers and connecting these layers. Additionally, we adjust the original pooling kernel size in the SPPCSPC module from 5 to 3. The parameter-free attention mechanism Simple Attention Module (SimAM) [
30] is embedded before the pooling layer, the target and other pixels are divided, and the three-dimensional attention weight of the feature map is inferred to increase the attention to small-target areas, reduce effective feature loss, and suppress confusion. These changes align the expanded receptive field of the smaller pooling kernel with the scale of small targets, which improves feature extraction and detection accuracy for small targets.
Figure 3 presents the structure of the SPPFCSPC-SR module.
3.3. InceptionNeXt Module
In the detection of small targets, the resolution and visual information are limited, which poses challenges in the extraction of discriminative features. In addition, environmental factors easily disturb small targets. Another challenge in small-target detection is that small targets occupy a small area in images, and thus they contain relatively limited contextual information (especially in YOLOv7-tiny, which uses smaller convolution kernels and expands the receptive field through multi-layer superposition). In the YOLOv7-tiny model, the capability to express semantic information is relatively weak. However, large kernel depth convolutions enhance the model receptive field and retain the contextual semantic information of small-target objects more effectively [
31,
32,
33,
34], especially in small object detection. ConvNeXt [
35] incorporates
depth convolutions, which considerably boosts the detection performance for small objects. However, the high memory access cost of ConvNeXt adversely affects the device’s computational efficiency. We introduce the InceptionNeXt module to address this issue and further enhance the detection performance of small UAV targets while enlarging the model’s receptive field.
Figure 4 shows the structural representation of this module.
The InceptionNeXt module innovatively decomposes large-kernel depthwise convolution into four parallel branches along the channel dimension: a small square kernel, two orthogonal band kernels, and an identity mapping. The input is first divided into four groups along the channel dimension. This process is presented in Equation (
1).
Then, these four features are processed using four different operators, and the output results are then combined. The calculation process is shown in Equations (2)–(6).
The default value of is 3, and that of is 11.
By utilizing this clever decomposition strategy, we can reduce the number of parameters and computational burden while retaining the advantages of large kernel depth convolutions. This approach promotes the further enlargement of the receptive field and improves model performance.
To fully leverage the advantages of the InceptionNeXt module, this paper integrates it into the last layer of the network’s neck structure. At the end of the neck, the InceptionNeXt module facilitates in-depth integration and comprehension of the features learned in preceding layers. With this design, the network can comprehensively capture more global semantic information of the images, thereby improving the model’s detection performance for small targets. Moreover, this structural design offers remarkable advantages for the improvement of the model performance in both theoretical and practical applications while also considering computational efficiency.
3.4. Get-and-Send Module
In classical YOLO series models, the feature pyramid network (FPN) method [
36] is commonly used to address the multiscale information processing during object detection. The original FPN structure fully integrates information from adjacent layers using a progressive multiscale feature fusion pattern. By contrast, information integration in other layers is only attainable indirectly through intermediary layers. This design often leads to information loss related to small targets during cross-layer information fusion, which weakens the capability of YOLO to detect small targets. Traditional approaches often involve adding shortcuts to create additional pathways to enhance information flow. Liu et al. [
37] propose the PANet architecture, which incorporates top-down pathways and lateral connection pathways to effectively capture semantic information and contextual relationships. Liu et al. [
38] also propose the adaptive spatial feature fusion structure to effectively integrate features of various scales more effectively to improve the model performance. Jin et al. [
39] introduced adaptive feature fusion and self-enhancement modules. Chen et al. [
40] propose a parallel FPN structure for bidirectional fusion object detection. However, the traditional FPN-based information fusion structures still suffer from drawbacks, such as slow speed and information loss, during cross-layer information exchange due to the numerous pathways and indirect interactions in the network.
We introduce the GD mechanism to address the issue of potential information loss during information fusion in the FPN structure of the YOLO series. This mechanism provides a uniform aggregation and fusion of features from various levels across a global view, which results in their distribution. Therefore, the information fusion capability of the neck section is notably boosted without introducing too much latency, which makes the GD mechanism more comprehensive and efficient for information interaction and fusion. The GD mechanism comprises two branches: low-GD and high-GD. Building upon the Inject module in the GD mechanism, we propose the improved low-Inject-ILAF and high-Inject-ILAF modules and construct a Get-and-Send module based on this module.
Figure 5 illustrates the working principle of the Get-and-Send module.
The ”Get” process consists of two stages: initially, the FAM gathers and aligns features across various layers, followed by the IFM, which integrates these aligned features to extract global information. After the acquisition of the global information, the information injection module (Inject) ”Send” this information to each level, which results in the enhanced detection capabilities of each branch through a simple attention mechanism.
To improve the model’s capability to detect objects of different sizes, we introduce two branches: low-GS and high-GS. In FAM_4in, average pooling is used to downsample to a unified size
, with
selected as the target size. IFM_4in includes multiple layers of re-parameterized convolutional blocks (RepBlock) and a split operation. RepBlock takes
(
) as the input to obtain
(
), which is then split along the channel dimension into
and
. The specific equations are presented in Equations (7)–(9).
FAM_3in and FAM_4in operate similarly, using global average pooling to downsample for size alignment, with a target size of
. IFM_3in includes multiple transformer blocks and a split operation. The output of FAM_3in,
, is processed through the transformer blocks to obtain
.
is then channel-reduced via a 1 × 1 convolution to
and split along the channel dimension into
and
. The specific equations are presented in Equations (10)–(12).
The Get-and-Send module is crafted to boost the model’s accuracy and robustness in the detection of small targets. Its primary objective is to enable a more thorough flow and fusion of information across different levels. To achieve effective cross-level fusion of feature information and reduce information loss during fusion, we employ the Get-and-Send module to replace the FPN structure and improve the model’s performance in small-target detection.
The LAF module is a lightweight intralayer fusion module, and it combines the input local features (Bi or Mi) with neighboring layer features and further enriches the local feature maps with multilevel information through the Inject module. Schematic diagrams of the shallow and deep structures of the LAF module are shown in
Figure 6a,b. As part of our efforts to improve the model’s performance, we optimized the Inject-LAF module (
Figure 7a). Specifically, local (from the current layer) and global information (generated by IFM) are inputted simultaneously and denoted as
and
, respectively.
is processed through two different convolution layers to obtain
and
.
is processed through a convolution layer to obtain
. The fused feature
is then obtained through an attention calculation. Here,
is equal to Bi, as detailed in Equations (13)–(16).
The high-GS branch has an Inject structure identical to that in the low-GS branch (Equations (17)–(20)).
The improved design includes the replacement of the original convolution layers of the second and third layers with InceptionNeXt large convolutional kernel layers. For each layer feature x_local, the feature map fused through the LAF fusion module and the resulting feature map x_global from the IFM module undergo deep processing (
Figure 7b). This improvement ensures that the Inject-ILAF module effectively preserves the information on small-target objects, which enhances the focus on small targets and the model’s capability to detect them effectively.