1. Introduction
UAV aerial image detection aims to locate and identify perceptual targets in the image, which belongs to small target detection. A small object refers to the following: (1) Based on relative scales: the width and height of an object is a target that is less than 1/10 of the image or a target whose bounding box to the area of the image is under a value (usually 0.03). (2) Based on absolute scales: objects with a pixel count of less than 32 × 32. Much meaningful progress has been made in object detection algorithms, which has also injected confidence into small target detection and led to the development of UAV technology. The increasing prevalence of UAV aerial images across diverse applications has provided a fertile ground for the integration of target detection and aerial image fusion. This integration has proven particularly advantageous in fields such as urban transportation [
1], urban planning [
2], and environmental monitoring [
3].
Consequently, the study of small target detection assumes paramount importance, particularly in the context of aerial imagery where images often contain small targets due to the high altitude of aerial photography [
4]. Additionally, variable angles and environmental conditions further complicate the detection process by exacerbating target–background mixing [
5]. While mainstream object detection algorithms are widely used across various domains, their effectiveness in detecting small targets is often limited, primarily due to their reliance on mechanisms such as anchor frames [
6]. Small object detection based on the anchor frame mechanism uses different methods to set anchor frames of different sizes on the feature map and predicts whether each frame contains target objects based on the features in each anchor frame. This reliance can lead to suboptimal results when applied directly to small target detection tasks as mechanisms like anchor frames exacerbate the challenge of localizing small targets. Furthermore, noise filtering during convolution operations can reduce image resolution, causing the loss of critical features essential for effectively learning from small targets [
7]. Hence, addressing the challenge of small object detection, characterized by limited pixel proportions and complex feature extraction, is essential for advancing the capabilities of object detection algorithms in aerial imagery analysis [
8,
9].
The urgent demand for small target detection spans various fields, including environmental monitoring, urban development, and transportation, among others. Research breakthroughs in this area are becoming increasingly imperative as aerial image detection becomes an integral component of daily life, posing a significant real-world challenge. However, numerous obstacles persist, rendering small target detection a formidable task.
To tackle these challenges, this paper introduces PCSG, which optimizes computational performance while enhancing detection capabilities. This enhancement entails modifications to both the backbone and detection head. Regarding the detection head, we propose an innovative framework and conduct further analysis through pruning and upsampling optimization. Concerning the backbone, we optimize feature extraction performance by leveraging the Ghost framework [
10] and SPD-Conv [
11]. The specific contributions are outlined below:
By enhancing shallow feature reuse, the model retains richer position information and bolsters feature extraction for small targets while also optimizing network structures and introducing additional prediction branches.
Leveraging the lightweight and versatile CARAFE structure for upsampling mitigates the issue of local information loss associated with traditional nearest-neighbor upsampling methods.
The adoption of the SPD-Conv (space-to-depth) module in place of strided convolution and pooling maximizes the retention of discriminative feature information.
The integration of Ghost convolution as the backbone reduces parameters and computational complexity compared to the original CSPNet backbone while also enhancing real-time performance.
The structure of our article is as follows:
Section 2 provides an overview of related works, focusing on anchor-free object detection methods, data augmentation techniques, and multi-scale learning approaches.
Section 3 elaborates on our proposed method, which comprises a PCHead model and a PCSG model. In
Section 4, we show the experiment results, where we apply our method to aerial images and compare its performance with that of YOLOv5. Finally, we list our findings and outline directions for future work.
2. Related Works
This paper first discusses the importance of resolution on small target detection. Then, we delve into the study of small object detection algorithms, which can be split into three main parts: anchor-free mechanisms, data augmentation technology, and multi-scale learning.
In object detection, image resolution plays a crucial role. Images with high resolution can provide more abundant information such as outlines, textures, and feature points. This information helps to improve the accuracy of the network. Low-resolution images contain less detail, and the image becomes blurry and contains more noise, which interferes with network detection. But, higher resolution increases computational complexity and slows down the network’s recognition speed. For small object detection, higher resolutions can increase the size of the target in the image and reduce the difficulty of detection. However, when the resolution is too large, the proportion of the network’s receptive field in the image decreases, making the network unable to predict objects at all scales.
The anchor box mechanism remains the prevailing approach in object detection, which applies different anchor frames to the feature map, predicts whether the target object is included based on the features in each box, and fits the ground truth box to obtain the target positioning. However, the detection of small objects poses unique challenges due to their limited pixel occupancy within images. Predicting the boundary box offset for small objects often leads to increased errors, compounded by the smaller number of anchor boxes available for their detection. To mitigate these challenges, various innovative solutions have emerged. One such solution [
12] involves per-pixel prediction, while another approach [
13,
14] utilizes key points to replace anchor boxes, with enhancements made through the incorporation of a central point. Additionally, some methods [
15] leverage global context information between detected instances and images to eliminate the reliance on anchor boxes and non-maximum suppression (NMS). Moreover, employing attention mechanisms to focus on the surrounding environment of detected instances [
16,
17] has also yielded promising results in small object detection.
Utilizing data augmentation techniques to enhance the quantity and quality of images in datasets can significantly improve the generalization ability of models. Small targets often suffer from low resolution, difficulty in feature extraction, and limited sample size. The problem of fewer small targets in the image can be solved by duplicating the small targets in the image through the oversampling strategy [
18]. Crop out the small target by mask, paste it anywhere in the image, and generate a new ground truth and the pasted target can be randomly transformed (scale, fold, rotate, etc.). This increases small targets in each image. However, simplistic replication strategies may lead to problems such as scale and background mismatches. Addressing these challenges requires the consideration of contextual information during data replication and the implementation of adaptive resampling augmentation [
19]. For the problem of feature extraction and limited sample size, image processing can be used, such as increasing the contrast of the image, histogram equalization, spatial filtering, spatial scale transformation, etc. These operations can enhance some of the features in the image, and the number of samples can also be increased by training the transformed image and the original image as a whole dataset.
Multi-scale learning enhances the feature learning capability of models by fusing deep abstract information with shallow detail information [
20]. Small targets often face challenges in feature adulteration with background information during convolution. Shallow networks extract pixel-level information, including color, edges, textures, and corners, while deep networks capture semantic information. Integrating features from different levels of the feature pyramid [
21,
22] and employing adaptive multi-scale networks [
23] are effective strategies to resolve this issue. This paper extends the application of multi-scale information to even shallower feature maps, prunes redundant network structures, and employs the CARAFE module to minimize feature information loss during upsampling, along with the SPD-Conv module to preserve fine-grained feature map information. Furthermore, processing the context region of targets instead of simple pixel-by-pixel processing during training [
24] yields an efficient multi-scale training approach. Another study [
25] has demonstrated improved detection performance through the utilization of relevant information across different feature maps.
3. Improvement on Detection Head and Backbone Networks
In object detection, optimizing both the detection head and the backbone network is crucial for achieving superior performance. The detection head relies on the features extracted by the backbone network to generate detection boxes and class confidences. Conversely, the backbone network extracts image features to facilitate accurate detection. By strategically optimizing these components, we can exploit their complementary strengths to enhance detection accuracy, robustness, and efficiency. This study explores novel optimization strategies for the detection head and backbone network to achieve state-of-the-art performance. Through this investigation, we aim to provide insights into the fundamental principles of effective object detection systems.
3.1. PCHead Model
In this paper, we address the challenge of small targets occupying a small proportion of the image. The changes make it difficult to extract useful information while also leading to redundant network structures. To tackle this issue, we propose several enhancements. Firstly, we deepen the feature pyramid and the path aggregation network. Secondly, we introduce a new detection layer specifically designed for smaller targets. Additionally, we trim the redundant parts of the original network structure, focusing on features relevant to larger targets. Finally, we utilize the lightweight and versatile upsampling operator CARAFE to construct the PCHead model.
3.1.1. A Novel Detection Head with Enhanced Feature Pyramid
The YOLO-v5 algorithm, classified as a one-stage target detection algorithm, directly predicts the object’s category probability and positional coordinate value. Its speed surpasses that of two-stage target detection algorithms, and it exhibits higher efficiency in detecting small targets. Therefore, it is particularly fit for real-time target detection in high-quality aerial imagery.
To leverage this feature of the YOLO-v5 algorithm and improve the performance, this paper enhances the FPN + PAN structure by transforming the original two-layer feature pyramid structure into a three-layer structure. The addition of an extra layer of the feature pyramid facilitates feature fusion with the feature map obtained through 4× downsampling. These modifications aim to capitalize on the combination of the FPN and PAN structures while enhancing the capability of the network.
In addition, the original YOLOv5 network uses a three-layer output method, corresponding to 8× downsampling, 16× downsampling, and 32× downsampling, respectively. The calculation of the target resolution from the original resolution to the feature layer is shown in Formula (1), where
represents the size of the pixel in L-th layer feature map,
represents original resolution of target, and
represents the downsampling stride of L-th layer feature map.
On the basis of the above modifications, this paper adds a detection head that downsamples by a factor of 4, which is used to detect even smaller targets. By deepening the feature pyramid, the new detection head model can use shallower feature maps to obtain richer feature and position information. The optimized network is shown in
Figure 1. The left side shows the feature maps corresponding to the downsampling factor obtained by the feature extraction structure from the original input. The middle shows the feature maps obtained through feature fusion by the FPN structure. The right side shows the PAN structure corresponding to the FPN, which finally obtains the detection head for different-sized targets.
3.1.2. Network Structure Pruning for Complexity Reduction
Although the newly introduced detection head model has enhanced the accuracy, it has also contributed to an increase in the complexity of the network structure. The addition of the feature pyramid network (FPN) structure, path aggregation network (PAN) structure, and 4× downsampling detection head has significantly augmented the intricacy of the network. In aerial image datasets, large objects are scarce, rendering the 32-fold downsampling correlation module inapplicable to the primary objectives of detecting small objects effectively. Consequently, modifying the existing network structure becomes imperative to address the issue of redundancy.
Moreover, the feature maps obtained through 32× downsampling possess a large receptive field, making them less sensitive to small objects with low pixel ratios, thus offering minimal contribution to the detection of such objects. Therefore, after thorough deliberation, this paper proposes pruning the modules associated with 32× downsampling in the network architecture. Specifically, this involves removing the 32× downsampling feature extraction module from the backbone network, the 32× downsampling feature fusion module from the neck section, and the 32× downsampling detection head from the prediction branch, as illustrated in
Figure 2. Following these pruning operations, adjustments are required for the FPN and PAN structures as the feature map size input into feature fusion network structure has been altered. Consequently, certain layers of the FPN structure and their corresponding PAN structures are pruned accordingly. Ultimately, the detection head designed for large objects is eliminated.
The increased complexity of the network structure poses significant challenges, particularly in terms of computational efficiency, memory consumption, and training time. Addressing these challenges is crucial for ensuring the scalability and practical applicability of the proposed model.
The overall network structure after pruning retains the advantages of using shallow feature maps and optimizes the problem of network complexity it brings. By removing the 32× downsampling feature extraction module in the feature extraction network, the subsequent feature fusion becomes more convenient without the need to add new upsampling structures. This also reduces unnecessary computations, decreases compute parameters, and increases the detection speed.
3.1.3. CARAFE-Based Feature Reassemble
The upsampling algorithm used in YOLOv5 is nearest-neighbor interpolation, which directly uses the closest existing color to generate missing pixel values. This approach of copying neighboring pixel values can create obvious aliasing artifacts. This strategy of copying samples is prone to emphasizing individual samples too much, leading to overfitting problems. In addition, this algorithm only considers the distance between instances and does not utilize the feature information of instances, which can also affect the detection performance to some extent.
In this paper, a lightweight and general-purpose upsampling algorithm called CARAFE (Content-Aware ReAssembly of Features) [
26] is used to supersede the nearest-neighbor upsampling algorithm. The CARAFE algorithm is a content-aware and feature reassembly upsampling method that has a large receptive field during the feature reassembly process.
Compared to traditional upsampling algorithms, the CARAFE algorithm uses convolutional layers to transform the input feature channels, effectively solving the checkerboard effect. It enhances detail information during upsampling, which is helpful for feature learning. Moreover, the algorithm is more lightweight and has higher running efficiency. CARAFE contains two key components: a kernel prediction module and a content-aware reassembly module. The former is used to generate weights for convolution kernels, which can be adjusted based on the pixel values in the image. The latter is used to combine feature maps of different sizes to attain the final feature map.
CARAFE contains two steps: first, it predicts a reassembly kernel for each target location based on its content, as shown in Formula (2); the second step uses the predicted kernel to guide the feature reassembly process, as shown in Formula (3). For feature map
with size
and upsampling rate
(assuming
is an integer), CARAFE generates a new feature map
with size
. Any position (
) in the new feature map (
) is associated with a corresponding original position (
) in the input feature map (
), where
and
.
represents the
subregion of
centered at position
, which is the neighborhood of
. The kernel prediction module
predicts the position kernel
for each
position based on the
subregion of
, while the content-aware reassembly module
reassembles the
subregion of
and the kernel
.
The CARAFE algorithm replaces the nearest-neighbor upsampling algorithm and applies it to the original network, which is modified by the previous work, and the final network structure is displayed in
Figure 3. In this paper, we refer to it as the PCHead model.
3.2. PCSG Model
In this section, we delve into the detailed modifications made to the backbone network. With the aim of improving both fine-grained feature extraction capability and computational performance, this paper incorporates a novel CNN module, SPD-Conv, while also substituting the CSPDarknet53 network structure with Ghost convolution. These alterations culminate in the development of the PCSG model.
3.2.1. Fine-Grained Information Network Structure
The underlying reason why mainstream target detection algorithms perform well in detecting regular-sized targets but poorly in detecting small targets is the flaw of existing network architectures, especially hierarchical convolution and pooling layers. These two layers cause the loss of fine-grained information and low efficiency. Limited by the low resolution of the input image and the small size of the target, existing CNN architectures are therefore not suitable for small object detection.
The SPD-Conv [
11] efficiently solves the above-mentioned problem. As a new type of CNN module, SPD-Conv consists of an SPD layer (space-to-depth) and a Conv (non-strided convolution) layer. By replacing each strided convolution and pooling layer with the SPD-Conv, it achieves the preservation of fine-grained information and effective feature learning. The SPD module performs image transformation on the internal CNN and the entire feature map in order to perform downsampling.
For a middle feature map
with a size of
, it can be divided into a series of sub-feature maps by cutting, as shown in the Formula (4). For a feature map (
), each sub-map
composes all entries
that are evenly divided by the scaling factors
, and
. Therefore, each sub-map downsamples the original feature map according to the scaling factors. Then, feature sub-maps are connected along channel dimension to gain a new feature map
, which has one less scaling factor in the spatial dimension and one more scaling factor in the channel dimension than the original feature map (
). Therefore, the dimension of feature map
changes from the original
to
.
Regarding the non-strided convolutional layer, after the dimension transformation of the feature map is completed by the SPD layer, a non-strided convolutional layer with filters is added. is less than . Therefore, the dimension of the feature map is further transformed from to . By using non-strided convolution, the original network can preserve much feature discriminative information.
This article modifies the original YOLOv5 network structure by embedding SPD-Conv modules into the stride convolutional layers in the backbone and neck parts. Specifically, SPD-Conv is added to each stride convolutional layer and its subsequent connection layer, i.e., between the Conv module and the C3 module. In total, there are six replacements, with four in the backbone part and two in the neck part because the backbone contains four stride convolutions and the neck contains two.
3.2.2. Lightweight Network Structure
Although the previous work can enhance the accuracy, it tends to make the network structure more complex, with larger parameters and computations. This leads to lower efficiency in small object detection and cannot meet the real-time needs of aerial image detection. In this paper, we use the Ghost convolution module [
10] to lighten the weight of the network.
The Ghost module contains three steps: conventional convolution, Ghost generation, and feature map concatenation. In this paper, the Conv and C3 modules in the original network structure were replaced with Ghost convolutions, which greatly reduced the parameters of the network and improved the detection speed.
Combining the previous work, this paper has some enhancements to the original network’s feature pyramid structure. By applying shallow feature maps, the ability to extract features for small objects is strengthened. The modules related to large objects in the original network’s feature extraction network, feature fusion network, and prediction branch are pruned to enhance the model’s accuracy and speed. The upsampling algorithm of the original network structure is modified, and the design flaws of the CNN module in the original network model are reduced to preserve fine-grained information. The entire model is lightweight to meet practical scenario requirements. The final network, called the PCSG model, is displayed in
Figure 4. It can be described as follows: Firstly, the original input image elapses the Focus structure, as well as the GhostConv, SPD, and GhostC3 modules, to obtain the corresponding downsampling feature maps. Then, after passing through the SPP module and being upsampled through the CARAFE module, it undergoes feature fusion with the corresponding scale feature map obtained previously via the Concat module. Afterward, the PAN structure is utilized bottom-up with corresponding feature maps obtained previously by the FPN structure. Finally, detection heads are obtained for different-sized targets.
5. Conclusions
While many object detection algorithms have shown effectiveness in detecting objects of regular sizes, small object detection remains a significant challenge. The main challenge is the limited pixel proportion and the complexity of feature extraction. Moreover, current mainstream detection algorithms tend to be overly complex, leading to structural redundancy for small objects. In this study, we delve into the domain of aerial imagery, where detecting small objects poses particular challenges due to their limited pixel representation and complex feature extraction. To address these challenges, we propose modifications to the feature pyramid structure, focusing on fusing shallow feature maps and enhancing their reusability. Additionally, tackling the issue of redundancy in existing network structures for small object detection, we prune the correlation structure and introduce a new detection head. Furthermore, adopting the lightweight and versatile upsampling operator CARAFE we address the problem of local feature information loss through content-aware feature recombination during upsampling. Striving to retain all discriminative feature information, the strided convolutions and pooling of the existing network are replaced with the SPD-Conv module. Leveraging Ghost convolution to reduce model complexity and enhance real-time performance, the resultant PCSG model achieves an impressive mAP value of 97.8% on the RSOD dataset, demonstrating superior detection capabilities. The research in this paper is for small target detection in aerial images, and the method adopted provides a certain reference in this field. Moving forward, the focus will be on enhancing the model’s generalizability and exploring further improvements from diverse perspectives, as well as incorporating YOLO v9 or other latest versions of the model into the comparison.