1. Introduction
In the modern military field, air objects such as stealth fighters and unmanned aerial vehicles are combat weapons that affect the success or failure of high-tech warfare. As a means of undertaking aerial reconnaissance to combat the threat of the “visual center”, these weapons can take advantage of surprise and high precision to break through the defender’s three-dimensional defense network. The facilities, personnel, and equipment are crucial to implementing a “surgical” type of precision strike, paralyzing the overall combat system, and weakening the defender’s combat capability. However, as high-value military targets, air objects have obvious non-cooperative and few-shot characteristics. The large amount of labeled data makes achieving accurate detection impossible. Therefore, when carrying out research on air object detection methods under few-shot conditions, realizing the accurate detection of air targets is vital to protecting key components, enhancing early warning capabilities on the battlefield, and improving the combat defense system. In addition, such research provides a reference for the accurate detection of other important non-cooperative military targets.
Few-shot object detection can be roughly divided into two types: one based on meta-learning and the other based on fine-tuning [
1]. For the meta-learning type, the class-independent parameters of the model are trained on a per-task basis to obtain a specialized meta-model. A small number of training samples are mapped to the new-class detection model to achieve the new-class detection task. For example, Yan et al. [
2] introduced meta-learning ideas on the basis of mask regional convolutional neural networks (R-CNNs). They used support branches to obtain category attention vectors and fused them with query-image-extracted features to obtain new predictive features for object classification and localization. Kang et al. [
3] used You Only Look Once version 2 (YOLOV2) as the basic framework and embedded meta-feature learners and feature-reweighting modules to obtain meta-features that could be generalized to new categories, enabling the detector to adapt to new categories quickly.
In contrast, fine-tuning-based methods solve few-shot problems through pre-training and fine-tuning. They apply a general supervised training approach, which minimizes the regularization losses of the pre-trained model during the fine-tuning stage using model gradient optimization methods to adapt to the detection of new categories. Gao et al. [
4] proposed a multi-domain adversarial variational Bayesian inference method that minimized the inter-domain difference between the conditional distributions of the features of the base class and those of the new class. In addition, Cao et al. [
5] constructed a compact new-class feature space based on fine-tuning to improve the new class’s detection performance.
Meta-learning methods play a major role in few-shot object detection research. Generally considered promising in object detection, they can perform well with a small number of sample inputs for specific tasks [
6,
7]. For example, Chen et al. [
8] solved the uncertainty representation problem based on meta-learning using a dual-awareness attention mechanism. Similarly, Perez-Rua et al. [
9] addressed the open accommodation problem for new categories by obtaining new-class category feature vectors through meta-learning. However, such techniques still show slight shortcomings in two aspects. First, the complexity of the meta-learning model increases the risk of overfitting the model’s parameters to the training base class [
1], and second, dimensional meta-learning may not converge during the training iterations [
10].
Therefore, from this perspective, fine-tuning-based learning has more advantages in terms of its universality and simplicity compared with meta-learning. Several fine-tuning-based studies have recently reported competitive results. For instance, Fan et al. [
11] conducted an analysis based on the fine-tuning method and proposed a “Bias-Balanced RPN” and a secondary detector to eliminate the bias brought about by the base class during pre-training while ensuring that the class-independent knowledge is not forgotten. In other research, Kaul et al. [
12] introduced a pseudo-labeling method based on the fine-tuning method, which increases the number of training samples for the new class by obtaining high-quality pseudo-labeled data, reducing the problem of too few samples in the new class and improving the model’s detection ability. Compared with the uniqueness and unfamiliarity of the meta-model, the fine-tuning-based method has stronger plasticity in terms of optimization technology, loss function, data enhancement, and architecture. Although the meta-learning method has stronger adaptability, in particular in the field of few-shot object detection, the vanishing gradient problem and its overall complexity limit its calculation steps, while the fine-tuning-based method encounters no such difficulties [
1].
However, the few-shot object detection method based on fine-tuning focuses on general-purpose objects in natural scenes and still exhibits some deficiencies in the identification and localization of air objects. Compared with natural scenes, the few-shot characteristics of air objects introduce more prominent problems. Unlike natural object detection, which only categorizes an object within a large class, the detection of air objects needs to accurately determine an object’s specific model. Moreover, the air object’s own feature recognition is low, making the model more difficult to decipher. A stronger feature extraction ability is needed to obtain more information about the object from a small number of samples so that the air object is effectively detected. In addition, because air objects fly in all weather conditions and samples are difficult to obtain, the precise sample shooting angle, as well as the time and location, cannot be determined, resulting in a wide distribution of sample scales when detecting air objects. Furthermore, the number of air objects at each scale is small, and the few existing shot detection networks may be difficult to adapt. The discriminative nature of object features at different scales and the processing capability of multi-scale features cannot be guaranteed, and a detection model similar to universal object detection undertaken in natural scenes is challenging to design to achieve the accurate detection of airborne objects at all scales.
To address the above issues, we propose a few-shot air object detection network (FADNet). Starting from the few-shot and multi-scale perspectives, the network structure enhances the model’s capability for detecting air objects. First, a multi-scale attention mechanism (MAM) is introduced after designing the backbone network to extract object features from both the spatial and channel aspects. This further aggregates the local contextual features and global features to improve the object information extraction capability of the network. Second, the feature pyramid of the neck network is improved by adding jump connections based on the path aggregation network (PANet) [
13], and sparsely connected convolution is added to the multi-scale output. The number of corresponding convolution groups is set for the outputs of different scales to improve the discriminative power of the features at each scale. Lastly, we integrate a multi-scale regional proposal network (MRPN) that is designed based on the multi-scale characteristics of the features, changing the previous mode of multiple inputs and single outputs. MRPNs are built based on multi-scale outputs, and adaptive convolution is introduced in the front end to process features at different scales effectively and enhance the object recognition ability of the detection network.
The main contributions to this article are as follows:
FADNet is proposed to solve the problem of the low precision rate of air object detection under the influence of few-shot characteristics and multi-scale properties, improving the detection capability of the network;
A MAM was designed to realize the deep fusion of object features from both the spatial and channel dimensions, and more effective information about object features was extracted;
Based on multi-scale characteristics, first, the feature pyramid structure was improved, a jump connection was added to PANet [
13], and sparsely connected convolution was introduced to the outputs of each scale, which improved the discriminative properties of the features of each scale. Second, a multi-scale regional candidate network was constructed to adaptively extract feature information for different scale outputs, and a multi-input and multi-output model was established to utilize the multi-scale features effectively;
The designed algorithm was experimentally validated on the general PASCAL VOC dataset and our self-developed few-shot military air object dataset, achieving good results.
2. Related Work
This section reviews the existing deep learning object detection algorithms, few-shot learning algorithms, and few-shot object detection algorithms related to this article.
First, we discuss object detection based on deep learning. Currently, object detection algorithms are divided into two main categories: two-stage and single-stage. During the detection process, two-stage algorithms first create region suggestion boxes, distinguishing between the background and foreground, and then perform classification and localization regression operations on each suggestion box. In 2014, Girshick et al. [
14] proposed the two-stage algorithmic regional convolutional neural network (R-CNN) model for the first time. However, because of its cumbersome algorithm steps and slow calculations, researchers proposed a fast R-CNN [
15] and a faster R-CNN [
16] to improve precision and reduce calculation speed. At present, two-stage algorithms are widely applied in fields such as unmanned driving, military detection [
17,
18], facial recognition, and industrial detection, yielding good results.
Compared with two-stage algorithms, single-stage algorithms can directly predict, locate, and classify feature maps. Redmin et al. [
19] first proposed the You Only Look Once (YOLO) algorithm in 2016, which has since undergone version updates in the YOLO series [
20,
21,
22], gradually becoming an important framework for one-step object detection. The single-shot detector (SSD) algorithm [
23] draws on the advantages of the two-stage algorithm and integrates the design concept of the faster R-CNN Chung-guyok algorithm into the single-stage algorithm. The deconvolutional single-shot detector (DSSD) algorithm [
24] cited Resnet-101 [
25] as a feature extraction network on this basis, exhibiting improved detection performance. Compared with two-stage algorithms, single-stage algorithms are simpler to implement and faster to train, but because of their lack of RPNs, their overall precision is inferior to that of two-stage algorithms.
The next area of interest relates to few-shot learning. Few-shot learning aims to use a small number of samples to acquire new knowledge. The central premise of this method is to accurately transfer knowledge from a base-class training model to a new class. The existing few-shot learning methods can be roughly divided into three categories. The first constitutes optimization-based methods, such as model-agnostic meta-learning [
26] (MAML), which learn through well-initialized rules. In a relatively short period of time and using MAML as a basis, Jamal et al. [
27] developed and proposed task-agnostic meta-learning (TAML) to solve the problem of meta-learner bias. The second method is based on metric learning, which obtains the generalized metric space of a category through learning to perform subsequent similarity measurement operations. Karlinsky et al. [
28] introduced multimodal distribution into metric learning to achieve end-to-end training of backbone network parameters and embedded spatial distribution. Wang et al. [
29] utilized global vectors for word representation encoding to embed label information into feature maps, achieving feature enhancement of the data. The third method is based on parameter generation [
30]. Unlike the other methods, this method obtains a superior network model by pre-training and fine-tuning the class-related parameters in the second stage to achieve better adaptation to new tasks. Sun et al. [
31] and Liu et al. [
32] integrated the MAML method into model fine-tuning, achieving algorithm improvements and enhancing the generalization performance of the algorithm.
The literature also covers few-shot object detection. Similar to few-shot object classification, most few-shot object detection methods currently use two-stage training, namely, a pre-training stage and a fine-tuning stage. However, this method is different from few-shot learning in that it must not only recognize the object in the sample but also locate its specific positions based on the background, which is more difficult to achieve. In order to improve the detection accuracy of few-shot objects, meta-R-CNN [
2] introduces meta-learning into an R-CNN, which does not extract feature map information from a holistic perspective. Instead, it focuses on the features of each region of interest (ROI). Fan et al. [
33] designed an aggregation model called an Attention RPN based on the meta-learning network model, which measures the similarity between the support set features and the query set features from three perspectives—global, local, and cross-correlation—helping the detector to better distinguish different categories.
Li et al. [
34] proposed a category marginal reconstruction method to transform the under-shot object detection problem into an under-shot object classification problem. This was carried out by introducing a fully connected layer at the end of the detector to decouple the classification and regression feature contradictions. Meanwhile, the category boundary loss is added to the feature learning to achieve the marginal space between the new class and the base class and improve the new-class detection level. Yin et al. [
35] improved the meta-learning method to study the under-photographed object detection environment for incremental learning. They proposed a hyper-network-based under-photographed object detection method, which solves the difficulties encountered in incremental learning. New categories can be learned sequentially and incrementally without additional training, which improves the effectiveness of the model’s detection. Similarly, Zhang et al. [
6] introduced a novel inter-class correlation meta-learning strategy to achieve the robust and efficient detection of underphotographed objects by extracting and utilizing the correlations of different categories. This inter-class correlation method can focus on multiple support categories simultaneously, reducing the misclassification of similar samples and enhancing the generalization ability of new-class samples. Qiao et al. [
36] improved the fine-tuning method, analyzing it from a multi-task and multi-stage perspective. They proposed a fast decoupling method, which decouples the feed-forward network and gradient updating through the introduction of the gradient decoupling layer; they also redefined the forward and backward operations. Meanwhile, an offline classification module was added to the detection back end, which realized classification correction through extra scores and improved the ability of category judgment.
The authors of [
37] proposed a multi-scale positive sample refinement (MPSR) model for few-shot object detection. This generates multi-scale samples through data augmentation and establishes a fast R-CNN branch to alleviate the problem of insufficient samples. Furthermore, Khandelwal et al. [
38] improved the generalization ability and detection performance of few-shot object detection by calculating the semantic similarities between the new and base classes and transferring the regression and classification weights to the new class. Sun et al. [
39] mixed the new and base classes to form a fine-tuning dataset to reduce the differences in the features between the classes. Zhang et al. [
40] proposed the Cooperative Region Proposal Network (CoRPN) to solve the problem of foreground-background imbalances exacerbated by insufficient sample data, increase the number of foreground classifiers, and avoid losing more pre-selection boxes.
In the above studies, by building a new network structure and improving the fine-tuning-based or meta-learning methods, few-shot object detection alleviates the problem of having insufficient samples of new classes during detection, solves the multi-task, multi-stage coupling contradiction in detection, and reshapes the feature space of the new and base classes. This improves the accuracy of detecting new classes and the network model’s overall detection capability. However, the above methods focus on the detection of natural objects and lack relevance for the detection of military targets with few shots, such as air objects. Particularly significant is the lack of attention these methods pay to the multi-scale characteristics of air objects, while the scale problem is considered the core of object detection [
41], which seriously restricts the detection capabilities of air objects under the conditions of few samples. In contrast, our proposed network, which is based on the characteristics of air objects, such as their few-shot and multi-scale nature, makes targeted modifications to the backbone network, neck network, detector, etc., achieving efficient multi-scale feature extraction and feature processing and improving the detection performance of few-shot military targets such as air objects.
3. Methods
This paper proposes FADNet to build a network model for few-shot and multi-scale situations. By improving the ability to extract object features, the discriminative power of features at each scale, and the utilization of multi-scale features, air object detection performance is enhanced. The network structure is shown in
Figure 1, comprising a backbone network, neck network, and detector. The backbone network includes an air object image input backbone network transformer [
40], and the transformer’s feature extraction network is composed of four main parts. First, the input image is divided into four blocks and 56 × 56 × 48 feature vectors, which sequentially output different levels of feature vectors across four stages to achieve multi-scale feature extraction of the object. After being processed by the MAM, the output multi-scale features are input into the neck network, the IPANet, which carries out the deep extraction and fusion of object features at each scale, and then the output results of each scale are input into the detector, which comprises ROI Align, HEAD, and the MRPN, consisting of multiple RPN branches. The features of each scale are input into the corresponding region candidate network and ROI Align to form the candidate frame, and the final classification results and location results are output by HEAD.
The design premise and specific implementations of the MAM, IPANet, and MRPN are discussed in detail below.
3.1. MAM
In order to extract more air object feature information and enhance the detection performance of air object models, we designed the MAM after the backbone network. It deeply fuses the spatial
- and
-scale information and the channel local and global information with the original features to realize the multi-scale attention focus on the input features in the spatial and channel dimensions and enhance the object feature extraction capability. The MAM structure is divided into left and right parts, as shown in
Figure 2. The left side constitutes the spatial attention module, which performs average pooling operations along the width
and the length
of the input feature
to
, and it obtains the dimensional feature
and the
dimensional feature
. The formulas are as follows:
where
and
are the average pooling operations of
and
, respectively.
We add
and
to the channel dimension
, and the output feature
is input with a
convolution for channel mixing, which reduces the number of channels to
and allows for the effective interaction of the feature information of each channel. The output-fused features
of the
dimension are separated into the
dimension feature
and the
dimension feature
through the BN layer and ReLU function processing. The formulas are as follows:
where
represents the feature concatenation of the channel dimensions and
represents a convolution with a kernel size of 1.
and
perform
convolutions, with the number of channels restored to
, yielding the spatial attention features
and
for the dimensions
and
, respectively, through the nonlinear operation of the sigmoid function. The formulas are as follows:
On the right is the channel attention module, which performs channel global average pooling on the input feature
to obtain the
dimensional features
. The
input is a
convolution, and the number of channels is compressed to
. The output
dimensional features
are processed by the BN layer and the ReLU function. The output features are then processed via a
convolution, and the number of channels is restored to
, resulting in the global attention feature
. The formulas are as follows:
where
represents the global channel average pooling operation.
The other branch directly performs a
convolution operation on the input features. The output
dimensional features are then subjected to the BN layer and the ReLU function. Then, a
convolution is performed again to output the local attention features
. The outputs
and
are normalized by the BN layer, and then feature addition is performed in the channel dimension. The sigmoid nonlinear function is input to obtain the
dimensional channel attention feature
. The formulas are as follows:
where
represents the feature addition operation for the channel dimension.
Finally, the spatial attention features
and
are output by the spatial attention module. The channel attention feature
is output by the channel attention module, and they are multiplied by the input feature
to obtain the multi-scale attention module output feature
. The formula is as follows:
3.2. IPANet
PANet [
13], proposed by Shun et al., includes multiple improvements to the mask R-CNN. It adds a bottom-up path to the back end of the original feature pyramid, and it uses adaptive feature pooling to incorporate full-fusion operations to address the information loss in the long path of the feature pyramid. However, PANet [
13] has not completely solved this problem, especially for air objects; their few-shot characteristics and cross-scale problems mean that PANet’s [
13] discriminative information for object features at each scale is not sufficiently strong. As a result, we propose an improved feature pyramid based on PANet [
13], as shown in
Figure 3.
In
Figure 3, B2, B3, B4, and B5 represent multi-scale input features with feature dimensions of
,
,
, and
, respectively. The steps of the feature pyramid can be divided into three stages. In the first stage, the input features perform
convolutions and bottom-up path up-sampling operations, strengthening the process of transmitting high-level semantic information to the low-level features. The feature channel dimension is unified to 256, and the output features are C2, C3, C4, and C5. In the second stage, the output features undergo
convolutions and bottom-up path down-sampling operations, further extracting the multi-scale information. At the same time, this strengthens the upward transmission of the low-level, strong positioning of the features to the multi-scale information, outputting the features M2, M3, M4, and M5. In the third stage, M2, M3, M4, and M5 are convolutionally processed into the output features P2, P3, P4, and P5, respectively. The formulas for the three stages are as follows:
On this basis, we propose two points of improvement. First, in the transmission of information at various scales, jump connection paths are added so that the output layer not only effectively fuses high-level and low-level feature information through the up and down paths but also retains the unmerged information of the original nodes. This reduces the loss of information during the transmission process. The formula after adding a skip connection is as follows:
Second, we add sparsely connected convolutions to the back ends of features P2, P3, P4, and P5 to further extract and fuse the multi-scale information. The specific structure of the sparsely connected convolution is shown on the right side of
Figure 3, and it comprises deep
convolutions and groups of
convolutions. The deep
convolutions are used to extract information at various scales, while the groups of
convolutions are used to enhance the information fusion between the channels. In features P2, P3, P4, and P5 at various scales, because of the gradual decrease in feature levels and the gradual decrease in information interaction among the channels, the number of groups increases gradually when performing groups of
convolution operations to promote the fusion of the low-level feature information. The formulas are as follows:
where
is a group of
convolutions and
is the number of groups.
3.3. MRPN
We contend that, for air objects, the existing structure of regional candidate networks is too simple to effectively handle the relevant information at various scales, especially in few-shot situations where the requirements for object feature selection vary from scale to scale. A single regional candidate network can constrain the information at various scales and have adverse effects on the final object recognition and positioning.
To address this issue, we improved the existing regional candidate network by building a multi-scale regional candidate network. The structure diagram is shown in
Figure 4. We split a single regional candidate network into multiple networks to adapt to the feature requirements of the different scales of information and avoid conflicts caused by mixed information at the different scales. Each sub-region candidate network consists of a front-end feature extraction section and a back-end classification and localization section. The front-end processing part is composed of a
convolution and a self-adaption (
) convolution in parallel; then, it is connected in series with a
convolution.
The
convolution performs multi-scale feature extraction operations, adaptively extracting object features based on object information. First, the
convolution extracts the input object feature information using a
convolution and a
convolution, concatenates the extracted information in the channel dimension, and performs global maximum pooling processing. After passing through two fully connected layers, one-dimensional feature data of the channel are obtained. Second, the obtained data are input into the softmax function, and
and
convolution feature information weights are output. The proportion of each convolution in the adaptive convolution of the candidate network in the scale region is then clarified and multiplied by the extracted feature information of each convolution. The output result is concatenated in the channel dimension to output the adaptive convolution result. The
convolution formulas are as follows:
where
and
represent the
and
convolution operations, respectively, and
represents two fully connected layer operations.
After the convolution processing, the output result is multiplied by a convolution output result, and the convolutions that are input into the object features are extracted and input into the back-end classification and positioning part. After the classification and the positioning convolution processing, the preliminary position and foreground information of the object are obtained, fused to form an object candidate box, and input into the detection head network. Thus, we obtain the recognition results and precise positions of the air object.
4. Experimental Results and Analysis
4.1. Experimental Setup
The hardware platform configuration for the experimental training phase is shown in
Table 1. This study used the Pytorch deep learning development framework for the experiments.
4.2. Few-Shot Dataset of Military Air Objects
Our few-shot dataset included five types of military air attack objects: F35, Su57, MQ9, RQ4, and B2. The dataset was divided into training, validation, and testing sets. The training set included one, two, three, five, and ten photos of the five object types, according to the requirements of the few-shot tasks, and one, two, three, five, and ten shots were used for model training. At the same time, ten photos were provided for each of the five object types to form a validation set to assist in the model training. In addition, we provided five images from each of the five object categories, totaling twenty-five images, to form a test set for testing the performance of the model. Finally, the image labeling software LabelImg 1.6.0 was used to label the sample data in the training, validation, and test sets using a dataset label format similar to the PASCAL VOC data label format.
4.3. Evaluating Indicator
The definitions of detection precision and object recall in deep learning are shown in Formulas (23) and (24), respectively, as follows:
where
is the number of real objects detected by the algorithm,
is the number of false objects detected by the algorithm, and
is the number of real weak objects that actually exist in the image. The average precision
is a combination of detection precision and object recall. According to the calculation method in [
2], the confidence threshold was set to 0.5 to evaluate the detection performance of the detection model for a single category. The average
of the detected categories was used to evaluate the overall performance of the detection model. The expression is shown in Formula (25):
where
represents the total number of categories (normally
). The higher the values of
and
, the better the detection performance of the model, and vice versa.
4.4. Implementation Details
We used FADNet as our network model for the network implementation. In the training phase of the base class, we used 15 classes of objects in the PASCAL VOC dataset as the base-class dataset, except for birds, buses, cows, motorbikes, and sofas. We used motorbikes and sofas as the base-class dataset. The training process applied the SGD optimizer with 15,000 iterations, a learning rate of 0.02, a batch size of 16, a momentum of 0.9, and a weight decay of 0.0001. In the fine-tuning phase, the learning rate was 0.001; the iterations of the 1-, 2-, 3-, 5-, and 10-shot tasks were 3000, 6000, 9000, and 15,000, respectively; and the batch size, momentum, and weight decay were unchanged.
4.5. Analysis of the Results of the Air Object Comparison Experiments
We used TFA/fc [
7], TFA/cos [
7], Attention RPN [
35], TIP [
42], DCNet [
43], MPSR [
36], LVC [
12], and our algorithm to detect the military air objects in the few-shot dataset constructed in this study. A comparison of the detection results is shown in
Table 2. Clearly, the algorithm proposed in this paper exhibited the strongest detection ability, especially in the three-shot task, showing a significant improvement in detection performance compared with the other algorithms. Compared with the suboptimal algorithm MPSR, the overall performance increased by an average of 1.1%, effectively improving the detection precision for air objects in various shot tasks.
Figure 5 shows the decoupling-based algorithm we designed, which was combined with four
[email protected] combinations of TFA/fc [
7], TFA/cos [
7], Attention RPN [
35], TIP [
42], DCNet [
43], MPSR [
36], and LVC [
12]. A comparison of the visual output results of the network models discussed above is shown in
Figure 5.
4.6. Air Object Ablation Experiment
The algorithm designed in this study is proposed using FADNet, based on the multi-scale problem of air objects and constructed from the MAM, IPANet, and MRPN. To evaluate the degree of optimization of the algorithm’s performance according to different module combinations and improvements, we designed ablation experiments.
Table 3 shows the results of the ablation experiments, which were validated on a few-shot dataset of air objects under the same experimental conditions.
The experimental results indicated that the different combinations had positive impacts on the overall performance of the model, with the baseline models scoring 25.6, 29.4, 27.4, 37.1, and 44.2 for the 1-, 2-, 3-, 5-, and 10-shot tasks, respectively. After replacing the backbone network resnet-101 with the transformer network, the shot tasks increased by an average of 5.1 percentage points compared with the baseline model, improving its ability to process multi-scale information significantly. After adding the MAM, effective feature fusion was achieved for the feature information in the channel and spatial dimensions, and effective aggregation of the local and global features was achieved in the channel dimension. The performance of each shot task increased to varying degrees, especially at 3 and 10 shots, which increased by 4.3 and 4.6 percentage points, respectively. In response to the characteristics of the air object, we improved the original PANet network after the MAM, further integrating the multi-scale features of the air object. With this improvement, the detection performance increased most significantly for 10 shots by 2.7 percentage points. To further improve the processing ability of the multi-scale features and to solve the multi-scale problem of air objects, we built an MRPN to refine each shot task, especially for the 5- and 10-shot tasks, which achieved rapid increases of 3.8 and 3.7 percentage points, respectively, indicating a stronger detection precision of the model.
4.7. Analysis of the Detection Results for the PASCAL VOC Dataset
We used TFA/fc [
7], TFA/cos [
7], Attention RPN [
35], TIP [
42], DCNet [
43], MPSR [
36], LVC [
12], FORD+BL [
44], and our algorithm for few-shot object detection on the PASCAL VOC dataset. A total of 5 types of objects—birds, buses, cows, motorbikes, and sofas—were identified as new-class objects, while the remaining 15 out of 20 were identified as the base classes. The detection results are shown in
Table 4; the optimal results in the 1-, 2-, 3-, 5-, and 10-shot tasks are given in bold. The results of our proposed method in the 1- and 2-shot tasks are lower than those of the FORD+BL algorithm and, thus, suboptimal, but our method performed better than the other algorithms. Meanwhile, in the 3-, 5-, and 10-shot tasks, the proposed method yielded optimal results, performing the best of all algorithms, with the highest performance achieved for the 3-shot task, 3.3 percentage points greater than that of the next best performer. The smallest performance improvement occurred with the 10-shot task, with an increase of 1.4 percentage points.
Compared with the results for the air object dataset, the results of our proposed method for the PASCAL VOC dataset show a smaller advantage. This is because our proposed method is based on the characteristics of air objects; the dataset is more specific to air objects. Unlike air objects, which exhibit less individual variability, natural objects dominate the PASCAL VOC dataset with a larger number of categories (20 in total) and exhibit more variability in various classes. This results in more 1- and 2-shot tasks when the number of samples is small. The effectiveness of our method decreases, and although our method outperforms as the number of samples rises (i.e., in the 3-, 5-, and 10-shot tasks), the resultant superiority of our method for the PASCAL VOC dataset is still lower than that for the air object dataset. However, the overall dominance of our method for the PASCAL VOC dataset, especially when the shot number is large, demonstrates the effectiveness and generalization ability of our method. A comparison of the visual output results of the network models discussed above is shown in
Figure 6.