1. Introduction
Thanks to the development of large-scale computing devices, deep learning has made rapid progress. As a research branch of deep learning, object detection is widely used in production and life due to its excellent stability, high accuracy, and detection speed. It can realize the localization and classification of the objects and mark them in images in the form of text and bounding boxes. However, object detection algorithms based on deep learning usually need to learn the representation of object features from large-scale labeled data before they can classify and locate objects, which consumes many human and material resources [
1,
2,
3]. Additionally, it is challenging to obtain a large amount of data that can be used for training in some application scenarios, such as rare species detection, industrial defect detection, and so on. Inspired by the cognitive characteristics that humans can recognize a new thing through only a few samples, the researchers believe that the neural network imitates human neurons’ reasoning process, so it should also have similar learning capabilities [
4]. Therefore, FSOD comes into being, which is dedicated to using only a few training samples to realize the detection function. The current mainstream FSOD methods can be divided into metric learning-based, data augmentation-based, and meta-learning-based methods.
Metric learning-based methods [
5,
6,
7,
8] usually utilize the feature distribution of objects for classification. Li et al. [
5] propose a boundary balance condition for the target distribution in the feature space. It can reduce the uncertainty of novel class object representation caused by an excessively large feature distance and the difficulty of novel class feature representation caused by an excessively small feature distance. Sun and Karlinsky et al. [
6,
7] classified features by comparing the distribution distance between query and support features. However, metric learning heavily relies on the sampling strategy. If the strategy is simple, the model can only learn simple distributions and cannot be applied to complex scenarios; if it is tough, the model will have difficulty or even fail to converge. Augmentation-based methods [
9,
10] enrich data diversity in limited data through various techniques. Li et al. [
10] increase the diversity of data by adding noise and occlusion to the images. This can improve the model’s consistent representation of the same object under different conditions. Zhang et al. [
9] employed hallucination networks to generate more object proposals, enriching the training data in disguise. However, it still cannot achieve good results in FSOD with very little training data. Meta-learning-based methods [
5,
11,
12,
13,
14,
15,
16] avoid the above problems. The model is usually built with a Siamese network structure [
17] and learns to discern the objects in the query images by relying on the information provided by the support images. This mode of continuously adapting to each specific task enables the model to obtain an abstract learning ability, which can easily be generalized on a few training samples.
In existing meta-learning-based FSOD methods, most models still take ResNet [
18] and VGG [
19] as the network backbones of the dual-branch structure. This will cause the model to be insufficient in small object detection [
20,
21], which can only reach 1–3% in mAP (mean average precision) on the MS COCO dataset [
22]. At the same time, many hard samples exist in the support branch (as shown in
Figure 1). A large percentage of the regions in these samples are background or other category objects, and only a tiny part belongs to useful support objects. This causes the model to fail to obtain the support features that can accurately represent the category, which affects the model’s recognition effect of objects belonging to the same class in the two branches. In addition, during the training process, most models need to filter out the support features of the same class according to the query labels to enhance the query features, highlighting the object features belonging to the same category in the two branches. This seems reasonable, but it only makes use of the support feature information of the same category, and the model does not obtain the ability to actively distinguish the same category of objects from the support images. The model has to manually prepare support images of the same class during the testing process without query labels, making the process more time-consuming and laborious.
To overcome the above deficiencies, this paper proposes the MSFFAL based on meta-learning. First, we adopt a multi-scale feature fusion strategy and design the backbone as ResNet + feature pyramid networks (FPNs) [
21] to improve the model’s recognition effect on small objects. Then, we optimize the model’s representation of hard support samples by introducing the channel attention structure SENet [
23] in the support branch to weight the features of foreground objects. Finally, we design an attention loss to let query features perform attention calculations with all support features. The computed attention scores constrain the model’s representation of the query features. Through attention loss, the model learns to actively focus on objects of the same category in the two branches and no longer depends on query labels. The experiments on the benchmark datasets Pascal VOC [
24,
25] and MS COCO [
22] prove the effectiveness of our method.
To summarize, the main contributions of this paper are as follows:
- (1)
We propose an MSFFAL framework for few-shot object detection. The backbone of our model mainly contains the multi-scale feature fusion and channel attention mechanisms. The former is introduced to improve the model’s detection accuracy on the small objects. The latter is adopted to strengthen the model’s representation of hard samples in the support branch and enhance the model attention to foreground object features.
- (2)
We design an attention loss to enhance the active recognition ability of the model, realize the consistent representation of objects belonging to the same category in the two branches, and improve the model’s generalization ability in novel classes. Based on this, the model no longer relies on the feature selection and avoids model testing difficulty.
- (3)
We conduct extensive experiments on the benchmark datasets Pascal VOC and MS COCO to verify the effectiveness of our method. The experimental results show that our model is 0.7–7.8% ahead of the SOTAs on the Pascal VOC. We also achieve a substantial lead over the baseline model in MS COCO’s small object detection.
This paper includes five sections: The first is the introduction, which introduces the relevant research background of FSOD and the motivation for our research. The
Section 2 presents the related work of FSOD and describes the problems and optimization possibilities of the previous methods. The
Section 3 introduces our algorithm in detail. The
Section 4 first introduces the dataset selected in this paper and the relevant experimental settings and then shows sufficient experimental results to prove the reliability of our work. The
Section 5 summarizes the whole work and concludes.
3. Method
Our method has been further innovated and optimized based on the meta R-CNN [
14]. We first improved the model’s recognition performance for small objects by introducing a multi-scale mechanism in the feature extraction backbone. Then, we add a channel attention mechanism based on FPN to optimize the model’s representation of hard samples in the support branch and improve detection precision. Finally, we designed an attention loss to let the model learn consistent representations of objects in the two branches of the same category. The model learns to actively identify objects from support samples, leading to an overall improvement in detection performance. In this section, we first make a method definition for FSOD. Then, we introduce the overall architecture of MSFFAL and describe the modules and structures in detail.
3.1. Problem Definition
We follow the dataset setup, training strategy, and evaluation methods in [
11,
14]. We divide the dataset into
and
, where
is the base class data with thousands of annotations per class and
is the novel class data with only one to dozens of annotations per class. The base class and the novel class data do not contain the same object categories, that is,
We first train the model on base classes
and then fine-tune it on the balance set of
and
with only K annotations per class. K is set to different values according to the evaluation indicators of different datasets. For a given N-way K-shot learning task, in each iteration, the model samples a query image and NK support images with N categories and K objects in each category from the prepared dataset as input. Then, the model outputs the detection results of the objects in the query image. Finally, we evaluate the model’s performance by the mAP on the novel classes in the test set.
3.2. Model Architecture
We choose the meta R-CNN, whose backbone is faster R-CNN, as our baseline. The model architecture is shown in
Figure 2, which is a Siamese network structure. The upper side of the network is the query branch, which inputs the query image to be detected, while the lower side is the support branch, which inputs the support image-mask pairs for auxiliary detection. We remove the meta learner module in meta R-CNN and realize the information interaction in the two branches through our attention loss. Compared with the baseline, we optimize the backbone of the query and support branches into ResNet + FPN and ResNet + FPN + SENet structures, respectively. The two backbones share weight parameters during the training stage. The query features are passed through RPN and ROIAlign to obtain positive and negative proposal feature vectors. The support features are directly average pooled to obtain support feature vectors representing each support object category. Then, they are used to construct
to classify support objects and to make attention loss with query positive proposal vectors. The model is trained with three losses, namely:
where
is the detection loss of faster R-CNN,
is the meta-classification loss in the support branch,
is our attention loss, and
is the weight parameter of the loss.
3.3. FPN and SENet
To improve the detection precision of the FSOD model for small objects and the representation effect for hard support samples, we design the feature extraction backbone as an FPN+SENet structure.
As shown in
Figure 3, FPN mainly includes a bottom–up line (blue box), a top–down line (green box), and lateral connections (1 × 1 conv 256). Bottom–up is the forward process of the ResNet network. Each layer down-samples the feature maps’ length and width and increases the number of channels. Suppose that the input image size is 224 × 224 × 3, Layer0–Layer3 output feature maps sizes of 56 × 56 × 256, 28 × 28 × 512, 14 × 14 × 1024, and 7 × 7 × 2048, respectively. Top–down is the process of up-sampling the width and height of the feature maps by two times. FPN combines high-level and low-level features through lateral connections to obtain M2–M5 features with 256 channels. Finally, the 3 × 3 convolution kernel is used to convolve the fusion features to eliminate the aliasing effect of up–sampling and get P2–P5 features. P6 is the feature map obtained by P5 after max-pooling with stride = 2. Each level features output by FPN are fed to the RPN module for region proposal. Among them, the low-level features will contribute more proposals for small objects to improve the detection effect of small objects.
We add the SENet structure based on FPN. The model achieves a channel-level self-attention enhancement through this structure. During the training process, the model continuously learns to improve the representation of hard support samples. The design of SENet is shown in
Figure 4. This module adds a skip connection to the output feature layer of the ResNet forward network. In the connection, the feature maps are first average pooled. Then, the channel attention scores are obtained through the channel attention module. Finally, the original feature maps are weighted at the channel level through the scores. The internal structure of the channel attention module is shown on the right side of
Figure 4. The input feature vector
is first dimensionally decreased through the first fully connected (FC) layer with a reduction rate of 4 to obtain
. Then,
is followed by the first activation function Tanh to obtain
. Then, the dimension of
is increased through the second FC layer to obtain
. Finally,
is followed by the second activation function sigmoid to obtain the weight score vector
.The whole process can be summarized as:
Two different activation functions are used to increase the network’s nonlinearity and enrich the network’s expressive ability. SENet allows the support branch to output high-quality support feature vectors for meta-classification and the construction of attention loss, improving the model detection performance.
3.4. Attention Loss
Meta learner is the core module in meta R-CNN, which uses the same category support features to weight the query features. This weighting method causes the model to lack the ability to actively identify objects of the same category, and the dependence on the query labels to select the support features makes model testing difficult. To remedy these, we design an attention loss to replace the meta learner module in the baseline model meta R-CNN.
The essence of the attention loss lies in utilizing the support features to establish a mapping between query-positive proposal features and their corresponding categories. Through training, a strong response is generated between objects of the same category in two branches. The model learns to recognize objects of the same category in two branches while also discriminating objects of different categories. As shown in
Figure 5, we extract all query-positive proposal feature vectors
according to the intersection over union (IOU) between the predicted bounding boxes generated by the RPN and the ground truth. We then perform a matrix multiplication operation between all positive proposal feature vectors and the transpose of the support feature vectors
and put them through softmax to obtain attention vectors
, where
denotes the number of input support images in each iteration. Each element in
represents a support category. Suppose the category of the positive proposal is consistent with that of the support vector. In that case, we expect the value of the element position corresponding to this category to be close to 1; otherwise, it is close to 0. To achieve the goal above, we concatenate all the attention vectors
together to obtain the score matrix
, where
is the numbers of positive proposals, and
corresponds to each positive proposal to constrain the trend of
, that is, the proposed attention loss:
where
represents the concatenation of the
.
Through the attention loss, on the one hand, the model can learn a consistent representation of objects belonging to the same category in the two branches during the training process. On the other hand, the model learns an abstract and easily transferable meta-knowledge in this way. Thus, it can also show an excellent generalization performance when facing unseen novel class objects.