1. Introduction
At present, it is challenging for object detectors to detect and locate multiple objects at different scales. As each layer of a convolutional neural network (CNN) [
1] has a fixed receptive field, Regional CNNs (R-CNNs) encounter specific issues [
2]. For instance, there tend to be discrepancies between the fixed receptive fields and the objects in the natural images at different scales. In many current object detectors, pyramid feature representation is used to alleviate those problems [
3]. As shown in
Figure 1a, top-down architecture was used in this study to produce more semantic feature maps at all scales [
4]. Specifically, it integrates low-resolution and semantically strong features with high-resolution and semantically weak features through lateral connections. In recent years, many studies have been conducted to improve the performance of FPNs. To increase the representation of low-level information in deep layers, Path Aggregation Networks (PANs) [
5] based on feature pyramid networks (FPNs) have been proposed to add bottom-up pathways. Along with pathway augmentation, Neural Architecture Search FPN (NAS-FPN) [
6] was proposed for a more effective fusion of all cross-scale connections. Additionally, augmented FPN (AugFPN) [
7] has been proposed in recent research, which utilizes residual feature augmentation to gain ratio-invariant contextual information. However, the aforementioned methods consider each object independently and, therefore, do not take into account the relationships between objects or between objects and their surroundings. As a result, the accuracy of object detection has certain limitations. In this study, it is believed that these approaches neglect the information provided at other scales, and uniformly scaled feature maps representing non-local interactions are insufficient to capture contextual information.
Figure 2 provides an illustration of the ignored features that could potentially yield crucial prediction information. In some cases, it is challenging for humans to recognize objects or their respective locations, as depicted in
Figure 2a. Nevertheless, as illustrated in
Figure 2b, the presence of a cup and a sofa in a home setting aids in the identification of the target object as a table. Coexisting objects provide strong cues for detecting specific objects, as exemplified in
Figure 2c, where the points surrounding the cup and sofa tend to identify the table. In
Figure 2d, when the surrounding environment information is given, we can then recognize the table easily. Additionally, global scene clues can prevent objects from being wrongly detected in unsuitable surroundings. For instance, a cup is more likely to be located on a table than on a road, and a desk is more likely to be situated in front of a sofa than a car. Therefore, this study posits that contextual information for object detection consists of multiple levels.
Contextual information has been demonstrated to play a vital role in semantic segmentation and object detection processes [
8]. For the extraction of context across multiple scales, the Deeplab-v2 [
9] pyramid pooling has been implemented in pyramid scene parsing networks to obtain a hierarchical global context, significantly enhancing the semantic segmentation quality of extraction context across multiple scales. Incorporating contextual information can also enhance the final detection and classification results by facilitating the localization of region proposals. Additionally, the use of contextual information in salient object detection (SOD) has been introduced in several recent studies. For example, a novel cross-level attention mechanism in the SOD network was proposed in the cross-level attention and supervision of salient object detection (CLASS) [
10] by modeling the channel-wise and position-wise dependency trends between features at different levels.
The method of effectively integrating context information exchange at different scales using a transformer is worth studying. The transformer [
11] is an architecture that does not use convolutional operators and solely relies on attention mechanisms. The vision transformer is based on learning attentive interaction between distinct patch tokens and has recently received considerable interest in many vision tasks. In addition, the vision transformer (ViT) [
12] and data-efficient image transformer (DeiT) [
13] methods can partition images into patch embedding sequences, then input them into conventional transformers in image classification challenges. Recently introduced methods make targeted adjustments to ViT which effectively enhance image categorization performance. Additionally, the Cross-attention Multi-scale Vision Transformer (CrossViT) [
14] employs a dual-branch transformer to process picture patches of varying sizes, while the Twins [
15] approach blends local and global attention techniques to improve feature representation. The results of the above-mentioned studies have shown that transformer-based models outperform other types of networks. In this study, a transformer module was introduced to model multi-scale global scene contexts. As illustrated in
Figure 1b, compared to methods based on convolutional neural networks, the proposed transformer can capture long-range dependencies between pixels and global contexts of models. As depicted in
Figure 3a, we randomly sampled cups (Patch A in yellow) to analyze patch interactions between the table (Patch B in blue) and the sofa (Patch C in red). Further, we performed analysis of the similarity of attention scores across different layers (
Figure 3b), and upon adding the proposed HA-FPN model, attention score similarities of different levels of the table, cup, and sofa significantly improved.
In order to improve computational efficiency, an FPN reduces the channel dimensionality, which results in significant loss of channel information as shown in
Figure 1a. The channel dimensions are reduced from 2048, 1024, and 512 to 256. In an attention mechanism, more resources will be invested in the most important feature maps by determining the differences in the importance of each feature map. In squeeze-and-excitation networks (SE-Nets) [
16], each channel is assigned a weight to assist the networks to learn important features. More efficient channel attention networks (ECA-Nets) [
17] improve the SE-Net blocks by obtaining more accurate attention information via one-dimensional convolution layers for consolidating cross-channel information. Then, with the introduction of selective kernel networks (SK-Nets) [
18], the adaptive receptive field sizes of the neurons were achieved through the nonlinear integration of information from multiple kernels. The convolutional block attention module (CBAM) [
19] collects spatial and channel attention information by constructing two submodules. Then, it integrates the information, thereby yielding more comprehensive and reliable attention information. Therefore, inspired by the above-mentioned methods, this study introduced a channel attention module that effectively utilized the channels containing rich channel information.
This study proposed a method named HA-FPN. A transformer feature map fusion method was first proposed to combine feature maps of different scales in various layers. It enabled the model to learn global contextual semantic information. Then, an effective yet simple channel attention module was presented for selecting the key channels. This effectively utilized the channels with rich channel information. It also alleviated the problems of massive channel information losses.
By replacing the FPN with the HA-FPN in a Faster R-CNN, the proposed model achieved a performance improvement of 1.6 and 1 AP, respectively, when using ResNet-50 and ResNet-101 as backbones. Additionally, using ResNet-50 as an initial network, the proposed HA-FPN improved Faster R-CNN [
20] by 1.5 AP. In addition to two-stage detectors, with minor modifications, the HA-FPN was also successfully applied to one-stage detectors. The results revealed that RetinaNet [
21] was improved by 1.2 AP by replacing the FPN with the HA-FPN. Therefore, the proposed HA-FPN has universality in object detection challenges.
The main contributions of this study can be summarized as follows:
The proposed TFPN could fully utilize multilevel features in the FPN, which captured global scene information;
The proposed CAM could successfully invest more computing resources of the neural networks into the most important channels, which alleviated the problem of massive channel information losses;
This proposed HA-FPN was based on the two contributions mentioned above and was designed to be a simple and generic algorithm that can boost performance detection, while remaining computationally efficient on multiple object detectors.