Next Article in Journal
A Genetic Hyper-Heuristic for an Order Scheduling Problem with Two Scenario-Dependent Parameters in a Parallel-Machine Environment
Next Article in Special Issue
Reprojection-Based Numerical Measure of Robustness for CT Reconstruction Neural Network Algorithms
Previous Article in Journal
Progressive Fracture Behavior and Acoustic Emission Release of CJBs Affected by Joint Distance Ratio
Previous Article in Special Issue
An Attention and Wavelet Based Spatial-Temporal Graph Neural Network for Traffic Flow and Speed Prediction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multibranch Attention Mechanism Based on Channel and Spatial Attention Fusion

1
Fujian Provincial Key Laboratory of Big Data Mining and Applications, Fujian University of Technology, Fuzhou 350118, China
2
School of Computer and Mathematics, Fujian University of Technology, Fuzhou 350118, China
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(21), 4150; https://doi.org/10.3390/math10214150
Submission received: 17 September 2022 / Revised: 24 October 2022 / Accepted: 2 November 2022 / Published: 6 November 2022

Abstract

:
Recently, it has been demonstrated that the performance of an object detection network can be improved by embedding an attention module into it. In this work, we propose a lightweight and effective attention mechanism named multibranch attention (M3Att). For the input feature map, our M3Att first uses the grouped convolutional layer with a pyramid structure for feature extraction, and then calculates channel attention and spatial attention simultaneously and fuses them to obtain more complementary features. It is a “plug and play” module that can be easily added to the object detection network and significantly improves the performance of the object detection network with a small increase in parameters. We demonstrate the effectiveness of M3Att on various challenging object detection tasks, including PASCAL VOC2007, PASCAL VOC2012, KITTI, and Zhanjiang Underwater Robot Competition. The experimental results show that this method dramatically improves the object detection effect, especially for the PASCAL VOC2007, and the mapping index of the original network increased by 4.93% when embedded in the YOLOV4 (You Only Look Once v4) network.

1. Introduction

Object detection is a fundamental task in the field of computer vision, where the main task is to locate all objects of interest in an image and determine their type and location [1]. An important challenge in this phase is to improve the detection accuracy of high-noise videos and images [2]. The noise environment is complicated, containing issues such as bad weather and blurred video obtained underwater; low-quality images obtained as a result of the image’s size, shape, and position, as well as the lighting and shooting conditions; and interference factors such as occlusion and background. However, these noises are unavoidable and are the major cause of missed and false detections, so improving object detection performance in the presence of noise is an urgent problem.
Using attention mechanisms to improve the performance of object detection networks has been widely recognized [3]. The intuitive interpretation of the attention mechanism is to efficiently allocate limited computational resources to the analysis of salient regions of an object and, therefore, to improve the accuracy of object detection. This is consistent with the human visual system, which tends to focus on the useful information parts of an image and ignore the irrelevant information parts, although the overall framework of existing machine vision is still more focused on the holistic analysis of the image, and the detection accuracy of the large object is generally reasonable; however, the detection accuracy of the small and medium objects is poor. Therefore, introducing an attention mechanism can compensate for this disadvantage to a certain extent.
Attention mechanisms can be broadly classified into channel attention and spatial attention [4]. Channel is a feature detector, and channel attention is a mechanism to mine a set of representative features in a given image. The typical channel attention is the squeeze-and-excitation (SE) module [5], whose central idea is to learn the weights of different channels by compression and excitation operation and highlight the significant features. However, the disadvantages of SE are also apparent. It ignores the importance of spatial information [6]. Therefore, the bottleneck attention module (BAM) [7] and convolutional block attention module (CBAM) [8] can better combine channel attention and spatial attention to enrich feature maps. The above work on the attention mechanism is practical; however, there are still two fundamental problems to be solved. Firstly, determining how to mine and utilize the rich information in feature maps at different scales, and secondly, channel or spatial attention can only establish short-term channel dependence but cannot develop long-range channel dependence. Aiming at the above two problems, scholars have proposed Res2Net [9] from the multiscale aspect and nonlocal neural networks [10] from the long-range channel dependence aspect, respectively. Although the above two methods solve the problem to a certain extent, they bring a heavy computational burden to the network. Therefore, based on the above description, we believe it is necessary to develop attention that has low cost and combines multiscale feature extraction and long-range channel dependence. In this paper, we propose an effective and low-cost attention mechanism named the multibranch attention mechanism (M3Att). Our M3Att can process the input tensor at different scales. Specifically, our M3Att combines three parts: firstly, we use group convolutions with different sizes to build a pyramid structure and then enrich the feature map information after grouping convolutions through the channel shuffling mechanism. Then, the feature maps that pass the grouped convolutional pyramid are sent into the channel and spatial attention, respectively. Finally, we use the softmax function to realize the attention weights, thus establishing the long-range channel dependence. At the same time, introducing a skip connection can better compensate for the information loss problem after multiple convolutions.
In this paper, a multibranch attention (M3Att) mechanism that merges channel attention and spatial attention is proposed to address the two problems mentioned above, and the main contributions of this paper are summarized as follows.
(1) We propose a new multibranch attention mechanism (M3Att), which can be flexibly incorporated into existing object detection networks and improves the performance of the object detection network without a significant increase in the parameters of the network. Similarly, our M3Att can also be extended to other computer vision tasks.
(2) We propose a practical multiscale feature extraction module (MFE), which can learn richer multiscale feature representation. Most importantly, the output of each MFE is propagated to the next layer to generate the channel-wise attention vector by using hybrid attention.
(3) The attention mechanism in this paper is inserted into the object detection network of YoloV4 [11] and completes experiments on the detection accuracy. The experimental results show that the M3Att can significantly improve detection accuracy. M3Att achieved a 4.93% improvement in mAP over the YoloV4 for the PASCAL VOC2007 [12].
This paper is organized as follows: Section 1 presents relevant research works on attention mechanisms. Section 2 presents the multibranch attention model proposed in this paper. Section 3 compares the method of this paper with the existing mainstream dataset algorithm and gives experimental results. Section 4 gives a summary.

2. Related Work

The attention model was first applied to machine translation [13] and has become a central concept in convolutional neural networks. The attention model has two leading roles: first, to tell the computer which parts of the content to focus on; second, to allocate the limited computational resources to the important parts of the image. The attention mechanism is proven to be one of the most significant ways to improve the effectiveness and efficiency of neural network learning. Currently, the mainstream attention methods can be divided into three major categories, namely, channel attention, spatial attention, and hybrid attention [14].
In deep neural networks, different channels in different feature maps usually represent other objects [15]. Channel attention adaptively recalibrates each channel’s weight and generates feature masks, a process of selecting objects to determine what to pay attention to. The squeeze-and-excitation (SE) can learn the consequences for each feature to obtain its importance and uses the critical metric to assign a weight value to each channel. GSop [16] builds on SENet and proposes a second-order pooling layer for aggregating richer features; however, the above attention has fully connected layers, leading to much computational redundancy. Therefore, ECANet [17] uses a one-dimensional convolutional layer to replace the fully connected layer, significantly reducing the model’s number of parameters. Fcanet [18] reconsiders the influence of the pooling layer on the attention mechanism from the frequency domain perspective and proposes multispectral channel attention. Distinct from channel attention focused on essential features, some researchers have investigated where it is necessary to concentrate. Therefore, the notion of spatial attention was proposed by Zhu et al. [19], which converts the information in the image’s spatial domain into the corresponding space to extract the critical data. GENet [20] uses feature context to aggregate features and then distributes them locally to tell the model which regions are essential.
Hybrid attention has both advantages over the channel and spatial attention mentioned above and has attracted researchers’ attention in recent years. In 2017, Fei et al. [21] pioneered the concept of hybrid attention. CBAM concatenates channel and spatial attention, and the generated feature vector has both channel and spatial attention advantages. The dual attention network [22] sums the outputs into two different attention branches and adaptively combines local and global features. Coordinate attention [23] embeds the location information into the channel attention so that the network can focus the computational cost on the sizeable important area. Unlike DANet and coordinate attention, relation-aware global attention (RGA) [24] is a new hybrid attention that emphasizes the importance of global structural information provided by pairwise relationships and uses it to generate attention maps. While some of the above attention mechanisms focus on designing more complex network structures, they will incur substantial computational costs. Some concentrate on creating lightweight network structures, but the improvement in model accuracy is not apparent. Thus, to further improve the model’s detection accuracy and reduce the model’s complexity, a novel attention mechanism, M3Att, is proposed; this mechanism aims to significantly improve the performance of the object detection network while reducing the complexity of the model, while channel attention and spatial attention are efficiently combined to generate complementary features.
In addition, some researchers have applied attention mechanisms to object detection networks in practical applications. Yang [25] integrated the CBAM attention mechanism into the YoloV4 object detection network, which significantly improved the ability and accuracy of wheat ears’ extraction capability. Kim [26] proposed ECAP-Yolo, a modification of the feature extraction network of the core network by the channel attention module, which greatly optimizes the detection performance of small objects. These efforts provide helpful background and scenarios for this and similar studies.
This paper proposes an attention mechanism with a multibranch structure and successfully incorporates it into the YoloV4 object detection network. The multibranching attention mechanism first extracts rich multiscale features through a multilevel feature extraction module and then suppresses the interference of spatial and channel dimensions, respectively.

3. Materials and Methods

The multiband attention model proposed in this paper mainly consists of a multiscale feature extraction module, a spatial attention module, a channel attention module, and a skip connection technique.

3.1. Multiscale Feature Extraction Module

As shown in Figure 1, it is the multiscale feature extraction module MFE that implements feature extraction in M3Att. The input feature map X R C × H × W is divided into N parts, denoted by [ X 0 , X 1 , · · · , X N 1 ] . For each partitioned feature map, the number of channels is C / S . Therefore, for the i t h feature map, it can be represented as X i R C S × H × W , i = 1 , 2 , · · · , n 1 . The model can process multiple scales of input tensor in parallel to obtain a topographic map with different scales. Accordingly, the input feature maps are processed using multiscale convolutional kernels, which can extract different spatial and depth information. As the size of the convolution kernel increases, the number of parameters will increase significantly. This paper introduces a clustered convolution approach to solve this issue so that the input tensor of different convolutional kernels can be processed without increasing the computational cost. The relationship between the convolution kernel size and the group size can be expressed as:
G = 2 K 3
where K is the size of the convolution kernel and G represents the group size. The kernel size of the convolution is 9 × 9 , especially when k is equal to 9 and the default of G is 32. Equation refeq1 above will be demonstrated later in the paper by way of ablation experiments. In addition, the use of clustered convolution can significantly reduce the number of model parameters and increase the computation speed. However, communication between channels is weakened, which leads to a reduction in the feature extraction capability [27].
Thus, this paper introduces channel shuffling to address the lack of feature communication between models. The essence of channel shuffling is to cluster each group of channels derived from the group convolution, randomly split them into new groups of channels, and then stitch them together to obtain the final feature map. The feature map after channel shuffling contains information from different channels, enriching the feature map information and avoiding the loss of feature information due to group convolution. Thus, the generate function of the multiscale feature map can be characterized as:
F i = S h u f f l e ( C o n v ( k i × K i , G i ) ) , i = 0 , 1 , 2 , · · · , N 1
where the size of the i t h convolution kernel k i is k i = 2 × ( i + 1 ) + 1 , and the size of the grouped is G i = 2 K 3 . To obtain the final output feature map, cascade stitching is performed.
F = C a t ( [ F 0 , F 1 , · · · , F N 1 ] )

3.2. Channel Attention Module

Figure 2 compares channel attention in this paper with the current mainstream channel attention. The channel attention in this paper consists mainly of two parallel channel attentions, which are responsible for establishing cross-dimensional interactions between the channel dimension and the spatial dimension, respectively. Notably, instead of using a fully connected layer for dimensionality reduction, M3Att adopts a 7 × 7 convolutional kernel for dimensionality reduction. This reduces computational overhead and achieves greater efficiency when executing the forward propagation model [28].
For a given input feature map F R C × H × W , it will be entered into the attention of two branches of the channel. In the first branch, the interaction between the channel information and the height information is constructed in this paper. Firstly, the input feature map is rotated 90 counterclockwise along the H axis via the permute function, which changes the shape of the input feature map to F H R W × H × C . Secondly, by arranging the G A P (global average pooling) and G M P (global max pooling) in parallel, the shape of the feature map is reduced F H R 2 × H × W . In addition, the tensor of the input feature map retains only two dimensions of information, which preserves the richness of the feature map and reduces the computational overhead at the same moment. The feature map information is then extracted via a standard convolutional layer with a convolution kernel size of 7 × 7 . In turn, the final attention weights are generated by the BN layer and sigmoid activation function. Then, the generated attention weights are directly multiplied point by point with the original input feature map to obtain a feature map with cross-dimensional interaction information and then rotated 90 clockwise along the H axis to preserve the original shape of the feature map input for further operations.
Similarly, in the second branch, the interaction between width and height information is constructed in this paper. Firstly, by rotating the input feature map 90 counterclockwise along the W axis via the permute function, the shape of the input features map will be transformed to F W R H × C × W . The feature map is simplified to F W R H × C × W by a parallel G A P and G M P layout, and then the feature map information is extracted via a standard convolutional layer with a convolution kernel size of 7 × 7 . The final attention weights are generated via the BN layer and the sigmoid activation function. The generated attention weights are directly multiplied point by point with the original input feature map to obtain a feature map with cross-dimensional interaction information, and then rotated 90 clockwise along the W axis to keep the original shape of the input feature.
Ultimately, the outputs of the two different branches are then summed element by element and averaged uniformly to obtain the end output Q F R C × H × W . In conclusion, for feature maps F R C × H × W , the calculation of channel attention weight can be expressed mathematically as:
Q ( F ) = 1 2 ( F H σ ( f 7 × 7 ( F H ) ) + F W σ ( f 7 × 7 ( F W ) ) )
where the convolution operation and activation function are denoted as f 7 × 7 and σ ( · ) .

3.3. Spatial Attention Module

The spatial attention structure of this paper is shown in Figure 3. The spatial attention mechanism in this paper draws on the idea of the spatial attention mechanism from SAM and improves upon it.
The input feature map F R C × H × W is obtained by two feature maps, max pooling and average pooling, respectively, and the two feature maps are stitched together to obtain the feature map for the next level.
Next, the stitched feature map of the shape F R 2 × H × W is fed into a standard convolutional layer with a convolution kernel size of 7 × 7 to generate a spatial attention map. In this way, the spatial features are further extracted. The feature map is also downscaled into a single-channel feature map, which completes the channel-matching process. Next, a sigmoid function is adopted to obtain a weight of spatial attention feature between (0,1) and apply it to the original feature map to obtain the final output feature map with spatial attention weights. Therefore, the calculation process of spatial attention weight can be presented as follows.
M ( F ) = σ ( ( f 7 × 7 ( C ( M axpool ( F ) , A vgpool ( F ) ) ) ) )
The max pooling layer, average pooling layers operation, convolution operation, splicing operation, and skip connection are used in M a x p o o l ( · ) , A v g p o o l ( · ) , f 7 × 7 , C ( · ) .

3.4. M3Att Attention Module

The module M3Att in this paper consists of the following four parts, as shown in Figure 4. First, the feature map is divided by the MFE module to obtain different channel feature maps and rich multiscale data. Secondly, the channel feature maps are fed into the channel attention module and spatial attention module, respectively, for extracting attention information at different scales. Thirdly, the obtained channel and spatial attention weights are cascaded and spliced. By this, channel attention and spatial attention fusion can be achieved without destroying original attention, and more complementary attention weights can be obtained. Therefore, the entire hybrid attention vector at multiple scales can be described as follows:
V a t t = C ( Q 0 ( F ) , · · · , M N 1 ( F ) )
Fourthly, the multiscale hybrid attention vector is fed into the softmax function for recalibration to enable better information interaction between channel attention and spatial attention, which can be represented as follows:
Z = s o f t m a x ( V a t t ) = e x p ( V a t t ) i = 0 N 1 e x p ( V a t t )
The calibrated attention vector is element-wise multiplied with the resulting feature map after splitting and the original feature map. Lastly, a feature map with multiscale information is used as output. Thus, the whole process of attention training can be summed up:
Y = Z M F E ( X ) L ( X )
M F E ( · ) is represented as a multiscale module; l ( · ) is skip connection.

4. Experimental Results and Analysis

The effectiveness of M3Att was verified by comparing it with eight current top–down attention mechanisms on three public datasets, VOC2007, VOC2012, and KITTI [29], and one live−action photographed underwater critter dataset (contracted from the Zhanjiang Underwater Robot Competition 2020). The eight attention mechanisms were SENet, coordinate attention, CBAM, ECANet, DANet, EPSANet [30] and SPANet [31], and triplet attention [32]. In addition, to evaluate the effectiveness of our final model, ablation experiments were conducted to validate the model, which is the main focus of our study.
Six good object detection networks were then selected to validate the generalizability of our M3Att, thus demonstrating the “plug-and-play” nature of M3Att and its ability to improve object detection accuracy. The six object detection networks are YoloV3 [33], YoloV4, Yolov5, YoloX [34], SSD [35], and Faster R-CNN [36]. The integration in M3Att and YoloV4 is schematically represented as an example in Figure 5.

4.1. Dataset

The public datasets used in this paper are shown in Table 1, and the main parameters of these three datasets are listed. Figure 6 shows some images of the dataset used in this paper.
Furthermore, to validate the model’s performance in real-world scenarios, the marine life dataset from the 2020 National Underwater Robotics Competition (Zhanjiang) was included in this paper for empirical validation. Since the dataset consisted of 3824 images, the original image set was expanded using data augmentation techniques and after data preprocessing, which resulted in 7648 images after data preprocessing. The dataset was divided 9:1 between training and test sets, with 6883 training images and 765 test images, including Echinus, Starfish, Holothurian, and Scallop.

4.2. Experimental Environment and Parameter Settings

The experimental equipment used in this paper was configured with an Intel i7-10700 CPU, an NVIDIA GeForce RTX 3090 GPU, video memory of 24 GB, and the Windows 10 operating system. The training model was constructed using the Pytorch deep learning framework based on the Windows 10 operating system, using the Python 3.7 programming language and Cuda 11.2. The main parameters of the experiment are shown in Table 2.

4.3. Evaluation of the Model Performance

To evaluate the model’s performance more precisely, two metrics were chosen to measure the model: the average precision of AP50 (average precision) and mAP (mean average precision) for each type of object with an intersection ratio of 0.5. mAP is defined as shown in Equation (9).
m A P = K K = 1 A P ( P , R , K ) K
where P is precision, R represents recall, and K denotes the the number of class. In addition, the VOC dataset K = 20, the KITTI dataset K = 3, and the marine biology dataset K = 4. For a more intuitive evaluation of the model performance, the eight selected attention mechanisms were also validated using visual analytical plots and heat maps.

4.4. Analysis of the Generalizability of the Model

The PASCAL VOC 2007 dataset and the PASCAL VOC 07+12 dataset were selected for comparison in different networks to verify the generalizability of the model, and the results are shown in Table 3.
As shown in Table 3, after adding M3Att, the performance of the one-stage object detection network and the two-stage detection network improved compared to the original one. Specifically, the Yolo series object detection algorithms (YoloV3, Yolov4, YoloV5, and YoloX) improved by 1.41%, 4.93%, 2.05%, and 2.36%, respectively, over the original algorithms. This shows that the proposed M3Att is effective and can significantly improve object detection accuracy. This paper aims to improve the accuracy of the object detection algorithm, and the data indicate that the work in this paper is efficient. The model’s generalization capability can be further tested for object detection tasks with more extensive datasets. Therefore, the VOC2007 dataset was fused with the VOC2012 dataset to test the network’s performance further. When the M3Att module proposed in this paper was added, the mAP of all networks improved further, with YoloV3, for example, improving by 2.06%.

4.5. Experiment Comparing Different Attention Mechanisms

The experiments were conducted using the PASCAL VOC2007 dataset, and eight popular attentional mechanisms, such as SENet, coordinate attention (CA), CBAM, ECANet, SPANet, DANet, EPASNet, and triplet attention (abbreviated as triplet), were selected. Table 4 and Table 5 show the accuracy comparison results along with the model complexity of the YoloV4-based object detection algorithm. With the addition of the M3Att module, the proposed M3Att–YoloV4 achieved 4.93% higher detection accuracy than the YoloV4 object detection network, and the number of parameters used was only 0.52 M higher, giving better results. In addition, compared to DANet, which is representative of the excellent hybrid attention mechanism in the last years, M3Att reduced the number of parameters and computational cost by 28.2% and 11.3% respectively, with a slight improvement in detection accuracy. In summary, based on the above results, our M3Att module achieved superior parametric number comparisons and detection accuracy comparisons at a lower computational cost.
For a more intuitive comparison of the performance of this method with other attention mechanisms for object detection, a visual comparison graph is introduced in this paper for evaluation purposes. The visual contrast diagram is shown in Figure 7. Figure 7a shows only one class of objects; however, there are occlusions and small objects, and the pose of each object class varies, so whether objects in the image can be fully detected can be a good test of the model’s performance in dealing with various complex conditions.
As can be easily seen from the figure, it is easy to see that M3Att alone does not have any false or missed detection and detects all the objects completely. Figure 7b shows only one object class, a boat, but it is more difficult to detect due to the dark background of the image and the severe occlusion between objects. Only M3Att detects occluded objects, as shown in Figure 7.
To quantitatively and intuitively explain how M3Att makes full use of the object’s salient features to enhance the network’s performance, this paper introduces the Grad-CAM technique [37], which shows a comparative analysis of eight attentional mechanisms with the M3Att attentional map. The visualized class heat map allows one to identify the parts of the object detection task, in which the darker colors represent the parts that have the most significant impact on the results, that is, the most significant feature. A comparison diagram is presented in Figure 8. The visualization in Figure 8 demonstrates the inherent advantages of M3Att, which is better able to focus on salient features and cover salient regions than the rest of the attention mechanisms. In other words, M3Att can learn to use the information within the object region and cluster features from it very well. Therefore, this feature can significantly improve the performance of object detection networks.

4.6. KITTI Dataset Experiment

To test the performance of our algorithm in complex scenarios, the KITTI dataset was selected for the experiments. The dataset contained 7482 images with eight object categories, including 6733 images in the training set and 749 in the test set. For statistical and analysis purposes, the eight categories in the KITTI dataset were combined into three categories, namely, Car, Cyclist, and Pedestrian. The images in this dataset were captured from real street scenes and therefore involved many small and medium objects in complex environments. Experiments on the KITTI dataset can further demonstrate the model’s performance. Table 6 shows the comparison results of the proposed module M3Att with the other eight attention mechanisms on the KITTI dataset. Compared with the other eight mainstream attention mechanisms on mAP, M3Att achieved the best results. Compared with other methods, this algorithm achieved the best improvement over YoloV4 with a 5.38% improvement in mAP. For these eight classical attentional mechanisms, the simple global mean pool ignores the local information in the channel, so the algorithm does not perform exceptionally well in datasets with large numbers of small and medium objects.
Experimental results of different attention mechanisms on the KITTI dataset are shown in Figure 8. The P-R graph of the three objects types on the KITTI dataset are shown in Figure 9.
Experimental results of different attention mechanisms and the P–R graph of the three object types on the KITTI dataset are shown in Figure 9. An irregular curve enclosed by the vertical axis of accuracy and the horizontal axis of completeness is called a P–R curve. P and R values should be as high as possible for better experimental results, but precision and recall are contradictory. A higher precision rate tends to be associated with a lower recall rate. Thus, the P–R curve plotting can better study the model performance. From Figure 9, the P–R graph of this algorithm is significantly better than the other algorithms. The proposed M3Att module in this paper can successfully capture object information. It improves object detection accuracy significantly compared to YoloV4, SENet, CA, CBAM, ECANet, SPANet, and DANet, as well as EPSANet networks, triplet networks, and other object detection and attention mechanisms. The final mAP achieved 86.94% with good detection results.

4.7. Ablation Studies

As indicated in Table 7, this set of experiments verified the effectiveness of M3Att on the PASCAL VOC2007 dataset by adjusting the cluster size of the clustered convolution. Since parallel computing will significantly increase the number of model parameters, clustered convolution is introduced to deal with the increasing number of parameters.
Using parallel computing significantly increases the number of model parameters, so we introduced clustered convolution to deal with the increasing number of parameters.
As can be seen from the results in the table, grouping size directly affects the performance and complexity of the model. Hence, this paper determined the convolutional kernel size and adjusted the group size to balance model performance and complexities. Lastly, this paper used a convolutional kernel size of (3,5,7,9) and a convolved clustering of (1,4,16,32).
Notes: 2 × 3 , 3 × 3 means two cascades connected to 3 × 3 convolutional kernels, and three cascades connected to 3 × 3 convolutional kernels.
To test the impact of the three modules proposed in this paper on object detection results, we performed ablation experiments on the PASCAL VOC2007 dataset. The results can be seen in Table 8. It is easy to see from the experimental results that adding the MFE module allows the network to extract more detailed features, and mAP increases from 82.22% to 85.41% compared to the original object detection network Yolov4. On this basis, mAP was increased from 82.22% to 85.41% by adding the channel attention mechanism, and the model focused on each key characteristic. Secondly, by introducing the spatial attention mechanism, the model could focus on key regions, and the mAP value increased to 86.97%. Lastly, a skip connection mechanism was introduced to transfer the bottom features from the shallow layer to the deep layer, compensating for a large amount of detail lost in the deep one. The final model mAP value reached 87.15%, which achieved a relatively good result.

4.8. Practical Scenario Experiments

To test the robustness of the algorithm in real-world environments, we experimented with the 2020 Zhanjiang Underwater Robotics Competition dataset, visualizing the comparison as shown in Figure 10. The proposed M3Att performed better than the current mainstream attention mechanisms in object detection tasks in a real-world environment. In particular, M3Att performed well in the small objects, with all objects detected with high confidence. In some specific cases, M3Att detected unlabeled objects, as shown in Figure 10a, demonstrating the robustness of our M3Att. To verify the model’s performance in complex underwater environments, Figure 10b tested the performance of the M3Att in fuzzy and occluded conditions, in which only M3Att detected the entire contents of all annotated images and the unannotated scallops in the upper right corner of the image with high confidence. This is because M3Att uses a multibranching structure that combines the advantages of channel attention and spatial attention to achieve additional complementary properties that effectively highlight object features and suppress irrelevant information, thus increasing the confidence of each object and further validating the model.

5. Conclusions

In this paper, an effective and lightweight attention mechanism named M3Att was proposed. It is plug-and-play and can be easily added to any object detection network to improve network performance. The multiscale feature extraction module (MFE) can fully extract multiscale information and enrich feature information by channel shuffling. At the same time, a hybrid form of the channel and spatial attention is used to construct long-range channel dependence, which is more helpful in obtaining complementary features. Finally, we introduced a skip connection mechanism to connect the shallow information with the profound information after multiple convolutions, avoiding the information loss problem after numerous convolutions. Experimental results demonstrate that our M3Att is successful. It was verified that our M3Att can achieve the best performance in object detection compared with other attention mechanisms. In future work, we will continue to explore the role of M3Att in the remaining computer vision tasks.

Author Contributions

Methodology, G.M.; writing—original draft preparation, G.L.; data curation, G.L.; review and editing, H.Z. and B.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (61773415), the National Key Research and Development Program of China (2019YFD0900805) and Science and Technology Project of Fujian University of Technology (GY-Z220205).

Informed Consent Statement

Not applicable.

Data Availability Statement

PASCAL VOC2007 dataset (http://host.robots.ox.ac.uk:8080/pascal/VOC/voc2007/, accessed on 1 September 2022.), PASCAL VOC2012 dataset (http://host.robots.ox.ac.uk/pascal/VOC/voc2012/, accessed on 1 September 2022.), KITTI dataset (https://www.cvlibs.net/datasets/kitti/, accessed on 1 September 2022.), 2020 Zhanjiang Underwater Robotics Competition dataset. Restrictions apply to the availability of these data. The data were obtained from the Fujian University of Technology and are available from the authors with the permission of Institute for Machine Learning and Intelligence Sciences.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wu, X.; Sahoo, D.; Hoi, S.C. Recent advances in deep learning for object detection. Neurocomputing 2020, 396, 39–64. [Google Scholar] [CrossRef] [Green Version]
  2. Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. arXiv 2019, arXiv:1905.05055. [Google Scholar]
  3. Huang, Z.; Li, W.; Xia, X.G.; Wu, X.; Cai, Z.; Tao, R. A novel nonlocal-aware pyramid and multiscale multitask refinement detector for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–20. [Google Scholar] [CrossRef]
  4. Guo, M.; Xu, T.; Liu, J.; Liu, Z.; Jiang, P.; Mu, T.; Zhang, S.; Martin, R.; Cheng, M.; Hu, S. Attention mechanisms in computer vision: A survey. arXiv 2021, arXiv:2111.07624. [Google Scholar] [CrossRef]
  5. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  6. Chaudhari, S.; Mithal, V.; Polatkan, G.; Ramanath, R. An attentive survey of attention models. ACM Trans. Intell. Syst. Technol. (TIST) 2021, 12, 1–32. [Google Scholar] [CrossRef]
  7. Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
  8. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  9. Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  10. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
  11. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  12. Everingham, M.; Eslami, S.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
  13. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
  14. Wang, F.; Tax, D.M. Survey on the attention based RNN model and its applications in computer vision. arXiv 2016, arXiv:1601.06823. [Google Scholar]
  15. Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667. [Google Scholar]
  16. Gao, Z.; Xie, J.; Wang, Q.; Li, P. Global second-order pooling convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–21 June 2019; pp. 3024–3033. [Google Scholar]
  17. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
  18. Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 783–792. [Google Scholar]
  19. Zhu, X.; Cheng, D.; Zhang, Z.; Lin, S.; Dai, J. An empirical study of spatial attention mechanisms in deep networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 6688–6697. [Google Scholar]
  20. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-excite: Exploiting feature context in convolutional neural networks. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
  21. Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
  22. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–21 June 2019; pp. 3146–3154. [Google Scholar]
  23. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
  24. Zhang, Z.; Lan, C.; Zeng, W.; Jin, X.; Chen, Z. Relation-aware global attention for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3186–3195. [Google Scholar]
  25. Yang, B.; Gao, Z.; Gao, Y.; Zhu, Y. Rapid detection and counting of wheat ears in the field using YOLOv4 with attention module. Agronomy 2021, 11, 1202. [Google Scholar] [CrossRef]
  26. Kim, M.; Jeong, J.; Kim, S. ECAP-YOLO: Efficient Channel Attention Pyramid YOLO for Small Object Detection in Aerial Image. Remote Sens. 2021, 13, 4851. [Google Scholar] [CrossRef]
  27. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
  28. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  29. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]
  30. Zhang, H.; Zu, K.; Lu, J.; Zou, Y.; Meng, D. Epsanet: An efficient pyramid split attention block on convolutional neural network. arXiv 2021, arXiv:2105.14447. [Google Scholar]
  31. Guo, J.; Ma, X.; Sansom, A.; McGuire, M.; Kalaani, A.; Chen, Q.; Tang, S.; Yang, Q.; Fu, S. Spanet: Spatial pyramid attention network for enhanced image recognition. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2019; pp. 1–6. [Google Scholar]
  32. Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision(WACV), Waikoloa Beach, HI, USA, 5–9 January 2021; pp. 3139–3148. [Google Scholar]
  33. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  34. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
  35. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
  36. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  37. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Figure 1. Multiscale feature extraction module.
Figure 1. Multiscale feature extraction module.
Mathematics 10 04150 g001
Figure 2. Comparison with different attention mechanisms.
Figure 2. Comparison with different attention mechanisms.
Mathematics 10 04150 g002
Figure 3. Diagram of our spatial attention.
Figure 3. Diagram of our spatial attention.
Mathematics 10 04150 g003
Figure 4. The structure of M3Att.
Figure 4. The structure of M3Att.
Mathematics 10 04150 g004
Figure 5. M3Att integrated in YoloV4.
Figure 5. M3Att integrated in YoloV4.
Mathematics 10 04150 g005
Figure 6. Examples of some images from the datasets used in this paper.
Figure 6. Examples of some images from the datasets used in this paper.
Mathematics 10 04150 g006
Figure 7. Visual comparison of VOC datasets.
Figure 7. Visual comparison of VOC datasets.
Mathematics 10 04150 g007
Figure 8. Visualization of Grad-CAM.
Figure 8. Visualization of Grad-CAM.
Mathematics 10 04150 g008
Figure 9. P–R graph for objects in the KITTI dataset.
Figure 9. P–R graph for objects in the KITTI dataset.
Mathematics 10 04150 g009
Figure 10. Comparison of the actual detection effect of different attention mechanisms.
Figure 10. Comparison of the actual detection effect of different attention mechanisms.
Mathematics 10 04150 g010
Table 1. Key parameters of three public datasets.
Table 1. Key parameters of three public datasets.
NameTrain ImageTest ImageClass
VOC20075011495220
VOC2007 + VOC201222,136495220
KITTI68837654
Table 2. Key parameters of the settings.
Table 2. Key parameters of the settings.
Experimental ParametersParameter Values
Image size 416 × 416
Learning rate0.001
Batch size16
Epochs100
OptimizerAdam
Table 3. Experiments with M3Att embedded in different object detection networks.
Table 3. Experiments with M3Att embedded in different object detection networks.
ModelInput SizeDatasetmAP
Yolov3 [33] 416 × 416 07/07 + 1279.77/81.76
Yolov3 [33] + M3Att 416 × 416 07/07 + 1281.18/83.82
Yolov4 [11] 416 × 416 07/07 + 1282.22/87.79
Yolov4 [11] + M3Att 416 × 416 07/07 + 1287.15/87.41
Yolov5 416 × 416 07/07 + 1285.01/87.62
Yolov5 + M3Att 416 × 416 07/07 + 1287.06/88.48
YoloX [34] 416 × 416 07/07 + 1282.33/85.91
YoloX [34] + M3Att 416 × 416 07/07 + 1288.09/88.52
SSD [35] 300 × 300 07/07 + 1268.0/74.3
SSD [35] + M3Att 300 × 300 07/07 + 1276.41/77.50
Faster [36] 600 × 600 07/07 + 1273.28/76.86
Faster [36] + M3Att 600 × 600 07/07 + 1277.47/78.65
Table 4. Experimental results of different attention mechanisms (based on YoloV4).
Table 4. Experimental results of different attention mechanisms (based on YoloV4).
MethodmAPAreoBikeBirdBoatBottleBusCarCatChairCowTableDogHorseMbikePersonPlantSheepSofaTrainTV
YoloV4 [11]82.2291.7889.9886.2173.6074.4890.6593.3391.3568.9292.0376.2289.4893.6192.9690.3454.7886.6791.4191.99 86.49
YoloV4 [11] + SE [5]85.5493.7292.3384.4878.7077.8291.5093.6289.7472.9389.1580.2186.4491.3293.3189.9361.4285.5579.6191.70 87.37
YoloV4 [11] + CA [23]84.3190.1591.9083.7875.0574.3289.8693.3690.5768.9091.3978.5085.1491.7390.6089.4160.2186.3676.2292.08 86.71
YoloV4 [11] + CBAM [8]85.4393.2692.5984.4575.6776.3289.6693.3089.8272.4892.1580.5287.7692.2991.7789.2162.1287.7380.5990.31 86.57
YoloV4 [11] + ECA [17]85.9983.6691.1686.2175.0278.4391.1093.6190.7973.1593.1381.2187.8192.5992.7690.6363.4488.7180.3790.59 85.47
YoloV4 [11] + SPANet [31]82.3788.6987.7983.8071.2670.9889.1691.6789.4466.5690.3074.7886.7191.4688.0688.4654.8482.8276.3090.71 83.55
YoloV4 [11] + DANet [22]87.0993.2490.6387.6279.9080.0392.0391.3490.8274.2393.3582.9691.0192.8492.9890.9463.4191.8479.7795.0687.92
YoloV4 [11] + ESPANet [30]84.3291.4188.8585.7373.9076.1590.5892.7588.5669.9090.5781.0684.6290.8291.9489.5360.4387.0974.5292.39 85.64
YoloV4 [11] + Triplet [32]82.4190.6887.6582.2570.3873.4989.1091.4689.6466.5588.9675.2085.7991.1588.1787.5655.7983.9976.0891.50 82.91
YoloV4 [11] + Residual [21]85.8193.9391.7884.9678.9577.6189.2592.8490.3771.5792.1481.9188.0993.1492.4389.6960.3988.0280.7292.1886.29
YoloV4 [11] + M3Att87.1593.2891.0887.0979.3579.3093.0094.2192.2973.0592.8883.6691.3692.0393.0291.0964.1290.8781.2894.00 85.34
Table 5. Comparison of mechanisms parameters.
Table 5. Comparison of mechanisms parameters.
ModelInput SizemAP(%)Parameters (M)GFlops (G)
Yolov4 [11] 416 × 416 82.2264.04029.948
Yolov4 [11] + SE [5] 416 × 416 85.5464.08359.898
Yolov4 [11] + CA [23] 416 × 416 84.3164.07559.659
Yolov4 [11] + CBAM [8] 416 × 416 85.4364.73159.900
Yolov4 [11] + ECA [17] 416 × 416 85.9964.04059.898
Yolov4 [11] + SPANet [31] 416 × 416 85.5464.51359.901
YoloV4 [11] + DANet [22] 416 × 416 87.0989.80668.606
YoloV4 [11] + EPSANet [30] 416 × 416 84.3265.75560.482
YoloV4 [11] + Triplet [32] 416 × 416 82.4164.04159.902
YoloV4 [11] + Residual [21] 416 × 416 85.8164.99961.204
YoloV4 [11] + M3Att 416 × 416 87.1564.46760.822
Table 6. Experimental comparison results on the KITTI dataset.
Table 6. Experimental comparison results on the KITTI dataset.
ModelInput SizemAP
Yolov4 [11] 416 × 416 81.56
Yolov4 [11] + SE [5] 416 × 416 84.55
Yolov4 [11] + CA [23] 416 × 416 85.11
Yolov4 [11] + CBAM [8] 416 × 416 83.39
Yolov4 [11] + ECANet [17] 416 × 416 85.92
Yolov4 [11] + SPANet [31] 416 × 416 85.21
Yolov4 [11] + DANet [22] 416 × 416 86.09
Yolov4 [11] + EPSANet [30] 416 × 416 82.72
Yolov4 [11] + Triplet [32] 416 × 416 82.99
YoloV4 [11] + Residual [21] 416 × 416 83.35
Yolov4 [11] + M3Att 416 × 416 86.94
Table 7. Results of group size ablation experiments.
Table 7. Results of group size ablation experiments.
Kernel SizeGroup SizeInput SizemAP (%)Parameters (M)
(3,5,7,9)(4,8,16,16) 416 × 416 86.2864.409
(3,5,7,9)(4,4,4,4) 416 × 416 87.2964.972
(3,5,7,9)(16,16,16,16) 416 × 416 87.0864.343
(3,5,7,9)(1,4,8,16) 416 × 416 87.0164.674
(3,5,7,9)(1,4,16,32) 416 × 416 87.1564.467
(3,5,7,9)(1,8,16,32) 416 × 416 86.9764.496
(3,5,7,9)(8,8,8,8) 416 × 416 87.1564.553
(3,3,3,3)/ 416 × 416 86.5564.870
(3, 2 × 3 , 3 × 3 , 3 × 3 )(1,4,16,32) 416 × 416 86.2265.008
Table 8. Experiments with M3Att embedded in different object detection networks.
Table 8. Experiments with M3Att embedded in different object detection networks.
mAPYoloV4MFEChannel AttentionSpatital AttentionSkip Connection
82.22
85.41
85.99
86.97
87.15
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Mao, G.; Liao, G.; Zhu, H.; Sun, B. Multibranch Attention Mechanism Based on Channel and Spatial Attention Fusion. Mathematics 2022, 10, 4150. https://doi.org/10.3390/math10214150

AMA Style

Mao G, Liao G, Zhu H, Sun B. Multibranch Attention Mechanism Based on Channel and Spatial Attention Fusion. Mathematics. 2022; 10(21):4150. https://doi.org/10.3390/math10214150

Chicago/Turabian Style

Mao, Guojun, Guanyi Liao, Hengliang Zhu, and Bo Sun. 2022. "Multibranch Attention Mechanism Based on Channel and Spatial Attention Fusion" Mathematics 10, no. 21: 4150. https://doi.org/10.3390/math10214150

APA Style

Mao, G., Liao, G., Zhu, H., & Sun, B. (2022). Multibranch Attention Mechanism Based on Channel and Spatial Attention Fusion. Mathematics, 10(21), 4150. https://doi.org/10.3390/math10214150

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop