Hybrid Cross-Feature Interaction Attention Module for Object Detection in Intelligent Mobile Scenes
Abstract
:1. Introduction
- (1)
- Building upon the structure of SPP, we introduced the MKSPP module. Subsequently, by leveraging the MKSPP structure, we improved the channel attention, resulting in the HCCI module having better cross-channel interaction performance.
- (2)
- We further enhanced spatial attention by incorporating dilated convolutions, resulting in the CSI module with improved cross-spatial interaction capabilities.
- (3)
- Based on the aforementioned improvements in the channel attention module and spatial attention module, this paper introduces the HCFI attention module. Additionally, for cases where certain detectors have extremely small output feature layers, we propose a solution that combines HCCI with HCFI. We conducted experimental evaluations on various detectors and datasets, and the results validate the effectiveness of the proposed method in object detection tasks.
2. Related Work
2.1. Object Detection
2.2. Attention Mechanism
3. Method
3.1. Multiple Kernel SPP Block
3.2. Hybrid Cross-Channel Interaction Attention Module
3.3. Cross-Space Interaction Attention Module
3.4. Hybrid Cross-Feature Interaction Attention Module
Algorithm 1: HCFI Module |
Input: Input features with shape [N1, C1, H1, W1] Output: Output features with shape [N3, C3, H3, W3] 1. Calculate cross-channel interaction features for different interaction ranges: C3, C5, C7, C9, C11, C13, C15 2. Calculate the integrated channel features: F = Conv1d(C3 + C5 + C7 + C9 + C11 + C13 + C15) 3. Calculate the output features of the HCCI module: [N2, C2, H2, W2] = [N1, C1, H1, W1] × Sigmoid(F) 4. Calculate the integrated spatial features: FSavg, FSmax 5. Calculate cross-space interaction features using dilated convolution: M = f27×7[FSavg; FSmax] 6. Calculate the output features of the CSI module: [N3, C3, H3, W3] = [N2, C2, H2, W2] × Sigmoid(M) |
4. Experiment and Discussion
4.1. Dataset and Detection Algorithm
4.2. Experiments on YOLOX
4.3. Experiments on YOLOv5
4.4. Experiments on SSD
4.5. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2019, 111, 257–276. [Google Scholar] [CrossRef]
- Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep Learning for Generic Object Detection: A Survey. Int. J. Comput. Vis. 2018, 128, 261–318. [Google Scholar] [CrossRef]
- Zhao, Z.; Zheng, P.; Xu, S.; Wu, X. Object Detection with Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 3212–3232. [Google Scholar] [CrossRef]
- Tong, K.; Wu, Y.; Zhou, F. Recent advances in small object detection based on deep learning: A review. Image Vis. Comput. 2020, 97, 103910. [Google Scholar] [CrossRef]
- Qin, L.; Shi, Y.; He, Y.; Zhang, J.; Zhang, X.; Li, Y.; Deng, T.; Yan, H. ID-YOLO: Real-Time Salient Object Detection Based on the Driver’s Fixation Region. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15898–15908. [Google Scholar] [CrossRef]
- Tian, D.; Han, Y.; Wang, B.; Guan, T.; Wei, W. A Review of Intelligent Driving Pedestrian Detection Based on Deep Learning. Comput. Intell. Neurosci. 2021, 2021, 5410049. [Google Scholar] [CrossRef]
- Liang, S.; Wu, H.; Zhen, L.; Hua, Q.; Garg, S.; Kaddoum, G.; Hassan, M.; Yu, K. Edge YOLO: Real-Time Intelligent Object Detection System Based on Edge-Cloud Cooperation in Autonomous Vehicles. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25345–25360. [Google Scholar] [CrossRef]
- Wang, X.; Ban, Y.; Guo, H.; Hong, L. Deep Learning Model for Target Detection in Remote Sensing Images Fusing Multilevel Features. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July 2019–2 August 2019; pp. 250–253. [Google Scholar]
- Han, X.; Zhong, Y.; Zhang, L. An Efficient and Robust Integrated Geospatial Object Detection Framework for High Spatial Resolution Remote Sensing Imagery. Remote Sens. 2017, 9, 666. [Google Scholar] [CrossRef]
- Guo, W.; Yang, W.; Zhang, H.; Hua, G. Geospatial Object Detection in High Resolution Satellite Images Based on Multi-Scale Convolutional Neural Network. Remote Sens. 2018, 10, 131. [Google Scholar] [CrossRef]
- Yang, R.; Yu, Y. Artificial Convolutional Neural Network in Object Detection and Semantic Segmentation for Medical Imaging Analysis. Front. Oncol. 2021, 11, 638182. [Google Scholar] [CrossRef]
- Rezaei, M.; Yang, H.; Meinel, C. Instance Tumor Segmentation using Multitask Convolutional Neural Network. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
- Ito, S.; Ando, K.; Kobayashi, K.; Nakashima, H.; Oda, M.; Machina, M.; Kanbara, S.; Inoue, T.; Yamaguchi, H.; Koshimizu, H.; et al. Automated Detection of Spinal Schwannomas Utilizing Deep Learning Based on Object Detection from MRI. Spine 2020, 46, 95–100. [Google Scholar] [CrossRef]
- Sande, K.E.; Uijlings, J.R.; Gevers, T.; Smeulders, A. Segmentation as selective search for object recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1879–1886. [Google Scholar]
- Jiang, X.; Pang, Y.; Pan, J.; Li, X. Flexible sliding windows with adaptive pixel strides. Signal Process. 2015, 110, 37–45. [Google Scholar] [CrossRef]
- Guo, M.; Cai, J.; Liu, Z.; Mu, T.; Martin, R.; Hu, S. PCT: Point cloud transformer. Comput. Vis. Media 2020, 7, 187–199. [Google Scholar] [CrossRef]
- Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Tay, F.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 538–547. [Google Scholar]
- Wu, H.; Xiao, B.; Codella, N.C.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing Convolutions to Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 22–31. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.F.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
- Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 2011–2023. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.; Koeon, I. CBAM: Convolutional Block Attention Module. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2018. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 1904–1916. [Google Scholar] [CrossRef]
- Viola, P.A.; Jones, M.J. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Kauai, HI, USA, 8–14 December 2001; p. 1. [Google Scholar]
- Patle, A.; Chouhan, D.S. SVM kernel functions for classification. In Proceedings of the 2013 International Conference on Advances in Technology and Engineering (ICATE), Mumbai, India, 23–25 January 2013; pp. 1–9. [Google Scholar]
- Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R.B. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.; Berg, A. SSD: Single Shot MultiBox Detector. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016. [Google Scholar]
- Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:abs/1804.02767. [Google Scholar]
- Bochkovskiy, A.; Wang, C.; Liao, H.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:abs/2004.10934. [Google Scholar]
- Jocher, G. YOLOv5. 2023. Available online: https://github.com/ultralytics/yolov5 (accessed on 5 June 2023).
- Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. Int. J. Comput. Vis. 2020, 128, 642–656. [Google Scholar] [CrossRef]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:abs/2107.08430. [Google Scholar]
- Potapova, E.; Zillich, M.; Vincze, M. Survey of recent advances in 3D visual attention for robotics. Int. J. Robot. Res. 2017, 36, 1159–1176. [Google Scholar] [CrossRef]
- Nguyen, T.V.; Zhao, Q.; Yan, S. Attentive Systems: A Survey. Int. J. Comput. Vis. 2018, 126, 86–110. [Google Scholar] [CrossRef]
- Han, D.; Zhou, S.; Li, K.; Mello, R. Cross-modality Co-attention Networks for Visual Question Answering. Soft Comput. 2021, 25, 5411–5421. [Google Scholar] [CrossRef]
- Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems, Montreal, QC Canada, 7–12 December 2015. [Google Scholar]
- Wang, X.; Girshick, R.B.; Gupta, A.K.; He, K. Non-local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Gao, Z.; Xie, J.; Wang, Q.; Li, P. Global Second-Order Pooling Convolutional Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3019–3028. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3141–3149. [Google Scholar]
- Everingham, M.; Gool, L.V.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
- Lin, T.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014. [Google Scholar]
- Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2633–2642. [Google Scholar]
Method | HCCI | CSI | HCFI | mAP (%) |
---|---|---|---|---|
YOLOX | × | × | × | 59.81 |
√ | × | × | 60.28 | |
× | √ (+CBAM channel) | × | 60.37 | |
× | × | √ | 60.52 |
Method/ Evaluation | mAP50 (%) | mAP75 (%) | mAP50:95 (%) | FPS |
---|---|---|---|---|
YOLOX (base) | 59.81 | 44.03 | 40.42 | 121.33 |
Relative improv. % | ||||
base + ECA | 60.11 | 44.19 | 40.80 | 115.64 |
Relative improv. % | 0.50% | 0.36% | 0.94% | |
base + CBAM | 60.28 | 43.98 | 40.63 | 108.28 |
Relative improv. % | 0.79% | −0.11% | 0.52% | |
base + HCFI | 60.52 | 45.51 | 41.04 | 107.46 |
Relative improv. % | 1.19% | 3.36% | 1.53% |
Classes | Bicycle | Train | Pedestrian | Truck | Traffic Sign | Car | Bus | Motorcycle | Rider | Traffic Light |
---|---|---|---|---|---|---|---|---|---|---|
Number | 494 | 8 | 5681 | 1907 | 15,710 | 46,499 | 728 | 189 | 286 | 11,886 |
Method/Evaluation | mAP50 (%) | mAP75 (%) | mAP50:95 (%) | mAPsmall (%) | mAPmedium (%) | mAPlarge (%) | FPS |
---|---|---|---|---|---|---|---|
YOLOX (base) | 35.93 | 18.48 | 19.82 | 3.70 | 25.45 | 52.35 | 125.83 |
Relative improv. % | |||||||
base + ECA | 36.08 | 19.05 | 20.05 | 4.54 | 25.42 | 52.03 | 119.62 |
Relative improv. % | 0.42% | 3.08% | 1.16% | 22.70% | −0.12% | −0.61% | |
base + CBAM | 36.15 | 18.84 | 19.99 | 3.86 | 25.53 | 52.21 | 111.35 |
Relative improv. % | 0.61% | 1.95% | 0.86% | 4.32% | 0.31% | −0.27% | |
base + HCFI | 36.60 | 19.26 | 20.39 | 4.70 | 26.60 | 51.00 | 110.94 |
Relative improv. % | 1.86% | 4.22% | 2.88% | 27.03% | 4.52% | −2.58 |
Method | HCCI | CSI | HCFI | mAP (%) |
---|---|---|---|---|
YOLOv5 | × | × | × | 45.39 |
√ | × | × | 45.85 | |
× | √ (+CBAM channel) | × | 45.81 | |
× | × | √ | 45.98 |
Method/ Evaluation | mAP50 (%) | mAP75 (%) | mAP50:95 (%) | FPS |
---|---|---|---|---|
YOLOX (base) | 45.39 | 24.55 | 25.83 | 97.71 |
Relative improv. % | ||||
base + ECA | 45.73 | 25.26 | 25.91 | 95.07 |
Relative improv. % | 0.75% | 2.89% | 0.31% | |
base + CBAM | 45.75 | 24.61 | 26.10 | 92.45 |
Relative improv. % | 0.79% | 0.24% | 1.05% | |
base + HCFI | 45.98 | 25.26 | 26.36 | 89.73 |
Relative improv. % | 1.30% | 2.89% | 2.05% |
Method/Evaluation | mAP50 (%) | mAP75 (%) | mAP50:95 (%) | mAPsmall (%) | mAPmedium (%) | mAPlarge (%) | FPS |
---|---|---|---|---|---|---|---|
YOLOX (base) | 63.51 | 47.10 | 43.00 | 22.56 | 43.70 | 55.59 | 96.27 |
Relative improv. % | |||||||
base + ECA | 63.66 | 48.33 | 44.42 | 24.07 | 45.43 | 56.89 | 93.81 |
Relative improv. % | 0.24% | 2.61% | 3.30% | 6.69% | 3.96% | 2.34% | |
base + CBAM | 63.64 | 48.33 | 44.44 | 23.26 | 46.26 | 56.95 | 88.28 |
Relative improv. % | 0.20% | 2.61% | 3.35% | 3.10% | 5.86% | 2.45% | |
base + HCFI | 64.14 | 48.49 | 44.91 | 23.69 | 46.68 | 57.39 | 86.37 |
Relative improv. % | 0.99% | 2.95% | 4.44% | 5.01% | 6.82% | 3.24% |
Method/Evaluation | mAP50 (%) | mAP75 (%) | mAP50:95 (%) |
---|---|---|---|
SSD (base) | 86.13 | 61.70 | 55.56 |
base + ECA | 86.20 | 61.64 | 55.64 |
Relative improv. % | 0.08% | −0.10% | 0.14% |
base + CBAM | 86.20 | 61.38 | 55.31 |
Relative improv. % | 0.08% | −0.52% | −0.45% |
base + HCFI | 86.67 | 62.54 | 55.97 |
Relative improv. % | 0.63% | 1.36% | 0.74% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tian, D.; Han, Y.; Liu, Y.; Li, J.; Zhang, P.; Liu, M. Hybrid Cross-Feature Interaction Attention Module for Object Detection in Intelligent Mobile Scenes. Remote Sens. 2023, 15, 4991. https://doi.org/10.3390/rs15204991
Tian D, Han Y, Liu Y, Li J, Zhang P, Liu M. Hybrid Cross-Feature Interaction Attention Module for Object Detection in Intelligent Mobile Scenes. Remote Sensing. 2023; 15(20):4991. https://doi.org/10.3390/rs15204991
Chicago/Turabian StyleTian, Di, Yi Han, Yongtao Liu, Jiabo Li, Ping Zhang, and Ming Liu. 2023. "Hybrid Cross-Feature Interaction Attention Module for Object Detection in Intelligent Mobile Scenes" Remote Sensing 15, no. 20: 4991. https://doi.org/10.3390/rs15204991
APA StyleTian, D., Han, Y., Liu, Y., Li, J., Zhang, P., & Liu, M. (2023). Hybrid Cross-Feature Interaction Attention Module for Object Detection in Intelligent Mobile Scenes. Remote Sensing, 15(20), 4991. https://doi.org/10.3390/rs15204991