HAFNet: Hierarchical Attentive Fusion Network for Multispectral Pedestrian Detection
Abstract
:1. Introduction
- A novel Hierarchical Attentive Fusion Network (HAFNet) is proposed, enabling the progressive calibration of features from two modalities, resulting in an improved fusion representation.
- A novel module called Hierarchical Content-dependent Attentive Fusion (HCAF) is presented, which utilizes top-level features across modalities to obtain hierarchical reference features. These features are then used to guide the pixel-wise fusion of multi-modality features at each stage, resulting in improved feature alignment and integration.
- A novel Multi-modality Feature Alignment (MFA) block is proposed, which can be easily integrated into any pre-trained multi-branch backbone, enhancing the learned feature representations.
2. Related Work
2.1. Multispectral Pedestrian Detection
2.2. Attention Mechanism
3. Method
3.1. Hierarchical Content-Dependent Attentive Fusion
3.1.1. Formulation
3.1.2. Implementation Design
3.2. Multi-Modality Feature Alignment
3.2.1. Formulation Details
3.2.2. Block Design
3.2.3. Integration into Backbone CNNs
3.3. Optimization
4. Experiments
4.1. Implementation Details
4.2. Results on KAIST Dataset
4.3. Results on CVC-14 Dataset
4.4. Ablation Study
4.4.1. Ablations on Network Components
4.4.2. Discussion on HCAF
4.4.3. Discussion on Stages of MFA
4.4.4. Discussion on Modalities Input
5. Conclusions
6. Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Kuras, A.; Brell, M.; Liland, K.H.; Burud, I. Multitemporal Feature-Level Fusion on Hyperspectral and LiDAR Data in the Urban Environment. Remote Sens. 2023, 15, 632. [Google Scholar] [CrossRef]
- You, Y.; Cao, J.; Zhou, W. A survey of change detection methods based on remote sensing images for multi-source and multi-objective scenarios. Remote Sens. 2020, 12, 2460. [Google Scholar] [CrossRef]
- Wu, B.; Iandola, F.; Jin, P.H.; Keutzer, K. Squeezedet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 129–137. [Google Scholar]
- Luo, Y.; Yin, D.; Wang, A.; Wu, W. Pedestrian tracking in surveillance video based on modified CNN. Multimed. Tools Appl. 2018, 77, 24041–24058. [Google Scholar] [CrossRef]
- Li, X.; Li, L.; Flohr, F.; Wang, J.; Xiong, H.; Bernhard, M.; Pan, S.; Gavrila, D.M.; Li, K. A unified framework for concurrent pedestrian and cyclist detection. IEEE Trans. Intell. Transp. Syst. 2016, 18, 269–281. [Google Scholar] [CrossRef]
- Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recognit. 2019, 85, 161–171. [Google Scholar] [CrossRef] [Green Version]
- Zhang, H.; Fromont, E.; Lefevre, S.; Avignon, B. Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 276–280. [Google Scholar]
- Zhang, L.; Liu, Z.; Zhang, S.; Yang, X.; Qiao, H.; Huang, K.; Hussain, A. Cross-modality interactive attention network for multispectral pedestrian detection. Inf. Fusion 2019, 50, 20–29. [Google Scholar] [CrossRef]
- Kim, J.U.; Park, S.; Ro, Y.M. Uncertainty-guided cross-modal learning for robust multispectral pedestrian detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1510–1523. [Google Scholar] [CrossRef]
- Dollár, P.; Appel, R.; Belongie, S.; Perona, P. Fast feature pyramids for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1532–1545. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhou, K.; Chen, L.; Cao, X. Improving multispectral pedestrian detection by addressing modality imbalance problems. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 787–803. [Google Scholar]
- Liu, J.; Zhang, S.; Wang, S.; Metaxas, D.N. Multispectral deep neural networks for pedestrian detection. arXiv 2016, arXiv:1611.02644. [Google Scholar]
- Qingyun, F.; Dapeng, H.; Zhaokui, W. Cross-modality fusion transformer for multispectral object detection. arXiv 2021, arXiv:2111.00273. [Google Scholar]
- Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar]
- González, A.; Fang, Z.; Socarras, Y.; Serrat, J.; Vázquez, D.; Xu, J.; López, A.M. Pedestrian detection at day/night time with visible and FIR cameras: A comparison. Sensors 2016, 16, 820. [Google Scholar] [CrossRef]
- Wagner, J.; Fischer, V.; Herman, M.; Behnke, S. Multispectral pedestrian detection using deep fusion convolutional neural networks. In Proceedings of the ESANN, Bruges, Belgium, 27–29 April 2016; Volume 587, pp. 509–514. [Google Scholar]
- Konig, D.; Adam, M.; Jarvers, C.; Layher, G.; Neumann, H.; Teutsch, M. Fully convolutional region proposal networks for multispectral person detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 49–56. [Google Scholar]
- Li, C.; Song, D.; Tong, R.; Tang, M. Multispectral pedestrian detection via simultaneous detection and segmentation. arXiv 2018, arXiv:1808.04818. [Google Scholar]
- Guan, D.; Cao, Y.; Yang, J.; Cao, Y.; Yang, M.Y. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Inf. Fusion 2019, 50, 148–157. [Google Scholar] [CrossRef] [Green Version]
- Zhang, L.; Liu, Z.; Zhu, X.; Song, Z.; Yang, X.; Lei, Z.; Qiao, H. Weakly aligned feature fusion for multimodal object detection. arXiv 2021. [Google Scholar] [CrossRef] [PubMed]
- Kim, J.; Kim, H.; Kim, T.; Kim, N.; Choi, Y. MLPD: Multi-label pedestrian detector in multispectral domain. IEEE Robot. Autom. Lett. 2021, 6, 7846–7853. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Liu, Y.; Chen, X.; Ward, R.K.; Wang, Z.J. Image fusion with convolutional sparse representation. IEEE Signal Process. Lett. 2016, 23, 1882–1886. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Li, X.; You, A.; Zhu, Z.; Zhao, H.; Yang, M.; Yang, K.; Tan, S.; Tong, Y. Semantic flow for fast and accurate scene parsing. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 775–793. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
- Zhang, L.; Zhu, X.; Chen, X.; Yang, X.; Lei, Z.; Liu, Z. Weakly aligned cross-modal learning for multispectral pedestrian detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 27 October–2 November 2019; pp. 5127–5137. [Google Scholar]
- Yang, X.; Qiang, Y.; Zhu, H.; Wang, C.; Yang, M. BAANet: Learning bi-directional adaptive attention gates for multispectral pedestrian detection. arXiv 2021, arXiv:2112.02277. [Google Scholar]
- Wang, Q.; Chi, Y.; Shen, T.; Song, J.; Zhang, Z.; Zhu, Y. Improving RGB-infrared object detection by reducing cross-modality redundancy. Remote Sens. 2022, 14, 2020. [Google Scholar] [CrossRef]
- Park, K.; Kim, S.; Sohn, K. Unified multi-spectral pedestrian detection based on probabilistic fusion networks. Pattern Recognit. 2018, 80, 143–155. [Google Scholar] [CrossRef]
- Choi, H.; Kim, S.; Park, K.; Sohn, K. Multi-spectral pedestrian detection based on accumulated object proposal with fully convolutional networks. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 621–626. [Google Scholar]
Method | Miss Rate (%) | MR-Scale (%) | MR-Occlusion (%) | ||||||
---|---|---|---|---|---|---|---|---|---|
All | Day | Night | Near | Medium | Far | None | Partial | Heavy | |
ACF [14] | 47.32 | 42.57 | 56.17 | 28.74 | 53.67 | 88.20 | 62.94 | 81.40 | 88.08 |
Halfway Fusion [12] | 25.75 | 24.88 | 26.59 | 8.13 | 30.34 | 75.70 | 43.13 | 65.21 | 74.36 |
Fusion RPN + BF [17] | 18.29 | 19.57 | 16.27 | 0.04 | 30.87 | 88.86 | 47.45 | 56.10 | 72.20 |
MSDS-RCNN [18] | 11.63 | 10.60 | 13.73 | 1.29 | 16.19 | 63.73 | 29.86 | 38.71 | 63.37 |
IAF-RCNN [6] | 15.73 | 14.55 | 18.26 | 0.96 | 25.54 | 77.84 | 40.17 | 48.40 | 69.76 |
IATDNN + IAMSS [19] | 14.96 | 14.67 | 15.72 | 0.04 | 28.55 | 83.42 | 45.43 | 46.25 | 64.57 |
CIAN [8] | 14.12 | 14.77 | 11.13 | 3.71 | 19.04 | 55.82 | 30.31 | 41.57 | 62.48 |
AR-CNN [31] | 9.34 | 9.94 | 8.38 | 0.00 | 16.08 | 69.00 | 31.40 | 38.63 | 55.73 |
MBNet [11] | 8.13 | 8.28 | 7.86 | 0.00 | 16.07 | 55.99 | 27.74 | 35.43 | 59.14 |
MLPD [21] | 7.58 | 7.95 | 6.95 | - | - | - | - | - | - |
BAANet [32] | 7.92 | 8.37 | 6.98 | 0.00 | 13.72 | 51.25 | 25.15 | 34.07 | 57.92 |
RISNet [33] | 7.89 | 7.61 | 7.08 | 0.00 | 14.01 | 52.67 | 25.23 | 34.25 | 56.14 |
HAFNet (Ours) | 6.93 | 7.68 | 5.66 | 0.00 | 13.68 | 53.94 | 26.31 | 30.10 | 55.16 |
Methods | MR-All (%) | Platform | Speed (s) |
---|---|---|---|
ACF [14] | 47.32 | MATLAB | 2.730 |
Fusion RPN + BF [17] | 18.29 | MATLAB | 0.800 |
CIAN [8] | 14.12 | GTX 1080Ti | 0.070 |
AR-CNN [31] | 9.34 | GTX 1080Ti | 0.120 |
MBNet [11] | 8.13 | GTX 1080Ti | 0.070 |
MLPD [21] | 7.58 | GTX 1080Ti | 0.012 |
BAANet [32] | 7.92 | GTX 1080Ti | 0.070 |
HAFNet(Ours) | 6.93 | GTX 1080Ti | 0.017 |
Modalities Input | Methods | Miss Rate (%) | ||
---|---|---|---|---|
Day | Night | All | ||
Visible only | SVM [15] | 37.6 | 76.9 | - |
DPM [15] | 25.2 | 76.4 | - | |
Random Forest [15] | 26.6 | 81.2 | - | |
ACF [34] | 65.0 | 83.2 | 71.3 | |
Faster R-CNN [34] | 43.2 | 71.4 | 51.9 | |
Visible + Thermal | MACF [34] | 61.3 | 48.2 | 60.1 |
Choi et al. [35] | 49.3 | 43.8 | 47.3 | |
Halfway Fusion [34] | 38.1 | 34.4 | 37.0 | |
Park et al. [34] | 31.8 | 30.8 | 31.4 | |
AR-CNN [8] | 24.7 | 18.1 | 22.1 | |
MLPD [21] | 24.18 | 17.97 | 21.33 | |
MBNet [11] | 24.7 | 13.5 | 21.1 | |
HAFNet (Ours) | 23.9 | 14.3 | 20.7 |
HCAF | MFA | Miss Rate (%) | |||
---|---|---|---|---|---|
CMA | HRG | All | Day | Night | |
12.68 | 13.70 | 10.76 | |||
√ | √ | 8.31 | 9.03 | 6.67 | |
√ | 10.33 | 12.18 | 7.53 | ||
√ | √ | 7.93 | 9.87 | 8.36 | |
√ | √ | 8.70 | 9.53 | 7.64 | |
√ | √ | √ | 6.93 | 7.68 | 5.66 |
HCAF | Miss Rate (%) | Speed (s) | |||
---|---|---|---|---|---|
All | Day | Night | |||
HAFNet | SA(4) + HRG | 8.70 | 9.53 | 7.64 | 0.045 |
SA(3) + CMA(1) + HRG | 8.28 | 9.35 | 6.01 | 0.035 | |
SA(2) + CMA(2) + HRG | 7.73 | 8.35 | 6.50 | 0.027 | |
SA(1) + CMA(3) + HRG | 7.28 | 7.92 | 5.91 | 0.021 | |
CMA(4) + HRG | 6.93 | 7.68 | 5.66 | 0.017 | |
HAFNet * | CMA(5) + HRG | 7.11 | 7.68 | 6.40 | 0.053 |
HRG | Miss Rate (%) | Speed (s) | |||
---|---|---|---|---|---|
All | Day | Night | |||
HAFNet | 8.27 | 8.88 | 7.07 | 0.011 | |
* | 8.36 | 9.88 | 5.68 | 0.012 | |
7.70 | 8.80 | 5.69 | 0.021 | ||
* | 8.07 | 9.10 | 5.82 | 0.035 | |
6.93 | 7.68 | 5.66 | 0.017 | ||
* | 7.38 | 7.74 | 6.45 | 0.028 |
HRG | Reference Stage | Miss Rate (%) | Speed (s) | |||
---|---|---|---|---|---|---|
All | Day | Night | ||||
HAFNet | 2 | 7.71 | 8.64 | 5.98 | 0.026 | |
spatial | 3 | 7.53 | 8.21 | 6.08 | 0.021 | |
attention | 4 | 7.04 | 7.94 | 5.82 | 0.019 | |
5 | 6.93 | 7.68 | 5.66 | 0.017 |
Volume of MFA | Miss Rate (%) | |||
---|---|---|---|---|
All | Day | Night | ||
HAFNet | 1 | 8.17 | 8.84 | 6.97 |
2 | 7.69 | 8.62 | 5.86 | |
3 | 7.16 | 7.56 | 6.23 | |
4 | 6.93 | 7.68 | 5.66 | |
HAFNet * | 5 | 7.01 | 7.71 | 5.54 |
Modalities Input | Miss Rate (%) | |||
---|---|---|---|---|
All | Day | Night | ||
HAFNet | visible only | 26.49 | 17.76 | 44.61 |
thermal only | 19.72 | 23.30 | 12.53 | |
visible + thermal | 6.93 | 7.68 | 5.66 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Peng, P.; Xu, T.; Huang, B.; Li, J. HAFNet: Hierarchical Attentive Fusion Network for Multispectral Pedestrian Detection. Remote Sens. 2023, 15, 2041. https://doi.org/10.3390/rs15082041
Peng P, Xu T, Huang B, Li J. HAFNet: Hierarchical Attentive Fusion Network for Multispectral Pedestrian Detection. Remote Sensing. 2023; 15(8):2041. https://doi.org/10.3390/rs15082041
Chicago/Turabian StylePeng, Peiran, Tingfa Xu, Bo Huang, and Jianan Li. 2023. "HAFNet: Hierarchical Attentive Fusion Network for Multispectral Pedestrian Detection" Remote Sensing 15, no. 8: 2041. https://doi.org/10.3390/rs15082041
APA StylePeng, P., Xu, T., Huang, B., & Li, J. (2023). HAFNet: Hierarchical Attentive Fusion Network for Multispectral Pedestrian Detection. Remote Sensing, 15(8), 2041. https://doi.org/10.3390/rs15082041