Local Adaptive Illumination-Driven Input-Level Fusion for Infrared and Visible Object Detection
Abstract
:1. Introduction
- (1)
- A local illumination perception module supervised by pixel information statistics is proposed to fully perceive the illumination difference of each area in the image, which provides a more suitable reference for fusion.
- (2)
- An offset estimation module composed of bottleneck blocks is designed to predict the location offset of objects in different modalities, which achieves fast alignment between image pairs.
- (3)
- The proposed local adaptive illumination-driven input-level fusion module (LAIIFusion) can simply convert the single modality object detection model into a multimodal model, and guarantees real-time inference while improving the detection performance.
2. Methods
2.1. Overall Network Architecture
2.2. Local Illumination Perception Module
2.2.1. Local Illumination Perception
2.2.2. Illumination Label Design
2.3. Offset Estimation Module
2.4. Loss Function
3. Experiment and Analysis
3.1. Dataset Introduction
3.2. Implementation Details
3.3. Evaluation Metrics
3.4. Analysis of Results
3.4.1. Experiments on DroneVehicle
3.4.2. Experiments on the KAIST Dataset
3.4.3. Parameter Analysis of RGB Value Interval Division
3.4.4. Generality of Proposed Method on KAIST
3.4.5. Ablation Study
4. Discussion
4.1. Lightweight Technology for Infrared and Visible Object Detection
4.2. Adaptability in Different Around-the-Clock Scenarios
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Han, W.; Chen, J.; Wang, L.; Feng, R.; Li, F.; Wu, L.; Tian, T.; Yan, J. Methods for Small, Weak Object Detection in Optical High-Resolution Remote Sensing Images: A survey of advances and challenges. IEEE Geosci. Remote Sens. Mag. 2021, 9, 8–34. [Google Scholar] [CrossRef]
- Ding, J.; Xue, N.; Long, Y.; Xia, G.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
- Wang, Q.; Chi, Y.; Shen, T.; Song, J.; Zhang, Z.; Zhu, Y. Improving RGB-Infrared Object Detection by Reducing Cross-Modality Redundancy. Remote Sens. 2022, 14, 2020. [Google Scholar] [CrossRef]
- Yu, D.; Ji, S. A New Spatial-Oriented Object Detection Framework for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4407416. [Google Scholar] [CrossRef]
- Zhang, Z.; Liu, Q.; Wang, Y. Road extraction by deep residual U-Net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef] [Green Version]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
- Liu, S.; Huang, D.; Wang, Y. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 404–419. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real- Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- ultralytics. yolov5. Available online: https://github.com/ultralytics/yolov5 (accessed on 18 May 2020).
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirllov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Online, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, 7–12 December 2015; Volume 28, pp. 91–99. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef]
- Qiao, S.; Chen, L.; Yuille, A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10208–10219. [Google Scholar]
- Everingham, M.; Eslami, S.; Van Gool, L.; Williams, C.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. 2015, 111, 98–136. [Google Scholar] [CrossRef]
- Lin, T.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q. Vision meets drones: A challenge. arXiv 2018, arXiv:1804.07437. [Google Scholar]
- Zhang, P.; Zhong, Y.; Li, X. SlimYOLOv3: Narrower, Faster and Better for Real-Time UAV Applications. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Korea, 27–28 October 2019; pp. 37–45. [Google Scholar]
- Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-Resolution Detection Network for Small Objects. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
- Li, Z.; Liu, X.; Zhao, Y.; Liu, B.; Huang, Z.; Hong, R. A lightweight multi-scale aggregated model for detecting aerial images captured by uavs. J. Vis. Commun. Image Represent. 2021, 77, 103058. [Google Scholar] [CrossRef]
- Yu, W.; Yang, T.; Chen, C. Towards resolving the challenge of longtail distribution in uav images for object detection. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 3257–3266. [Google Scholar]
- Li, C.; Song, D.; Tong, R.; Tang, M. Multispectral Pedestrian Detection via Simultaneous Detection and Segmentation. In Proceedings of the British Machine Vision Conference (BMWC), Newcastle, UK, 3–6 September 2018; pp. 225.1–225.12. [Google Scholar]
- Wagner, J.; Fischer, V.; Herman, M.; Behnke, S. Multispectral Pedestrian Detection using Deep Fusion Convolutional Neural Networks. ESANN 2016, 587, 509–514. [Google Scholar]
- Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recognit. 2019, 85, 161–171. [Google Scholar] [CrossRef] [Green Version]
- Zhang, L.; Liu, Z.; Zhang, S.; Yang, X.; Qiao, H.; Huang, K.; Hussain, A. Cross-modality interactive attention network for multispectral pedestrian detection. Inf. Fusion 2019, 50, 20–29. [Google Scholar] [CrossRef]
- Guan, D.; Cao, Y.; Yang, J.; Cao, Y.; Yang, M. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Inf. Fusion 2019, 50, 148–157. [Google Scholar] [CrossRef]
- Zhou, K.; Chen, L.; Cao, X. Improving multispectral pedestrian detection by addressing modality imbalance problems. In Proceedings of the European Conference on Computer Vision (ECCV), Online, 23–28 August 2020; pp. 787–803. [Google Scholar]
- Zhang, L.; Zhu, X.; Chen, X.; Yang, X.; Lei, Z.; Liu, Z. Weakly aligned cross-modal learning for multispectral pedestrian detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019; pp. 5127–5137. [Google Scholar]
- Liu, T.; Lam, K.; Zhao, R.; Qiu, G. Deep Cross-Modal Representation Learning and Distillation for Illumination-Invariant Pedestrian Detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 315–329. [Google Scholar] [CrossRef]
- Li, H.; Wu, X. DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Trans. Image Process. 2019, 28, 2614–2623. [Google Scholar] [CrossRef] [Green Version]
- Zhao, Z.; Xu, S.; Zhang, C.; Liu, J.; Zhang, J.; Li, P. DIDFuse: Deep Image Decomposition for Infrared and Visible Image Fusion. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), Yokohama, Japan, 11–17 July 2020; pp. 970–976. [Google Scholar]
- Ma, J.; Wei, Y.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
- Li, H.; Wu, X.; Durrani, T. NestFuse: An Infrared and Visible Image Fusion Architecture Based on Nest Connection and Spatial/Channel Attention Models. IEEE Tans. Instrum. Meas. 2020, 69, 9645–9656. [Google Scholar] [CrossRef]
- Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Cambridge (NIPS), MA, USA, 7–12 December 2015; Volume 28. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-Infrared Cross-Modality Vehicle Detection via Uncertainty-Aware Learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
- Hwang, S.; Park, J.; Kim, N.; Choi, Y.; Kweon, S.I. Multispectral Pedestrian Detection: Benchmark Dataset and Baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar]
- Dollar, P.; Wojek, C.; Schiele, B.; Perona, P. Pedestrian Detection: An Evaluation of the State of the Art. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 743–761. [Google Scholar] [CrossRef]
- He, Y.; Zhang, X.; Sun, J. Channel Pruning for Accelerating Very Deep Neural Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1398–1406. [Google Scholar]
- Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning Efficient Convolutional Networks through Network Slimming. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2755–2763. [Google Scholar]
- Wu, J.; Leng, C.; Wang, Y.; Hu, Q.; Cheng, J. Quantized Convolutional Neural Networks for Mobile Devices. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, NV, USA, 27–30 June 2016; pp. 4820–4828. [Google Scholar]
- Courbariaux, M.; Bengio, Y. BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv 2016, arXiv:1602.02830. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- Yim, J.; Joo, D.; Bae, J.; Kim, J. A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7130–7138. [Google Scholar]
Fusion Stage | Method | Modality | mAP |
---|---|---|---|
Multi-Stage Fusion | CMDet [37] | visible + infrared | 62.58 |
UA-CMDet [37] | visible + infrared | 64.01 | |
MBNet [28] | visible + infrared | 62.83 | |
No Fusion | YOLOv5L [10] | visible | 57.02 |
YOLOv5L | infrared | 62.93 | |
Input-Level Fusion | YOLOv5L + Addition | visible + infrared | 64.89 |
YOLOv5L + LAIIFusion (ours) | visible + infrared | 66.23 |
Method | Car | Truck | Bus | Van | Freight Car |
---|---|---|---|---|---|
UA-CMDet [37] | 87.51 | 60.70 | 87.08 | 37.95 | 19.04 |
YOLOv5L (visible) [10] | 86.83 | 55.50 | 83.14 | 34.42 | 25.34 |
YOLOv5L (Infrared) | 95.26 | 51.03 | 89.39 | 27.15 | 51.83 |
YOLOv5L + Addition | 95.19 | 53.66 | 88.56 | 29.74 | 57.29 |
YOLOv5L + LAIIFusion (ours) | 94.45 | 54.38 | 90.46 | 33.89 | 57.91 |
Fusion Stage | Method | All | Day | Night | Near | Medium | Far | Platform | fps |
---|---|---|---|---|---|---|---|---|---|
Multi-Stage Fusion | IAF R-CNN [25] | 15.73 | 14.55 | 18.26 | 0.96 | 25.54 | 77.84 | - | - |
CIAN [26] | 14.12 | 14.77 | 11.13 | 3.71 | 19.04 | 55.82 | GTX 1080 Ti | 14 | |
AR-CNN [29] | 9.34 | 9.94 | 8.38 | 0 | 16.08 | 69.00 | GTX 1080 Ti | 8 | |
MBNet [28] | 8.13 | 8.28 | 7.86 | 0 | 16.07 | 55.99 | RTX 2060 | 10 | |
No Fusion | YOLOv5L (visible) [10] | 15.14 | 11.24 | 22.97 | 1.82 | 23.05 | 54.70 | RTX 2060 | 32 |
YOLOv5L (infrared) | 15.94 | 20.52 | 7.02 | 3.11 | 19.49 | 37.34 | RTX 2060 | 32 | |
Input-Level Fusion | YOLOv5L + Addition | 13.72 | 11.90 | 17.13 | 0 | 22.58 | 63.36 | RTX 2060 | 32 |
YOLOv5L + DenseFuse [31] | 16.09 | 14.41 | 19.45 | 0 | 24.42 | 64.36 | RTX 2060 | 18 | |
YOLOv5L + LAIIFusion (ours) | 10.44 | 12.22 | 6.96 | 1.66 | 16.09 | 40.95 | RTX 2060 | 31 |
N | mAP | mAP0.5:0.95 |
---|---|---|
8 | 64.74 | 46.83 |
16 | 64.55 | 46.55 |
32 | 65.05 | 48.11 |
64 | 64.75 | 47.44 |
128 | 64.60 | 47.47 |
Method | All | Day | Night | Near | Medium | Far |
---|---|---|---|---|---|---|
YOLOv3 (visible) [9] | 21.77 | 17.09 | 29.40 | 9.32 | 26.36 | 56.59 |
YOLOv3 + addition | 18.89 | 21.85 | 13.21 | 0 | 26.92 | 62.55 |
YOLOv3 + LAIIFusion | 12.97 | 15.50 | 7.77 | 0 | 19.66 | 47.23 |
SSD (visible) [6] | 37.77 | 30.74 | 52.90 | 11.82 | 41.34 | 77.32 |
SSD + addition | 35.51 | 31.10 | 44.15 | 4.74 | 46.53 | 82.80 |
SSD + LAIIFusion | 32.95 | 37.86 | 21.75 | 8.51 | 38.99 | 73.58 |
YOLOXm (visible) [11] | 21.31 | 14.88 | 33.97 | 1.55 | 28.72 | 64.46 |
YOLOXm + addition | 22.95 | 20.96 | 27.01 | 0.03 | 34.82 | 72.72 |
YOLOXm + LAIIFusion | 17.57 | 21.91 | 9.12 | 1.97 | 26.67 | 58.46 |
Method | DroneVehicle Dataset | KAIST Dataset | |||||
---|---|---|---|---|---|---|---|
mAP | All | Day | Night | Near | Medium | Far | |
YOLOv5L [10] (visible) | 57.02 | 15.14 | 11.24 | 22.97 | 1.82 | 23.05 | 54.70 |
YOLOv5L (infrared) | 62.93 | 15.94 | 20.52 | 7.02 | 3.11 | 19.49 | 37.34 |
YOLOv5L + AF | 64.72 | 12.23 | 14.41 | 7.75 | 0.03 | 19.48 | 53.88 |
YOLOv5L + AF + LIP | 63.71 | 11.75 | 14.14 | 7.33 | 0.02 | 17.51 | 50.66 |
YOLOv5L + AF + LIP + OE | 66.23 | 10.44 | 12.22 | 6.96 | 1.66 | 16.09 | 40.95 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wu, J.; Shen, T.; Wang, Q.; Tao, Z.; Zeng, K.; Song, J. Local Adaptive Illumination-Driven Input-Level Fusion for Infrared and Visible Object Detection. Remote Sens. 2023, 15, 660. https://doi.org/10.3390/rs15030660
Wu J, Shen T, Wang Q, Tao Z, Zeng K, Song J. Local Adaptive Illumination-Driven Input-Level Fusion for Infrared and Visible Object Detection. Remote Sensing. 2023; 15(3):660. https://doi.org/10.3390/rs15030660
Chicago/Turabian StyleWu, Jiawen, Tao Shen, Qingwang Wang, Zhimin Tao, Kai Zeng, and Jian Song. 2023. "Local Adaptive Illumination-Driven Input-Level Fusion for Infrared and Visible Object Detection" Remote Sensing 15, no. 3: 660. https://doi.org/10.3390/rs15030660
APA StyleWu, J., Shen, T., Wang, Q., Tao, Z., Zeng, K., & Song, J. (2023). Local Adaptive Illumination-Driven Input-Level Fusion for Infrared and Visible Object Detection. Remote Sensing, 15(3), 660. https://doi.org/10.3390/rs15030660