DeployFusion: A Deployable Monocular 3D Object Detection with Multi-Sensor Information Fusion in BEV for Edge Devices
Abstract
:1. Introduction
- We propose a lightweight and efficient image feature extraction network, EdgeNeXt_DCN, which integrates residual branches to prevent degradation in deep networks. By employing deformable convolutions, the network expands the receptive field while reducing computational load, achieving feature learning capabilities comparable to Swin-Transformer.
- A two-stage fusion network is constructed to align image features with point cloud features, supplementing environmental information. This optimization enhances the fusion and alignment of different modal features, resulting in accurate BEV features.
- Compared to the baseline model network, the NuScenes detection score and average object accuracy have been improved by 4.5% and 5.5%, respectively. Deployment on mobile edge devices shows an inference latency of approximately 138 milliseconds.
2. Related Work
2.1. 3D Object Detection via Multi-Sensor Information Fusion
2.2. Model Deployment
3. Methods
3.1. Design of the Overall Network
3.2. Improvement of Backbone Network
3.3. Deformable Convolution
3.4. Improved Multi-Scale Feature Fusion Network
3.5. Improved Feature Fusion Network
3.6. Improved Detection Head Network
4. Experiment and Result Analysis
4.1. Experiments Settings
4.2. Comparative Experiment of Feature Extraction Network
4.3. Comparative Experiment of Feature Fusion Network
4.4. Comparison Experiment of Method Performance
5. Model Deployment Application
5.1. Configuration of Mobile Computing Device
5.2. TensorRT Optimization
5.3. Model Testing
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Chen, W.; Li, Y.; Tian, Z.; Zhang, F. 2D and 3D object detection methods from images: A Survey. Array 2023, 19, 100305. [Google Scholar] [CrossRef]
- Wang, Z.; Huang, Z.; Gao, Y.; Wang, N.; Liu, S. MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-sensor3D Detection. arXiv 2024, arXiv:2408.05945. [Google Scholar]
- Chambon, L.; Zablocki, E.; Chen, M.; Bartoccioni, F.; Pérez, P.; Cord, M. PointBeV: A Sparse Approach for BeV Predictions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15195–15204. [Google Scholar]
- Wang, Z.; Wu, Y.; Niu, Q. Multi-sensor fusion in automated driving: A survey. IEEE Access 2019, 8, 2847–2868. [Google Scholar] [CrossRef]
- Xie, L.; Xiang, C.; Yu, Z.; Xu, G.; Yang, Z.; Cai, D.; He, X. PI-RCNN: An efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12460–12467. [Google Scholar]
- Meyer, G.P.; Charland, J.; Hegde, D.; Laddha, A.; Vallespi-Gonzalez, C. Sensor fusion for joint 3d object detection and semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Wen, L.H.; Jo, K.H. Fast and accurate 3D object detection for lidar-camera-based autonomous vehicles using one shared voxel-based backbone. IEEE Access 2021, 9, 22080–22089. [Google Scholar] [CrossRef]
- Wang, J.; Zhu, M.; Wang, B.; Sun, D.; Wei, H.; Liu, C.; Nie, H. Kda3d: Key-point densification and multi-attention guidance for 3d object detection. Remote Sens. 2020, 12, 1895. [Google Scholar] [CrossRef]
- Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR object candidates fusion for 3D object detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, Las Vegas, NV, USA, 25–29 October 2020; pp. 10386–10393. [Google Scholar]
- Gu, S.; Zhang, Y.; Tang, J.; Yang, J.; Alvarez, J.M.; Kong, H. Integrating dense lidar-camera road detection maps by a multi-sensorcrf model. IEEE Trans. Veh. Technol. 2019, 68, 11635–11645. [Google Scholar] [CrossRef]
- Gu, S.; Zhang, Y.; Tang, J.; Yang, J.; Kong, H. Road detection through crf based lidar-camera fusion. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), IEEE, Montreal, QC, Canada, 20–24 May 2019; pp. 3832–3838. [Google Scholar]
- Braun, M.; Rao, Q.; Wang, Y.; Flohr, F. Pose-rcnn: Joint object detection and pose estimation using 3d object proposals. In Proceedings of the IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), IEEE, Rio de Janeiro, Brazil, 1–4 November 2016; pp. 1546–1551. [Google Scholar]
- Pandey, G. An Information Theoretic Framework for Camera and Lidar Sensor Data Fusion and its Applications in Autonomous Navigation of Vehicles. Ph.D. Thesis, University of Michigan, Ann Arbor, MI, USA, 2014. [Google Scholar]
- Farsiu, S. A Fast and Robust Framework for Image Fusion and Enhancement; University of California: Santa Cruz, CA, USA, 2005. [Google Scholar]
- Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv 2021, arXiv:2112.11790. [Google Scholar]
- Huang, J.; Huang, G. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv 2022, arXiv:2203.17054. [Google Scholar]
- Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.; Han, S. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), IEEE, London, UK, 29 May–2 June 2023; pp. 2774–2781. [Google Scholar]
- Cai, H.; Zhang, Z.; Zhou, Z.; Li, Z.; Ding, W.; Zhao, J. BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird’s-Eye-View via Cross-Modality Guidance and Temporal Aggregation. arXiv 2023, arXiv:2303.17099. [Google Scholar]
- Xu, H.; Guo, M.; Nedjah, N.; Zhang, J.; Li, P. Vehicle and pedestrian detection method based on lightweight YOLOv3-promote and semi-precision acceleration. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19760–19771. [Google Scholar] [CrossRef]
- Dai, B.; Li, C.; Lin, T.; Wang, Y.; Gong, D.; Ji, X.; Zhu, B. Field robot environment sensing technology based on TensorRT. In Proceedings of the Intelligent Robotics and Applications: 14th International Conference, ICIRA 2021, Yantai, China, 22–25 October 2021; Proceedings, Part I 14. Springer International Publishing: New York, NY, USA, 2021; pp. 370–377. [Google Scholar]
- Tang, Y.; Qian, Y. High-speed railway track components inspection framework based on YOLOv8 with high-performance model deployment. High-Speed Railw. 2024, 2, 42–50. [Google Scholar] [CrossRef]
- Hang, J.; Song, Q.; Yu, J. A Transformer Based Complex-YOLOv4-Trans for 3D Point Cloud Object Detection on Embedded Device. J. Phys. Conf. Ser. 2022, 2404, 012026. [Google Scholar]
- Maaz, M.; Shaker, A.; Cholakkal, H.; Khan, S.; Zamir, S.W.; Anwer, R.M.; Khan, F.S. Edgenext: Efficiently amalgamated cnn-transformer architecture for mobile vision applications. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature Switzerland: Cham, Switzerland, 2022; pp. 3–20. [Google Scholar]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
- Luo, Y.; Cao, X.; Zhang, J.; Cao, X.; Guo, J.; Shen, H.; Wang, T.; Feng, Q. CE-FPN: Enhancing channel information for object detection. Multimed. Tools Appl. 2022, 81, 30685–30704. [Google Scholar] [CrossRef]
Feature Fusion Network | Fusion Network | mAP↑ | ATE | ASE | AOE | AVE | AAE | NDS |
---|---|---|---|---|---|---|---|---|
Swin-Transformer | Conv | 0.592 | 0.320 | 0.277 | 0.564 | 0.412 | 0.194 | 0.619 |
Add | 0.624 | 0.309 | 0.262 | 0.424 | 0.347 | 0.192 | 0.659 | |
Cross-Attention | 0.581 | 0.309 | 0.263 | 0.434 | 0.356 | 0.193 | 0.635 | |
Feature Fusion Network | 0.640 | 0.293 | 0.258 | 0.417 | 0.294 | 0.194 | 0.674 | |
EdgeNeXt_DCN | Conv | 0.620 | 0.307 | 0.264 | 0.426 | 0.344 | 0.189 | 0.657 |
Add | 0.625 | 0.297 | 0.261 | 0.413 | 0.357 | 0.201 | 0.660 | |
Cross-Attention | 0.584 | 0.311 | 0.261 | 0.427 | 0.351 | 0.194 | 0.638 | |
Feature Fusion Network | 0.637 | 0.293 | 0.256 | 0.418 | 0.289 | 0.191 | 0.674 | |
ResNet50 | Conv | 0.576 | 0.349 | 0.296 | 0.634 | 0.976 | 0.245 | 0.538 |
Feature Fusion Network | 0.600 | 0.315 | 0.268 | 0.468 | 0.438 | 0.188 | 0.632 |
Parameter | Number | Unit |
---|---|---|
GPU Architecture | Ampere | |
AI Performance | 100 | TOPS |
GPU Maximum Frequency | 918 | MHz |
CUDA Core | 1024 | |
Tensor Core | 32 | |
Memory | 16 | GB |
Bandwidth | 102.4 | GB/s |
CPU | Arm® Cortex-A78AE | |
CPU Core | 8 | |
CPU Maximum Frequency | 2 | GHz |
DL Accelerator | 2 | |
DLA Maximum Frequency | 614 | MHz |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, F.; Liu, S.; Zhang, G.; Hao, B.; Xiang, Y.; Yuan, K. DeployFusion: A Deployable Monocular 3D Object Detection with Multi-Sensor Information Fusion in BEV for Edge Devices. Sensors 2024, 24, 7007. https://doi.org/10.3390/s24217007
Huang F, Liu S, Zhang G, Hao B, Xiang Y, Yuan K. DeployFusion: A Deployable Monocular 3D Object Detection with Multi-Sensor Information Fusion in BEV for Edge Devices. Sensors. 2024; 24(21):7007. https://doi.org/10.3390/s24217007
Chicago/Turabian StyleHuang, Fei, Shengshu Liu, Guangqian Zhang, Bingsen Hao, Yangkai Xiang, and Kun Yuan. 2024. "DeployFusion: A Deployable Monocular 3D Object Detection with Multi-Sensor Information Fusion in BEV for Edge Devices" Sensors 24, no. 21: 7007. https://doi.org/10.3390/s24217007
APA StyleHuang, F., Liu, S., Zhang, G., Hao, B., Xiang, Y., & Yuan, K. (2024). DeployFusion: A Deployable Monocular 3D Object Detection with Multi-Sensor Information Fusion in BEV for Edge Devices. Sensors, 24(21), 7007. https://doi.org/10.3390/s24217007