PLC-Fusion: Perspective-Based Hierarchical and Deep LiDAR Camera Fusion for 3D Object Detection in Autonomous Vehicles
Abstract
:1. Introduction
- We propose an efficient perspective-aware hierarchical multimodal vision transformer-based end-to-end fusion framework, which enhances robustness and adaptability. This leads to superior detection performance, particularly for pedestrians and cyclists in complex traffic environments.
- The OPS module is designed to improve feature sampling by aligning LiDAR and image data with ground-truth targets. This module incorporates a lightweight perspective detector that combines 2D and monocular 3D sub-networks to generate refined object perspective proposals, significantly enhancing object recognition in complex scenarios.
- Our multi-modal cross-fusion approach leverages CamViT and LidViT to independently learn embedding representations from LiDAR point cloud and camera images. These outputs are fused via the Cross-Fusion module for hierarchical and deep representation learning, resulting in improved performance and computational efficiency.
- Through experiments conducted in diverse urban traffic settings, the robustness and effectiveness of our method have been validated using the KITTI dataset.
2. Related Work
2.1. Camera-Based Object Detection
2.2. LiDAR-Based Object Detection
2.3. Fusion Point-Based Methods
2.4. ViT-Based Methods
3. Methodology
3.1. Overview
3.2. MultiModal Feature Extraction
3.3. Object Perspective Sampling
3.4. Camera ViT Branch
3.5. LiDAR ViT Branch
3.6. ViT-Based Cross-Fusion
3.7. Detection Head
3.8. Loss Function
4. Experiment and Results
4.1. Dataset and Evaluation Metrics
4.2. Training Details
4.3. Qualitative Results
4.4. PLC-Fusion Efficiency
4.5. Ablation Studies
4.5.1. Effect of the Feature Fusion Approach
4.5.2. Effect of Image and LiDAR Backbones
4.5.3. Effect of PLC-Fusion Components on Runtime
4.5.4. Distance Analysis
4.6. Analysis and Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Yu, Z.; Wan, W.; Ren, M.; Zheng, X.; Fang, Z. Sparsefusion3d: Sparse sensor fusion for 3d object detection by radar and camera in environmental perception. IEEE Trans. Intell. Veh. 2023, 9, 1524–1536. [Google Scholar] [CrossRef]
- Zhu, H.; Deng, J.; Zhang, Y.; Ji, J.; Mao, Q.; Li, H.; Zhang, Y. Vpfnet: Improving 3d object detection with virtual point based lidar and stereo data fusion. IEEE Trans. Multimed. 2022, 25, 5291–5304. [Google Scholar] [CrossRef]
- Uzair, M.; Dong, J.; Shi, R.; Mushtaq, H.; Ullah, I. Channel-wise and spatially-guided Multimodal feature fusion network for 3D Object Detection in Autonomous Vehicles. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5707515. [Google Scholar] [CrossRef]
- Nie, C.; Ju, Z.; Sun, Z.; Zhang, H. 3D object detection and tracking based on lidar-camera fusion and IMM-UKF algorithm towards highway driving. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 7, 1242–1252. [Google Scholar] [CrossRef]
- Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef]
- Chen, Q.; Li, P.; Xu, M.; Qi, X. Sparse Activation Maps for Interpreting 3D Object Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June 2021; pp. 76–84. [Google Scholar] [CrossRef]
- Chen, Y.; Li, Y.; Zhang, X.; Sun, J.; Jia, J. Focal Sparse Convolutional Networks for 3D Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5418–5427. [Google Scholar] [CrossRef]
- Shi, S.; Wang, X.; Li, H. PointRCNN: 3D object proposal generation and detection from point cloud. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; Volume 2019. [Google Scholar] [CrossRef]
- Mushtaq, H.; Deng, X.; Ullah, I.; Ali, M.; Malik, B.H. O2SAT: Object-Oriented-Segmentation-Guided Spatial-Attention Network for 3D Object Detection in Autonomous Vehicles. Information 2024, 15, 376. [Google Scholar] [CrossRef]
- Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3D object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
- Xie, L.; Xiang, C.; Yu, Z.; Xu, G.; Yang, Z.; Cai, D.; He, X. PI-RCNN: An efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12460–12467. [Google Scholar]
- Wang, H.; Tang, H.; Shi, S.; Li, A.; Li, Z.; Schiele, B.; Wang, L. Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 11–15 June 2023; pp. 6792–6802. [Google Scholar]
- Yan, J.; Liu, Y.; Sun, J.; Jia, F.; Li, S.; Wang, T.; Zhang, X. Cross modal transformer: Towards fast and robust 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 18268–18278. [Google Scholar]
- Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.L. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1090–1099. [Google Scholar]
- Chitta, K.; Prakash, A.; Jaeger, B.; Yu, Z.; Renz, K.; Geiger, A. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12878–12895. [Google Scholar] [CrossRef]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D Proposal Generation and Object Detection from View Aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018. [Google Scholar] [CrossRef]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; Volume 2017. [Google Scholar] [CrossRef]
- Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q.V.; et al. DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18-24 June 2022; Volume 2022. [Google Scholar] [CrossRef]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar] [CrossRef]
- Wang, Y.; Chao, W.L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8445–8453. [Google Scholar]
- Weng, X.; Kitani, K. Monocular 3d object detection with pseudo-lidar point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
- You, Y.; Wang, Y.; Chao, W.L.; Garg, D.; Pleiss, G.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. arXiv 2019, arXiv:1906.06310. [Google Scholar]
- Rukhovich, D.; Vorontsova, A.; Konushin, A. ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022. [Google Scholar] [CrossRef]
- Ding, M.; Huo, Y.; Yi, H.; Wang, Z.; Shi, J.; Lu, Z.; Luo, P. Learning depth-guided convolutions for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 1000–1001. [Google Scholar]
- Park, D.; Ambruş, R.; Guizilini, V.; Li, J.; Gaidon, A. Is Pseudo-Lidar needed for Monocular 3D Object detection? In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
- Landrieu, L.; Simonovsky, M. Large-Scale Point Cloud Semantic Segmentation with Superpoint Graphs. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
- Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
- Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; Volume 2019. [Google Scholar] [CrossRef]
- Pan, X.; Xia, Z.; Song, S.; Li, L.E.; Huang, G. 3D Object Detection with Pointformer. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
- He, Q.; Wang, Z.; Zeng, H.; Zeng, Y.; Liu, Y. Svga-net: Sparse voxel-graph attention network for 3d object detection from point clouds. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 870–878. [Google Scholar]
- An, P.; Liang, J.; Yu, K.; Fang, B.; Ma, J. Deep structural information fusion for 3D object detection on LiDAR–camera system. Comput. Vis. Image Underst. 2022, 214, 103295. [Google Scholar] [CrossRef]
- Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 720–736. [Google Scholar]
- Huang, T.; Liu, Z.; Chen, X.; Bai, X. Epnet: Enhancing point features with image semantics for 3d object detection. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 35–52. [Google Scholar]
- Chen, M.; Liu, P.; Zhao, H. LiDAR-camera fusion: Dual transformer enhancement for 3D object detection. Eng. Appl. Artif. Intell. 2023, 120, 105815. [Google Scholar] [CrossRef]
- Hu, C.; Zheng, H.; Li, K.; Xu, J.; Mao, W.; Luo, M.; Wang, L.; Chen, M.; Liu, K.; Zhao, Y.; et al. FusionFormer: A multi-sensory fusion in bird’s-eye-view and temporal consistent transformer for 3D object detection. arXiv 2023, arXiv:2309.05257. [Google Scholar]
- Huang, J.; Ye, Y.; Liang, Z.; Shan, Y.; Du, D. Detecting as labeling: Rethinking LiDAR-camera fusion in 3D object detection. arXiv 2023, arXiv:2311.07152. [Google Scholar]
- Cai, H.; Zhang, Z.; Zhou, Z.; Li, Z.; Ding, W.; Zhao, J. BEVFusion4D: Learning LiDAR-camera fusion under bird’s-eye-view via cross-modality guidance and temporal aggregation. arXiv 2023, arXiv:2303.17099. [Google Scholar]
- Khamsehashari, R.; Schill, K. Improving deep multi-modal 3D object detection for autonomous driving. In Proceedings of the 2021 7th International Conference on Automation, Robotics and Applications (ICARA), Auckland, New Zealand, 9–11 February 2021; pp. 263–267. [Google Scholar]
- Chen, Z.; Li, Z.; Zhang, S.; Fang, L.; Jiang, Q.; Zhao, F. Deformable feature aggregation for dynamic multi-modal 3D object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 628–644. [Google Scholar]
- Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. BEVFusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2774–2781. [Google Scholar]
- Liu, X.; Zhang, B.; Liu, N. The Graph Neural Network Detector Based on Neighbor Feature Alignment Mechanism in LIDAR Point Clouds. Machines 2023, 11, 116. [Google Scholar] [CrossRef]
- Sindagi, V.A.; Zhou, Y.; Tuzel, O. MVX-net: Multimodal VoxelNet for 3D object detection. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; Volume 2019. [Google Scholar] [CrossRef]
- Chen, W.; Li, P.; Zhao, H. MSL3D: 3D object detection from monocular, stereo and point cloud for autonomous driving. Neurocomputing 2022, 494, 23–32. [Google Scholar] [CrossRef]
- Zhu, M.; Ma, C.; Ji, P.; Yang, X. Cross-modality 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Online, 5–9 January 2021; pp. 3772–3781. [Google Scholar]
- Wei, Z.; Zhang, F.; Chang, S.; Liu, Y.; Wu, H.; Feng, Z. MmWave Radar and Vision Fusion for Object Detection in Autonomous Driving: A Review. Sensors 2022, 22, 2542. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 2017. [Google Scholar]
- Xiang, P.; Wen, X.; Liu, Y.S.; Cao, Y.P.; Wan, P.; Zheng, W.; Han, Z. SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16X16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929v2. [Google Scholar]
- Hua, B.S.; Tran, M.K.; Yeung, S.K. Pointwise Convolutional Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
- Mushtaq, H.; Deng, X.; Ali, M.; Hayat, B.; Raza Sherazi, H.H. DFA-SAT: Dynamic Feature Abstraction with Self-Attention-Based 3D Object Detection for Autonomous Driving. Sustainability 2023, 15, 3667. [Google Scholar] [CrossRef]
- She, R.; Kang, Q.; Wang, S.; Tay, W.P.; Zhao, K.; Song, Y.; Geng, T.; Xu, Y.; Navarro, D.N.; Hartmannsgruber, A. PointDifformer: Robust Point Cloud Registration With Neural Diffusion and Transformer. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
- Lu, D.; Gao, K.; Xie, Q.; Xu, L.; Li, J. 3DGTN: 3-D Dual-Attention GLocal Transformer Network for Point Cloud Classification and Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
- Fei, J.; Chen, W.; Heidenreich, P.; Wirges, S.; Stiller, C. SemanticVoxels: Sequential Fusion for 3D Pedestrian Detection using LiDAR Point Cloud and Semantic Segmentation. In Proceedings of the 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), Karlsruhe, Germany, 14–16 September 2020; Volume 2020. [Google Scholar] [CrossRef]
- Mahmoud, A.; Waslander, S.L. Sequential Fusion via Bounding Box and Motion PointPainting for 3D Objection Detection. In Proceedings of the 2021 18th Conference on Robots and Vision (CRV), Burnaby, BC, Canada, 26–28 May 2021. [Google Scholar] [CrossRef]
- Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-voxel feature set abstraction for 3D object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
- OpenPCDet Development Team. Openpcdet: An Opensource Toolbox for 3d Object Detection from Point Clouds. 2020. Available online: https://github.com/open-mmlab/OpenPCDet (accessed on 1 October 2024).
- Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
- Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 641–656. [Google Scholar]
Method | Mod. | Time | 3D Detection (mAP) | BEV Detection (mAP) | ||||||
---|---|---|---|---|---|---|---|---|---|---|
E | M | H | mAP | E | M | H | mAP | |||
SECOND [5] | L | 0.05 | 83.13 | 73.66 | 66.20 | 74.33 | 79.37 | 77.95 | 79.37 | 78.90 |
VoxelNet [27] | L | 0.23 | 77.47 | 65.11 | 57.73 | 66.77 | 89.35 | 79.26 | 77.39 | 82.00 |
PointPillar [28] | L | 0.02 | 82.58 | 74.31 | 68.99 | 75.29 | 86.56 | 82.81 | 86.56 | 85.31 |
Pointformer [29] | L | - | 87.13 | 77.06 | 69.25 | 81.48 | - | - | - | - |
SVGA-Net [30] | L | 0.03 | 87.33 | 80.47 | 75.91 | 81.24 | 92.07 | 89.88 | 85.59 | 89.18 |
PointRCNN [8] | L | 0.10 | 86.96 | 75.64 | 70.70 | 77.77 | 87.39 | 82.72 | 87.39 | 85.83 |
MV3D [17] | C + L | 0.36 | 74.97 | 63.63 | 54.00 | 64.20 | 86.62 | 78.93 | 69.80 | 78.45 |
AVOD-FPN [16] | L + C | 0.10 | 83.07 | 71.76 | 65.73 | 73.52 | 90.99 | 84.82 | 79.62 | 85.14 |
F-PointNet [59] | L + C | 0.17 | 82.19 | 69.79 | 60.59 | 70.86 | 91.17 | 84.67 | 74.77 | 83.54 |
3D-CVF [32] | L + C | 0.16 | 89.20 | 80.05 | 73.11 | 80.79 | 93.52 | 89.56 | 82.45 | 88.51 |
ContFuse [60] | C + L | 0.16 | 83.68 | 68.78 | 61.67 | 71.38 | 94.07 | 85.35 | 75.88 | 85.10 |
CM3D [44] | C + L | - | 87.22 | 77.28 | 72.04 | 78.85 | - | - | - | - |
MSL3D [43] | C + L | 0.24 | 87.27 | 81.15 | 76.56 | 81.66 | - | - | - | - |
EPNet [33] | C + L | 0.10 | 89.81 | 79.28 | 74.59 | 81.23 | 94.22 | 88.47 | 83.69 | 88.79 |
PointPainting [10] | C + L | 0.40 | 82.11 | 71.70 | 67.08 | 73.63 | 92.45 | 88.11 | 83.36 | 87.97 |
PI-RCNN [11] | C + L | 0.10 | 84.37 | 74.82 | 70.03 | 76.41 | 91.44 | 85.81 | 81.00 | 86.08 |
DFIM [34] | C + L | 0.19 | 88.36 | 81.37 | 76.71 | 82.15 | 92.61 | 88.69 | 85.77 | 89.02 |
StructuralIF [31] | C + L | 0.12 | 87.15 | 80.69 | 76.26 | 81.37 | 91.78 | 88.38 | 85.67 | 88.61 |
PLC-Fusion | C + L | 0.18 | 89.69 | 82.73 | 77.82 | 83.52 | 93.75 | 89.87 | 86.42 | 90.34 |
Method | Mod. | 3D Detection (mAP) | BEV Detection (mAP) | ||||||
---|---|---|---|---|---|---|---|---|---|
E | M | H | mAP | E | M | H | mAP | ||
SECOND [5] | L | 88.61 | 78.62 | 77.22 | 81.48 | 89.96 | 87.07 | 79.66 | 85.56 |
VoxelNet [27] | L | 81.97 | 65.46 | 62.85 | 70.09 | 89.60 | 84.81 | 78.57 | 84.33 |
PointRCNN [8] | L | 88.72 | 78.61 | 77.82 | 81.72 | – | – | – | – |
PointPillar [28] | L | 86.46 | 77.28 | 74.65 | 79.46 | – | – | – | – |
SVGA-Net [30] | L | 90.59 | 80.23 | 79.15 | 83.32 | 90.27 | 89.16 | 88.11 | 89.18 |
F-PointNet [59] | C + L | 83.76 | 70.92 | 63.65 | 72.78 | 88.16 | 84.02 | 76.44 | 82.87 |
MV3D [17] | C + L | 71.29 | 62.68 | 56.56 | 63.51 | 86.55 | 78.10 | 76.67 | 80.44 |
3D-CVF [32] | C + L | 89.67 | 79.88 | 78.47 | 82.67 | – | – | – | – |
AVOD-FPN [16] | C + L | 84.41 | 74.44 | 68.65 | 75.83 | – | – | – | – |
EPNet [33] | C + L | 92.28 | 82.59 | 80.14 | 85.00 | 95.51 | 88.76 | 88.36 | 90.88 |
CM3D [44] | C + L | 91.08 | 83.19 | 77.12 | 83.80 | – | – | – | – |
StructuralIF [31] | C + L | 92.78 | 85.38 | 83.45 | 87.20 | – | – | – | – |
MSL3D [43] | C + L | 91.78 | 84.71 | 82.20 | 86.23 | 94.35 | 90.68 | 88.40 | 91.13 |
DFIM [34] | C + L | 92.15 | 86.04 | 84.26 | 87.48 | 95.09 | 90.72 | 88.85 | 91.55 |
PLC-Fusion | C + L | 93.08 | 87.05 | 85.10 | 88.17 | 96.06 | 91.27 | 89.58 | 92.44 |
Method | Mod. | AP 3D (Pedestrian) | AP 3D (Cyclist) | ||||
---|---|---|---|---|---|---|---|
E | M | H | E | M | H | ||
AVOD-FPN [16] | C + L | 50.73 | 42.54 | 39.31 | 64.03 | 50.82 | 45.20 |
StructuralIF | C + L | 50.80 | 42.42 | 38.35 | 72.54 | 56.39 | 49.28 |
PLC-Fusion | C + L | 51.19 | 43.13 | 39.73 | 77.25 | 60.50 | 54.06 |
Method | Mem. | Paral. | Speed⊥ | Speed‖ | Input Scale |
---|---|---|---|---|---|
PointRCNN | 324 MB | 29 | 47 | 59 | 2∼9 k |
PointPillars | 561 MB | 19 | 11 | 15 | 16,384 |
F-PointNet | 1223 MB | 18 | 38 | 10 | 11∼17 k |
SVGA-Net | 802 MB | 19 | 11 | 28 | 16,384 |
PLC-Fusion | 751 MB | 21 | 23 | 48 | 16,348 |
Method | AP 3D | AP BEV | ||||
---|---|---|---|---|---|---|
E | M | H | E | M | H | |
Baseline | 90.18 | 81.44 | 81.02 | 91.35 | 84.37 | 83.46 |
PLC (Early fusion) | 91.40 | 83.19 | 82.43 | 93.48 | 86.35 | 85.24 |
PLC (Concatenation) | 92.14 | 85.36 | 83.53 | 94.92 | 89.18 | 88.47 |
PLC (Summation) | 92.29 | 85.47 | 83.68 | 95.09 | 89.34 | 88.65 |
PLC (Multiplication) | 93.08 | 87.05 | 85.10 | 96.06 | 91.27 | 89.58 |
Modality | Image Backbone | LiDAR Backbone | CamVit | LiDViT | Fusion | mAP 3D | mAP BEV |
---|---|---|---|---|---|---|---|
C+L | V2-99 | VoxelNet | 86.35 | 90.29 | |||
C+L | V2-99 | VoxelNet | ✓ | ✓ | 87.09 | 90.73 | |
C+L | V2-99 | VoxelNet | ✓ | ✓ | 87.42 | 91.57 | |
C+L | V2-99 | VoxelNet | ✓ | ✓ | ✓ | 88.17 | 92.44 |
OPS | CamVit | LiDViT | Fusion | Time | Mem. | RT | BEV (%) | 3D (%) |
---|---|---|---|---|---|---|---|---|
✓ | 28.0 | 19,500 | 9.5 | 86.95 | 90.82 | |||
✓ | ✓ | 23.5 | 12,550 | 11.0 | 87.29 | 91.07 | ||
✓ | ✓ | 25.0 | 12,700 | 10.6 | 87.51 | 91.48 | ||
✓ | ✓ | ✓ | 16 | 11,900 | 12.0 | 87.96 | 91.89 | |
✓ | ✓ | ✓ | 13.5 | 4500 | 20.8 | 88.17 | 92.44 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mushtaq, H.; Deng, X.; Azhar, F.; Ali, M.; Raza Sherazi, H.H. PLC-Fusion: Perspective-Based Hierarchical and Deep LiDAR Camera Fusion for 3D Object Detection in Autonomous Vehicles. Information 2024, 15, 739. https://doi.org/10.3390/info15110739
Mushtaq H, Deng X, Azhar F, Ali M, Raza Sherazi HH. PLC-Fusion: Perspective-Based Hierarchical and Deep LiDAR Camera Fusion for 3D Object Detection in Autonomous Vehicles. Information. 2024; 15(11):739. https://doi.org/10.3390/info15110739
Chicago/Turabian StyleMushtaq, Husnain, Xiaoheng Deng, Fizza Azhar, Mubashir Ali, and Hafiz Husnain Raza Sherazi. 2024. "PLC-Fusion: Perspective-Based Hierarchical and Deep LiDAR Camera Fusion for 3D Object Detection in Autonomous Vehicles" Information 15, no. 11: 739. https://doi.org/10.3390/info15110739
APA StyleMushtaq, H., Deng, X., Azhar, F., Ali, M., & Raza Sherazi, H. H. (2024). PLC-Fusion: Perspective-Based Hierarchical and Deep LiDAR Camera Fusion for 3D Object Detection in Autonomous Vehicles. Information, 15(11), 739. https://doi.org/10.3390/info15110739