MS3D: A Multi-Scale Feature Fusion 3D Object Detection Method for Autonomous Driving Applications
Abstract
:1. Introduction
2. Related Works
3. Proposed Method
3.1. Network Architecture
3.2. Voxelization
3.3. SecondFPN
3.4. Loss Function
4. Experiments and Analysis
4.1. Datasets
4.2. Evaluation Metrics
4.3. Experimental Setup
4.4. Experiment Analysis
4.4.1. Accuracy Analysis
4.4.2. Representative Cases
- Easy Category: Minor Occlusion of Small Objects
- 2.
- Moderate Category: Overlapping and Occluded Multi-Object
- 3.
- Difficult Category: Complex Backgrounds and Intricate Trajectories
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zhou, Y.; Tuzel, O. VoxelNet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
- Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
- Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12697–12705. [Google Scholar]
- He, C.; Zeng, H.; Huang, J.; Hua, X.-S.; Zhang, L. Structure aware single-stage 3D object detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11873–11882. [Google Scholar]
- Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
- Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3D object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; pp. 1201–1209. [Google Scholar]
- Zhou, Y.; Sun, P.; Zhang, Y.; Anguelov, D.; Gao, J.; Ouyang, T.; Guo, J.; Ngiam, J.; Vasudevan, V. End-to-end multi-view fusion for 3D object detection in lidar point clouds. In Proceedings of the Conference on Robot Learning, PMLR, Virtual, 16–18 November 2020; pp. 923–932. [Google Scholar]
- Liu, Z.; Zhao, X.; Huang, T.; Hu, R.; Zhou, Y.; Bai, X. TANet: Robust 3D object detection from point clouds with triple attention. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11677–11684. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
- Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2446–2454. [Google Scholar]
- Dong, Y.; Kang, C.; Zhang, J.; Zhu, Z.; Wang, Y.; Yang, X.; Su, H.; Wei, X.; Zhu, J. Benchmarking robustness of 3D object detection to common corruptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1022–1032. [Google Scholar]
- Zheng, W.; Tang, W.; Chen, S.; Jiang, L.; Fu, C.-W. Cia-ssd: Confident iou-aware single-stage object detector from point cloud. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; pp. 3555–3562. [Google Scholar]
- Schuldt, C.; Laptev, I.; Caputo, B. Recognizing human actions: A local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR 2004), Cambridge, UK, 23–26 August 2004; IEEE: Piscataway, NJ, USA, 2004; pp. 32–36. [Google Scholar]
- Paul, A.; Mukherjee, D.P.; Das, P.; Gangopadhyay, A.; Chintha, A.R.; Kundu, S. Improved random forest for classification. IEEE Trans. Image Process. 2018, 27, 4012–4024. [Google Scholar] [CrossRef] [PubMed]
- Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3Dssd: Point-based 3D single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11040–11048. [Google Scholar]
- Zhang, Y.; Hu, Q.; Xu, G.; Ma, Y.; Wan, J.; Guo, Y. Not all points are equal: Learning highly efficient point-based detectors for 3D lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18953–18962. [Google Scholar]
- Wang, Y.; Han, X.; Wei, X.; Luo, J. Instance Segmentation Frustum–PointPillars: A Lightweight Fusion Algorithm for Camera–LiDAR Perception in Autonomous Driving. Mathematics 2024, 12, 153. [Google Scholar] [CrossRef]
- Dong, P.; Li, L.; Wei, Z. Diswot: Student architecture search for distillation without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11898–11908. [Google Scholar]
- Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; Jia, J. Largekernel3D: Scaling up kernels in 3D sparse cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13488–13498. [Google Scholar]
- Zhou, C.; Zhang, Y.; Chen, J.; Huang, D. Octr: Octree-based transformer for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5166–5175. [Google Scholar]
- Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; Jia, J. Voxelnext: Fully sparse VoxelNet for 3D object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21674–21683. [Google Scholar]
- Wu, H.; Deng, J.; Wen, C.; Li, X.; Wang, C.; Li, J. CasA: A cascade attention network for 3-D object detection from LiDAR point clouds. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5704511. [Google Scholar] [CrossRef]
- Xu, Q.; Zhong, Y.; Neumann, U. Behind the curtain: Learning occluded shapes for 3D object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 22 February–1 March 2022; pp. 2893–2901. [Google Scholar]
- He, C.; Li, R.; Zhang, Y.; Li, S.; Zhang, L. Msf: Motion-guided sequential fusion for efficient 3D object detection from point cloud sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5196–5205. [Google Scholar]
- Zhu, B.; Wang, Z.; Shi, S.; Xu, H.; Hong, L.; Li, H. Conquer: Query contrast voxel-detr for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9296–9305. [Google Scholar]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
- Huang, T.; Liu, Z.; Chen, X.; Bai, X. Epnet: Enhancing point features with image semantics for 3D object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 35–52. [Google Scholar]
- Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4604–4612. [Google Scholar]
- Yin, T.; Zhou, X.; Krähenbühl, P. Multimodal virtual point 3D detection. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Online, 6–14 December 2021; pp. 16494–16507. [Google Scholar]
- Li, X.; Ma, T.; Hou, Y.; Shi, B.; Yang, Y.; Liu, Y.; Wu, X.; Chen, Q.; Li, Y.; Qiao, Y. Logonet: Towards accurate 3D object detection with local-to-global cross-modal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17524–17534. [Google Scholar]
- Wu, X.; Peng, L.; Yang, H.; Xie, L.; Huang, C.; Deng, C.; Liu, H.; Cai, D. Sparse fuse dense: Towards high quality 3D detection with depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5418–5427. [Google Scholar]
- Wu, H.; Wen, C.; Shi, S.; Li, X.; Wang, C. Virtual sparse convolution for multimodal 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21653–21662. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5105–5114. [Google Scholar]
- Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Jiang, H.W.; Chen, M.Y.; Yuan, X.C. An algorithm for visual simultaneous localization and mapping with integrated hybrid attention instance segmentation. Laser Optoelectron. Prog. 2023, 60, 404–413. [Google Scholar]
- Wang, T.; Hu, X.; Liu, Z.; Fu, C.-W. Sparse2Dense: Learning to densify 3D features for 3D object detection. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 38533–38545. [Google Scholar]
- Wu, X.; Tian, Z.; Wen, X.; Peng, B.; Liu, X.; Yu, K.; Zhao, H. Towards large-scale 3D representation learning with multi-dataset point prompt training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19551–19562. [Google Scholar]
- Liu, Y.; Kong, L.; Wu, X.; Chen, R.; Li, X.; Pan, L.; Liu, Z.; Ma, Y. Multi-Space Alignments Towards Universal LiDAR Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 14648–14661. [Google Scholar]
- Jiao, Y.; Jie, Z.; Chen, S.; Chen, J.; Ma, L.; Jiang, Y.-G. Msmdfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21643–21652. [Google Scholar]
- Cui, C.; Ma, Y.; Cao, X.; Ye, W.; Zhou, Y.; Liang, K.; Chen, J.; Lu, J.; Yang, Z.; Liao, K.-D. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 958–979. [Google Scholar]
- Sun, T.; Zhang, Z.; Tan, X.; Peng, Y.; Qu, Y.; Xie, Y. Uni-to-Multi Modal Knowledge Distillation for Bidirectional LiDAR-Camera Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11059–11072. [Google Scholar] [CrossRef] [PubMed]
- Wang, Z.; Rao, Y.; Yu, X.; Zhou, J.; Lu, J. Point-to-pixel prompting for point cloud analysis with pre-trained image models. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4381–4397. [Google Scholar] [CrossRef] [PubMed]
Abbreviation | Full Name |
---|---|
3D-SSD [16] | 3D Single Shot MultiBox Detector |
IA-SSD [17] | Improved Anchor-based Single Shot MultiBox Detector |
VoxelNet [1] | VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection |
SECOND [2] | Sparsely Embedded Convolutional Detection |
PointPillars [18] | PointPillars: Fast Encoders for Object Detection from Point Clouds |
CIA-SSD [13] | Contextualized Intermediate-level Anchoring for Single Shot 3D Object Detection |
SA-SSD [4] | Scale-Aware Single Shot MultiBox Detector for 3D Object Detection |
Diswot [19] | Discrete Wasserstein Objectives for Learning with Limited Data |
LargeKernel3D [20] | Large Kernel Matters—Improving Semantic Segmentation by Global Convolutional Network |
OcTr [21] | Occlusion-aware Transformer for 3D Object Detection |
VoxelNeXt [22] | VoxelNeXt: Voxels for the Next Generation of 3D Object Detection |
CasA [23] | Cascaded Alignment for 3D Object Detection |
BtcDet [24] | Bird’s Eye View Object Detection with Localization Refinement |
MSF [25] | Multi-Scale Feature Aggregation for 3D Object Detection |
ConQueR [26] | ConQueR: Monocular 3D Object Detection by Construction and Query |
MV3D [27] | Multi-View 3D Object Detection Network for Autonomous Driving |
EPNet [28] | Efficient PointNet for 3D Object Detection in Point Clouds |
PointPainting [29] | PointPainting: Sequential Fusion for 3D Object Detection |
MVP [30] | Multi-View Prediction for 3D Object Detection from Monocular Images |
LoGoNet [31] | Local and Global Network for 3D Object Detection from Point Cloud |
SFD [32] | Sparse Feature Detection for Point Cloud Based 3D Object Detection |
Virtual Conv [33] | Virtual Convolution for Efficient Point Cloud Processing |
Methods | Frame Rate (HZ) | AP | F1Score | ||||
---|---|---|---|---|---|---|---|
BBOX | BEV | 3D | BBOX | BEV | 3D | ||
VoxelNet | 4.4 | 0.69 | 0.82 | 0.81 | 0.47 | 0.49 | 0.50 |
SECOND | 20 | 0.78 | 0.89 | 0.87 | 0.51 | 0.53 | 0.55 |
PointPillars | 62 | 0.80 | 0.90 | 0.90 | 0.53 | 0.55 | 0.56 |
MS3D | 60 | 0.84 | 0.94 | 0.93 | 0.54 | 0.56 | 0.58 |
Methods | Frame Rate (HZ) | AP | F1Score | ||||
---|---|---|---|---|---|---|---|
BBOX | BEV | 3D | BBOX | BEV | 3D | ||
VoxelNet | 4.4 | 0.67 | 0.76 | 0.75 | 0.46 | 0.48 | 0.51 |
SECOND | 20 | 0.76 | 0.85 | 0.84 | 0.50 | 0.53 | 0.54 |
PointPillars | 62 | 0.78 | 0.864 | 0.863 | 0.53 | 0.55 | 0.55 |
MS3D | 60 | 0.82 | 0.92 | 0.91 | 0.54 | 0.56 | 0.56 |
Methods | Frame Rate (HZ) | AP | F1Score | ||||
---|---|---|---|---|---|---|---|
BBOX | BEV | 3D | BBOX | BEV | 3D | ||
VoxelNet | 4.4 | 0.63 | 0.74 | 0.73 | 0.45 | 0.47 | 0.49 |
SECOND | 20 | 0.74 | 0.78 | 0.80 | 0.49 | 0.52 | 0.53 |
PointPillars | 62 | 0.75 | 0.84 | 0.83 | 0.52 | 0.54 | 0.53 |
MS3D | 60 | 0.77 | 0.89 | 0.88 | 0.53 | 0.55 | 0.55 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Y.; Zhuang, W.; Yang, G. MS3D: A Multi-Scale Feature Fusion 3D Object Detection Method for Autonomous Driving Applications. Appl. Sci. 2024, 14, 10667. https://doi.org/10.3390/app142210667
Li Y, Zhuang W, Yang G. MS3D: A Multi-Scale Feature Fusion 3D Object Detection Method for Autonomous Driving Applications. Applied Sciences. 2024; 14(22):10667. https://doi.org/10.3390/app142210667
Chicago/Turabian StyleLi, Ying, Wupeng Zhuang, and Guangsong Yang. 2024. "MS3D: A Multi-Scale Feature Fusion 3D Object Detection Method for Autonomous Driving Applications" Applied Sciences 14, no. 22: 10667. https://doi.org/10.3390/app142210667
APA StyleLi, Y., Zhuang, W., & Yang, G. (2024). MS3D: A Multi-Scale Feature Fusion 3D Object Detection Method for Autonomous Driving Applications. Applied Sciences, 14(22), 10667. https://doi.org/10.3390/app142210667