3D Object Detection under Urban Road Traffic Scenarios Based on Dual-Layer Voxel Features Fusion Augmentation
Abstract
:1. Introduction
- (1)
- During point cloud voxelization, fixed grid settings may lead to loss of local fine-grained features. We employ the Mahalanobis distance to link boundary point information for each voxel, yielding voxel feature mappings that are better aligned with local object information.
- (2)
- We constructed neighborhood voxel and global voxel modules (N-GV) based on the voxelization network. We improved the attention Gaussian deviation matrix (GDM) to compute relative position encodings corresponding to voxel features.
- (3)
- During the fusion stage of point cloud and image features, we designed a new set of learnable weight parameters (LWP), thereby expanding and enhancing the feature information of key points in the attention fusion module.
2. Related Works
2.1. LiDAR-Based 3D Object Detection
2.2. Multi-Modal-Based 3D Object Detection
3. Methods
3.1. Voxel Feature Fusion Module
3.2. LiDAR and Image Feature Fusion Module
3.2.1. Calculating the Correlation Values
3.2.2. Calculating the Attention Scores
3.2.3. Calculating the Final Output Total Feature
4. Experiments
4.1. Implementation Details
4.1.1. Dataset Setup and Environment Configuration
4.1.2. Detector Detail Settings
4.2. Experiment Results
4.3. Ablation Studies
4.3.1. Quantitative Analysis
4.3.2. Qualitative Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ma, S.; Jiang, Z.; Jiang, H.; Han, M.; Li, C. Parking space and obstacle detection based on a vision sensor and checkerboard grid laser. Appl. Sci. 2020, 10, 2582. [Google Scholar] [CrossRef]
- Liu, Z.; Cai, Y.; Wang, H.; Chen, L.; Gao, H.; Jia, Y.; Li, Y. Robust target recognition and tracking of self-driving cars with radar and camera information fusion under severe weather conditions. IEEE Trans. Intell. Transp. Syst. 2021, 23, 6640–6653. [Google Scholar] [CrossRef]
- Jiang, H.; Chen, Y.; Shen, Q.; Yin, C.; Cai, J. Semantic closed-loop based visual mapping algorithm for automated valet parking. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2023, 09544070231167639. [Google Scholar] [CrossRef]
- Liu, F.; Liu, X. Voxel-based 3D detection and reconstruction of multiple objects from a single image. Adv. Neural Inf. Process. Syst. 2021, 34, 2413–2426. [Google Scholar]
- Wang, J.; Song, Z.; Zhang, Z.; Chen, Y.; Xu, N. Delineating Sight Occlusions of Head-On Traffic Signboards under Varying Available Sight Distances Using LiDAR Point Clouds. Transp. Res. Rec. 2024, 03611981231217741. [Google Scholar] [CrossRef]
- Liu, Y.; Zhou, X.; Zhong, W. Multi-modality image fusion and object detection based on semantic information. Entropy 2023, 25, 718. [Google Scholar] [CrossRef] [PubMed]
- Luo, X.; Zhou, F.; Tao, C.; Yang, A.; Zhang, P.; Chen, Y. Dynamic multitarget detection algorithm of voxel point cloud fusion based on pointrcnn. IEEE Trans. Intell. Transp. Syst. 2022, 23, 20707–20720. [Google Scholar] [CrossRef]
- Anisha, A.M.; Abdel-Aty, M.; Abdelraouf, A.; Islam, Z.; Zheng, O. Automated vehicle to vehicle conflict analysis at signalized intersections by camera and LiDAR sensor fusion. Transp. Res. Rec. 2023, 2677, 117–132. [Google Scholar] [CrossRef]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
- Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4604–4612. [Google Scholar]
- Yin, T.; Zhou, X.; Krähenbühl, P. Multimodal virtual point 3d detection. Adv. Neural Inf. Process. Syst. 2021, 34, 16494–16507. [Google Scholar]
- Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
- Sindagi, V.A.; Zhou, Y.; Tuzel, O. Mvx-net: Multimodal voxelnet for 3d object detection. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7276–7282. [Google Scholar]
- Kuang, H.; Wang, B.; An, J.; Zhang, M.; Zhang, Z. Voxel-FPN: Multi-scale voxel feature aggregation for 3D object detection from LIDAR point clouds. Sensors 2020, 20, 704. [Google Scholar] [CrossRef] [PubMed]
- Hu, H.; Hou, Y.; Ding, Y.; Pan, G.; Chen, M.; Ge, X. V2PNet: Voxel-to-Point Feature Propagation and Fusion That Improves Feature Representation for Point Cloud Registration. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 5077–5088. [Google Scholar] [CrossRef]
- Deng, Y.; Lan, L.; You, L.; Chen, K.; Peng, L.; Zhao, W.; Song, B.; Wang, Y.; Zhou, X. Automated CT pancreas segmentation for acute pancreatitis patients by combining a novel object detection approach and U-Net. Biomed. Signal Process. Control. 2023, 81, 104430. [Google Scholar] [CrossRef] [PubMed]
- Koonce, B.; Koonce, B. ResNet 50. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Springer: Berlin/Heidelberg, Germany, 2021; pp. 63–72. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 1–10. [Google Scholar]
- Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
- Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef]
- Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection from Point Clouds. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
- Hou, R.; Chen, G.; Han, Y.; Tang, Z.; Ru, Q. Multi-modal feature fusion for 3D object detection in the production workshop. Appl. Soft Comput. 2022, 115, 108245. [Google Scholar] [CrossRef]
- Song, Z.; Wei, H.; Jia, C.; Xia, Y.; Li, X.; Zhang, C. VP-net: Voxels as points for 3D object detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1. [Google Scholar] [CrossRef]
- Fan, B.; Zhang, K.; Tian, J. Hcpvf: Hierarchical cascaded point-voxel fusion for 3D object detection. IEEE Trans. Circuits Syst. Video Technol. 2023. [Google Scholar] [CrossRef]
- Ren, X.; Li, S. An anchor-free 3D object detection approach based on hierarchical pillars. Wirel. Commun. Mob. Comput. 2022, 2022, 3481517–3481526. [Google Scholar] [CrossRef]
- Liu, M.; Ma, J.; Zheng, Q.; Liu, Y.; Shi, G. 3D object detection based on attention and multi-scale feature fusion. Sensors 2022, 22, 3935. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Shao, Z.; Dou, W.; Pan, Y. Dual-level Deep Evidential Fusion: Integrating multimodal information for enhanced reliable decision-making in deep learning. Inf. Fusion 2024, 103, 102113. [Google Scholar] [CrossRef]
- Aung, S.; Park, H.; Jung, H.; Cho, J. Enhancing multi-view pedestrian detection through generalized 3d feature pulling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1196–1205. [Google Scholar]
- Li, Y.; Chen, Y.; Qi, X.; Li, Z.; Sun, J.; Jia, J.J. Unifying voxel-based representation with transformer for 3d object detection. Adv. Neural Inf. Process. Syst. 2022, 35, 18442–18455. [Google Scholar]
- Li, Y. Optimized voxel transformer for 3D detection with spatial-semantic feature aggregation. Comput. Electr. Eng. 2023, 112, 109023. [Google Scholar] [CrossRef]
- Zhou, J.; Lin, T.; Gong, Z.; Huang, X. SIANet: 3D object detection with structural information augment network. IET Comput. Vis. 2024, 1–14. [Google Scholar] [CrossRef]
- Xu, D.; Anguelov, D.; Jain, A. Pointfusion: Deep sensor fusion for 3D bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 244–253. [Google Scholar]
- Xu, S.; Zhou, D.; Fang, J.; Yin, J.; Bin, Z.; Zhang, L. Fusionpainting: Multimodal fusion with adaptive attention for 3D object detection. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 3047–3054. [Google Scholar]
- McLachlan, G.J. Mahalanobis distance. Resonance 1999, 4, 20–26. [Google Scholar] [CrossRef]
- Geiger, A.; Urtasun, R.; Lenzp, P. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
- Wu, X.; Peng, L.; Yang, H.; Xie, L.; Huang, C.; Deng, C.; Liu, H.; Cai, D. Sparse fuse dense: Towards high quality 3D detection with depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5418–5427. [Google Scholar]
- Cheng, S.; Ning, Z.; Hu, J.; Liu, J.; Yang, W.; Wang, L.; Yu, H.; Liu, W. G-Fusion: LiDAR and Camera Feature Fusion on the Ground Voxel Space. IEEE Access 2024, 12, 4127–4138. [Google Scholar] [CrossRef]
Method | 3D AP (%) | BEV AP (%) | ||||
---|---|---|---|---|---|---|
Easy | Mod. | Hard | Easy | Mod. | Hard | |
Second | 84.91 | 72.65 | 43.96 | 85.38 | 73.87 | 54.53 |
Voxel-Net | 81.62 | 65.23 | 50.77 | 82.26 | 66.73 | 52.29 |
AVOD | 74.96 | 63.55 | 53.70 | 76.42 | 64.81 | 54.92 |
PointPainting | 89.53 | 73.89 | 55.74 | 91.36 | 75.23 | 58.37 |
MVX-Net | 85.52 | 72.26 | 56.83 | 88.63 | 73.61 | 59.41 |
SFD | 87.56 | 77.61 | 60.95 | 88.12 | 80.06 | 62.24 |
V2PNet | 84.33 | 76.35 | 60.28 | 85.69 | 77.83 | 62.31 |
G-Fusion | 86.37 | 76.43 | 60.54 | 87.49 | 77.52 | 61.09 |
Ours | 87.79 | 79.92 | 61.48 | 88.95 | 82.30 | 64.28 |
Second | N-GV | GDM | LWP | mAP (%) | Inference Time (s) |
---|---|---|---|---|---|
1 | 0 | 0 | 0 | 66.46 | 0.88 |
1 | 1 | 0 | 0 | 67.83 (+1.37) | 0.93 (+0.05) |
1 | 0 | 1 | 0 | 68.20 (+1.74) | 0.91 (+0.03) |
1 | 0 | 0 | 1 | 67.92 (+1.46) | 0.84 (−0.04) |
1 | 1 | 1 | 0 | 68.95 (+2.49) | 0.95 (+0.07) |
1 | 1 | 0 | 1 | 69.23 (+2.77) | 0.96 (+0.08) |
1 | 0 | 1 | 1 | 68.65 (+2.19) | 0.93 (+0.05) |
1 | 1 | 1 | 1 | 70.39 (+3.93) | 1.07 (+0.19) |
Method | mAP (%) IOU = 0.5 | mAP (%) IOU = 0.7 | FPS |
---|---|---|---|
MVX-Net | 76.27 | 70.71 | 10.2 |
Ours | 79.48 | 72.25 | 11.0 |
Method | With DL-VFFA | 3D AP (%) | BEV AP (%) | ||||
---|---|---|---|---|---|---|---|
Easy | Mod. | Hard | Easy | Mod. | Hard | ||
SFD | No | 87.56 | 77.61 | 60.95 | 88.12 | 80.06 | 62.24 |
Yes | 89.24 | 81.67 | 66.05 | 90.35 | 82.30 | 66.43 | |
Improvement | +1.68 | +4.06 | +5.10 | +2.23 | +2.24 | +4.19 | |
V2PNet | No | 84.33 | 76.35 | 60.28 | 85.69 | 77.83 | 62.31 |
Yes | 86.82 | 77.94 | 62.36 | 87.56 | 79.20 | 64.58 | |
Improvement | +2.49 | +1.59 | +2.08 | +1.87 | +1.37 | +2.27 | |
G-Fusion | No | 86.37 | 76.43 | 60.54 | 87.49 | 77.52 | 61.09 |
Yes | 88.21 | 79.06 | 64.37 | 89.78 | 80.63 | 65.87 | |
Improvement | +1.84 | +2.63 | +3.83 | +2.29 | +3.11 | +4.78 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jiang, H.; Ren, J.; Li, A. 3D Object Detection under Urban Road Traffic Scenarios Based on Dual-Layer Voxel Features Fusion Augmentation. Sensors 2024, 24, 3267. https://doi.org/10.3390/s24113267
Jiang H, Ren J, Li A. 3D Object Detection under Urban Road Traffic Scenarios Based on Dual-Layer Voxel Features Fusion Augmentation. Sensors. 2024; 24(11):3267. https://doi.org/10.3390/s24113267
Chicago/Turabian StyleJiang, Haobin, Junhao Ren, and Aoxue Li. 2024. "3D Object Detection under Urban Road Traffic Scenarios Based on Dual-Layer Voxel Features Fusion Augmentation" Sensors 24, no. 11: 3267. https://doi.org/10.3390/s24113267
APA StyleJiang, H., Ren, J., & Li, A. (2024). 3D Object Detection under Urban Road Traffic Scenarios Based on Dual-Layer Voxel Features Fusion Augmentation. Sensors, 24(11), 3267. https://doi.org/10.3390/s24113267