PSANet: Pyramid Splitting and Aggregation Network for 3D Object Detection in Point Cloud
Abstract
:1. Introduction
- We propose a new method to complete the cross-layer fusion of multi-scale feature maps, which uses the pyramid splitting and aggregation (PSA) module to integrate different levels of information.
- We propose a novel backbone network to extract the robust features from the bird’s eye view, which combines the advantages of cross-layer fusion features and original multi-scale features.
- Our proposed PSANet achieves competitive detection performance in both 3D and BEV detection tasks, and the inference speed can reach 11 FPS on a single GTX 1080Ti GPU.
2. Related Work
2.1. Monocular Image-Based Methods
2.2. Multi-Sensor Fusion-Based Methods
2.3. Point Cloud-Based Methods
3. PSANet Detector
3.1. Motivation
3.2. Network Architecture
3.3. Data Preprocessing
3.4. Voxel-Wise Feature Extractor
3.5. 3D Sparse Convolutional Middle Extractor
3.6. Reshaping To BEV
3.7. Cross-Layer Fusion of Multi-Scale BEV Features (PFH-PSA)
3.8. Loss Function
4. Experiments
4.1. Dataset
4.2. Implementation Details
4.2.1. Network Details
4.2.2. Training Details
4.3. Comparisons on the KITTI Validation Set
4.4. Ablation Studies
4.4.1. Different Backbone Networks
4.4.2. Different Fusion Methods
4.5. Analysis of the Detection Results
4.5.1. Detection Results on the KITTI Validation Set
4.5.2. Comparison with Some State-of-the-Art Voxel-Based Methods
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
BEV | Bird’s Eye View |
PFH | Pyramidal Feature Hierarchy |
PSA | Pyramid Splitting and Aggregation |
RPN | Region Proposal Network |
VFE | Voxel Feature Encoding |
References
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1440–1448. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
- Lin, G.; Milan, A.; Shen, C.; Reid, I. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1925–1934. [Google Scholar]
- Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 184–199. [Google Scholar]
- Kim, J.; Kwon Lee, J.; Mu Lee, K. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1646–1654. [Google Scholar]
- Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
- Janai, J.; Güney, F.; Behl, A.; Geiger, A. Computer vision for autonomous vehicles: Problems, datasets and state of the art. Found. Trends Comput. Graph. Vis. 2020, 12, 1–308. [Google Scholar] [CrossRef]
- Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
- Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12697–12705. [Google Scholar]
- Liu, Z.; Zhao, X.; Huang, T.; Hu, R.; Zhou, Y.; Bai, X. TANet: Robust 3D Object Detection from Point Clouds with Triple Attention. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11677–11684. [Google Scholar]
- Zhang, D.; Zhang, H.; Tang, J.; Wang, M.; Hua, X.; Sun, Q. Feature Pyramid Transformer. arXiv 2020, arXiv:2007.09451. [Google Scholar]
- Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2147–2156. [Google Scholar]
- Li, B.; Ouyang, W.; Sheng, L.; Zeng, X.; Wang, X. Gs3d: An efficient 3d object detection framework for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1019–1028. [Google Scholar]
- Ma, X.; Wang, Z.; Li, H.; Zhang, P.; Ouyang, W.; Fan, X. Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6851–6860. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
- Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 918–927. [Google Scholar]
- Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 641–656. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5099–5108. [Google Scholar]
- Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 770–779. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Yang, B.; Luo, W.; Urtasun, R. Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7652–7660. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. Ipod: Intensive point-based object detector for point cloud. arXiv 2018, arXiv:1812.05276. [Google Scholar]
- Li, X.; Guivant, J.; Kwok, N.; Xu, Y.; Li, R.; Wu, H. Three-dimensional Backbone Network for 3D Object Detection in Traffic Scenes. arXiv 2019, arXiv:1901.08373. [Google Scholar]
- Kuang, H.; Wang, B.; An, J.; Zhang, M.; Zhang, Z. Voxel-FPN: Multi-Scale Voxel Feature Aggregation for 3D Object Detection from LIDAR Point Clouds. Sensors 2020, 20, 704. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Choi, J.; Chun, D.; Kim, H.; Lee, H.J. Gaussian yolov3: An accurate and fast object detector using localization uncertainty for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 502–511. [Google Scholar]
- Shi, S.; Wang, Z.; Shi, J.; Wang, X.; Li, H. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. 2020. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
Type | Method | Modality | 3D Detection (IoU = 0.7) | BEV Detection (IoU = 0.7) | FPS | ||||
---|---|---|---|---|---|---|---|---|---|
Easy | Moderate | Hard | Easy | Moderate | Hard | ||||
MV3D [20] | RGB + LiDAR | 71.29 | 62.68 | 56.56 | 86.55 | 78.10 | 76.67 | 3 | |
AVOD-FPN [21] | RGB + LiDAR | 84.41 | 74.44 | 68.65 | N/A | N/A | N/A | 10 | |
2-stage | F-PointNet [22] | RGB + LiDAR | 83.76 | 70.92 | 63.65 | 88.16 | 84.02 | 76.44 | 6 |
IPOD [30] | RGB + LiDAR | 84.10 | 76.40 | 75.30 | 88.30 | 86.40 | 84.60 | N/A | |
PointRCNN [25] | LiDAR | 88.88 | 78.63 | 77.38 | N/A | N/A | N/A | 10 | |
VoxelNet [11] | LiDAR | 81.97 | 65.46 | 62.85 | 89.60 | 84.81 | 78.57 | 4 | |
SECOND [12] | LiDAR | 87.43 | 76.48 | 69.10 | 89.96 | 87.07 | 79.66 | 25 | |
Pointpillars [13] | LiDAR | 86.13 | 77.03 | 72.43 | 89.93 | 86.92 | 84.97 | 62 | |
1-stage | 3DBN [31] | LiDAR | 87.98 | 77.89 | 76.35 | N/A | N/A | N/A | 8 |
TANet [14] | LiDAR | 88.21 | 77.85 | 75.62 | N/A | N/A | N/A | 29 | |
Voxel-FPN [32] | LiDAR | 88.27 | 77.86 | 75.84 | 90.20 | 87.92 | 86.27 | 50 | |
PSANet (Ours) | LiDAR | 89.02 | 78.70 | 77.57 | 90.20 | 87.88 | 86.20 | 11 |
Method | 3D Detection (IoU = 0.7) | BEV Detection (IoU = 0.7) | ||||
---|---|---|---|---|---|---|
Easy | Moderate | Hard | Easy | Moderate | Hard | |
Baseline | 88.46 | 78.15 | 76.68 | 90.01 | 87.45 | 85.08 |
Baseline+Coarse Branch | 88.73 | 78.24 | 76.50 | 90.07 | 87.51 | 85.56 |
Baseline+Fine Branch | 88.52 | 77.98 | 76.62 | 90.15 | 87.63 | 86.14 |
Baseline+PFH-PSA | 89.02 | 78.70 | 77.57 | 90.20 | 87.88 | 86.20 |
Improvement | +0.56 | +0.55 | +0.89 | +0.19 | +0.43 | +1.12 |
Method | 3D Detection (IoU = 0.7) | BEV Detection (IoU = 0.7) | ||||
---|---|---|---|---|---|---|
Easy | Moderate | Hard | Easy | Moderate | Hard | |
Baseline | 88.46 | 78.15 | 76.68 | 90.01 | 87.45 | 85.08 |
Early Concat Fusion | 88.44 | 78.27 | 77.09 | 90.10 | 87.83 | 86.28 |
Early Sum Fusion | 89.10 | 78.68 | 77.35 | 90.21 | 87.82 | 85.99 |
Late Concat Fusion | 88.66 | 78.55 | 77.25 | 90.12 | 88.05 | 86.80 |
Late Sum Fusion | 89.02 | 78.70 | 77.57 | 90.20 | 87.88 | 86.20 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, F.; Jin, W.; Fan, C.; Zou, L.; Chen, Q.; Li, X.; Jiang, H.; Liu, Y. PSANet: Pyramid Splitting and Aggregation Network for 3D Object Detection in Point Cloud. Sensors 2021, 21, 136. https://doi.org/10.3390/s21010136
Li F, Jin W, Fan C, Zou L, Chen Q, Li X, Jiang H, Liu Y. PSANet: Pyramid Splitting and Aggregation Network for 3D Object Detection in Point Cloud. Sensors. 2021; 21(1):136. https://doi.org/10.3390/s21010136
Chicago/Turabian StyleLi, Fangyu, Weizheng Jin, Cien Fan, Lian Zou, Qingsheng Chen, Xiaopeng Li, Hao Jiang, and Yifeng Liu. 2021. "PSANet: Pyramid Splitting and Aggregation Network for 3D Object Detection in Point Cloud" Sensors 21, no. 1: 136. https://doi.org/10.3390/s21010136
APA StyleLi, F., Jin, W., Fan, C., Zou, L., Chen, Q., Li, X., Jiang, H., & Liu, Y. (2021). PSANet: Pyramid Splitting and Aggregation Network for 3D Object Detection in Point Cloud. Sensors, 21(1), 136. https://doi.org/10.3390/s21010136