Refined Prior Guided Category-Level 6D Pose Estimation and Its Application on Robotic Grasping
Abstract
:1. Introduction
- We propose a refined prior-guided category-based 6D pose estimation framework. Based on RGB and depth images, this framework effectively handles occlusion and lighting variations and can estimate the pose of unseen objects within the same category.
- We refine high-dimensional feature information through a multi-stage structure and refine the deformation results of the point cloud. By explicitly incorporating the transformation of the point cloud, the learned features gain stronger semantic information.
- We introduce a novel attention mechanism, enabling the network to allocate more focus on the differences between point clouds, thereby addressing the intra-class variation problem.
- Extensive experiments on the CAMERA25 and REAL275 datasets demonstrate that our method outperforms existing approaches and is capable of estimating the 6D pose of objects under partial occlusion and varying lighting conditions. Furthermore, the effectiveness of our algorithm is also tested and validated in vision-based robotic arm-grasping experiments.
2. Related Works
2.1. Manipulator Grasping
2.2. Instance-Level 6D Pose Estimation
2.3. Category-Level 6D Pose Estimation
3. Materials and Methods
3.1. Category Prior
3.2. Network Structure
3.2.1. Overview
3.2.2. Prior Guided Observation Network
3.2.3. Deformation Refine Module
3.2.4. Correspondence Refine Module
3.2.5. Feature Fusion Attention Network
3.3. Loss Function
3.3.1. Reconstruction Loss
3.3.2. Correspondence Loss
3.3.3. Regulation Loss
3.3.4. Overall Loss
3.4. 6D Pose Parameter Calculation
4. Results
4.1. Preprocessing and Implementation Detail
4.2. Dataset
4.3. Evaluation Metrics
4.4. Comparison with State-of-the-Art Methods
4.5. Ablation Studies
4.6. Simulation of Robotic Grasping
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kumra, S.; Joshi, S.; Sahin, F. Antipodal robotic grasping using generative residual convolutional neural network. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 9626–9633. [Google Scholar]
- Morrison, D.; Corke, P.; Leitner, J. Learning robust, real-time, reactive robotic grasping. Int. J. Robot. Res. 2020, 39, 183–201. [Google Scholar] [CrossRef]
- Sahin, C.; Garcia-Hernando, G.; Sock, J.; Kim, T.K. Instance-and category-level 6D object pose estimation. In RGB-D Image Analysis and Processing; Springer: Berlin/Heidelberg, Germany, 2019; pp. 243–265. [Google Scholar]
- Wang, C.; Xu, D.; Zhu, Y.; Martín-Martín, R.; Lu, C.; Fei-Fei, L.; Savarese, S. Densefusion: 6D object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3343–3352. [Google Scholar]
- Tremblay, J.; To, T.; Sundaralingam, B.; Xiang, Y.; Fox, D.; Birchfield, S. Deep object pose estimation for semantic robotic grasping of household objects. arXiv 2018, arXiv:1809.10790. [Google Scholar]
- Fang, H.S.; Wang, C.; Gou, M.; Lu, C. Graspnet-1Billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Eecognition, Seattle, WA, USA, 13–19 June 2020; pp. 11444–11453. [Google Scholar]
- Wang, H.; Sridhar, S.; Huang, J.; Valentin, J.; Song, S.; Guibas, L.J. Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2642–2651. [Google Scholar]
- Tian, M.; Ang, M.H.; Lee, G.H. Shape prior deformation for categorical 6d object pose and size estimation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XXI 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 530–546. [Google Scholar]
- Park, K.; Mousavian, A.; Xiang, Y.; Fox, D. Latentfusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10710–10719. [Google Scholar]
- Wang, G.; Manhardt, F.; Tombari, F.; Ji, X. GDR-Net: Geometry-guided direct regression network for monocular 6D object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16611–16621. [Google Scholar]
- Nie, T.; Ma, J.; Zhao, Y.; Fan, Z.; Wen, J.; Sun, M. Category-level 6D pose estimation using geometry-guided instance-aware prior and multi-stage reconstruction. IEEE Robot. Autom. Lett. 2023, 8, 2381–2388. [Google Scholar] [CrossRef]
- Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
- Umeyama, S. Least-squares estimation of transformation parameters between two point patterns. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 376–380. [Google Scholar] [CrossRef]
- Marullo, G.; Tanzi, L.; Piazzolla, P.; Vezzetti, E. 6D object position estimation from 2D images: A literature review. Multimed. Tools Appl. 2023, 82, 24605–24643. [Google Scholar] [CrossRef]
- Muñoz, E.; Konishi, Y.; Beltran, C.; Murino, V.; Del Bue, A. Fast 6D pose from a single RGB image using Cascaded Forests Templates. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; pp. 4062–4069. [Google Scholar]
- Pavlakos, G.; Zhou, X.; Chan, A.; Derpanis, K.G.; Daniilidis, K. 6-DoF object pose from semantic keypoints. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 2011–2018. [Google Scholar]
- Peng, S.; Liu, Y.; Huang, Q.; Zhou, X.; Bao, H. PVNet: Pixel-wise voting network for 6DoF pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4561–4570. [Google Scholar]
- Zhao, W.; Zhang, S.; Guan, Z.; Zhao, W.; Peng, J.; Fan, J. Learning deep network for detecting 3D object keypoints and 6D poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14134–14142. [Google Scholar]
- Lepetit, V.; Moreno-Noguer, F.; Fua, P. EPnP: An accurate O(n) solution to the PnP problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
- Payet, N.; Todorovic, S. From contours to 3D object detection and pose estimation. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 983–990. [Google Scholar]
- Sundermeyer, M.; Marton, Z.C.; Durner, M.; Triebel, R. Augmented autoencoders: Implicit 3D orientation learning for 6D object detection. Int. J. Comput. Vis. 2020, 128, 714–729. [Google Scholar] [CrossRef]
- Liu, F.; Fang, P.; Yao, Z.; Fan, R.; Pan, Z.; Sheng, W.; Yang, H. Recovering 6D object pose from RGB indoor image based on two-stage detection network with multi-task loss. Neurocomputing 2019, 337, 15–23. [Google Scholar] [CrossRef]
- He, Y.; Huang, H.; Fan, H.; Chen, Q.; Sun, J. FFB6D: A full flow bidirectional fusion network for 6d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3003–3013. [Google Scholar]
- Chen, W.; Jia, X.; Chang, H.J.; Duan, J.; Shen, L.; Leonardis, A. FS-Net: Fast shape-based network for category-level 6D object pose estimation with decoupled rotation mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1581–1590. [Google Scholar]
- Di, Y.; Zhang, R.; Lou, Z.; Manhardt, F.; Ji, X.; Navab, N.; Tombari, F. GPV-Pose: Category-level object pose estimation via geometry-guided point-wise voting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6781–6791. [Google Scholar]
- Lin, X.; Yang, W.; Gao, Y.; Zhang, T. Instance-adaptive and geometric-aware keypoint learning for category-level 6D object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 21040–21049. [Google Scholar]
- Chen, D.; Li, J.; Wang, Z.; Xu, K. Learning canonical shape space for category-level 6D object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11973–11982. [Google Scholar]
- Chen, K.; Dou, Q. SGPA: Structure-guided prior adaptation for category-level 6d object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2773–2782. [Google Scholar]
- Zhang, R.; Di, Y.; Lou, Z.; Manhardt, F.; Tombari, F.; Ji, X. RBP-Pose: Residual bounding box projection for category-level pose estimation. In Computer Vision—ECCV 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 655–672. [Google Scholar]
- Wang, J.; Chen, K.; Dou, Q. Category-level 6D object pose estimation via cascaded relation and recurrent reconstruction networks. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 4807–4814. [Google Scholar]
- Lin, H.; Liu, Z.; Cheang, C.; Fu, Y.; Guo, G.; Xue, X. SAR-Net: Shape alignment and recovery network for category-level 6D object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6707–6717. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Vaswani, A. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Method | 3D50 | 3D75 | 5°2 cm | 5°5 cm | 10°2 cm | 10°5 cm |
---|---|---|---|---|---|---|
NOCS [7] | 83.9 | 69.5 | 32.3 | 40.9 | 48.2 | 64.6 |
SAR-Net [31] | 86.8 | 79.0 | 66.7 | 70.9 | 75.3 | 80.3 |
SPD [8] | 93.2 | 83.1 | 54.3 | 59.0 | 73.3 | 81.5 |
CR-Net [30] | 93.8 | 88.0 | 72.0 | 76.4 | 81.0 | 87.7 |
RPB-Pose [29] | 93.1 | 89.0 | 73.5 | 79.6 | 82.1 | 89.5 |
SGPA [28] | 93.2 | 88.1 | 70.7 | 74.5 | 82.7 | 88.4 |
Ours | 93.4 | 89.2 | 70.2 | 72.6 | 86.7 | 90.4 |
Method | 3D50 | 3D75 | 5°2 cm | 5°5 cm | 10°2 cm | 10°5 cm |
---|---|---|---|---|---|---|
NOCS [7] | 78.0 | 30.1 | 7.2 | 10.0 | 13.8 | 25.2 |
SAR-Net [31] | 79.3 | 62.4 | 31.6 | 42.3 | 50.3 | 68.3 |
SPD [8] | 77.3 | 53.2 | 19.3 | 21.4 | 43.2 | 54.1 |
CR-Net [30] | 79.3 | 55.9 | 27.8 | 34.3 | 47.2 | 60.8 |
RPB-Pose [29] | - | 67.8 | 38.2 | 48.2 | 63.1 | 79.2 |
SGPA [28] | 80.1 | 61.9 | 35.9 | 39.6 | 61.3 | 70.7 |
Ours | 82.1 | 66.1 | 36.7 | 40.3 | 64.7 | 79.7 |
Feature Fusion Module | Refine Module | CAMERA25 | ||||||
---|---|---|---|---|---|---|---|---|
3D50 | 3D75 | 5°2 cm | 5°5 cm | 10°2 cm | 10°5 cm | |||
1 | - | - | 92.1 | 86.5 | 66.0 | 66.2 | 84.9 | 87.7 |
2 | √ | - | 92.7 | 86.8 | 65.9 | 68.5 | 84.4 | 87.8 |
3 | √ | √ | 93.4 | 89.2 | 70.2 | 72.6 | 86.7 | 90.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sun, H.; Zhang, Y.; Sun, H.; Hashimoto, K. Refined Prior Guided Category-Level 6D Pose Estimation and Its Application on Robotic Grasping. Appl. Sci. 2024, 14, 8009. https://doi.org/10.3390/app14178009
Sun H, Zhang Y, Sun H, Hashimoto K. Refined Prior Guided Category-Level 6D Pose Estimation and Its Application on Robotic Grasping. Applied Sciences. 2024; 14(17):8009. https://doi.org/10.3390/app14178009
Chicago/Turabian StyleSun, Huimin, Yilin Zhang, Honglin Sun, and Kenji Hashimoto. 2024. "Refined Prior Guided Category-Level 6D Pose Estimation and Its Application on Robotic Grasping" Applied Sciences 14, no. 17: 8009. https://doi.org/10.3390/app14178009
APA StyleSun, H., Zhang, Y., Sun, H., & Hashimoto, K. (2024). Refined Prior Guided Category-Level 6D Pose Estimation and Its Application on Robotic Grasping. Applied Sciences, 14(17), 8009. https://doi.org/10.3390/app14178009