A Novel Deep Learning-Based Pose Estimation Method for Robotic Grasping of Axisymmetric Bodies in Industrial Stacked Scenarios
Abstract
:1. Introduction
- Changeability. As the axisymmetric bodies are in contact with and support each other, the movement of one body usually causes other bodies to move, resulting in a change in the entire stacking scenario.
- Model information. The dimensions, geometric shapes, and other parametric information of the axisymmetric bodies are known.
- Ungraspability. Some bodies cannot be grasped, because the part to be grasped is shielded by other bodies.
- The number of products is large. Different categories of industrial products will not be mixed and stacked; hence, there is only one product category. Products are generally in an environment with a single working background, such as an assembly line; hence, there is only one background.
- object detection based on deep learning is used to locate the component;
- instance segmentation based on deep learning is used to select unshielded bodies that can be grasped;
- conventional 6D object pose estimation methods have the characteristics of low efficiency or high cost of point cloud annotation, which are not suitable for stacked scenarios with a large number of objects. Therefore, we propose a novel method to obtain the 6D pose of the axisymmetric bodies to be grasped by detecting the pre-defined keypoints on the body surface; this approach requires multiobject 2D keypoint detection based on deep learning, and avoids the disadvantages of conventional methods.
- the changeability of stacked scenarios requires (1), (2), and (3) to be carried out frequently to update the detection results. Therefore, to ensure efficient grasping, we integrated the three technologies involved in (1), (2), and (3) into one convolutional neural network (CNN) and set the real-time requirements. Fast detection can also extend our method to multiarm collaboration, camera in the arm, and other scenarios. More importantly, it is friendly to embedded devices with lower computing power due to fewer model parameters;
- a large number of bodies usually results in more than one piece of graspable body being detected. Therefore, the grasping system requires a decision-making subsystem to create a grasping strategy to perform action ranking. In this study, a small CNN was used to score the quality of predictions of (2) and (3), which involves the intersection over union (IoU) of masks and object keypoint similarity (OKS) of keypoints, and these scores are used as the ranking criteria.
- a novel method is proposed to obtain the 6D pose of axisymmetric bodies to be grasped;
- a real-time multitask CNN is designed, named Key-Yolact;
- a small scoring CNN is designed to score the quality of the Key-Yolact prediction.
2. Related Work
2.1. 6D Object Pose Estimation
2.2. Instance Segmentation
2.3. Multiobject 2D Keypoint Detection
3. Approaches
3.1. 6D Object Pose Estimation
3.2. Multitask CNN Key-Yolact
3.2.1. Architecture
3.2.2. Object Detection and Instance Segmentation
3.2.3. Multiobject Keypoint Detection
3.2.4. Improvements
3.2.5. Loss Function
3.3. Decision-Making Subsystem for Multiobject Grasp
4. Experiments
4.1. Self-Built Dataset
4.2. Experiments on Key-Yolact
4.2.1. Training and Loss
4.2.2. Evaluation Metric
4.2.3. Analysis of Results
4.3. Scoring Network-Related Experiments
4.4. Robotic Grasping Experiment
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Appendix B
References
- Tulsiani, S.; Malik, J. Viewpoints and Keypoints. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Brachmann, E.; Krull, A.; Michel, F.; Gumhold, S.; Shotton, J.; Rother, C. Learning 6D Object Pose Estimation Using 3D Object Coordinates. In Proceedings of the European Conference on Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
- Tekin, B.; Sinha, S.N.; Fua, P. Real-Time Seamless Single Shot 6D Object Pose Prediction. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Sundermeyer, M.; Marton, Z.C.; Durner, M.; Triebel, R. Augmented autoencoders: Implicit 3d orientation learning for 6d object detection. IJCV 2020, 128, 714–729. [Google Scholar] [CrossRef]
- Tremblay, J.; To, T.; Sundaralingam, B.; Xiang, Y.; Fox, D.; Birchfield, S. Deep object pose estimation for semantic robotic grasping of household objects. arXiv 2018, arXiv:1809.10790. [Google Scholar]
- Yisheng, H.; Wei, S.; Haibin, H.; Jianran, L.; Haoqiang, F.; Jian, S. PVN3D: A Deep Point-Wise 3D Keypoints Voting Network for 6DoF Pose Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Song, S.; Xiao, J. Sliding Shapes for 3D Object Detection in Depth Images. In Proceedings of the European Conference on Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
- Song, S.; Xiao, J. Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Qi, C.R.; Wei, L.; Wu, C.; Hao, S.; Guibas, L.J. Frustum PointNets for 3D Object Detection from RGB-D Data. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 June 2017. [Google Scholar]
- Yu, X.; Schmidt, T.; Narayanan, V.; Fox, D. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. In Proceedings of the Robotics: Science and Systems—RSS 2018, Pittsburgh, PA, USA, 26–30 June 2018. [Google Scholar]
- Bertram, D.; Markus, U.; Nassir, N.; Slobodan, I. Model Globally, Match Locally: Efficient and Robust 3D Object Recognition. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010. [Google Scholar]
- Paul, J.B.; Neil, D.M. A Method for Registration of 3-D Shapes. IEEE T-PAMI 1992, 14, 239–256. [Google Scholar]
- Xu, D.; Anguelov, D.; Jain, A. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Wang, C.; Xu, D.; Zhu, Y.; Martín-Martín, R.; Lu, C.; Li, F.F.; Savarese, S. DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Xinke, D.; Arsalan, M.; Yu, X.; Fei, X.; Timothy, B.; Dieter, F. PoseRBPF: A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking. arXiv 2019, arXiv:1905.09304. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Fan, Z.; Yu, J.G.; Liang, Z.; Ou, J.; Gao, C.; Xia, G.S.; Li, Y. Fgn: Fully guided network for few-shot instance segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the 2019 IEEE International Conference on Computer Vision(ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Xie, E.; Sun, P.; Song, X.; Wang, W.; Luo, P. PolarMask: Single Shot Instance Segmentation With Polar Representation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Chen, L.C.; Hermans, A.; Papandreou, G.; Schroff, F.; Wang, P.; Adam, H. Masklab: Instance segmentation by refining object detection with semantic and direction features. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Dai, J.; He, K.; Li, Y.; Ren, S.; Sun, J. Instance-sensitive fully convolutional networks. In Proceedings of the European Conference on Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
- Li, Y.; Qi, H.; Dai, J.; Ji, X.; Wei, Y. Fully convolutional instance-aware semantic segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 June 2017. [Google Scholar]
- Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE T-PAMI 2021, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Insafutdinov, E.; Pishchulin, L.; Anres, B.; Anrriluka, M.; Schiele, B. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the European Conference on Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
- Pishchulin, L.; Insafutdinov, E.; Tang, S.; Anres, B.; Anriluka, M.; Gehler, P.; Schiele, B. Deepcut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE T-PAMI 2021, 43, 172–186. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kocabas, M.; Karagoz, S.; Akbas, E. MultiPoseNet: Fast Multi-Person Pose Estimation Using Pose Residual Network. In Proceedings of the European Conference on Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Pishchulin, L.; Jain, A.; Anriluka, M.; Thormahlen, T.; Schiele, B. Articulated people detection and pose estimation: Reshaping the future. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012. [Google Scholar]
- Gkioxari, G.; Hariharan, B.; Girshick, R.; Malik, J. Using k-Poselets for Detecting People and Localizing Their Keypoints. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Hoppe, H.; Derose, T.; Duchamp, T.; Mcdonald, J.; Stuetzle, W. Surface Reconstruction from Unorganized Points. ACM Siggraph 1992, 26, 71–78. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 June 2017. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE T-PAMI 2017, 39, 640–651. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. IEEE T-PAMI 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Tsung-Yi, L.; Michael, M.; Serge, B.; James, H.; Pietro, P.; Deva, R.; Piotr, D.; Lawrence, Z. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
- Courbariaux, M.; Bengio, Y.; David, J.P. Training deep neural networks with low precision multiplications. arXiv 2014, arXiv:1412.7024. [Google Scholar]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision(ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Zhou, X.; Wang, D.; Krhenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Branch | Loss Function/L | |
---|---|---|
Class | Cross Entropy Loss | 1 |
Box | Smooth L1 Loss | 1.5 |
Mask | BCE Loss | 6.125 |
Keypoint | Focal Loss | 1 |
Offset | L1 Loss | 1 |
Precision | Run Time/ms | FPS | Parameters/M | |
---|---|---|---|---|
Keypoint R-CNN | 79.68% | 88.59 | 11.3 | 58.8 |
Key-Yolact | 72.61% | 47.26 | 21.29 | 38.8 |
Key-Yolact with TensorRT | 72.61% | 35.39 | 28.1 | / |
Key-Yolact without DCN | 70.36% | 44.89 | 22.3 | 35.9 |
Key-Yolact without offset branch | 51.54% | 44.92 | 22.3 | 35.9 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Y.; Guo, F.; Zhang, M.; Suo, S.; An, Q.; Li, J.; Wang, Y. A Novel Deep Learning-Based Pose Estimation Method for Robotic Grasping of Axisymmetric Bodies in Industrial Stacked Scenarios. Machines 2022, 10, 1141. https://doi.org/10.3390/machines10121141
Li Y, Guo F, Zhang M, Suo S, An Q, Li J, Wang Y. A Novel Deep Learning-Based Pose Estimation Method for Robotic Grasping of Axisymmetric Bodies in Industrial Stacked Scenarios. Machines. 2022; 10(12):1141. https://doi.org/10.3390/machines10121141
Chicago/Turabian StyleLi, Yaowei, Fei Guo, Miaotian Zhang, Shuangfu Suo, Qi An, Jinlin Li, and Yang Wang. 2022. "A Novel Deep Learning-Based Pose Estimation Method for Robotic Grasping of Axisymmetric Bodies in Industrial Stacked Scenarios" Machines 10, no. 12: 1141. https://doi.org/10.3390/machines10121141
APA StyleLi, Y., Guo, F., Zhang, M., Suo, S., An, Q., Li, J., & Wang, Y. (2022). A Novel Deep Learning-Based Pose Estimation Method for Robotic Grasping of Axisymmetric Bodies in Industrial Stacked Scenarios. Machines, 10(12), 1141. https://doi.org/10.3390/machines10121141