Category Level Object Pose Estimation via Global High-Order Pooling
Abstract
:1. Introduction
- We propose the HoPENet model for category-level 6D object pose estimation. HoPENet incorporates global high-order enhancement modules into each stage of the model and utilizes global high-order information throughout the network to model complex feature distributions, thereby enabling the model to learn a more discriminative feature representation.
- The global high-order enhancement module incorporates high-order information into the attention mechanism, employs global high-order pooling operations to capture feature correlations, integrates global information, and enhances features. By modeling the high-order statistics of the entire tensor, the proposed module can capture long-term statistical correlations and fully leverage contextual information.
- We conduct comprehensive ablation studies to validate the effectiveness of the proposed HoPENet network in 6D object pose estimation. Experimental results on the REAL275 and CAMERA25 datasets demonstrate that the proposed method surpasses the baseline model.
2. Related Works
2.1. Category-Level 6D Object Pose Estimation
2.2. Higher Order Pooling
3. Method
3.1. High-Order Pose Estimation Network
3.2. Global High-Order Enhancement Block
4. Experiments
4.1. Implement Details
4.1.1. Datasets
4.1.2. Implement Details
4.2. Ablation Studies
4.2.1. The Impact of Covariance Size
4.2.2. Fusion of Number- and Position-Wise GHoE Block
4.2.3. The Impact of the Position of Global Higher-Order Enhancement Modules
4.2.4. The Impact of the Number of Global Higher-Order Enhancement Modules
4.3. Comparisons with Existing Methods
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Collet, A.; Martinez, M.; Srinivasa, S.S. The moped framework: Object recognition and pose estimation for manipulation. Int. J. Robot. Res. 2011, 30, 1284–1306. [Google Scholar] [CrossRef]
- Tremblay, J.; To, T.; Sundaralingam, B.; Xiang, Y.; Fox, D.; Birchfield, S. Deep object pose estimation for semantic robotic grasping of household objects. arXiv 2018, arXiv:1809.10790. [Google Scholar]
- Marchand, E.; Uchiyama, H.; Spindler, F. Pose estimation for augmented reality: A hands-on survey. IEEE Trans. Vis. Comput. Graph. 2015, 22, 2633–2651. [Google Scholar] [CrossRef] [PubMed]
- Xu, D.; Anguelov, D.; Jain, A. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 244–253. [Google Scholar]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
- Peng, S.; Liu, Y.; Huang, Q.; Zhou, X.; Bao, H. Pvnet: Pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 4561–4570. [Google Scholar]
- Lin, Y.; Tremblay, J.; Tyree, S.; Vela, P.A.; Birchfield, S. Single-stage keypoint-based category-level object pose estimation from an RGB image. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 1547–1553. [Google Scholar]
- Tian, M.; Ang, M.H.; Lee, G.H. Shape prior deformation for categorical 6d object pose and size estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXI 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 530–546. [Google Scholar]
- Wang, H.; Sridhar, S.; Huang, J.; Valentin, J.; Song, S.; Guibas, L.J. Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 2642–2651. [Google Scholar]
- Zhang, J.; Wu, M.; Dong, H. Generative Category-level Object Pose Estimation via Diffusion Models. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
- Tatemichi, H.; Kawanishi, Y.; Deguchi, D.; Ide, I.; Murase, H. Category-level Object Pose Estimation in Heavily Cluttered Scenes by Generalized Two-stage Shape Reconstructor. IEEE Access 2024, 12, 33440–33448. [Google Scholar] [CrossRef]
- Chen, W.; Jia, X.; Chang, H.J.; Duan, J.; Shen, L.; Leonardis, A. Fs-net: Fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1581–1590. [Google Scholar]
- Lin, J.; Wei, Z.; Li, Z.; Xu, S.; Jia, K.; Li, Y. Dualposenet: Category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3560–3569. [Google Scholar]
- Wang, C.; Martín-Martín, R.; Xu, D.; Lv, J.; Lu, C.; Fei-Fei, L.; Savarese, S.; Zhu, Y. 6-pack: Category-level 6d pose tracker with anchor-based keypoints. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 10059–10066. [Google Scholar]
- Chen, K.; Dou, Q. Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2773–2782. [Google Scholar]
- Chen, X.; Dong, Z.; Song, J.; Geiger, A.; Hilliges, O. Category level object pose estimation via neural analysis-by-synthesis. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVI 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 139–156. [Google Scholar]
- Ze, Y.; Wang, X. Category-level 6d object pose estimation in the wild: A semi-supervised learning approach and a new dataset. Adv. Neural Inf. Process. Syst. 2022, 35, 27469–27483. [Google Scholar]
- Lee, T.; Lee, B.U.; Shin, I.; Choe, J.; Shin, U.; Kweon, I.S.; Yoon, K.J. UDA-COPE: Unsupervised domain adaptation for category-level object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14891–14900. [Google Scholar]
- Lin, J.; Wei, Z.; Ding, C.; Jia, K. Category-level 6d object pose and size estimation using self-supervised deep prior deformation networks. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 19–34. [Google Scholar]
- Wang, R.; Wang, X.; Li, T.; Yang, R.; Wan, M.; Liu, W. Query6dof: Learning sparse queries as implicit shape prior for category-level 6dof pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 14055–14064. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Ionescu, C.; Vantzos, O.; Sminchisescu, C. Matrix backpropagation for deep networks with structured layers. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2965–2973. [Google Scholar]
- Li, P.; Xie, J.; Wang, Q.; Zuo, W. Is second-order information helpful for large-scale visual recognition? In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2070–2078. [Google Scholar]
- Lin, T.Y.; RoyChowdhury, A.; Maji, S. Bilinear CNN models for fine-grained visual recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1449–1457. [Google Scholar]
- Yu, T.; Cai, Y.; Li, P. Toward faster and simpler matrix normalization via rank-1 update. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 203–219. [Google Scholar]
- Li, P.; Xie, J.; Wang, Q.; Gao, Z. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 947–955. [Google Scholar]
- Wang, Q.; Xie, J.; Zuo, W.; Zhang, L.; Li, P. Deep CNNs meet global covariance pooling: Better representation and generalization. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2582–2597. [Google Scholar] [CrossRef] [PubMed]
- Gao, Z.; Xie, J.; Wang, Q.; Li, P. Global second-order pooling convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 3024–3033. [Google Scholar]
- Wang, Q.; Li, P.; Zhang, L. G2DeNet: Global Gaussian distribution embedding network and its application to visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2730–2739. [Google Scholar]
- Di, Y.; Zhang, R.; Lou, Z.; Manhardt, F.; Ji, X.; Navab, N.; Tombari, F. Gpv-pose: Category-level object pose estimation via geometry-guided point-wise voting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6781–6791. [Google Scholar]
- Lin, H.; Liu, Z.; Cheang, C.; Fu, Y.; Guo, G.; Xue, X. Sar-net: Shape alignment and recovery network for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6707–6717. [Google Scholar]
- Zhang, R.; Di, Y.; Lou, Z.; Manhardt, F.; Tombari, F.; Ji, X. Rbp-pose: Residual bounding box projection for category-level pose estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 655–672. [Google Scholar]
mAP | ||||||||
---|---|---|---|---|---|---|---|---|
Configuration | Size | 5°2 cm | 5°5 cm | 10°2 cm | 10°5 cm | |||
baseline | - | - | 82.9 | 76.0 | 46.8 | 54.7 | 67.9 | 81.6 |
number-wise cov size N | 64 × 64 | 84.3 | 81.9 | 75.2 | 48.0 | 55.1 | 69.1 | 81.6 |
number-wise cov size N | 128 × 128 | 84.3 | 82.7 | 76.2 | 50.5 | 57.9 | 70.6 | 82.5 |
number-wise cov size N | 256 × 256 | 84.3 | 82.4 | 75.4 | 47.8 | 56.5 | 67.9 | 81.1 |
position-wise cov size d | 128 × 128 | 84.2 | 81.6 | 75.4 | 47.8 | 55.0 | 69.7 | 81.6 |
CAMERA25 | REAL275 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Configuration | 5°2 cm | 5°5 cm | 10°2 cm | 10°5 cm | 5°2 cm | 5°5 cm | 10°2 cm | 10°5 cm | ||||||
position-wise | 94.5 | 92.2 | 88.4 | 78.4 | 84.0 | 83.9 | 90.4 | 84.2 | 81.6 | 75.4 | 47.8 | 55.0 | 69.7 | 81.6 |
number-wise | 94.5 | 92.2 | 88.4 | 78.4 | 84.0 | 83.9 | 90.4 | 84.3 | 82.7 | 76.2 | 50.5 | 57.9 | 70.6 | 82.5 |
average | 94.5 | 92.6 | 89.3 | 79.8 | 85.2 | 84.8 | 91.3 | 84.3 | 82.3 | 76.2 | 49.0 | 56.7 | 71.1 | 81.7 |
maximum | 94.5 | 92.5 | 88.7 | 79.2 | 84.4 | 84.6 | 90.8 | 84.3 | 81.6 | 75.0 | 48.7 | 58.9 | 68.5 | 81.9 |
concatenation | 94.4 | 92.3 | 89.1 | 78.7 | 84.1 | 84.4 | 91.1 | 84.1 | 82.1 | 71.2 | 34.6 | 50.4 | 57.4 | 81.5 |
mAP | |||||||
---|---|---|---|---|---|---|---|
5°2 cm | 5°5 cm | 10°2 cm | 10°5 cm | ||||
Positon1 | 84.3 | 82.7 | 76.0 | 48.0 | 56.5 | 68.7 | 81.7 |
Positon2 | 84.3 | 82.4 | 74.4 | 48.0 | 55.7 | 68.0 | 80.9 |
Positon3 | 84.3 | 82.0 | 75.1 | 48.7 | 56.9 | 70.1 | 80.2 |
Positon4 | 84.3 | 82.1 | 75.2 | 48.5 | 57.0 | 70.4 | 82.6 |
mAP | |||||||
---|---|---|---|---|---|---|---|
5°2 cm | 5°5 cm | 10°2 cm | 10°5 cm | ||||
Position1+2 | 84.2 | 82.8 | 75.6 | 45.2 | 53.7 | 69.4 | 82.6 |
Position1+3 | 84.3 | 82.8 | 76.4 | 49.4 | 58.9 | 68.9 | 81.5 |
Position1+4 | 84.3 | 83.0 | 74.9 | 46.9 | 55.9 | 68.7 | 82.6 |
Position2+3 | 84.3 | 82.6 | 75.3 | 47.1 | 56.0 | 69.8 | 82.5 |
Position2+4 | 84.3 | 82.2 | 76.0 | 47.7 | 55.5 | 71.0 | 82.0 |
Position3+4 | 84.3 | 82.7 | 76.6 | 48.8 | 57.3 | 70.6 | 83.7 |
Position1+2+3 | 84.3 | 81.9 | 75.2 | 47.6 | 55.5 | 69.5 | 81.9 |
Position1+2+4 | 84.3 | 81.4 | 74.2 | 45.7 | 53.9 | 67.9 | 80.7 |
Position2+3+4 | 84.3 | 82.4 | 76.1 | 47.4 | 54.8 | 70.0 | 82.2 |
Position1+2+3+4 | 84.3 | 82.7 | 76.2 | 50.5 | 57.9 | 70.6 | 82.5 |
CAMERA25 | REAL275 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | 5°2 cm | 5°5 cm | 10°2 cm | 10°5 cm | 5°2 cm | 5°5 cm | 10°2 cm | 10°5 cm | ||||
NOCS [9] | 83.9 | 69.5 | 32.3 | 40.9 | 48.2 | 64.6 | 78.0 | 30.1 | 7.2 | 10.0 | 13.8 | 25.2 |
DualPoseNet [13] | 92.4 | 86.4 | 64.7 | 70.7 | 77.2 | 84.7 | 79.8 | 62.2 | 29.3 | 35.9 | 50.0 | 66.8 |
GPV-Pose [30] | 93.4 | 88.3 | 72.1 | 79.1 | - | 89.0 | 83.0 | 64.4 | 32.0 | 42.9 | - | 73.3 |
SPD [8] | 93.2 | 83.1 | 54.3 | 59.0 | 73.3 | 81.5 | 77.3 | 53.2 | 19.3 | 21.4 | 43.2 | 54.1 |
SAR-Net [31] | 86.8 | 79.0 | 66.7 | 70.9 | 75.6 | 80.3 | 79.3 | 62.4 | 31.6 | 42.3 | 50.3 | 68.3 |
SGPA [15] | 93.2 | 88.1 | 70.7 | 74.5 | 82.7 | 88.4 | 80.1 | 61.9 | 35.9 | 39.6 | 61.3 | 70.7 |
RBP-Pose [32] | 93.1 | 89.0 | 73.5 | 79.6 | 82.1 | 89.5 | - | 67.8 | 38.2 | 48.1 | 63.1 | 79.2 |
Query6DoF [20] | 92.3 | 88.6 | 78.4 | 83.9 | 84.0 | 90.5 | 82.9 | 76.0 | 46.8 | 54.7 | 67.9 | 81.6 |
HoPENet | 92.2 | 88.4 | 78.4 | 84.0 | 83.9 | 90.4 | 82.7 | 76.2 | 50.5 | 57.9 | 70.6 | 82.5 |
HoPENet* | 92.6 | 89.3 | 79.8 | 85.2 | 84.8 | 91.3 | 82.3 | 76.2 | 49.0 | 56.7 | 71.1 | 81.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jiang, C.; Mu, X.; Zhang, B.; Xie, M.; Liang, C. Category Level Object Pose Estimation via Global High-Order Pooling. Electronics 2024, 13, 1720. https://doi.org/10.3390/electronics13091720
Jiang C, Mu X, Zhang B, Xie M, Liang C. Category Level Object Pose Estimation via Global High-Order Pooling. Electronics. 2024; 13(9):1720. https://doi.org/10.3390/electronics13091720
Chicago/Turabian StyleJiang, Changhong, Xiaoqiao Mu, Bingbing Zhang, Mujun Xie, and Chao Liang. 2024. "Category Level Object Pose Estimation via Global High-Order Pooling" Electronics 13, no. 9: 1720. https://doi.org/10.3390/electronics13091720
APA StyleJiang, C., Mu, X., Zhang, B., Xie, M., & Liang, C. (2024). Category Level Object Pose Estimation via Global High-Order Pooling. Electronics, 13(9), 1720. https://doi.org/10.3390/electronics13091720