FusionNetV2: Explicit Enhancement of Edge Features for 6D Object Pose Estimation
Abstract
:1. Introduction
- •
- We propose FusionNetV2 with the improved ability of FusionNet to learn edge features closely related to 6D object pose estimation.
- •
- We propose an attention block called EBB, which is simple to implement and specialized for enhancing edge features.
- •
- The performance of FusionNetV2 is validated on a benchmark dataset in various aspects. The experiments show that FusionNetV2 outperforms FusionNet in 6D object pose estimation.
2. Problem Definition
3. Related Work
3.1. CNN-Based Method
3.2. Transformer-Based Method
3.3. Hybrid Method
4. Revisiting FusionNet
5. FusionNetV2
6. Experimental Results and Discussion
6.1. Experimental Setup
6.2. Ablation Study
- •
- The accuracy of EPro-PnP and FusionNet was not so high without pre-training. Their ADD-0.1d scores remained at 73.78 and 83.07, respectively, although FusionNet could significantly increase the ADD-0.1d score by employing GDE and AtBs.
- •
- The accuracy of FusionNetV2 was dependent on which stage the EBB is placed. Comparing the results of placing EBB in a single stage only, placing EBB in Stage 2 was the most effective, with an ADD-0.1d score reaching 87.09. Even with EBB in only one stage, the ADD-0.1d score was approximately 4 points higher than FusionNet. However, placing one EBB in Stage 1 or Stage 3 could not achieve ADD scores equivalent to FusionNet.
- •
- Comparing the results of placing EBBs over multiple stages, additionally placing EBB in Stage 3 or Stage 4 did not help improve accuracy but rather may impair accuracy. It indicates that applying EBBs with a small receptive field to deep features associated with semantic information can lead to the loss of important semantic information. Placing EBBs in Stage 1 and Stage 2 (the stages responsible for extracting shallow features) achieved the highest accuracy, with an ADD-0.1d score reaching 90.18. FusionNetV2 with EBBs in Stage 1 and Stage 2 achieved mean ADD scores approximately 18.8 and 9.7 points higher than EPro-PnP and FusionNet, respectively.
6.3. Training Stability and Generalization Ability
6.4. Using Edge Features for 6D Object Pose Estimation
6.5. Inference Time
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
PnP | perspective-n-point |
CNN | convolutional neural network |
NLP | natural language processing |
ViT | Vision Transformer |
AtB | attention block |
EBB | edge boosting block |
GDE | global dependency encoder |
BN | batch normalization |
2DRE | 2D reprojection error |
LT | linear Transformer |
FFF | fast feedforward network |
References
- Lepetit, V.; Moreno-Noguer, F.; Fua, P. EPnP: An accurate O(n) solution to the PnP problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
- Hosang, J.; Benenson, R.; Schiele, B. Learning non-maximum suppression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4507–4515. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
- Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3588–3597. [Google Scholar]
- Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7151–7160. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 1971–1980. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv 2018, arXiv:1804.07461. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2010, arXiv:2010.11929. [Google Scholar]
- Hatamizadeh, A.; Yin, H.; Heinrich, G.; Kautz, J.; Molchanov, P. Global context vision transformers. In Proceedings of the International Conference on Machine Learning, 2023, ICML’23, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
- Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing Convolutions to Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual Conference, 11–17 October 2021; pp. 22–31. [Google Scholar] [CrossRef]
- Ye, Y.; Park, H. FusionNet: An End-to-End Hybrid Model for 6D Object Pose Estimation. Electronics 2023, 12, 4162. [Google Scholar] [CrossRef]
- Chen, H.; Wang, P.; Wang, F.; Tian, W.; Xiong, L.; Li, H. EPro-PnP: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2781–2790. [Google Scholar]
- Wang, Y.; Jiang, X.; Fujita, H.; Fang, Z.; Qiu, X.; Chen, J. EFN6D: An efficient RGB-D fusion network for 6D pose estimation. J. Ambient. Intell. Humaniz. Comput. 2022, 15, 75–88. [Google Scholar] [CrossRef]
- Dam, T.; Dharavath, S.B.; Alam, S.; Lilith, N.; Chakraborty, S.; Feroskhan, M. AYDIV: Adaptable Yielding 3D Object Detection via Integrated Contextual Vision Transformer. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 3–17 May 2024; pp. 10657–10664. [Google Scholar]
- Rad, M.; Lepetit, V. BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3828–3836. [Google Scholar]
- Peng, S.; Liu, Y.; Huang, Q.; Zhou, X.; Bao, H. PVNet: Pixel-wise voting network for 6DoF pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4561–4570. [Google Scholar]
- Pavlakos, G.; Zhou, X.; Chan, A.; Derpanis, K.G.; Daniilidis, K. 6-DoF object pose from semantic keypoints. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–June 2017; pp. 2011–2018. [Google Scholar]
- Zhao, Z.; Peng, G.; Wang, H.; Fang, H.; Li, C.; Lu, C. Estimating 6D pose from localizing designated surface keypoints. arXiv 2018, arXiv:1812.01387. [Google Scholar]
- Ullah, F.; Wei, W.; Daradkeh, Y.I.; Javed, M.; Rabbi, I.; Al Juaid, H. A Robust Convolutional Neural Network for 6D Object Pose Estimation from RGB Image with Distance Regularization Voting Loss. Sci. Program. 2022, 2022, 2037141. [Google Scholar] [CrossRef]
- Oberweger, M.; Rad, M.; Lepetit, V. Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 119–134. [Google Scholar]
- Haugaard, R.L.; Buch, A.G. SurfEmb: Dense and Continuous Correspondence Distributions for Object Pose Estimation with Learnt Surface Embeddings. arXiv 2021, arXiv:2111.13489. [Google Scholar]
- Hai, Y.; Song, R.; Li, J.; Ferstl, D.; Hu, Y. Pseudo Flow Consistency for Self-Supervised 6D Object Pose Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 14075–14085. [Google Scholar]
- Yang, X.; Li, K.; Wang, J.; Fan, X. ER-Pose: Learning edge representation for 6D pose estimation of texture-less objects. Neurocomputing 2023, 515, 13–25. [Google Scholar] [CrossRef]
- Li, F.; Vutukur, S.R.; Yu, H.; Shugurov, I.; Busam, B.; Yang, S.; Ilic, S. NeRF-Pose: A first-reconstruct-then-regress approach for weakly-supervised 6D object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 2123–2133. [Google Scholar]
- Wu, Y.; Greenspan, M. Learning Better Keypoints for Multi-Object 6DoF Pose Estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January 2024; pp. 564–574. [Google Scholar]
- Jantos, T.G.; Hamdad, M.A.; Granig, W.; Weiss, S.; Steinbrener, J. PoET: Pose Estimation Transformer for Single-View, Multi-Object 6D Pose Estimation. In Proceedings of the Conference on Robot Learning. PMLR, Atlanta, GA, USA, 6–9 November 2023; pp. 1060–1070. [Google Scholar]
- Periyasamy, A.S.; Amini, A.; Tsaturyan, V.; Behnke, S. YOLOPose V2: Understanding and improving transformer-based 6D pose estimation. Robot. Auton. Syst. 2023, 168, 104490. [Google Scholar] [CrossRef]
- Zhang, Z.; Chen, W.; Zheng, L.; Leonardis, A.; Chang, H.J. Trans6D: Transformer-Based 6D Object Pose Estimation and Refinement. In Computer Vision—ECCV 2022 Workshops; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Springer: Cham, Switzerland, 2023; pp. 112–128. [Google Scholar]
- Castro, P.; Kim, T.K. CRT-6D: Fast 6D object pose estimation with cascaded refinement transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 5746–5755. [Google Scholar]
- Wen, B.; Yang, W.; Kautz, J.; Birchfield, S. FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects. arXiv 2023, arXiv:2312.08344. [Google Scholar]
- Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Gool, L.V.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Virtual Conference, 11–17 October 2021; pp. 1833–1844. [Google Scholar] [CrossRef]
- Li, X.; Xiang, Y.; Li, S. Combining convolutional and vision transformer structures for sheep face recognition. Comput. Electron. Agric. 2023, 205, 107651. [Google Scholar] [CrossRef]
- He, L.; He, L.; Peng, L. CFormerFaceNet: Efficient Lightweight Network Merging a CNN and Transformer for Face Recognition. Appl. Sci. 2023, 13, 6506. [Google Scholar] [CrossRef]
- Mogan, J.N.; Lee, C.P.; Lim, K.M.; Ali, M.; Alqahtani, A. Gait-CNN-ViT: Multi-Model Gait Recognition with Convolutional Neural Networks and Vision Transformer. Sensors 2023, 23, 3809. [Google Scholar] [CrossRef] [PubMed]
- Lin, Y.; Zhang, D.; Fang, X.; Chen, Y.; Cheng, K.T.; Chen, H. Rethinking Boundary Detection in Deep Learning Models for Medical Image Segmentation. In International Conference on Information Processing in Medical Imaging; Springer: Cham, Switzerland, 2023; pp. 730–742. [Google Scholar]
- Kanopoulos, N.; Vasanthavada, N.; Baker, R.L. Design of an image edge detection filter using the Sobel operator. IEEE J. Solid-State Circuits 1988, 23, 358–367. [Google Scholar] [CrossRef]
- Hinterstoisser, S.; Cagniart, C.; Ilic, S.; Sturm, P.; Navab, N.; Fua, P.; Lepetit, V. Gradient response maps for real-time detection of textureless objects. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 876–888. [Google Scholar] [CrossRef] [PubMed]
- Brachmann, E.; Michel, F.; Krull, A.; Yang, M.Y.; Gumhold, S. Uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–27 June 2016; pp. 3364–3372. [Google Scholar]
- Li, Z.; Wang, G.; Ji, X. CDPN: Coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7678–7687. [Google Scholar]
- Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual Event, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
- Belcak, P.; Wattenhofer, R. Fast feedforward networks. arXiv 2023, arXiv:2308.14711. [Google Scholar]
ADD(-S) | ||||
---|---|---|---|---|
0.02d | 0.05d | 0.1d | Mean | |
EPro-PnP | 12.05 | 43.79 | 73.78 | 43.37 |
FusionNet | 19.32 (+6.78) | 55.15 (+11.36) | 83.07 (+9.29) | 52.51 (+9.14) |
FusionNetV2 (s-1) | 15.3 (+3.25) | 49.18 (+5.39) | 77.62 (+3.84) | 47.37 (3) |
FusionNetV2 (s-2) | 22.62 (+10.57) | 61.11 (+17.32) | 87.09 (+13.31) | 57.21(+13.84) |
FusionNetV2 (s-3) | 17.52 (+5.47) | 52.52 (+8.73) | 80.03 (+6.25) | 50.02 (+6.65) |
FusionNetV2 (s-1,2) | 28.54 (+16.49) | 67.79 (+23.9) | 90.18 (+16.4) | 62.17 (+18.8) |
FusionNetV2 (s-1,2,3) | 18.16 (+6.11) | 53.81 (+10.32) | 81.68 (+7.9) | 51 (+7.63) |
FusionNetV2 (s-1,2,3,4) | 16.8 (+4.75) | 53.59 (+9.8) | 81.33 (+7.55) | 50.57 (+7.2) |
EPro-PnP | FusionNet | FusionNetV2 | ||||||
---|---|---|---|---|---|---|---|---|
s-1 | s-2 | s-3 | s-1,2 | s-1,2,3 | s-1,2,3,4 | |||
Ape | 53.14 | 64.29 | 52.76 | 60.48 | 54.67 | 76.19 | 59.90 | 62.19 |
Bench vise | 88.17 | 90.20 | 88.94 | 93.50 | 90.59 | 95.93 | 90.69 | 89.72 |
Camera | 65.49 | 79.02 | 72.74 | 85.78 | 77.06 | 90.00 | 77.25 | 77.65 |
Can | 75.10 | 84.25 | 80.22 | 95.47 | 82.97 | 92.22 | 84.45 | 89.57 |
Cat | 58.38 | 74.65 | 65.07 | 80.64 | 70.56 | 85.13 | 71.56 | 78.14 |
Driller | 78.39 | 86.92 | 82.85 | 92.17 | 87.12 | 93.46 | 86.32 | 87.71 |
Duck | 60.28 | 66.67 | 58.50 | 70.99 | 54.65 | 77.37 | 64.13 | 38.69 |
Egg box | 97.56 | 99.25 | 99.34 | 99.34 | 99.15 | 99.34 | 99.44 | 98.78 |
Glue | 80.12 | 89.67 | 85.81 | 94.59 | 88.80 | 95.37 | 90.15 | 93.34 |
Hole puncher | 62.32 | 78.40 | 66.79 | 82.21 | 71.36 | 87.63 | 70.50 | 75.74 |
Iron | 87.54 | 92.44 | 88.66 | 93.16 | 90.50 | 94.18 | 90.19 | 92.65 |
Lamp | 88.87 | 96.74 | 93.76 | 98.94 | 95.97 | 98.27 | 96.55 | 94.63 |
Phone | 63.83 | 77.43 | 73.47 | 84.89 | 76.96 | 87.25 | 80.74 | 78.47 |
Mean | 73.78 | 83.07 | 77.62 | 87.09 | 80.03 | 90.18 | 81.68 | 81.33 |
Stdev. | 14.13 | 10.84 | 14.09 | 11.40 | 14.28 | 7.31 | 12.32 | 16.28 |
EPro-PnP | FusionNet | FusionNetV2 | ||||||
---|---|---|---|---|---|---|---|---|
s-1 | s-2 | s-3 | s-1,2 | s-1,2,3 | s-1,2,3,4 | |||
Ape | 97.52 | 98.00 | 97.81 | 98.86 | 97.81 | 98.48 | 98.00 | 98.38 |
Bench vise | 96.70 | 95.64 | 95.64 | 96.99 | 96.31 | 98.55 | 96.51 | 97.67 |
Camera | 95.10 | 98.43 | 97.75 | 99.12 | 98.04 | 99.12 | 97.75 | 97.94 |
Can | 93.11 | 96.75 | 94.39 | 98.62 | 96.75 | 99.02 | 96.36 | 97.93 |
Cat | 98.20 | 99.20 | 99.10 | 99.30 | 99.20 | 99.40 | 99.30 | 98.80 |
Driller | 90.39 | 94.65 | 91.77 | 95.34 | 93.16 | 96.23 | 93.46 | 89.40 |
Duck | 98.40 | 98.22 | 98.22 | 98.22 | 98.50 | 98.69 | 98.59 | 98.50 |
Egg box | 98.59 | 99.06 | 98.87 | 99.15 | 98.69 | 99.15 | 98.69 | 98.97 |
Glue | 93.63 | 97.88 | 95.95 | 98.36 | 96.53 | 98.17 | 95.56 | 97.68 |
Hole puncher | 98.86 | 98.95 | 99.05 | 99.81 | 98.95 | 99.71 | 99.33 | 99.52 |
Iron | 91.52 | 94.59 | 92.75 | 95.81 | 93.77 | 95.20 | 92.85 | 94.79 |
Lamp | 90.88 | 94.82 | 91.36 | 97.60 | 93.57 | 97.12 | 94.72 | 94.72 |
Phone | 92.26 | 96.32 | 93.77 | 97.17 | 95.56 | 98.39 | 96.69 | 95.94 |
Mean | 95.01 | 97.12 | 95.88 | 98.03 | 96.68 | 98.25 | 96.75 | 96.94 |
Stdev. | 3.19 | 1.74 | 2.83 | 1.37 | 2.12 | 1.31 | 2.13 | 2.73 |
FFF | LT | Time 1 (s) | ADD-0.1d | |
---|---|---|---|---|
FusionNet | - | - | 0.004 | 83.07 |
√ | - | 0.009 | 81.8 | |
- | √ | 0.003 | 81.05 | |
FusionNetV2 | - | - | 0.004 | 90.18 |
√ | - | 0.009 | 82.58 | |
- | √ | 0.003 | 81.69 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ye, Y.; Park, H. FusionNetV2: Explicit Enhancement of Edge Features for 6D Object Pose Estimation. Electronics 2024, 13, 3736. https://doi.org/10.3390/electronics13183736
Ye Y, Park H. FusionNetV2: Explicit Enhancement of Edge Features for 6D Object Pose Estimation. Electronics. 2024; 13(18):3736. https://doi.org/10.3390/electronics13183736
Chicago/Turabian StyleYe, Yuning, and Hanhoon Park. 2024. "FusionNetV2: Explicit Enhancement of Edge Features for 6D Object Pose Estimation" Electronics 13, no. 18: 3736. https://doi.org/10.3390/electronics13183736
APA StyleYe, Y., & Park, H. (2024). FusionNetV2: Explicit Enhancement of Edge Features for 6D Object Pose Estimation. Electronics, 13(18), 3736. https://doi.org/10.3390/electronics13183736