A Transformer-Based Capsule Network for 3D Part–Whole Relationship Learning
Abstract
:1. Introduction
- 3D shape Transformer. We propose a novel self-attention calculation method based on local shape representation. It allows a mechanism similar to the standard 1D self-attention, taking the local shape of the mesh model surface as a token, and designing a matching similarity measure for it. Thus, the well-known 1D Transformer suitable for NLP can be adapted to 3D mesh tasks.
- Multi-head shape attention layer. We propose a multi-head shape attention mechanism to form multiple subspaces, allowing the model to pay attention to different aspects of information. This expands the possibility of combination between the underlying local shapes, and makes the local combination information learned by the model more accurate.
- Vector representation. Based on the 3D mesh data, we propose a new primary capsule construction method to improve the performance of the capsule network.
- 3D vector-type network. We construct a novel vector-type mesh trans-capsule neural network and apply it to the recognition of three-dimensional deformable models. Experiments show that compared with other methods, our network can respect the geometric characteristics of the model itself, and has a better classification effect and learning ability.
2. Related Work
2.1. Transformer for 2D Vision
2.2. Transformer for 3D Vision
2.3. Vector Networks for 3D Vision
3. Methods
3.1. 3D Mesh Trans-Capsule Networks Architecture
3.2. 3D Shape Transformer
3.2.1. The Definition of Local Shape Token
3.2.2. The Definition of Typical Local Shape Token
3.2.3. 3D Shape Transformer
3.3. ShapesToShapes and ShapesToObject
3.3.1. Multi-Head Shape Attention Layer
3.3.2. Routing-by-Agreement
4. Results
4.1. SHREC10, SHREC15 and Manifold40
4.2. Shape Classification
4.3. Analysis of Parameters
4.4. Robustness on Different Resolutions
4.5. Robustness on Different Ratios of Training Set/Test Set
4.6. Ablation Experiment of Each Part in The Network Structure
4.7. Time and Space Complexity
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Hinton, G. Some demonstrations of the effects of structural descriptions in mental imagery. Cogn. Sci. 1979, 3, 231–250. [Google Scholar] [CrossRef]
- Hanocka, R.; Hertz, A.; Fish, N.; Giryes, R.; Fleishman, S.; Cohen-Or, D. MeshCNN: A Network with an Edge. ACM Trans. Graph. 2019, 38, 90.1–90.12. [Google Scholar] [CrossRef] [Green Version]
- Baker, N.; Lu, H.; Erlikhman, G.; Kellman, P.J.; Einhauser, W. Deep convolutional networks do not classify based on global object shape. PLoS Comput. Biol. 2018, 14, e1006613. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kucker, S.C.; Samuelson, L.K.; Perry, L.K.; Yoshida, H.; Smith, L.B. Reproducibility and a unifying explanation: Lessons from the shape bias. Infant Behav. Dev. 2018, 54, 156–165. [Google Scholar] [CrossRef] [PubMed]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.1192. [Google Scholar]
- Kosiorek, A.; Sabour, S.; Teh, Y.W.; Hinton, G.E. Stacked Capsule Autoencoders. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 15512–15522. [Google Scholar]
- Zhao, Y.; Birdal, T.; Deng, H.; Tombari, F. 3D Point Capsule Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–25 June 2019; pp. 1009–1018. [Google Scholar] [CrossRef]
- Sabour, S.; Frosst, N.; Hinton, G. Dynamic Routing between Capsules. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
- He, J.; Chen, J.; Liu, S.; Kortylewski, A. TransFG: A Transformer Architecture for Fine-grained Recognition. arXiv 2021, arXiv:2103.07976. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
- Hermosilla, P.; Ritschel, T.; Vázquez, P.P.; Vinacua, A.; Ropinski, T. Monte Carlo Convolution for Learning on Non-Uniformly Sampled Point Clouds. ACM Trans. Graph. 2018, 37, 1–12. [Google Scholar] [CrossRef] [Green Version]
- Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. PointCNN: Convolution On X-Transformed Points. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
- Feng, Y.; Feng, Y.; You, H.; Zhao, X.; Gao, Y. MeshNet: Mesh Neural Network for 3D Shape Representation. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 8279–8286. [Google Scholar]
- Biasotti, S.; Cerri, A.; Aono, M.; Hamza, A.B. Retrieval and classification methods for textured 3D models: A comparative study. Vis. Comput. 2016, 32, 217–241. [Google Scholar] [CrossRef]
- Rodolà, E.; Cosmo, L.; Litany, O.; Bronstein, M.M.; Bronstein, A.M.; Audebert, N.; Hamza, A.B.; Boulch, A.; Castellani, U.; Do, M.N.; et al. Deformable Shape Retrieval with Missing Parts: SHREC’17. In Workshop on 3D Object Retrieval; Eurographics Association: Goslar, Germany, 2017; pp. 85–94. [Google Scholar] [CrossRef]
- Guo, M.; Cai, J.; Liu, Z.; Mu, T.; Martin, R.R.; Hu, S. PCT: Point Cloud Transformer. arXiv 2020, arXiv:2012.09688. [Google Scholar] [CrossRef]
- Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point Transformer. arXiv 2020, arXiv:2012.09164. [Google Scholar]
- Lin, K.; Wang, L.; Liu, Z. Mesh Graphormer. arXiv 2021, arXiv:2104.00272. [Google Scholar]
- Marcos, D.; Volpi, M.; Komodakis, N.; Tuia, D. Rotation Equivariant Vector Field Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Hinton, G.E.; Krizhevsky, A.; Wang, S.D. Transforming Auto-Encoders. In Proceedings of the ICANN’11: 21th International Conference on Artificial Neural Networks—Volume Part I, Espoo, Finland, 14–17 June 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 44–51. [Google Scholar]
- Srivastava, N.; Goh, H.; Salakhutdinov, R. Geometric Capsule Autoencoders for 3D Point Clouds. arXiv 2019, arXiv:1912.03310. [Google Scholar]
- Lenssen, J.E.; Fey, M.; Libuschewski, P. Group Equivariant Capsule Networks. In Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31, pp. 8844–8853. [Google Scholar]
- Wang, D.; Liu, Q. An Optimization View on Dynamic Routing between Capsules. In Proceedings of the ICLR 2018 Workshop, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Hinton, G.E.; Sabour, S.; Frosst, N. Matrix capsules with EM routing. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Jaiswal, A.; AbdAlmageed, W.; Wu, Y.; Natarajan, P. CapsuleGAN: Generative Adversarial Capsule Network. In Proceedings of the Computer Vision—ECCV 2018 Workshops; Leal-Taixé, L., Roth, S., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 526–535. [Google Scholar]
- Zhao, Y.; Birdal, T.; Lenssen, J.E.; Menegatti, E.; Guibas, L.; Tombari, F. Quaternion Equivariant Capsule Networks for 3D Point Clouds. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 1–19. [Google Scholar]
- Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE Computer Society: Los Alamitos, CA, USA, 2015; pp. 1912–1920. [Google Scholar] [CrossRef] [Green Version]
- Hu, S.; Liu, Z.; Guo, M.; Cai, J.; Huang, J.; Mu, T.; Martin, R.R. Subdivision-Based Mesh Convolution Networks. arXiv 2021, arXiv:2106.02285. [Google Scholar] [CrossRef]
- Reuter, M.; Wolter, F.; Peinecke, N. Laplace-Beltrami spectra as ’Shape-DNA’ of surfaces and solids. Comput.-Aided Des. 2006, 38, 342–366. [Google Scholar] [CrossRef]
- Gao, Z.; Yu, Z.; Pang, X. A compact shape descriptor for triangular surface meshes. Comput.-Aided Des. 2014, 53, 62–69. [Google Scholar] [CrossRef] [Green Version]
- Rustamov, R.M. Laplace-Beltrami eigenfunctions for deformation invariant shape representation. In Proceedings of the Fifth Eurographics Symposium on Geometry Processing, Barcelona, Spain, 4–6 July 2007; pp. 225–233. [Google Scholar]
- Han, Z.; Liu, Z.; Vong, C.M.; Liu, Y.; Bu, S.; Han, J.; Chen, C.L.P. BoSCC: Bag of Spatial Context Correlations for Spatially Enhanced 3D Shape Representation. IEEE Trans. Image Process. 2017, 26, 3707–3720. [Google Scholar] [CrossRef] [PubMed]
- Chen, Y.; Zhao, J.; Shi, C.; Yuan, D. Mesh Convolution: A Novel Feature Extraction Method for 3D Nonrigid Object Classification. IEEE Trans. Multimed. 2020, 23, 3098–3111. [Google Scholar] [CrossRef]
- Bronstein, M.M.; Kokkinos, I. Scale-invariant heat kernel signatures for non-rigid shape recognition. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 1704–1711. [Google Scholar]
- Aubry, M.; Schlickewei, U.; Cremers, D. The wave kernel signature: A quantum mechanical approach to shape analysis. In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 1626–1633. [Google Scholar]
- Charles, R.Q.; Su, H.; Kaichun, M.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar]
- Charles, R.Q. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5099–5108. [Google Scholar]
- Xu, Y.; Fan, T.; Xu, M.; Zeng, L.; Qiao, Y. SpiderCNN: Deep Learning on Point Sets with Parameterized Convolutional Filters. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Haim, N.; Segol, N.; Ben-Hamu, H.; Maron, H.; Lipman, Y. Surface Networks via General Covers. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 632–641. [Google Scholar] [CrossRef] [Green Version]
- Lahav, A.; Tal, A. MeshWalker: Deep Mesh Understanding by Random Walks. ACM Trans. Graph. 2020, 39, 1–13. [Google Scholar] [CrossRef]
- Garland, M.; Heckbert, P.S. Surface simplification using quadric error metrics. In Proceedings of the Siggraph, Los Angeles, CA, USA, 3–8 August 1997; pp. 209–216. [Google Scholar]
Methods | Input | Accuracy |
---|---|---|
Shape-DNA [33] | point | 82.67 |
cShape-DNA [34] | point | 78.50 |
GPS-embedding [35] | point | 87.17 |
BoW [36] | feature | 65.94 |
BoSCCs [36] | feature | 85.99 |
3D MeshConv [37] | mesh | 94.37 |
Our method | mesh | 99.95 |
Methods | Input | Accuracy |
---|---|---|
SVM+HKS [38] | feature | 56.9 |
SVM+WKS [39] | feature | 87.5 |
Shape-DNA [33] | point | 64.55 |
cShape-DNA [34] | point | 76.21 |
GPS-embedding [35] | point | 75.13 |
PointNet [40] | point | 69.4 |
PointNet++ [41] | point | 60.2 |
SpiderCNN [42] | mesh | 95.8 |
3D MeshConv [37] | mesh | 97.3 |
Our method | mesh | 99.90 |
Methods | ModelNet40 | Manifold40 |
---|---|---|
PointNet++ [41] | 91.7 | 87.9 |
PCT [20] | 93.2 | 92.4 |
SNGC [43] | 91.6 | - |
MeshNet [17] | 91.9 | 88.4 |
MeshWalker [44] | 92.3 | 90.5 |
Our method | - | 93.7 |
Methods | 10,000 points | 2500 points | 700 points |
---|---|---|---|
SHREC10 | 99.97 | 99.95 | 99.95 |
SHREC15 | 99.94 | 99.93 | 99.90 |
DataSet | 3D ShapeTransformer | ShapesToShapes | ShapesToObject | |||
---|---|---|---|---|---|---|
SHREC10 | Input | Accuracy | Input | Accuracy | Input | Accuracy |
3D points (3 dimensions) | 58.7 | ShapeTransformer value as primary capsule (without multi-head shape attention layer) | 53.2 | PointNet | 88.9 | |
Surface parameters (10 dimensions) | 89.7 | Routing-by-agreement | 99.95 | |||
ShapeTransformer value (N dimensions) | 99.95 | multi-head shape attention layer | 99.95 | - | - | |
SHREC15 | 3D points (3 dimensions) | 90.2 | ShapeTransformer value as primary capsule (without multi-head shape attention layer) | 23.0 | PointNet | 93.2 |
Surface parameters (10 dimensions) | 93.8 | Routing-by-agreement | 99.90 | |||
ShapeTransformer value (N dimensions) | 99.90 | multi-head shape attention layer | 99.90 | - | - |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, Y.; Zhao, J.; Qiu, Q. A Transformer-Based Capsule Network for 3D Part–Whole Relationship Learning. Entropy 2022, 24, 678. https://doi.org/10.3390/e24050678
Chen Y, Zhao J, Qiu Q. A Transformer-Based Capsule Network for 3D Part–Whole Relationship Learning. Entropy. 2022; 24(5):678. https://doi.org/10.3390/e24050678
Chicago/Turabian StyleChen, Yu, Jieyu Zhao, and Qilu Qiu. 2022. "A Transformer-Based Capsule Network for 3D Part–Whole Relationship Learning" Entropy 24, no. 5: 678. https://doi.org/10.3390/e24050678
APA StyleChen, Y., Zhao, J., & Qiu, Q. (2022). A Transformer-Based Capsule Network for 3D Part–Whole Relationship Learning. Entropy, 24(5), 678. https://doi.org/10.3390/e24050678