Rethinking Attention Mechanisms in Vision Transformers with Graph Structures
Abstract
:1. Introduction
Contribution of This Work
- Unlike with other graph-based transformers [17,18,19], which apply graphs and attention in parallel and combine the outputs, this study is the first attempt to apply a graph inside the transformer head and replace MHA with a few GHA mechanisms. Moreover, there is no need for a class token in patch embedding, and thus the number of operations can be reduced.
- The links of nodes with low attention scores are excluded using graph pooling, and the node features are updated by applying GHA boosting to reflect the connectivity of the neighboring nodes. This process preserves the feature locality and secures the diversity of the attention. GHA-ViT not only creates tokens with local characteristics but also learns the relationship between tokens using a graph structure, which eventually strengthens the locality of the tokens.
- Because the graph structure is constructed using an attention matrix and the node feature are extracted from a value matrix, additional learning parameters for a graph construction are not required.
- GHA-ViT shows a promising classification performance with only scratch training conducted on small and medium-sized datasets and no pre-training on large datasets.
2. Related Studies
Graph Vision Transformer Models
3. Preliminaries
3.1. Vision Transformers
3.2. Patch Generation
3.3. Graph Attention Networks
4. A Graph-Head-Attention-Based ViT
4.1. Graph Head Attention
4.2. Graph Structure Generation
4.3. Graph Head Attention Boosting
5. The Dataset and Experimental Results
5.1. The Experimental Setup
5.2. The Experiment Environment and Parameter Settings
5.3. Comparing the Performance with State-of-the-Arts Models
5.4. Ablation Studies
5.5. Graph Visualization
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 1–11. [Google Scholar]
- Pham, N.-Q.; Nguyen, T.S.; Niehues, J.; Müller, M.; Stüker, S.; Waibel, A.H. Very deep self-attention networks for end-to-end speech recognition. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 66–70. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021; pp. 1–21. [Google Scholar]
- Yan, X.; Tang, H.; Sun, S.; Ma, H.; Kong, D.; Xie, X. After-unet: Axial fusion transformer unet for medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2022; pp. 3971–3981. [Google Scholar]
- Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. Motr: End-to-end multiple-object tracking with transformer. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 659–675. [Google Scholar]
- Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 12179–12188. [Google Scholar]
- Chen, J.; Ho, C.M. Mm-vit: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2022; pp. 1910–1921. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
- Liu, Y.; Sangineto, E.; Bi, W.; Sebe, N.; Lepri, B.; Nadai, M. Efficient training of visual transformers with small datasets. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021; pp. 23818–23830. [Google Scholar]
- Liu, H.; Dai, Z.; So, D.; Le, Q.V. Pay attention to mlps. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021; pp. 9204–9215. [Google Scholar]
- Lee-Thorp, J.; Ainslie, J.; Eckstein, I.; Ontanon, S. FNet: Mixing tokens with Fourier transforms. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Seattle, WA, USA, 10–15 July 2022; pp. 4296–4313. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 568–578. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.-H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 558–567. [Google Scholar]
- Yuan, K.; Guo, S.; Liu, Z.; Zhou, A.; Yu, F.; Wu, W. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 579–588. [Google Scholar]
- Li, Y.; Zhang, K.; Cao, J.; Timofte, R.; Van Gool, L. Localvit: Bringing locality to vision transformers. arXiv 2021, arXiv:2104.05707. [Google Scholar]
- Shen, F.; Xie, Y.; Zhu, J.; Zhu, X.; Zeng, H. Git: Graph interactive transformer for vehicle re-identification. IEEE Trans. Image Process. 2023, 32, 1037–1051. [Google Scholar] [CrossRef] [PubMed]
- Zheng, Y.; Gindra, R.H.; Green, E.J.; Burks, E.J.; Betke, M.; Beane, J.E.; Kolachalama, V.B. A Graph-Transformer for Whole Slide Image Classification. IEEE Trans. Med. Imaging 2022, 41, 3003–3015. [Google Scholar] [CrossRef]
- Lin, K.; Wang, L.; Liu, Z. Mesh graphormer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 12939–12948. [Google Scholar]
- Yun, S.; Jeong, M.; Kim, R.; Kang, J.; Kim, H.J. Graph transformer networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 1–11. [Google Scholar]
- Ying, C.; Cai, T.; Luo, S.; Zheng, S.; Ke, G.; He, D.; Shen, Y.; Liu, T.-Y. Do transformers really perform badly for graph representation? In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021; pp. 28877–28888. [Google Scholar]
- Yun, S.; Jeong, M.; Yoo, S.; Lee, S.; Sean, S.Y.; Kim, R.; Kang, J.; Kim, H.J. Graph transformer networks: Learning meta-path graphs to improve gnns. Neural Netw. 2022, 153, 104–119. [Google Scholar] [CrossRef] [PubMed]
- Zhao, W.; Wang, W.; Tian, Y. Graformer: Graph-oriented transformer for 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 20438–20447. [Google Scholar]
- Han, K.; Wang, Y.; Guo, J.; Tang, Y.; Wu, E. Vision gnn: An image is worth graph of nodes. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; pp. 1–13. [Google Scholar]
- Yin, M.; Uzkent, B.; Shen, Y.; Jin, H.; Yuan, B. Gohsp: A unified framework of graph and optimization-based heterogeneous structured pruning for vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, USA, 7–14 February 2023; pp. 1–9. [Google Scholar]
- Hassani, A.; Walton, S.; Shah, N.; Abuduweili, A.; Li, J.; Shi, H. Escaping the big data paradigm with compact transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Virtual, 19–25 June 2021; pp. 1–18. [Google Scholar]
- Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph attention networks. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–12. [Google Scholar]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017; pp. 1–14. [Google Scholar]
- Lee, J.; Jeong, M.; Ko, B.C. Graph convolution neural network-based data association for online multi-object tracking. IEEE Access 2021, 9, 114535–114546. [Google Scholar] [CrossRef]
- Michel, P.; Levy, O.; Neubig, G. Are sixteen heads really better than one? In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 1–11. [Google Scholar]
- Gao, H.; Ji, S. Graph u-nets. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 10–15 June 2019; pp. 2083–2092. [Google Scholar]
- Lee, J.; Lee, I.; Kang, J. Self-attention graph pooling. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 10–15 June 2019; pp. 3734–3743. [Google Scholar]
- Kim, J.-Y.; Cho, S.-B. Electric energy consumption prediction by deep learning with state explainable autoencoder. Energies 2019, 12, 739. [Google Scholar] [CrossRef]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. In Proceedings of the Advances in Neural Information Processing Systems Workshops (NeurIPSW), Montréal, QC, Canada, 8–13 December 2014; pp. 1–9. [Google Scholar]
- Wei, L.; Xiao, A.; Xie, L.; Zhang, X.; Chen, X.; Tian, Q. Circumventing outliers of autoaugment with knowledge distillation. In Proceedings of the European Conference on Computer Vision (ECCV), Virtual, 23–28 August 2020; pp. 608–625. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2019; pp. 1–14. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Late City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 10819–10829. [Google Scholar]
- Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. Mlp-mixer: An all-mlp architecture for vision. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021; pp. 24261–24272. [Google Scholar]
- Touvron, H.; Bojanowski, P.; Caron, M.; Cord, M.; El-Nouby, A.; Grave, E.; Izacard, G.; Joulin, A.; Synnaeve, G.; Verbeek, J.; et al. Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5314–5321. [Google Scholar] [CrossRef] [PubMed]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009; pp. 1–58. [Google Scholar]
- Deng, L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
- Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A largescale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation policies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Late City, UT, USA, 18–22 June 2018; pp. 113–123. [Google Scholar]
- Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Virtual, 14–19 June 2020; pp. 702–703. [Google Scholar]
- Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 13001–13008. [Google Scholar]
- Zhang, Y.; Chen, D.; Kundu, S.; Li, C.; Beerel, P.A. SAL-ViT: Towards Latency Efficient Private Inference on ViT using Selective Attention Search with a Learnable Softmax Approximation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023; pp. 5116–5125. [Google Scholar]
- Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How powerful are graph neural networks? In Proceedings of the International Conference on Learning Representations (ICLR), Neu Orleans, LA, USA, 6–9 May 2019; pp. 1–17. [Google Scholar]
Model | Head | dim d | mlp Ratio | ||
---|---|---|---|---|---|
GHA-S-7/3 | 3 × 3 | 3 | 7 | 64 | 2 |
GHA-B-7/3 | 3 × 3 | 6 | 7 | 64 | 2 |
GHA-S-14/7 | 7 × 7 | 3 | 14 | 64 | 4 |
GHA-B-14/7 | 7 × 7 | 6 | 14 | 64 | 4 |
Model | Params (M) ↓ | MACs (M) ↓ | Top-1 (%) ↑ | |||
---|---|---|---|---|---|---|
CIFAR-10 | CIFAR-100 | MNIST | MNIST-F | |||
VGG-16 [36] | 20 | 155 | 90.1 | 70.7 | 99.7 | 94.6 |
ResNet-18 [37] | 11 | 40 | 90.2 | 66.4 | 99.8 | 94.7 |
ResNet-34 [37] | 21 | 80 | 90.5 | 66.8 | 99.7 | 94.7 |
ResNet-56 [37] | 24 | 130 | 93.9 | 71.5 | 99.7 | 94.8 |
ResNet-110 [37] | 43 | 260 | 94.1 | 72.6 | 99.7 | 95.1 |
MobileNetV2/0.5 [38] | 1 | 10 | 84.7 | 56.3 | 99.7 | 93.9 |
MobileNetV2/2.0 [38] | 8 | 20 | 91.0 | 67.4 | 99.7 | 95.2 |
ViT-12/4 † [3] | 85 | 5520 | 94.8 | 74.1 | 99.6 | 95.4 |
ViT-Lite-7/16 †† [26] | 3 | 20 | 78.4 | 52.8 | 99.6 | 93.2 |
ViT-Lite-7/8 †† [26] | 3 | 60 | 89.1 | 67.2 | 99.6 | 94.4 |
ViT-Lite-7/4 †† [26] | 3 | 260 | 93.5 | 73.9 | 99.7 | 95.1 |
CVT-7/8 [26] | 3 | 60 | 89.7 | 70.1 | 99.7 | 94.5 |
CVT-7/4 [26] | 3 | 250 | 94.0 | 76.4 | 99.7 | 95.3 |
CVT-7/3 × 2 [26] | 3 | 1290 | 95.0 | 77.7 | 99.7 | 95.1 |
ViG-Ti [24] | 6 | 1230 | 93.0 | 74.1 | 99.4 | 95.0 |
GOHSP ††† [25] | 10 | 2020 | 97.4 | - | - | - |
SAL-ViT [49] | 5 | 1601 | 95.9 | 77.6 | - | - |
GHA-S-7/3 | 5 | 950 | 95.2 | 78.4 | 99.5 | 95.2 |
GHA-B-7/3 | 10 | 2130 | 96.8 | 80.1 | 99.7 | 95.3 |
Model | Params (M) ↓ | MACs (G) ↓ | Top-1 (%) ↑ | Top-5 (%) ↑ |
---|---|---|---|---|
ResNet-50 [37] | 26 | 4.3 | 76.2 | 95.0 |
ResNet-101 [37] | 45 | 7.9 | 77.4 | 95.4 |
ResNet-152 [37] | 60 | 11.6 | 78.3 | 95.9 |
ViT-S-16 [3] | 47 | 10.1 | 78.1 | - |
DeiT-S [8] | 22 | 4.6 | 79.8 | 95.0 |
CCT-14/7 × 2 [26] | 22 | 18.6 | 80.6 | - |
T2T-ViT-14 [14] | 22 | 4.8 | 81.5 | - |
PoolFormer-S12 [39] | 12 | 1.8 | 77.2 | - |
PoolFormer-S24 [39] | 30 | 3.0 | 80.3 | - |
Mixer-B /16 [40] | 59 | 12.7 | 76.4 | - |
ResMLP-12 [41] | 15 | 3.0 | 76.6 | - |
gMLP-Ti [10] | 6 | 1.4 | 72.3 | - |
gMLP-S [10] | 20 | 4.5 | 79.6 | - |
ViG-Ti [24] | 7 | 1.3 | 73.9 | 92.0 |
ViG-S [24] | 23 | 4.5 | 80.4 | 95.2 |
GOHSP [25] | 11 | 2.8 | 79.9 | - |
GHA-S-14/7 | 10 | 1.8 | 77.4 | 93.5 |
GHA-B-14/7 | 29 | 5.9 | 81.7 | 95.8 |
Dataset | ||||
---|---|---|---|---|
CIFAR-10 | 93.2 | 93.1 | 92.8 | 95.2 |
CIFAR-100 | 73.8 | 73.7 | 73.0 | 78.4 |
MNIST | 99.4 | 99.3 | 99.4 | 99.5 |
MNIST-F | 94.8 | 94.7 | 94.8 | 95.2 |
Dataset | Mean | Max | Seq | Multi |
---|---|---|---|---|
CIFAR-10 | 93.9 | 92.5 | 91.1 | 95.2 |
CIFAR-100 | 73.8 | 72.1 | 75.1 | 78.4 |
MNIST | 99.2 | 99.4 | 99.2 | 99.5 |
MNIST-F | 94.8 | 93.4 | 94.9 | 95.2 |
Dataset | GCN | GIN | GAT |
---|---|---|---|
CIFAR-10 | 94.5 | 94.9 | 95.2 |
CIFAR-100 | 76.7 | 77.1 | 78.4 |
MNIST | 99.4 | 99.4 | 99.5 |
MNIST-F | 94.8 | 94.9 | 95.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, H.; Ko, B.C. Rethinking Attention Mechanisms in Vision Transformers with Graph Structures. Sensors 2024, 24, 1111. https://doi.org/10.3390/s24041111
Kim H, Ko BC. Rethinking Attention Mechanisms in Vision Transformers with Graph Structures. Sensors. 2024; 24(4):1111. https://doi.org/10.3390/s24041111
Chicago/Turabian StyleKim, Hyeongjin, and Byoung Chul Ko. 2024. "Rethinking Attention Mechanisms in Vision Transformers with Graph Structures" Sensors 24, no. 4: 1111. https://doi.org/10.3390/s24041111
APA StyleKim, H., & Ko, B. C. (2024). Rethinking Attention Mechanisms in Vision Transformers with Graph Structures. Sensors, 24(4), 1111. https://doi.org/10.3390/s24041111