Towards Understanding Neural Machine Translation with Attention Heads’ Importance
Abstract
:1. Introduction
Contributions
- We offer a comprehensive examination of the three types of attention mechanisms in the Transformer model: encoder self-attention, decoder self-attention, and encoder–decoder attention. This thorough analysis enriches our understanding of the model’s attention mechanisms;
- Our study investigated the distribution and importance of attention heads. This helps researchers retain important attention heads and remove unimportant ones, thereby reducing model parameters and improving model speed;
- Our findings reveal a link between attention heads and the understanding of specific POS features, offering theoretical insights for future studies. Specifically, we suggest that focusing on the acquisition of noun and verb knowledge in Chinese–English machine translation could potentially improve model performance;
- We analyze the decision-making processes of attention heads based on POS, dependency relations, and syntactic trees, which may inform and inspire model design. Our research indicates that certain linguistic elements, such as nouns, adjectives, and adjective modifiers, are pivotal, whereas others, like determiners, are less critical. This discovery suggests that in Chinese–English machine translation, the model focuses more on certain information, which enhances the model’s interpretability.
2. Background
2.1. Interpreting Attention
2.2. Linguistic Knowledge
3. Attention Heads’ Importance
3.1. Experimental Setup
3.2. Multi-Headed Attention
3.3. Attention Head Analysis
3.4. Correlation Analysis
4. Linguistic Knowledge
4.1. Part-of-Speech
4.2. Dependency
4.3. Syntax Tree
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 2014, 27, 3104–3112. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Jack, H.; Ana, M.; Jena, D.H.; Lillian, L.; Jeff, D.; Rowan, Z.; Robert, M.; Yejin, C. Do Androids Laugh at Electric Sheep? Humor “Understanding” Benchmarks from the New Yorker Caption Contest. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
- Ding, Y.; Liu, Y.; Luan, H.; Sun, M. Visualizing and understanding neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1150–1159. [Google Scholar]
- Belinkov, Y.; Durrani, N.; Dalvi, F.; Sajjad, H.; Glass, J. What do neural machine translation models learn about morphology? arXiv 2017, arXiv:1704.03471. [Google Scholar]
- Jing, L.; Yong, Z. An Algorithm for Finding Optimal k-Core in Attribute Networks. Appl. Sci. 2024, 14, 1256. [Google Scholar] [CrossRef]
- Michel, P.; Levy, O.; Neubig, G. Are sixteen heads really better than one? Adv. Neural Inf. Process. Syst. 2019, 32, 14037–14047. [Google Scholar]
- Nikita, M.; Tom, S.; Mark, S.; Alexandra, B. Extrinsic Evaluation of Machine Translation Metrics. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
- Lipton, Z. The Mythos of Model Interpretability. Commun. ACM 2016, 61, 36–43. [Google Scholar] [CrossRef]
- Wu, W.; Jiang, C.; Jiang, Y.; Xie, P.; Tu, K. Do PLMs Know and Understand Ontological Knowledge? In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
- Wang, W.; Tu, Z. Rethinking the value of transformer components. arXiv 2020, arXiv:2011.03803. [Google Scholar]
- Serrano, S.; Smith, N.A. Is attention interpretable? arXiv 2019, arXiv:1906.03731. [Google Scholar]
- Li, X.; Li, G.; Liu, L.; Meng, M.; Shi, S. On the word alignment from neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1293–1303. [Google Scholar]
- Kobayashi, G.; Kuribayashi, T.; Yokoi, S.; Inui, K. Attention module is not only a weight: Analyzing transformers with vector norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online, 16–20 November 2020. [Google Scholar]
- Jain, S.; Wallace, B.C. Attention is not explanation. arXiv 2019, arXiv:1902.10186. [Google Scholar]
- Wiegreffe, S.; Pinter, Y. Attention is not not explanation. arXiv 2019, arXiv:1908.04626. [Google Scholar]
- Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv 2019, arXiv:1905.09418. [Google Scholar]
- Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, K.-R.; Samek, W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 2015, 10, e0130140. [Google Scholar] [CrossRef] [PubMed]
- Ma, W.; Zhang, K.; Lou, R.; Wang, L.; Vosoughi, S. Contributions of transformer attention heads in multi-and cross-lingual tasks. arXiv 2021, arXiv:2108.08375. [Google Scholar]
- Ghader, H.; Monz, C. What does attention in neural machine translation pay attention to? arXiv 2017, arXiv:1710.03348. [Google Scholar]
- Chen, Z.; Jiang, C.; Tu, K. Using Interpretation Methods for Model Enhancement. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language, Singapore, 6–10 December 2023. [Google Scholar]
- Yin, K.; Neubig, G. Interpreting language models with contrastive explanations. arXiv 2022, arXiv:2202.10419. [Google Scholar]
- Belinkov, Y.; Màrquez, L.; Sajjad, H.; Durrani, N.; Dalvi, F.; Glass, J. Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks. arXiv 2018, arXiv:1801.07772. [Google Scholar]
- Ekin, A.; Dale, S.; Jacob, A.; Tengyu, M.; Denny, Z. What learning algorithm is in-context learning? Investigations with linear models. In Proceedings of the ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- He, S.; Tu, Z.; Wang, X. Towards Understanding Neural Machine Translation with Word Importance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019. [Google Scholar] [CrossRef]
- Qiang, J.; Liu, K.; Li, Y.; Zhu, Y.; Yuan, Y.H.; Hu, X.; Ouyang, X. Chinese Lexical Substitution: Dataset and Method. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language, Singapore, 6–10 December 2023. [Google Scholar]
- Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002. [Google Scholar] [CrossRef]
- Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional sequence to sequence learning. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1243–1252. [Google Scholar]
- Tan, S.; Shen, Y.; Chen, Z.; Courville, A.; Gan, C. Sparse Universal Transformer. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language, Singapore, 6–10 December 2023. [Google Scholar]
- Müller, M.; Jiang, Z.; Moryossef, A.; Rios, A.; Ebling, S. Considerations for meaningful sign language machine translation based on glosses. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
- Marcus, M.; Santorini, B.; Marcinkiewicz, M.A. Building a large annotated corpus of English: The Penn Treebank. Comput. Linguist. 1993, 19, 313–330. [Google Scholar]
- Kai, V.; Frank, K. Cluster-Centered Visualization Techniques for Fuzzy Clustering Results to Judge Single Clusters. Appl. Sci. 2024, 14, 1102. [Google Scholar] [CrossRef]
- Woosik, L.; Juhwan, L. Tree-Based Modeling for Large-Scale Management in Agriculture: Explaining Organic Matter Content in Soil. Appl. Sci. 2024, 14, 1811. [Google Scholar] [CrossRef]
Categories | Strengths | Limitations | References | Contributions |
---|---|---|---|---|
Visualization | 1. Intuitive representation of textual data and processes; 2. Provides visually intuitive charts; 3. Accessible to non-experts. | 1. Difficulty in explaining complex grammatical structures; 2. May overlook subtle nuances in text; 3. Limited in-depth linguistic analysis. | Visualizing and understanding neural machine translation [5]. | Understanding the internal workings of NMT models through attention-based visualization techniques. |
Rethinking the value of transformer components [12]. | Understanding the importance of each component of the Transformer model through visualization. | |||
Towards Understanding Neural Machine Translation with Word Importance [26]. | Understanding the importance of words in generating sentences through visualization. | |||
Linguistic interpretation | 1. Provides in-depth analysis and interpretation based on linguistic theories; 2. Explains reasons and patterns behind language phenomena; 3. Aids understanding of language structures and semantics. | 1. Requires linguistic expertise for interpretation; 2. Relies heavily on corpora and linguistic resources; 3. May not cover all language phenomena, especially non-structural and abstract features. | Interpreting language models with contrastive explanations [23]. | Explaining the model’s behavior by contrasting through word replacement. |
What do neural machine translation models learn about morphology [6]? | Explaining the internal processing of the model through morphology. | |||
Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks. [24]. | Explaining the internal processing of the model through POS and semantic. |
ZH-EN | casia2015 corpus | 1 million | 9 million |
casict2011 corpus | 2 million | ||
casict2015 corpus | 2 million | ||
datum2015 corpus | 1 million | ||
datum2017 corpus | 1 million | ||
neu2017 corpus | 2 million |
Pearson | |||
---|---|---|---|
None | 0.85586148 | 0.00288859 | 0.01695689 |
Verb | 0.81776615 | 0.00448712 | 0.01009835 |
Adjective | 0.6634828 | 0.12858195 | 0.05381574 |
Adverb | 0.59999701 | 0.12642623 | 0.07041694 |
Preposition | 0.61585583 | 0.10229620 | 0.12294885 |
Determiner | 0.36217017 | 0.11767329 | 0.34239231 |
Types | Number | |
---|---|---|
Part-of-speech | None | 1209801 |
Verb | 639937 | |
Adjective | 325112 | |
Adverb | 168872 | |
Preposition | 521772 | |
Determiner | 421079 | |
Dependency | Root | 253109 |
Nsubj | 300879 | |
Obj | 200308 | |
Advmod | 183645 | |
Det | 401812 | |
Amod | 285112 | |
Nmod | 229550 | |
Compound | 242266 | |
Syntax tree | NP-NP-NN | 185474 |
NP-PP-IN | 139905 | |
PP-NP-NN | 148864 | |
NP-NP-DT | 126033 | |
VP-PP-IN | 131974 |
Reference | The analysis method provides some theoretical foundation for the safe assessment on welded joint strength in a rotation blade. |
Transformer | This method provides a theoretical basis for the strength-checking of rotary welded joints. |
Mask Important Attention heads | The analysis method is used to verify the welding strength of the joints. |
Mask Unimportant Attention heads | This method provides a theoretical basis for the strength-checking of the rotary joint. |
Complete Model | Pruned Model | ||||||
---|---|---|---|---|---|---|---|
5 | 10 | 15 | 20 | 25 | 30 | ||
N | 55.07 | 12.54 | 13.19 | 20.61 | 27.74 | 32.79 | 34.46 |
V | 40.05 | 6.39 | 7.22 | 12.84 | 16.56 | 19.01 | 21.01 |
ADJ | 53.85 | 14.27 | 15.15 | 35.83 | 38.63 | 39.77 | 40.23 |
ADV | 43.79 | 18.69 | 20.07 | 30.03 | 30.59 | 31.01 | 31.92 |
IN | 58.87 | 18.61 | 20.33 | 21.08 | 22.38 | 25.30 | 28.41 |
DT | 65.01 | 13.55 | 15.74 | 11.69 | 10.15 | 11.93 | 18.26 |
Complete Model | Pruned Model | ||||||
---|---|---|---|---|---|---|---|
5 | 10 | 15 | 20 | 25 | 30 | ||
root | 32.92 | 2.36 | 3.61 | 9.47 | 14.26 | 17.69 | 20.49 |
nsubj | 22.57 | 3.62 | 4.79 | 9.98 | 12.91 | 14.58 | 16.72 |
obj | 19.17 | 3.50 | 4.54 | 8.62 | 11.72 | 13.45 | 14.76 |
advmod | 19.02 | 7.95 | 9.08 | 14.12 | 15.02 | 15.37 | 15.83 |
det | 36.72 | 9.04 | 11.19 | 12.40 | 15.58 | 19.18 | 22.35 |
amod | 31.31 | 7.80 | 9.96 | 22.08 | 24.86 | 26.09 | 26.55 |
nmod | 23.03 | 7.64 | 9.84 | 12.06 | 16.10 | 18.00 | 19.03 |
compound | 33.93 | 4.60 | 7.22 | 14.58 | 19.99 | 22.22 | 24.12 |
Complete Model | Pruned Model | ||||||
---|---|---|---|---|---|---|---|
5 | 10 | 15 | 20 | 25 | 30 | ||
NP-NP-NN | 34.20 | 7.51 | 9.82 | 11.11 | 15.01 | 18.84 | 20.69 |
NP-PP-IN | 59.12 | 14.00 | 18.39 | 8.00 | 5.83 | 7.68 | 15.01 |
PP-NP-NN | 37.83 | 11.84 | 13.13 | 13.51 | 16.18 | 19.91 | 20.49 |
NP-NP-DT | 56.55 | 13.54 | 17.69 | 9.64 | 7.83 | 9.26 | 19.48 |
VP-PP-IN | 41.43 | 14.31 | 15.67 | 17.53 | 19.25 | 22.05 | 24.11 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhou, Z.; Zhu, J.; Li, W. Towards Understanding Neural Machine Translation with Attention Heads’ Importance. Appl. Sci. 2024, 14, 2798. https://doi.org/10.3390/app14072798
Zhou Z, Zhu J, Li W. Towards Understanding Neural Machine Translation with Attention Heads’ Importance. Applied Sciences. 2024; 14(7):2798. https://doi.org/10.3390/app14072798
Chicago/Turabian StyleZhou, Zijie, Junguo Zhu, and Weijiang Li. 2024. "Towards Understanding Neural Machine Translation with Attention Heads’ Importance" Applied Sciences 14, no. 7: 2798. https://doi.org/10.3390/app14072798
APA StyleZhou, Z., Zhu, J., & Li, W. (2024). Towards Understanding Neural Machine Translation with Attention Heads’ Importance. Applied Sciences, 14(7), 2798. https://doi.org/10.3390/app14072798