Diversifying Multi-Head Attention in the Transformer Model
Abstract
:1. Introduction
2. Motivation and Proposed Methodology
2.1. Generalized Hebbian Algorithm
2.2. The DEACON Algorithm
3. The Proposed Architectures
3.1. Direct Architecture
3.2. Average Architecture
- Dimensionality Reduction through Averaging: Before the PCA layer, the average architecture performs an averaging operation across the dimensions of each attention head for each word in the sequence. This step condenses the information represented by each head into a single scalar value for each word. This significantly reduces the dimensionality of the data that are fed into the PCA layer. For example, for 8 attention heads, each with dimensionality of 32 , the baseline Transformer would have values representing the attention heads for each word. The average architecture, after averaging, reduces this to only 8 values per word. This dimensionality reduction has a direct impact on the number of parameters in subsequent layers.
- Head Pruning through PCA: The PCA layer itself contributes to parameter reduction through its inherent ability to identify and retain only the most informative features, effectively pruning less important ones. The PCA layer in the average architecture projects the averaged attention head values onto a lower-dimensional space defined by the principal components. By selecting only the outputs corresponding to the largest principal components (m outputs), the architecture effectively discards the less important heads that likely capture redundant information. For instance, with 8 attention heads initially but choosing to retain only the top 3 principal components , we can effectively prune 5 heads. This reduction in the number of heads translates into fewer parameters in the final linear transformation that projects the outputs back to the input dimension.
3.3. Non-Linear Architecture
4. Datasets
5. Experimental Results
5.1. WMT-16 Dataset (Machine Translation)
5.2. XSum Dataset (Text Summarization)
5.3. SQuAD v1.1 Dataset (Question Answering)
5.4. SlimPajama Dataset (Language Modeling)
6. Conclusions
- Theoretical foundations—developing a formal framework to characterize the relationship between the attention head diversity and model performance across different NLP tasks;
- Efficiency optimization—investigating techniques to reduce the computational overhead of the PCA layer while maintaining its diversification benefits;
- Attention interpretability—analyzing how diversified attention heads capture different linguistic phenomena to better understand their complementary roles.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
- Raganato, A.; Tiedemann, J. An Analysis of Encoder Representations in Transformer-Based Machine Translation. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; pp. 287–297. [Google Scholar] [CrossRef]
- Michel, P.; Levy, O.; Neubig, G. Are Sixteen Heads Really Better than One? arXiv 2019, arXiv:1905.10650. [Google Scholar]
- Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. arXiv 2019, arXiv:1905.09418. [Google Scholar]
- Raganato, A.; Scherrer, Y.; Tiedemann, J. Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation. arXiv 2020, arXiv:2002.10260. [Google Scholar]
- Tay, Y.; Bahri, D.; Metzler, D.; Juan, D.C.; Zhao, Z.; Zheng, C. Synthesizer: Rethinking Self-Attention in Transformer Models. arXiv 2020, arXiv:2005.00743. [Google Scholar]
- Peng, H.; Schwartz, R.; Li, D.; Smith, N.A. A Mixture of h - 1 Heads is Better than h Heads. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6566–6577. [Google Scholar] [CrossRef]
- Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive Mixtures of Local Experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef] [PubMed]
- Ampazis, N.; Perantonis, S.; Taylor, J. Dynamics of Multilayer Networks in the Vicinity of Temporary Minima. Neural Netw. 1999, 12, 43–58. [Google Scholar] [CrossRef] [PubMed]
- Sanger, T.D. Optimal Unsupervised Learning in a Single-Layer Linear Feedforward Neural Network. Neural Netw. 1989, 2, 459–473. [Google Scholar] [CrossRef]
- Oja, E. Principal Components, Minor Components, and Linear Neural Networks. Neural Netw. 1992, 5, 927–935. [Google Scholar] [CrossRef]
- Karras, D.; Perantonis, S. An Efficient Constrained Training Algorithm for Feedforward Networks. IEEE Trans. Neural Netw. 1995, 6, 1420–1434. [Google Scholar] [CrossRef] [PubMed]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
- James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R; Springer: New York, NY, USA, 2014. [Google Scholar]
- Elliott, D.; Frank, S.; Sima’an, K.; Specia, L. Multi30K: Multilingual English-German Image Descriptions. arXiv 2016, arXiv:1605.00459. [Google Scholar]
- Narayan, S.; Cohen, S.B.; Lapata, M. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
- Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; Su, J., Duh, K., Carreras, X., Eds.; 2016; pp. 2383–2392. [Google Scholar] [CrossRef]
- Soboleva, D.; Al-Khateeb, F.; Myers, R.; Steeves, J.R.; Hestness, J.; Dey, N. SlimPajama: A 627B Token Cleaned and Deduplicated Version of RedPajama. 2023. Available online: https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama (accessed on 9 June 2023).
- Ostapenko, O.; Lesort, T.; Rodriguez, P.; Arefin, M.R.; Douillard, A.; Rish, I.; Charlin, L. Continual learning with foundation models: An empirical study of latent replay. In Proceedings of the Conference on Lifelong Learning Agents, PMLR, Montréal, QC, Canada, 22–24 August 2022; pp. 60–91. [Google Scholar]
- Black, S.; Biderman, S.; Hallahan, E.; Anthony, Q.; Gao, L.; Golding, L.; He, H.; Leahy, C.; McDonell, K.; Phang, J.; et al. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. arXiv 2022, arXiv:2204.06745. [Google Scholar]
Architecture | Description and Key Features |
---|---|
Direct | |
| |
Average | |
| |
Non-linear | |
|
Data Source | Sampling | Train | Validation | Train |
---|---|---|---|---|
(%) | Tokens | Tokens | (% of Total) | |
CommonCrawl | 67.0 | 200.82B | 214.72M | 66.99 |
C4 | 15.0 | 44.98B | 48.06M | 15.00 |
GitHub | 4.5 | 13.51B | 14.42M | 4.51 |
Books | 4.5 | 13.49B | 14.39M | 4.50 |
Wikipedia | 4.5 | 13.50B | 14.41M | 4.50 |
ArXiv | 2.5 | 7.49B | 8.01M | 2.50 |
StackExchange | 2.0 | 5.99B | 6.41M | 2.00 |
Total | 100.0 | 299.78B | 320.42M | 100.00 |
Architecture | PCA | Number of | Training | Validation | BLEU | Params |
---|---|---|---|---|---|---|
Heads | Parameters | Loss | Accuracy (%) | Score | (% of Base) | |
Baseline Transformer | 8 | 5,370,112 | 0.8211 | 68.63 | 28.4 | 100.00 |
Direct Architecture | 8 | 5,373,616 | 0.8031 | 69.75 | 28.9 | 100.07 |
Direct Architecture | 3 | 5,127,586 | 0.9887 | 68.30 | 28.3 | 95.48 |
Average Architecture | 8 | 4,992,688 | 0.8255 | 67.78 | 28.1 | 92.97 |
Average Architecture | 3 | 4,984,738 | 1.1382 | 63.70 | 26.8 | 92.82 |
Non-Linear Architecture | 8 | 5,372,800 | 0.9916 | 68.02 | 28.2 | 100.05 |
Non-Linear Architecture | 3 | 5,125,690 | 0.8689 | 67.68 | 28.1 | 95.45 |
Architecture | PCA | Number of | Training | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|---|---|---|
Heads | Parameters | Loss | (%) | (%) | (%) | |
Baseline Transformer | 8 | 5,370,112 | 3.245 | 32.45 | 11.78 | 25.67 |
Direct Architecture | 8 | 5,373,616 | 3.187 | 33.12 | 12.05 | 26.21 |
Direct Architecture | 3 | 5,127,586 | 3.356 | 32.38 | 11.72 | 25.59 |
Average Architecture | 8 | 4,992,688 | 3.278 | 32.01 | 11.56 | 25.33 |
Average Architecture | 3 | 4,984,738 | 3.512 | 30.87 | 11.02 | 24.45 |
Non-Linear Architecture | 8 | 5,372,800 | 3.298 | 32.23 | 11.69 | 25.48 |
Non-Linear Architecture | 3 | 5,125,690 | 3.321 | 32.09 | 11.63 | 25.37 |
Architecture | PCA | Number of | Training | Exact | F1 | Params |
---|---|---|---|---|---|---|
Heads | Parameters | Loss | Match (%) | Score (%) | (% of Base) | |
Baseline Transformer | 8 | 5,370,112 | 2.876 | 67.89 | 77.45 | 100.00 |
Direct Architecture | 8 | 5,373,616 | 2.812 | 68.73 | 78.21 | 100.07 |
Direct Architecture | 3 | 5,127,586 | 2.934 | 67.65 | 77.18 | 95.48 |
Average Architecture | 8 | 4,992,688 | 2.901 | 67.42 | 76.95 | 92.97 |
Average Architecture | 3 | 4,984,738 | 3.087 | 65.78 | 75.32 | 92.82 |
Non-Linear Architecture | 8 | 5,372,800 | 2.889 | 67.98 | 77.56 | 100.05 |
Non-Linear Architecture | 3 | 5,125,690 | 2.923 | 67.54 | 77.09 | 95.45 |
Architecture | PCA | Number of | Perplexity | Training | Inference | Convergence |
---|---|---|---|---|---|---|
Heads | Parameters | (Valid) | Time (days) | Time (ms) | Epoch | |
Baseline Transformer | 16 | 354,818,048 | 9.86 | 8.5 | 41 | 2.7 |
Direct Architecture | 16 | 354,850,816 | 9.41 | 8.8 | 42 | 2.3 |
Direct Architecture | 8 | 354,023,424 | 9.57 | 8.0 | 37 | 2.5 |
Average Architecture | 16 | 354,588,672 | 9.63 | 8.6 | 40 | 2.6 |
Average Architecture | 8 | 353,761,280 | 9.98 | 7.8 | 36 | 2.9 |
Non-Linear Architecture | 16 | 354,899,968 | 9.52 | 9.0 | 44 | 2.4 |
Non-Linear Architecture | 8 | 354,072,576 | 9.61 | 8.2 | 39 | 2.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ampazis, N.; Sakketou, F. Diversifying Multi-Head Attention in the Transformer Model. Mach. Learn. Knowl. Extr. 2024, 6, 2618-2638. https://doi.org/10.3390/make6040126
Ampazis N, Sakketou F. Diversifying Multi-Head Attention in the Transformer Model. Machine Learning and Knowledge Extraction. 2024; 6(4):2618-2638. https://doi.org/10.3390/make6040126
Chicago/Turabian StyleAmpazis, Nicholas, and Flora Sakketou. 2024. "Diversifying Multi-Head Attention in the Transformer Model" Machine Learning and Knowledge Extraction 6, no. 4: 2618-2638. https://doi.org/10.3390/make6040126
APA StyleAmpazis, N., & Sakketou, F. (2024). Diversifying Multi-Head Attention in the Transformer Model. Machine Learning and Knowledge Extraction, 6(4), 2618-2638. https://doi.org/10.3390/make6040126