EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers
Abstract
:1. Introduction
- We propose a high-efficiency architecture EFA-Trans for accelerating transformers that possesses excellent configurability and can achieve high utilization in matrix multiplication of different sizes.
- We implement and optimize two complicated nonlinear functions to obtain better performance. Furthermore, we proposed a customized on-chip memories design, which can realize the on-the-fly transformation of matrix transpose.
- EFA-Trans is compatible with dense and sparse computing paradigms, which can dynamically switch during runtime, significantly reducing storage requirements and overall latency. Experiments have demonstrated the effectiveness and computationally inexpensiveness of our novel scheme.
- We devise a performance analytic model to evaluate the dependencies between latency and sparsity ratio and architecture parameter sets. After a comprehensive analysis of the model, we reach a balance between acceleration performance and resource consumption of EFA-Trans.
2. Background and Motivation
2.1. Attention-Based Transformer Models
2.2. Model Pruning
3. Accelerator Architecture
3.1. Architecture Overview
3.2. Matrix Computation Array (MCA) and Workload Mapping
3.3. Nonlinear Modules Design and Optimization Strategies
3.3.1. Softmax Core (SMC) and Optimization
3.3.2. LayerNorm Core (LNC) and Optimization
3.4. Improvement for the on-Chip Memories
3.4.1. The Features of on-Chip Memories
3.4.2. On-the-fly Transformation of Matrix Transpose
3.5. Supporting Sparse Matrix Computation
Algorithm 1: Bank-balanced weight pruning algorithm |
3.6. Fine-Grained Scheduling
4. Performance Analytical Model
5. Evaluation and Experiments
5.1. Experimental Setup
5.2. Characteristics of Accelerators
5.3. Resource Utilization
5.4. Performance Comparison
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Li, B.; Pandey, S.; Fang, H.; Lyv, Y.; Li, J.; Chen, J.; Xie, M.; Wan, L.; Liu, H.; Ding, C. FTRANS: Energy-efficient acceleration of transformers using FPGA. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, Boston, MA, USA, 10–12 August 2020; pp. 175–180. [Google Scholar] [CrossRef]
- Ham, T.J.; Jung, S.J.; Kim, S.; Oh, Y.H.; Park, Y.; Song, Y.; Park, J.-H.; Lee, S.; Park, K.; Lee, J.W.; et al. A^3: Accelerating Attention Mechanisms in Neural Networks with Approximation. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), San Diego, CA, USA, 16 April 2020; pp. 328–341. [Google Scholar] [CrossRef] [Green Version]
- Ham, T.J.; Lee, Y.; Seo, S.H.; Kim, S.; Choi, H.; Jung, S.J.; Lee, J.W. ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 4 August 2021; pp. 692–705. [Google Scholar] [CrossRef]
- Zhang, X.; Wu, Y.; Zhou, P.; Tang, X.; Hu, J. Algorithm-hardware Co-design of Attention Mechanism on FPGA Devices. ACM Trans. Embed. Comput. Syst. 2021, 20, 1–24. [Google Scholar] [CrossRef]
- Lu, S.; Wang, M.; Liang, S.; Lin, J.; Wang, Z. Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer. In Proceedings of the 2020 IEEE 33rd International System-on-Chip Conference (SOCC), Las Vegas, NV, USA, 6 September 2021; pp. 84–89. [Google Scholar] [CrossRef]
- Cao, S.; Zhang, C.; Yao, Z.; Xiao, W.; Nie, L.; Zhan, D.; Liu, Y.; Wu, M.; Zhang, L. Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 24–26 February 2019; pp. 63–72. [Google Scholar] [CrossRef]
- Wang, H.; Zhang, Z.; Han, S. Spatten: Efficient sparse attention architecture with cascade token and head pruning. arXiv 2020, arXiv:2012.09852. [Google Scholar] [CrossRef]
- Qi, P.; Song, Y.; Peng, H.; Huang, S.; Zhuge, Q.; Sha, E.H.-M. Accommodating Transformer onto FPGA: Coupling the Balanced Model Compression and FPGA-Implementation Optimization. In Proceedings of the 2021 on Great Lakes Symposium on VLSI, Virtual Event, USA, 22–25 June 2021; pp. 163–168. [Google Scholar] [CrossRef]
- Peng, H.; Huang, S.; Geng, T.; Li, A.; Jiang, W.; Liu, H.; Wang, S.; Ding, C. Accelerating Transformer-based Deep Learning Models on FPGAs using Column Balanced Block Pruning. In Proceedings of the 2021 22nd International Symposium on Quality Electronic Design (ISQED), Santa Clara, CA, USA, 7–9 April 2021; pp. 142–148. [Google Scholar] [CrossRef]
- Khan, H.; Khan, A.; Khan, Z.; Bin Huang, L.; Wang, K.; He, L. NPE: An FPGA-based Overlay Processor for Natural Language Processing. arXiv 2021, arXiv:3431920.3439477. [Google Scholar] [CrossRef]
- Li, B.; Kong, Z.; Zhang, T.; Li, J.; Li, Z.; Liu, H.; Ding, C. Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning. arXiv 2020, arXiv:2009.08065. [Google Scholar]
- Narang, S.; Undersander, E.; Diamos, G. Block-sparse recurrent neural networks. arXiv 2017, arXiv:1711.02782. [Google Scholar]
- Jean, S.; Firat, O.; Cho, K.; Memisevic, R.; Bengio, Y. Montreal Neural Machine Translation Systems for WMT’15. In Proceedings of the 10th Workshop on Statistical Machine Translation, Lisbon, Portugal, 17–18 September 2015; pp. 134–140. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
- Liu, Z.; Li, G.; Cheng, J. Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing. In Proceedings of the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 1–5 February 2021. [Google Scholar]
Embedding and Positional Encoding | |
EM/PE | + PE() |
Multihead Self-Attention | |
COM1 | |
COM2 | |
COM3 | |
COM4 | |
Residual Addition and Layer Normalization A | |
LN-A | |
Position-Wise Feed-Forward | |
FFN1 | |
FFN2 | |
Residual Addition and Layer Normalization B | |
LN-B |
[5] | [8] | [9] | [12] | This Work | |
---|---|---|---|---|---|
# Configurability | ✘ | ✔ | ✔ | ✘ | ✔ |
# Complete layer | ✔ | ✘ | ✘ | ✔ | ✔ |
# Compatibility 1 | ✘ | ✘ | ✘ | ✘ | ✔ |
Operations | Resource | |||||
---|---|---|---|---|---|---|
Matrix Dimensions | OPs | DSP | LUT | FF | BRAM | |
MCA | X 1—[64, 512] MHA—[512, 512] FFN1—[512, 2048] FFN2—[2048, 512] | 0.41 G | 1024 | 40,508 | 6114 | / |
SMC | Softmax—[64, 64] | 0.13 M | 0 | 4665 | 5774 | / |
LNC | LN—[64, 512] | 0.46 M | 0 | 2787 | 838 | / |
Memory Groups | / | / | / | 16,685 | 17,936 | 539 |
Transmission | / | / | / | 200 | 107 | / |
Resource | Latency (ms) | ||||
---|---|---|---|---|---|
DSP | LUT | FF | BRAM | ||
Available | 2520 | 274,080 | 548,160 | 912 | - |
Dense | 1024 | 65,385 | 31,739 | 539 | 1.47 |
Dense–Sparse | 512 | 132,433 | 52,332 | 439 | 0.87 |
CPU i5-4460 | GPU RTX-3060 | This Work ZCU102 | |
Latency (ms) | 4.66 | 0.71 | 1.47 |
Throughput (GOPS) | 88.2 | 579.3 | 279.8 |
Power (W) | 41 | 86 | 5.48 |
Energy Efficiency (GOPS/W) | 2.15 | 6.74 | 51.06 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, X.; Su, T. EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers. Electronics 2022, 11, 3550. https://doi.org/10.3390/electronics11213550
Yang X, Su T. EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers. Electronics. 2022; 11(21):3550. https://doi.org/10.3390/electronics11213550
Chicago/Turabian StyleYang, Xin, and Tao Su. 2022. "EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers" Electronics 11, no. 21: 3550. https://doi.org/10.3390/electronics11213550
APA StyleYang, X., & Su, T. (2022). EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers. Electronics, 11(21), 3550. https://doi.org/10.3390/electronics11213550