Resource-Efficient Optimization for FPGA-Based Convolution Accelerator
Abstract
:1. Introduction
- Utilizing the six-input look-up table (LUT) and associated carry chain of FPGA, two resource-efficient optimization methods of single-cycle radix-4 accurate Booth multiplier are proposed, which can further facilitate the addition operation for the multiply–accumulate functionality.
- Based on partial product accumulation, a multiply–accumulate structure based on radix-4 Booth multipliers is proposed without calculating intermediate multiplication and addition results for each input datum, which not only reduces the hardware utilization effectively but can also be expanded to other multiply–accumulate operations.
- Our proposed convolution accelerator based on optimized multiply–accumulate structure achieves comparable performance and resource utilization to those based on approximate multipliers without accuracy loss.
2. Problem Formulation
3. Proposed Designs
3.1. Radix-4 Booth Multiplier and Its Sign Bit Extension
3.2. LUT-Based Optimization of Radix-4 Booth Multiplier
3.3. Carry-Chain-Based Optimization of Radix-4 Booth Multiplier
3.4. Partial Product Accumulation Based Optimization of Convolutional Process Unit
4. Discussion
4.1. Experimental Setup
4.2. Implementation Results of Optimized Multiplier
4.3. Implementation Results of Multiply–Accumulate Structure
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Mittal, S. A survey of FPGA-based accelerators for convolutional neural networks. Neural Comput. Appl. 2020, 32, 1109–1139. [Google Scholar] [CrossRef]
- Wang, D.; Xu, K.; Guo, J.; Ghiasi, S. DSP-efficient hardware acceleration of convolutional neural network inference on FPGAs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 4867–4880. [Google Scholar] [CrossRef]
- Ullah, S.; Sripadra, S.; Murthy, J.; Kumar, A. SMApproxLib: Library of FPGA-based approximate multipliers. In Proceedings of the IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 24–28 June 2018; pp. 1–6. [Google Scholar]
- Xilinx LogiCORE IP v12.0. Available online: https://www.xilinx.com/support/documentation/ip_documentation/mult_gen/v12_0/pg108-mult-gen.pdf (accessed on 21 July 2023).
- Lentaris, G. Combining arithmetic approximation techniques for improved CNN circuit design. In Proceedings of the IEEE International Conference on Electronics, Circuits and Systems (ICECS), Glasgow, UK, 23–25 November 2020; p. 9294869. [Google Scholar]
- Ebrahimi, Z.; Ullah, S.; Kumar, A. LeAp: Leading-one detection-based softcore approximate multipliers with tunable accuracy. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC), Beijing, China, 13–16 January 2020; pp. 605–610. [Google Scholar]
- Csordás, G.; Fehér, B.; Kovácsházy, T. Application of bit-serial arithmetic units for FPGA implementation of convolutional neural networks. In Proceedings of the International Carpathian Control Conference (ICCC), Szilvasvarad, Hungary, 28–31 May 2018; pp. 322–327. [Google Scholar]
- Zhang, H.; Xiao, H.; Qu, H.; Ko, S. FPGA-based approximate multiplier for efficient neural computation. In Proceedings of the IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), Gangwon, Republic of Korea, 1–3 November 2021; pp. 1–4. [Google Scholar]
- Lammie, C.; Azghadi, M. Stochastic computing for low-power and high-speed deep learning on FPGA. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Sapporo, Japan, 26–29 May 2019; pp. 1–5. [Google Scholar]
- Thamizharasan, V.; Kasthuri, N. High-Speed Hybrid Multiplier Design Using a Hybrid Adder with FPGA Implementation. IETE J. Res. 2021, 69, 2301–2309. [Google Scholar] [CrossRef]
- Balasubramanian, P.; Nayar, R.; Maskell, D.L. Digital Image Blending Using Inaccurate Addition. Electronics 2022, 11, 3095. [Google Scholar] [CrossRef]
- Kumar, S.R.; Balasubramanian, P.; Reddy, R. Optimized Fault-Tolerant Adder Design Using Error Analysis. J. Circuits Syst. Comput. 2023, 32, 6. [Google Scholar]
- Sarwar, S.S.; Venkataramani, S.; Raghunathan, A.; Roy, K. Multiplier-less artificial neurons exploiting error resiliency for energy-efficient neural computing. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE), Dresden, Germany, 14–18 March 2016; pp. 145–150. [Google Scholar]
- Kala, S.; Jose, B.; Mathew, J.; Nalesh, S. High-performance CNN accelerator on FPGA using unified Winograd-GEMM architecture. IEEE Trans. Very Large Scale Integr. VLSI Syst. 2019, 27, 2816–2828. [Google Scholar] [CrossRef]
- Toan, N.V.; Lee, J.G. FPGA-based multi-Level approximate multipliers for high-performance error-resilient applications. IEEE Access 2020, 8, 25481–25497. [Google Scholar] [CrossRef]
- Wang, X.; Wang, C.; Cao, J.; Gong, L. WinoNN: Optimizing FPGA-Based Convolutional Neural Network Accelerators Using Sparse Winograd Algorithm. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 4290–4302. [Google Scholar] [CrossRef]
- Ullah, S.; Rehman, S.; Shafique, M.; Kumar, A. High-performance accurate and approximate multipliers for FPGA-based hardware accelerators. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 2022, 41, 211–224. [Google Scholar] [CrossRef]
- Farrukh, F. Power efficient tiny Yolo CNN using reduced hardware resources based on Booth multiplier and Wallace tree adders. IEEE Open J. Circuits Syst. 2020, 1, 76–87. [Google Scholar] [CrossRef]
- Rooban, S. Implementation of 128-bit radix-4 booth multiplier. In Proceedings of the International Conference of Computer Communication and Informatics (ICCCI), Coimbatore, India, 27–29 January 2021; pp. 1–7. [Google Scholar]
- Chang, Y.; Cheng, Y.; Liao, S.; Hsiao, C. A low power radix-4 booth multiplier with pre-encoded mechanism. IEEE Access 2020, 8, 114842–114853. [Google Scholar] [CrossRef]
- Kumm, M.; Kappauf, J. Advanced compressor tree synthesis for FPGAs. IEEE Trans. Comput. 2018, 67, 1078–1091. [Google Scholar] [CrossRef]
- Ullah, S.; Nguyen, T.; Kumar, A. Energy-efficient low-latency signed multiplier for FPGA-based hardware accelerators. IEEE Emded. Syst. Lett. 2021, 13, 41–44. [Google Scholar] [CrossRef]
- Ullah, S. Area-optimized low-latency approximate multipliers for FPGA-based hardware accelerators. In Proceedings of the IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 24–28 June 2018; pp. 1–6. [Google Scholar]
- Kumm, M.; Abbas, S.; Zipf, P. An efficient softcore multiplier architecture for Xilinx FPGAs. In Proceedings of the Symposium on Computer Arithmetic (ARITH), Lyon, France, 22–24 June 2015; pp. 18–25. [Google Scholar]
- Waris, H.; Wang, C.; Liu, W.; Lombardi, F. AxBMs: Approximate radix-8 booth multipliers for high-performance FPGA-based accelerators. IEEE Trans. Circuits Syst. Express Briefs 2021, 68, 1566–1570. [Google Scholar] [CrossRef]
- Yan, S. An FPGA-based MobileNet accelerator considering network structure characteristics. In Proceedings of the International Conference on Field-Programmable Logic and Applications (FPL), Virtual, Dresden, Germany, 30 August 2021; pp. 17–23. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G. Seatrching for MobileNetV3. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October 2019; pp. 1314–1324. [Google Scholar]
Designs | LUT |
---|---|
Unoptimized | 92 |
Sign bit extension | 77 |
Proposed (LUT) | 63 |
Proposed (carry chain) | 41 |
Design | 8 × 8 | 16 × 16 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
LUT | Carry Chain | Delay (ns) | Power (mW) | PDP (pJ) | LUT | Carry Chain | Delay (ns) | Power (mW) | PDP (pJ) | |
Sign expansion | 77 | 0 | 3.011 | 15.8 | 47.5 | 279 | 0 | 3.583 | 56.8 | 203.5 |
Proposed (LUT) | 63 | 0 | 2.774 | 16.3 | 45.1 | 256 | 0 | 3.296 | 55.1 | 181.6 |
Proposed (carry chain) | 41 | 5 | 2.530 | 7.2 | 18.1 | 167 | 18 | 2.782 | 29.2 | 81.3 |
Vivado IP area | 85 | 7 | 1.967 | 10.8 | 21.2 | 324 | 26 | 2.177 | 47.0 | 102.3 |
Vivado IP speed | 73 | 14 | 1.602 | 8.3 | 13.3 | 281 | 45 | 1.922 | 31.7 | 60.83 |
T. Nguyen [22] | 69 | 2 | 2.927 | 10.1 | 29.5 | 263 | 4 | 3.323 | 41.8 | 139 |
S. Ullah [23] | 81 | 0 | 2.954 | 13.8 | 40.7 | 296 | 0 | 3.738 | 48.3 | 180.7 |
S. Abbas [24] | 73 | 2 | 3.037 | 15.1 | 45.9 | 270 | 4 | 3.642 | 67.6 | 246.2 |
S. Rehman [17] | 56 | 2 | 1.980 | 8.9 | 17.7 | 240 | 4 | 2.380 | 42.4 | 101 |
H. Waris [25] | 60 | 2 | 2.140 | 6.6 | 14.1 | 194 | 4 | 2.413 | 28.6 | 69.1 |
Designs | LUT | DSP | Freq. (MHz) | Power (mW) |
---|---|---|---|---|
Proposed | 362 | 0 | 395 | 71 |
Vivado’s default synthesis | 766 | 0 | 407 | 80 |
DSP blocks | 68 | 9 | 400 | 81 |
Rehman [17] | 570 | 0 | 397 | 89 |
H. Waris [25] | 604 | 0 | 383 | 69 |
Designs | LUT | DSP | Freq. (MHz) | Power (mW) |
---|---|---|---|---|
Proposed | 2792 | 0 | 342 | 154 |
R. Cai [26] | 5342 | 0 | 330 | 293 |
DSP blocks | 158 | 16 | 177 | 226 |
F. Farrukh [18] | 3612 | 0 | 533 | 176 |
Design | LUT | DSP | Freq. (MHz) | Power (W) | PSNR | SSIM |
---|---|---|---|---|---|---|
Proposed | 9688 | 0 | 379 | 1.144 | ∞ | 1 |
DSP blocks | 608 | 64 | 383 | 1.181 | ∞ | 1 |
Rehman [17] | 11,025 | 0 | 393 | 1.296 | 56.17 | 0.990 |
H. Waris [25] | 10,870 | 0 | 387 | 1.004 | 52.58 | 0.974 |
Layer | Input | Filter | Size | Stride | Output |
---|---|---|---|---|---|
Conv1 | 32 × 32 × 3 | 16 | 3 × 3 | 1 | 30 × 30 × 16 |
Conv2 | 30 × 30 × 16 | 16 | 3 × 3 | 1 | 28 × 28 × 16 |
Max pooling1 | 28 × 28 × 16 | N/A | 2 × 2 | 2 | 14 × 14 × 16 |
Conv3 | 14 × 14 × 16 | 32 | 3 × 3 | 1 | 12 × 12 × 32 |
Conv4 | 12 × 12 × 32 | 32 | 3 × 3 | 1 | 10 × 10 × 32 |
Max pooling2 | 10 × 10 × 32 | N/A | 2 × 2 | 2 | 5 × 5 × 32 |
Full connect1 | 800 | N/A | N/A | N/A | 120 |
Full connect2 | 120 | N/A | N/A | N/A | 84 |
Design | LUT Utilization | FF Utilization | DSP Utilization | Freq. (MHz) |
---|---|---|---|---|
Proposed | 23,887 (34%) | 6172 (4.37%) | 0 (0%) | 175 |
DSP blocks | 15,695 (22%) | 6172 (4.37%) | 211 (58%) | 173 |
Design | LUT Utilization | FF Utilization | DSP Utilization | Freq. (MHz) |
---|---|---|---|---|
Proposed | 51,798 (73%) | 34,408 (24%) | 308 (85%) | 116 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ma, Y.; Xu, Q.; Song, Z. Resource-Efficient Optimization for FPGA-Based Convolution Accelerator. Electronics 2023, 12, 4333. https://doi.org/10.3390/electronics12204333
Ma Y, Xu Q, Song Z. Resource-Efficient Optimization for FPGA-Based Convolution Accelerator. Electronics. 2023; 12(20):4333. https://doi.org/10.3390/electronics12204333
Chicago/Turabian StyleMa, Yanhua, Qican Xu, and Zerui Song. 2023. "Resource-Efficient Optimization for FPGA-Based Convolution Accelerator" Electronics 12, no. 20: 4333. https://doi.org/10.3390/electronics12204333
APA StyleMa, Y., Xu, Q., & Song, Z. (2023). Resource-Efficient Optimization for FPGA-Based Convolution Accelerator. Electronics, 12(20), 4333. https://doi.org/10.3390/electronics12204333