FPGA-Based Convolutional Neural Network Accelerator with Resource-Optimized Approximate Multiply-Accumulate Unit
Abstract
:1. Introduction
- An approximate MAC operator based on bit modifications and functions provided by HLS tool is proposed and implemented for CNN accelerator on FPGA.
- Additional data size optimization for CNN is applied by removing unused bits after output activation function processing.
- Experiments were performed with various bit width for data on HLS implementation of CNN accelerator, and the performance results are analyzed.
2. Background
2.1. CNN
2.2. LeNet-5
2.3. FPGA
3. Proposed Accelerator Design
3.1. Loop Parallelization
3.2. Fixed-Point Data Optimization
3.3. Approximate MAC Operations
4. Experimental Results
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Li, H.; Lin, Z.; Shen, X.; Brandt, J.; Hua, G. A convolutional neural network cascade for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5325–5334. [Google Scholar]
- Wu, J.; Leng, C.; Wang, Y.; Hu, Q.; Cheng, J. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4820–4828. [Google Scholar]
- Yamashita, R.; Nishio, M.; Do, R.K.G.; Togashi, K. Convolutional neural networks: An overview and application in radiology. Insights Imaging 2018, 9, 611–629. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Xiao, T.; Xu, Y.; Yang, K.; Zhang, J.; Peng, Y.; Zhang, Z. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Vinayakumar, R.; Soman, K.P.; Poornachandran, P. Applying Convolutional Neural Network for Network Intrusion. In Proceedings of the 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi, India, 13–16 September 2017; pp. 1222–1228. [Google Scholar]
- Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 1–74. [Google Scholar] [CrossRef] [PubMed]
- Guo, Z.; Huang, Y.; Hu, X.; Wei, H.; Zhao, B. A Survey on Deep Learning Based Approaches for Scene Understanding in Autonomous Driving. Electronics 2021, 10, 471. [Google Scholar] [CrossRef]
- Anwar, S.; Hwang, K.; Sung, W. Fixed point optimization of deep convolutional neural networks for object recognition. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–25 April 2015; pp. 1131–1135. [Google Scholar]
- Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth. arXiv 2016, arXiv:1606.06160. [Google Scholar]
- Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res. 2017, 18, 6869–6898. [Google Scholar]
- Courbariaux, M.; Bengio, Y.; David, J.P. Training deep neural networks with low precision multiplications. arXiv 2014, arXiv:1412.7024. [Google Scholar]
- Gysel, P.; Motamedi, M.; Ghiasi, S. Hardware-oriented Approximation of Convolutional Neural Networks. arXiv 2016, arXiv:1604.03168. [Google Scholar]
- Zhuang, B.; Shen, C.; Tan, M.; Liu, L.; Reid, I. Towards effective low-bitwidth convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7920–7928. [Google Scholar]
- Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 161–170. [Google Scholar]
- Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E.; Yu, J.; Tang, T.; Xu, N.; Song, S.; et al. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016; pp. 26–35. [Google Scholar]
- Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J.S. Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 45–54. [Google Scholar]
- Zhou, Y.; Jiang, J. An FPGA-based accelerator implementation for deep convolutional neural networks. In Proceedings of the 2015 4th International Conference on Computer Science and Network Technology (ICCSNT), Harbin, China, 19–20 December 2015; Volume 1, pp. 829–832. [Google Scholar]
- Ghaffari, S.; Sharifian, S. FPGA-based convolutional neural network accelerator design using high level synthesize. In Proceedings of the 2nd International Conference of Signal Processing and Intelligent Systems (ICSPIS), Tehran, Iran, 14–15 December 2016; pp. 1–6. [Google Scholar]
- Gschwend, D. Zynqnet: An Fpga-Accelerated Embedded Convolutional Neural Network. arXiv 2020, arXiv:2005.06892. [Google Scholar]
- Abdelouahab, K.; Bourrasset, C.; Pelcat, M.; Berry, F.; Quinton, J.C.; Serot, J. A Holistic Approach for Optimizing DSP Block Utilization of a CNN implementation on FPGA. In Proceedings of the 10th International Conference on Distributed Smart Camera, Paris, France, 12–15 September 2016; pp. 69–75. [Google Scholar]
- Lee, S.; Kim, D.; Nguyen, D.; Lee, J. Double MAC on a DSP: Boosting the performance of convolutional neural networks on FPGAs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2019, 38, 888–897. [Google Scholar] [CrossRef]
- Wang, D.; Xu, K.; Guo, J.; Ghiasi, S. DSP-efficient hardware acceleration of convolutional neural network inference on FPGAs. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2020, 39, 4867–4880. [Google Scholar] [CrossRef]
- Chen, W.; Wu, H.; Wei, S.; He, A.; Chen, H. An asynchronous energy-efficient CNN accelerator with reconfigurable architecture. In Proceedings of the IEEE Asian Solid-State Circuits Conference (A-SSCC), Tainan, Taiwan, 5–7 November 2018; pp. 51–54. [Google Scholar]
- Giardino, D.; Matta, M.; Silvestri, F.; Spanò, S.; Trobiani, V. FPGA implementation of hand-written number recognition based on CNN. Int. J. Adv. Sci. Eng. Inf. Technol. 2019, 9, 167–171. [Google Scholar] [CrossRef] [Green Version]
- Rongshi, D.; Yongming, T. Accelerator implementation of Lenet-5 convolution neural network based on FPGA with HLS. In Proceedings of the 2019 3rd International Conference on Circuits, System and Simulation (ICCSS), Nanjing, China, 13–15 June 2019; pp. 64–67. [Google Scholar]
- Shi, Y.; Gan, T.; Jiang, S. Design of Parallel Acceleration Method of Convolutional Neural Network Based on FPGA. In Proceedings of the 2020 IEEE 5th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), Chengdu, China, 10–13 April 2020; pp. 133–137. [Google Scholar]
- Shan, D.; Cong, G.; Lu, W. A CNN Accelerator on FPGA with a Flexible Structure. In Proceedings of the 2020 5th International Conference on Computational Intelligence and Applications (ICCIA), Beijing, China, 19–21 June 2020; pp. 211–216. [Google Scholar]
- Xiao, T.; Tao, M. Research on FPGA Based Convolutional Neural Network Acceleration Method. In Proceedings of the 2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 28–30 June 2021; pp. 289–292. [Google Scholar]
- UltraScale Architecture DSP Slice User Guide. 2020. Available online: https://www.xilinx.com/support/documentation/user_guides/ug579-ultrascale-dsp.pdf (accessed on 28 July 2021).
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
- The MNIST Database of Handwritten Digits. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 29 July 2021).
- Zeiler, M.D.; Fergus, R. Stochastic pooling for regularization of deep convolutional neural networks. arXiv 2013, arXiv:1301.3557. [Google Scholar]
- Yu, D.; Wang, H.; Chen, P.; Wei, Z. Mixed Pooling for Convolutional Neural Networks. In Proceedings of the Rough Sets and Knowledge Technology: 9th International Conference (RSKT 2014), Shanghai, China, 24–26 October 2014; pp. 364–375. [Google Scholar]
- Sun, M.; Song, Z.; Jiang, X.; Pan, J.; Pang, Y. Learning pooling for convolutional neural network. Neurocomputing 2017, 224, 96–104. [Google Scholar] [CrossRef]
- Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
- Karlik, B.; Olgac, A.V. Performance analysis of various activation functions in generalized MLP architectures of neural networks. Int. J. Artif. Intell. Expert Syst. 2011, 1, 111–122. [Google Scholar]
- Reduce Power and Cost by Converting from Floating Point to Fixed Point. 2017. Available online: https://www.xilinx.com/support/documentation/white_papers/wp491-floating-to-fixed-point.pdf (accessed on 28 July 2021).
Layer | Input | Weight | Bias | Kernel | Stride | Output |
---|---|---|---|---|---|---|
Conv1 | 1 × 32 × 32 | 6 × 1 × 5 × 5 | 6 | 5 | 1 | 6 × 28 × 28 |
Pool1 | 6 × 28 × 28 | 6 | 6 | 2 | 2 | 6 × 14 × 14 |
Conv2 | 6 × 14 × 14 | 16 × 6 × 5 × 5 | 16 | 5 | 1 | 16 × 10 × 10 |
Pool2 | 16 × 10 × 10 | 16 | 16 | 2 | 2 | 16 × 5 × 5 |
Conv3 | 16 × 5 × 5 | 120 × 16 × 5 × 5 | 120 | 5 | 1 | 1 × 120 |
FC1 | 1 × 120 | 120 × 84 | 84 | - | - | 1 × 84 |
FC2 | 1 × 84 | 84 × 10 | 10 | - | - | 1 × 10 |
Data Type <Integer, Fraction> | Accuracy (10,000 Sets) |
---|---|
Floating (32-bit) | 9863 |
Fixed <6, 12> | 9859 |
Fixed <6, 11> | 9858 |
Fixed <6, 10> | 9847 |
Fixed <6, 9> | 9834 |
Fixed <6, 8> | 9758 |
Fixed <6, 7> | 8977 |
Fixed <6, 6> | 3408 |
Rounded MAC | 9821 |
Carry MAC | 9614 |
Data Type | Accuracy (10,000 Sets) |
---|---|
18-bit | 9859 |
18/12-bit | 9840 |
12-bit | 3408 |
Data Type <Integer, Fraction> | Clock Period (ns) | Maximum Frequency (MHz) | Clock Cycles | Latency (ms) | Normalized Latency |
---|---|---|---|---|---|
Floating (32-bit) | 9.428 | 106.07 | 1,076,296 | 10.147 | 100% |
Fixed <6, 12> | 7.342 | 136.20 | 701,300 | 5.149 | 51% |
Fixed <6, 11> | 7.160 | 139.66 | 701,300 | 5.021 | 49% |
Fixed <6, 10> | 7.534 | 132.73 | 701,300 | 5.284 | 52% |
Fixed <6, 9> | 7.455 | 134.14 | 701,300 | 5.228 | 52% |
Fixed <6, 8> | 7.466 | 133.94 | 701,290 | 5.236 | 52% |
Fixed <6, 7> | 8.436 | 118.54 | 701,300 | 5.916 | 58% |
Fixed <6, 6> | 7.549 | 132.47 | 701,300 | 5.127 | 51% |
Rounded MAC | 8.734 | 114.50 | 587,004 | 5.127 | 51% |
Carry MAC | 7.890 | 126.74 | 587,004 | 4.631 | 46% |
Data Type <Integer, Fraction> | CLB | LUT | FF | DSP | BRAM | SRL | Latch |
---|---|---|---|---|---|---|---|
Floating (32-bit) | 12,405 | 52,406 | 46,114 | 199 | 303 | 589 | 0 |
Fixed <6, 12> | 9653 | 43,929 | 31,100 | 549 | 159 | 581 | 32 |
Fixed <6, 11> | 9256 | 42,807 | 29,618 | 569 | 165 | 581 | 32 |
Fixed <6, 10> | 9236 | 41,841 | 28,720 | 569 | 163 | 581 | 32 |
Fixed <6, 9> | 9245 | 40,717 | 28,016 | 559 | 150 | 581 | 32 |
Fixed <6, 8> | 8932 | 39,360 | 27,531 | 539 | 144 | 581 | 32 |
Fixed <6, 7> | 8661 | 38,448 | 26,372 | 549 | 128 | 581 | 32 |
Fixed <6, 6> | 8030 | 37,598 | 25,196 | 559 | 124 | 581 | 32 |
Rounded MAC | 11,273 | 61,713 | 27,863 | 123 | 102 | 545 | 32 |
Carry MAC | 10,991 | 57,657 | 28,311 | 123 | 102 | 581 | 32 |
Model | [24] | [27] | [28] | This Work (Rounded MAC) | This Work (Carry MAC) |
---|---|---|---|---|---|
Year | 2018 | 2020 | 2020 | 2021 | 2021 |
FPGA | Zynq VC7VX485T | Zynq XCZU9EG | Artix XC7A20 | Zynq XCZU9EG | Zynq XCZU9EG |
Clock (MHz) | - | 150 | 50 | 100 | 100 |
Precision (bit) | 16-bit fixed | 16-bit floating | 8-bit fixed | 12/8-bit fixed | 12/8-bit fixed |
Power (W) | 0.676 | - | 14.13 | 1.673 | 1.598 |
GOPs | 20.3 | 28.8 | 164.1 | 0.141 | 0.141 |
DSP | 406 | 204 | 571 | 123 | 123 |
LUT | 75,221 | 25,276 | 88,756 | 61,713 | 57,657 |
FF | 38,577 | 66,569 | 42,038 | 27,863 | 28,311 |
BRAM | 101 | 55 | 218 | 102 | 102 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cho, M.; Kim, Y. FPGA-Based Convolutional Neural Network Accelerator with Resource-Optimized Approximate Multiply-Accumulate Unit. Electronics 2021, 10, 2859. https://doi.org/10.3390/electronics10222859
Cho M, Kim Y. FPGA-Based Convolutional Neural Network Accelerator with Resource-Optimized Approximate Multiply-Accumulate Unit. Electronics. 2021; 10(22):2859. https://doi.org/10.3390/electronics10222859
Chicago/Turabian StyleCho, Mannhee, and Youngmin Kim. 2021. "FPGA-Based Convolutional Neural Network Accelerator with Resource-Optimized Approximate Multiply-Accumulate Unit" Electronics 10, no. 22: 2859. https://doi.org/10.3390/electronics10222859
APA StyleCho, M., & Kim, Y. (2021). FPGA-Based Convolutional Neural Network Accelerator with Resource-Optimized Approximate Multiply-Accumulate Unit. Electronics, 10(22), 2859. https://doi.org/10.3390/electronics10222859