HBCA: A Toolchain for High-Accuracy Branch-Fused CNN Accelerator on FPGA with Dual-Decimal-Fused Technique
Abstract
:1. Introduction
2. Toolchain
- Inception-based branch-fuse
- 8-bit integer quantization
- Hardware emulator
- Fast algorithm
- HDL code generator
- Function simulation
- Parallelism exploration
- CNN updater
2.1. Inception-Base Branch-Fuse
2.2. The 8-Bit Integer Quantization
3. Accelerator
3.1. Dual-Decimal-Fuse Technique
3.2. Winograd Decomposed-Part Reuse Technique
3.3. The Architecture of the Accelerator
3.3.1. Ping-Pong BRAM with Multi-Mode BRAM
3.3.2. Data Reuse and Padding of IFM Buffer
3.3.3. OFM Generator with Multi-Mode DSP
4. Experimental Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kaiming, H.; Xiangyu, Z.; Shaoqing, R.; Jian, S. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Elhassouny, A.; Smarandache, F. Trends in deep convolutional neural Networks architectures: A review. In Proceedings of the 2019 International Conference of Computer Science and Renewable Energies (ICCSRE), Agadir, Morocco, 22–24 July 2019; pp. 1–8. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. NIPS 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
- Jie, H.; Li, S.; Samuel, A.; Gang, S.; Enhua, W. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets Great Again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Liu, C.; Zoph, B.; Neumann, M.; Shlens, J.; Hua, W.; Li, L.; Fei-Fei, L.; Yuille, A.; Huang, J.; Murphy, K. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 19–34. [Google Scholar]
- Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollar, P. Designing network design spaces. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10428–10436. [Google Scholar]
- Freund, K. Machine Learning Application Landscape. 2017. Available online: https://www.xilinx.com/support/documentation/backgrounders/Machine-Learning-Application-Landscape.pdf (accessed on 28 February 2020).
- Véstias, M.P.; Duarte, R.P.; De Sousa, J.T.; Neto, H.C. Moving Deep Learning to the Edge. Algorithms 2020, 13, 125. [Google Scholar] [CrossRef]
- Zhang, Y.; Wei, X.-S.; Zhou, B.; Wu, J. Bag of Tricks for Long-Tailed Visual Recognition with Deep Convolutional Neural Networks. Proc. Conf. AAAI Artif. Intell. 2021, 35, 3447–3455. [Google Scholar] [CrossRef]
- Grigorescu, S.; Trasnea, B.; Cocias, T.; Macesanu, G. A survey of deep learning techniques for autonomous driving. J. Field Robot. 2020, 37, 362–386. [Google Scholar] [CrossRef] [Green Version]
- Wang, X.; Han, Y.; Leung, V.C.M.; Niyato, D.; Yan, X.; Chen, X. Convergence of Edge Computing and Deep Learning: A Comprehensive Survey. IEEE Commun. Surv. Tutor. 2020, 22, 869–904. [Google Scholar] [CrossRef] [Green Version]
- Véstias, M. A Survey of Convolutional Neural Networks on Edge with Reconfigurable Computing. Algorithms 2019, 12, 154. [Google Scholar] [CrossRef] [Green Version]
- Zhang, C.; Sun, G.; Fang, Z.; Zhou, P.; Pan, P.; Cong, J. Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks. IEEE Trans. Comput. Des. Integr. Circuits Syst. 2019, 38, 2072–2085. [Google Scholar] [CrossRef]
- Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J.S. Automatic Compilation of Diverse CNNs Onto High-Performance FPGA Accelerators. IEEE Trans. Comput. Des. Integr. Circuits Syst. 2020, 39, 424–437. [Google Scholar] [CrossRef]
- Yu, Y.; Wu, C.; Zhao, T.; Wang, K.; He, L. OPU: An FPGA-Based overlay processor for convolutional neural networks. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019, 28, 35–47. [Google Scholar] [CrossRef]
- Yu, Y.; Zhao, T.; Wang, K.; He, L. Light-OPU: An FPGA-based Overlay Processor for Lightweight Convolutional Neural Networks. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 23–25 February 2020; pp. 122–132. [Google Scholar] [CrossRef] [Green Version]
- Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and<0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
- David, R.; Duke, J.; Jain, A.; Reddi, V.J.; Jeffries, N.; Li, J.; Kreeger, N.; Nappier, I.; Natraj, M.; Wang, T.; et al. TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems. Proc. Mach. Learn. Syst. 2020, 3, 800–811. [Google Scholar]
- Lin, J.; Chen, W.M.; Lin, Y.; Gan, C.; Han, S. MCUNet: Tiny Deep Learning on IoT Devices. Adv. Neural Inf. Process. Syst. 2020, 33, 11711–11722. [Google Scholar]
- Li, Z.; Gao, J.; Lai, J. HBDCA: A Toolchain for High-Accuracy BRAM-Defined CNN Accelerator on FPGA with Flexible Structure. IEICE Trans. Inf. Syst. 2021, E104.D, 1724–1733. [Google Scholar] [CrossRef]
- Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 2017, 105, 2295–2329. [Google Scholar] [CrossRef] [Green Version]
- Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E.; Yu, J.; Tang, T.; Xu, N.; Song, S.; et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016; pp. 26–35. [Google Scholar] [CrossRef]
- Guo, K.; Sui, L.; Qiu, J.; Yu, J.; Wang, J.; Yao, S.; Han, S.; Wang, Y.; Yang, H. Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA. IEEE Trans. Comput. Des. Integr. Circuits Syst. 2018, 37, 35–47. [Google Scholar] [CrossRef]
- Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv 2018, arXiv:1806.08342v1. [Google Scholar]
- Wu, H.; Judd, P.; Zhang, X.; Isaev, M.; Micikevicius, P. Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation. arXiv 2020, arXiv:2004.09602. [Google Scholar]
- Shaydyuk, N.K.; John, E.B. Semi-Streaming Architecture: A New Design Paradigm for CNN Implementation on FPGAs. arXiv 2020, arXiv:2006.08759v1. [Google Scholar]
- Suda, N.; Chandra, V.; Dasika, G.; Mohanty, A.; Ma, Y.; Vrudhula, S.; Seo, J.-s.; Cao, Y. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016; pp. 16–25. [Google Scholar]
- Lavin, A.; Gray, S. Fast algorithms for convolutional neural networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4013–4021. [Google Scholar]
- Yepez, J.; Ko, S.-B. Stride 2 1-D, 2-D, and 3-D Winograd for Convolutional Neural Networks. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 853–863. [Google Scholar] [CrossRef]
- Liang, Y.; Lu, L.; Xiao, Q.; Yan, S. Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs. IEEE Trans. Comput. Des. Integr. Circuits Syst. 2020, 39, 857–870. [Google Scholar] [CrossRef]
- Shen, J.; Huang, Y.; Wang, Z.; Qiao, Y.; Wen, M.; Zhang, C. Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 25–27 February 2018; pp. 97–106. [Google Scholar] [CrossRef]
- Ahmad, A.; Pasha, M.A. FFConv: An FPGA-based Accelerator for Fast Convolution Layers in Convolutional Neural Networks. ACM Trans. Embed. Comput. Syst. 2020, 19, 1–24. [Google Scholar] [CrossRef] [Green Version]
- Huang, D.; Zhang, X.; Zhang, R.; Zhi, T.; He, D.; Guo, J.; Liu, C.; Guo, Q.; Du, Z.; Liu, S.; et al. DWM: A Decomposable Winograd Method for Convolution Acceleration. Proc. Conf. AAAI Artif. Intell. 2020, 34, 4174–4181. [Google Scholar] [CrossRef]
- Huang, C.; Dong, X.; Li, Z.; Song, T.; Liu, Z.; Dong, L. Efficient Stride 2 Winograd Convolution Method Using Unified Transformation Matrices on FPGA. In Proceedings of the 2021 International Conference on Field-Programmable Technology (ICFPT), Auckland, New Zealand, 6–10 December 2021; pp. 1–9. [Google Scholar] [CrossRef]
- Yu, J.; Hu, Y.; Ning, X.; Qiu, J.; Guo, K.; Wang, Y.; Yang, H. Instruction driven cross-layer CNN accelerator with Winograd transformation on FPGA. In Proceedings of the 2017 International Conference on Field Programmable Technology (ICFPT), Melbourne, VIC, Australia, 11–13 December 2017; pp. 227–230. [Google Scholar]
- Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J.-S. Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 25–27 February 2018; pp. 45–54. [Google Scholar] [CrossRef]
Kernel Size (K × K) | Stride (S) | Original DWM | WDPR | ||
---|---|---|---|---|---|
Part | Transformation | Part | Transformation | ||
3 × 3 | 1 | 1 | F(4 × 4, 3 × 3) | 1 | F(4 × 4, 3 × 3) |
3 × 3 | 2 | 1 | F(4 × 4, 2 × 2) | 1 | F(4 × 4, 3 × 3) |
2 | F(4 × 4, 2 × 1) | 2 | F(4 × 4, 3 × 3) | ||
3 | F(4 × 4, 1 × 2) | 3 | F(4 × 4, 3 × 3) | ||
4 | F(4 × 4, 1 × 1) | 4 | F(4 × 4, 3 × 3) | ||
5 × 5 | 1 | 1 | F(4 × 4, 3 × 3) | 1 | F(4 × 4, 3 × 3) |
2 | F(4 × 4, 3 × 2) | 2 | F(4 × 4, 3 × 3) | ||
3 | F(4 × 4, 2 × 3) | 3 | F(4 × 4, 3 × 3) | ||
4 | F(4 × 4, 2 × 2) | 4 | F(4 × 4, 3 × 3) | ||
5 × 5 | 2 | 1 | F(4 × 4, 3 × 3) | 1 | F(4 × 4, 3 × 3) |
2 | F(4 × 4, 3 × 2) | 2 | F(4 × 4, 3 × 3) | ||
3 | F(4 × 4, 2 × 3) | 3 | F(4 × 4, 3 × 3) | ||
4 | F(4 × 4, 2 × 2) | 4 | F(4 × 4, 3 × 3) | ||
Total | 7 | 1 |
NO. | Name | Kernel (Size, Stride) & Branch | TCAD 2020 [15] | TCAD 2020 [31] |
---|---|---|---|---|
1 | VGG | K = 3, S = 1; No branches; | VGG | VGG |
2 | VGG-S2 | K = 3, S = 1 & K = 3, S = 2; No branches; | - | - |
3 | AlexNet | K = 5, S = 1 & K = 3, S = 1; No branches; | - | AlexNet |
4 | RepResNet | K = 3, S = 1; With branches; | ResNet | ResNet |
5 | RepVGG | K = 3, S = 1 & K = 3, S = 2; With branches; | - | - |
6 | RepInception | K = 5, S = 1 & K = 3, S = 1; With branches; | Inception | - |
7 | RepK5S2 | K = 5, S = 1 & K = 5, S = 2 & K = 3, S = 1 & K = 3, S = 2; With branches | - | - |
CNN | After Training | After Quantizing | Loss between Training and Quantizing | Hardware Accelerator | Loss between Accelerator and Quantizing | Total Loss |
---|---|---|---|---|---|---|
VGG | 93.82 | 93.75 | −0.07 | 93.73 | −0.02 | −0.09 |
VGG-S2 | 92.6 | 92.59 | −0.01 | 92.6 | +0.01 | 0 |
AlexNet | 90.72 | 90 | −0.72 | 89.98 | −0.02 | −0.74 |
RepResNet | 93.06 | 92.88 | −0.18 | 92.93 | +0.05 | −0.13 |
RepVGG | 92.99 | 92.93 | −0.06 | 92.87 | −0.06 | −0.12 |
RepInception | 93.56 | 93.36 | −0.2 | 93.45 | +0.09 | −0.11 |
RepK5S2 | 93.29 | 93.14 | −0.15 | 93.06 | −0.08 | −0.23 |
Items | TCAD 2019 [14] | VLSI 2020 [16] | TCAD 2020 [15] | This Paper | |||||
---|---|---|---|---|---|---|---|---|---|
>Platform | >VC709 (28 nm) | >XC7K325T | >Arria-10 (20 nm) | >VC709 (28 nm) | |||||
>Toolchain | >Yes | >Yes | >Yes | >Yes | |||||
Frequency (MHz) | 150 | 200 | 240 | 100 | |||||
Precision | 16-bit Fix-point | 8-bit Fix-point | 8/16 bit Fix-point | 16-bit Fix-point | 16-bit Fix-point | 8/14/18 Integer | |||
Winograd | No | No | No | F(4 × 4, 3 × 3) | |||||
Power | 26 | 16.5 | - | 4.4 | |||||
Supported Kernels | K = 3, S = 1 | K = 5, S = 1; K = 3, S = 1; K = 1, S = 1 | K = 5, S = 1; K = 3, S = 1; K = 3, S = 2; K = 1, S = 1 | K = 1, S = 1; K = 1, S = 2; K = 3, S = 1; K = 3, S = 2; K = 5, S = 1; K = 5, S = 2 | |||||
CNN | VGG | VGG16 | InceptionV1 | ResNet-50 | VGG | Inception | RepResNet | VGG | RepInception |
Logic Cell | 300 K (81%) | 94,763 (46.5%) | 94,763 (46.5%) | 286 K (67%) | 228 K (49%) | 277 K (65%) | 255.6 K (59%) | 255.6 K (59%) | 255.6 K (59%) |
BRAM(Kb) | 1248 (42%) | 165 (37.08%) | 165 (37.08%) | 2356 × 20 (87%) | 2319 × 20 (85%) | 1849 × 20 (68%) | 1036 × 36 (70%) | 1036 × 36 (70%) | 1036 × 36 (70%) |
DSP | 2833 (78%) | 516 (61.43%) | 516 (61.43%) | 3036 (100%) | 3036 (100%) | 3036 (100%) | 2816 (78%) | 2816 (78%) | 2816 (78%) |
Throughput (GOPS) | 354 | 354 | 54.4 | 758 | 968.03 | 524.98 | 827.8 | 869.9 | 997.2 |
Power Efficiency (GOPS/W) | 13.6 | 21.5 | 3.3 | - | - | - | 188.1 | 197.7 | 226.6 |
Items | FPGA 2018 [32] | VLSI 2020 [30] | TCAD 2020 [31] | This Paper | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Platform | VC709 (28 nm) | Arria-10 (20 nm) | ZCU102 (16 nm) | ZC706 (28 nm) | VC709 (28 nm) | |||||
Toolchain | No | No | Yes | Yes | Yes | |||||
Frequency (MHz) | 150 | 250 | 200 | 166 | 100 | |||||
Precision | 16-bit Fix-point | 16-bit Fix-point | 16-bit Fix-point | 16-bit Fix-point | 8/14/18 Integer | |||||
Winograd | F(2 × 2, 3 × 3) | F( 2 × 2, 3 × 3) DWM | F(4 × 4, 3 × 3) | F(4 × 4, 3 × 3) DWM | ||||||
Power | 25 | 18 | - | 4.4 | ||||||
Supported Kernels | K = 3, S = 1 | K = 3, S = 1; K = 3, S = 2 | K = 5, S = 1; K = 3, S = 1; K = 3, S = 2; K = 1, S = 1 | K = 1, S = 1; K = 1, S = 2; K = 3, S = 1; K = 3, S = 2; K = 5, S = 1; K = 5, S = 2 | ||||||
CNN | VGG | VGG | VGG-S2 | VGG | AlexNet | ResNet | VGG | VGG-S2 | AlexNet | RepResNet |
Logic Cell | 175 K (40%) | 181 K (15.7%) | 180 K (15.7%) | 95% | 67% | 67% | 255.6 K (59%) | 255.6 K (59%) | 255.6 K (59%) | 255.6 K (59%) |
BRAM(Kb) | 1232 (42%) | 1310 (61.5%) | 1310 (61.5%) | 95% | 67% | 67% | 1036 × 36 (70%) | 1036 × 36 (70%) | 1036 × 36 (70%) | 1036 × 36 (70%) |
DSP | 1376 (38%) | 1344 (88.5%) | 1344 (88.5%) | 95% | 67% | 67% | 2816 (78%) | 2816 (78%) | 2816 (78%) | 2816 (78%) |
Throughput (GOPS) | 570 | 1642 | 1788 | 2479.6 | 854.6 | 201.6 | 869.9 | 432.4 | 727.4 | 827.8 |
Power Efficiency (GOPS/W) | 22.80 | 91.2 | 99.3 | 105.4 | 36.2 | 13.8 | 197.7 | 98.3 | 165.3 | 188.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Z.; Hou, L.; Tao, X.; Wang, J.; Lai, J. HBCA: A Toolchain for High-Accuracy Branch-Fused CNN Accelerator on FPGA with Dual-Decimal-Fused Technique. Electronics 2023, 12, 192. https://doi.org/10.3390/electronics12010192
Li Z, Hou L, Tao X, Wang J, Lai J. HBCA: A Toolchain for High-Accuracy Branch-Fused CNN Accelerator on FPGA with Dual-Decimal-Fused Technique. Electronics. 2023; 12(1):192. https://doi.org/10.3390/electronics12010192
Chicago/Turabian StyleLi, Zhengjie, Lingli Hou, Xinxuan Tao, Jian Wang, and Jinmei Lai. 2023. "HBCA: A Toolchain for High-Accuracy Branch-Fused CNN Accelerator on FPGA with Dual-Decimal-Fused Technique" Electronics 12, no. 1: 192. https://doi.org/10.3390/electronics12010192
APA StyleLi, Z., Hou, L., Tao, X., Wang, J., & Lai, J. (2023). HBCA: A Toolchain for High-Accuracy Branch-Fused CNN Accelerator on FPGA with Dual-Decimal-Fused Technique. Electronics, 12(1), 192. https://doi.org/10.3390/electronics12010192