A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network
Abstract
:1. Introduction
- An FPGA-based computational kernel and a reconfigurable register array are proposed to improve the utilization of computing resources for the depthwise convolution.
- A vector unit is designed to implement the nonlinear activation functions and the scale operation in the SE module.
- An exchangeable-sequence dual-computational kernel architecture and its memory architectures are designed to improve the performance and utilization.
2. Background and Related Work
2.1. Lightweight Convolutional Neural Network
2.2. Existing Hardware Architectures for Lightweight CNNs
3. Design Space Analysis
3.1. Parallel Strategies Analysis
3.2. Architecture Analysis
4. Hardware Architecture Design
4.1. Architecture Overview
4.2. Computational Kernel and Register Array
4.3. Vector Unit
4.4. Exchangeable-Sequence Dual-Kernel Architecture for MBCONV Module
4.5. On-Chip Input Buffer
5. Experimental Results
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
- Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 122–138. [Google Scholar]
- Howard, A.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.; Chen, B.; Tan, M. Searching for MobileNetV3. In Proceedings of the 2019 IEEE Conference on Computer Vision(ICCV), Seoul, Republic of Korea, 2–6 October 2019; pp. 1314–1324. [Google Scholar]
- Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. MnasNet: Platform-aware neural architecture search for mobile. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2815–2823. [Google Scholar]
- Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 6105–6114. [Google Scholar]
- Tan, M.; Le, Q. EfficientNetV2: Smaller models and faster training. In Proceedings of the 38th International Conference on Machine Learning, Vienna, Austria, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
- Huang, Y.; Cheng, Y.; Bapna, A.; Firat, O.; Chen, D.; Chen, M.; Lee, H.; Ngiam, J.; Le, Q.V.; Wu, Y.; et al. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 103–112. [Google Scholar]
- NVIDIA Data Center Deep Learning Product Performance. 2023. Available online: https://developer.nvidia.com/deep-learning-performance-training-inference (accessed on 29 May 2023).
- Lammie, C.; Xiang, W.; Azghadi, M.R. Training progressively binarizing deep networks using fpgas. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 10–21 October 2020; pp. 1–5. [Google Scholar]
- Groom, T.; George, K. Real time fpga-based cnn training and recognition of signals. In Proceedings of the 2022 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, 6–9 June 2022; pp. 22–26. [Google Scholar]
- Li, H.; Fan, X.; Jiao, L.; Cao, W.; Zhou, X.; Wang, L. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, Switzerland, 29 August–2 September 2016; pp. 1–9. [Google Scholar]
- Xie, L.; Fan, X.; Cao, W.; Wang, L. High throughput cnn accelerator design based on fpga. In Proceedings of the 2018 International Conference on Field-Programmable Technology (FPT), Naha, Japan, 10–14 December 2018; pp. 274–277. [Google Scholar]
- Jiao, M.; Li, Y.; Dang, P.; Cao, W.; Wang, L. A high performance fpga-based accelerator design for end-to-end speaker recognition system. In Proceedings of the 2019 International Conference on Field-Programmable Technology (ICFPT), Tianjin, China, 9–13 December 2019; pp. 215–223. [Google Scholar]
- Wu, D.; Zhang, Y.; Jia, X.; Lu, T.; Li, T.; Sui, L.; Xie, D.; Shan, Y. A high-performance cnn processor based on fpga for MobileNets. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 8–12 September 2019; pp. 136–143. [Google Scholar]
- Bai, L.; Zhao, Y.; Huang, X. A cnn accelerator on fpga using depthwise separable convolution. IEEE Trans. Circuits Syst. II Express Briefs 2018, 65, 1415–1419. [Google Scholar] [CrossRef] [Green Version]
- Knapheide, J.; Stabernack, B.; Kuhnke, M. A high throughput MobileNetV2 fpga implementation based on a flexible architecture for depthwise separable convolution. In Proceedings of the 2020 30th International Conference on Field-Programmable Logic and Applications (FPL), Gothenburg, Sweden, 31 August–4 September 2020; pp. 277–283. [Google Scholar]
- Li, B.; Wang, H.; Zhang, X.; Ren, J.; Liu, L.; Sun, H.; Zheng, N. Dynamic dataflow scheduling and computation mapping techniques for efficient depthwise separable convolution acceleration. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 3279–3292. [Google Scholar] [CrossRef]
- Xu, Y.; Wang, S.; Li, N.; Xiao, H. Design and implementation of an efficient CNN accelerator for low-cost fpgas. IEICE Electron. Express 2022, 19, 20220370. [Google Scholar] [CrossRef]
- Tang, Y.; Ren, H.; Zhang, Z. A reconfigurable convolutional neural networks accelerator based on fpga. In Communications and Networking. ChinaCom 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering; Gao, F., Wu, J., Li, Y., Gao, H., Eds.; Springer: Cham, Switzerland, 2022; Volume 500, pp. 259–269. [Google Scholar]
- Xu, R.; Ma, S.; Wang, Y.; Li, D.; Qiao, Y. Heterogeneous systolic array architecture for compact cnns hardware accelerators. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 2860–2871. [Google Scholar]
- Shivapakash, S.; Jain, H.; Hellwich, O.; Gerfers, F. A power efficiency enhancements of a multi-bit accelerator for memory prohibitive deep neural networks. IEEE Open J. Circuits Syst. 2021, 2, 156–169. [Google Scholar] [CrossRef]
- Nguyen, D.T.; Je, H.; Nguyen, T.N.; Ryu, S.; Lee, K.; Lee, H.-J. ShortcutFusion: From tensorflow to fpga-based accelerator with a reuse-aware memory allocation for shortcut data. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 2477–2489. [Google Scholar] [CrossRef]
- Gao, J.; Qian, Y.; Hu, Y.; Fan, X.; Luk, W.; Cao, W.; Wang, L. LETA: A lightweight exchangeable-track accelerator for efficientNet based on fpga. In Proceedings of the International Conference on Field Programmable Technology (ICFPT), Auckland, New Zealand, 6–10 December 2021; pp. 1–9. [Google Scholar]
- Sifre, L. Rigid-Motion Scattering for Image Classification. Ph.D. Thesis, École Polytechnique, Paris, France, 2014. [Google Scholar]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Fu, Y.; Wu, E.; Sirasao, A.; Attia, S.; Khan, K.; Witting, R. Deep Learning with INT8 Optimization on Xilinx Devices. Available online: https://www.xilinx.com/support/documentation/whitepapers/wp486-deep-learning-int8.pdf (accessed on 24 April 2017).
- Williams, S.; Waterman, A.; Patterson, D.; Witting, R. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 2009, 52, 65–76. [Google Scholar] [CrossRef] [Green Version]
- Feng, X.; Li, Y.; Qian, Y.; Gao, J.; Cao, W.; Wang, L. A high-precision flexible symmetry-aware architecture for element-wise activation functions. In Proceedings of the 2021 International Conference on Field-Programmable Technology (ICFPT), Auckland, New Zealand, 6–10 December 2021; pp. 1–4. [Google Scholar]
Resource | Utilization | Available | Utilization (%) |
---|---|---|---|
LUT | 138,519 | 1,303,680 | 10.6 |
FF | 197,799 | 2,607,360 | 7.6 |
BRAM | 220.5 * | 2016 | 10.9 |
URAM | 40 | 960 | 4.2 |
DSP | 494 | 9024 | 5.5 |
Paper [23] | Paper [21] | Paper [24] | Paper [25] | Our Work | |
---|---|---|---|---|---|
Platform | Xilinx XCVU440 | Xilinx ZC706 | Xilinx KCU1500 | Xilinx XCVU37P | Xilinx XCVU37P |
Process | 20 nm | 28 nm | 20 nm | 16 nm | 16 nm |
Frequency | 180 MHz | 100 MHz | 200 MHz | 300 MHz | 300 MHz |
DSP | 1008 * | N/A | 2240 | 534 | 494 |
Model | EfficientNet-B0 | EfficientNet-lite0 | EfficientNet-B1 | EfficientNet-B3 | EfficientNet-B3 |
Precision | INT16 | INT8 | INT8 | INT8 | INT8 |
Input Size | 224 × 224 × 3 | 224 × 224 × 3 | 256 × 256 × 3 | 300 × 300 × 3 | 300 × 300 × 3 |
MAC Operations (B) | 0.78 | 0.77 | 1.38 | 3.67 | 3.67 |
Latency (ms) | N/A | 5.1 | 4.69 | 18.4 | 14.39 |
Frame Rate (FPS) | 231.2 | 196.1 | 213.2 | 54.3 | 69.50 |
Frame Rate/DSP (FPS/DSP) | 0.229 | N/A | 0.0951 | 0.102 | 0.141 |
Throughput (GOPS) | 180.3 | 150.6 | 317.1 | 199.6 | 255.22 |
Throughput/DSP (GOPS/DSP) | 0.179 | N/A | 0.142 | 0.374 | 0.517 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
An, F.; Wang, L.; Zhou, X. A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network. Electronics 2023, 12, 2847. https://doi.org/10.3390/electronics12132847
An F, Wang L, Zhou X. A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network. Electronics. 2023; 12(13):2847. https://doi.org/10.3390/electronics12132847
Chicago/Turabian StyleAn, Fubang, Lingli Wang, and Xuegong Zhou. 2023. "A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network" Electronics 12, no. 13: 2847. https://doi.org/10.3390/electronics12132847
APA StyleAn, F., Wang, L., & Zhou, X. (2023). A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network. Electronics, 12(13), 2847. https://doi.org/10.3390/electronics12132847