Research on the Lightweight Deployment Method of Integration of Training and Inference in Artificial Intelligence
Abstract
:1. Introduction
- Taking advantage of the characteristics of the heterogeneous architecture, we propose a new lightweight deployment scheme of a neural network model, which enables AI applications in different scenarios to achieve a better balance between flexibility, performance, cost, power consumption and anti-interference capabilities. The integrated deployment of neural network training and forward inference acceleration with flexible weight adjustment has been realized.
- Using Programable Logic (PL) of MPSoC, the balance between pipeline and parallelism was realized in the neural network layer based on the limited resources, and the performance of neural network forward inference was optimized. After optimizing and packaging each layer, a flexible and customizable hardware-accelerated IP library was constructed.
- We deployed a neural network training framework in the Processing System (PS) of MPSoC to support neural network training. The automatic network weight parameter migration script was written to realize the data processing automation of lightweight deployment integration.
2. Related Work
3. Method
3.1. Integrated Architecture of Artificial Intelligence
- (1)
- Firstly, a multi-core processor was used to train CNN in the PS terminal inside the SoC, and a large amount of image data was sent into the network. After several iterations, the CNN model with the best performance on the test set was finally obtained. Then, the model weights were exported by script and passed to the CNN accelerator IP core.
- (2)
- The basic network layer of CNN was realized on the PL side inside the SoC, and these basic network layers were connected according to the network structure to form a complete neural network structure.
- (3)
- Through advanced extensible interface (AXI) specifications, PS and PL were linked together to form a complete hardware structure for training and inference [38]. In this hardware structure, the CNN accelerator IP core was encapsulated into API functions that can be called directly by the Linux operating system, so as to constitute a complete implementation structure for training and acceleration.
3.2. Training Methods of Neural Networks
3.3. Hardware Accelerator Structure Design of Convolution Layer
Algorithm 1: Convolution |
OUT: Output IN: Input W: Weight R: Row of output feature map C: Column of output feature map K: Kernel size CHin: In-Channel size CHout: Out-Channel size 1. for(kr = 0; kr < K; kr++) 2. for(kc = 0; kc < K; kc++) 3. for(r = 0; r < R; r++) 4. for(c = 0; c < C; c++) 5. for(chi = 0; chi < Chin; chi++) 6. for(cho = 0; cho < CHout; cho++) 7. OUT[cho][r][c] = W[cho][chi][kr][kc] * IN[chi][r + kr][c + kc]; |
Algorithm 2: For Loop Optimization |
OUT: Output IN: Input KL: Inner loop KH: Outer loop K = KL * KH: Original loop count 1. for(h = 0; h < KH; h++) 2. for(l = 0; l < KL; l++) 3. OUT[h * KL + l] = f ( IN[h * KL + l]); |
3.4. Hardware Accelerator Structure Design of the Full-Connection Layer
Algorithm 3: Full-Connection |
OUT: Output IN: Input W: Weight CHin: In-Channel size CHout: Out-Channel size 1. for(chi = 0; chi < Chin; chi++) 2. for(cho = 0; cho < CHout; cho++) 3. OUT[cho] = W[cho][chi] * IN[chi]; |
4. Experiments
4.1. Creating the Validation Platform
- (1)
- LeNet-5 module. This module implements the image recognition hardware accelerator structure based on the modified LeNet-5 network structure.
- (2)
- Zynq Ultrascale + MPSoC module. This module is an abstraction of PS in the MPSoC. It contains the quad-core Cortex-A53 APU and the dual-core Cortex-R5F RPU.
- (3)
- AXI Interconnect module. In compliance with the AXI specification, multiple AXI memory-mapped master devices were connected to multiple memory-mapped slave devices through a switch structure, which was used as a bridge to connect S_AXI peripherals.
- (4)
- AXI SmartConnect module. Similar to AXI Interconnect, it was used to connect AXI peripherals to PS, in this structure primarily as a bridge to connect M_AXI.
- (5)
- Process System Reset module. The processor system reset module was used in this structure to generate reset signals for PS and three other modules.
4.2. Design of Validation Method
5. Results
6. Discussion
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Conflicts of Interest
References
- Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
- Pak, M.; Kim, S. A review of deep learning in image recognition. In Proceedings of the 2017 4th International Conference on Computer Applications and Information Processing Technology (CAIPT), Kuta Bali, Indonesia, 8–10 August 2017; pp. 1–3. [Google Scholar]
- Hu, Y.; Liu, Y.; Liu, Z. A Survey on Convolutional Neural Network Accelerators: GPU, FPGA and ASIC. In Proceedings of the 2022 14th International Conference on Computer Research and Development (ICCRD), Shenzhen, China, 7–9 January 2022; pp. 100–107. [Google Scholar]
- Zaman, K.S.; Reaz, M.B.I.; Ali, S.H.M.; Bakar, A.A.A.; Chowdhury, M.E.H. Custom Hardware Architectures for Deep Learning on Portable Devices: A Review. IEEE Trans. Neural Networks Learn. Syst. 2021, 32, 1–21. [Google Scholar] [CrossRef] [PubMed]
- Vipin, K. ZyNet: Automating Deep Neural Network Implementation on Low-Cost Reconfigurable Edge Computing Platforms. In Proceedings of the 2019 International Conference on Field-Programmable Technology (ICFPT), Tianjin, China, 9–13 December 2019; pp. 323–326. [Google Scholar]
- Colbert, I.; Daly, J.; Kreutz-Delgado, K.; Das, S. A competitive edge: Can FPGAs beat GPUs at DCNN inference acceleration in resource-limited edge computing applications? arXiv 2021, arXiv:2102.00294. [Google Scholar]
- Nurvitadhi, E.; Sheffield, D.; Sim, J.; Mishra, A.; Venkatesh, G.; Marr, D. Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT), Xi’an, China, 7–9 December 2016; pp. 77–84. [Google Scholar]
- Lacey, G.; Taylor, G.W.; Areibi, S. Deep learning on fpgas: Past, present, and future. arXiv 2016, arXiv:1602.04283. [Google Scholar]
- Dias, M.A.; Ferreira, D.A.P. Deep Learning in Reconfigurable Hardware: A Survey. In Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Rio de Janeiro, Brazil, 20–24 May 2019; pp. 95–98. [Google Scholar]
- Seng, K.P.; Lee, P.J.; Ang, L.M. Embedded Intelligence on FPGA: Survey, Applications and Challenges. Electronics 2021, 10, 895. [Google Scholar] [CrossRef]
- Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
- Deng, L. The MNIST Database of Handwritten Digit Images for Machine Learning Research. IEEE Signal Process. Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
- CIFAR-10 and CIFAR-100 datasets. Available online: https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 19 May 2022).
- Nurvitadhi, E.; Sim, J.; Sheffield, D.; Mishra, A.; Krishnan, S.; Marr, D. Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, Switzerland, 29 August–2 September 2016; pp. 1–4. [Google Scholar]
- Sateesan, A.; Sinha, S.; KG, S.; Vinod, A.P. A Survey of Algorithmic and Hardware Optimization Techniques for Vision Convolutional Neural Networks on FPGAs. Neural Process. Lett. 2021, 53, 2331–2377. [Google Scholar] [CrossRef]
- Hamdan, M.K.; Rover, D.T. VHDL generator for a high performance convolutional neural network FPGA-based accelerator. In Proceedings of the 2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig), Cancun, Mexico, 4–6 December 2017; pp. 1–6. [Google Scholar]
- Liu, Z.; Dou, Y.; Jiang, J.; Xu, J. Automatic code generation of convolutional neural networks in FPGA implementation. In Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT), Xi’an, China, 7–9 December 2016; pp. 61–68. [Google Scholar]
- Ahmed, H.O.; Ghoneima, M.; Dessouky, M. Concurrent MAC unit design using VHDL for deep learning networks on FPGA. In Proceedings of the 2018 IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE), Penang, Malaysia, 28–29 April 2018; pp. 31–36. [Google Scholar]
- Venieris, S.I.; Bouganis, C. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In Proceedings of the 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Washington, DC, USA, 1–3 May 2016; pp. 40–47. [Google Scholar]
- DiCecco, R.; Lacey, G.; Vasiljevic, J.; Chow, P.; Taylor, G.; Areibi, S. Caffeinated FPGAs: FPGA framework for convolutional neural networks. In Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT), Xi’an, China, 7–9 December 2016; pp. 265–268. [Google Scholar]
- Hua, S. Design and optimization of a light-weight handwritten digital system based on FPGA. Electron. Manuf. 2020, 16, 6–7+37. [Google Scholar]
- Mujawar, S.; Kiran, D.; Ramasangu, H. An Efficient CNN Architecture for Image Classification on FPGA Accelerator. In Proceedings of the 2018 Second International Conference on Advances in Electronics, Computers and Communications (ICAECC), Bangalore, India, 9–10 February 2018; pp. 1–4. [Google Scholar] [CrossRef]
- Huang, W.; Wu, H.; Chen, Q.; Luo, C.; Zeng, S.; Li, T.; Huang, Y. FPGA-Based High-Throughput CNN Hardware Accelerator With High Computing Resource Utilization Ratio. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 1–15. [Google Scholar] [CrossRef] [PubMed]
- Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 161–170. [Google Scholar]
- Bachtiar, Y.A.; Adiono, T. Convolutional Neural Network and Maxpooling Architecture on Zynq SoC FPGA. In Proceedings of the 2019 International Symposium on Electronics and Smart Devices (ISESD), Badung, Indonesia, 25 November 2019; pp. 1–5. [Google Scholar]
- Ghaffari, S.; Sharifian, S. FPGA-based convolutional neural network accelerator design using high level synthesize. In Proceedings of the 2016 2nd International Conference of Signal Processing and Intelligent Systems (ICSPIS), Tehran, Iran, 14–15 December 2016; pp. 1–6. [Google Scholar]
- Huang, W. Design of Deep learning Image Classification and Recognition System Based on Zynq. Master’s Thesis, Guangdong University of Technology, Guangzhou, China, 2018. [Google Scholar]
- Liu, B.; Zou, D.; Feng, L.; Feng, S.; Fu, P.; Li, J. An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolu-tion. Electronics 2019, 8, 281. [Google Scholar] [CrossRef] [Green Version]
- Zhang, S.; Cao, J.; Zhang, Q.; Zhang, Q.; Zhang, Y.; Wang, Y. An FPGA-Based Reconfigurable CNN Accelerator for YOLO. In Proceedings of the 2020 IEEE 3rd International Conference on Electronics Technology (ICET), Chengdu, China, 8–12 May 2020; pp. 74–78. [Google Scholar]
- Xie, W.; Zhang, C.; Zhang, Y.; Hu, C.; Jiang, H.; Wang, Z. An Energy-Efficient FPGA-Based Embedded System for CNN Appli-cation. In Proceedings of the 2018 IEEE International Conference on Electron Devices and Solid State Circuits (EDSSC), Shenzhen, China, 6–8 June 2018; pp. 1–2. [Google Scholar]
- Meloni, P.; Deriu, G.; Conti, F.; Loi, I.; Raffo, L.; Benini, L. A high-efficiency runtime reconfigurable IP for CNN acceleration on a mid-range all-programmable SoC. In Proceedings of the 2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig), Cancun, Mexico, 30 November–2 December 2016; pp. 1–8. [Google Scholar]
- Feng, G.; Hu, Z.; Chen, S.; Wu, F. Energy-efficient and high-throughput FPGA-based accelerator for Convolutional Neural Networks. In Proceedings of the 2016 13th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), Hangzhou, China, 25–28 October 2016; pp. 624–626. [Google Scholar]
- Hailesellasie, M.; Hasan, S.R.; Khalid, F.; Wad, F.A.; Shafique, M. Fpga-based convolutional neural network architecture with reduced parameter requirements. In Proceedings of the 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May 2018; pp. 1–5. [Google Scholar]
- Dong, R.; Tang, Y. Accelerator Implementation of Lenet-5 Convolution Neural Network Based on FPGA with HLS. In Proceedings of the 2019 3rd International Conference on Circuits, System and Simulation (ICCSS), Nanjing, China, 13–15 June 2019; pp. 64–67. [Google Scholar]
- Shi, Y.; Gan, T.; Jiang, S.; Shi, Y.; Gan, T.; Jiang, S. Design of Parallel Acceleration Method of Convolutional Neural Network Based on FPGA. In Proceedings of the 2020 IEEE 5th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), Chengdu, China, 10–13 April 2020; pp. 133–137. [Google Scholar]
- Maraoui, A.; Messaoud, S.; Bouaafia, S.; Ammari, A.C.; Khriji, L.; Machhout, M. PYNQ FPGA Hardware implementation of LeNet-5-Based Traffic Sign Recognition Application. In Proceedings of the 2021 18th International Multi-Conference on Systems, Signals & Devices (SSD), Monastir, Tunisia, 22–25 March 2021; pp. 1004–1009. [Google Scholar]
- Liu, J.; Feng, J. Design of embedded digital image processing system based on ZYNQ. Microprocess. Microsyst. 2021, 83, 104005. [Google Scholar] [CrossRef]
- AXI Reference Guide. Available online: https://docs.xilinx.com/v/u/en-US/ug761_axi_reference_guide (accessed on 19 May 2022).
- Zynq UltraScale+ MPSoC Data Sheet: Overview. Available online: https://docs.xilinx.com/v/u/en-US/ds891-zynq-ultrascale-plus-overview (accessed on 19 May 2022).
- He, B. Xilinx FPGA Design Guide: Based on Vivado 2018 Integrated Development Environment, 1st ed.; Publishing House of Electronics Industry: Beijing, China, 2018; pp. 274–422. [Google Scholar]
- Vivado Design Suite Tutorial: Design Flows Overview. Available online: https://docs.xilinx.com/v/u/2019.1-English/ug888-vivado-design-flows-overview-tutorial (accessed on 19 May 2022).
- PYNQ—Python productivity for Zynq—Home. Available online: http://www.pynq.io/ (accessed on 19 May 2022).
GPU | ZYNQ UltraScale + MPSoC | |
---|---|---|
Device | GTX1050 | quad-core Arm Cortex-A53 |
Power consumption | 75 W | Less than 2 W |
Training duration/epoch | 22.6 s | 286 s |
No Optimization | Pipelined | ||
---|---|---|---|
Latency & Interval (clock cycles) | 7,709,158 | 353,263 | |
Utilization Estimates | BRAM_18K | 79 | 83 |
DSP48E | 9 | 244 | |
FF | 29,061 | 53,049 | |
LUT | 24,651 | 41,842 |
No Optimization | Pipelined | ||
---|---|---|---|
FC Layers’ Latency & Interval (clock cycles) | 128,856 | 45,696 | |
Utilization Estimates | BRAM_18K | 79 | 88 |
DSP48E | 9 | 14 | |
FF | 29,061 | 29,259 | |
LUT | 24,651 | 28,196 |
Sources | Utilization Estimates | Utilization (%) |
---|---|---|
BRAMs | 41.5 | 19.2 |
DSP | 248 | 68.89 |
FF | 51,434 | 36.45 |
LUT | 48,966 | 69.40 |
LUTRAM | 3288 | 11.42 |
CPU | ZYNQ UltraScale + MPSoC | |
---|---|---|
Device | Inter i5-4300U | XCZU3EG-SBVA484 |
Frequency | 2.5 GHz | 100 MHz |
Power | 44 W | 3.22 W |
Inference time (single frame) | 8.1 ms | 2.2 ms |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zheng, Y.; He, B.; Li, T. Research on the Lightweight Deployment Method of Integration of Training and Inference in Artificial Intelligence. Appl. Sci. 2022, 12, 6616. https://doi.org/10.3390/app12136616
Zheng Y, He B, Li T. Research on the Lightweight Deployment Method of Integration of Training and Inference in Artificial Intelligence. Applied Sciences. 2022; 12(13):6616. https://doi.org/10.3390/app12136616
Chicago/Turabian StyleZheng, Yangyang, Bin He, and Tianling Li. 2022. "Research on the Lightweight Deployment Method of Integration of Training and Inference in Artificial Intelligence" Applied Sciences 12, no. 13: 6616. https://doi.org/10.3390/app12136616
APA StyleZheng, Y., He, B., & Li, T. (2022). Research on the Lightweight Deployment Method of Integration of Training and Inference in Artificial Intelligence. Applied Sciences, 12(13), 6616. https://doi.org/10.3390/app12136616