A Sliding Window for Data Reuse in Deep Convolution Operations to Reduce Bandwidth Requirements and Resource Utilization
Abstract
:1. Introduction
- Convolution computation architecture: A new addressing method is proposed based on the 4 × 4 convolutional traversal scheme, leading to the development of a convolutional computing architecture optimized for large-scale convolution. This architecture is implemented on an FPGA (XC7Z035), making it suitable for deployment across various neural networks.
- New Data Feeder Design: A novel 3D data feeder is introduced, which represents the first extension of the traditional 2D row buffer. This design achieves a 75% reduction in on-chip bandwidth requirements and significantly decreases address computation latency for pointer addressing.
- Performance: Achieved a processing speed of 121.36 GOPS using minimal resources at a clock frequency of 200 MHz, resulting in a 39.10 times increase in performance compared to CPU computations.
2. Related Work
2.1. Sliding Window
2.2. GEMM Acceleration and im2col Method
2.3. Row Buffer Strategy for 2D Convolution
3. Design
3.1. The 4 × 4 Sliding Window
3.2. Plane Buffer Strategy
3.3. Addressing and Storage Strategy
- Loop 1: Increase within the same pixel up to the D-th channel.
- Loop 2: Traverse a plane in a left-to-right, top-to-bottom order for .
- Loop 3: Traverse a block for .
- Loop 4: Traverse a block for
3.4. Buffer Structure Design
3.4.1. DDR Weight Buffer
3.4.2. Concatenation Transfer Buffer
3.5. Overall Structural Framework
4. Evaluation and Experiments
4.1. Evaluation
4.2. Performance and Resource Utilization
4.3. Comparison with CPU and GPU Platforms
4.4. Comparison with Other Accelerators
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’15), Monterey, CA, USA, 22–24 February 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 161–170. [Google Scholar] [CrossRef]
- Liu, Z.; Liu, Q.; Yan, S.; Cheung, R.C.C. An Efficient FPGA-based Depthwise Separable Convolutional Neural Network Accelerator with Hardware Pruning. ACM Trans. Reconfig. Technol. Syst. 2024, 17, 15. [Google Scholar] [CrossRef]
- Liu, F.; Li, H.; Hu, W.; He, Y. Review of neural network model acceleration techniques based on FPGA platforms. Neurocomputing 2024, 610, 128511. [Google Scholar] [CrossRef]
- Yang, C.; Wang, Y.; Wang, X.; Geng, L. A Stride-Based Convolution Decomposition Method to Stretch CNN Acceleration Algorithms for Efficient and Flexible Hardware Implementation. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 3007–3020. [Google Scholar] [CrossRef]
- Adiono, T.; Putra, A.; Sutisna, N.; Syafalni, I.; Mulyawan, R. Low Latency YOLOv3-Tiny Accelerator for Low-Cost FPGA Using General Matrix Multiplication Principle. IEEE Access 2021, 9, 141890–141913. [Google Scholar] [CrossRef]
- Sudrajat, M.R.D.; Adiono, T.; Syafalni, I. GEMM-Based Quantized Neural Network FPGA Accelerator Design. In Proceedings of the 2019 International Symposium on Electronics and Smart Devices (ISESD), Badung, Indonesia, 8–9 October 2019; pp. 1–5. [Google Scholar] [CrossRef]
- Kim, M.; Oh, K.; Cho, Y.; Seo, H.; Nguyen, X.T.; Lee, H.-J. A Low-Latency FPGA Accelerator for YOLOv3-Tiny With Flexible Layerwise Mapping and Dataflow. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 71, 1158–1171. [Google Scholar] [CrossRef]
- Ahmad, A.; Pasha, M.A.; Raza, G.J. Accelerating Tiny YOLOv3 using FPGA-Based Hardware/Software Co-Design. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 12–14 October 2020; pp. 1–5. [Google Scholar] [CrossRef]
- Bai, Z.; Fan, H.; Liu, L.; Liu, L.; Wang, D. An OpenCL-Based FPGA Accelerator with the Winograd’s Minimal Filtering Algorithm for Convolution Neuron Networks. In Proceedings of the 2019 IEEE 5th International Conference on Computer and Communications (ICCC), Chengdu, China, 6–9 December 2019; pp. 277–282. [Google Scholar] [CrossRef]
- Nguyen, D.T.; Nguyen, T.N.; Kim, H.; Lee, H.-J. A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019, 27, 1861–1873. [Google Scholar] [CrossRef]
- Chen, Y.; Luo, T.; Liu, S.; Zhang, S.; He, L.; Wang, J.; Li, L.; Chen, T.; Xu, Z.; Sun, N.; et al. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, UK, 13–17 December 2014; pp. 609–622. [Google Scholar] [CrossRef]
- Li, H.; Fan, X.; Jiao, L.; Cao, W.; Zhou, X.; Wang, L. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, Switzerland, 29 August–2 September 2016; pp. 1–9. [Google Scholar] [CrossRef]
- Bai, L.; Zhao, Y.; Huang, X. A CNN Accelerator on FPGA Using Depthwise Separable Convolution. IEEE Trans. Circuits Syst. II Express Briefs 2018, 65, 1415–1419. [Google Scholar] [CrossRef]
- Sun, F.; Wang, C.; Gong, L.; Xu, C.; Zhang, Y.; Lu, Y.; Li, X.; Zhou, X. A High-Performance Accelerator for Large-Scale Convolutional Neural Networks. In Proceedings of the 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC), Guangzhou, China, 12–15 December 2017; pp. 622–629. [Google Scholar] [CrossRef]
- Srilakshmi, S.; Madhumati, G.L. A Comparative Analysis of HDL and HLS for Developing CNN Accelerators. In Proceedings of the 2023 Third International Conference on Artificial Intelligence and Smart Energy (ICAIS), Coimbatore, India, 2–4 February 2023; pp. 1060–1065. [Google Scholar] [CrossRef]
- Ma, Y.; Suda, N.; Cao, Y.; Seo, J.-S.; Vrudhula, S. Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA. In Proceedings of the 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, Switzerland, 29 August–2 September 2016; pp. 1–8. [Google Scholar] [CrossRef]
- Wang, B.; Li, M. A Structure to Effectively Prepare the Data for Sliding Window in Deep Learning. In Proceedings of the 2021 IEEE 6th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 22–24 October 2021; pp. 1025–1028. [Google Scholar] [CrossRef]
- Chellapilla, K.; Puri, S.; Simard, P. High Performance Convolutional Neural Networks for Document Processing. Tenth International Workshop on Frontiers in Handwriting Recognition, Université de Rennes 1, 2006, La Baule (France). inria-00112631. Available online: https://inria.hal.science/inria-00112631v1/document (accessed on 29 January 2025).
- Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 24–28 June 2017; pp. 1–12. [Google Scholar] [CrossRef]
- Zhang, H.; Xia, M.; Hu, G. A Multiwindow Partial Buffering Scheme for FPGA-Based 2-D Convolvers. IEEE Trans. Circuits Syst. II Express Briefs 2007, 54, 200–204. [Google Scholar] [CrossRef]
- Bosi, B.; Bois, G.; Savaria, Y. Reconfigurable pipelined 2-D convolvers for fast digital signal processing. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 1999, 7, 299–308. [Google Scholar] [CrossRef]
- Nagel, M.; Fournarakis, M.; Amjad, R.A.; Bondarenko, Y.; van Baalen, M.; Blankevoort, T. A White Paper on Neural Network Quantization. arXiv 2021, arXiv:2106.08295. [Google Scholar]
- Esser, S.K.; McKinstry, J.L.; Bablani, D.; Appuswamy, R.; Modha, D.S. Learned Step Size Quantization. arXiv 2019, arXiv:1902.08153. [Google Scholar]
- Courbariaux, M.; Bengio, Y.; David, J.-P. Training deep neural networks with low precision multiplications. arXiv 2015, arXiv:1412.7024v5. [Google Scholar]
Resource | Estimation | Available | Utilization % |
---|---|---|---|
LUT | 9653 | 171,900 | 5.62 |
LUTRAM | 1024 | 70,400 | 1.45 |
FF | 15,823 | 343,800 | 4.60 |
BRAM | 214.50 | 500 | 42.90 |
DSP | 576 | 900 | 64.00 |
BUFG | 3 | 32 | 9.38 |
MMCM | 1 | 8 | 12.50 |
Device | Intel Core i5 10s | Nvidia Tesla P100 | This Work |
---|---|---|---|
Platform type | CPU | GPU | FPGA |
Precision (bit) | double (64) | double (64) | uint16 (16) |
Clock frequency | 2.0/3.8 GHz | 1328 MHz | 200 MHz |
Latency | 67.80 ms | 0.45 ms | 1.73 ms |
Total clock cycles | 135.6 M | 597.6 K | 346.8 K |
GOPS | 3.10 GOPS | 467.64 GOPS | 121.36 GOPS |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sun, Y.; Ma, Y.; Chen, Z.; Liu, Z.; Chen, B.; Song, R. A Sliding Window for Data Reuse in Deep Convolution Operations to Reduce Bandwidth Requirements and Resource Utilization. Electronics 2025, 14, 582. https://doi.org/10.3390/electronics14030582
Sun Y, Ma Y, Chen Z, Liu Z, Chen B, Song R. A Sliding Window for Data Reuse in Deep Convolution Operations to Reduce Bandwidth Requirements and Resource Utilization. Electronics. 2025; 14(3):582. https://doi.org/10.3390/electronics14030582
Chicago/Turabian StyleSun, Yiqi, Yaoyang Ma, Zixuan Chen, Zhiyu Liu, Boxin Chen, and Rui Song. 2025. "A Sliding Window for Data Reuse in Deep Convolution Operations to Reduce Bandwidth Requirements and Resource Utilization" Electronics 14, no. 3: 582. https://doi.org/10.3390/electronics14030582
APA StyleSun, Y., Ma, Y., Chen, Z., Liu, Z., Chen, B., & Song, R. (2025). A Sliding Window for Data Reuse in Deep Convolution Operations to Reduce Bandwidth Requirements and Resource Utilization. Electronics, 14(3), 582. https://doi.org/10.3390/electronics14030582