An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution
Abstract
:1. Introduction
- A configurable system architecture is proposed based on the ZYNQ heterogeneous platform. Under this architecture, the optimal design of the accelerator is completed with the Roofline model, and the accelerator is scalable.
- Based on the single-computation engine model, the CNN hardware accelerator we designed efficiently integrates standard convolution and depthwise separable convolution.
- Ping-pong on-chip buffer maximizes the bandwidth and the CNN accelerator we designed is full pipelined.
2. Background
2.1. Convolutional Neural Network
2.1.1. Convolution Layer
2.1.2. Pool Layer
2.1.3. Fully-Connected Layer
2.2. Depthwise Separable Convolution
3. Architecture and Accelerator Design
3.1. Design Overview
3.2. Accelerator Design under the Roofline Model
3.2.1. Accelerator Overview
3.2.2. The Roofline Model of ZYNQ 7100
3.2.3. Data Partition and Exploring the Design Space
4. Experimental Evaluation and Results
4.1. Resource Utilization
4.2. Comparisons of Pipelined and no Pipelined
4.3. Comparisons with CPU Implementation
4.4. Comparisons with Others
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Abbreviations
CNN | Convolutional Neural Network |
FPGA | Field Programmable Gate Array |
IoT | Internet of Things |
GPU | Graphics Processing Unit |
ASIC | Application-Specific Integrated Circuit |
Open CL | Open Computing Language |
DDR | Double Data Rate |
AXI | Advanced eXtensible Interface |
DRAM | Dynamic Random Access Memory |
PS | Processing System |
PL | Programmable Logic |
DMA | Direct Memory Access |
HP | High Performance |
GFLOPS | Giga FLoating-point Operations Per Second |
CTC | Computing To Communication |
BW | BandWidth |
DSP | Digital Signal Processing |
FF | Flip Flop |
LUT | Look-Up-Table |
MACC | Multiply-Accumulate |
RTL | Register Transfer Level |
References
- Sivaramakrishnan, R.; Sema, C.; Incheol, K.; George, T.; Sameer, A. Visualization and Interpretation of Convolutional Neural Network Predictions in Detecting Pneumonia in Pediatric Chest Radiographs. Appl. Sci. 2018, 8, 1715. [Google Scholar] [CrossRef]
- Yinghua, L.; Bin, S.; Xu, K.; Xiaojiang, D.; Mohsen, G. Vehicle-Type Detection Based on Compressed Sensing and Deep Learning in Vehicular Networks. Sensors 2018, 18, 4500. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural network. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Ren, S.; He, K.M.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-time object Detection with Region Proposal Network. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H.; Penn, G. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 4277–4280. [Google Scholar]
- Farabet, C.; Poulet, C.; Han, J.Y.; Le, C.Y. CNP: An FPGA-based processor for convolutional networks. In Proceedings of the International Conference on Field Programmable Logic and Applications, Prague, Czech Republic, 31 August–2 September 2009; pp. 32–37. [Google Scholar]
- Sankaradas, M.; Jakkula, V.; Cadambi, S.; Chakradhar, S.; Durdanovic, I.; Cosatto, E.; Graf, H.P. A massively parallel coprocessor for convolutional neural networks. In Proceedings of the IEEE International Conference on Application-specific Systems, Architectures and Processors, New York, NY, USA, 6–7 July 2009; pp. 53–60. [Google Scholar]
- Hadsell, R.; Sermanet, P.; Ben, J.; Erkan, A.; Scoffier, M.; Kavukcuoglu, K.; Muller, U.; Le, C.Y. Learning long-range vision for autonomous off-road driving. J. Field Robot. 2009, 26, 120–144. [Google Scholar] [CrossRef] [Green Version]
- Maria, J.; Amaro, J.; Falcao, G.; Alexandre, L.A. Stacked autoencoders using low-power accelerated architectures for object recognition in autonomous systems. Neural Process Lett. 2016, 43, 445–458. [Google Scholar] [CrossRef]
- Wei, Z.; Zuchen, J.; Xiaosong, W.; Hai, W. An FPGA Implementation of a Convolutional Auto-Encoder. Appl. Sci. 2018, 8, 504. [Google Scholar] [CrossRef]
- Zhiling, T.; Siming, L.; Lijuan, Y. Implementation of Deep learning-Based Automatic Modulation Classifier on FPGA SDR Platform. Elecronics 2018, 7, 122. [Google Scholar] [CrossRef]
- Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz, M.A.; Dally, W.J. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 2016 International Symposium on Computer Architecture, Seoul, Korea, 18–22 June 2016; pp. 243–254. [Google Scholar]
- Chen, T.S.; Du, Z.D.; Sun, N.H.; Wang, J.; Wu, C.Y.; Chen, Y.J.; Temam, O. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM Sigplan Notices 2014, 49, 269–284. [Google Scholar]
- Song, L.; Wang, Y.; Han, Y.H.; Zhao, X.; Liu, B.S.; Li, X.W. C-brain: A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In Proceedings of the 53rd Annual Design Automation Conference, Austin, TX, USA, 5–9 June 2016. [Google Scholar]
- Andrew, G.H.; Menglong, Z.; Bo, C.; Dmitry, K.; Weijun, W.; Tobias, W.; Marco, A.; Hartwing, A. Mobile Nets: Efficient convolutional neural networks for mobile vision applications. arXiv, 2017; arXiv:1704.04861. [Google Scholar]
- Mark, S.; Andrew, G.H.; Menglong, Z.; Andrey, Z.; Liangchied, C. Mobile Net V2: Inverted residuals and linear bottlenecks. arXiv, 2018; arXiv:1801.04381v3. [Google Scholar]
- Cadambi, S.; Majumdar, A.; Becchi, M.; Chakradhar, S.; Graf, H.P. A programmable parallel accelerator for learning and classification. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, Vienna, Austria, 11–15 September 2010; pp. 273–284. [Google Scholar]
- Chakradhar, S.; Sankaradas, M.; Jakkula, V.; Cadambi, S. A dynamically configurable coprocessor for convolutional neural networks. In Proceedings of the 37th International Symposiumon Computer Architecture, St Mal, France, 19–23 June 2010; pp. 247–257. [Google Scholar]
- Peemen, M.; Setio, A.A.; Mesman, B.; Corporaal, H. Memory-centric accelerator design for convolutional neural networks. In Proceedings of the 2013 IEEE 31st International Conference (ICCD), Asheville, NC, USA, 6–9 October 2013; pp. 13–19. [Google Scholar]
- Alhamali, A.; Salha, N.; Morcel, R. FPGA-Accelerated Hadoop Cluster for Deep Learning Computations. In Proceedings of the 2015 IEEE International Conference on Data Mining Workshop (ICDMW), Atlantic City, NJ, USA, 14–17 November 2015; pp. 565–574. [Google Scholar]
- Bettoni, M.; Urgese, G.; Kobayashi, Y.; Macii, E.; Acquaviva, A. A Convolutional Neural Network Fully Implemented on FPGA for Embedded Platforms. In Proceedings of the 2017 New Generation of CAS (NGCAS), Genoa, Italy, 6–9 September 2017. [Google Scholar]
- Mousouliotis, P.G.; Panayiotou, K.L.; Tsardoulias, E.G.; Petrou, L.P.; Symeonidis, A.L. Expanding a robot’s life: Low power object recognition via fpga-based dcnn deployment. In Proceedings of the 2018 7th International Conference on Modern Circuits and Systems Technologies (MOCAST), Thessaloniki, Greece, 7–9 May 2018. [Google Scholar]
- Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing fpgabased accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-programmable Gate Arrays, Monterey, CA, USA, 22–24 February2015. [Google Scholar]
- Wang, Z.R.; Qiao, F.; Liu, Z.; Shan, Y.X.; Zhou, X.Y.; Luo, L.; Yang, H.Z. Optimizing convolutional neural network on FPGA under heterogeneous computing framework with OpenCL. In Proceedings of the IEEE Region 10 Conference (TENCON), Singapore, 22–25 November 2016; pp. 3433–3438. [Google Scholar]
- Naveen, S.; Vikas, C.; Ganesh, D.; Abinash, M.; Yufei, M. Throughput-optimized Open CL-based FPGA accelerator for largescale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016; pp. 16–25. [Google Scholar]
- Xu, K.; Wang, X.; Fu, S.; Wang, D. A Scalable FPGA Accelerator for Convolutional Neural Networks. Commun. Comput. Inf. Sci. 2018, 908, 3–14. [Google Scholar]
- Williams, S.; Waterman, A.; Patterson, D. Roofline: An insightful visual performance model for floating-point and multicore architectures. Commun. ACM 2009, 52, 65–76. [Google Scholar] [CrossRef]
Parameter | Description |
---|---|
width | The width of the input feature map |
height | The height of the input feature map |
channels_in | Number of channels of convolution kernels |
channels_out | Number of channels of the output feature map |
Tr | Block factor of the width of the output feature map |
Tc | Block factor of the height of the output feature map |
Tri | Block factor of the width of the input feature map |
Tci | Block factor of the height of the input feature map |
Tni | Block factor of the channels of the intput feature map |
kernel | The size of convolution kernels |
stride | The stride of convolution |
pad | Whether to pad or not |
depthwise | Whether it is a depthwise convolution or not |
relu | Whether to relu or not |
split | Whether it is a split layer (detection layer) or not |
Name | Bram_18k | DSP | FF | LUT | Power (pipelined) | Power (unpipelined) |
---|---|---|---|---|---|---|
Total | 708 | 1926 | 187,146 | 142,291 | 4.083 W | 3.993 W |
Available | 1510 | 2020 | 554,800 | 277,400 | - | - |
Utilization (%) | 46 | 95 | 38 | 51 | - | - |
Name | First Layer | First Layer (Block) | Second Layer | Second Layer (Block) |
---|---|---|---|---|
Output_fm row (R) | 150 | 150 | 150 | 150 |
Output_fm col (C) | 150 | 150 | 150 | 150 |
Output_fm channel (M) | 32 | 64 | 32 | 64 |
Input_fm channel (Ni) | 3 | 6 | 32 | 64 |
Kernel channel (N) | 3 | 6 | 1 | 6 |
Kernel (K) | 3 | 3 | 3 | 3 |
Stride (S) | 2 | 2 | 1 | 1 |
Name | No Pipelined | Full Pipelined |
---|---|---|
Latency (clock cycles) | 1,921,492 | 934,612 |
Clock frequency (MHz) | 100 | 100 |
Time (ms) | 19.21 | 9.35 |
Calculated amount (GLOPs) | 0.16 | 0.16 |
Performance (GLOPS) | 8.32 | 17.11 |
Name | No Pipelined | Full Pipelined |
---|---|---|
Latency (clock cycles) | 2,816,722 | 1,904,020 |
Clock frequency (MHz) | 100 | 100 |
Time (ms) | 28.2 | 19 |
Calculated amount (GLOPs) | 0.16 | 0.16 |
Performance (GLOPS) | 5.67 | 8.42 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, B.; Zou, D.; Feng, L.; Feng, S.; Fu, P.; Li, J. An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution. Electronics 2019, 8, 281. https://doi.org/10.3390/electronics8030281
Liu B, Zou D, Feng L, Feng S, Fu P, Li J. An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution. Electronics. 2019; 8(3):281. https://doi.org/10.3390/electronics8030281
Chicago/Turabian StyleLiu, Bing, Danyin Zou, Lei Feng, Shou Feng, Ping Fu, and Junbao Li. 2019. "An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution" Electronics 8, no. 3: 281. https://doi.org/10.3390/electronics8030281
APA StyleLiu, B., Zou, D., Feng, L., Feng, S., Fu, P., & Li, J. (2019). An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution. Electronics, 8(3), 281. https://doi.org/10.3390/electronics8030281