TORRES: A Resource-Efficient Inference Processor for Binary Convolutional Neural Networks Based on Locality-Aware Operation Skipping
Abstract
:1. Introduction
- TORRES efficiently skips some operations within the pooling windows to achieve high inference speed by exploiting the spatial locality inherent in feature maps. Furthermore, the training process is regularized by modifying the loss function so that more operations can be skipped.
- The microarchitecture is designed to skip operations without considerable resource overhead. In addition, address generation is carried out efficiently by carefully ordering the elements in the memories.
- A prototype inference system has been implemented based on TORRES in a 28 nm field-programmable gate array (FPGA), under which the functionality has been verified elaborately for practical inference tasks. The resource efficiency of TORRES is as high as 126.06 MOP/s/LUT at the inference speed of 291.2 GOP/s with the resource usage of 2.31 K LUTs. The inference accuracy is 87.47% for the CIFAR10 classification task.
2. Background
- Threshold-based operation skipping [8,10]: The partial sum monotonically increases as calculated by accumulating the non-negative results of the XPOP operations. Therefore, the resulting bit can be determined immediately after the partial sum is greater than a threshold, and the remaining operations can be skipped.
- Pooling-based operation skipping [8,10]: Max pooling can be performed based on the logical OR operations for the binarized elements within a window [19]; hence, the bit resulting from the window can be determined immediately after any element within the window is determined to be 1, and the operations involved in computing the other elements can be skipped.
- Boundary operation skipping [8]: The operations with the zeros padded off the boundaries of the input feature map to maintain the size across the convolution can be skipped by trimming the receptive fields.
3. Proposed Processor: TORRES
3.1. Processing Flow
Algorithm 1 Block processing flow in TORRES, where , and and are the natural and integer number sets, respectively |
- To find the spatial locality, the proposed scheme is designed to consider only one adjacent window located to the immediate left of the current window. According to the processing flow in Algorithm 1, this window corresponds to the previous window that has been processed just before the current window for the feature map in a channel. If more windows neighboring the current window were considered, the locality could be found more meticulously. However, it may incur a substantial complexity to find the spatial locality by considering more windows, which may lead to a prohibitive resource overhead for the implementation.
- In the compute ordering technique, the ordering complexity increases with the window size. Therefore, the resource overhead may be considerable for the implementation of the technique to support larger windows. However, in practical BCNN models, such as those presented in [3,14,18,19], the pooling window size is not usually so large since there may be a drastic loss of features through a large pooling window. The ordering for such a small window as illustrated in Figure 3 involves low complexity with only a few cases.
- There are two parameters of the zero-prediction technique: and . Configured for each block-processing, these parameters can be used to control the effects on the inference results as well as the number of operations to be skipped. More specifically, configuring them to smaller values results in a greater deviation of the inference results from those obtained without applying the technique of skipping more operations.
3.2. Microarchitecture
- The other previous operation-skipping schemes are implemented as presented in our previous work [8]. The threshold-based and pooling-based operation-skipping schemes are implemented by controlling the counters to terminate the loop if the related conditions are satisfied. The boundary operation-skipping scheme is implemented by biasing the partial sum and changing the range of , as expressed by Lines 11 and 12 in Algorithm 1, respectively, for which dedicated components (the receptive-field range calculator and partial-sum bias calculator) are incorporated. The implementation details of these components are omitted here for brevity; the interested readers can be referred to our previous work.
- The partial sum is initialized by negating the value calculated by adding the threshold and bias that corresponds to the the receptive field size. The comparison of the partial sum to the threshold is thus efficiently implemented by picking the sign of the partial sum with no explicit comparison.
- Since the first block in the BCNN inference process usually has the input feature map with multi-bit elements, it has to be processed in the way that for the binary-weight network [2], differently from those for the other blocks. To support this kind of processing, the microarchitecture is designed so as to carry out the summation of the single-bit or multi-bit elements, resulting from the bitwise XNOR operations, as shown in Figure 5.
- The last block in the BCNN inference process produces the soft results, going to be used for the post-processing (e.g., calculating the class probabilities in the classification tasks). The summation result stored in the register corresponds to each of the soft results, which can be stored directly to the memory, as shown in Figure 5.
3.3. Prototype Inference System
4. Results and Evaluation
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized neural networks. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 5–10 December 2016; pp. 4107–4115. [Google Scholar]
- Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. XNOR-Net: ImageNet classification using binary convolutional neural networks. In Proceedings of the European Conference Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 525–542. [Google Scholar]
- Ferrarini, B.; Milford, M.J.; McDonald-Maier, K.D.; Ehsan, S. Binary Neural Networks for Memory-Efficient and Effective Visual Place Recognition in Changing Environments. IEEE Trans. Robot. 2022, 38, 2617–2631. [Google Scholar] [CrossRef]
- Cerutti, G.; Cavigelli, L.; Andri, R.; Magno, M.; Farella, E.; Benini, L. Sub-mW Keyword Spotting on an MCU: Analog Binary Feature Extraction and Binary Neural Networks. IEEE Trans. Circuits Syst. I 2022, 69, 2002–2012. [Google Scholar] [CrossRef]
- Conti, F.; Schiavone, P.D.; Benini, L. XNOR Neural Engine: A Hardware Accelerator IP for 21.6-fJ/OP Binary Neural Network Inference. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2018, 37, 2940–2951. [Google Scholar] [CrossRef] [Green Version]
- Moons, B.; Bankman, D.; Yang, L.; Murmann, B.; Verhelst, M. BinarEye: An always-on energy-accuracy-scalable binary CNN processor with all memory on chip in 28 nm CMOS. In Proceedings of the Custom Integrated Circuits Conference, San Diego, CA, USA, 8–11 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–4. [Google Scholar]
- Guo, P.; Ma, H.; Chen, R.; Li, P.; Xie, S.; Wang, D. FBNA: A Fully Binarized Neural Network Accelerator. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL), Dublin, Ireland, 27–31 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 51–54. [Google Scholar]
- Kim, T.H.; Shin, J. A resource-efficient inference accelerator for binary convolutional neural networks. IEEE Trans. Circuits Syst. II 2021, 68, 451–455. [Google Scholar] [CrossRef]
- Kim, T.; Shin, J.; Choi, K. IOTA: A 1.7-TOP/J inference processor for binary convolutional neural networks with 4.7 K LUTs in a tiny FPGA. IET Electron. Lett. 2020, 56, 1041–1044. [Google Scholar] [CrossRef]
- Geng, T.; Li, A.; Wang, T.; Wu, C.; Li, Y.; Shi, R.; Wu, W.; Herbordt, M. O3BNN-R: An out-of-order architecture for high-performance and regularized BNN inference. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 199–213. [Google Scholar] [CrossRef]
- Rasoulinezhad, S.; Fox, S.; Zhou, H.; Wang, L.; Boland, D.; Leong, P.H. MajorityNets: BNNs Utilising Approximate Popcount for Improved Efficiency. In Proceedings of the International Conference on Field-Programmable Technology (ICFPT), Tianjin, China, 9–13 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 339–342. [Google Scholar]
- Scherer, M.; Rutishauser, G.; Cavigelli, L.; Benini, L. CUTIE: Beyond PetaOp/s/W ternary DNN inference acceleration with Better-than-Binary energy efficiency. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 1020–1033. [Google Scholar] [CrossRef]
- Liu, Q.; Lai, J.; Gao, J. An Efficient Channel-Aware Sparse Binarized Neural Networks Inference Accelerator. IEEE Trans. Circuits Syst. II 2022, 69, 1637–1641. [Google Scholar] [CrossRef]
- Nakahara, H.; Fujii, T.; Sato, S. A fully connected layer elimination for a binarized convolutional neural network on an FPGA. In Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL), Gent, Belgium, 4–6 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–4. [Google Scholar]
- Liu, Z.; Shen, Z.; Savvides, M.; Cheng, K.T. Reactnet: Towards precise binary neural network with generalized activation functions. In Proceedings of the European Conference Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 143–159. [Google Scholar]
- Lin, X.; Zhao, C.; Pan, W. Towards accurate binary convolutional neural network. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 345–353. [Google Scholar]
- Zhao, R.; Song, W.; Zhang, W.; Xing, T.; Lin, J.H.; Srivastava, M.; Gupta, R.; Zhang, Z. Accelerating binarized convolutional neural networks with software-programmable FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; ACM: New York, NY, USA, 2017; pp. 15–24. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
- Umuroglu, Y.; Fraser, N.J.; Gambardella, G.; Blott, M.; Leong, P.; Jahre, M.; Vissers, K. FINN: A framework for fast, scalable binarized neural network inference. In Proceedings of the International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; ACM: New York, NY, USA, 2017; pp. 65–74. [Google Scholar]
- Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 2017, 105, 2295–2329. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 1–12. [Google Scholar]
- Weng, J.; Jain, A.; Wang, J.; Wang, L.; Wang, Y.; Nowatzki, T. UNIT: Unifying tensorized instruction compilation. In Proceedings of the International Symposium on Code Generation and Optimization, Seoul, Korea, 27 February–3 March 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 77–89. [Google Scholar]
- Zhou, K.; Tan, G.; Zhang, X.; Wang, C.; Sun, N. A performance analysis framework for exploiting GPU microarchitectural capability. In Proceedings of the International Conference Supercomputing, Chicago, IL, USA, 14–16 June 2017; pp. 1–10. [Google Scholar]
- Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A system for Large-Scale machine learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation, Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; Citeseer: University Park, PA, USA, 2009. [Google Scholar]
- Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading digits in natural images with unsupervised feature learning. In Proceedings of the NIPS Workshop on Deep Learning & Unsupervised Feature Learning, Granada, Spain, 12–17 December 2011; NeurIPS Foundation: Grenada, Spain, 2011; pp. 1–9. [Google Scholar]
Element ordering a | (Traditional [21,22,23]) | (Traditional [8,9,24]) | (TORRES) |
Address layout b | |||
Computational complexity c | Three MACs | Two MACs | One MAC |
Target Task | Model Structure | Parameter Set A | Parameter Set B | |
---|---|---|---|---|
Block a | ||||
CIFAR10 classification c | CV1 | (32, 32, 3, 128, 1) | - | - |
CV2 | (32, 32, 3, 128, 2) | (1, 4) b | (1, 3) | |
CV3 | (16, 16, 3, 256, 1) | - | - | |
CV4 | (16, 16, 3, 256, 2) | (1, 3) | (1, 3) | |
CV5 | (8, 8, 3, 512, 1) | - | - | |
CV6 | (8, 8, 3, 512, 2) | (1, 3) | (1, 2) | |
FC1 | (1, 1, 1, 1024, 1) | - | - | |
FC2 | (1, 1, 1, 1024, 1) | - | - | |
FC3 | (1, 1, 1, 10, 1) | - | - | |
SVHN classification | CV1 | (32, 32, 3, 128, 1) | - | - |
CV2 | (32, 32, 3, 128, 2) | (2, 3) | (1, 2) | |
CV3 | (16, 16, 3, 128, 1) | - | - | |
CV4 | (16, 16, 3, 256, 2) | (2, 3) | (1, 2) | |
CV5 | (8, 8, 3, 256, 2) | (2, 3) | (2, 1) | |
FC1 | (1, 1, 1, 128, 1) | - | - | |
FC2 | (1, 1, 1, 10, 1) | - | - |
BCNN Inference Processor | TORRES a (with Param. Set A) | TORRES a (with Param. Set B) | [7] | [8] | [13] | [14] | [17] |
---|---|---|---|---|---|---|---|
FPGA device b (Part number) | Zynq®-7000 (XC7Z020) | Zynq®-7000 (XC7Z020) | Zynq®-7000 (XC7Z020) | Cyclone®V (5CSXFC6D6) | Zynq®-7000 (XC7Z045) | Zynq®-7000 (XC7Z020) | Zynq®-7000 (XC7Z020) |
Inference speed (GOP/s) c | 255.2 | 291.2 | 722.0 | 83.0 | 13,389.0 | 329.0 | 208.0 |
Resource usage (KLUT) d | 2.31 | 2.31 | 29.60 | 2.00 | 153.86 | 14.5 | 46.9 |
Energy eff. (TOP/J) | 2.69 | 3.07 | 0.22 | 0.94 | 1.23 | 0.14 | 0.04 |
Resource eff. (MOP/s/LUT) | 110.49 | 126.06 | 24.39 | 41.45 | 87.02 | 22.72 | 4.43 |
Classification acc. (%) e | 88.04 | 87.47 | 88.61 | 88.88 | 88.70 | 81.80 | 88.18 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lee, S.-J.; Kwak, G.-H.; Kim, T.-H. TORRES: A Resource-Efficient Inference Processor for Binary Convolutional Neural Networks Based on Locality-Aware Operation Skipping. Electronics 2022, 11, 3534. https://doi.org/10.3390/electronics11213534
Lee S-J, Kwak G-H, Kim T-H. TORRES: A Resource-Efficient Inference Processor for Binary Convolutional Neural Networks Based on Locality-Aware Operation Skipping. Electronics. 2022; 11(21):3534. https://doi.org/10.3390/electronics11213534
Chicago/Turabian StyleLee, Su-Jung, Gil-Ho Kwak, and Tae-Hwan Kim. 2022. "TORRES: A Resource-Efficient Inference Processor for Binary Convolutional Neural Networks Based on Locality-Aware Operation Skipping" Electronics 11, no. 21: 3534. https://doi.org/10.3390/electronics11213534
APA StyleLee, S. -J., Kwak, G. -H., & Kim, T. -H. (2022). TORRES: A Resource-Efficient Inference Processor for Binary Convolutional Neural Networks Based on Locality-Aware Operation Skipping. Electronics, 11(21), 3534. https://doi.org/10.3390/electronics11213534