LDF-BNN: A Real-Time and High-Accuracy Binary Neural Network Accelerator Based on the Improved BNext
Abstract
:1. Introduction
- (1)
- The LDF-BNN model based on BNext was proposed. The design of the high-accuracy BNext model does not take into account the data streaming requirements of actual hardware deployment, resulting in bandwidth constraints and an overall model computational efficiency of only 38.18%. Based on this, we introduce a layered data fusion mechanism (LDF). By fusing data from different layers, LDF significantly reduces bandwidth requirements by 53.67%. Importantly, the loss of accuracy is almost negligible.
- (2)
- An efficient hardware architecture based on LDF-BNN is devised. Given the relatively complex hardware architecture required to adapt LDF-BNN, and the fact that there is no general-purpose hardware accelerator available for deployment, we design this architecture. The approach achieves pipelining of LDF-BNN modules and improves the overall computational efficiency by 31.52% and by 1.83× speedup.
- (3)
- An innovative multi-storage parallelism (MSP) design has been introduced. Even after model structure optimization, there exists a bottleneck in the rate at which convolution calculations read feature map data, resulting in the efficiency of 1×1 convolution remaining only 31.68%. The multilevel buffers of the aforementioned hardware architecture require more multi-dimensional parallelism to meet the demand for reading rates. Hence, MSP is proposed. This design, in turn, partially improves the overall model computational efficiency by 14.75%, and achieves a 2.21× speedup.
- (4)
- This design is fully implemented on the Xilinx ZCU102 platform. Experimental results indicate that the proposed LDF-BNN accelerator attains a high accuracy of 72.23%, an image processing rate of 72.6 FPS and 1826 GOPs on the ImageNet dataset when the system clock frequency is set at 200 MHz. Our design also maintains a high accuracy of 98.70% on the Mixed WM-38 dataset [12], which can demonstrate the potential of LDF-BNN in the field of defect detection.
2. Proposed BNN Architecture
2.1. Bottlenecks in Previous Design
2.2. Proposed LDF-BNN Architecture
2.3. Model Quantification
3. The Framework of Hardware
3.1. Overall Hardware Architecture
3.2. Multi-Mode Convolution Computation Engine
3.3. The Design of On-Chip Data Buffers
3.4. SE Module Engine
Algorithm 1 PWC 0 in the SEE |
Input: The output results of the GAP , GAP channel number , PWC 0 output channel number , the weights of PWC 0 , Input/output parallelism . Output:
|
3.5. Post-Processing Engine
4. Experimental Results and Analysis
4.1. Software Performance Analysis
4.1.1. Comparison with Different BNNs
4.1.2. Ablation Experiments
4.1.3. Performance on Mixed WM-38 Dataset
4.2. Hardware Performance Analysis
4.2.1. Comparison under Different Methods
4.2.2. Performance Comparison with Existing Designs
4.2.3. Performance Comparison with CPU and GPU
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Nomenclature
Conv2d | 2D convolution |
BConv | Binary convolution |
BN | Batch normalization |
PReLU | Parametric rectified linear unit |
AvgPool | Average pooling |
PWC | Point-wise convolution |
GAP | Global average pooling |
Move | Add the parameter in the per-channel direction in BNext |
SE | Squeeze-and-excitation |
B0 | Block random access memory with id 0 |
MEM | Dynamic random access memory |
CL3×3 | Complex Layer with kernel size of 3 × 3 |
SL1×1 | Simple Layers with kernel size of 1 × 1 |
2SL1×1 | 2 SL1×1 modules and concat |
Conv3×3 | Convolution with kernel size of 3 × 3 |
Conv1×1 | Convolution with kernel size of 1 × 1 |
LDF-BNN-T | LDF-BNN with tiny model size |
LDF-BNN-S | LDF-BNN with small model size |
Accum | The accumulator |
Reorder | Data reordering operation |
i | A superscript means in the i-th layer |
A subscript means the input of BConv | |
A subscript means the output of BConv | |
W | The width of feature map |
H | The height of feature map |
C | The number of channels in feature maps |
S | The total size of the feature map |
Appendix A
Methods | CIFAR10 Accuracy | CIFAR100 Accuracy | kLUT |
---|---|---|---|
LDF-BNN(ResNet-18+ReLU) | 92.83% | 69.56% | 173 |
LDF-BNN(ResNet-18+Sigmoid) | 94.39% (↑1.56%) | 73.07% (↑3.51%) | 174 (↑0.58%) |
References
- Cao, J.; Yang, G.; Yang, X. A pixel-level segmentation convolutional neural network based on deep feature fusion for surface defect detection. IEEE Trans. Instrum. Meas. 2020, 70, 1–12. [Google Scholar] [CrossRef]
- Deng, G.; Wang, H. Efficient Mixed-Type Wafer Defect Pattern Recognition Based on Light-Weight Neural Network. Micromachines 2024, 15, 836. [Google Scholar] [CrossRef] [PubMed]
- Jing, J.F.; Ma, H.; Zhang, H.H. Automatic fabric defect detection using a deep convolutional neural network. Color. Technol. 2019, 135, 213–223. [Google Scholar] [CrossRef]
- Blott, M.; Preußer, T.B.; Fraser, N.J.; Gambardella, G.; O’brien, K.; Umuroglu, Y.; Leeser, M.; Vissers, K. FINN-R: An end-to-end deep-learning framework for fast exploration of quantized neural networks. ACM Trans. Reconfigurable Technol. Syst. (TRETS) 2018, 11, 1–23. [Google Scholar] [CrossRef]
- Nakahara, H.; Que, Z.; Luk, W. High-throughput convolutional neural network on an FPGA by customized JPEG compression. In Proceedings of the 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Fayetteville, AR, USA, 3–6 May 2020; pp. 1–9. [Google Scholar] [CrossRef]
- Zhang, Y.; Pan, J.; Liu, X.; Chen, H.; Chen, D.; Zhang, Z. FracBNN: Accurate and FPGA-efficient binary neural networks with fractional activations. In Proceedings of the 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Virtual Event, USA, 28 February–2 March 2021; pp. 171–182. [Google Scholar] [CrossRef]
- Guo, N.; Bethge, J.; Meinel, C.; Yang, H. Join the high accuracy club on ImageNet with a binary neural network ticket. arXiv 2022, arXiv:2211.12933. [Google Scholar] [CrossRef]
- Liu, Z.; Luo, W.; Wu, B.; Yang, X.; Liu, W.; Cheng, K.T. Bi-real net: Binarizing deep network towards real-network performance. Int. J. Comput. Vis. 2020, 128, 202–219. [Google Scholar] [CrossRef]
- Liu, Z.; Shen, Z.; Savvides, M.; Cheng, K.T. ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 143–159. [Google Scholar] [CrossRef]
- Song, M.; Asim, F.; Lee, J. Extending Neural Processing Unit and Compiler for Advanced Binarized Neural Networks. In Proceedings of the 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC), Incheon, Republic of Korea, 22–25 January 2024; pp. 115–120. [Google Scholar] [CrossRef]
- Ma, R.; Qiao, G.; Liu, Y.; Meng, L.; Ning, N.; Liu, Y.; Hu, S. A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 5704–5713. [Google Scholar]
- Wang, J.; Xu, C.; Yang, Z.; Zhang, J.; Li, X. Deformable convolutional networks for efficient mixed-type wafer defect pattern recognition. IEEE Trans. Semicond. Manuf. 2020, 33, 587–596. [Google Scholar] [CrossRef]
- Esser, S.K.; McKinstry, J.L.; Bablani, D.; Appuswamy, R.; Modha, D.S. Learned step size quantization. arXiv 2019, arXiv:1902.08153. [Google Scholar] [CrossRef]
- Zhang, D.; Wang, A.; Mo, R.; Wang, D. End-to-end acceleration of the YOLO object detection framework on FPGA-only devices. Neural Comput. Appl. 2024, 36, 1067–1089. [Google Scholar] [CrossRef]
- Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar] [CrossRef]
- Liu, L.; Jiang, H.; He, P.; Chen, W.; Liu, X.; Gao, J.; Han, J. On the variance of the adaptive learning rate and beyond. arXiv 2019, arXiv:1908.03265. [Google Scholar] [CrossRef]
- Chen, T.; Zhang, Z.; Ouyang, X.; Liu, Z.; Shen, Z.; Wang, Z. “ BNN-BN=?”: Training Binary Neural Networks Without Batch Normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, QC, Canada, 10–17 October 2021; pp. 4619–4629. [Google Scholar]
- Sun, M.; Li, Z.; Lu, A.; Li, Y.; Chang, S.E.; Ma, X.; Lin, X.; Fang, Z. FILM-QNN: Efficient FPGA acceleration of deep neural networks with intra-layer, mixed-precision quantization. In Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Virtual Event, USA, 27 February–1 March 2022; pp. 134–145. [Google Scholar] [CrossRef]
- Yang, S.; Ding, C.; Huang, M.; Li, K.; Li, C.; Wei, Z.; Huang, S.; Dong, J.; Zhang, L.; Yu, H. LAMPS: A Layer-wised Mixed-Precision-and-Sparsity Accelerator for NAS-Optimized CNNs on FPGA. In Proceedings of the 2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Orlando, FL, USA, 5–8 May 2024; pp. 90–96. [Google Scholar] [CrossRef]
- Lu, L.; Xie, J.; Huang, R.; Zhang, J.; Lin, W.; Liang, Y. An efficient hardware accelerator for sparse convolutional neural networks on FPGAs. In Proceedings of the 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), San Diego, CA, USA, 28 April–1 May 2019; pp. 17–25. [Google Scholar] [CrossRef]
- Dong, P.; Sun, M.; Lu, A.; Xie, Y.; Liu, K.; Kong, Z.; Meng, X.; Li, Z.; Lin, X.; Fang, Z.; et al. Heatvit: Hardware-efficient adaptive token pruning for vision transformers. In Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Montreal, QC, Canada, 25 February–1 March 2023; pp. 442–455. [Google Scholar] [CrossRef]
Models | N | M | Channel Number (C) |
---|---|---|---|
LDF-BNN-T | 1 | 3 | 32 |
LDF-BNN-S | 1 | 3 | 48 |
Models | Weight | I/OFMs | Total |
---|---|---|---|
BNext-S | 12.34 MB | 58.85 MB | 71.19 MB |
LDF-BNN-S | 12.34 MB | 20.64 MB | 32.98 MB |
Reduction | 0.00% | 64.93% | 53.67% |
Models | Year | BOPs () | Top-1 Accuracy (%) |
---|---|---|---|
Bi-RealNet-18 [8] | 2020 | 1.68 | 56.4 |
Bi-RealNet-34 [8] | 2020 | 3.53 | 62.2 |
ReActNet-A [9] | 2020 | 4.82 | 69.4 |
ReActNet-Adam [16] | 2019 | 4.82 | 70.5 |
BNext-T [7] | 2022 | 4.82 | 72.4 |
BNext-S [7] | 2022 | 10.84 | 76.1 |
LDF-BNN-T | 2024 | 4.82 | 72.2 |
LDF-BNN-S | 2024 | 10.84 | 75.8 |
Models | Year | CIFAR10 Accuracy (%) | CIFAR100 Accuracy (%) |
---|---|---|---|
ReActNet [17] | 2021 | 92.1 | 68.3 |
AdaBNN [17] | 2022 | 93.1 | - |
BNext [7] | 2022 | 93.6 | 72.2 |
LDF-BNN | 2024 | 94.4 | 73.1 |
Methods | Epochs | Top-1 Accuracy (%) |
---|---|---|
Remove LDF | 512 | 75.27 |
BNext-S | 512 | 76.04 (+0.77) |
LDF-BNN-S | 512 | 75.78 (+0.48) |
+quantize | 128 | 72.44 |
+modules merging | 128 | 71.85 (−0.59) |
+GAP factor | 128 | 72.23 (−0.21) |
Models | Parameters (MB) | OPs (M) | Accuracy (%) |
---|---|---|---|
MobileNetV2 [2] | 8.92 | 326.22 | 97.56 |
ResNet-50 [2] | 94.08 | 4131.71 | 96.92 |
LDF-BNN-S | 11.75 | 298.98 | 98.78 |
+quantize | |||
+modules merging | |||
+GAP factor | 11.40 | 187.13 | 98.70 |
Methods | Conv1×1 Efficiency | Conv3×3 Efficiency | Overall Efficiency | Latency (ms) | Speedup | kLUTs | BRAM | CARRY8 | DSPs |
---|---|---|---|---|---|---|---|---|---|
Baseline | 17.41% | 46.70% | 38.18% | 30.3 | 1× | 189 | 357.5 | 7979 | 863 |
optimized | 31.68% (↑14.27%) | 80.79% (↑34.09%) | 69.70% (↑31.52%) | 16.6 | 1.83× | 171 (↓9.52%) | 431 (↑20.56%) | 6998 (↓12.29%) | 527 (↓38.93%) |
optimized+ MSP | 42.46% (↑25.05%) | 93.73% (↑47.03%) | 84.45% (↑46.27%) | 13.7 | 2.21× | 174 (↓7.94%) | 447 (↑25.03%) | 7265 (↓8.95%) | 527 (↓38.93%) |
Accelerator | [20] | FILM-QNN [18] | HeatViT [21] | VTA [10] | LAMPS [19] | Ours | ||
---|---|---|---|---|---|---|---|---|
Year | 2019 | 2022 | 2023 | 2024 | 2024 | 2024 | ||
Model | ResNet- 50 | ResNet- 18 | ResNet- 50 | MobileNet- V2 | DeiT-B | BiRealNet- 18 | NAS VGG -16 | LDF-BNN- S |
Top-1 (%) | 76.5 | 70.47 | 77.25 | 65.67 | 81.80 | 56.40 | 70.10 | 72.23 |
Bits (W/A) | 16/16 | 4/5 | 16/16 | 1/1 | Mixed | 1/1 | ||
(MHz) | 200 | 150 | 150 | 333 | 214 | 200 | ||
Power † | 23.6/- | -/12.9 | -/11.0 | -/- | -/- | 18.6/9.3 | ||
kLUT | 132 | 180 | 145 | 53 | 206 | 174 | ||
DSP | 1144 | 2092 | 1786 | 59 | 1037 | 527 | ||
BRAM (36k) | 912 | 440.5 | 664.5 | 139 | 481.5 | 447 | ||
FPS | - | 214.8 | 109.1 | 537.9 | 11.2 | 24.3 | 40.6 | 72.6 |
Throughput (GOPs) | 291 | 779 | 891 | 320 | 394 | 230 | 757 | 1826 |
GOPs/kLUT | 2.20 | 4.32 | 4.95 | 1.78 | 2.72 | 4.34 | 3.67 | 10.49 |
GOPs/DSP | 0.25 | 0.37 | 0.43 | 0.15 | 0.22 | 3.89 | 0.73 | 3.46 |
Energy efficiency † (GOPs/W) | 12.33/- | -/60.4 | -/69.1 | -/24.8 | -/35.2 | -/- | -/- | 98.2/196.3 |
Platform | Frequency | Power (W) | FPS | Energy Efficiency (FPS/W) |
---|---|---|---|---|
CPU | 3.6 GHz | 95 W | 2.3 | 0.02 |
GPU | 2.2 MHz | 302 W | 357.4 | 1.18 |
FPGA | 0.2 GHz | 9.3 W | 72.6 | 7.81 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wan, R.; Cen, R.; Zhang, D.; Wang, D. LDF-BNN: A Real-Time and High-Accuracy Binary Neural Network Accelerator Based on the Improved BNext. Micromachines 2024, 15, 1265. https://doi.org/10.3390/mi15101265
Wan R, Cen R, Zhang D, Wang D. LDF-BNN: A Real-Time and High-Accuracy Binary Neural Network Accelerator Based on the Improved BNext. Micromachines. 2024; 15(10):1265. https://doi.org/10.3390/mi15101265
Chicago/Turabian StyleWan, Rui, Rui Cen, Dezheng Zhang, and Dong Wang. 2024. "LDF-BNN: A Real-Time and High-Accuracy Binary Neural Network Accelerator Based on the Improved BNext" Micromachines 15, no. 10: 1265. https://doi.org/10.3390/mi15101265
APA StyleWan, R., Cen, R., Zhang, D., & Wang, D. (2024). LDF-BNN: A Real-Time and High-Accuracy Binary Neural Network Accelerator Based on the Improved BNext. Micromachines, 15(10), 1265. https://doi.org/10.3390/mi15101265