Automatic Compression of Neural Network with Deep Reinforcement Learning Based on Proximal Gradient Method
Abstract
:1. Introduction
- (1)
- One is the pruning method driven by regularization, for example, ref. [25] uses the group sparse lasso penalty to prune network neurons. However, the compression rate is dependent on the adjustment of hyperparameters, and the hyperparameters require analysis and adjustment empirically. Moreover, the layers in networks are correlated, which causes the compression rate of each layer will affect to the other layers. So, the compression rate is designed by adjusting the parameters without considering the relationship among layers of networks.
- (2)
- The second is a pruning method driven by auxiliary parameters, such as AutoPrune [28], which takes a set of auxiliary parameters and optimally trains them to prune the network instead of the original weights, and reduces the dependence on the networks and the time for trial on hyperparameters. However, pruning the network through a set of auxiliary parameters, the corresponding hyperparameters still need to be adjusted, so manual empirical is still required.
- (3)
- The third is a deep reinforcement learning (DRL)-based pruning method, such as AMC [30], which achieves network pruning by making decisions on the compression rate of the neural network through DRL without complex hyperparameter tuning. However, the redundant weights in the DRL network can lead to the problem of an inefficient pruning process.
- (1)
- To improve the predictive performance of the critic network, we impose the L norm regularizer on the weights of the critic network, so that the activation output can obtain distinct features in representations, thus improving the prediction accuracy of the critic network.
- (2)
- To improve the decision performance of the actor network, we impose the L norm regularizer on the weights of the actor network such that the insignificant weights converge to 0, which can reduce the redundant weights in the actor network and improve the decision accuracy of the actor network.
- (3)
- To improve the training efficiency and automatic compression performance of DRL, the proximal gradient method is employed to optimize the objective function of DRL by updating the weight parameters of the critic network and the actor network. A DRL-based automatic compression algorithm was obtained by the proximal gradient optimization method to achieve automatic compression of the network.
2. Related Work
3. Automatic Compression of Neural Network with Deep Reinforcement Learning Based on Proximal Gradient Method
3.1. Markov Decision Problem Based Pruning Technology
3.1.1. State
3.1.2. Action
3.1.3. Reward
3.2. Pruning Based on DRL with the Proximal Gradient
3.3. Proposed Algorithom
Algorithm 1 Neural network pruning algorithm by deep reinforcement learning. |
Initialization: t: current layer : current state : pruning rate be created by Actor : next state : reward of pruning one layer : desired size of data to be collected : experience pool : experience pool size Pruning:
|
4. Experiments
4.1. Experimental Setup
4.1.1. MNIST Datasets
4.1.2. Fashion MNIST Datasets
4.1.3. Cifar-10 Datasets
4.2. Experimental Comparison and Evaluation Index
4.3. Convergence of Deep Reinforcement Learning in Pruning
4.4. Balance Factor of the Reward
4.5. Convergence of Reward
4.6. Results and Analysis
4.6.1. Pruning Performance on Mnist
4.6.2. Pruning Performance on Fashion MNIST
4.6.3. Pruning Performance on Cifar-10
4.7. Ablation Study
4.8. Weight Distribution
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Brock, A.; De, S.; Smith, S.L.; Simonyan, k. High-performance large-scale image recognition without normalization. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 1059–1071. [Google Scholar]
- Ding, X.; Xia, C.; Zhang, X.; Chu, X.; Han, J.; Ding, G. Repmlp: Re-parameterizing convolutions into fully-connected layers for image recognition. arXiv 2021, arXiv:2105.0. [Google Scholar]
- Liang, W.; Long, J.; Li, K.C.; Xu, J.; Ma, N.; Lei, X. A fast defogging image recognition algorithm based on bilateral hybrid filtering. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2021, 17, 1–16. [Google Scholar] [CrossRef]
- Zhang, W.; Wang, J.; Lan, F. Dynamic hand gesture recognition based on short-term sampling neural networks. IEEE/CAA J. Autom. Sin. 2021, 8, 110–120. [Google Scholar] [CrossRef]
- Kim, S.; Hwang, S.; Hong, S.H. Identifying shoplifting behaviors and inferring behavior intention based on human action detection and sequence analysis. Adv. Eng. Inf. 2021, 50, 101399. [Google Scholar] [CrossRef]
- Yang, W.; Zhang, T.; Mao, Z.; Zhang, Y.; Tian, Q.; Wu, F. Multi-scale structure-aware network for weakly supervised temporal action detection. IEEE Trans. Image Process. 2021, 30, 5848–5861. [Google Scholar] [CrossRef] [PubMed]
- Erhan, D.; Szegedy, C.; Toshev, A.; Anguelov, D. Scalable object detection using deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2147–2154. [Google Scholar]
- Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A Unified Multi-Scale Deep Convolutional Neural Network for Fast Object Detection; Springer: Berlin/Heidelberg, Germany, 2016; pp. 354–370. [Google Scholar]
- Muzahid, A.A.M.; Wan, W.; Sohel, F.; Wu, L.; Hou, L. CurveNet: Curvature-Based Multitask Learning Deep Networks for 3D Object Recognition. IEEE/CAA J. Autom. Sin. 2021, 8, 1177–1187. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Arbeláez, P.; Hariharan, B.; Gu, C.; Gupta, S.; Bourdev, L.; Malik, J. Semantic segmentation using regions and parts. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3378–3385. [Google Scholar]
- Peng, C.; Zhang, X.; Yu, G.; Luo, G.; Sun, J. Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 4353–4361. [Google Scholar]
- Chen, W.; Wilson, J.; Tyree, S.; Weinberger, K.; Chen, Y. Compressing neural networks with the hashing trick. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 2285–2294. [Google Scholar]
- Wang, K.; Liu, Z.; Lin, Y.; Lin, J.; Han, S. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8612–8620. [Google Scholar]
- Kim, H.; Khan, M.U.K.; Kyung, C.M. Efficient neural network compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12569–12577. [Google Scholar]
- Denton, E.L.; Zaremba, W.; Bruna, J.; LeCun, Y.; Fergus, R. Exploiting linear structure within convolutional networks for efficient evaluation. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1269–1277. [Google Scholar]
- Mirzadeh, S.I.; Farajtabar, M.; Li, A.; Levine, N.; Matsukawa, A.; Ghasemzadeh, H. Improved knowledge distillation via teacher assistant. AAAI Conf. Artif. Intell. 2020, 34, 5191–5198. [Google Scholar] [CrossRef]
- Wang, D.; Li, Y.; Wang, L.; Gong, B. Neural networks are more productive teachers than human raters: Active mixup for data-efficient knowledge distillation from a blackbox model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1498–1507. [Google Scholar]
- Tavakolian, M.; Tavakoli, H.R.; Hadid, A. Awsd: Adaptive weighted spatiotemporal distillation for video representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8020–8029. [Google Scholar]
- Zhao, H.; Wu, J.; Li, Z.; Chen, W.; Zheng, Z. Double Sparse Deep Reinforcement Learning via Multilayer Sparse Coding and Nonconvex Regularized Pruning. IEEE Trans. Cybern. 2022, 1–14. [Google Scholar] [CrossRef] [PubMed]
- Li, Z.; Su, W.; Xu, M.; Yu, R.; Niyato, D.; Xie, S. Compact Learning Model for Dynamic Off-Chain Routing in Blockchain-Based IoT. IEEE J. Sel. Areas Commun. 2022, 40, 3615–3630. [Google Scholar] [CrossRef]
- Chen, C.; Tung, F.; Vedula, N.; Mori, G. Constraint-aware deep neural network compression. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 400–415. [Google Scholar]
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Vettam, S.; John, M. Regularized deep learning with a non-convex penalty. arXiv 2019, arXiv:1909.05142. [Google Scholar]
- Scardapane, S.; Comminiello, D.; Hussain, A.; Uncini, A. Group sparse regularization for deep neural networks. Neurocomputing 2017, 241, 81–89. [Google Scholar] [CrossRef] [Green Version]
- Yoon, J.; Hwang, S.J. Combined group and exclusive sparsity for deep neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 3958–3966. [Google Scholar]
- Louizos, C.; Welling, M.; Kingma, D.P. Learning sparse neural networks through L0 regularization. arXiv 2017, arXiv:1712.01312. [Google Scholar]
- Xiao, X.; Wang, Z. Autoprune: Automatic network pruning by regularizing auxiliary parameters. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 September 2019; Volume 32. [Google Scholar]
- Liu, Z.; Sun, M.; Zhou, T.; Huang, G.; Darrell, T. Rethinking the value of network pruning. arXiv 2018, arXiv:1810.05270. [Google Scholar]
- He, Y.; Lin, J.; Liu, Z.; Wang, H.; Li, L.J.; Han, S. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 784–800. [Google Scholar]
- Ng, A.Y.; Harada, D.; Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. Icml 1999, 99, 278–287. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
- Gupta, M.; Aravindan, S.; Kalisz, A.; Chandrasekhar, V.; Jie, L. Learning to Prune Deep Neural Networks via Reinforcement Learning. arXiv 2020, arXiv:2007.04756. [Google Scholar]
- Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
- Hasselt, H. Double Q-learning. Adv. Neural Inf. Process. Syst. 2010, 23, 2613–2621. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Pizzocaro, F.; Torreggiani, D.; Gilardi, G. Inhibition of apple polyphenoloxidase (PPO) by ascorbic acid, citric acid and sodium chloride. J. Food Process. Preserv. 1993, 17, 21–30. [Google Scholar] [CrossRef]
- Qiu, S.; Yang, Z.; Ye, J.; Wang, Z. On finite-time convergence of actor-critic algorithm. IEEE J. Sel. Areas Inf. Theory 2021, 2, 652–664. [Google Scholar] [CrossRef]
- Wen, J.; Kumar, S.; Gummadi, R.; Schuurmans, D. Characterizing the gap between actor-critic and policy gradient. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 11101–11111. [Google Scholar]
- Li, L.; Li, D.; Song, T.; Xu, X. Actor–Critic Learning Control With Regularization and Feature Selection in Policy Gradient Estimation. IEEE Trans. Neural Networks Learn. Syst. 2020, 32, 1217–1227. [Google Scholar] [CrossRef]
Methods | Error (%) | Layers | NCR | FLOPS (%) |
---|---|---|---|---|
CGES [26] | 0.36 | 221-97-94 | 2.87 | 35 |
L0 norm [27] | 0.23 | 266-88-33 | 3.06 | 33 |
AutoPrune [28] | 0.22 | 244-85-37 | 3.23 | 31 |
AMC [30] | 0.30 | 225-88-45 | 3.30 | 30 |
ACNN | 0.20 | 219-81-43 | 3.46 | 28 |
Methods | Error (%) | NCR | Time (s) | DRL SPA (%) |
---|---|---|---|---|
AMC [30] | 0.30 ± 0.12 | 3.30 ± 0.20 | 1835.00 ± 15.72 | 0.00 ± 0.00 |
ACNN | 0.20 ± 0.16 | 3.46 ± 0.14 | 455.00 ± 2.88 | 96.00 ± 0.74 |
Methods | Error (%) | Layers | NCR | FLOPS (%) |
---|---|---|---|---|
CGES [26] | 13.50 | 702-15-7 | 1.64 | 60 |
L0 norm [27] | 6.25 | 295-80-25 | 2.96 | 34 |
AutoPrune [28] | 4.33 | 285-85-19 | 3.04 | 32 |
AMC [30] | 2.40 | 141-81-42 | 4.48 | 23 |
ACNN | 2.22 | 185-35-20 | 4.92 | 20 |
Methods | Error (%) | NCR | Time (s) | DRL SPA (%) |
---|---|---|---|---|
AMC [30] | 2.40 ± 0.10 | 4.48 ± 0.05 | 4210.00 ± 11.40 | 0.00 ± 0.00 |
ACNN | 2.22 ± 0.08 | 4.92 ± 0.04 | 544.00 ± 4.15 | 97.00 ± 0.50 |
Methods | Error (%) | Layers | NCR | FLOPS (%) |
---|---|---|---|---|
CGES [26] | 28.70 | 3035-142-3 | 1.61 | 62 |
L0 norm [27] | 26.04 | 1036-260-290 | 3.23 | 31 |
AutoPrune [28] | 24.04 | 900-240-282 | 3.60 | 28 |
AMC [30] | 7.63 | 890-317-235 | 3.55 | 27 |
ACNN | 7.10 | 857-256-245 | 3.77 | 25 |
Methods | Error (%) | NCR | Time (s) | DRL SPA (%) |
---|---|---|---|---|
AMC [30] | 7.63 ± 0.65 | 3.55 ± 0.25 | 4157.00 ± 11.30 | 0.00 ± 0.00 |
ACNN | 7.10 ± 0.26 | 3.77 ± 0.21 | 756.00 ± 5.94 | 97.00 ± 0.60 |
Dateset | Method | Error (%) | Ncr |
---|---|---|---|
MNIST | ACNN without pro | 0.21 ± 0.08 | 3.25 ± 0.07 |
ACNN | 0.20 ± 0.16 | 3.46 ± 0.14 | |
Fashion MNIST | ACNN without pro | 2.75 ± 0.62 | 3.62 ± 0.22 |
ACNN | 2.22 ± 0.08 | 4.92 ± 0.04 | |
Cifar-10 | ACNN without pro | 7.71 ± 0.35 | 3.60 ± 0.06 |
ACNN | 7.10 ± 0.26 | 3.77 ± 0.21 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, M.; Tang, J.; Zhao, H.; Li, Z.; Xie, S. Automatic Compression of Neural Network with Deep Reinforcement Learning Based on Proximal Gradient Method. Mathematics 2023, 11, 338. https://doi.org/10.3390/math11020338
Wang M, Tang J, Zhao H, Li Z, Xie S. Automatic Compression of Neural Network with Deep Reinforcement Learning Based on Proximal Gradient Method. Mathematics. 2023; 11(2):338. https://doi.org/10.3390/math11020338
Chicago/Turabian StyleWang, Mingyi, Jianhao Tang, Haoli Zhao, Zhenni Li, and Shengli Xie. 2023. "Automatic Compression of Neural Network with Deep Reinforcement Learning Based on Proximal Gradient Method" Mathematics 11, no. 2: 338. https://doi.org/10.3390/math11020338
APA StyleWang, M., Tang, J., Zhao, H., Li, Z., & Xie, S. (2023). Automatic Compression of Neural Network with Deep Reinforcement Learning Based on Proximal Gradient Method. Mathematics, 11(2), 338. https://doi.org/10.3390/math11020338