Stochastic Control for Bayesian Neural Network Training
Abstract
:1. Introduction
- We provide a derivation of the stochastic differential equation on a first principle basis that governs the evolution of the parameters in variational distributions trained with variational inference and we decompose the uncertainty of the gradients into their aleatoric and epistemic components.
- We derive a stochastic optimal control optimization algorithm which incorporates the uncertainty in the gradients to optimally control the learning rates for each variational parameter.
- The evolution of the control exhibits distinct dynamical behaviour and demonstrates different fluctuation and dissipation regimes for the variational mean and uncertainty parameters.
2. Variational Inference for Bayesian Neural Networks
2.1. Stochastic Differential Equations for Frequentist Models
2.2. Stochastic Differential Equations for Bayesian Models
3. Stochastic Control for Learning Rates
3.1. Simplifiying the Loss
3.2. Our Control Problem
Algorithm 1: StochControlSGD |
4. Experiments
Behaviour of Control Parameter
5. Related Work
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Evidence Lower Bound
Appendix B. Ito’s Lemma
Appendix C. Bayesian Stochastic Differential Equation of a Variational Distribution
Appendix D. Gradient Derivations
Appendix E. Stochastic Control
Appendix F. Estimation of Local Quadratic Approximation
Appendix G. Experimental Setup
References
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25, pp. 1097–1105. [Google Scholar]
- Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A.; et al. Deep speech: Scaling up end-to-end speech recognition. arXiv 2014, arXiv:1412.5567. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
- Andrieu, C.; De Freitas, N.; Doucet, A.; Jordan, M.I. An introduction to MCMC for machine learning. Mach. Learn. 2003, 50, 5–43. [Google Scholar] [CrossRef]
- Wainwright, M.J.; Jordan, M.I. Graphical Models, Exponential Families, and Variational Inference; Now Publishers Inc.: Norwell, MA, USA, 2008. [Google Scholar]
- Hoffman, M.D.; Blei, D.M.; Wang, C.; Paisley, J. Stochastic variational inference. J. Mach. Learn. Res. 2013, 14, 1303–1347. [Google Scholar]
- Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of the COMPSTAT’2010: 19th International Conference on Computational Statistics, Paris, France, 22–27 August 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 177–186. [Google Scholar]
- Liu, G.H.; Theodorou, E.A. Deep learning theory review: An optimal control and dynamical systems perspective. arXiv 2019, arXiv:1908.10920. [Google Scholar]
- Orvieto, A.; Kohler, J.; Lucchi, A. The role of memory in stochastic optimization. In Proceedings of the Uncertainty in Artificial Intelligence (PMLR), Virtual, 3–6 August 2020; pp. 356–366. [Google Scholar]
- Mandt, S.; Hoffman, M.D.; Blei, D.M. Stochastic gradient descent as approximate Bayesian inference. arXiv 2017, arXiv:1704.04289. [Google Scholar]
- Yaida, S. Fluctuation-dissipation relations for stochastic gradient descent. arXiv 2018, arXiv:1810.00004. [Google Scholar]
- Oksendal, B. Stochastic Differential Equations: An Introduction with Applications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- Depeweg, S.; Hernandez-Lobato, J.M.; Doshi-Velez, F.; Udluft, S. Decomposition of uncertainty in Bayesian deep learning for efficient and risk-sensitive learning. In Proceedings of the International Conference on Machine Learning (PMLR), Stockholm, Sweden, 10–15 July 2018; pp. 1184–1193. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3th International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Li, Q.; Tai, C.; Weinan, E. Stochastic modified equations and adaptive stochastic gradient algorithms. In Proceedings of the International Conference on Machine Learning (PMLR), Sydney, Australia, 6–11 August 2017; pp. 2101–2110. [Google Scholar]
- Stengel, R.F. Optimal Control and Estimation; Courier Corporation: Chelmsford, MA, USA, 1994. [Google Scholar]
- LeCun, Y. The MNIST Database of Handwritten Digits. 1998. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 4 March 2022).
- Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
- Krizhevsky, A.; Hinton, G. Convolutional Deep Belief Networks on Cifar-10. 2010. Available online: https://www.cs.toronto.edu/~kriz/conv-cifar10-aug2010.pdf (accessed on 4 March 2022).
- Wenzel, F.; Roth, K.; Veeling, B.S.; Swiatkowski, J.; Tran, L.; Mandt, S.; Snoek, J.; Salimans, T.; Jenatton, R.; Nowozin, S. How good is the bayes posterior in deep neural networks really? arXiv 2020, arXiv:2002.02405. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (PMLR), Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
- Frankle, J.; Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv 2018, arXiv:1803.03635. [Google Scholar]
- Smith, S.L.; Kindermans, P.J.; Ying, C.; Le, Q.V. Don’t decay the learning rate, increase the batch size. arXiv 2017, arXiv:1711.00489. [Google Scholar]
- Murata, N.; Kawanabe, M.; Ziehe, A.; Müller, K.R.; Amari, S.i. On-line learning in changing environments with applications in supervised and unsupervised learning. Neural Netw. 2002, 15, 743–760. [Google Scholar] [CrossRef]
- Murata, N.; Müller, K.R.; Ziehe, A.; Amari, S.-I. Adaptive on-line learning in changing environments. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 1–6 December 1997; pp. 599–605. [Google Scholar]
- Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
- Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight uncertainty in neural network. In Proceedings of the International Conference on Machine Learning (PMLR), Lille, France, 7–9 July 2015; pp. 1613–1622. [Google Scholar]
- Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning (PMLR), New York, NY, USA, 20–22 June 2016; pp. 1050–1059. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Kingma, D.P.; Salimans, T.; Welling, M. Variational dropout and the local reparameterization trick. arXiv 2015, arXiv:1506.02557. [Google Scholar]
- Ranganath, R.; Gerrish, S.; Blei, D. Black box variational inference. In Proceedings of the Artificial Intelligence and Statistics (PMLR), Bejing, China, 22–24 June 2014; pp. 814–822. [Google Scholar]
- Baydin, A.G.; Pearlmutter, B.A.; Radul, A.A.; Siskind, J.M. Automatic differentiation in machine learning: A survey. J. Mach. Learn. Res. 2018, 18, 1–43. [Google Scholar]
- Kucukelbir, A.; Tran, D.; Ranganath, R.; Gelman, A.; Blei, D.M. Automatic differentiation variational inference. J. Mach. Learn. Res. 2017, 18, 430–474. [Google Scholar]
- Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
- Welling, M.; Teh, Y.W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011; pp. 681–688. [Google Scholar]
MNIST | FMNIST | CIFAR10 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SGD | ADAM | cSGD | scSGD | LRSGD | SGD | ADAM | cSGD | scSGD | LRSGD | SGD | ADAM | cSGD | scSGD | LRSGD | |
NN | 0.959 | 0.987 | 0.961 | / | 0.985 | 0.818 | 0.890 | 0.851 | / | 0.878 | 0.461 | 0.512 | 0.432 | / | 0.499 |
CNN | 0.989 | 0.993 | 0.981 | / | 0.990 | 0.904 | 0.918 | 0.912 | / | 0.907 | 0.853 | 0.865 | 0.857 | / | 0.855 |
BNN (Normal) | 0.956 | 0.963 | 0.970 | 0.971 | 0.069 | 0.865 | 0.870 | 0.876 | 0.900 | 0.900 | 0.441 | 0.442 | 0.451 | 0.471 | 0.462 |
CBNN (Normal) | 0.982 | 0.988 | 0.982 | 0.990 | 0.989 | 0.869 | 0.914 | 0.903 | 0.921 | 0.915 | 0.615 | 0.854 | 0.836 | 0.853 | 0.801 |
BNN (Laplace) | 0.976 | 0.978 | 0.974 | 0.977 | 0.975 | 0.890 | 0.875 | 0.903 | 0.901 | 0.9 | 0.501 | 0.452 | 0.461 | 0.479 | 0.500 |
CBNN (Laplace) | 0.989 | 0.987 | 0.985 | 0.991 | 0.989 | 0.899 | 0.916 | 0.907 | 0.918 | 0.912 | 0.627 | 0.857 | 0.829 | 0.857 | 0.853 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Winkler, L.; Ojeda, C.; Opper, M. Stochastic Control for Bayesian Neural Network Training. Entropy 2022, 24, 1097. https://doi.org/10.3390/e24081097
Winkler L, Ojeda C, Opper M. Stochastic Control for Bayesian Neural Network Training. Entropy. 2022; 24(8):1097. https://doi.org/10.3390/e24081097
Chicago/Turabian StyleWinkler, Ludwig, César Ojeda, and Manfred Opper. 2022. "Stochastic Control for Bayesian Neural Network Training" Entropy 24, no. 8: 1097. https://doi.org/10.3390/e24081097
APA StyleWinkler, L., Ojeda, C., & Opper, M. (2022). Stochastic Control for Bayesian Neural Network Training. Entropy, 24(8), 1097. https://doi.org/10.3390/e24081097