Stagewise Accelerated Stochastic Gradient Methods for Nonconvex Optimization
Abstract
:1. Introduction
1.1. Related Works
1.2. Contributions
- Incorporating the SSTS into the iterative framework of RSAG, we propose the stagewise RSAG, abbreviated as S-RSAG, and show that the iterative complexities of it are at most to find an -stationary point for nonconvex objective and to find an -minimum for convex objective, which are significantly reduced with respect to its non-stagewise counterpart RSAG, the complexities of which are and respectively, where is the PL modulus and L is the Lipschitz constant. Compared to existing stagewise algorithm S-SGD, the complexities of our S-RSAG are more superior under the convex condition (i.e., with respect to , where in general) and at the same level under the nonconvex condition.
- With the same methodology, we propose the stagewise SVRG, abbreviated as S-SVRG and show that the iterative complexities of it are at most to find an -stationary point for nonconvex objective and to find an -minimum for convex objective, where m is an arbitrary constant and denotes the number of inner iterations for VR. It is worth mentioning that the iterative complexities of S-SVRG are both significantly superior to its non-stagewise counterpart SVRG and existing stagewise algorithm S-SGD under convex and nonconvex conditions.
- We also numerically evaluate our algorithms through the experiments on a number of benchmark datasets. The obtained results are consistent with our theoretical findings.
2. Notions and Preliminaries
3. Stagewise Accelerated Algorithms Development
3.1. Stagewise RSAG Development
Algorithm 1 Stagewise RSAG (S-RSAG). |
|
3.2. Theoretical Aspects of S-RSAG
- 1.
- If the parameters are chosen as , , , , and probability mass function of is chosen such that for any , where , , and we can find an ϵ-stationary point (a point such that ) by performing Algorithm 1 at most times.
- 2.
- If we further assume F is convex and the parameters are chosen as , , , , and probability mass function of is chosen such that for any , where and , we can find an ϵ-minimizer (a point such that ) by performing Algorithm 1 at most times.
3.3. Stagewise SVRG Development
Algorithm 2 Stagewise SVRG (S-SVRG). |
|
3.4. Theoretical Aspects of S-SVRG
- 1.
- If the parameters are chosen as , , and probability mass function of is chosen such that for any , where and , we can find an ϵ-stationary point (a point such that ) by performing Algorithm 2 at most times.
- 2.
- If we further assume F is convex and the parameters are chosen as , and probability mass function are chosen such that and for , where and , we can find an ϵ-minimizer (a point such that ) by performing Algorithm 1 at most times.
4. Numerical Experiments
4.1. Learning DNNs
- MNIST: This dataset contains gray images from ten-digit classes. To improve learning efficiency, we load 10 samples per batch. We use 60,000 () images for training, and the remaining 10,000 for testing. We adopt the 4-layer MLP network
- CIFAR-10: This dataset contains color images from ten object classes. Similarly, we load 4 samples per batch. We use 50,000 (12,500 × 4) images for training, and the remaining 10,000 for testing. We adopt the VGG-like architecture
4.1.1. S-RSAG and S-SVRG vs. Their Non-Stagewise Counterparts
4.1.2. S-RSAG and S-SVRG vs. Other Methods
4.2. Logistic Regression
- REAL-SIM: This contains 72,309 data points of 20,958 features from two object classes. We divide it into two sets, i.e., one for training and the other for testing.
- RCV1: This contains 20,242 data points of 47,236 features from two object classes. Similarly, we also divide it into two sets, averagely, one for training and the other for testing.
4.2.1. S-RSAG and S-SVRG vs. Their Non-Stagewise Counterparts
4.2.2. S-RSAG and S-SVRG vs. S-SGD
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
GD | gradient descent |
SGD | stochastic gradient descent |
SSTS | stagewise stepsize tuning strategy |
DNN | deep neural network |
VR | variance reduction |
RASG | randomized stochastic accelerated gradient |
SVRG | stochastic variance reduced gradient |
S-SGD | stagewise stochastic gradient descent |
S-RSAG | stagewise randomized stochastic accelerated gradient |
S-SVRG | stagewise stochastic variance reduced gradient |
l.c.d. | Lipschitz continuously differentiable |
PL | Polyak–Łojasiewicz |
MLP | Mmultilayer Perceptron |
VGG | Visual Geometry Group |
FC | a ReLu fully connected layer |
SF | softmax output layer |
Appendix A
Appendix A.1. Proof of Theorem 1
Appendix A.2. Proof of Theorem 2
References
- Neter, J.; Khutner, M.H.; Nachtsheim, C.J.; Wasserman, W. Applied Linear Statistical Models; Irwin: Chicago, IL, USA, 1996; Volume 4. [Google Scholar]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436. [Google Scholar] [CrossRef] [PubMed]
- Kushner, H.J.; Yin, G.G. Stochastic Approximation and Recursive Algorithms and Applications; Springer Science & Business Media: New York, NY, USA, 2003; Volume 35. [Google Scholar]
- Bottou, L. Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade, Reloaded; Springer: Berlin/Heidelberg, Germany, 2012; pp. 421–436. [Google Scholar]
- He, W.; Liu, Y. To regularize or not: Revisiting SGD with simple algorithms and experimental studies. Expert Syst. Appl. 2018, 112, 1–14. [Google Scholar] [CrossRef]
- He, W.; Kwok, J.T.; Zhu, J.; Liu, Y. A Note on the Unification of Adaptive Online Learning. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 1178–1191. [Google Scholar] [CrossRef] [PubMed]
- Bottou, L.; Bousquet, O. The Tradeoffs of Large Scale Learning. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 8–10 December 2008; pp. 161–168. [Google Scholar]
- Bottou, L.; Curtis, F.E.; Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 2018, 60, 223–311. [Google Scholar] [CrossRef]
- Polyak, B.T. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 1964, 4, 1–17. [Google Scholar] [CrossRef]
- Nesterov, Y. A method of solving a convex programming problem with convergence rate O(1/k2). Sov. Math. Dokl. 1983, 27, 372–376. [Google Scholar]
- Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course; Kluwer: Boston, MA, USA, 2004. [Google Scholar]
- Nesterov, Y. Smooth minimization of non-smooth functions. Math. Program. 2005, 103, 127–152. [Google Scholar] [CrossRef]
- Auslender, A.; Teboulle, M. Interior Gradient and Proximal Methods for Convex and Conic Optimization. SIAM J. Optim. 2006, 16, 697–725. [Google Scholar] [CrossRef]
- Nesterov, Y. Primal-dual subgradient methods for convex problems. Math. Program. 2009, 120, 221–259. [Google Scholar] [CrossRef]
- Lan, G.; Lu, Z.; Monteiro, R.D.C. Primal-dual first-order methods with O(1/ϵ) iteration-complexity for cone programming. Math. Program. 2011, 126, 1–29. [Google Scholar] [CrossRef]
- Ghadimi, S.; Lan, G. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming. SIAM J. Optim. 2013, 23, 2341–2368. [Google Scholar] [CrossRef]
- Sutskever, I.; Martens, J.; Dahl, G.E.; Hinton, G.E. On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1139–1147. [Google Scholar]
- Ochs, P.; Chen, Y.; Brox, T.; Pock, T. iPiano: Inertial Proximal Algorithm for Nonconvex Optimization. SIAM J. Imaging Sci. 2014, 7, 1388–1419. [Google Scholar] [CrossRef]
- Ghadimi, S.; Lan, G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 2016, 156, 59–99. [Google Scholar] [CrossRef]
- Johnson, R.; Zhang, T. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 315–323. [Google Scholar]
- Reddi, S.J.; Hefny, A.; Sra, S.; Poczos, B.; Smola, A. Stochastic Variance Reduction for Nonconvex Optimization. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 314–323. [Google Scholar]
- Shang, F.; Zhou, K.; Liu, H.; Cheng, J.; Tsang, I.W.; Zhang, L.; Tao, D.; Jiao, L. VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning. IEEE Trans. Knowl. Data Eng. 2020, 32, 188–202. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1106–1114. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Cui, Z.X.; Fan, Q. A “Nonconvex + Nonconvex” approach for image restoration with impulse noise removal. Appl. Math. Model. 2018, 62, 254–271. [Google Scholar] [CrossRef]
- Fan, Q.; Jia, C.; Liu, J.; Luo, Y. Robust recovery in 1-bit compressive sensing via Lq-constrained least squares. Signal Process. 2021, 179, 107822. [Google Scholar] [CrossRef]
- Xu, Y.; Lin, Q.; Yang, T. Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence. In Proceedings of the International Conference on Machine Learning, Ningbo, China, 9–12 July 2017; pp. 3821–3830. [Google Scholar]
- Hardt, M.; Ma, T. Identity matters in deep learning. arXiv 2016, arXiv:1611.04231. [Google Scholar]
- Xie, B.; Liang, Y.; Song, L. Diversity leads to generalization in neural networks. arXiv 2016, arXiv:1611.03131v2. [Google Scholar]
- Li, Y.; Yuan, Y. Convergence analysis of two-layer neural networks with relu activation. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 597–607. [Google Scholar]
- Zhou, Y.; Liang, Y. Characterization of gradient dominance and regularity conditions for neural networks. arXiv 2017, arXiv:1710.06910. [Google Scholar]
- Charles, Z.; Papailiopoulos, D.S. Stability and Generalization of Learning Algorithms that Converge to Global Optima. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 744–753. [Google Scholar]
- Arjevani, Y.; Carmon, Y.; Duchi, J.C.; Foster, D.J.; Srebro, N.; Woodworth, B. Lower bounds for non-convex stochastic optimization. Math. Program. 2023, 199, 165–214. [Google Scholar] [CrossRef]
- Horváth, S.; Lei, L.; Richtárik, P.; Jordan, M.I. Adaptivity of stochastic gradient methods for nonconvex optimization. SIAM J. Math. Data Sci. 2022, 4, 634–648. [Google Scholar] [CrossRef]
- Wang, Z.; Zhang, J.; Chang, T.H.; Li, J.; Luo, Z.Q. Distributed stochastic consensus optimization with momentum for nonconvex nonsmooth problems. IEEE Trans. Signal Process. 2021, 69, 4486–4501. [Google Scholar] [CrossRef]
- Yuan, K.; Ying, B.; Zhao, X.; Sayed, A.H. Exact Diffusion for Distributed Optimization and Learning—Part I: Algorithm Development. IEEE Trans. Signal Process. 2019, 67, 708–723. [Google Scholar] [CrossRef]
- Yuan, Z.; Yan, Y.; Jin, R.; Yang, T. Stagewise training accelerates convergence of testing error over SGD. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 2604–2614. [Google Scholar]
- Bolte, J.; Nguyen, T.P.; Peypouquet, J.; Suter, B.W. From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 2017, 165, 471–507. [Google Scholar] [CrossRef]
- Karimi, H.; Nutini, J.; Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Riva del Garda, Italy, 19–23 September 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 795–811. [Google Scholar]
- Allen-Zhu, Z. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods. J. Mach. Learn. Res. 2018, 18, 1–51. [Google Scholar]
Methods | PL Condition | Generally Convex | Nonconvex |
---|---|---|---|
SGD [16] | |||
SGD [39] | ✓ | ||
RSAG [19] | |||
SVRG [21] | |||
S-SGD [37] | ✓ | ||
S-RSAG | ✓ | ||
S-SVRG | ✓ |
Methods | SGD | RASG | SVRG | S-SGD | Katyusha | S-RSAG | S-SVRG |
---|---|---|---|---|---|---|---|
MLP on MNIST | 247.57 | 685.94 | 292.72 | 293.61 | 435.22 | 293.15 | 326.78 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jia, C.; Cui, Z. Stagewise Accelerated Stochastic Gradient Methods for Nonconvex Optimization. Mathematics 2024, 12, 1664. https://doi.org/10.3390/math12111664
Jia C, Cui Z. Stagewise Accelerated Stochastic Gradient Methods for Nonconvex Optimization. Mathematics. 2024; 12(11):1664. https://doi.org/10.3390/math12111664
Chicago/Turabian StyleJia, Cui, and Zhuoxu Cui. 2024. "Stagewise Accelerated Stochastic Gradient Methods for Nonconvex Optimization" Mathematics 12, no. 11: 1664. https://doi.org/10.3390/math12111664
APA StyleJia, C., & Cui, Z. (2024). Stagewise Accelerated Stochastic Gradient Methods for Nonconvex Optimization. Mathematics, 12(11), 1664. https://doi.org/10.3390/math12111664