Next Article in Journal
Technological Factors That Influence the Mathematics Performance of Secondary School Students
Next Article in Special Issue
Spherical Ruled Surfaces in S3 Characterized by the Spherical Gauss Map
Previous Article in Journal
Author Identification Using Chaos Game Representation and Deep Learning
Previous Article in Special Issue
A Pinching Theorem for Compact Minimal Submanifolds in Warped Products I×fSm(c)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

When Will a Sequence of Points in a Riemannian Submanifold Converge?

by
Tuyen Trung Truong
Department of Mathematics, University of Oslo, Blindern, 0851 Oslo, Norway
Mathematics 2020, 8(11), 1934; https://doi.org/10.3390/math8111934
Submission received: 26 August 2020 / Revised: 16 October 2020 / Accepted: 19 October 2020 / Published: 3 November 2020
(This article belongs to the Special Issue Riemannian Geometry of Submanifolds)

Abstract

:
Let X be a Riemannian manifold and x n a sequence of points in X. Assume that we know a priori some properties of the set A of cluster points of x n . The question is under what conditions that x n will converge. An answer to this question serves to understand the convergence behaviour for iterative algorithms for (constrained) optimisation problems, with many applications such as in Deep Learning. We will explore this question, and show by some examples that having X a submanifold (more generally, a metric subspace) of a good Riemannian manifold (even in infinite dimensions) can greatly help.

In this paper we will explain how the geometry of submanifolds of R k is useful to optimisation problems in Deep Learning, and we explore similar properties for other manifolds.

1. Motivation

Optimisation is important in many aspects of engineering and computer sciences. For a modern example, one can mention deep neural networks, which can solve effectively several tasks (image/video classification, natural language processing and so on) that posed enormous challenges for the old paradigm of based-rule learning. For deep neural networks to be able to work, one has to solve large scale and non-convex optimisation. For example, modern state-of-the-art deep neural network architectures give rise to optimisation problems (finding minima) in hundred of million variables.
For an explicit example, we consider the case of recognising of hand written digits, useful for example when scanning postal packages. A well-known dataset is MNIST [1]. A sample is in Figure 1, several of them can be challenging even for human beings. This task is extremely difficult for the old paradigm of based-rule learning, but is considered a simple task for deep neural networks. For this task, one can use a “simple” deep neural network, which gives rise to an optimisation problem in about 12,000 variables.
Because of this large scale and non-convex feature of associated optimisation problems, one must confine with iterative numerical methods. One would like the method to guarantee convergence to local minima. This can be divided into two steps. First, show that the method converge, and then show that the limit point is a local minimum. Since saddle points are dominant for functions in higher dimensions [2,3], for the second step it is important to guarantee that the limit point is not a saddle point.
It could be considered as a lucky fact that from beginning Gradient Descent methods (GD) have been used in deep neural networks. At first, it could be the fact that GD is easy to implement and is not costly to run in large scale optimisation. Then, even though it dates back more than 170 years [4], only gradually (with some results announced only very recently) it has been shown that GD has good properties: it can avoid saddle points [5,6]. While its standard version does not guarantee convergence to critical points, its Backtracking version [7] does [8,9,10] (the latter paper consists of the more experimental part of arXiv:1808.05160, in combination with arXiv:2001.02005 and arXiv:2007.03618) and can be implemented in deep neural networks with very good performance on CIFAR10 and CIFAR100 image datasets [10,11]. Some further modifications of Backtracking GD can avoid saddle points as well [12,13].

2. Convergence Results

We will, for the remainder of this section, discuss convergence to critical points for GD methods, which uses special geometrical properties. First, we recall in detail the update rule in GD. We consider a function f : R k R , which is assumed to be C 1 . We want to find minima of f. One starts from a random initial point x 0 , and construct a sequence x n + 1 = x n δ n f ( x n ) , where δ n > 0 is an appropriate number. There are many ways to choose δ n . In the Standard GD scheme, one chooses δ n to be a constant δ 0 . A disadvantage of Standard GD is that it does not guarantee convergence, to have good behaviour one must assume that f is in C L 1 , 1 , that is f is globally Lipschitz continuous with the Lipschitz constant L, and further assume that δ 0 is in the order of 1 / L . There are many popular modifications trying to overcome this, such as Adam, Adadelta, Nesterov Accelerated Gradient, Momentum and so on (see [14] for a review), none of these are guaranteed to converge in general either. To date, only Backtracking GD is guaranteed to converge: see Chapter 12 in [9], in particular Proposition 12.6.1 there, for the case f C L 1 , 1 and has compact sublevels and has at most countably many critical points, see [8] when f is real analytic (or more generally satisfies the so-called Losjasiewicz gradient inequality), and see [10] for the general case of f being C 1 only and has at most countably many critical points. Note that the assumption in the last paper is not too restrictive: indeed, it is known from transversality results that such an assumption is satisfied by a generic C 1 function (for example, by Morse’s functions, which are a well-known class of functions in geometry and analysis). Since the real analyticity assumption in [8] is quite special, we will not discuss about it in the below, trying to provide only the most general ideas.
Both [9,10] start from the following property: If { x n } is constructed as above, and { x n j } is a convergent subsequence, then lim j f ( x n j ) = 0 . This is classically known (see [15]), the main idea is as follows: if lim j x n j = x and f ( x ) 0 , then lim inf j δ n j > 0 . The latter is a contradiction, when Armijo’s condition is taken into account.
Now, one needs also the following property:
Property 1.
Either lim n f ( x n ) = , or lim n | | x n + 1 x n | | = 0 .
In case f has compact sublevels, then this is easily proven [9]. For the general case, see [10] for a proof.
Now, one needs a special property of compact metric spaces [16]. We recall that given a sequence { x n } , its set of cluster points consists of points x so that there is a subsequence { x n j } for which lim j x n j = x .
Theorem 1.
Let ( X , d ) be a compact metric space. If { x n } X is a sequence so that lim n d ( x n + 1 , x n ) = 0 , then the set of cluster points of { x n } is connected.
In the setting of [9], one can finish the proof of convergence as follows: Let X = { x R k : f ( x ) f ( x 0 ) } . Since f is assumed to have compact sublevels, it follows that X is compact. Let d be the restriction to X of the usual metric on R k , then ( X , d ) is a compact metric space. Since we know that lim n d ( x n , x n + 1 ) = 0 for the constructed sequence, we can then apply Theorem 1 for the sequence { x n } , and have that the set of cluster points D of { x n } is connected. Since we also know that D must be contained in the set of critical points C = { x : f ( x ) = 0 } and by assumption C is countable, it follows that { D } is also countable. Since a countable and connected set must be either empty or one point, it follows that either lim n | | x n | | = (the first case) or { x n } converges. Note that in this case, since { x n } is bounded, it follows that only the second case happens, that is { x n } converges.
In the general case, the above proof does not go through, since the set X in the above may not be bounded. In [10], a way to go around is as follows. We let ( P k , d ) be the real projective space with its canonical metric (the spherical metric). We let ( R k , | | . | | ) be the usual Euclidean metric on R k , and let ( R k , d ) be the restricted metric of d. Then, topologically, the two spaces ( R k , | | . | | ) are homeomorphic. In particular, convergence properties of a sequence { x n } in ( R k , | | . | | ) can be translated to that of the same sequence but considered in ( R k , d ) . Even though they are not isometric, one can check that if lim n | | x n + 1 x n | | = 0 , then also lim n d ( x n , x n + 1 ) = 0 . In addition, ( P k , d ) is a compact metric space. Hence, one can apply Theorem 1, and have that for the constructed sequence { x n } , considered in ( P k , d ) the set of cluster points D ¯ is connected. Since D ¯ R k = D C and C is countable, it follows that if D is non-empty then it is a point. Hence, we obtain the same conclusions as before. Note that here the case lim n | | x n | | = can happen, for example for the function f ( x ) = x 3 .

3. Riemannian Manifold Optimisation

Now we discuss the general setting of optimisation on manifolds. Let X be a real manifold and f : X R . We want to find (local) minima of f. Here one could, as before, try to use Standard GD. To prove good properties for Standard GD, one could, as before, restrict the discussion to C L 1 , 1 functions. However, here it is very cumbersome to define a global C L 1 , 1 notion in the manifold setting. It is better to switch to Backtracking GD, which is local in nature, and hence has a natural extension to the manifold setting. We let it to the readers to state a specific version of Backtracking GD in the manifold setting, by working on small coordinate charts.
Taking the ideas from the proof in [10], as presented above, we obtain the following general result. We say that a sequence { x n } in a metric space X diverges to infinity if { x n } eventually leaves every bounded subset of X.
Theorem 2.
Let X be a Riemmanian manifold, with the induced metric d. Assume that there is a compact metric space ( Z , d Z ) , together with a homeomorphism h : X h ( X ) Z such that d Z ( h ( x 1 ) , h ( x 2 ) ) d ( x 1 , x 2 ) for all x 1 , x 2 X . Let f : X R be a C 1 function, and { x n } a sequence constructed by Backtracking GD. Assume that f has at most countably many critical points. Then, either { x n } converges to a critical point of f, or { x n } diverges to infinity.
For clarity of the next discussion, we define a class of Riemannian manifolds.
Definition 1.
A Riemannian manifold ( X , d ) is called non-expandingly homeomorphically compactible if there is a compact metric space ( Z , d Z ) , together with a homeomorphism h : X h ( X ) Z such that d Z ( h ( x 1 ) , h ( x 2 ) ) d ( x 1 , x 2 ) for all x 1 , x 2 X .
To be able to apply Theorem 2, it remains to find what manifolds are non-expandingly homeomorphically compactible. We give first some examples.
Example 1.
X is a bounded subset of R l . This is the case in [9].
Example 2.
X is a subset of R l . This is the case in [10] and is more general than Example 1, when X is unbounded. On the other hand, it is interesting to note that an X from Example 2 is homeomorphic to a one in Example 1. Indeed, by Theorem 3.4.4 in [17], the real projective space P l is affine, that is isometric to a compact subset of R ( l + 1 ) 2 .
Remark 1.
In a very recent paper [18], the author has been able to extend previously mentioned results, including Theorem 2 and a modification of Newton’s method, to general Riemannian manifolds. There, the readers can consult more general algorithms on Riemannian manifolds. In particular, by using Nash’s embedding theorem [19,20,21], every Riemannian manifold is non-expandingly homeomorphically compactible and hence Theorem 2 holds in general.
From the above discussion, we are motivated to state the following question. Even partial answers to it will be greatly helpful for optimisation on manifolds.
Question 1. In the statement of Theorem 2, if one does not assume that f has at most countably many critical points, will the conclusions of Theorem 2 still hold?
Currently, we do not know about the answer to Question 1. We have just a couple of comments. First, by transversality theorem, functions satisfying Theorem 2 is dense, and hence for practical purposes the Theorem is applicable in realistic setttings. Second, we have done many examples on various types of benchmark functions, and found that the sequence constructed by Backtracking GD always satisfies the conclusions of Theorem 2. Hence, we conjecture that the answer to Question 1 is affirmative, at least when the initial point x 0 is randomly chosen.

Funding

This work is supported by Young Research Talents grant number 300814 from Research Council of Norway.

Acknowledgments

The author thanks the editors of the special issue “Riemannian geometry of submanifolds” in the journal Mathematics (MDPI), for invitation and help. The author would like to thank the referees for helpful feedbacks.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. LeCun, Y.; Cortes, C.; Burges, C. MNIST Handwritten Digit Database. 2010. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 1 April 2020).
  2. Bray, A.J.; Dean, D.S. Statistics of critical points of gaussian fields on large-dimensional spaces. Phys. Rev. Lett. 2006, 98, 150201. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Dauphin, Y.N.; Pascanu, R.; Gulcehre, C.; Cho, K.; Ganguli, S.; Bengjo, Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Proceedings of the 27th International Conference on Neural Information Processing Systems NIPS’ 14, Montreal, QC, Canada, 8–13 December 2014; Volume 2, pp. 2933–2941. [Google Scholar]
  4. Cauchy, A. Method général pour la résolution des systemes d’équations simulanées. C. R. Math. 1847, 25, 536. [Google Scholar]
  5. Lee, J.D.; Simchowitz, M.; Jordan, M.I.; Recht, B. Gradient descent only converges to minimizers. JMRL 2016, 49, 1–12. [Google Scholar]
  6. Panageas, I.; Piliouras, G. Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions. In Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), Berkeley, CA, USA, 9–11 January 2017; Volume 2, pp. 1–12. [Google Scholar]
  7. Armijo, L. Minimization of functions having Lipschitz continuous first partial derivatives. Pac. J. Math. 1966, 16, 1–3. [Google Scholar] [CrossRef] [Green Version]
  8. Absil, P.-A.; Mahony, R.; Andrews, B. Convergence of the iterates of descent methods for analytic cost functions. SIAM J. Optim. 2005, 16, 531–547. [Google Scholar] [CrossRef]
  9. Lange, K. Optimization, 2nd ed.; Springer Texts in Statistics; Springer: New York, NY, USA, 2013. [Google Scholar]
  10. Truong, T.T.; Nguyen, T.H. Backtracking Gradient Descent method and some applications to Large scale optimisation. Part 2: Algorithms and experiments. Appl. Math. Optim. 2020. [Google Scholar] [CrossRef]
  11. Vaswani, S.; Mishkin, A.; Laradji, I.; Schmidt, M.; Gidel, G.; Lacoste-Julien, S. Painless Stochastic Gradient: Interpolation, line-search and convergence rates. paper #8630. NeurIPS 2019. arXiv 2019, arXiv:1905.09997. [Google Scholar]
  12. Truong, T.T. Some convergent results for Backtracking Gradient Descent method on Banach spaces. arXiv 2020, arXiv:2001.056768. [Google Scholar]
  13. Truong, T.T. Convergence to minima for the continuous version of Backtracking Gradient Descent. arXiv 2019, arXiv:1911.04221. [Google Scholar]
  14. Ruder, S. An overview of gradient descent optimisation algorithms. arXiv 2017, arXiv:1609.04747. [Google Scholar]
  15. Bertsekas, D.P. Nonlinear Programming, 2nd ed.; Athena Scientific: Belmont, MA, USA, 1999. [Google Scholar]
  16. Asic, M.D.; Adamovic, D.D. Limit points of sequences in metric spaces. Am. Math. Month. 1970, 77, 613–616. [Google Scholar] [CrossRef]
  17. Bochnak, J.; Coste, M.; Roy, M.-F. Real Algebraic Geometry; A Series of Modern Surveys in Mathematics; Springer: Berlin/Heidelberg, Germany, 1998; Volume 36. [Google Scholar]
  18. Truong, T.T. Unconstrained optimisation on Riemannian manifolds. arXiv 2006, arXiv:2008.11091. [Google Scholar]
  19. Kuiper, N.H. On C1-isometric imbeddings, I and II. Indag. Math. Proc. 1955, 58, 545–556, 683–689. [Google Scholar] [CrossRef]
  20. Nash, J. The Imbedding problem for Riemannian manifolds. Ann. Math. 1956, 63, 20–63. [Google Scholar] [CrossRef]
  21. Nash, J. C1-isometric imbeddings. Ann. Math. 1955, 60, 383–396. [Google Scholar] [CrossRef]
Figure 1. Some samples from the MNIST hand-written digit dataset. (Source: [1]).
Figure 1. Some samples from the MNIST hand-written digit dataset. (Source: [1]).
Mathematics 08 01934 g001
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Truong, T.T. When Will a Sequence of Points in a Riemannian Submanifold Converge? Mathematics 2020, 8, 1934. https://doi.org/10.3390/math8111934

AMA Style

Truong TT. When Will a Sequence of Points in a Riemannian Submanifold Converge? Mathematics. 2020; 8(11):1934. https://doi.org/10.3390/math8111934

Chicago/Turabian Style

Truong, Tuyen Trung. 2020. "When Will a Sequence of Points in a Riemannian Submanifold Converge?" Mathematics 8, no. 11: 1934. https://doi.org/10.3390/math8111934

APA Style

Truong, T. T. (2020). When Will a Sequence of Points in a Riemannian Submanifold Converge? Mathematics, 8(11), 1934. https://doi.org/10.3390/math8111934

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop