1. Motivation
Optimisation is important in many aspects of engineering and computer sciences. For a modern example, one can mention deep neural networks, which can solve effectively several tasks (image/video classification, natural language processing and so on) that posed enormous challenges for the old paradigm of based-rule learning. For deep neural networks to be able to work, one has to solve large scale and non-convex optimisation. For example, modern state-of-the-art deep neural network architectures give rise to optimisation problems (finding minima) in hundred of million variables.
For an explicit example, we consider the case of recognising of hand written digits, useful for example when scanning postal packages. A well-known dataset is MNIST [
1]. A sample is in
Figure 1, several of them can be challenging even for human beings. This task is extremely difficult for the old paradigm of based-rule learning, but is considered a simple task for deep neural networks. For this task, one can use a “simple” deep neural network, which gives rise to an optimisation problem in about 12,000 variables.
Because of this large scale and non-convex feature of associated optimisation problems, one must confine with iterative numerical methods. One would like the method to guarantee convergence to local minima. This can be divided into two steps. First, show that the method converge, and then show that the limit point is a local minimum. Since saddle points are dominant for functions in higher dimensions [
2,
3], for the second step it is important to guarantee that the limit point is not a saddle point.
It could be considered as a lucky fact that from beginning Gradient Descent methods (GD) have been used in deep neural networks. At first, it could be the fact that GD is easy to implement and is not costly to run in large scale optimisation. Then, even though it dates back more than 170 years [
4], only gradually (with some results announced only very recently) it has been shown that GD has good properties: it can avoid saddle points [
5,
6]. While its standard version does not guarantee convergence to critical points, its Backtracking version [
7] does [
8,
9,
10] (the latter paper consists of the more experimental part of arXiv:1808.05160, in combination with arXiv:2001.02005 and arXiv:2007.03618) and can be implemented in deep neural networks with very good performance on CIFAR10 and CIFAR100 image datasets [
10,
11]. Some further modifications of Backtracking GD can avoid saddle points as well [
12,
13].
2. Convergence Results
We will, for the remainder of this section, discuss convergence to critical points for GD methods, which uses special geometrical properties. First, we recall in detail the update rule in GD. We consider a function
, which is assumed to be
. We want to find minima of
f. One starts from a random initial point
, and construct a sequence
, where
is an appropriate number. There are many ways to choose
. In the Standard GD scheme, one chooses
to be a constant
. A disadvantage of Standard GD is that it does not guarantee convergence, to have good behaviour one must assume that
f is in
, that is
is globally Lipschitz continuous with the Lipschitz constant
L, and further assume that
is in the order of
. There are many popular modifications trying to overcome this, such as Adam, Adadelta, Nesterov Accelerated Gradient, Momentum and so on (see [
14] for a review), none of these are guaranteed to converge in general either. To date, only Backtracking GD is guaranteed to converge: see Chapter 12 in [
9], in particular Proposition 12.6.1 there, for the case
and has compact sublevels and has at most countably many critical points, see [
8] when
f is real analytic (or more generally satisfies the so-called Losjasiewicz gradient inequality), and see [
10] for the general case of
f being
only and has at most countably many critical points. Note that the assumption in the last paper is not too restrictive: indeed, it is known from transversality results that such an assumption is satisfied by a generic
function (for example, by Morse’s functions, which are a well-known class of functions in geometry and analysis). Since the real analyticity assumption in [
8] is quite special, we will not discuss about it in the below, trying to provide only the most general ideas.
Both [
9,
10] start from the following property: If
is constructed as above, and
is a convergent subsequence, then
. This is classically known (see [
15]), the main idea is as follows: if
and
, then
. The latter is a contradiction, when Armijo’s condition is taken into account.
Now, one needs also the following property:
Property 1. Either , or .
In case
f has compact sublevels, then this is easily proven [
9]. For the general case, see [
10] for a proof.
Now, one needs a special property of compact metric spaces [
16]. We recall that given a sequence
, its set of cluster points consists of points
x so that there is a subsequence
for which
.
Theorem 1. Let be a compact metric space. If is a sequence so that , then the set of cluster points of is connected.
In the setting of [
9], one can finish the proof of convergence as follows: Let
. Since
f is assumed to have compact sublevels, it follows that
X is compact. Let
d be the restriction to
X of the usual metric on
, then
is a compact metric space. Since we know that
for the constructed sequence, we can then apply Theorem 1 for the sequence
, and have that the set of cluster points
of
is connected. Since we also know that
must be contained in the set of critical points
and by assumption
is countable, it follows that
is also countable. Since a countable and connected set must be either empty or one point, it follows that either
(the first case) or
converges. Note that in this case, since
is bounded, it follows that only the second case happens, that is
converges.
In the general case, the above proof does not go through, since the set
X in the above may not be bounded. In [
10], a way to go around is as follows. We let
be the real projective space with its canonical metric (the spherical metric). We let
be the usual Euclidean metric on
, and let
be the restricted metric of
d. Then,
topologically, the two spaces
are homeomorphic. In particular, convergence properties of a sequence
in
can be translated to that of the same sequence but considered in
. Even though they are not isometric, one can check that if
, then also
. In addition,
is a compact metric space. Hence, one can apply Theorem 1, and have that for the constructed sequence
, considered in
the set of cluster points
is connected. Since
and
is countable, it follows that if
is non-empty then it is a point. Hence, we obtain the same conclusions as before. Note that here the case
can happen, for example for the function
.
3. Riemannian Manifold Optimisation
Now we discuss the general setting of optimisation on manifolds. Let X be a real manifold and . We want to find (local) minima of f. Here one could, as before, try to use Standard GD. To prove good properties for Standard GD, one could, as before, restrict the discussion to functions. However, here it is very cumbersome to define a global notion in the manifold setting. It is better to switch to Backtracking GD, which is local in nature, and hence has a natural extension to the manifold setting. We let it to the readers to state a specific version of Backtracking GD in the manifold setting, by working on small coordinate charts.
Taking the ideas from the proof in [
10], as presented above, we obtain the following general result. We say that a sequence
in a metric space
X diverges to infinity if
eventually leaves every bounded subset of
X.
Theorem 2. Let X be a Riemmanian manifold, with the induced metric d. Assume that there is a compact metric space , together with a homeomorphism such that for all . Let be a function, and a sequence constructed by Backtracking GD. Assume that f has at most countably many critical points. Then, either converges to a critical point of f, or diverges to infinity.
For clarity of the next discussion, we define a class of Riemannian manifolds.
Definition 1. A Riemannian manifold is called non-expandingly homeomorphically compactible if there is a compact metric space , together with a homeomorphism such that for all .
To be able to apply Theorem 2, it remains to find what manifolds are non-expandingly homeomorphically compactible. We give first some examples.
Example 1. X is a bounded subset of . This is the case in [9]. Example 2. X is a subset of . This is the case in [10] and is more general than Example 1, when X is unbounded. On the other hand, it is interesting to note that an X from Example 2 is homeomorphic to a one in Example 1. Indeed, by Theorem 3.4.4 in [17], the real projective space is affine, that is isometric to a compact subset of . Remark 1. In a very recent paper [18], the author has been able to extend previously mentioned results, including Theorem 2 and a modification of Newton’s method, to general Riemannian manifolds. There, the readers can consult more general algorithms on Riemannian manifolds. In particular, by using Nash’s embedding theorem [19,20,21], every Riemannian manifold is non-expandingly homeomorphically compactible and hence Theorem 2 holds in general. From the above discussion, we are motivated to state the following question. Even partial answers to it will be greatly helpful for optimisation on manifolds.
Question 1. In the statement of Theorem 2, if one does not assume that f has at most countably many critical points, will the conclusions of Theorem 2 still hold?
Currently, we do not know about the answer to Question 1. We have just a couple of comments. First, by transversality theorem, functions satisfying Theorem 2 is dense, and hence for practical purposes the Theorem is applicable in realistic setttings. Second, we have done many examples on various types of benchmark functions, and found that the sequence constructed by Backtracking GD always satisfies the conclusions of Theorem 2. Hence, we conjecture that the answer to Question 1 is affirmative, at least when the initial point is randomly chosen.