1. Introduction
Deep neural networks are used for many tasks, such as natural language processing, computer vision, and text and image classification (see also [
1,
2,
3] for applications of neural networks), and a number of algorithms have been presented to tune the model parameters of such networks. The appropriate parameters are found by solving nonconvex stochastic optimization problems. In particular, the algorithms solve these problems in order to adapt the learning rates of the model parameters. Accordingly, they are called
adaptive learning rate optimization algorithms [
4] (Subchapter 8.5), and they include AdaGrad [
5], RMSProp [
4] (Algorithm 8.5), Adam [
6], and AMSGrad [
7].
Recently, reference [
8] preformed convergence analyses on adaptive learning rate optimization algorithms for constant learning rates and diminishing learning rates. The convergence analyses indicated that the algorithms with sufficiently small constant learning rates approximate stationary points of the problems [
8] (Theorem 3.1). This implies that useful algorithms, such as Adam and AMSGrad, can use constant learning rates to solve the nonconvex stochastic optimization problems in deep learning, in contrast to the results in [
6,
7] that presented only analyses assuming the convexity conditions of objective functions for diminishing learning rates. The analyses also indicated that the algorithms with diminishing learning rates converge to stationary points of the problems and achieve a certain convergence rate [
8] (Theorem 3.2). Numerical comparisons showed that the algorithms with constant learning rates perform better than the ones with diminishing learning rates.
Meanwhile, conjugate gradient methods are useful for unconstrained nonconvex deterministic optimization (see [
9] for details on conjugate gradient methods). These methods use the conjugate gradient direction (see also (
2) for the definition of the conjugate gradient direction with the Fletcher–Reeves formula), and they accelerate the steepest descent method. Conjugate gradient methods converge globally and generate the descent direction. In particular, the Hager-Zhang, Polak-Ribière-Polyak, and Hestenes–Stiefel methods have efficient numerical performance [
9]. It seems that conjugate gradient methods could be applied to constrained optimization, because they might accelerate the existing methods for constrained optimization. However, the inconvenient possibility that the conjugate gradient methods may not converge to solutions to constrained optimization problems [
10] (Proposition 3.2) means that we cannot apply them directly. Actually, the numerical results in [
10] showed that the conjugate gradient methods with conventional formulas, such as the Fletcher-Reeves, Polak-Ribière-Polyak, and Hestenes-Stiefel formulas, do not always converge to solutions to constrained optimization problems.
The conjugate gradient direction has been modified so that it can be applied to constrained optimization. The modified direction is called the
conjugate gradient-like direction [
10,
11,
12,
13,
14], and it is obtained by replacing the formula used for finding the conventional conjugate gradient direction with a positive real sequence depending on the number of iterations (see (
1) for the definition of the conjugate gradient-like direction). The
conjugate gradient-like method with the conjugate gradient-like direction can be applied to constrained convex deterministic optimization. In particular, the conjugate gradient-like method converges to solutions to constrained convex deterministic optimization problems when the step sizes (which are called learning rates) are diminishing [
10] (Theorem 3.1). Moreover, the numerical results in [
10] showed that it converges faster than the existing steepest descent method.
Roughly speaking, the existing adaptive learning rate optimization algorithms [
4] (Subchapter 8.5) are first-order methods using the steepest descent direction of an observed function at each iteration. Accordingly, using the conjugate gradient-like method would be useful to accelerate these algorithms. Hence, in this article, we propose an iterative method combining the existing adaptive learning rate optimization algorithms [
4] (Subchapter 8.5) with the conjugate gradient-like method [
10,
11,
12,
13,
14].
This article provides two convergence analyses. The first analysis shows that with a small constant learning rate, the proposed algorithm approximates a stationary point of a nonconvex optimization problem in deep learning (Theorem 1). The second analysis shows that with diminishing learning rates, it converges to a stationary point of the nonconvex optimization problem (Theorem 2). The convergence and performance of the proposed algorithm are examined through numerical comparisons with the existing adaptive learning rate optimization algorithms for image and text classification. The numerical results show that the proposed algorithm with a constant learning rate is superior for training neural networks, while the one with diminishing learning rates is not good for training neural networks.
This article is organized as follows.
Section 2 gives the mathematical preliminaries and states the main problem.
Section 3 presents the proposed algorithm for solving the main problem and analyzes its convergence.
Section 4 numerically compares the behaviors of the proposed learning algorithms with those of the existing ones.
Section 5 discusses the relationship between the previously reported results and the results in
Section 3 and
Section 4.
Section 6 concludes the paper with a brief summary.
2. Mathematical Preliminaries
2.1. Notation and Definitions
denotes the set of all positive integers and zero. denotes a d-dimensional Euclidean space with inner product , which induces the norm . denotes the set of symmetric matrices, i.e., . denotes the set of symmetric positive-definite matrices, i.e., . denotes the set of diagonal matrices, i.e., . denotes the Hadamard product of matrices A and B. For all , we have .
Given , the H-inner product of and the H-norm are defined for all by and .
The
metric projection [
15] (Subchapter 4.2, Chapter 28) onto a nonempty, closed convex set
X, denoted by
, is defined for all
by
and
.
satisfies the nonexpansivity condition, i.e.,
(
), and satisfies
[
15] (Proposition 4.8, (4.8)). The metric projection onto
X under the
H-norm is denoted by
. When
X is an affine subspace, a half-space, or a hyperslab, the projection onto
X can be computed within a finite number of arithmetic operations [
15] (Chapter 28).
denotes the expectation of a random variable X. The history of the process up to time n is denoted by . For a random process , denotes the conditional expectation of X given . Unless stated otherwise, all relations between random variables hold almost surely.
2.2. Stationary Point Problem Associated with Nonconvex Optimization Problem
Let us consider the following problem [
8] (see, e.g., Subchapter 1.3.1 in [
16] for details on stationary point problems):
Problem 1. Assume that
- (A1)
is a nonempty, closed convex set onto which the projection can be easily computed;
- (A2)
, which is defined for all by , is well defined, where is continuously differentiable for almost every , where is a random vector whose probability distribution P is supported on a set .
Then, we would like to find a stationary point of the problem of minimizing f over X, i.e.,where denotes the gradient of f. We can see that, if
, then
and that, if
f is convex, then
is a global minimizer of
f over
X [
16] (Subchapter 1.3.1).
Problem 1 is examined under the following conditions [
8].
- (C1)
There is an independent and identically distributed sample of realizations of the random vector ;
- (C2)
There is an oracle which, for a given input point , returns a stochastic gradient such that ;
- (C3)
There exists a positive number M such that, for all , .
3. Conjugate Gradient-Like Method
Algorithm 1 is a method for solving Problem 1 under (C1)–(C3).
First, we would like to emphasize that Algorithm 1 uses a
conjugate gradient-like direction [
10,
11,
12,
13] (see step 3 in Algorithm 1) defined by
The direction (
1) differs from a conventional conjugate gradient direction using, for example, the Fletcher–Reeves formula,
Although conventional conjugate gradient methods are powerful tools for solving unconstrained smooth nonconvex optimization (see, e.g., [
9] for details on conjugate gradient methods), iterative methods with the conjugate gradient-like directions are useful for solving constrained smooth optimization problems [
10,
11,
12,
13] (see also
Section 1 for details). Since Problem 1 is a constrained optimization problem, we will focus on using conjugate gradient-like directions.
Algorithm 1 Conjugate gradient-like method for solving Problem 1 |
Require:- 1:
- 2:
loop - 3:
- 4:
- 5:
- 6:
- 7:
Find that solves . - 8:
- 9:
- 10:
end loop
|
We can see that Algorithm 1 with
(
) coincides with the existing algorithm in [
8] defined by
where
. We can also show that algorithm (
3) (i.e., Algorithm 1 with
) includes AMSGrad [
7] and Adam [
6] by referring to [
8] (Section 3). For example, consider
and
(
) defined for all
by
where
and
. Then, algorithm (
3) with (
4) and
is the AMSGrad algorithm. When
and
(
) are defined for all
by
Algorithm (
3) with (
5) resembles the Adam algorithm (The original Adam uses
and does not always converge [
7] (Theorems 1–3). We use
to guarantee its convergence (see Theorems 1 and 2 for the convergence of Algorithm 1)).
For example, let us consider Algorithm 1 with (
4) and
, i.e.,
From the above discussion, algorithm (
6) with
coincides with AMSGrad. We can see that algorithm (
6) uses a conjugate gradient-like direction
, while AMSGrad (algorithm (
3) with (
4)) uses a gradient direction
.
The convergence analyses of Algorithm 1 assume the following conditions.
Assumption 1. The sequence , denoted by , in Algorithm 1 satisfies the following conditions:
- (A3)
almost surely for all and all ;
- (A4)
For all , a positive number exists such that .
Moreover,
- (A5)
.
Assumption (A5) holds under the boundedness condition of
X, which is assumed in [
17] (p. 1574) and [
7] (p. 2). In [
8] (Section 3), it is shown that
and
defined by (
4) or (
5) satisfies (A3) and (A4).
3.1. Constant Learning Rate Rule
The following is the convergence analysis of Algorithm 1 with a constant learning rate. Theorem 1 can be inferred by referring to the proof of Theorem 3.1 in [
8]. The proof of Theorem 1 is given in
Appendix A.
Theorem 1. Suppose that (A1)–(A5) and (C1)–(C3) hold and is the sequence generated by Algorithm 1 with , , and (). Then, for all ,where , , M is defined as in (C3), , , D is defined as in (A5), and . Theorem 1 shows that using a small constant learning rate approximates a solution to Problem 1. The result for
coincides with Theorem 3.1 in [
8].
We have the following proposition for convex stochastic optimization.
Proposition 1. Suppose that (A1)–(A5) and (C1)–(C3) hold, is convex for almost every , and is the sequence generated by Algorithm 1 with , , and (). Then,where denotes the optimal value of the problem of minimizing f over X, and , , M, , , D, and are defined as in Theorem 1. The previously reported results in [
7] showed that AMSGrad, which is an example of Algorithm 1 (see Algorithm (
3) with (
4) and
), ensures that there exists a positive real number
B such that
where
T is the number of training examples and
is convex for almost every
. Inequality (
7) indicates that the value
generated by AMSGrad has an upper bound; however, it is not guaranteed that AMSGrad solves Problem 1. Meanwhile, Proposition 1 shows that Algorithm 1, which includes Adam and AMSGrad, can approximate a global minimizer of
f by using a small constant learning rate.
3.2. Diminishing Learning Rate Rule
The following is the convergence analysis of Algorithm 1 with diminishing learning rates. Theorem 2 can be proven by referring to the proof of Theorem 3.2 in [
8]. The proof of Theorem 2 is given in
Appendix A.
Theorem 2. Suppose that (A1)–(A5) and (C1)–(C3) hold and is the sequence generated by Algorithm 1 with , , and () satisfying , , , and . Then, for all , Moreover, suppose that , , or , where , , and . Then, Algorithm 1 achieves the following convergence rate: Inequality (
8) implies that there exists a subsequence
of
such that
converges to
and, for all
,
which implies that
satisfies
(
); i.e.,
is a solution to Problem 1.
Theorem 2 leads to the following proposition, which indicates that Algorithm 1 converges to a global minimizer of f when is convex for almost every .
Proposition 2. Suppose that (A1)–(A5) and (C1)–(C3) hold, is convex for almost every , and is the sequence generated by Algorithm 1 with , , and satisfying , , , and . Then,where denotes the optimal value of the problem of minimizing f over X. Moreover, suppose that , , or , where , , and . Then, any accumulation point of defined by almost surely belongs to the solution set , and Algorithm 1 achieves the following convergence rate: 4. Numerical Experiments
The experiments used a fast scalar computation server (
https://www.meiji.ac.jp/isys/hpc/ia.html) at Meiji University. The environment has two Intel(R) Xeon(R) Gold 6148 (2.4 GHz, 20 cores) CPUs, an NVIDIA Tesla V100 (16GB, 900Gbps) GPU, and a Red Hat Enterprise Linux 7.6 operating system. The experimental code was written in Python 3.8.2, and we used the NumPy 1.19.1 package and PyTorch 1.5.0 package.
We compared the existing algorithms, such as the momentum method [
18] (9), [
19] (
Section 2), AdaGrad [
5], RMSProp [
4] (Algorithm 8.5), Adam [
6], and AMSGrad [
7] in
torch.optim (
https://pytorch.org/docs/stable/optim.html) using the default values and learning rate
, with Algorithm 1 defined as follows:
Algorithm 1 with a constant learning rate (Algorithm 1 with
, such as Momentum-C
i, Adam-C
i, and AMSGrad-C
i (
), is Algorithm 1 in [
8]):
Momentum-C1: Algorithm 1 with , , , and .
Momentum-C2: Algorithm 1 with , , , and .
Momentum-C3: Algorithm 1 with , , , and .
MomentumCG-C1: Algorithm 1 with , , and .
MomentumCG-C2: Algorithm 1 with , , and .
MomentumCG-C3: Algorithm 1 with , , and .
Adam-C1: Algorithm 1 with
,
,
defined by (
5),
, and
.
Adam-C2: Algorithm 1 with
,
,
defined by (
5),
, and
.
Adam-C3: Algorithm 1 with
,
,
defined by (
5),
, and
.
AdamCG-C1: Algorithm 1 with
,
,
defined by (
5), and
.
AdamCG-C2: Algorithm 1 with
,
,
defined by (
5), and
.
AdamCG-C3: Algorithm 1 with
,
,
defined by (
5), and
.
AMSGrad-C1: Algorithm 1 with
,
,
defined by (
4),
, and
.
AMSGrad-C2: Algorithm 1 with
,
,
defined by (
4),
, and
.
AMSGrad-C3: Algorithm 1 with
,
,
defined by (
4),
, and
.
AMSGradCG-C1: Algorithm 1 with
,
,
defined by (
4), and
.
AMSGradCG-C2: Algorithm 1 with
,
,
defined by (
4), and
.
AMSGradCG-C3: Algorithm 1 with
,
,
defined by (
4), and
.
Algorithm 1 with diminishing learning rates
and
based on [
7] (Theorem 4 and Corollary 1) (Algorithm 1 with
, such as Momentum-D1, Adam-D1, and AMSGrad-D1, is Algorithm 1 in [
8]):
Momentum-D1: Algorithm 1 with , , and .
MomentumCG-D1: Algorithm 1 with , , and .
MomentumCG-D2: Algorithm 1 with , , and .
Adam-D1: Algorithm 1 with
,
,
defined by (
5), and
.
AdamCG-D1: Algorithm 1 with
,
,
defined by (
5), and
.
AdamCG-D2: Algorithm 1 with
,
,
defined by (
5), and
.
AMSGrad-D1: Algorithm 1 with
,
,
defined by (
4), and
.
AMSGradCG-D1: Algorithm 1 with
,
,
defined by (
4), and
.
AMSGradCG-D2: Algorithm 1 with
,
,
defined by (
4), and
.
4.1. Image Classification
This experiment used the CIFAR10 dataset (
https://www.cs.toronto.edu/~kriz/cifar.html), a benchmark for image classification. The dataset consists of 60,000 color images (
) in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images. The test batch contained exactly 1000 randomly selected images from each class. We trained a 44-layer ResNet (ResNet-44) [
20] organized into 43 convolutional layers which had
filters and a 1000-way-fully-connected layer with a softmax function. We used the cross entropy as the loss function for fitting ResNet in accordance with the commonly used strategy in image classification.
Figure 1,
Figure 2 and
Figure 3 compare the behaviors of the proposed algorithm with a constant learning rate with those of Momentum, AdaGrad, RMSProp, Adam, and AMSGrad using the default values in
torch.optim (i.e.,
).
Figure 1 shows that Momentum-C1, MomentumCG-C1, and AMSGrad-C2 minimized the training loss function faster than the existing algorithms, and
Figure 2 shows that they decreased the training error rate faster as well. Moreover, AdamCG-C
i (resp. AMSGradCG-C
i) (
) outperformed AdamCG-C1 (resp. AMSGradCG-C1); this implies AdamCG and AMSGradCG require fewer iterations at smaller learning rates.
Figure 3 shows that Adam-C2, AdamCG-C2, AMSGrad-C2, AMSGradCG-C2 decreased the test error rate faster than other algorithms. A similar trend was observed in the numerical results in [
21].
Figure 4,
Figure 5 and
Figure 6 plot the behaviors of the proposed algorithms with diminishing learning rates. These algorithms did not work, and thus, it is clear that using diminishing learning rates is not good for training neural networks (see
Section 5 for the details). A similar problem was observed in the numerical results in [
8].
Table 1 shows the mean and variance of elapsed time per epoch for the existing algorithms and Algorithm 1 with a constant learning rate. This table indicates that the elapsed time of Momentum was almost the same as those of the proposed algorithms, e.g., Momentum-C
i and MomentumCG-C
i (
). Adam and AMSGrad also had such a trend.
Table 2 compares the training error rates of the existing algorithms with those of Algorithm 1 by using the
scipy.stats.ttest_ind function in Python. The
p-value is the probability associated with a
t-test, and the significance level is set at 5%. If the value is less than 0.05, then there is a significant difference between the existing algorithm and the proposed algorithms.
Table 2 and
Figure 2 indicate that Momentum-C1 and MomentumCG-C1 outperformed Momentum and the performance of the existing algorithm (Momentum) was significantly different from the performances of the proposed algorithms (Momentum-C1 and MomentumCG-C1). Adam-C
i and AdamCG-C
i (
) had almost the same performance as Adam, while the performance of AMSGrad was not significantly different from that of AMSGrad-C
i and AMSGradCG-C
i (
).
4.2. Text Classification
This experiment used the IMDb dataset (
https://datasets.imdbws.com/) for text classification tasks. The dataset contains 50,000 movie reviews along with their associated binary sentiment polarity labels. The dataset is split into 25,000 training and 25,000 test sets. We used an embedding layer that generated 50-dimensional embedding vectors and two bidirectional long short-term memory (LSTM) with an affine layer and a sigmoid function as an activation function for the output. To train it, we used the binary cross entropy (BCE) as a loss function minimized by the existing and proposed algorithms.
Figure 7,
Figure 8 and
Figure 9 compare the behaviors of the proposed algorithm with a constant learning rate with those of Momentum, AdaGrad, RMSProp, Adam, and AMSGrad, using the default values in
torch.optim (i.e.,
). These figures show that Adam-C3, AdamCG-C3, AMSGrad-C3, RMSProp, Adam, and AMSGrad all performed well. In particular,
Figure 8 shows that AdamCG-C3 (resp. AMSGradCG-C3) performed better than Adam-C3 (resp. AMSGrad-C3), which implies that using conjugate gradient-like directions would be good for training neural networks.
Table 3 indicates that the elapsed time for the existing algorithm was almost the same as the one for the proposed algorithms, as seen in
Table 1.
Table 4 and
Figure 8 show that, although Momentum, Momentum-C
i, and MomentumCG-C
i did not perform better than the existing algorithms such as Adam and AMSGrad, the performance of Momentum was significantly different from that of almost all of proposed algorithms. It can be seen that Adam, Adam-C3, and AdamCG-C3 performed well and that, although AMSGrad, AMSGrad-C3, and AMSGradCG-C3 did not perform better than Adam, AMSGrad-C3 and AMSGradCG-C3 had almost the same performance as AMSGrad.
5. Discussion
Let us first discuss the relationship between the momentum method [
18] (9), [
19] (Section 2) with MomentumCG used in
Section 4. The momentum method [
18] (9), [
19] (Section 2) is defined by
where
is the learning rate and
is the momentum coefficient. We can see that
defined by (9) is the conjugate gradient-like direction of
. Meanwhile, MomentumCG used in
Section 4 is as follows:
Algorithm (10) uses the conjugate gradient-like direction
of
. For simplicity, algorithm (10) with
is such that
which implies that algorithm (
11) is the momentum method with a learning rate
and momentum coefficient
.
The numerical comparisons in
Section 4 show that Algorithm 1 with a constant learning rate performed better than Algorithm 1 with diminishing learning rates. For example, let us consider the text classification in
Section 4.2 and compare AdamCG-C3 defined by
with AdamCG-D1 defined by
AdamCG-C3 (algorithm (
12)) works well for all
, since it uses a constant learning rate. Meanwhile, there is a possibility that AdamCG-D1 (Algorithm (
13)) does not work for a large number of iterations, because it uses diminishing learning rates. In fact, AdamCG-D1 (algorithm (
13)) for a large
n is as follows:
which implies that algorithm (
14) does not work. As can be seen in
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11 and
Figure 12, Algorithm 1 with diminishing learning rates would not be good for training neural networks.
Finally, let us compare the existing algorithm with Algorithm 1, in particular, AMSGrad in
torch.optim using
,
, and
with AMSGrad-C3 using
,
, and
. The difference between AMSGrad and AMSGrad-C3 is the setting of
. According to
Figure 7,
Figure 8 and
Figure 9, AMSGrad-C3 performs comparably to AMSGrad, a useful algorithm. These results are guaranteed by Theorem 1, which indicates that Algorithm 1 with a small constant learning rate approximates a stationary point of the minimization problem in deep neural networks, and more specifically, the sequence
generated by AMSGrad-C3 (Algorithm 1 with
) satisfying
approximates
.