1. Introduction
The nature of convexity of many machine learning models has not been addressed properly in literature, particularly when it comes to train parameters of neural networks. The first-order methods such as the stochastic gradient descent (SGD) and its variants are the preferred techniques for optimizing neural networks and many other machine learning algorithms. However, these methods do not consider the learning activity of the parameters in the different layers of neural networks. Therefore, there is a need to calculate learning rates mathematically for the individual parameters in a deep neural network and better understand the learning hierarchy of the different layers of the network. In machine learning applications including deep learning, a number of different convexity definitions have been presented in the literature (e.g., see a recent review by [
1]). In addition, based on the definition of convexity, SGD for empirical risk minimization is utilized to converge to a global optimum for convex loss and non-convex loss of objective functions [
2]. The nature of strong convexity requires the objective function
to be twice differentiable (i.e.,
) for all of its variables, and the classical Hessian matrix given in Equation (
1) can be used for determining convexity results for the function.
The iterative SGD method applies the gradient operator
to the function
f (i.e.,
) with the stochastic variables
and
and calculates:
where
is the learning rate.
At each evaluation, SGD selects a random training sample from the training dataset; then, the network output is computed to perform the sub-gradient of the loss function over the selected sample, and the algorithm adjusts the network parameters [
3]. Therefore, determining an efficient learning rate
is crucial for successfully solving machine learning problems as a part of the corresponding SGD algorithm. Strong convexity and
-convexity presented recently by [
1] are two of the convexity definitions introduced in the literature for solving stochastic optimization problems. In this work, we use the SGD method to solve the well-known stochastic optimization problem:
where
is a random variable of a stochastic distribution [
4]. One can define
for a given training set
, where
is a random variable that is defined by a single random sample
pulled uniformly from the training set. The empirical risk minimization reduces to
The existence of the unbiased gradient estimator (i.e., ) is required for any fixed w to apply the SGD in its form of Equation
We present a brief glance at gradient descent optimization algorithms that mainly contributed to the development of the learning-rate schedulers. Duchi et al. introduced the “AdaGrad” method, an adaptive learning-rate based on previous knowledge gained from observing the accumulative sum of squared gradients in earlier iterations [
5]. The proposed subgradient-based learning has improved the robustness of the SGD algorithm by controlling the gradient steps of the algorithm [
6]. “AdaDelta” is an enhanced version of “AdaGrad”, restricting the past accumulated gradients to be a fixed window size [
7]. This window is implemented as an exponentially decaying average of the squared gradients. The new implementation ensures that a separate dynamic learning rate is computed on a per-dimension basis. The adaptive-moment estimation, “Adam” [
8], is designed to combine the heuristics of the exponential decaying average of past gradients “AdaGrad” with the root mean square prop or “RMSprop” of the exponential average of square of gradients [
9]. Adam is observed to be robust and particularly well-suited for non-convex optimization problems. A new variant of the Adam method is “AMSGrad” [
10], which relies on the long-term memory of past gradients. The AMSGrad is proposed to develop a new principled exponential moving average because it has been shown that the reliance on only the past few gradients to update the learning rate can result in poor convergence rates. A successful contribution to SGD with diminishing learning rates is performed for convex objective functions by [
1]. The defined framework is characterized by a core property, called curvature. Based on the curvature, a new inequality is derived to find an optimal sequence of learning rates by solving a differential equation.
2. CDC and Optimization
CDC is introduced as a nonlinear real extensible closed form function
by [
11]. For the sake of completeness, we summarize the CDC results relevant to SGD optimization.
The first difference of an integer variable function
is defined by:
where
represents the integer vector of the unit length at the
jth position of the function
f towards the direction of
jth dimension. The difference of the first difference, namely, the second difference of
f is defined by Equation (
7) below.
The
discrete Hessian matrix corresponding to the function
f contains the second differences and this matrix is presented in Equation (
8) as follows:
This Hessian matrix is introduced in local settings, and the convexity results are obtained for condense-discrete convex functions similar to the convexity results obtained in real convex analysis. The discrete Hessian matrix is shown to be symmetric and linear, and it vanishes when the condense discrete function is affine. The coefficient matrix of is shown to satisfy the properties of the Hessian matrix corresponding to real convex functions. That is, is linear with respect to the condense discrete functions, symmetric, and vanishes when f is discrete affine. It is also shown that a function is condense-discrete convex if and only if the corresponding discrete Hessian matrix is positive definite in D.
To obtain minimization results for a given condense-discrete convex function, we require the given condense-discrete convex function to be .
Assuming is a strict condense-discrete convex function, the set of local minimums of f form a set of global minimums and vice versa. The importance of applying CDC to stochastic gradient-descent calculations is the elimination of the second differential operator for determining convexity. Given a function without knowing its convexity structure, CDC determines the convexity within the domain without calculating derivatives of the function.
2.1. SGD and CDC Functions
Given a function
such that
, the condensed convexity of
f can be checked by showing that the corresponding discrete Hessian matrix
is positive definite; therefore, CDC allows for convexity calculations by using simple mathematical operations. The existence of
depends on the assumption
In this section, we use the definition of condense-discrete convexity as a part of the SGD algorithm and its application. Using the iterative procedure
where
for and using the definition of the first difference of
f, we introduce:
Noting that the iterative procedure follows the directional method, we use the
jth entry
of the gradient vector
; therefore, we attain:
in the vector form and
in the scalar form towards the
jth direction satisfying the first difference
. By using the function differentiation definition, the differential of
f can be approximated by choosing sufficiently small
such that
indicating
where
is the step size of the iterative procedure in the
jth dimension of the directional derivative and
is a non-negative scalable parameter. The use of
is a key tuning component that is essential for defining the step size for adjusting it based on the algorithmic solution.
2.2. CDC Examples
This section presents the condense-discrete convexity of logistic regression examples that are shown to be convex,
, and strongly convex by [
1]. These examples are going to be used for attaining experimental results in
Section 3.
where
is a convex function,
and
are w-convex functions, and
. The following calculations prove that
is a CDC:
For simplicity we let
, then
The second difference is non-negative for which holds for Therefore, is non -negative for that naturally holds in a data set for non-negative input x and output y.
Now, we explain the condense-discrete convexity of the function
It is shown by [
12] that the
(i.e.,
) is a condense-discrete convex function; noting that
is also CDC, the summation of the two functions,
is also a CDC. Next, we show that
is a CDC function:
Therefore, the convex, , and strongly convex examples we examined are condense-discrete convex functions.
2.3. Learning-Rate Estimation
In this work, we utilize the Hassan–Nelder–Mead algorithm (HNM) to tune the hyperparameters of Equation (
14) and help in estimating a set of optimal learning rates for the different weights and biases of the loss functions [
13,
14,
15]. The HNM algorithm is a variant of the famous Nelder–Mead algorithm [
16], which allows the
-dimensional simplex to break down into a set of trigonometric simplex designs that work sequentially to locate a minimum of a nonlinear function. In addition, the HNM algorithm has delivered a higher accuracy than a famous Matlab function, known as “fminsearch”, for handling unconstrained optimization problems. To create
-trigonometric simplex designs of the HNM algorithm, we need to generate 5 vertices that reflect 5 different initialization points in
-dimensional space. The 5 vertices of the standard HNM algorithm are the points
. In this particular case, the vertex parameters are the parameters of the neural network, including weights and biases. After creating the vertices of the HNM algorithm, we need to arrange them in ascending order according to the values of the objective function.
In the above design of the learning rates scheduler, we have noticed that from Equation (
14), if the SGD algorithm proceeds successfully to the next iteration, then a good set of
vector has to be extracted from exploring the solution space of any of the convex objective functions defined in the previous section. So, if we assume the starting vector of the randomly initialized parameters of a deep network is the initial vertex of the HNM algorithm, then we can generate the other vertices using Pfefferś method [
17] and run the HNM algorithm to explore the neighborhood around
. For example, suppose that the objective function is convex and defined as in Equation (
15). For a given training set
, we allow the HNM simplex optimization to explore the solution space and extract different features of non-isometric reflections for the next vector
, which has a lower function evaluation than
. After the vector
is determined, the optimal values of
and the constant value of
can be found to adjust the learning rates scheduler for the next iteration. Hence, the values of
are calculated for each iteration and set to
.
If we train the network to learn the characteristics of an objective function relative to a particular case or dataset by forcing the network to update its parameters with a single learning rate, then there is a possibility that the network can converge to a non-stationary point or fall into a local minimum. On the contrary, our solution is to examine the solution space for optimal step sizes that individual parameters in the network can perform and use to optimize the network to find an optimal point. Some parameters in the network can push the learning process faster than others; therefore, they need larger learning rates, while others need to slow down and thus need smaller learning rates. The advantage of using the HNM algorithm is that it generates independent trigonometric simplex designs that can extract distinct non-isometric reflections to sequentially estimate different and adaptive learning rates for the parameters of the network.
3. Discussion
This section is devoted to experimental results by testing the CDC results and the learning-rate estimation introduced in the previous section, using the HNM algorithm. In addition, the test is designed to examine the performance of the proposed learning-rate scheduler on a logistic regression dataset, “mushroom”, introduced by [
18]. The proposed framework includes various modules for data cleaning up, preprocessing, and normalization. In order to provide a comprehensive evaluation of the performance of the proposed learning-rate scheduler, we conduct four experiments on the “mushroom” dataset from the UCI machine learning repository, which is a binary classification problem. We test Equation (
14) for the convex logistic regression examples given in the previous section that are shown to be CDC. The additional tests compare the proposed learning-rate scheduler for the adaptive SGD algorithm to state-of-the-art models such as [
1].
The results given in
Figure 1 indicate that the proposed learning-rate scheduler has helped the hidden layers of the network to adapt efficiently to an optimal solution. It is also observed that the network shows fast convergence rates as the different weight and bias parameters of the network are characterized by different learning rates. Our solution confirms the known results in the literature such as the previous study introduced by [
1], which indicates that the optimal performance of training a neural network is obtained by a diminishing step size scheduler as the network progresses in terms of evaluations. In [
1], a new definition of curvature of convex objective functions is presented, and the value of the curvature property determines the optimal learning rates for deep networks. The best step size that is determined for Equation (
15) is
[
1]. In this work, however, the trigonometric simplex designs explore the solution space of the loss function around the neighborhood with respect to the values of the network parameters and determine the optimal sequence of the learning-rate scheduler based on the CDC definition.
The experimental results on Equations (
16) and (
17) are shown in
Figure 2 and
Figure 3, reflecting the computational results on regularized and unregularized logistic regression examples. These test results prove the successful contribution of the CDC definition to estimate a vector of optimal learning rates for different weights and biases and have resulted in developing an efficient deep learning network architecture. When comparing our results to those of [
1], the pattern of the learning rates results is almost similar when exploring w-convex loss functions (Equations (
16) and (
17)); however, the scale of diminishing learning rates presented in this study is more efficient than [
1] in stabilizing training and accelerating the convergence rate due to the use of the simplex optimization method to explore the properties of the objective functions. Thus, the adaptive step sizes provide different learning activities for the parameters of the network compared to the use of one diminishing step size to update the network parameters. In particular, the optimal step sizes proposed by [
1] for Equations (
16) and (
17) are
and
, respectively.
Figure 4 displays the experimental results using our theoretical framework on the function given in Equation (
18). The empirical results show that the proposed learning-rate scheduler achieves remarkable success in obtaining a faster convergence rate for optimizing the SGD method. In addition, the idea of adapting network parameters to various levels of learning enhances the effectiveness of the neural network for analyzing convex optimization problems. CDC-based updates on learning rates proved to perform better than a single rate-based method to adjust the network parameters. The main problem of adjusting network parameters based on a single adaptive rate comes from the fact that the objective function for the multilayer network is not an explicit function of the weights and biases in the hidden layers. The effectiveness of our framework is characterized by allowing the parameters of the objective function to be tuned with respect to the architecture and performance of the deep neural network.
4. Conclusions
In this work, we introduced a new convexity definition to calculate the learning-rate scheduler for the SGD method. This convexity method and learning rate are directly related to each other, and this work is the first time such a relationship between the convexity and learning rate has been introduced, to the best of our knowledge; learning rate calculations follow from the first difference of a given function with a modification by using a tuning parameter, while the condense-discrete convexity determination follows from the second difference of the given function. The developed theory incorporates CDC with a sequence of trigonometric simplex designs to explore various characteristics of convex, w-convex, and strongly convex loss functions and determine an optimal vector of learning rates for SGD to adjust the network parameters. In fact, the proposed learning-rate scheduler can be used for other convex optimization applications pertaining to deep learning and pattern analysis. The four functions used by [
1] for computational results are also used for attaining the numerical results in this work after showing that they are CDC. The computational results proved that the different parameters of the network could increase their adaption at various levels of the learning hierarchy when they are characterized by different step sizes. Finally, the proposed optimization solution has an advantage over the solution attained by [
1]: being able to work on a given problem without knowing its curvature conditions, while requiring statistical estimate tests based on trigonometric simplex evaluations.