Next Article in Journal
An Extension of Fuzzy Competition Graph and Its Uses in Manufacturing Industries
Next Article in Special Issue
An Elderly Health Monitoring System Using Machine Learning and In-Depth Analysis Techniques on the NIH Stroke Scale
Previous Article in Journal
Representations of Generalized Self-Shrunken Sequences
Previous Article in Special Issue
Comparison of Supervised Classification Models on Textual Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A New Machine Learning Algorithm Based on Optimization Method for Regression and Classification Problems

Data Science Research Center, Department of Mathematics, Faculty of Science, Chiang Mai University, Chiang Mai 50200, Thailand
*
Author to whom correspondence should be addressed.
Mathematics 2020, 8(6), 1007; https://doi.org/10.3390/math8061007
Submission received: 18 May 2020 / Revised: 11 June 2020 / Accepted: 14 June 2020 / Published: 19 June 2020

Abstract

:
A convex minimization problem in the form of the sum of two proper lower-semicontinuous convex functions has received much attention from the community of optimization due to its broad applications to many disciplines, such as machine learning, regression and classification problems, image and signal processing, compressed sensing and optimal control. Many methods have been proposed to solve such problems but most of them take advantage of Lipschitz continuous assumption on the derivative of one function from the sum of them. In this work, we introduce a new accelerated algorithm for solving the mentioned convex minimization problem by using a linesearch technique together with a viscosity inertial forward–backward algorithm (VIFBA). A strong convergence result of the proposed method is obtained under some control conditions. As applications, we apply our proposed method to solve regression and classification problems by using an extreme learning machine model. Moreover, we show that our proposed algorithm has more efficiency and better convergence behavior than some algorithms mentioned in the literature.

1. Introduction

In this work, we are dealing with a convex minimization problem, which can be formulated as
min x H { f ( x ) + g ( x ) } ,
where f , g : H R { + } are proper, lower-semicontinuous convex functions and H is a Hilbert space. Many real world problems, such as signal processing, image reconstruction and compressed sensing, can be described using this model [1,2,3,4]. Moreover, data classification can also be formulated as (1); for more information about the importance and development of data classification and its methods see [5,6,7,8]. Therefore, a convex minimization problem has a wide range of applications, some of which will be studied in this research.
If f is differentiable then it is well-known that an element x H is a solution of (1) if and only if
x = p r o x α g ( I α f ) ( x ) ,
where α > 0 , p r o x α g ( x ) = J α g ( x ) = ( I α g ) 1 ( x ) , I is an identity mapping, and  g is a subdifferential of g . In addition, if  f is L-Lipschitz continuous then the classical foward–backward algorithm [9] can be used to solve (1). It is defined as follows:
x n + 1 = p r o x α n g ( I α n f ) ( x n )
where α n is a suitable stepsize. This method has been extensively used due to its simplicity, as a result it has been improved by many works, as seen in [2,10,11,12]. One well-known method that has improved the convergence rate of (3) significantly is known as the fast iterative shrinkage-threshodling algorithm or FISTA. It was proposed by Beck and Teboulle [13], as seen in Algorithm 1.
Algorithm 1. FISTA.
 1: Input x 1 = y 0 H , t 1 = 1 , L > 0 , k = number of iterations .
 2: for n = 1 to k do
 3:      y n = p r o x 1 L g ( x n 1 L f x n ) ,
 4:      t n + 1 = 1 + 1 + 4 t n 2 2 ,
 5:      θ n = t n 1 t n + 1 ,
 6:      x n + 1 = y n + θ n ( y n y n 1 ) .
 7: end for
 8: return x n + 1
They proved that FISTA has a better convergence rate than (3), however the convergence theorem of this method was not given. Recently, Laing and Schonlieb [14] modified FISTA by setting t n + 1 = p + q + r t n 2 2 , where p , q > 0 and 0 < r 4 , and proved its weak convergence theorem.
In the case that H is an infinite dimension Hilbert space, weak convergence results may not be enough, consequently modifications of some algorithms are needed to obtain strong convergence results. There are several ways to modify the methods for such purpose, for more information see [15,16,17,18]. One method that caught our attention was the viscosity-base inertial foward–backward algorithm (VIFBA) proposed by Verma et al. [19], as seen in Algorithm 2.
Algorithm 2. VIFBA.
 1: Input x 0 , x 1 H , L > 0 , β n 0 , α n ( 0 , 1 ) , γ n ( 0 , 2 L ) , c -contractive mapping F,
       k = number of iterations .
 2: for n = 1 to k do
 3:      y n = x n + β n ( x n x n 1 ) ,
 4:      z n = α n F ( y n ) + ( 1 α n ) y n ,
 5:      x n + 1 = p r o x γ n g ( z n γ n f z n ) .
 6: end for
 7: return x n + 1
They proved a strong convergence of this algorithm if the following conditions are satisfied for all n N :
A1.
lim n α n = 0 , and n = 1 α n = ,
A2.
γ n ( 0 , 2 L ) , and n = 1 | γ n γ n + 1 | < ,
A3.
n = 1 x n + 1 x n 2 < , and lim n β n x n x n 1 α n = 0 .
Note that all the methods mentioned above require f to be L-Lipschitz continuous, which is quite difficult to find in general. Therefore, some improvements are still desirable.
Very recently, Cruz and Nghia [20] proposed a linesearch technique which can be used to eliminate the L-Lipschitz continuous assumption of f and replaced it with weaker assumptions. In their work, the following conditions are needed instead:
B1.
f , g are proper lower semicontinuous convex functions with d o m g = d o m f ,
B2.
f is differentiable on an open set containing d o m g , and f is uniformly continuous on any bounded subset of d o m g and maps any a bounded subset of d o m g to a bounded set in H .
The linesearch step is defined as Algorithm 3 as follows.
Algorithm 3. Linesearch 1.
 1: Input x d o m g , σ > 0 , θ ( 0 , 1 ) , and δ ( 0 , 1 2 ) .
 2: set α = σ .
 3: while α f ( p r o x α g ( x α f x ) ) f x > δ p r o x α g ( x α f x ) x do
 4:    α = θ α .
 5: end while
 6: return α
They also proved that Linesearch 1 stops after finitely many steps, and proposed Algorithm 4 as follows.
Algorithm 4.
 1: Input x 1 d o m g , σ > 0 , θ ( 0 , 1 ) , and δ ( 0 , 1 2 ) , k = number of iterations .
 2: for n = 1 to k do
 3:      γ n = Linesearch 1 ( x n , σ , θ , δ ) ,
 4:      x n + 1 = p r o x γ n g ( I γ n f ) ( x n ) .
 5: end for
 6: return x n + 1
They also proved its weak convergence theorem. Again the weak convergence may not be enough in the context of infinite dimension space.
As we know, most of the work related to a convex minimization problem assumes the L-Lipschitz continuity of f . This restriction can be relaxed using a linesearch technique. So, we are motivated to establish a novel accelerated algorithm for solving a convex minimization problem (1), which employs a linesearch technique introduced by Cruz and Nghia [20] together with VIFBA [19]. The novelty of our proposed method is a suitable combination of the two methods to obtain a fast and efficient method for solving (1). We improve Algorithm 4 by adding an inertial step, which enhances the performance of the algorithm. We also prove its strong convergence theorem under weaker assumptions on the control conditions than that of VIFBA. More precisely, we can eliminate the assumption A2 and replace A3 with a weaker assumption. As applications, we apply our main result to solve a data classification problem and a regression of a sine function. Then we compare the performance of our algorithm with FISTA, VIFBA, and Algorithm 4.
This work is organized as follows: In Section 2, we recall some useful concepts related to the topic. In Section 3, we provide a new algorithm and prove its strong convergence to a solution of (1). In Section 4, we conduct some numerical experiments with a data classification problem and a regression of a sine function and compare the performance of each algorithm (FISTA, VIFBA, Algorithms 4 and 5). Finally, the conclusion of this work is in Section 5.

2. Preliminaries

In this section, we review some important tools which will be used in the later sections. Throughout this paper, we denote x n x and x n x as strong and weak convergence of { x n } to x , respectively.
A mapping T : H H is said to be L-Lipschitz continuous if there exists L > 0 such that
T x T x L x y , for all x , y H .
For x H , a subdifferential of h at x is defined as follows:
h ( x ) : = { u H : u , y x + h ( x ) h ( y ) , y H } .
We have known from [21] that a subdifferential h is maximal monotone. Moreover, a graph of h , G p h ( h ) : = { ( u , v ) H × H : v h ( u ) } is demiclosed, i.e., for any sequence ( u n , v n ) G p h ( h ) such that { u n } converges weakly to u and { v n } converges strongly to v , we have ( u , v ) G p h ( h ) .
The proximal operator, p r o x g : H d o m g with p r o x g ( x ) = ( I + g ) 1 ( x ) , is single-valued with a full domain. Moreover, the following is satisfied:
x p r o x α g x α g ( p r o x α g x ) , for   all x H   and   α > 0 .
The following lemmas are crucial for the main results.
Lemma 1
([22]). Let f , g : H R be two proper lower semicontinuous convex functions with d o m g d o m f and J ( x , α ) = p r o x α g ( x α f x ) . Then for any x d o m g and α 2 α 1 > 0 , we have
α 2 α 1 x J ( x , α 1 ) x J ( x , α 2 ) x J ( x , α 1 ) .
Lemma 2
([23]). Let H be a real Hilbert space. Then the following holds, for all x , y H and α [ 0 , 1 ] ,
(i) 
x ± y 2 = x 2 ± 2 x , y + y 2 ,
(ii) 
α x + ( 1 α ) y 2 = α x 2 + ( 1 α ) y 2 α ( 1 α ) x y 2 ,
(iii) 
x + y 2 x 2 + 2 y , x + y .
Lemma 3
([24]). Let { a n } be a sequence of real numbers such that there exists a subsequence { a m j } of { a n } such that a m j < a m j + 1 for all j N . Then there exists a nondecreasing sequence { n k } of N such that lim k n k = and for all sufficiently large k N the following holds:
a n k a n k + 1 a n d a k a n k + 1 .
Lemma 4
([25]). Let { a n } be a sequence of nonnegative real numbers, { α n } a sequence in ( 0 , 1 ) with n = 1 α n = , { b n } a sequence of nonnegative real numbers with n = 1 b n < and { ζ n } a sequence of real numbers with lim sup n ζ n 0 . Suppose that the following holds
a n + 1 ( 1 α n ) a n + α n ζ n + b n ,
for all n N , then lim n a n = 0 .

3. Main Results

In this section, we assume the existence of a solution of (1) and denote S * the set of all such solutions. It is known that S * is closed and convex. We propose a new algorithm, by combining a linesearch technique (Linesearch 1) with VIFBA, as seen in Algorithm 5. A diagram of this algorithm can be seen in Figure 1.
Algorithm 5: 
 1: Input x 0 , x 1 H , σ > 0 , θ ( 0 , 1 ) , δ ( 0 , 1 2 ) , β n 0 , α n ( 0 , 1 ) , c -contractive mapping
      F , k = number of iterations.
 2: for n = 1 to k do
 3:      y n = x n + β n ( x n x n 1 ) ,
 4:      z n = α n F ( y n ) + ( 1 α n ) y n ,
 5:      γ n = Linesearch 1 ( z n , σ , θ , δ ) ,
 6:      x n + 1 = p r o x γ n g ( z n γ n f z n ) .
 7: end for
 8: return x n + 1
We prove a strong convergence result of Algorithm 5 in Theorem 1 as follows.
Theorem 1.
Let H be a Hilbert space, g : H R { + } proper lower-semicontinuous convex, f : H R { + } proper convex differentiable with f being uniformly continuous on any bounded subset of H . Suppose the following holds:
C1. 
lim n α n = 0 , n = 1 α n = ,
C2. 
lim n β n α n x n x n 1 = 0 .
Then a sequence { x n } generated by Algorithm 5 converges strongly to x * = P S * F ( x * ) .
Proof. 
Since S * is closed and convex, a mapping P S * F yields a fixed point. Let x * = P S * F ( x * ) , by the definition of x n , y n and z n , we obtain the following, for all n N :
x n y n = β n x n x n 1 = β n α n x n x n 1 α n 0 as n ,
z n y n = α n F y n y n , and
y n x * x n x * + β n α n x n x n 1 α n x n x * + α n M 1
for some M 1 β n α n x n x n 1 . The following also holds:
y n x * 2 x n x * 2 + 2 β n x n x * x n x n 1 + β n x n x n 1 2 = x n x * 2 + β n α n x n x n 1 ( 2 α n x n x * + α n β n x n x n 1 ) , n N .
Next, we prove the following
z n x * 2 x n + 1 x * 2 ( 1 2 δ ) x n + 1 z n 2 .
Indeed, from (4) and the definition of g , we obtain
z n x n + 1 γ n f z n g ( x n + 1 ) ,   therefore
g x * g x n + 1 z n x n + 1 γ n f z n , x * x n + 1 ,   also
f x * f z n f z n , x * z n ,   and
f z n f x n + 1 f x n + 1 , z n x n + 1 .
Hence, by using the above inequalities and the definition of γ n , we have
f x * f z n + g x * g x n + 1 1 γ n z n x n + 1 , x * x n + 1 + f z n , x n + 1 z n , = 1 γ n z n x n + 1 , x * x n + 1 + f z n f x n + 1 , x n + 1 z n + f x n + 1 , x n + 1 z n , 1 γ n z n x n + 1 , x * x n + 1 f z n f x n + 1 x n + 1 z n + f x n + 1 , x n + 1 z n , 1 γ n z n x n + 1 , x * x n + 1 δ γ n x n + 1 z n 2 + f x n + 1 f z n .
Consequently,
1 γ n z n x n + 1 , x n + 1 x * ( f + g ) ( x n + 1 ) ( f + g ) ( x * ) δ γ n x n + 1 z n 2 .
Since z n x n + 1 , x n + 1 x * = 1 2 ( z n x * 2 z n x n + 1 2 x n + 1 x * 2 ) ,
1 2 γ n ( z n x * 2 z n x n + 1 2 x n + 1 x * 2 ) ( f + g ) ( x n + 1 ) ( f + g ) ( x * ) δ γ n x n + 1 z n 2 .
Furthermore, it follows from x * S * that
z n x * 2 x n + 1 x * 2 2 γ n [ ( f + g ) ( x n + 1 ) ( f + g ) ( x * ) ] + ( 1 2 δ ) x n + 1 z n 2 , ( 1 2 δ ) x n + 1 z n 2 .
Next, we show that { x n } is bounded. Indeed, from (7) and (9), we obtain
x n + 1 x * z n x * = α n F y n + ( 1 α n ) y n x * , α n F y n F x * + α n F x * x * + ( 1 α n ) y n x * , ( 1 ( 1 c ) α n ) y n x * + α n F x * x * , ( 1 ( 1 c ) α n ) x n x * + ( 1 ( 1 c ) α n ) α n M 1 + α n F x * x * , ( 1 ( 1 c ) α n ) x n x * + α n ( 1 c ) ( M 1 + F x * x * 1 c ) .
Inductively, we have x n + 1 x * max { x 0 x * , M 1 + F x * x * 1 c } , and hence { x n } is bounded. Furthermore, by using (5) and (6), { y n } and { z n } are also bounded. To show the convergence of { x n } , we divide the proof into two cases.
Case 1 There exists N 0 N such that x n + 1 x * x n x * for all n N 0 . So lim n x n x * = a , for some a R . From (5) and (6), and the fact that
z n x * z n y n + y n x n + x n x * ,
we have lim n z n x * = a . Using (9), we have lim n x n + 1 z n = 0 . Since { z n } is bounded, there exists a subsequence { z n k } of { z n } such that z n k w , for some w H , and the following holds:
lim sup n F x * x * , z n x * = lim k F x * x * , z n k x * = F x * x * , w x * .
We claim that w S * . In order to prove this, we need to consider two cases of { z n k } . The first case, if γ n k σ , for finitely many k . Then, without loss of generality, we can assume that γ n k = σ , for all k N . From the definition of γ n k , we have
f x n k + 1 f z n k δ σ x n k + 1 z n k .
The uniform continuity of f implies that lim k f x n k + 1 f z n k = 0 . We know that
x n k + 1 z n k γ n k f z n k + f x n k + 1 g ( x n k + 1 ) + f x n k + 1 = ( f + g ) ( x n k + 1 ) .
Since G p h ( ( f + g ) ) is demiclosed, we can have that 0 ( f + g ) ( w ) , and hence w S * .
The second case, there exists a subsequence { z n k j } of { z n k } such that γ n k j σ θ , for all j N . Let γ ^ n k j = γ n k j θ and x ^ n k j = p r o x γ ^ n k j g ( z n k j γ ^ n k j f z n k j ) . From the definition of γ n k j , we have
f x ^ n k j f z n k j > δ γ ^ n k j x ^ n k j z n k j .
Moreover, from Lemma 1 we have
1 θ z n k j x n k j + 1 z n k j x ^ n k j , therefore z n k j x ^ n k j 0   as   n ,
which implies that x ^ n k j w . Since f is uniformly continuous, we also have f x ^ n k j f z n k j 0 as n . Combining this with (10), we obtain x ^ n k j z n k j γ ^ n k j 0 as n . Again, we know that
x ^ n k j z n k j γ ^ n k j f z n k j + f x ^ n k j g ( x ^ n k j ) + f x ^ n k j = ( f + g ) ( x ^ n k j ) .
The demiclosedness of G p h ( ( f + g ) ) implies that 0 ( f + g ) ( w ) , and hence w S * . Therefore
lim sup n F x * x * , z n x * = F x * x * , w x * 0 .
Using (8), (9), and Lemma 2, we have
z n x * 2 = α n F y n + ( 1 α n ) y n x * 2 , = α n ( F y n F x * ) + α n ( F x * x * ) + ( 1 α n ) ( y n x * ) 2 , α n ( F y n F x * ) + ( 1 α n ) ( y n x * ) 2 + 2 α n F x * x * , z n x * , ( 1 ( 1 c ) α n ) y n x * 2 + 2 α n F x * x * , z n x * , ( 1 ( 1 c ) α n ) x n x * 2 + β n α n x n x n 1 ( 2 α n x n x * + α n β n x n x n 1 ) + 2 α n F x * x * , z n x * , ( 1 ( 1 c ) α n ) x n x * 2 + ( 1 c ) α n ( β n α n x n x n 1 M 2 1 c + 2 1 c F x * x * , z n x * ) ,
for some M 2 2 x n x * + β n x n x n 1 .
Hence,
x n + 1 x * 2 ( 1 ( 1 c ) α n ) x n x * 2 + ( 1 c ) α n ( β n α n x n x n 1 M 2 1 c + 2 1 c F x * x * , z n x * ) .
We set a n = x n + 1 x * 2 and ζ n = β n α n x n x n 1 M 2 1 c + 2 1 c F x * x * , z n x * in Lemma 4. Since β n α n x n x n 1 M 2 1 c 0 and lim sup n F x * x * , z n x * 0 , we have lim sup n ζ n 0 . Consequently, Lemma 4 is applicable and hence x n x * 2 0 , that is { x n } converges strongly to x * .
Case 2 There exists a subsequence { x m j } of { x n } such that x m j x * < x m j + 1 x * for all j N . From Lemma 3, there exists a nondecreasing sequence { n k } of N such that lim k n k = and the following holds, for all sufficiently large k N ,
x n k x * x n k + 1 x *   and   x k x * x n k + 1 x * .
From the definition of z n k and (8) we have, for all k N ,
z n k x * 2 α n k F y n k x * 2 + y n k x * 2 , x n k x * 2 + α n k F y n k x * 2 + β n k α n k x n k x n k 1 ( 2 α n k x n k x * + α n k β n k x n k x n k 1 ) , x n k x * 2 + α n k F y n k x * 2 + α n k ( β n k α n k x n k x n k 1 M 3 ) , x n k + 1 x * 2 + α n k F y n k x * 2 + α n k ( β n k α n k x n k x n k 1 M 3 ) ,
for some M 3 2 x n k x * + β n k x n k x n k 1 . Combining with (9), we obtain
α n k F y n k x * 2 + α n k ( β n k α n k x n k x n k 1 M 3 ) ( 1 2 δ ) x n k + 1 z n k , k N .
So, x n k + 1 z n k 0 as k . Since { z n k } is bounded, there exists a subsequence { z n k j } such that z n k j w , for some w H , and
lim sup k F x * x * , z n k x * = lim j F x * x * , z n k j x * = F x * x * , w x * .
Using the same argument as in case 1, we have that w S * and
lim sup k F x * x * , z n k x * = F x * x * , w x * 0 .
Moreover, it follows from (11) that
x n k + 1 x * 2 ( 1 ( 1 c ) α n k ) x n k x * 2 + ( 1 c ) α n k ( β n k α n k x n k x n k 1 M 2 1 c + 2 1 c F x * x * , z n k x * ) , ( 1 ( 1 c ) α n k ) x n k + 1 x * 2 + ( 1 c ) α n k ( β n c α n k x n k x n k 1 M 2 1 c + 2 1 c F x * x * , z n k x * ) .
Consequently, x n k + 1 x * 2 β n k α n k x n k x n k 1 M 2 1 c + 2 1 k F x * x * , z n k x * . Hence,
0 lim sup k x k x * 2 lim sup k x n k + 1 x * 2 0 .
Thus, we can conclude that { x n } converges strongly to x * , and the proof is complete. □
Remark 1.
We observe that we can prove our main result, Theorem 1, without the condition A 2 and use the weaker condition C 2 instead of A 3 while VIFBA requires all of these conditions.

4. Applications to Data Classification and Regression Problems

As mentioned in the literature, many real world problems can be formulated in the form of a convex minimization problem. So, in this section, we illustrate the reformulation process of some problems in machine learning, namely classification and regression problems, into a convex minimization problem, and apply our proposed algorithm to solve such problems. We also show that our proposed method is more efficient than some methods mentioned in the literature.
First, we give a brief concept of extreme learning machine for data classification and regression problems, then we apply our main result to solve these two problems by conducting some numerical experiments. We also compare the performance of FISTA, VIFBA, Algorithms 4 and 5.
Extreme learning machine (ELM). Let S : = { ( x k , t k ) : x k R n , t k R m , k = 1 , 2 , , N } be a training set of N distinct samples, then x k is an input data and t k is a target. For any single hidden layer of ELM, the output of the i-th hidden node is
h i ( x ) = G ( a i , b i , x ) ,
where G is an activate function, a i and b i are parameters of the i-th hidden node. The output function of ELM for SLFNs with M hidden nodes is
O k = i = 1 M β i h i ( x k ) ,
where β i is the output weight of the i-th hidden node. The hidden layer output matrix H is defined as follows:
H = G ( a 1 , b 1 , x 1 ) G ( a M , b M , x 1 ) G ( a 1 , b 1 , x N ) G ( a M , b M , x N ) .
The main goal of ELM is to find β = [ β 1 T , , β M T ] T such that H β = T , where T = [ t 1 T , , t N T ] T is the training data. In some cases, finding β = H T , where H is the Moore–Penrose generalized inverse of H , maybe a difficult task when H does not exist. Thus, finding such solution β by means of convex minimization can overcome such difficulty.
In this section, we conduct some experiments on regression and classification problems, the problem is formulated as the following convex minimization problem:
Minimize : H β T 2 2 + λ β 1 ,
where λ is a regularization parameter. This problem is called the least absolute shrinkage and selection operator (LASSO) [26]. In this case f ( x ) = H x T 2 2 and g ( x ) = λ x 1 . We note that, in our experiments, FISTA and VIFBA can be used to solve the problems, since the L-Lipschitz constants of the problems exist. However, FISTA and VIFBA fail to solve problems in which L-Lipschitz constants do not exist, while Algorithms 4 and 5 succeed.

4.1. Regression of a Sine Function

Throughout Section 4.1 and Section 4.2, all parameters are chosen to satisfy all the hypotheses of Theorem 1. All results are performed on Intel Core i5-7500 CPU with 16GB RAM and GeForce GTX 1060 6GB GPU.
As seen in Table 1, we create randomly 10 distinct points x 1 , x 2 , , x 10 which value between [ 4 , 4 ] , then we create the training set S : = { sin x n : n = 1 , , 10 } and a graph of a sine function on [ 4 , 4 ] as the target. The activation function is sigmoid, number of hidden nodes M = 100 , and regularization parameter λ = 1 × 10 5 . We use FISTA, VIFBA, Algorithms 4 and 5 to predict a sine function with 10 training points.
The first experiment is to compare the performance of Algorithm 5 with different c-contractive mapping F , so we can observe if F affects the performance of Algorithm 5. We use mean square error (MSE) as a measure defined as follows:
MSE = 1 n i = 1 n ( y ¯ i y i ) 2 .
By setting σ = 0.49 , δ = 0.1 , θ = 0.1 and the inertial parameter β n = 1 x n x n 1 3 + n 3 and MSE = 5 × 10 3 as the stopping criteria, we obtain the results as seen in Table 2.
We observe that Algorithm 5 performs better when c is closer to 1.
In the second experiment, we compare the performance of Algorithm 5 with different inertial parameters β n in Theorem 1, namely
β n 1 = 0 , β n 2 = 1 n 2 x n x n 1 , β n 3 = 1 x n x n 1 3 + n 3 , β n 4 = 10 10 x n x n 1 3 + n 3 + 10 10 .
It can be shown that β n 1 , β n 2 ,   β n 3 and β n 4 satisfy C2. By setting σ = 0.49 , δ = 0.1 , θ = 0.1 , F = 0.999 x , and MSE = 5 × 10 3 as the stopping criteria, we obtain the results, as seen in Table 3.
We can clearly see that β n 4 significantly improves the performance of Algorithm 5. Although, β n 4 converges to 0 as n , we observe that the behavior of β n 4 is different form β n 2 and β n 3 at the first few steps of the iteration, i.e., β n 4 is extremely close to 1 while β n 2 and β n 3 are far away from 1. Based on this experiment, we choose β n 4 as our default inertial parameter for later experiments.
The third experiment, we compare the performance of FISTA, VIFBA, Algorithms 4 and 5. As in Table 4, we set the following parameters for each algorithm:
By setting MSE = 5 × 10 3 as the stopping criteria, we obtain the results, as seen in Table 5.
We observe that Algorithm 5 takes only 129 iterations while FISTA, VIFBA and Algorithm 4 take a higher number of iterations, and Algorithm 5 uses a training time less than Algorithm 4.
Next, we compare each algorithm at the 3000th iteration with different kinds of measures, namely mean absolute error (MAE) and root mean squared error (RMSE) defined as follows:
MAE = 1 n i = 1 n | y ¯ i y i | , RMSE = 1 n i = 1 n ( y ¯ i y i ) 2 .
The results can be seen in Table 6.
We observe from Table 6 that Algorithm 5 has the lowest MAE and RMSE, but takes the longest training time. In Figure 2, we observe that Algorithm 5 outperforms other algorithms in the regression of a graph of a sine function under the small number of iterations. In Figure 3, it is shown that Algorithm 5, FISTA and VIFBA have a better performance in the regression of a graph than Algorithm 4 when the number of iterations is higher.

4.2. Data Classification

In this experiment, we classify the type of Iris plants from Iris dataset created by Fisher [27]. As shown in Table 7, this dataset contains 3 classes of 50 instances and each sample contains four attributes.
We also would like to thank https://archive.ics.uci.edu for providing the dataset.
With this dataset, we set sigmoid as an activation function, number of hidden nodes M = 100 , and regularization parameter λ = 1 × 10 5 . We use FISTA, VIFBA, Algorithms 4 and 5 as the training algorithm to estimate the optimal weight β . The output data O of training and testing data are obtained by O = H β , see Table 8 for more detail.
In the first experiment, we use the first 35 instances of each class as training data and the last 15 of each class as testing data, see Table 9 for detail.
The accuracy of the output data is calculated by:
accuracy = correctly   predicted   data all   data × 100 % .
To compare the performance of FISTA, VIFBA, Algorithms 4 and 5, we choose parameters for each algorithm the same as in Table 4.
We first compare the accuracy of each method at the 700th iteration, and obtain the following results, as seen in Table 10.
As we see, from Table 10, Algorithm 5 obtains the highest accuracy at 700th iterations. We use acc.train and acc.test for the accuracy of the training data set and testing data set, respectively.
Next we compare each method with the stopping criteria as acc.train > 90 and acc.test > 90, the results can be seen in Table 11.
We observe from Table 11 that Algorithm 5 performs better than Algorithm 4.
In the next experiment, we use 10-fold stratified cross-validation to set up the training and testing data, see Table 12 for detail.
We also use Average ACC and ERR % to evaluate the performance of each algorithm.
Average   ACC = i = 1 N x i y i × 100 % / N .
where N is a number of sets considered during cross validation ( N = 10 ), x i a number of correctly predicted data at fold i and y i a number of all data at fold i.
Let err L s u m = sum of errors in all 10 training sets, err T s u m = sum of errors in all 10 testing sets, L s u m = sum of all data in 10 training sets and T s u m = sum of all data in 10 testing sets. Define
ERR % = ( err L % + err T % ) / 2 ,
where err L % = err L s u m L s u m × 100 % , and err T % = err T s u m T s u m × 100 % .
We choose the same parameters as in Table 4. We compare the accuracy at the 1000th iteration of each fold, and obtain the following results, as seen in Table 13.
We observe from Table 13 that Algorithm 5 has higher average accuracy than Algorithm 4.

5. Conclusions

In this work, algorithms for solving a convex minimization problem (1) are studied. Many effective algorithms for solving this problem were proposed, most of them require a Lipschitz continuous assumption of f . By combining a linesearch technique introduced by Cruz and Nghia [20], and an iterative method VIFBA by Verma et al. [19], we establish a new algorithm that does not require a Lipschitz continuous assumption of f . As a result, it can be applied to solve problems in which Lipschitz constants do not exist, while VIFBA and FISTA cannot. Moreover, by viscosity approximation together with the inertial technique, our proposed algorithm has a better convergence behavior than Algorithm 4. A strong convergence of our proposed method is also proven under some control conditions that are weaker than that of VIFBA.
Our algorithm can be used to solve many real world problems such as image and signal processing, machine learning, especially regression and classification problems. To compare the performance of FISTA, VIFBA, Algorithm 4 and our proposed algorithm(Algorithm 5), we conduct numerical experiments on the latter problems. We observe from these experiments that Algorithms 4 and 5 take computational time longer than FISTA and VIFBA at the same number of iterations because the linesearch step (Linesearch 1) takes a long time to compute. In the experiments with the stopping criteria (Table 5 and Table 11), Algorithm 5 converges to a solution with a lower number of iterations than Algorithm 4 and hence performs better in terms of speed. We can also observe that Algorithm 5 performs decently in terms of accuracy, especially when compared with Algorithm 4.
For our future research, since FISTA performs better than Algorithm 5 in terms of speed, in order to compete with FISTA, we aim to find a new linesearch technique that takes less computational time than Linesearch 1 and hence decreases the computational time of Algorithm 5.

Author Contributions

Writing—review and editing, W.I.; supervision, S.S.; writing—original draft preparation, P.S.; software, D.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by Chiang Mai University, Chiang Mai, Thailand.

Acknowledgments

This work was supported by Chiang Mai University, Chiang Mai, Thailand.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Byrne, C. Iterative oblique projection onto convex subsets and the split feasibility problem. Inverse Probl. 2002, 18, 441–453. [Google Scholar] [CrossRef]
  2. Combettes, P.L.; Wajs, V. Signal recovery by proximal forward–backward splitting. Multiscale Model. Simul. 2005, 4, 1168–1200. [Google Scholar] [CrossRef] [Green Version]
  3. Combettes, P.L.; Pesquet, J.C. Proximal Splitting Methods in Signal Processing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering; Bauschke, H., Burachik, R., Combettes, P., Elser, V., Luke, D., Wolkowicz, H., Eds.; Springer: New York, NY, USA, 2011; pp. 185–212. [Google Scholar]
  4. Cholamjiak, P.; Shehu, Y. Inertial forward-backward splitting method in Banach spaces with application to compressed sensing. Appl. Math. 2019, 64, 409–435. [Google Scholar] [CrossRef]
  5. Szaleniec, M.; Tadeusiewicz, R.; Witkoa, M. How to select an optimal neural model of chemical reactivity? Neurocomputing 2008, 72, 241–256. [Google Scholar] [CrossRef]
  6. Szaleniec, J.; Wiatr, M.; Szaleniec, M.; Składzień, J.; Tomik, J.; Oleś, K.; Tadeusiewicz, R. Artificial neural network modelling of the results of tympanoplasty in chronic suppurative otitis media patients. Comput. Biol. Med. 2013, 43, 16–22. [Google Scholar] [CrossRef]
  7. Pławiak, P.; Abdar, M.; Acharya, U.R. Application of new deep genetic cascade ensemble of SVM classifiers to predict the Australian credit scoring. Appl. Soft Comput. 2019, 84, 105740. [Google Scholar] [CrossRef]
  8. Pławiak, P.; Abdar, M.; Pławiak, J.; Makarenkov, V.; Acharya, U.R. DGHNL: A new deep genetic hierarchical network of learners for prediction of credit scoring. Inf. Sci. 2020, 516, 401–418. [Google Scholar] [CrossRef]
  9. Lions, P.L.; Mercier, B. Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 1979, 16, 964–979. [Google Scholar] [CrossRef]
  10. Bussaban, L.; Suantai, S.; Kaewkhao, A. A parallel inertial S-iteration forward-backward algorithm for regression and classification problems. Carpathian J. Math. 2020, 36, 21–30. [Google Scholar]
  11. Moudafi, A.; Oliny, M. Convergence of a splitting inertial proximal method for monotone operators. J. Comput. Appl. Math. 2003, 155, 447–454. [Google Scholar] [CrossRef] [Green Version]
  12. Verma, M.; Shukla, K.K. A new accelerated proximal gradient technique for regularized multitask learning framework. Pattern Recogn. Lett. 2017, 95, 98–103. [Google Scholar] [CrossRef]
  13. Beck, A.; Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef] [Green Version]
  14. Liang, J.; Schonlieb, C.B. Improving fista: Faster, smarter and greedier. arXiv 2018, arXiv:1811.01430. [Google Scholar]
  15. Moudafi, A. Viscosity approximation method for fixed-points problems. J. Math. Anal. Appl. 2000, 241, 46–55. [Google Scholar] [CrossRef] [Green Version]
  16. Nakajo, K.; Takahashi, W. Strong convergence theorems for nonexpansive mappings and nonexpansive semigroups. J. Math. Anal. Appl. 2003, 279, 372–379. [Google Scholar] [CrossRef] [Green Version]
  17. Takahashi, W.; Zembayashi, K. Strong Convergence Theorem by a New Hybrid Method for Equilibrium Problems and Relatively Nonexpansive Mappings. Fixed Point Theory Appl. 2008, 2008, 528476. [Google Scholar] [CrossRef]
  18. Halpern, B. Fixed points of nonexpansive maps. Bull. Am. Math. Soc. 1967, 73, 957–961. [Google Scholar] [CrossRef] [Green Version]
  19. Verma, M.; Sahu, D.R.; Shukla, K.K. VAGA: A novel viscosity-based accelerated gradient algorithm. Appl. Intell. 2018, 48, 2613–2627. [Google Scholar] [CrossRef]
  20. Bello Cruz, J.Y.; Nghia, T.T. On the convergence of the forward-backward splitting method with linesearches. Optim. Methods Softw. 2016, 31, 1209–1238. [Google Scholar] [CrossRef] [Green Version]
  21. Burachik, R.S.; Iusem, A.N. Set-Valued Mappings and Enlargements of Monotone Operators; Springer: Berlin, Germany, 2008. [Google Scholar]
  22. Huang, Y.; Dong, Y. New properties of forward-backward splitting and a practical proximal-descent algorithm. Appl. Math. Comput. 2014, 237, 60–68. [Google Scholar] [CrossRef]
  23. Takahashi, W. Introduction to Nonlinear and Convex Analysis; Yokohama Publishers: Yokohama, Japan, 2009. [Google Scholar]
  24. Mainge, P.E. Strong convergence of projected subgradient methods for nonsmooth and nonstrictly convex minimization. Set-Valued Anal. 2008, 16, 899–912. [Google Scholar] [CrossRef]
  25. Xu, H.K. Another control condition in an iterative method for nonexpansive mappings. Bull. Austral. Math. Soc. 2002, 65, 109–113. [Google Scholar] [CrossRef] [Green Version]
  26. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  27. Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
Figure 1. Diagram of Algorithm 5.
Figure 1. Diagram of Algorithm 5.
Mathematics 08 01007 g001
Figure 2. A regression of a sine function at the 130th iteration.
Figure 2. A regression of a sine function at the 130th iteration.
Mathematics 08 01007 g002
Figure 3. A regression of a sine function at the 3000th iteration.
Figure 3. A regression of a sine function at the 3000th iteration.
Mathematics 08 01007 g003
Table 1. Detail about the regression of a sine function experiment.
Table 1. Detail about the regression of a sine function experiment.
Training setCreate a training matrix R = [x1 x2x10]T where x1, x2, …, xn ∈ [−4, 4] are randomly generated
Create the training target matrix S = [sin x1 sin x2 … sin x10]T
Testing setCreate the testing matrix V = [−4 −3.99 −3.98 … 4]T
Target setCreate T = [sin(−4 sin(−3.99 … sin(4)]T as the target matrix
Learning processGenerate the hidden layer output matrix H1 of training matrix R with 100 hidden nodes using sigmoid as the activation function
Pick regularization parameter λ and formulate a convex minimization problem:
Minimize : H 1 β S 2 2 + λ β 1
Find optimal weight β* of this problem using Algorithm 5 with a certain number of iterations
Testing processGenerate the hidden layer output matrix H2 of testing matrix V with 100 hidden nodes using sigmoid as the activation function
Calculate output O = H2β*
Calculate MSE, MAE, RMSE of output O and target matrix T
Table 2. Numerical results of c-contractive mapping.
Table 2. Numerical results of c-contractive mapping.
Fx = cx Iteration No.Training TimeMSE
c = 0.5 251434.83765 × 10 3
c = 0.8 147612.79095 × 10 3
c = 0.9 119562.31155 × 10 3
c = 0.999 96491.79925 × 10 3
Table 3. Numerical results of each inertial parameter.
Table 3. Numerical results of each inertial parameter.
Iteration No.Training TimeMSE
β n 1 96351.74775 × 10 3
β n 2 96711.81075 × 10 3
β n 3 96491.79925 × 10 3
β n 4 1290.02774.2 × 10 3
Table 4. Chosen parameters of each algorithm.
Table 4. Chosen parameters of each algorithm.
FISTAVIFBAAlgorithm 4Algorithm 5
t 1 = 1 ---
L = 2 H 1 2 ---
σ = 0.49 --
δ = 0.1 --
θ = 0.1 --
γ n = 1 2 H 1 2 ---
α n = 1 n --
β n = 10 10 x n x n 1 3 + n 3 + 10 10 --
Table 5. Numerical results of a regression of a sine function with the stopping criteria.
Table 5. Numerical results of a regression of a sine function with the stopping criteria.
Iteration No.Training TimeMSE
FISTA3050.0055 4.8 × 10 3
VIFBA4190.008 5 × 10 3
Algorithm 495921.9681 5 × 10 3
Algorithm 51290.0277 4.2 × 10 3
Table 6. Numerical results of a regression of a sine function at the 3000th iteration.
Table 6. Numerical results of a regression of a sine function at the 3000th iteration.
Iteration No.Training TimeMSERMSE
FISTA30000.12480.01690.0326
VIFBA30000.16600.02360.0364
Algorithm 430000.64740.24960.3224
Algorithm 530000.65890.01690.0310
Table 7. Iris dataset.
Table 7. Iris dataset.
Class (50 Samples Each)Attributes (in Centimeters)
Iris setosa
Iris versicolor
Iris virginica
sepal length
sepal width
petal length
petal witdth
Table 8. Details about the classification of Iris dataset experiment.
Table 8. Details about the classification of Iris dataset experiment.
Training setCreate the N × 1 training matrix S of number 1, 2 and 3 according to the training set and N is a number of training samples
Create the N × 4 training attribute matrix R according to S and attribute data
Testing setCreate the M × 1 testing matrix T of number 1, 2 and 3 according to the testing set and M is a number of testing samples
Create the M × 4 testing attribute matrix V according to T and attribute data
Learning processGenerate the hidden layer output matrix H1 of training attribute matrix R with 100 hidden nodes using sigmoid as the activation function
Choose λ = 1 × 10−5 and formulate a convex minimization problem:
Minimize : H 1 β S 2 2 + λ β 1
Find optimal weight β* of this problem using Algorithm 5 as a learning algorithm with certain number of iterations
Testing processCalculate output O1 = H1β*
Calculate number of correctly predicted samples between output O1 and training matrix S Generate the hidden layer output matrix H2 of testing attribute matrix V with 100 hidden nodes using sigmoid as the activation function
Calculate output O2 = H2β*
Calculate number of correctly predicted samples between output O2 and testing matrix T
Calculate acc.train, acc.test, Average ACC, ERR%
Let the number 1, 2 and 3 represent Iris setosa, Iris versicolor and Iris virginica, respectively.
Table 9. Training and testing sets of the Iris dataset.
Table 9. Training and testing sets of the Iris dataset.
ClassTraining DataTesting Data
Iris setosa3515
Iris versicolor3515
Iris virginica3515
Sum10545
Table 10. The performance of each algorithm at the 700th iteration.
Table 10. The performance of each algorithm at the 700th iteration.
Iteration No.Training Timeacc.trainacc.test
FISTA7000.028898.10100
VIFBA7000.041698.10100
Algorithm 47000.386795.2497.78
Algorithm 57000.417299.05100
Table 11. The performance of each algorithm with the stopping criteria.
Table 11. The performance of each algorithm with the stopping criteria.
Iteration No.Training Timeacc.trainacc.test
FISTA610.003891.4391.11
VIFBA320.002191.4391.11
Algorithm 43680.081690.4891.11
Algorithm 5650.058493.3393.33
Table 12. Training and testing sets for 10-fold stratified cross-validation.
Table 12. Training and testing sets for 10-fold stratified cross-validation.
ClassGroup 1–10
Training SetTesting Set
Iris Setosa455
Iris versicolor455
Iris virginica455
Sum13515
Table 13. The performance of each algorithm at the 1000th iteration with a 10-fold stratified cross-validation.
Table 13. The performance of each algorithm at the 1000th iteration with a 10-fold stratified cross-validation.
FISTAVIFBAAlgorithm 4Algorithm 5
acc.trainacc.testacc.trainacc.testacc.trainacc.testacc.trainacc.test
Fold 197.7810097.7810097.7810097.78100
Fold 299.2693.3399.2693.3399.2693.3398.5293.33
Fold 397.7810097.7810095.5610098.52100
Fold 499.2693.3399.2693.3397.0493.3397.7886.67
Fold 598.5210098.5210097.7810098.52100
Fold 698.5210098.5210098.5210098.52100
Fold 798.5286.6798.5286.6798.5273.3398.5293.33
Fold 898.5210098.5210098.5210098.52100
Fold 998.5210097.7810097.0410097.78100
Fold 1098.5210098.5210097.7810097.78100
Average Acc98.5297.33398.44697.33397.7789698.22297.333
ERR % 2.0742.1123.1112.223
Training time (sec.)0.04130.06250.56730.7051

Share and Cite

MDPI and ACS Style

Inthakon, W.; Suantai, S.; Sarnmeta, P.; Chumpungam, D. A New Machine Learning Algorithm Based on Optimization Method for Regression and Classification Problems. Mathematics 2020, 8, 1007. https://doi.org/10.3390/math8061007

AMA Style

Inthakon W, Suantai S, Sarnmeta P, Chumpungam D. A New Machine Learning Algorithm Based on Optimization Method for Regression and Classification Problems. Mathematics. 2020; 8(6):1007. https://doi.org/10.3390/math8061007

Chicago/Turabian Style

Inthakon, Warunun, Suthep Suantai, Panitarn Sarnmeta, and Dawan Chumpungam. 2020. "A New Machine Learning Algorithm Based on Optimization Method for Regression and Classification Problems" Mathematics 8, no. 6: 1007. https://doi.org/10.3390/math8061007

APA Style

Inthakon, W., Suantai, S., Sarnmeta, P., & Chumpungam, D. (2020). A New Machine Learning Algorithm Based on Optimization Method for Regression and Classification Problems. Mathematics, 8(6), 1007. https://doi.org/10.3390/math8061007

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop