Next Article in Journal
Group Acceptance Sampling Plan Based on New Compounded Three-Parameter Weibull Model
Next Article in Special Issue
Construction of Eigenfunctions to One Nonlocal Second-Order Differential Operator with Double Involution
Previous Article in Journal
Multi-Label Classification of E-Commerce Customer Reviews via Machine Learning
Previous Article in Special Issue
A Stable Generalized Finite Element Method Coupled with Deep Neural Network for Interface Problems with Discontinuities
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optimality of the Approximation and Learning by the Rescaled Pure Super Greedy Algorithms

1
School of Mathematics and LPMC, Nankai University, Tianjin 300071, China
2
School of Science, China University of Geosciences, Beijing 100083, China
*
Author to whom correspondence should be addressed.
Axioms 2022, 11(9), 437; https://doi.org/10.3390/axioms11090437
Submission received: 17 June 2022 / Revised: 14 August 2022 / Accepted: 24 August 2022 / Published: 28 August 2022

Abstract

:
We propose the Weak Rescaled Pure Super Greedy Algorithm (WRPSGA) for approximation with respect to a dictionary D in Hilbert space. The WRPSGA is simpler than some popular greedy algorithms. We show that the convergence rate of the RPSGA on the closure of the convex hull of the μ -coherent dictionary D is optimal. Then, we design the Rescaled Pure Super Greedy Learning Algorithm (RPSGLA) for kernel-based supervised learning. We prove that the convergence rate of the RPSGLA can be arbitrarily close to the best rate O ( m 1 ) under some mild assumptions.
MSC:
41A46; 41A25; 46N30; 68T05; 62J02

1. Introduction

Non-linear sparse approximation with respect to dictionaries has been widely used in many application areas, such as machine learning, signal processing, and numerical computation, see [1,2,3,4,5,6,7,8]. Greedy algorithms have been used extensively for generating such approximations, see [9,10,11,12]. Among others, super greedy algorithms are very efficient for signal processing and machine learning, see [13,14,15]. In this paper, we propose a new super greedy algorithm—the Weak Rescaled Pure Super Greedy Algorithm (WRPSGA), which is simpler than some well-known greedy algorithms. We estimate the error of the WRPSGA and show that its convergence rate on the convex hull of the dictionary is optimal.
Then, we consider the application of the RPSGA to supervised learning. Since the greedy algorithms have been proven to possess charming generalization capacity with a lower computational burden in [16], they have been used in supervised learning, see Refs. [13,16,17,18,19,20,21,22]. In this paper, we design the Rescaled Pure Super Greedy Learning Algorithm (RPSGLA) and derive an almost optimal convergence rate.
We first recall some basic notions of greedy approximation from [10]. Let H be a Hilbert space with an inner product · , · and the norm x = x , x . A set of elements D H is called a dictionary if φ = 1 for every φ D , and span ¯ ( D ) is H. We consider symmetric dictionaries, i.e., φ D implies φ D .
Let f m be the output of a greedy-type algorithm with respect to a dictionary D after m iterations. We want to estimate the decay of f f m as m . To solve this problem, we need the following classes of sparse elements.
For a dictionary D , we define the class of the elements
A 1 0 ( D , N ) : = f : f = k Λ c k φ k , φ k D , # ( Λ ) < , k Λ | c k | N ,
and by A 1 ( D , N ) its closure in H. Let A 1 ( D ) be the union of the classes A 1 ( D , N ) over all N > 0 . For f A 1 ( D ) , we define the norm of f as
f A 1 ( D ) : = inf N : f A 1 ( D , N ) .
The most natural greedy algorithm with respect to D is the Pure Greedy Algorithm (PGA). We recall the definition of it from [10].
PGA(H, D ):  
  • Step 0: Define f 0 = 0 .
  • Stepm:
-
If f = f m 1 , stop the algorithm and define f k = f m 1 = f for k m .
-
If f f m 1 , choose an element φ m D such that
| f f m 1 , φ m | = sup φ D | f f m 1 , φ | .
Define the next approximant to be
f m = i = 1 m f f m 1 , φ i φ i ,
and proceed to Step m + 1 .  
We recall the result about the upper estimate of the convergence rate of the PGA in [10].
Theorem 1.
Let D be an arbitrary dictionary in H. Then, for each f A 1 ( D , N ) , the PGA has the following convergence rate
f f m N m 1 / 6 , m = 0 , 1 , 2 , .
Since the rate of convergence of PGA was unsatisfying, some modified greedy algorithms, such as the Orthogonal Greedy Algorithm (OGA) and the Rescaled Pure Greedy Algorithm (RPGA), were proposed. We first recall the definition of the OGA in [10].
OGA(H, D ):  
  • Step 0: Define f 0 = 0 .
  • Stepm:
-
If f = f m 1 , stop the algorithm and define f k = f m 1 = f for k m .
-
If f f m 1 , choose an element φ m D such that
| f f m 1 , φ m | = sup φ D | f f m 1 , φ | .
Define the next approximant to be
f m = P m ( f ) ,
where P m is the orthogonal projection operator onto span { φ 1 , φ 2 , · · · , φ m } , and proceed to Step m + 1 .  
In [10], the authors derived the following convergence rate for the OGA.
Theorem 2.
Let D be an arbitrary dictionary in H. Then, for each f A 1 ( D , N ) , the OGA has the following convergence rate
f f m N m 1 / 2 , m = 0 , 1 , 2 , .
It is clear that when D is an orthonormal basis, the rate O ( m 1 / 2 ) is sharp, see [6]. Thus, this convergence rate serves as a benchmark for the approximation ability of the greedy-type algorithms.
To study the error for f H , in [16], the authors defined the K-functional for the pair ( H , A 1 ( D ) ) as follows
K ( f , ζ ; H , A 1 ( D ) ) : = inf h A 1 ( D ) f h H + ζ h A 1 ( D ) , f H , ζ > 0 .
They proved that for any f H , the output { f m } m 1 of the OGA satisfies
f f m 2 K ( f , m 1 / 2 ; H , A 1 ( D ) ) .
In [11], Petrova proposed the RPGA as follows:
RPGA(H, D ) :  
  • Step 0: Define f 0 = 0 .
  • Stepm:
-
If f = f m 1 , stop the algorithm and define f k = f m 1 = f for k m .
-
If f f m 1 , choose an element φ m D such that
| f f m 1 , φ m | = sup φ D | f f m 1 , φ | .
With
λ m = f f m 1 , φ m , f ^ m : = f m 1 + λ m φ m , s m = f , f ^ m f ^ m 2 ,
define the next approximant to be f m = s m f ^ m , and proceed to Step m + 1 .  
From the definition, we find that the RPGA is simpler than the OGA from the viewpoint of computational complexity. In [11], the author derived the following convergence rate for the RPGA.
Theorem 3.
Let D be an arbitrary dictionary in H. If f A 1 ( D ) , then the RPGA has the following convergence rate
f f m f A 1 ( D ) ( m + 1 ) 1 / 2 , m = 0 , 1 , 2 , .
He also proved that for any f H , the output { f m } m 1 of the RPGA satisfies
f f m 2 K ( f , ( m + 1 ) 1 / 2 ; H , A 1 ( D ) ) .
Now we recall super greedy algorithms. In [14], Liu and Temlyakov proposed the Weak Orthogonal Super Greedy Algorithm (WOSGA). The WOSGA with parameter s and a weakness sequence τ : = { t m } m = 1 , where t m ( 0 , 1 ] , m = 1 , 2 , . . . , was defined as follows:
WOSGA(s, τ ):  
Initially, we define f 0 : = 0 . Then for a natural number s 1 and each m 0 , we inductively define:
(1):
Denote I m : = [ ( m 1 ) s + 1 , m s ] . Let φ ( m 1 ) s + 1 , , φ m s D satisfy the inequality
min i I m | f f m 1 , φ i | t m sup φ D , φ φ i , i I m | f f m 1 , φ | .
(2):
Let H m : = H m ( f ) : = span ( φ 1 , , φ m s ) and P H m denote the orthogonal projection operator onto H m . Define
f m ( f ) : = f m s , τ ( f , D ) : = P H m ( f ) .
The WOSGA selects more than one element from a dictionary in each iteration step and hence reduces the computational burden of the conventional OGA. Thus, compared with OGA, the WOSGA can construct the approximant more quickly. We recall some results on the error estimates of the WOSGA. When t 1 = , , = t m = t , we use the notation OSGA(s, t) instead of WOSGA(s, τ ). We denote
μ ( D ) : = sup φ 1 φ 2 D | φ 1 , φ 2 |
the coherence of a dictionary D . It is clear that the smaller the μ ( D ) , the more the D resembles an orthonormal basis. Throughout this paper, we consider dictionaries with small values of coherence μ ( D ) > 0 , and call them μ -coherent dictionaries.
In [14], the authors derived the following convergence rate for the OSGA(s, t).
Theorem 4.
Let D be a dictionary with coherence μ . Then, for f H and s ( 2 μ ) 1 , the OSGA(s,t) provides, after m iterations, an approximation of f A 1 ( D , 1 ) with the following convergence rate:
f f m 81 / 8 ( 1 + t ) / t 2 ( s m ) 1 / 2 , k = 1 , 2 , .
In [13], the authors established the following error bound for the OSGA(s,t).
Theorem 5.
Let D be a dictionary with coherence μ . Then for f H , and s ( 2 μ ) 1 + 1 , the OSGA(s,t) provides, after m iterations, an approximation of f with the error bound
f f m C 1 ( s , μ , t ) K ( f , ( s m ) 1 / 2 ; H , A 1 ( D ) ) , k = 1 , 2 , .
Motivated by the above results, in Section 2, we propose the Weak Rescaled Pure Super Greedy Algorithm (WRPSGA) and estimate the error of this algorithm by means of K-functional. We give an error estimate of the form (1) for the WRPSGA. This estimation implies the convergence rate of the WRPSGA for A 1 ( D ) is O ( m 1 / 2 ) which is optimal. In Section 3, we design the Rescaled Pure Super Greedy Learning Algorithm (RPSGLA) for solving the regression problem which is fundamental in statistical learning. We derive a learning rate that can be arbitrarily close to the best rate O ( m 1 ) . In Section 4, we prove the main result derived in Section 3. In Section 5, we test the performance of the RPSGLA by numerical experiments. The simulation results show that RPSGLA is very efficient for regression. In Section 6, we compare the RPSGA with other greedy algorithms to show that the efficiency of RPSGA is the best. In Section 7, we give the conclusions of our study and make some suggestions for further studies.

2. The Weak Rescaled Pure Super Greedy Algorithms

In this section, we present the definition of the Weak Rescaled Pure Super Greedy Algorithm (WRPSGA) and study its approximation ability.
The WRPSGA with parameter s and a weakness sequence τ : = { t m } m = 1 , where t m ( 0 , 1 ] , m = 1 , 2 , , is defined as follows:
WRPSGA(s, τ ):  
Initially, we define f 0 = 0 . Then for a natural number s 1 and each m 0 , we inductively define:
(1):
Denote I m : = [ ( m 1 ) s + 1 , m s ] . Let φ ( m 1 ) s + 1 , , φ m s D satisfy the inequality
min i I m | f f m 1 , φ i | t m sup φ D , φ φ i , i I m | f f m 1 , φ | .
(2):
Let F m : = span ( φ ( m 1 ) s + 1 , , φ m s ) and P F m denote the orthogonal projection operator onto F m . With
f ^ m = f m 1 + P F m ( f f m 1 ) ,
define the next approximant to be
f m : = f m s , τ ( f , D ) : = s m f ^ m , w i t h s m = f , f ^ m f ^ m , f ^ m .
When t 1 = , , = t m = t , we write the RPSGA(s, t) for the WRPSGA(s, τ ).
We now state the main result about the error estimate for the RPSGA(s, t).
Theorem 6.
Let D be a dictionary with coherence parameter μ : = μ ( D ) , and 0 < t 1 . Then, for f H and s 1 2 μ + 3 2 , the error of the RPSGA(s, t) satisfies
f f m C ( s , μ , t ) K f , ( s ( 1 + m ) ) 1 / 2 ; H , A 1 ( D ) , m = 1 , 2 , ,
where C ( s , μ , t ) = ( 3 + 5 t ) ( 1 + μ ( s 1 ) ) / t 2 ( 1 μ ( s 1 ) ) .
Theorem 6 implies the following corollary.
Corollary 1.
Under the assumption of Theorem 6, if f A 1 ( D ) , then the error of the RPSGA(s, t) satisfies
f f m C ( s , μ , t ) f A 1 ( D ) ( s ( 1 + m ) ) 1 / 2 , m = 1 , 2 , .
Remark 1.
We remark that the RPSGA(s, t) adds s new elements at each iteration, while the RPGA adds only one element at each iteration, so the RPSGA(s, t) can reduce the computational cost of RPGA. Theorem 4 and Corollary 1 show that the performance of the RPSGA(s, t) is as good as the performance of the OSGA(s, t) in sense of the rate of convergence. However, from a computational point of view, the OSGA(s, t) is more expensive to implement since at each step it requires the evaluation of the orthogonal projection of f onto span { φ 1 , φ 2 , , φ m s } . While the output of the RPSGA(s, t) is the orthogonal projection of f onto the one dimensional space spanned by { f ^ m } .
Remark 2.
Although the convergence rate of the OSGA(s, t) and RPSGA(s, t) on A 1 ( D ) are almost the same, see Theorem 4 and Corollary 1, the corresponding constants are different. The constant C ( s , μ , t ) in Corollary 1 is not as good as that in Theorem 4. This is because C ( s , μ , t ) is obtained from Theorem 4 which holds for any f H not just for f A 1 ( D ) . There is a trade-off between universality and accuracy. Therefore, the constant C ( s , μ , t ) is not so good. Nevertheless, our results still have some advantages. The range of s in Corollary 1 is wider than that in Theorem 4. The factor ( 1 + m ) 1 / 2 is slightly better than m 1 / 2 in Theorem 4. It would be very interesting to derive the error of the RPSGA(s,t) for f A 1 ( D ) directly. This would help us to see if the constant C ( s , μ , t ) can be improved.
In order to prove Theorem 6, we need the following two lemmas.
Lemma 1
([13]). Assume a dictionary D has coherence μ. Let { g i } i = 1 s D and H ( s ) = s p a n ( g 1 , , g s ) . Then, we have
1 1 + μ ( s 1 ) i = 1 s f , g i 2 P H ( s ) ( f ) 2 1 1 μ ( s 1 ) i = 1 s f , g i 2 .
Lemma 2
([23]). Let > 0 , r > 0 , B > 0 , J N be fixed, and { a m } m = 1 and { b m } m = 2 be sequences of non-negative numbers satisfying the inequalities
a J B , a m a m 1 1 b m r a m 1 , m = J + 1 , J + 2 , . . . .
Then, we have
a m max 1 , 1 / r 1 / r B + k = J + 1 m b k 1 / , m = J + 1 , J + 2 , .
Proof of Theorem 6: 
First, we show that for any f H and any h A 1 ( D ) , the inequality
f f m 2 f h 2 + C ( s , μ , t ) 2 s ( 1 + m ) h A 1 ( D ) 2
holds for m = 1 , 2 , .
Since A 1 0 ( D , N ) is dense in A 1 ( D , N ) , it is enough to prove (2) for elements that are finite sums h = j c j g j with j | c j | N and g j D . Let us fix ϵ > 0 , and choose a representation for h = j = 1 c j g j , such that
j = 1 | c j | < N + ϵ , | c 1 | | c 2 | | c 3 | .
Let P A 1 ( D ) ( f ) be the projection of f onto A 1 ( D ) . Note that P A 1 ( D ) ( f ) can be approximated arbitrarily well by the form of some above elements in A 1 ( D ) .
Suppose q is such that | c q | a s ( N + ϵ ) | c q + 1 | with a = 3 + t 2 t . Then, the above assumption on the sequence c j implies that q s a and | c s + 1 | < N + ϵ s .
We claim that the elements g 1 , , g q will be chosen among φ 1 , , φ s at the first iteration.
Indeed, for i [ 1 , q ] , we have
P A 1 ( D ) ( f ) , g i = j = 1 c j g j , g i c i j i c j g j , g i c i μ ( N + ϵ | c i | ) a s ( N + ϵ ) ( 1 + μ ) μ ( N + ϵ ) .
For all g distinct from g 1 , , g s , we have
P A 1 ( D ) ( f ) , g = j = 1 c j g j , g
| c g | + g j g c j g j , g N + ϵ s + μ ( N + ϵ ) = ( N + ϵ ) ( μ + 1 s ) .
Our assumption s 1 2 μ + 3 2 implies a s t ( N + ϵ ) ( 1 + μ ) t μ ( N + ϵ ) ( N + ϵ ) ( μ + 1 s ) . Then, we obtain | P A 1 ( D ) ( f ) , g i | | P A 1 ( D ) ( f ) , g | . This implies that | f , g i | | f , g | . Thus, we do not pick any g D distinct from g 1 , , g s until we have chosen all g 1 , , g q .
We denote r m = f f m and a m = r m 2 f h 2 , m = 0 , 1 , 2 , . . . . Since f m is the orthogonal projection of f onto span { f m ^ } , we have
r m , f m = 0 ,
and
r m 2 = f s m f ^ m 2 f f ^ m 2 .
According to the definition of f ^ m and the choice of φ i , i I m , we obtain
f f ^ m 2 = r m 1 P F m ( r m 1 ) , r m 1 P F m ( r m 1 ) = r m 1 2 2 r m 1 , P F m ( r m 1 ) + P F m ( r m 1 ) 2 .
We deduce
r m 1 , P F m ( r m 1 ) = P F m ( r m 1 ) + r m 1 P F m ( r m 1 ) , P F m ( r m 1 ) = P F m ( r m 1 ) 2 .
Denote p m = P F m ( r m 1 ) . We have
f f ^ m 2 = r m 1 2 p m 2 .
Combining (4) with (5), we have
r m 2 r m 1 2 p m 2 .
It follows from (6) that r m m = 0 is a decreasing sequence, and hence, { a m } m = 0 is also a decreasing sequence.
Now the proof is divided into two cases, according to whether a 0 0 or a 0 > 0 .
Case 1: a 0 : = f 2 f h 2 0 . Therefore a m 0 for all m 1 , hence inequality (2) holds for m = 1 , 2 , .
Case 2: a 0 > 0 . With (3), we notice that
r m 1 2 = f f m 1 2 = f f m 1 , f f m 1 = f f m 1 , f = f f m 1 , h + f h 1 2 ( r m 1 2 + f h 2 ) + r m 1 , h .
We proceed with a lower estimate for p m 2 . We consider the following quantity for r m 1
p s : = sup h i D , i [ 1 , s ] P H ( s ) ( r m 1 ) ,
where H ( s ) : = span ( h 1 , , h s ) . Applying Lemma 1, we have
p s 2 sup h i D , i [ 1 , s ] ( 1 μ ( s 1 ) ) 1 i = 1 s r m 1 , h i 2 .
In order to get the relation between p s and p m , we consider an arbitrary set { h i , i [ 1 , s ] } of distinct elements of D . Let W = { i [ 1 , s ] | h i = φ k ( i ) , k ( i ) I m } and W = { k ( i ) , i W } . Then
i = 1 s f G m 1 ( f ) , h i 2 = i W r m 1 , h i 2 + i [ 1 , s ] W r m 1 , h i 2 .
Using the choice of { φ k } k I m , we have
min k I m W | r m 1 , φ k | t sup i [ 1 , s ] W | r m 1 , h i | ,
and hence
i = 1 s r m 1 , h i 2 k W r m 1 , φ k 2 + t 2 k I m W r m 1 , φ k 2 t 2 k I m r m 1 , φ k 2 .
Therefore, by (8) and Lemma 1, we obtain
p s 2 1 μ ( s 1 ) 1 t 2 k I m r m 1 , φ k 2 1 + μ ( s 1 ) ( 1 μ ( s 1 ) ) t 2 p m 2 .
For any h A 1 ( D ) , we turn to bound r m 1 , h . For m 2 , we write r m 1 , h as
r m 1 , h = r m 1 , j = 1 q c j g j + r m 1 , j = q + 1 c j g j .
By setting J l : = [ ( l 1 ) s + q + 1 , l s + q ] , we first bound r m 1 , j = q + 1 c j g j as follows
r m 1 , j = q + 1 c j g j l = 1 j J l c j 2 1 / 2 j J l r m 1 , g j 2 1 / 2 l = 1 j J l c j 2 1 / 2 1 + μ ( s 1 ) 1 / 2 P H ( J l ) ( r m 1 ) l = 1 j J l c j 2 1 / 2 1 + μ ( s 1 ) 1 / 2 p s .
Since the sequence c j has the property
| c q + 1 | | c q + 2 | | c q + 3 | , j = q + 1 | c j | N + ϵ , | c q + 1 | a s ( N + ϵ ) ,
we may use the simple inequality
j J l c j 2 1 / 2 s 1 / 2 c ( l 1 ) s + q + 1
to derive
l = 1 j J l c j 2 1 / 2 l = 1 s 1 / 2 c ( l 1 ) s + q + 1 s 1 / 2 a s ( N + ϵ ) + s 1 l = 2 j J l 1 | c j | s 1 / 2 ( N + ϵ ) ( a + 1 ) .
Next, we bound r m 1 , j = 1 q c j g j as follows
r m 1 , j = 1 q c j g j s 1 / 2 N + ϵ s 1 + μ ( s 1 ) 1 / 2 p s .
Furthermore, by (7) and (10)–(12), we have
p s s 1 / 2 ( r m 1 2 f h 2 ) 2 ( a + 2 ) ( N + ϵ ) ( 1 + μ ( s 1 ) ) 1 / 2 .
By (6), (9), and (13), we get
r m 2 r m 1 2 ( 1 μ ( s 1 ) ) t 2 1 + μ ( s 1 ) p s 2 r m 1 2 s t 2 ( 1 μ ( s 1 ) ) ( r m 1 2 f h 2 ) 2 4 ( 1 + μ ( s 1 ) ) 2 ( a + 2 ) 2 ( N + ϵ ) 2 .
Let ϵ 0 . Subtracting f h 2 from both sides in (14), we obtain
a m a m 1 1 s a m 1 N 2 C ( s , μ , t ) 2 ,
where the constant C ( s , μ , t ) is defined in Theorem 6.
Case 2.1: 0 < a 0 < N 2 C ( s , μ , t ) 2 s 1 . In terms of the decreasing property of the sequence { a m } m = 0 , either { a m } m 0 ( 0 , N 2 C ( s , μ , t ) 2 s 1 ) , or for some m * 1 we have that a m * 0 . Then for m m * the arguments are as in Case 1. Applying Lemma 2 to the positive numbers in the sequence { a m } m = 0 with = 1 , b m = 1 , B = N 2 C ( s , μ , t ) 2 s 1 , J = 0 and r = N 2 C ( s , μ , t ) 2 s 1 , we obtain
a m N 2 C ( s , μ , t ) 2 ( s ( 1 + m ) ) 1 .
Case 2.2: a 0 N 2 C ( s , μ , t ) 2 s 1 . Taking m = 1 in inequality (15), we get
a 1 a 0 1 s a 0 N 2 C ( s , μ , t ) 2 .
Therefore a 1 0 , that is, r 1 f h , which gives (2) because of monotonicity.
Now inequality (2) has been proved completely. Then, we have
f f m K f , C ( s , μ , t ) ( s ( 1 + m ) ) 1 / 2 ; H , A 1 ( D ) C ( s , μ , t ) K f , ( s ( 1 + m ) ) 1 / 2 ; H , A 1 ( D ) .
   □
Proof of Corollary 1: 
If f A 1 ( D ) , then
K f , ( s ( 1 + m ) ) 1 / 2 ; H , A 1 ( D ) f A 1 ( D ) ( s ( 1 + m ) ) 1 / 2 .
Thus we get the desired result from Theorem 6.    □

3. The Rescaled Pure Super Greedy Learning Algorithms

In this section, we consider the application of the RPSGA(s, 1) to supervised learning. In this context, the RPSGA(s, 1) has its new form. We call it the Rescaled Pure Super Greedy Learning Algorithm (RPSGLA). The precise definition of the RPSGLA will be given later.
We first formulate the problem of supervised learning. Let the input space X R d be a compact subset and the output space Y R . Let ρ be a Borel probability measure on Z = X × Y . The generalization error for a function f : X Y is defined as
E ( f ) : = Z ( f ( x ) y ) 2 d ρ .
For each input x X and output y Y , ( f ( x ) y ) 2 is the error suffered from the use of f as a model for the process producing y from x. By integrating over X × Y with respect to ρ , we average out the error over all pairs ( x , y ) .
Denote by L ρ X 2 the Hilbert space of the square integrable functions defined on X with respect to the measure ρ X , where ρ X ( x ) is the marginal measure of ρ on X and
f ( · ) L ρ X 2 = X | f ( · ) | 2 | d ρ X 1 / 2 .
The regression function f ρ which minimizes E ( f ) over all f L ρ X 2 is given by
f ρ ( x ) : = Y y d ρ ( y | x ) ,
where ρ ( · | x ) is the conditional distribution induced by ρ at x X . In the framework of supervised learning, ρ is unknown and what we have in hand is a set of random samples z = ( x i , y i ) i = 1 m Z m , without loss of generality, we assume that | y i | M for a fixed M > 0 , drawn from the measure ρ identically and independently. The task is to find a good approximation f z of the regression function, which is derived from some learning algorithm. To measure the approximation ability of f z , we estimate the excess generalization error
E ( f z ) E ( f ρ ) = f z f ρ L ρ X 2 2 .
In the designation of the learning algorithm, we replace the generalization error E ( f ) with the empirical error
E z ( f ) : = 1 m i = 1 m ( f ( x i ) y i ) 2 .
We expect to find a good approximation of f ρ by minimizing E z ( f ) in a suitable way.
Given a training data z = ( x i , y i ) i = 1 m , the empirical inner product and empirical norm are defined as follows:
f , g m = 1 m i = 1 m f ( x i ) g ( x i ) ,
f m 2 : = 1 m i = 1 m | f ( x i ) | 2 .
With the definition of the empirical inner product and empirical norm, the empirical error can be represented as follow
E z ( f ) : = f y m 2 .
We consider the kernel-based greedy learning algorithms, which have been studied extensively in machine learning, see [13,16,18,20,22,24]. We will use the continuous kernels, which are more general than the usual Mercer kernels, see [25,26]. The hypothesis spaces based on continuous kernels were used widely in many fields of machine learning, see, for instance, Refs. [18,27,28]. In this way, a wider selection of the kernel offers more flexibility. We say a function K defined on X × X is a continuous kernel if K is continuous at every ( u , v ) X × X . Let K be a continuous and symmetric kernel. Assume that, in addition, K is positive definite. Then, the space H = span { K x : = K ( x , · ) : x X } is a pre-Hilbert space equipped with the inner product K x 1 , K x 2 H = K ( x 1 , x 2 ) . Let H ¯ be the closure of H with respect to the norm induced by the above inner product. Then, H ¯ is a reproducing kernel Hilbert space (RKHS) associated with the kernel K. The reproducing property is given by K x , f H = f ( x ) , x X , f H ¯ . For more details, one can see [25].
We define a data-independent hypothesis space as follows.
Definition 1.
Define a data-independent hypothesis space
L 1 = f : f = j = 0 β j K ˜ v j , j = 0 | β j | < , { v j } X , K ˜ v j = K v j K v j L ρ X 2 ,
with the norm
f L 1 = inf j = 0 | β j | : f L 1 .
Note that the space L 1 is a special kind of A 1 space which was introduced in [10], see Section 1.
It follows from the definition of f ρ that | f ρ | M , so it is natural to restrict the approximation functions to [ M , M ] . The following truncation operator has been used in the error analysis of learning algorithms for improving the learning rate, see [16,18,22].
Definition 2.
The truncation operator π M is defined on the space of measurable functions f : X R as
π M ( f ) ( x ) = M , i f f ( x ) > M ; M , i f f ( x ) < M ; f ( x ) , o t h e r w i s e .
Given a training data z = ( x i , y i ) i = 1 m X × Y , we define the data-dependent hypothesis space as
L z : = f : f = i = 0 m α i K ˜ x i , i = 1 m | α i | < ,
and define the norm on L z as
f L z = inf i = 0 m | α i | , f L z .
So, L z is a subspace of L 1 .
We will choose an approximant for f ρ from L z by the RPSGLA.
We now present the RPSGLA as follows Algorithm 1:
Algorithm 1 The RPSGLA
Input: z = ( x i , y i ) i = 1 m , and K, and T > 0 , we get D m = { K x i , i = 1 , 2 , . . . , m } .
step 1. Normalization: K ^ x i = K x i / K x i m and dictionary: D m * = { ± K ^ x i } , i = 1 , 2 , . . . , m .
step 2. Computation: Let f 0 = 0 , r k = y f k .
1: Denote I k : = [ ( k 1 ) s + 1 , k s ] . Let φ ( k 1 ) s + 1 , , φ k s D m * satisfy the inequality
min i I k | r k 1 , φ i m | sup φ D m * , φ φ i , i I k | r k 1 , φ m | .
2: Let F k : = span ( φ ( k 1 ) s + 1 , , φ k s ) and P F k denote the orthogonal projection operator onto F m . With
f k ^ = f k 1 + P F k ( r k 1 ) ,
define the next approximant to be
f k : = s k f k ^ , w i t h s k = y , f k ^ m f k ^ , f k ^ m .
3: if  y f k m 2 + f k L z y m 2 and k T  then break.
end
Output: π M ( f k )
In the case of s = 1 , this algorithm coincides with the Rescaled Pure Greedy Learning Algorithm (RPGLA), which was studied in [24]. Here we concentrate on s > 1 .
We recall the following condition which has been widely used for error analysis, see, for instance, Refs. [18,20,26,27,29].
Definition 3.
We say that a kernel K is a C γ kernel with γ > 0 if there exists some constant C γ > 0 , such that
K ( [ γ ] ) ( u , x ) K ( [ γ ] ) ( u , x ) C γ | x x | γ [ γ ] , u , x , x X ,
where [ γ ] denotes the largest integer not exceeding γ and K ( [ γ ] ) ( u , x ) denotes [ γ ] -th partial derivative of K ( u , x ) with respect to variable x. We set K ( 0 ) ( u , x ) = K ( u , x ) .
We recall two types of kernels that are used widely in practice. They are all the C γ kernels for any γ > 0 .
The first one is Guassian kernels: K σ ( u , x ) = exp ( σ 2 u x 2 ) , u , x R d , σ > 0 .
The second one is polynomial kernels: K n ( u , x ) = ( 1 + u · x ) n , u , x R d , n N .
Here u · x is the Euclidean inner product of u and x in R d .
Let C ( X ) denote the space of continuous functions defined on X.
Definition 4.
For 0 < r 1 2 , we define the space L 1 r to be the set of all functions f such that, for all m, there exist h s p a n { D m } such that
h L 1 B , f h B m r ,
where · denotes the uniform norm on C ( X ) , and the smallest constant B such that this holds defines a norm for L 1 r .
Under the assumption of f ρ L 1 r , we obtain the following convergence rate of the RPSGLA.
Theorem 7.
Assume that f ρ L 1 r and δ ( 0 , 1 ) . Let K be a C γ kernel with 0 < k 1 K ( u , v ) k 2 , for any u , v X . Choose T m . For s 1 2 μ + 3 2 , the output of the RPSGLA satisfies the following inequality with confidence 1 δ ,
E ( π M ( f m ) ) E ( f ρ ) c B 2 log 3 δ ( s m ) 1 + m 2 r + m 2 p + 2 ,
where
p = 2 d / ( d + 2 γ ) , 0 < γ 1 , 2 d / ( d + 2 ) , 1 < γ 1 + d / 2 , d / γ , γ > 1 + d / 2 ,
and the constant c depends at most on C γ , δ, K, and M.
Remark 3.
If we take K as an infinitely smooth kernel, such as a Gaussian kernel, then when r is sufficiently close to 1/2, the convergence rate of RPSGLA can be arbitrarily close to O ( m 1 ) , which is the best learning rate one can obtain so far, see [18,30]. Since in this case the result of Theorem 7 is valid for any γ > 0 , the convergence rate of RPSGLA can be arbitrarily close to O ( m 1 ) when γ is sufficiently large. To see this, let γ and r 1 / 2 , then the right-hand side of (17) tends to c · m 1 . Note that if we take a finite smooth kernel, then we could not obtain the above convergence rate. In this case, inequality (17) holds only for some fix γ but not all γ.
Remark 4.
We show that the efficiency of the RPSGLA is better than some existing greedy learning algorithms. The convergence rate of the RPSGLA is faster than that of the Orthogonal Super Greedy Learning Algorithm (OSGLA) in [13]. In [13], the rate O ( ( m log m ) 1 / 2 ) was derived. Our convergence rate is also faster than O ( ( m log m ) 1 / 2 ) the convergence rate of the Orthogonal Greedy Learning Algorithm (OGLA) and Relaxed Greedy Learning Algorithm (RGLA) in [16] and O ( m ( 2 γ + d ) / ( 2 γ + 2 d ) ) the convergence rate of the RGLA in [20]. Additionally, our convergence rate is almost the same as that of the OGLA in [18] and RPGLA in [24]. However, the complexity of the RPSGLA is smaller than the OGLA and RPGLA. We will illustrate this in Section 5 and Section 6. On the other hand, since greedy learning is a large field, there are other greedy learning procedures that are quite different from ours, see [17,19,21].
Remark 5.
Kernel-based greedy algorithms can be used to solve different problems. We only mention some typical works. Moreover, we focus on the approximation problems on RKHS. In [31], the authors proposed kernel-based greedy algorithms to approximate non-linear vectorial functions and derived the rate O ( m 1 / 2 ) . In [7,32], kernel-based greedy algorithms were used to approximate a linear functional defined on an RKHS. The authors proved that for the square-integrable functions, the convergence rate can attain O ( m 1 / 2 ) . Roughly speaking, the convergence rate of greedy algorithms for functional approximation is similar to that of function approximation, while for the regression problem, one can obtain a faster convergence rate.

4. Error Analysis of the RPSGLA

In this section, we prove Theorem 7. The proof is divided into five parts, the error decomposition strategy, the estimate of sample error, the estimate of hypothesis error, and the estimate of approximation error. Finally, the Theorem 7 is proved by synthesizing the results of each error estimate.

4.1. Error Decomposition Strategy

Before we show the error decomposition strategy of the error analysis, we construct a stepstone function f k * span ( D m ) as follows. As f ρ L 1 r , there exists a h ρ : = i = 0 m a i K x i span { D m } such that
h ρ L 1 B , f ρ h ρ B m r .
Define f 0 * = 0 ,
f k * = 1 1 k f k 1 * + h ρ L 1 k φ k * ,
where
φ k * : = arg max φ D m h ρ 1 1 k f k 1 * , φ L ρ X 2 ,
and
D m = K x i / K x i L ρ X 2 i = 1 m K x i / K x i L ρ X 2 i = 1 m .
Lemma 3.
Let f k * be defined in (19). Then we have f k * L 1 B .
Proof. 
By the definition of f k * , we have
f k * = 1 1 k 1 1 k 1 · · · 1 1 2 h ρ L 1 φ 1 * + 1 1 k · · · 1 1 3 h ρ L 1 2 φ 2 * + · · · + 1 1 k h ρ L 1 k 1 g k 1 * + h ρ L 1 k φ k * ,
then the conclusion of the lemma 3 follows from the fact h ρ L 1 B and φ i * L ρ X 2 = 1 , i = 1 , 2 k .    □
With f k * at hand, we can give an upper bound of E ( π M ( f k ) ) E ( f ρ ) as follows.
Proposition 1.
Let f k be defined in Algorithm 1. Then, for the f k * in (19), we have the error decomposition as follows:
E ( π M ( f k ) ) E ( f ρ ) S ( z , k ) + D ( k ) + P ( z , k ) ,
where
S ( z , k ) = E ( π M ( f k ) ) E z ( π ( f k ) ) + E z ( f k * ) E ( f k * ) ,
D ( k ) = E ( f k * ) E ( f ρ ) ,
P ( z , k ) = E z ( π M ( f k ) ) E z ( f k * ) .
are known as the sample error, the approximation error, and the hypothesis error, respectively, in learning theory.

4.2. Estimate of Sample Error

In this subsection, we will bound the sample error S ( z , k ) . We set
S 1 ( z , k ) = E z ( f k * ) E z ( f ρ ) { E ( f k * ) E ( f ρ ) }
and
S 2 ( z , k ) = { E ( π M ( f k ) ) E ( f ρ ) } { E z ( π M ( f k ) ) E z ( f ρ ) } .
Then, the sample error can be written as S ( z , k ) = S 1 ( z , k ) + S 2 ( z , k ) .
The bound of S 1 ( z , k ) has been proved in [20] by using the one-side Bernstein inequality and the inequality f k * L 1 B .
Proposition 2.
For any 0 < δ < 1 , with confidence at least 1 δ 3 , we have
S 1 ( z , k ) 7 ( 3 M + B ) 2 log ( 3 / δ ) 3 m + 1 2 D ( k ) .
For the function f k changed with the sample z , we should obtain the uniform upper bound of S 2 ( z , k ) .
We should consider the data-dependent space
B ϱ = f L 1 : f L 1 ϱ .
To estimate the capacity of B ϱ , we need the concept of the empirical covering number.
Definition 5.
Let E be a metric space with metric d and F be a subset of E. For any ϵ > 0 , the covering number N ( F , ϵ , d ) of F with respect to ϵ and d is defined as the minimal number of balls of radius ϵ whose union covers F, that is,
N ( F , ϵ , d ) : = min l N : F j = 1 l B ( t j , ϵ ) , f o r s o m e { t j } j = 1 l E ,
where B ( t j , ϵ ) = { t E : d ( t j , t ) ϵ } .
Definition 6.
( l 2 empirical covering number) Let F be a set of functions on X, x = ( x i ) i = 1 m X m and let F | x = { ( f ( x i ) ) i = 1 m , f F } R m . Set N 2 , x ( F , ϵ ) = N ( F | x , ϵ , d ) , for arbitrary ϵ > 0 . The l 2 empirical covering number of F is defined by
N 2 ( F , ϵ ) = sup m N sup x X m N 2 , x ,
where the l 2 metric
d 2 ( a , b ) = 1 m i = 0 m | a i b i | 2 1 2 , a = ( a i ) i = 1 m R m , b = ( b i ) i = 1 m R m .
We will use the following result on the capacity of the unit ball B 1 .
Lemma 4
([30]). Let X be a compact subset of R d and K be a C γ kernel with some γ > 0 . Then, there exist an exponent p, 0 < p < 2 , and a constant c p , for arbitrary ϵ > 0 , such that
log N 2 ( B 1 , ϵ ) c p ϵ p ,
where
p = 2 d / ( d + 2 γ ) , 0 < γ 1 , 2 d / ( d + 2 ) , 1 < γ 1 + d / 2 , d / γ , γ > 1 + d / 2 .
Now we recall the concentration inequality from [29].
Lemma 5.
Assume that there are constants B , c > 0 and α [ 0 , 1 ] such that f B and E f 2 c ( E f ) α , for every f F . If for arbitrary ϵ > 0 ,
log N 2 ( F , ϵ ) a ϵ p
holds for some a > 0 and p ( 0 , 2 ) , then there exists a constant c p depending only on p such that for any t > 0 , with probability at least 1 e t , there
E f 1 m i = 1 m f ( z i ) 1 2 η 1 α ( E f ) α + c p η + 2 c t m 1 2 α + 18 B t m , f F ,
where
η : = max c 2 p 4 2 α + p α a m 2 4 2 α + p α , B 2 p 2 + p a m 2 2 + p .
Proposition 3.
If K is a C γ kernel, then for any 0 < δ < 1 , with confidence at least 1 δ 3 , we have
S 2 ( z , k ) 1 2 { E ( π M ( f k ) ) E ( f ρ ) } + C 2 log 3 δ m 2 p + 2 .
Proof. 
Denote
F ϱ = { g ( z ) = ( y π M ( f ) ( x ) ) 2 ( y f ρ ( x ) ) 2 , f B ϱ } .
Hence
E g = E ( π M ( f ) ) E ( f ρ ) = π M ( f ) f ρ L ρ X 2 2 ,
and
1 m i = 1 m g ( z i ) = E z ( π M ( f ) ) E z ( f ρ ) .
Observe that
g ( z ) = ( π M ( f ) ( x ) f ρ ( x ) ) ( π M ( f ) ( x ) y ) + ( f ρ ( x ) y ) ) .
From the obvious inequalities f M and | f ρ | M , we get the inequalities
| g ( z ) | 8 M 2
and
E g 2 = Z ( π M ( f ) ( x ) f ρ ( x ) ) 2 ( ( π M ( f ) ( x ) y ) + ( f ρ ( x ) y ) ) 2 d ρ 16 M 2 E g .
For g 1 , g 2 F ϱ , we have
| g 1 ( z ) g 2 ( z ) | = ( y π M ( f 1 ) ( x ) ) 2 ( y π M ( f 2 ) ( x ) ) 2 4 M π M ( f 1 ) ( x ) π M ( f 2 ) ( x ) 4 M f 1 ( x ) f 2 ( x ) .
For any ϵ > 0 , it follows that
N 2 , z F r , ϵ N 2 , x B r , ϵ 4 M N 2 , x B 1 , ϵ 4 M ϱ .
By Lemma 4, we have
log N 2 ( F ϱ , ϵ ) c p ( 4 M ϱ ) p ϵ p , ϵ > 0 ,
where c p is a constant independent of ϵ .
Applying Lemma 5 with B = 16 M 2 , α = 1 , c = 16 M 2 , and a = c p ( 4 M ϱ ) p , then for any 0 < δ < 1 and any g F ϱ , the inequality
E g 1 m i = 1 m g ( z i ) 1 2 E g + C 1 η + C 2 M 2 log ( 3 / δ ) m
holds with confidence 1 δ 3 , where C 1 and C 2 are constants depending at most on d, X, and the kernel K and
η = B 2 p p + 2 ( 4 M ϱ ) p m 2 2 + p .
It follows from the definition of f k in Algorithm 1 that f k L 1 k 2 M 2 . Then, there exists a constant C depending at most on d, X, M, and kernel K, such that
S 2 ( z , k ) 1 2 E ( π M ( f k ) ) E ( f ρ ) + C log 3 δ m 2 p + 2 .
   □

4.3. Estimate of Hypothesis Error

In this subsection, we give an error estimate for P ( z , k ) . For this purpose, we need the following two lemmas.
The first one is Hoeffding inequality, which was established by Wassily Hoeffding in 1963.
Lemma 6
([33]). Let X 1 , , X n be independent random variables bounded by the interval [ 0 , 1 ] : 0 X i 1 . We define the empirical mean of these variables by
X ¯ = 1 n ( X 1 + + X n ) .
Then, the following inequality holds for any given t > 0
Prob ( X ¯ E ( X ¯ ) t ) exp ( 2 t 2 n ) .
When X i are strictly bounded by the intervals [ a i , b i ] , i = 1 , , n , the generalization of the above inequality holds for any given t > 0
Prob ( X ¯ E ( X ¯ ) t ) exp 2 t 2 n i = 1 n ( b i a i ) 2 .
The second one is an immediate consequence of Theorem 6.
Equip H = { f = i = 1 α i K v i , { α i } i = 1 1 , { v i } X } with the empirical inner product. Then H is a Hilbert space. Given a training data z = ( x i , y i ) i = 1 m X × Y , D m = { K ^ x i : = K x i K x i m } is a dictionary of H. We denote its coherence by K .
Lemma 7.
Let K be a C γ kernel with 0 < k 1 K ( u , v ) k 2 . For any g s p a n ( D m ) , there exists { ξ i } i = 1 m such that g = i = 1 m ξ i K x i K x i L ρ X 2 , i = 1 m | ξ i | < . We define
g L 1 m = inf i = 1 m | ξ i m | , ξ i m = ξ i K x i m K x i L ρ X 2 .
Then, for any f H and any g s p a n ( D m ) , and s 1 2 μ + 3 2 , the error of the RPSGLA satisfies
f f k m 2 f g m 2 + C ( s , K , 1 ) 2 g L 1 m 2 s k , k = 1 , 2 , .
Proof. 
According to (2), for any f H and any g span ( D m ) A 1 ( D m ) , the inequality
f f k m 2 f g m 2 + C ( s , K , 1 ) 2 s ( 1 + k ) g A 1 ( D m ) 2
holds for k = 1 , 2 , . . . .
Given a training data z = ( x i , y i ) i = 1 m X × Y , since
g A 1 ( D m ) = i = 1 m ξ i K x i K x i L ρ X 2 A 1 ( D m ) = i = 1 m ξ i K x i m K x i L ρ X 2 K ^ x i A 1 ( D m ) = inf i = 1 m | ξ i m | = g L 1 m ,
then, combining (20) with (21), we get the desired result.    □
Proposition 4.
For any 0 < δ < 1 , with confidence at least 1 δ 3 , we have
E z ( π M ( f k ) ) E z ( f k * ) 4 min k 2 2 k 1 2 , 1 + k 2 k 1 log ( 3 / δ ) 2 m 2 B 2 ( s k ) 1 .
Proof. 
Applying Lemma 7 with g = f k * , we have
f f k m 2 f f k * m 2 + C ( s , K , 1 ) 2 f k * L 1 m 2 s k .
Since f k * L 1 m changes with training data z , we should find its relation with f k * L 1 . We know that f k * L 1 m k 2 k 1 f k * L 1 . We also have that
E K ˜ v i 2 = Z K ˜ v i 2 d ρ = K ˜ v i L ρ x 2 2 = 1 ,
and
K ˜ v i m = 1 m i = 1 m | K ˜ v i ( x i ) | 2 1 2 .
Based on Lemma 6, for t > 0 and any i, we have
Prob 1 m i = 1 m | K ˜ v i ( x i ) | 2 E K ˜ v i 2 < t 1 exp 2 k 1 2 t 2 m k 2 2 .
By setting δ = 3 exp 2 k 1 2 t 2 m k 2 2 , we have t = k 2 k 1 log ( 3 / δ ) 2 m . From (23) and (24), with the confidence 1 δ 3 , we have
K ˜ v i m 1 m i = 1 m | K ˜ v i ( x i ) | 2 E K ˜ v i 2 + k 2 k 1 log ( 3 / δ ) 2 m 1 + k 2 k 1 log ( 3 / δ ) 2 m .
By the definition of f k * , we observe that the following inequality
f k * L 1 m 1 + k 2 k 1 log ( 3 / δ ) 2 m f k * L 1 ,
holds with the confidence 1 δ 3 .
Combining (22), (25) with Lemma 3, with the confidence at least 1 δ 3 , we have
E z ( π M ( f k ) ) E z ( f k * ) C ( s , K , 1 ) 2 min k 2 2 k 1 2 , 1 + k 2 k 1 log ( 3 / δ ) 2 m 2 B 2 ( s k ) 1 .
   □

4.4. Estimate of Approximation Error

Finally, we estimate the approximation error.
Proposition 5.
If f ρ L 1 r , then D ( k ) B 2 k 1 / 2 + m r 2 .
Proof. 
From the definition of D ( k ) and (16), there holds:
D ( k ) = E ( f k * ) E ( f ρ ) = f k * f ρ L ρ X 2 2 .
For h ρ satisfying (18), and h ρ L ρ X 2 h ρ , from Theorem 2.2 in [16], we obtain
D ( k ) h ρ f k * L ρ X 2 + h ρ f ρ L ρ X 2 2 h ρ f k * + h ρ f ρ 2 B 2 k 1 / 2 + m r 2 .
   □

4.5. Proof of Theorem 7

Now we prove Theorem 7.
Proof of Theorem 7 
Assembling the results in Proposition 1–5 together, we have the inequality
E ( π M ( f k ) ) E ( f ρ ) D ( k ) + S ( z , k ) + P ( z , k ) 3 2 B 2 k 1 / 2 + m r 2 + 7 ( 3 M + B ) 2 log ( 3 / δ ) 3 m + 1 2 { E ( π ( f k ) ) E ( f ρ ) } + C log 3 δ m 2 p + 2 + C ( s , K , 1 ) min k 2 2 k 1 2 , 1 + k 2 k 1 log ( 3 / δ ) 2 m 2 B 2 ( s k ) 1
holds with confidence at least 1 δ .
Therefore,
E ( π M ( f k ) ) E ( f ρ ) c B 2 log 3 δ ( s k ) 1 + m 2 r + m 2 p + 2 .
holds with confidence at least 1 δ , where
p = 2 d / ( d + 2 γ ) , 0 < γ 1 , 2 d / ( d + 2 ) , 1 < γ 1 + d / 2 , d / γ , γ > 1 + d / 2 .
This completes the proof of Theorem 7.
   □

5. Simulation Results

In this section, we test the performance of the RPSGLA and present the results of our experiments. We first describe the details relevant to our experiments in Section 5.1. We demonstrate the effectiveness of the RPSGLA in Section 5.2. In Section 5.3, we compare the RPSGLA with existing greedy learning algorithms to show the efficiency and robustness of our algorithm. We also study the effects of the parameter s in Section 5.4.

5.1. Implementation

For all experiments, we uniformly sample 500 data from the objective function
x f ( x ) , x [ 1 , 1 ] .
Let x 0 = 1 and { x i } be the 200 equally spaced points in [ 1 , 1 ] , i.e., x i = i 1 200 , 0 i < 200 . We use { e x x i 2 : i = 0 , , 199 } as a dictionary for the learning algorithms. We also add a Gaussian noise from N ( 0 , δ 2 ) , with δ 2 = 0.2 for testing the performance. When the matrix is singular during calculating the orthogonal projection, we take the Moore–Penrose pseudoinverse instead of the inverse matrix.
To demonstrate the performance of learning algorithms, we use the mean square error (MSE) to measure the error between objective function f and its approximant f ρ on the unlabeled samples x = { x ˜ j } j = 1 500 , which is defined as follows
MSE = 1 500 j = 1 500 ( f ( x ˜ j ) f ρ ( x ˜ j ) ) 2 .

5.2. Effectiveness of the RPSGLA

In the first numerical experiment, we use the uniformly sampled data from the function
x x · sin π x , x [ 1 , 1 ] ,
and set s = 5 .
Figure 1 shows the performance of the RPSGLA ( s = 5 ) after 100 iterations with noiseless data for the target function f ( x ) = x sin π x , where the blue points represent noiseless samples and the red line represents the approximation. By repeating the test 100 times, we calculate the mean square error: MSE = 3.4159 × 10 4 .
Figure 2 shows the performance of the RPSGLA ( s = 5 ) after 100 iterations with Gaussian noise from N ( 0 , 0.2 ) for the target function f ( x ) = x sin π x , where the blue points represent noisy samples and the red line represents the approximation. By repeating the test 100 times, we calculate the mean square error: MSE = 2.4 × 10 3 .
The above results show that the RPSGLA is very effective for regression. Although the prediction performance will be affected when noise occurs, our method still has stable effectiveness.

5.3. Comparison with Existing Algorithms

In order to demonstrate the efficiency and robustness of the RPSGLA, we implement the RPGLA (when s = 1 for RPSGLA) and the Orthogonal Super Greedy Learning Algorithm (OSGLA) based on the WOSGA with the same objective function and parameters as in Section 5.2 for comparison. We first recall the definition of the OSGLA in [13] (Algorithm 2):  
Algorithm 2 The OSGLA
Input: z = ( x i , y i ) i = 1 m , and K, and T > 0 , we get D m = { K x i , i = 1 , 2 , . . . , m } .
step 1. Normalization: K ^ x i = K x i / K x i m and dictionary: D m * = { ± K ^ x i } , i = 1 , 2 , . . . , m .
step 2. Computation: Let f 0 = 0 , r k = y f k .
1: Denote I k : = [ ( k 1 ) s + 1 , k s ] . Let φ ( k 1 ) s + 1 , , φ k s D m * satisfy the inequality
min i I k | r k 1 , φ i m | sup φ D m * , φ φ i , i I k | r k 1 , φ m | .
2: Let H k = span { φ 1 , , φ k s } and P H k denote the orthogonal projection operator onto H k . Define the next approximant to be
f k = P H k ( f ) .
3: if  y f k m 2 + f k L z y m 2 and k T  then break.
end
Output: π M ( f k )
By repeating the test 100 times, the mean square error of RPGLA and OSGLA ( s = 5 ) after 100 iterations with noiseless data and Gaussian noise from N ( 0 , 0.2 ) are shown in Table 1. Additionally, the running time of the RPGLA, OSGLA, and RPSGLA on different types of data are reported in Table 2.
According to the mean square error, obviously, our algorithm is more effective than the RPGLA on both the noiseless data and the noisy data (Gaussian). The performance of the OSGLA on noiseless data outperforms the RPSGLA by a lot. This trend is expected because the approximant of the OSGLA is updated by solving a s × k -dimensional least squares problem at k-th iteration, while for the RPSGLA, it is updated by solving a s-dimensional least squares problem. That is also the reason why the RPSGLA runs much more quickly than the OSGLA as reported in Table 2. However, for the data with Gaussian noise from N ( 0 , 0.2 ) , the performance of the RPSGLA outperforms the OSGLA, showing the robustness and effectiveness of our proposed algorithm.
In Table 2, we compare the running time of the RPGLA, OSGLA, and RPSGLA. We can observe that the RPSGLA runs more quickly than the RPGLA and the existing super algorithm, the OSGLA, demonstrating the efficiency of our algorithm.

5.4. Effects of the Parameter S

In this subsection, we study the effects of the parameter s { 1 , 2 , , 10 } . We take the same dictionary and Gaussian noise as in Section 5.2. By repeatedly implementing the RPSGLA 100 times on the uniformly sampled data from the function
x x 2 , x [ 1 , 1 ] ,
we calculate the mean square error respectively with s = 1 , 2 , , 10 , which are reported in Figure 3 and Figure 4.
Figure 3 and Figure 4, respectively, show the MSE with noiseless data and Gaussian noise from N ( 0 , 0.2 ) for the target function f ( x ) = x 2 when s increases from 1 to 10, which implies that the error of the RPSGLA will decrease with the increased s { 1 , 2 , , 10 } . For the data with Gaussian noise, according to MSE, the accuracy of the RPSGLA is roughly increased when parameter s increases from 1 to 10. For the noiseless data, the accuracy of the RPSGLA is increased and then decreased, reaching its best when s = 6 .
As shown in Figure 5 and Figure 6, with the increase of the parameter s, the running time of the RPSGLA is decreased and then increased. The running time of the RPSGLA reaches its lowest when s = 2 (for both noiseless data and the data with Gaussian noise), which implies that our algorithms reach their best computational cost in this case. Although the running time would change with different settings of parameter s, the RPSGLA is fairly efficient overall. In particular, the RPSGLA is more efficient than the RPGLA in general.

6. Discussion

The experiments in Section 5 show the superiority of the RPSGA(s, t) directly. In this section, we will explain why the tested results are achieved. By using the super selection step and one-dimensional optimization together for the first time, we get the simplest good greedy algorithm so far. We provide more details. First, we recall the other three greedy algorithms—the Relaxed Greedy Algorithm (RGA), the Greedy Algorithm with Free Relaxation (GAFR), and the Rescaled Relaxed Greedy Algorithm (RRGA).
In [10], the RGA was defined as the following steps.
RGA(H, D ):
  • Step 0: Define f 0 = 0 .
  • Stepm:
-
If f = f m 1 , stop the algorithm and define f k = f m 1 = f for k m .
-
If f f m 1 , choose an element φ m D such that
f f m 1 , φ m = sup φ D | f f m 1 , φ | .
For m = 1 , define
f 1 = f , φ 1 φ 1 .
For m 2 , define the next approximant to be
f m = 1 1 m f m 1 + 1 m φ m ,
and proceed to Step m + 1 .
In [10], the authors proved that the RGA converged only for the target elements from A 1 ( D ) and derived the following convergence rate for the RGA.
Theorem 8.
Let D be an arbitrary dictionary in H. Then, for each f A 1 ( D ) , we take the approximant as f m = 1 1 m f m 1 + f A 1 ( D ) m φ m , m 2 the RGA has the following convergence rate
f f m 2 f A 1 ( D ) m 1 / 2 , m = 0 , 1 , 2 , .
In [9], the GAFR was defined as follows:
GAFR(H, D ) :
  • Step 0: Define f 0 = 0 .
  • Stepm:
-
If f = f m 1 , stop the algorithm and define f k = f m 1 = f for k m .
-
If f f m 1 , choose an element φ m D such that
f f m 1 , φ m = sup φ D | f f m 1 , φ | .
With ( ω m , θ m ) = arg min ( ω , θ ) R 2 f ω f m 1 θ φ m , define the next approximant to be
f m = ω m f m 1 + θ m φ m ,
and proceed to Step m + 1 .
In [9], the RRGA was defined as follows:
RRGA(H, D ) :
  • Step 0: Define f 0 = 0 .
  • Step m:
  • -
    If f = f m 1 , stop the algorithm and define f k = f m 1 = f for k m .
    -
    If f f m 1 , choose an element φ m D such that
    | f f m 1 , φ m | = sup φ D | f f m 1 , φ | .
    With
    λ m = arg min θ R f f m 1 θ m , ω m = arg min ω R f ω ( f m 1 + θ m φ m ) ,
    define the next approximant to be f m = ω m ( f m 1 + θ m φ m ) , and proceed to Step m + 1 .
    In [9], the authors derived the following convergence rate for the GAFR and RRGA.
    Theorem 9.
    Let D be an arbitrary dictionary in H. Then, for each f A 1 ( D ) , the GAFR and RRGA have the following convergence rate
    f f m C f A 1 ( D ) m 1 / 2 , m = 0 , 1 , 2 , .
    From Theorems 2–4, Corollary 1, and Theorems 8 and 9, for each f A 1 ( D ) , the OGA, RPGA, RGA, GAFR, RRGA, OSGA(s, t), and RPSGA(s, t) all have the following convergence rate
    f f m C f A 1 ( D ) m 1 / 2 .
    The rate is optimal and the results show that these algorithms have almost identical error performances. We call these types of greedy algorithms Good Greedy Algorithms (GGA). Therefore, we just compare the complexity and execution time of the GGA.
    In terms of scope of application, the RGA converges only for the target elements from A 1 ( D ) . Obviously, the RPSGA(s, t) is better than the RGA.
    From the viewpoint of complexity, the OGA has to solve a m-dimensional optimization problem
    f m = P { φ 1 , , φ m } f = k = 1 m θ k m φ k ,
    where ( θ 1 m , , θ m m ) = arg min ( θ 1 , , θ m ) R m f k = 1 m θ k φ k . The GAFR has to solve a two-dimensional optimization problem
    f m = ω m f m 1 + θ m φ m ,
    where ( ω m , θ m ) = arg min ( ω , θ ) R 2 f ( ω f m 1 + θ φ m ) . The RRGA employs two one-dimensional optimization problems
    f m = ω m ( f m 1 + θ m φ m ) ,
    where θ m = arg min θ R f ( f m 1 + θ φ m ) , ω m = arg min ω R f ω ( f m 1 + θ m φ m ) . However, the RPGA only needs to solve a one-dimensional optimization problem
    f m = s m f ^ m , where s m = arg min s R f s f ^ m .
    Then, the RPGA is simpler than the OGA, RGA, GAFR, and RRGA. So the RPGA can save much execution time. According to the definition, the RPSGA(s, t) selects more than one element from a dictionary in each iteration step and hence reduces the computational burden of the RPGA (taking s=1 in the RPSGA(s, t)), especially when the RPGA and RPSGA(s, t) run with noisy data, see Figure 6. Combining with the empirical test results in Table 1 and Table 2 in Section 5, it can be easily found that from the viewpoints of error performance and execution time, the RPSGA(s, t) is more effective than the RPGA. The commonly used super algorithm is only the OSGA(s, t), as far as we know, and the OSGA(s, t) always needs to solve an m s -dimensional optimization problem. While the RPSGA(s, t) only need to solve a s-dimensional optimization problem and an one-dimensional optimization problem. Table 2 also shows the robustness and effectiveness of the RPSGA(s, t) compared with the OSGA(s, t). Therefore, the efficiency of the RPSGA(s, t) is the best among the GGA.

    7. Conclusions and Further Studies

    We propose a new type of super greedy algorithm—the WRPSGA. The RPSGA(s, t) is simpler than the OGA, RGA, RRGA, RPGA, and OSGA(s, t) from the viewpoint of computational complexity. The convergence rate of RPSGA(s, t) on A 1 ( D ) is optimal. Based on this result, we design the RPSGLA for solving kernel-based regression problems in supervised learning. When the kernel is infinitely smooth, we derive a significantly faster learning rate that can be arbitrarily close to the best rate O ( m 1 ) under some mild assumptions of the regression function. The efficiency of the RPSGLA is better than some existing greedy learning algorithms. For instance, the convergence rate of the RPSGLA is faster than the OSGLA in [13], RGLA in [20], and the OGLA and RGLA in [16]. Additionally, our convergence rate is almost the same as that of the OGLA in [18]. However, the complexity of the RPSGLA is smaller than the OGLA. We test the performance of the RPSGLA by numerical experiments. Our simulation results show that the RPSGLA is very efficient for regression.
    In addition to the applications in machine learning, the greedy algorithms can also be used to solve convex optimization problems which are quite different from the approximation problems, see Refs. [23,34,35,36]. We formulate the problem of convex optimization as follows.
    Let H be a Hilbert space with an inner product · , · , and E be a convex function on H which is called the objective function. We assume that E has a Fréchet derivative E ( x ) H at each point x H , that is
    lim h 0 | E ( x + h ) E ( x ) E ( x ) , h | h = 0 ,
    where · is the norm induced by · , · . We want to find an approximate solution to the problem
    inf x Ω E ( x ) ,
    where Ω is a bounded convex subset of H.
    For a dictionary D of H, we denote Σ m ( D ) as the set of all at most m-term linear combinations with respect to D ,
    Σ m : = Σ m ( D ) = { g : g = i Λ c i g i , g i D , Λ N , # ( Λ ) m } .
    Let x ¯ = arg inf x Ω E ( x ) . We will develop a new greedy algorithm to produce an approximation x m Σ m ( D ) of x ¯ . This algorithm is a modification of the RPSGA. We denote it as the RPSGA(co). We present the definition of the RPSGA(co) as follows:
    RPSGA(co):
    • Step 0: Define x 0 = 0 . If E ( x 0 ) = 0 , stop the algorithm and define x k : = x 0 , k 1 .
    • Step m: Assume that x m 1 has been defined and E ( x m 1 ) 0 . Denote I m : =
    [ ( m 1 ) s + 1 , m s ] . Let φ ( m 1 ) s + 1 , , φ m s D satisfy the inequality
    min i I m | E ( x m 1 ) , φ m | sup φ D , φ φ i , i I m | E ( x m 1 ) , φ | .
    Let F m : = span ( φ ( m 1 ) s + 1 , , φ m s ) . With
    x ^ m : = x m 1 + arg min x F m E ( x )
    define the next approximant to be x m = s m x ^ m , where s m = x , x ^ m x ^ m , x ^ m .
    If E ( x m ) = 0 , then stop the algorithm and define x k = x m , k > m , otherwise go to Step m + 1 .
    We will impose the following assumptions on the objective function E.
    Condition 0: E has a Fréchet derivative E ( x ) H at each point x in Ω and
    E ( x ) M 0 , x Ω .
    Uniform Smoothness: There are constants β > 0 , 1 < q 2 , and U > 0 , such that for all x, x with x x U , x in Ω ,
    E ( x ) E ( x ) E ( x ) , x x β x x q .
    It is known from [23] that the OGA(co) can be used effectively to solve convex optimization problems. Since the RPSGA(s, t) is more efficient than the OGA, the RPSGA(co) can solve these problems more efficiently. For the RPSGA(co), by estimating the decay of E ( x m ) E ( x ¯ ) as m , we will obtain the convergence rate O ( m 1 q ) , which is independent of the dimension of the underlying space H. So, the curse of dimensionality for problem (26) can be overcome by using the RPSGA(co). This work will be of great interest in practical applications. It is also important to compare the efficiency of the RPSGA(co) with other greedy optimization algorithms in future work.

    Author Contributions

    Conceptualization, all authors; methodology, all authors; software, all authors; validation, all authors; formal analysis, all authors; investigation, all authors; resources, all authors; data curation, all authors; writing—original draft preparation, all authors; writing—review and editing, all authors; visualization, all authors; supervision, all authors; project administration, all authors; funding acquisition, all authors. All authors have read and agreed to the published version of the manuscript.

    Funding

    This work was supported by National Natural Science Foundation of China under Grant 11671213.

    Acknowledgments

    The authors would like to thank the referees, the editors, Shi Lei, and Guo Qin for their very useful suggestions, which improved significantly this paper.

    Conflicts of Interest

    The authors declare no conflict of interest.

    References

    1. DeVore, R.A. Nonlinear approximation. Acta. Numer. 1998, 7, 51–150. [Google Scholar] [CrossRef]
    2. Donahue, M.J.; Darken, C.; Gurvits, L.; Sontag, E. Rates of convex approximation in non-Hilbert spaces. Constr. Approx. 1997, 13, 187–220. [Google Scholar]
    3. Jiang, B.; Ye, P.X. Efficiency of the weak rescaled pure greedy algorithm. Int. J. Wavelets Multiresolut. Inf. Process. 2021, 19, 2150001. [Google Scholar] [CrossRef]
    4. Jiang, B.; Ye, P.X.; Zhang, W.H. Unified error estimate for weak biorthogonal Greedy algorithms. Int. J. Wavelets Multiresolut. Inf. Process. 2022, 20, 2250010. [Google Scholar] [CrossRef]
    5. Shen, Y.; Li, S. Sparse signals recovery from noisy measurements by Orthogonal Matching Pursuit. Inverse. Probl. Imag. 2015, 9, 231–238. [Google Scholar] [CrossRef]
    6. Wei, X.J.; Ye, P.X. Efficiency of orthogonal super greedy algorithm under the restricted isometry property. J. Inequal. Appl. 2019, 124, 21. [Google Scholar] [CrossRef]
    7. Temlyakov, V.N. Numerical integration and discrepancy under smoothness assumption and without it. Constr. Approx. 2022, 55, 743–771. [Google Scholar]
    8. Wei, W.B.; Xu, Y.S.; Ye, P.X. Adaptive algorithms of nonlinear approximation with finite terms. Acta. Math. Sin. 2007, 23, 1663–1672. [Google Scholar] [CrossRef]
    9. Dereventsov, A.V.; Temlyakov, V.N. A unified way of analyzing some greedy algorithms. J. Funct. Anal. 2019, 277, 1–30. [Google Scholar] [CrossRef]
    10. DeVore, R.A.; Temlyakov, V.N. Some remarks on greedy algorithms. Adv. Comput. Math. 1996, 5, 173–187. [Google Scholar] [CrossRef]
    11. Petrova, G. Rescaled pure greedy algorithm for Hilbert and Banach spaces. Appl. Comput. Harmon. Anal. 2016, 41, 852–866. [Google Scholar] [CrossRef]
    12. Shao, C.F.; Ye, P.X. Almost optimality of orthogonal super greedy algorithms for incoherent dictionaries. Int. J. Wavelets Multiresolut. Inf. Process 2017, 15, 1750029. [Google Scholar] [CrossRef]
    13. Fang, J.; Lin, S.B.; Xu, Z.B. Learning and approximation capability of orthogonal super greedy algorithm. Knowl-Based. Syst. 2016, 95, 86–98. [Google Scholar] [CrossRef]
    14. Liu, E.T.; Temlyakov, V.N. The orthogonal super greedy algorithm and applications in compressed sensing. IEEE. T. Inform. Theory. 2012, 58, 2040–2047. [Google Scholar] [CrossRef]
    15. Shao, C.F.; Chang, J.C.; Ye, P.X.; Zhang, W.H.; Xing, S. Almost optimality of the orthogonal super greedy algorithm for μ-coherent dictionaries. Axioms 2022, 11, 186. [Google Scholar] [CrossRef]
    16. Barron, A.R.; Cohen, A.; Dahmen, W.; DeVore, R.A. Approximation and learning by greedy algorithms. Ann. Stat. 2008, 36, 64–94. [Google Scholar] [CrossRef]
    17. Alcin, O.F.; Sengur, A.; Ghofrani, S.; Ince, M.C. GA-SELM: Greedy algorithms for sparse extreme learning machine. Measurement 2014, 55, 126–132. [Google Scholar]
    18. Chen, H.; Zhou, Y.C.; Tang, Y.Y.; Li, L.Q.; Pan, Z.B. Convergence rate of the semi-supervised greedy algorithm. Neural Netw. 2013, 44, 44–50. [Google Scholar] [CrossRef]
    19. Herrero, H.; Solares, C. A Greedy Algorithm for observability analysis. IEEE. Trans. Power. Syst. 2020, 35, 1638–1641. [Google Scholar] [CrossRef]
    20. Lin, S.B.; Rong, Y.H.; Sun, X.P.; Xu, Z.B. Learning capability of the relaxed greedy algorithms. IEEE. Trans. Neur. Net. Lear. 2013, 24, 1598–1608. [Google Scholar]
    21. Xu, X.L.; Wang, M.Z.; Wang, Y.X.; Ma, D.C. Two-stage routing with optimized guided search and greedy algorithm on proximity graph. Knowl-Based. Syst. 2021, 229, 107305. [Google Scholar] [CrossRef]
    22. Chen, H.; Li, L.Q.; Pan, Z.B. Learning rates of multi-kernel regression by orthogonal greedy algorithm. J. Stat. Plan. Infer. 2013, 143, 276–282. [Google Scholar] [CrossRef]
    23. Nguyen, H.; Petrova, G. Greedy strategies for convex optimization. Calcolo 2017, 54, 207–224. [Google Scholar] [CrossRef]
    24. Zhang, W.H.; Ye, P.X.; Xing, S. Optimality of the rescaled pure greedy learning algorithms, unpublished manuscript.
    25. Aronszajn, N. Theory of reproducing kernels. Trans. Amer. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
    26. Cucker, F.; Zhou, D.X. Learning Theory: An aPproximation Theory Viewpoint; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
    27. Shi, L.; Feng, Y.L.; Zhou, D.X. Concentration estimates for learning with l1 regularizer and data dependent hypothesis spaces. Appl. Comput. Harmon. Anal. 2011, 31, 286–302. [Google Scholar] [CrossRef]
    28. Xiao, Q.W.; Zhou, D.X. Learning by nonsymmetric kernel with data dependent spaces and l1-regularizer. Taiwan. J. Math. 2010, 14, 1821–1836. [Google Scholar] [CrossRef]
    29. Wu, Q.; Ying, Y.M.; Zhou, D. Multi-kernel regularized classifiers. J. Complexity. 2007, 23, 108–134. [Google Scholar] [CrossRef]
    30. Shi, L. Learning theory estimates for coefficient-based regularized regression. Appl. Comput. Harmon. Anal. 2013, 34, 252–265. [Google Scholar] [CrossRef]
    31. Wirtz, D.; Haasdonk, B. A vectorial kernel orthogonal greedy algorithm. Dolomit. Res. Notes. Approx. 2013, 6, 83–100. [Google Scholar]
    32. Santin, G.; Karvonen, T.; Haasdonk, B. Sampling based approximation of linear functionals in reproducing kernel Hilbert spaces. Bit. Numer. Math. 2022, 62, 279–310. [Google Scholar] [CrossRef]
    33. Hoeffding, W. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 1963, 58, 13–30. [Google Scholar] [CrossRef]
    34. DeVore, R.A.; Temlyakov, V.N. Convex optimization on Banach spaces. Found. Comput. Math. 2016, 16, 369–394. [Google Scholar]
    35. Temlyakov, V.N. Greedy expansions in convex optimization. P. Steklov. I. Math. 2014, 284, 252–270. [Google Scholar] [CrossRef]
    36. Temlyakov, V.N. Greedy approximation in convex optimization. Constr. Approx. 2015, 41, 269–296. [Google Scholar] [CrossRef] [Green Version]
    Figure 1. RPSGLA ( s = 5 ) with noiseless data after 100 iterations.
    Figure 1. RPSGLA ( s = 5 ) with noiseless data after 100 iterations.
    Axioms 11 00437 g001
    Figure 2. RPSGLA ( s = 5 ) with noisy data after 100 iterations.
    Figure 2. RPSGLA ( s = 5 ) with noisy data after 100 iterations.
    Axioms 11 00437 g002
    Figure 3. MSE with noiseless data after 100 iterations.
    Figure 3. MSE with noiseless data after 100 iterations.
    Axioms 11 00437 g003
    Figure 4. MSE with noisy data after 100 iterations.
    Figure 4. MSE with noisy data after 100 iterations.
    Axioms 11 00437 g004
    Figure 5. Running time with noiseless data after 100 iterations.
    Figure 5. Running time with noiseless data after 100 iterations.
    Axioms 11 00437 g005
    Figure 6. Running time with noisy data after 100 iterations.
    Figure 6. Running time with noisy data after 100 iterations.
    Axioms 11 00437 g006
    Table 1. The average MSE (repeating 100 times) of the RPGLA, RPSGLA, and OSGLA obtained on noiseless data and noisy data (Gaussian).
    Table 1. The average MSE (repeating 100 times) of the RPGLA, RPSGLA, and OSGLA obtained on noiseless data and noisy data (Gaussian).
    AlgorithmNoiseless DataNoisy Data (Gaussian)
    RPGLA0.02910.0301
    OSGLA 1.1246 × 10 7 3.3 × 10 3
    RPSGLA 3.4159 × 10 4 2.4 × 10 3
    Table 2. The average running time (repeating 100 times) of the RPGLA, RPSGLA, and OSGLA obtained on noiseless data and noisy data (Gaussian).
    Table 2. The average running time (repeating 100 times) of the RPGLA, RPSGLA, and OSGLA obtained on noiseless data and noisy data (Gaussian).
    AlgorithmNoiseless DataNoisy Data (Gaussian)
    RPGLA0.1151 s0.1173 s
    OSGLA2.2322 s1.5415 s
    RPSGLA0.1105 s0.1093 s
    Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

    Share and Cite

    MDPI and ACS Style

    Zhang, W.; Ye, P.; Xing, S.; Xu, X. Optimality of the Approximation and Learning by the Rescaled Pure Super Greedy Algorithms. Axioms 2022, 11, 437. https://doi.org/10.3390/axioms11090437

    AMA Style

    Zhang W, Ye P, Xing S, Xu X. Optimality of the Approximation and Learning by the Rescaled Pure Super Greedy Algorithms. Axioms. 2022; 11(9):437. https://doi.org/10.3390/axioms11090437

    Chicago/Turabian Style

    Zhang, Wenhui, Peixin Ye, Shuo Xing, and Xu Xu. 2022. "Optimality of the Approximation and Learning by the Rescaled Pure Super Greedy Algorithms" Axioms 11, no. 9: 437. https://doi.org/10.3390/axioms11090437

    APA Style

    Zhang, W., Ye, P., Xing, S., & Xu, X. (2022). Optimality of the Approximation and Learning by the Rescaled Pure Super Greedy Algorithms. Axioms, 11(9), 437. https://doi.org/10.3390/axioms11090437

    Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

    Article Metrics

    Back to TopTop