1. Introduction
Non-linear sparse approximation with respect to dictionaries has been widely used in many application areas, such as machine learning, signal processing, and numerical computation, see [
1,
2,
3,
4,
5,
6,
7,
8]. Greedy algorithms have been used extensively for generating such approximations, see [
9,
10,
11,
12]. Among others, super greedy algorithms are very efficient for signal processing and machine learning, see [
13,
14,
15]. In this paper, we propose a new super greedy algorithm—the Weak Rescaled Pure Super Greedy Algorithm (WRPSGA), which is simpler than some well-known greedy algorithms. We estimate the error of the WRPSGA and show that its convergence rate on the convex hull of the dictionary is optimal.
Then, we consider the application of the RPSGA to supervised learning. Since the greedy algorithms have been proven to possess charming generalization capacity with a lower computational burden in [
16], they have been used in supervised learning, see Refs. [
13,
16,
17,
18,
19,
20,
21,
22]. In this paper, we design the Rescaled Pure Super Greedy Learning Algorithm (RPSGLA) and derive an almost optimal convergence rate.
We first recall some basic notions of greedy approximation from [
10]. Let
H be a Hilbert space with an inner product
and the norm
. A set of elements
is called a dictionary if
for every
, and
is
H. We consider symmetric dictionaries, i.e.,
.
Let be the output of a greedy-type algorithm with respect to a dictionary after m iterations. We want to estimate the decay of as . To solve this problem, we need the following classes of sparse elements.
For a dictionary
, we define the class of the elements
and by
its closure in
H. Let
be the union of the classes
over all
. For
, we define the norm of
f as
The most natural greedy algorithm with respect to
is the Pure Greedy Algorithm (PGA). We recall the definition of it from [
10].
PGA(H, ):
Step 0: Define .
Stepm:
- -
If , stop the algorithm and define for .
- -
If
, choose an element
such that
Define the next approximant to be
and proceed to Step
.
We recall the result about the upper estimate of the convergence rate of the PGA in [
10].
Theorem 1. Let be an arbitrary dictionary in H. Then, for each the PGA has the following convergence rate Since the rate of convergence of PGA was unsatisfying, some modified greedy algorithms, such as the Orthogonal Greedy Algorithm (OGA) and the Rescaled Pure Greedy Algorithm (RPGA), were proposed. We first recall the definition of the OGA in [
10].
OGA(H, ):
Step 0: Define .
Stepm:
- -
If , stop the algorithm and define for .
- -
If
, choose an element
such that
Define the next approximant to be
where
is the orthogonal projection operator onto span
, and proceed to Step
.
In [
10], the authors derived the following convergence rate for the OGA.
Theorem 2. Let be an arbitrary dictionary in H. Then, for each the OGA has the following convergence rate It is clear that when
is an orthonormal basis, the rate
is sharp, see [
6]. Thus, this convergence rate serves as a benchmark for the approximation ability of the greedy-type algorithms.
To study the error for
, in [
16], the authors defined the
K-functional for the pair
as follows
They proved that for any
, the output
of the OGA satisfies
In [
11], Petrova proposed the RPGA as follows:
RPGA(H, :
Step 0: Define .
Stepm:
- -
If , stop the algorithm and define for .
- -
If
, choose an element
such that
With
define the next approximant to be
and proceed to Step
.
From the definition, we find that the RPGA is simpler than the OGA from the viewpoint of computational complexity. In [
11], the author derived the following convergence rate for the RPGA.
Theorem 3. Let be an arbitrary dictionary in H. If then the RPGA has the following convergence rate He also proved that for any
, the output
of the RPGA satisfies
Now we recall super greedy algorithms. In [
14], Liu and Temlyakov proposed the Weak Orthogonal Super Greedy Algorithm (WOSGA). The WOSGA with parameter
s and a weakness sequence
, where
was defined as follows:
WOSGA(s, ):
Initially, we define . Then for a natural number and each , we inductively define:
- (1):
Denote
. Let
satisfy the inequality
- (2):
Let
and
denote the orthogonal projection operator onto
. Define
The WOSGA selects more than one element from a dictionary in each iteration step and hence reduces the computational burden of the conventional OGA. Thus, compared with OGA, the WOSGA can construct the approximant more quickly. We recall some results on the error estimates of the WOSGA. When
, we use the notation OSGA(
s,
t) instead of WOSGA(
s,
). We denote
the coherence of a dictionary
. It is clear that the smaller the
the more the
resembles an orthonormal basis. Throughout this paper, we consider dictionaries with small values of coherence
and call them
-coherent dictionaries.
In [
14], the authors derived the following convergence rate for the OSGA(
s,
t).
Theorem 4. Let be a dictionary with coherence Then, for and the OSGA(s,t) provides, after m iterations, an approximation of with the following convergence rate: In [
13], the authors established the following error bound for the OSGA(
s,
t).
Theorem 5. Let be a dictionary with coherence Then for and the OSGA(s,t) provides, after m iterations, an approximation of f with the error bound Motivated by the above results, in
Section 2, we propose the Weak Rescaled Pure Super Greedy Algorithm (WRPSGA) and estimate the error of this algorithm by means of
K-functional. We give an error estimate of the form (
1) for the WRPSGA. This estimation implies the convergence rate of the WRPSGA for
is
which is optimal. In
Section 3, we design the Rescaled Pure Super Greedy Learning Algorithm (RPSGLA) for solving the regression problem which is fundamental in statistical learning. We derive a learning rate that can be arbitrarily close to the best rate
. In
Section 4, we prove the main result derived in
Section 3. In
Section 5, we test the performance of the RPSGLA by numerical experiments. The simulation results show that RPSGLA is very efficient for regression. In
Section 6, we compare the RPSGA with other greedy algorithms to show that the efficiency of RPSGA is the best. In
Section 7, we give the conclusions of our study and make some suggestions for further studies.
2. The Weak Rescaled Pure Super Greedy Algorithms
In this section, we present the definition of the Weak Rescaled Pure Super Greedy Algorithm (WRPSGA) and study its approximation ability.
The WRPSGA with parameter s and a weakness sequence , where is defined as follows:
WRPSGA(s, ):
Initially, we define . Then for a natural number and each , we inductively define:
- (1):
Denote
. Let
satisfy the inequality
- (2):
Let
and
denote the orthogonal projection operator onto
. With
define the next approximant to be
When , we write the RPSGA(s, t) for the WRPSGA(s, ).
We now state the main result about the error estimate for the RPSGA(s, t).
Theorem 6. Let be a dictionary with coherence parameter , and . Then, for and , the error of the RPSGA(s, t) satisfieswhere Theorem 6 implies the following corollary.
Corollary 1. Under the assumption of Theorem 6, if , then the error of the RPSGA(s, t) satisfies Remark 1. We remark that the RPSGA(s, t) adds s new elements at each iteration, while the RPGA adds only one element at each iteration, so the RPSGA(s, t) can reduce the computational cost of RPGA. Theorem 4 and Corollary 1 show that the performance of the RPSGA(s, t) is as good as the performance of the OSGA(s, t) in sense of the rate of convergence. However, from a computational point of view, the OSGA(s, t) is more expensive to implement since at each step it requires the evaluation of the orthogonal projection of f onto span. While the output of the RPSGA(s, t) is the orthogonal projection of f onto the one dimensional space spanned by .
Remark 2. Although the convergence rate of the OSGA(s, t) and RPSGA(s, t) on are almost the same, see Theorem 4 and Corollary 1, the corresponding constants are different. The constant in Corollary 1 is not as good as that in Theorem 4. This is because is obtained from Theorem 4 which holds for any not just for . There is a trade-off between universality and accuracy. Therefore, the constant is not so good. Nevertheless, our results still have some advantages. The range of s in Corollary 1 is wider than that in Theorem 4. The factor is slightly better than in Theorem 4. It would be very interesting to derive the error of the RPSGA(s,t) for directly. This would help us to see if the constant can be improved.
In order to prove Theorem 6, we need the following two lemmas.
Lemma 1 ([
13]).
Assume a dictionary has coherence μ. Let and . Then, we have Lemma 2 ([
23]).
Let , , , be fixed, and and be sequences of non-negative numbers satisfying the inequalitiesThen, we have Proof of Theorem 6: First, we show that for any
and any
, the inequality
holds for
Since
is dense in
, it is enough to prove (
2) for elements that are finite sums
with
and
. Let us fix
, and choose a representation for
, such that
Let be the projection of f onto . Note that can be approximated arbitrarily well by the form of some above elements in .
Suppose q is such that with . Then, the above assumption on the sequence implies that and
We claim that the elements will be chosen among at the first iteration.
Indeed, for
, we have
For all
g distinct from
, we have
Our assumption implies Then, we obtain This implies that Thus, we do not pick any distinct from until we have chosen all
We denote
and
Since
is the orthogonal projection of
f onto span
, we have
and
According to the definition of
and the choice of
, we obtain
Denote
. We have
Combining (
4) with (
5), we have
It follows from (
6) that
is a decreasing sequence, and hence,
is also a decreasing sequence.
Now the proof is divided into two cases, according to whether or .
Case 1:
. Therefore
for all
, hence inequality (
2) holds for
Case 2:
. With (
3), we notice that
We proceed with a lower estimate for
. We consider the following quantity for
where
Applying Lemma 1, we have
In order to get the relation between
and
, we consider an arbitrary set
of distinct elements of
. Let
and
. Then
Using the choice of
, we have
and hence
Therefore, by (
8) and Lemma 1, we obtain
For any
, we turn to bound
. For
, we write
as
By setting
, we first bound
as follows
Since the sequence
has the property
we may use the simple inequality
to derive
Next, we bound
as follows
Furthermore, by (
7) and (
10)–(
12), we have
By (
6), (
9), and (
13), we get
Let
. Subtracting
from both sides in (
14), we obtain
where the constant
is defined in Theorem 6.
Case 2.1:
. In terms of the decreasing property of the sequence
, either
, or for some
we have that
. Then for
the arguments are as in Case 1. Applying Lemma 2 to the positive numbers in the sequence
with
,
,
,
and
, we obtain
Case 2.2:
. Taking
in inequality (
15), we get
Therefore
, that is,
, which gives (
2) because of monotonicity.
Now inequality (
2) has been proved completely. Then, we have
□
Proof of Corollary 1: Thus we get the desired result from Theorem 6. □
3. The Rescaled Pure Super Greedy Learning Algorithms
In this section, we consider the application of the RPSGA(s, 1) to supervised learning. In this context, the RPSGA(s, 1) has its new form. We call it the Rescaled Pure Super Greedy Learning Algorithm (RPSGLA). The precise definition of the RPSGLA will be given later.
We first formulate the problem of supervised learning. Let the input space
be a compact subset and the output space
. Let
be a Borel probability measure on
The generalization error for a function
is defined as
For each input and output , is the error suffered from the use of f as a model for the process producing y from x. By integrating over with respect to , we average out the error over all pairs .
Denote by
the Hilbert space of the square integrable functions defined on
X with respect to the measure
, where
is the marginal measure of
on
X and
The regression function
which minimizes
over all
is given by
where
is the conditional distribution induced by
at
. In the framework of supervised learning,
is unknown and what we have in hand is a set of random samples
, without loss of generality, we assume that
for a fixed
, drawn from the measure
identically and independently. The task is to find a good approximation
of the regression function, which is derived from some learning algorithm. To measure the approximation ability of
, we estimate the excess generalization error
In the designation of the learning algorithm, we replace the generalization error
with the empirical error
We expect to find a good approximation of by minimizing in a suitable way.
Given a training data
, the empirical inner product and empirical norm are defined as follows:
With the definition of the empirical inner product and empirical norm, the empirical error can be represented as follow
We consider the kernel-based greedy learning algorithms, which have been studied extensively in machine learning, see [
13,
16,
18,
20,
22,
24]. We will use the continuous kernels, which are more general than the usual Mercer kernels, see [
25,
26]. The hypothesis spaces based on continuous kernels were used widely in many fields of machine learning, see, for instance, Refs. [
18,
27,
28]. In this way, a wider selection of the kernel offers more flexibility. We say a function
K defined on
is a continuous kernel if
K is continuous at every
. Let
K be a continuous and symmetric kernel. Assume that, in addition,
K is positive definite. Then, the space
is a pre-Hilbert space equipped with the inner product
Let
be the closure of
H with respect to the norm induced by the above inner product. Then,
is a reproducing kernel Hilbert space (RKHS) associated with the kernel
K. The reproducing property is given by
. For more details, one can see [
25].
We define a data-independent hypothesis space as follows.
Definition 1. Define a data-independent hypothesis spacewith the norm Note that the space
is a special kind of
space which was introduced in [
10], see
Section 1.
It follows from the definition of
that
, so it is natural to restrict the approximation functions to
. The following truncation operator has been used in the error analysis of learning algorithms for improving the learning rate, see [
16,
18,
22].
Definition 2. The truncation operator is defined on the space of measurable functions as Given a training data
, we define the data-dependent hypothesis space as
and define the norm on
as
So, is a subspace of .
We will choose an approximant for from by the RPSGLA.
We now present the RPSGLA as follows Algorithm 1:
Algorithm 1 The RPSGLA |
Input:, and K, and , we get step 1. Normalization: and dictionary: step 2. Computation: Let , 1: Denote . Let satisfy the inequality
2: Let and denote the orthogonal projection operator onto . With
define the next approximant to be
3: if and then break. end Output: |
In the case of
this algorithm coincides with the Rescaled Pure Greedy Learning Algorithm (RPGLA), which was studied in [
24]. Here we concentrate on
.
We recall the following condition which has been widely used for error analysis, see, for instance, Refs. [
18,
20,
26,
27,
29].
Definition 3. We say that a kernel K is a kernel with if there exists some constant such thatwhere denotes the largest integer not exceeding γ and denotes -th partial derivative of with respect to variable x. We set . We recall two types of kernels that are used widely in practice. They are all the
kernels for any
.
The first one is Guassian kernels: |
The second one is polynomial kernels: |
Here
is the Euclidean inner product of
u and
x in
.
Let denote the space of continuous functions defined on X.
Definition 4. For , we define the space to be the set of all functions f such that, for all m, there exist such thatwhere denotes the uniform norm on , and the smallest constant such that this holds defines a norm for . Under the assumption of , we obtain the following convergence rate of the RPSGLA.
Theorem 7. Assume that and . Let K be a kernel with , for any . Choose For , the output of the RPSGLA satisfies the following inequality with confidence ,whereand the constant c depends at most on , δ, K, and M. Remark 3. If we take K as an infinitely smooth kernel, such as a Gaussian kernel, then when r is sufficiently close to 1/2, the convergence rate of RPSGLA can be arbitrarily close to , which is the best learning rate one can obtain so far, see [18,30]. Since in this case the result of Theorem 7 is valid for any , the convergence rate of RPSGLA can be arbitrarily close to when γ is sufficiently large. To see this, let and , then the right-hand side of (17) tends to . Note that if we take a finite smooth kernel, then we could not obtain the above convergence rate. In this case, inequality (17) holds only for some fix γ but not all γ. Remark 4. We show that the efficiency of the RPSGLA is better than some existing greedy learning algorithms. The convergence rate of the RPSGLA is faster than that of the Orthogonal Super Greedy Learning Algorithm (OSGLA) in [13]. In [13], the rate was derived. Our convergence rate is also faster than the convergence rate of the Orthogonal Greedy Learning Algorithm (OGLA) and Relaxed Greedy Learning Algorithm (RGLA) in [16] and the convergence rate of the RGLA in [20]. Additionally, our convergence rate is almost the same as that of the OGLA in [18] and RPGLA in [24]. However, the complexity of the RPSGLA is smaller than the OGLA and RPGLA. We will illustrate this in Section 5 and Section 6. On the other hand, since greedy learning is a large field, there are other greedy learning procedures that are quite different from ours, see [17,19,21]. Remark 5. Kernel-based greedy algorithms can be used to solve different problems. We only mention some typical works. Moreover, we focus on the approximation problems on RKHS. In [31], the authors proposed kernel-based greedy algorithms to approximate non-linear vectorial functions and derived the rate . In [7,32], kernel-based greedy algorithms were used to approximate a linear functional defined on an RKHS. The authors proved that for the square-integrable functions, the convergence rate can attain . Roughly speaking, the convergence rate of greedy algorithms for functional approximation is similar to that of function approximation, while for the regression problem, one can obtain a faster convergence rate. 4. Error Analysis of the RPSGLA
In this section, we prove Theorem 7. The proof is divided into five parts, the error decomposition strategy, the estimate of sample error, the estimate of hypothesis error, and the estimate of approximation error. Finally, the Theorem 7 is proved by synthesizing the results of each error estimate.
4.1. Error Decomposition Strategy
Before we show the error decomposition strategy of the error analysis, we construct a stepstone function
as follows. As
, there exists a
such that
Define
,
where
and
Lemma 3. Let be defined in (19). Then we have Proof. By the definition of
, we have
then the conclusion of the lemma 3 follows from the fact
and
□
With at hand, we can give an upper bound of as follows.
Proposition 1. Let be defined in Algorithm 1. Then, for the in (19), we have the error decomposition as follows:whereare known as the sample error, the approximation error, and the hypothesis error, respectively, in learning theory. 4.2. Estimate of Sample Error
In this subsection, we will bound the sample error
. We set
and
Then, the sample error can be written as .
The bound of
has been proved in [
20] by using the one-side Bernstein inequality and the inequality
.
Proposition 2. For any , with confidence at least , we have For the function changed with the sample , we should obtain the uniform upper bound of .
We should consider the data-dependent space
To estimate the capacity of , we need the concept of the empirical covering number.
Definition 5. Let E be a metric space with metric d and F be a subset of E. For any , the covering number of F with respect to ϵ and d is defined as the minimal number of balls of radius ϵ whose union covers F, that is,where Definition 6. ( empirical covering number) Let be a set of functions on X, and let Set for arbitrary . The empirical covering number of is defined bywhere the metric We will use the following result on the capacity of the unit ball .
Lemma 4 ([
30]).
Let X be a compact subset of and K be a kernel with some . Then, there exist an exponent p, , and a constant , for arbitrary , such thatwhere Now we recall the concentration inequality from [
29].
Lemma 5. Assume that there are constants and such that and , for every If for arbitrary ,holds for some and , then there exists a constant depending only on p such that for any , with probability at least , therewhere Proposition 3. If K is a kernel, then for any , with confidence at least , we have Proof. From the obvious inequalities
and
, we get the inequalities
and
For
we have
For any
, it follows that
By Lemma 4, we have
where
is a constant independent of
.
Applying Lemma 5 with
,
and
, then for any
and any
, the inequality
holds with confidence
, where
and
are constants depending at most on
d,
X, and the kernel
K and
It follows from the definition of
in Algorithm 1 that
. Then, there exists a constant
C depending at most on
d,
X,
M, and kernel
K, such that
□
4.3. Estimate of Hypothesis Error
In this subsection, we give an error estimate for . For this purpose, we need the following two lemmas.
The first one is Hoeffding inequality, which was established by Wassily Hoeffding in 1963.
Lemma 6 ([
33]).
Let be independent random variables bounded by the interval . We define the empirical mean of these variables byThen, the following inequality holds for any given When are strictly bounded by the intervals , the generalization of the above inequality holds for any given The second one is an immediate consequence of Theorem 6.
Equip with the empirical inner product. Then H is a Hilbert space. Given a training data , is a dictionary of H. We denote its coherence by
Lemma 7. Let K be a kernel with . For any there exists such that , . We define Then, for any and any and , the error of the RPSGLA satisfies Proof. According to (
2), for any
and any
, the inequality
holds for
Given a training data
, since
then, combining (
20) with (
21), we get the desired result. □
Proposition 4. For any , with confidence at least , we have Proof. Applying Lemma 7 with
, we have
Since
changes with training data
, we should find its relation with
We know that
We also have that
and
Based on Lemma 6, for
and any
i, we have
By setting
, we have
. From (
23) and (
24), with the confidence
, we have
By the definition of
, we observe that the following inequality
holds with the confidence
Combining (
22), (
25) with Lemma 3, with the confidence at least
, we have
□
4.4. Estimate of Approximation Error
Finally, we estimate the approximation error.
Proposition 5. If , then
Proof. From the definition of
and (
16), there holds:
For
satisfying (
18), and
, from Theorem 2.2 in [
16], we obtain
□
4.5. Proof of Theorem 7
Now we prove Theorem 7.
Proof of Theorem 7 Assembling the results in Proposition 1–5 together, we have the inequality
holds with confidence at least
.
Therefore,
holds with confidence at least
, where
This completes the proof of Theorem 7.
□
6. Discussion
The experiments in
Section 5 show the superiority of the RPSGA(
s,
t) directly. In this section, we will explain why the tested results are achieved. By using the super selection step and one-dimensional optimization together for the first time, we get the simplest good greedy algorithm so far. We provide more details. First, we recall the other three greedy algorithms—the Relaxed Greedy Algorithm (RGA), the Greedy Algorithm with Free Relaxation (GAFR), and the Rescaled Relaxed Greedy Algorithm (RRGA).
In [
10], the RGA was defined as the following steps.
RGA(H, ):
Step 0: Define .
Stepm:
- -
If , stop the algorithm and define for .
- -
If
, choose an element
such that
For
, define the next approximant to be
and proceed to Step
.
In [
10], the authors proved that the RGA converged only for the target elements from
and derived the following convergence rate for the RGA.
Theorem 8. Let be an arbitrary dictionary in H. Then, for each we take the approximant as the RGA has the following convergence rate In [
9], the GAFR was defined as follows:
GAFR(H, :
Step 0: Define .
Stepm:
- -
If , stop the algorithm and define for .
- -
If
, choose an element
such that
With
define the next approximant to be
and proceed to Step
.
In [
9], the RRGA was defined as follows:
RRGA(H, :
Step 0: Define .
Step m:
-If , stop the algorithm and define for .
-If
, choose an element
such that
With
define the next approximant to be
and proceed to Step
.
In [
9], the authors derived the following convergence rate for the GAFR and RRGA.
Theorem 9. Let be an arbitrary dictionary in H. Then, for each the GAFR and RRGA have the following convergence rate From Theorems 2–4, Corollary 1, and Theorems 8 and 9, for each
, the OGA, RPGA, RGA, GAFR, RRGA, OSGA(
s,
t), and RPSGA(
s,
t) all have the following convergence rate
The rate is optimal and the results show that these algorithms have almost identical error performances. We call these types of greedy algorithms Good Greedy Algorithms (GGA). Therefore, we just compare the complexity and execution time of the GGA.
In terms of scope of application, the RGA converges only for the target elements from . Obviously, the RPSGA(s, t) is better than the RGA.
From the viewpoint of complexity, the OGA has to solve a
m-dimensional optimization problem
where
The GAFR has to solve a two-dimensional optimization problem
where
The RRGA employs two one-dimensional optimization problems
where
,
However, the RPGA only needs to solve a one-dimensional optimization problem
Then, the RPGA is simpler than the OGA, RGA, GAFR, and RRGA. So the RPGA can save much execution time. According to the definition, the RPSGA(
s,
t) selects more than one element from a dictionary in each iteration step and hence reduces the computational burden of the RPGA (taking
s=1 in the RPSGA(
s,
t)), especially when the RPGA and RPSGA(
s,
t) run with noisy data, see
Figure 6. Combining with the empirical test results in
Table 1 and
Table 2 in
Section 5, it can be easily found that from the viewpoints of error performance and execution time, the RPSGA(
s,
t) is more effective than the RPGA. The commonly used super algorithm is only the OSGA(
s,
t), as far as we know, and the OSGA(
s,
t) always needs to solve an
-dimensional optimization problem. While the RPSGA(
s,
t) only need to solve a
s-dimensional optimization problem and an one-dimensional optimization problem.
Table 2 also shows the robustness and effectiveness of the RPSGA(
s,
t) compared with the OSGA(
s,
t). Therefore, the efficiency of the RPSGA(
s,
t) is the best among the GGA.
7. Conclusions and Further Studies
We propose a new type of super greedy algorithm—the WRPSGA. The RPSGA(
s,
t) is simpler than the OGA, RGA, RRGA, RPGA, and OSGA(
s,
t) from the viewpoint of computational complexity. The convergence rate of RPSGA(
s,
t) on
is optimal. Based on this result, we design the RPSGLA for solving kernel-based regression problems in supervised learning. When the kernel is infinitely smooth, we derive a significantly faster learning rate that can be arbitrarily close to the best rate
under some mild assumptions of the regression function. The efficiency of the RPSGLA is better than some existing greedy learning algorithms. For instance, the convergence rate of the RPSGLA is faster than the OSGLA in [
13], RGLA in [
20], and the OGLA and RGLA in [
16]. Additionally, our convergence rate is almost the same as that of the OGLA in [
18]. However, the complexity of the RPSGLA is smaller than the OGLA. We test the performance of the RPSGLA by numerical experiments. Our simulation results show that the RPSGLA is very efficient for regression.
In addition to the applications in machine learning, the greedy algorithms can also be used to solve convex optimization problems which are quite different from the approximation problems, see Refs. [
23,
34,
35,
36]. We formulate the problem of convex optimization as follows.
Let
H be a Hilbert space with an inner product
, and
E be a convex function on
H which is called the objective function. We assume that
E has a Fréchet derivative
at each point
, that is
where
is the norm induced by
. We want to find an approximate solution to the problem
where
is a bounded convex subset of
H.
For a dictionary
of
H, we denote
as the set of all at most
m-term linear combinations with respect to
Let We will develop a new greedy algorithm to produce an approximation of . This algorithm is a modification of the RPSGA. We denote it as the RPSGA(co). We present the definition of the RPSGA(co) as follows:
RPSGA(co):
Step 0: Define . If , stop the algorithm and define .
Step m: Assume that has been defined and . Denote
. Let
satisfy the inequality
Let
. With
define the next approximant to be
where
If , then stop the algorithm and define , otherwise go to Step .
We will impose the following assumptions on the objective function E.
Condition 0:
E has a Fréchet derivative
at each point
x in
and
Uniform Smoothness: There are constants
,
, and
, such that for all
x,
with
,
x in
,
It is known from [
23] that the OGA(co) can be used effectively to solve convex optimization problems. Since the RPSGA(
s,
t) is more efficient than the OGA, the RPSGA(co) can solve these problems more efficiently. For the RPSGA(co), by estimating the decay of
as
, we will obtain the convergence rate
, which is independent of the dimension of the underlying space
H. So, the curse of dimensionality for problem (
26) can be overcome by using the RPSGA(co). This work will be of great interest in practical applications. It is also important to compare the efficiency of the RPSGA(co) with other greedy optimization algorithms in future work.