Optimality of the Approximation and Learning by the Rescaled Pure Super Greedy Algorithms

Zhang, Wenhui; Ye, Peixin; Xing, Shuo; Xu, Xu

doi:10.3390/axioms11090437

Open AccessArticle

Optimality of the Approximation and Learning by the Rescaled Pure Super Greedy Algorithms

by

Wenhui Zhang

^1,*

,

Peixin Ye

¹,

Shuo Xing

¹

and

Xu Xu

²

¹

School of Mathematics and LPMC, Nankai University, Tianjin 300071, China

²

School of Science, China University of Geosciences, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Axioms 2022, 11(9), 437; https://doi.org/10.3390/axioms11090437

Submission received: 17 June 2022 / Revised: 14 August 2022 / Accepted: 24 August 2022 / Published: 28 August 2022

(This article belongs to the Special Issue Numerical Computation, Approximation of Functions and Applied Mathematics)

Download

Browse Figures

Versions Notes

Abstract

:

We propose the Weak Rescaled Pure Super Greedy Algorithm (WRPSGA) for approximation with respect to a dictionary

D

in Hilbert space. The WRPSGA is simpler than some popular greedy algorithms. We show that the convergence rate of the RPSGA on the closure of the convex hull of the

μ

-coherent dictionary

D

is optimal. Then, we design the Rescaled Pure Super Greedy Learning Algorithm (RPSGLA) for kernel-based supervised learning. We prove that the convergence rate of the RPSGLA can be arbitrarily close to the best rate

O (m^{- 1})

under some mild assumptions.

Keywords:

super greedy algorithms; sparse approximation; supervised learning; convergence rate

MSC:

41A46; 41A25; 46N30; 68T05; 62J02

1. Introduction

Non-linear sparse approximation with respect to dictionaries has been widely used in many application areas, such as machine learning, signal processing, and numerical computation, see [1,2,3,4,5,6,7,8]. Greedy algorithms have been used extensively for generating such approximations, see [9,10,11,12]. Among others, super greedy algorithms are very efficient for signal processing and machine learning, see [13,14,15]. In this paper, we propose a new super greedy algorithm—the Weak Rescaled Pure Super Greedy Algorithm (WRPSGA), which is simpler than some well-known greedy algorithms. We estimate the error of the WRPSGA and show that its convergence rate on the convex hull of the dictionary is optimal.

Then, we consider the application of the RPSGA to supervised learning. Since the greedy algorithms have been proven to possess charming generalization capacity with a lower computational burden in [16], they have been used in supervised learning, see Refs. [13,16,17,18,19,20,21,22]. In this paper, we design the Rescaled Pure Super Greedy Learning Algorithm (RPSGLA) and derive an almost optimal convergence rate.

We first recall some basic notions of greedy approximation from [10]. Let H be a Hilbert space with an inner product

〈 \cdot, \cdot 〉

and the norm

∥ x ∥ = \sqrt{〈 x, x 〉}

. A set of elements

D \subset H

is called a dictionary if

∥ φ ∥ = 1

for every

φ \in D

, and

\bar{span} (D)

is H. We consider symmetric dictionaries, i.e.,

φ \in D implies - φ \in D

.

Let

f_{m}

be the output of a greedy-type algorithm with respect to a dictionary

D

after m iterations. We want to estimate the decay of

∥ f - f_{m} ∥

as

m \to \infty

. To solve this problem, we need the following classes of sparse elements.

For a dictionary

D

, we define the class of the elements

A_{1}^{0} (D, N) : = \{f : f = \sum_{k \in Λ} c_{k} φ_{k}, φ_{k} \in D, # (Λ) < \infty, \sum_{k \in Λ} | c_{k} | \leq N\},

and by

A_{1} (D, N)

its closure in H. Let

A_{1} (D)

be the union of the classes

A_{1} (D, N)

over all

N > 0

. For

f \in A_{1} (D)

, we define the norm of f as

{∥ f ∥}_{A_{1} (D)} : = inf \{N : f \in A_{1} (D, N)\} .

The most natural greedy algorithm with respect to

D

is the Pure Greedy Algorithm (PGA). We recall the definition of it from [10].

PGA(H, $D$ ):

Step 0: Define $f_{0} = 0$ .
Stepm:

-: If $f = f_{m - 1}$ , stop the algorithm and define $f_{k} = f_{m - 1} = f$ for $k \geq m$ .
-: If $f \neq f_{m - 1}$ , choose an element $φ_{m} \in D$ such that

$| 〈 f - f_{m - 1}, φ_{m} 〉 | = sup_{φ \in D} | 〈 f - f_{m - 1}, φ 〉 | .$

Define the next approximant to be

f_{m} = \sum_{i = 1}^{m} 〈 f - f_{m - 1}, φ_{i} 〉 φ_{i},

and proceed to Step

m + 1

.

We recall the result about the upper estimate of the convergence rate of the PGA in [10].

Theorem 1.

Let

D

be an arbitrary dictionary in H. Then, for each

f \in A_{1} (D, N),

the PGA has the following convergence rate

∥ f - f_{m} ∥ \leq N m^{- 1 / 6}, m = 0, 1, 2, \dots .

Since the rate of convergence of PGA was unsatisfying, some modified greedy algorithms, such as the Orthogonal Greedy Algorithm (OGA) and the Rescaled Pure Greedy Algorithm (RPGA), were proposed. We first recall the definition of the OGA in [10].

OGA(H, $D$ ):

Step 0: Define $f_{0} = 0$ .
Stepm:

-: If $f = f_{m - 1}$ , stop the algorithm and define $f_{k} = f_{m - 1} = f$ for $k \geq m$ .
-: If $f \neq f_{m - 1}$ , choose an element $φ_{m} \in D$ such that

$| 〈 f - f_{m - 1}, φ_{m} 〉 | = sup_{φ \in D} | 〈 f - f_{m - 1}, φ 〉 | .$

Define the next approximant to be

f_{m} = P_{m} (f),

where

P_{m}

is the orthogonal projection operator onto span

{φ_{1}, φ_{2}, \cdot \cdot \cdot, φ_{m}}

, and proceed to Step

m + 1

.

In [10], the authors derived the following convergence rate for the OGA.

Theorem 2.

Let

D

be an arbitrary dictionary in H. Then, for each

f \in A_{1} (D, N),

the OGA has the following convergence rate

∥ f - f_{m} ∥ \leq N m^{- 1 / 2}, m = 0, 1, 2, \dots .

It is clear that when

D

is an orthonormal basis, the rate

O (m^{- 1 / 2})

is sharp, see [6]. Thus, this convergence rate serves as a benchmark for the approximation ability of the greedy-type algorithms.

To study the error for

f \in H

, in [16], the authors defined the K-functional for the pair

(H, A_{1} (D))

as follows

K (f, ζ; H, A_{1} (D)) : = inf_{h \in A_{1} (D)} \{{∥ f - h ∥}_{H} + ζ {∥ h ∥}_{A_{1} (D)}\}, \forall f \in H, ζ > 0 .

They proved that for any

f \in H

, the output

{f_{m}}_{m \geq 1}

of the OGA satisfies

∥ f - f_{m} ∥ \leq 2 K (f, m^{- 1 / 2}; H, A_{1} (D)) .

In [11], Petrova proposed the RPGA as follows:

RPGA(H, $D)$ :

Step 0: Define $f_{0} = 0$ .
Stepm:

-: If $f = f_{m - 1}$ , stop the algorithm and define $f_{k} = f_{m - 1} = f$ for $k \geq m$ .
-: If $f \neq f_{m - 1}$ , choose an element $φ_{m} \in D$ such that

$| 〈 f - f_{m - 1}, φ_{m} 〉 | = sup_{φ \in D} | 〈 f - f_{m - 1}, φ 〉 | .$

With

λ_{m} = 〈 f - f_{m - 1}, φ_{m} 〉, {\hat{f}}_{m} : = f_{m - 1} + λ_{m} φ_{m}, s_{m} = \frac{〈 f, {\hat{f}}_{m} 〉}{∥ {\hat{f}}_{m} ∥^{2}},

define the next approximant to be

f_{m} = s_{m} {\hat{f}}_{m},

and proceed to Step

m + 1

.

From the definition, we find that the RPGA is simpler than the OGA from the viewpoint of computational complexity. In [11], the author derived the following convergence rate for the RPGA.

Theorem 3.

Let

D

be an arbitrary dictionary in H. If

f \in A_{1} (D),

then the RPGA has the following convergence rate

∥ f - f_{m} {∥ \leq ∥ f ∥}_{A_{1} (D)} {(m + 1)}^{- 1 / 2}, m = 0, 1, 2, \dots .

He also proved that for any

f \in H

, the output

{f_{m}}_{m \geq 1}

of the RPGA satisfies

∥ f - f_{m} ∥ \leq 2 K (f, {(m + 1)}^{- 1 / 2}; H, A_{1} (D)) .

Now we recall super greedy algorithms. In [14], Liu and Temlyakov proposed the Weak Orthogonal Super Greedy Algorithm (WOSGA). The WOSGA with parameter s and a weakness sequence

τ : = {t_{m}}_{m = 1}^{\infty}

, where

t_{m} \in (0, 1], m = 1, 2, . . .,

was defined as follows:

WOSGA(s, $τ$ ):

Initially, we define

f_{0} : = 0

. Then for a natural number

s \geq 1

and each

m \geq 0

, we inductively define:

(1):: Denote $I_{m} : = [(m - 1) s + 1, m s]$ . Let $φ_{(m - 1) s + 1}, \dots, φ_{m s} \in D$ satisfy the inequality

$min_{i \in I_{m}} | 〈 f - f_{m - 1}, φ_{i} 〉 | \geq t_{m} sup_{φ \in D, φ \neq φ_{i}, i \in I_{m}} | 〈 f - f_{m - 1}, φ 〉 | .$
(2):: Let $H_{m} : = H_{m} (f) : = span (φ_{1}, \dots, φ_{m s})$ and $P_{H_{m}}$ denote the orthogonal projection operator onto $H_{m}$ . Define

$f_{m} (f) : = f_{m}^{s, τ} (f, D) : = P_{H_{m}} (f) .$

The WOSGA selects more than one element from a dictionary in each iteration step and hence reduces the computational burden of the conventional OGA. Thus, compared with OGA, the WOSGA can construct the approximant more quickly. We recall some results on the error estimates of the WOSGA. When

t_{1} =, \dots, = t_{m} = t

, we use the notation OSGA(s, t) instead of WOSGA(s,

τ

). We denote

μ (D) : = sup_{φ_{1} \neq φ_{2} \in D} | 〈 φ_{1}, φ_{2} 〉 |

the coherence of a dictionary

D

. It is clear that the smaller the

μ (D),

the more the

D

resembles an orthonormal basis. Throughout this paper, we consider dictionaries with small values of coherence

μ (D) > 0,

and call them

μ

-coherent dictionaries.

In [14], the authors derived the following convergence rate for the OSGA(s, t).

Theorem 4.

Let

D

be a dictionary with coherence

μ .

Then, for

f \in H

and

s \leq {(2 μ)}^{- 1},

the OSGA(s,t) provides, after m iterations, an approximation of

f \in A_{1} (D, 1)

with the following convergence rate:

∥ f - f_{m} ∥ \leq \sqrt{81 / 8} (1 + t) / t^{2} {(s m)}^{- 1 / 2}, k = 1, 2, \dots .

In [13], the authors established the following error bound for the OSGA(s,t).

Theorem 5.

Let

D

be a dictionary with coherence

μ .

Then for

f \in H,

and

s \leq {(2 μ)}^{- 1} + 1,

the OSGA(s,t) provides, after m iterations, an approximation of f with the error bound

∥ f - f_{m} ∥ \leq C_{1} (s, μ, t) K (f, {(s m)}^{- 1 / 2}; H, A_{1} (D)), k = 1, 2, \dots .

(1)

Motivated by the above results, in Section 2, we propose the Weak Rescaled Pure Super Greedy Algorithm (WRPSGA) and estimate the error of this algorithm by means of K-functional. We give an error estimate of the form (1) for the WRPSGA. This estimation implies the convergence rate of the WRPSGA for

A_{1} (D)

is

O (m^{- 1 / 2})

which is optimal. In Section 3, we design the Rescaled Pure Super Greedy Learning Algorithm (RPSGLA) for solving the regression problem which is fundamental in statistical learning. We derive a learning rate that can be arbitrarily close to the best rate

O (m^{- 1})

. In Section 4, we prove the main result derived in Section 3. In Section 5, we test the performance of the RPSGLA by numerical experiments. The simulation results show that RPSGLA is very efficient for regression. In Section 6, we compare the RPSGA with other greedy algorithms to show that the efficiency of RPSGA is the best. In Section 7, we give the conclusions of our study and make some suggestions for further studies.

2. The Weak Rescaled Pure Super Greedy Algorithms

In this section, we present the definition of the Weak Rescaled Pure Super Greedy Algorithm (WRPSGA) and study its approximation ability.

The WRPSGA with parameter s and a weakness sequence

τ : = {t_{m}}_{m = 1}^{\infty}

, where

t_{m} \in (0, 1], m = 1, 2, \dots,

is defined as follows:

WRPSGA(s, $τ$ ):

Initially, we define

f_{0} = 0

. Then for a natural number

s \geq 1

and each

m \geq 0

, we inductively define:

(1):: Denote $I_{m} : = [(m - 1) s + 1, m s]$ . Let $φ_{(m - 1) s + 1}, \dots, φ_{m s} \in D$ satisfy the inequality

$min_{i \in I_{m}} | 〈 f - f_{m - 1}, φ_{i} 〉 | \geq t_{m} sup_{φ \in D, φ \neq φ_{i}, i \in I_{m}} | 〈 f - f_{m - 1}, φ 〉 | .$
(2):: Let $F_{m} : = span (φ_{(m - 1) s + 1}, \dots, φ_{m s})$ and $P_{F_{m}}$ denote the orthogonal projection operator onto $F_{m}$ . With

${\hat{f}}_{m} = f_{m - 1} + P_{F_{m}} (f - f_{m - 1}),$

define the next approximant to be

$f_{m} : = f_{m}^{s, τ} (f, D) : = s_{m} {\hat{f}}_{m}, w i t h s_{m} = \frac{〈 f, {\hat{f}}_{m} 〉}{〈 {\hat{f}}_{m}, {\hat{f}}_{m} 〉} .$

When

t_{1} =, \dots, = t_{m} = t

, we write the RPSGA(s, t) for the WRPSGA(s,

τ

).

We now state the main result about the error estimate for the RPSGA(s, t).

Theorem 6.

Let

D

be a dictionary with coherence parameter

μ : = μ (D)

, and

0 < t \leq 1

. Then, for

f \in H

and

s \leq \frac{1}{2 μ} + \frac{3}{2}

, the error of the RPSGA(s, t) satisfies

∥ f - f_{m} ∥ \leq C (s, μ, t) K (f, {(s (1 + m))}^{- 1 / 2}; H, A_{1} (D)), m = 1, 2, \dots,

where

C (s, μ, t) = (3 + 5 t) (1 + μ (s - 1)) / t^{2} \sqrt{(1 - μ (s - 1))} .

Theorem 6 implies the following corollary.

Corollary 1.

Under the assumption of Theorem 6, if

f \in A_{1} (D)

, then the error of the RPSGA(s, t) satisfies

∥ f - f_{m} {∥ \leq C (s, μ, t) ∥ f ∥}_{A_{1} (D)} {(s (1 + m))}^{- 1 / 2}, m = 1, 2, \dots .

Remark 1.

We remark that the RPSGA(s, t) adds s new elements at each iteration, while the RPGA adds only one element at each iteration, so the RPSGA(s, t) can reduce the computational cost of RPGA. Theorem 4 and Corollary 1 show that the performance of the RPSGA(s, t) is as good as the performance of the OSGA(s, t) in sense of the rate of convergence. However, from a computational point of view, the OSGA(s, t) is more expensive to implement since at each step it requires the evaluation of the orthogonal projection of f onto span

{φ_{1}, φ_{2}, \dots, φ_{m s}}

. While the output of the RPSGA(s, t) is the orthogonal projection of f onto the one dimensional space spanned by

{{\hat{f}}_{m}}

.

Remark 2.

Although the convergence rate of the OSGA(s, t) and RPSGA(s, t) on

A_{1} (D)

are almost the same, see Theorem 4 and Corollary 1, the corresponding constants are different. The constant

C (s, μ, t)

in Corollary 1 is not as good as that in Theorem 4. This is because

C (s, μ, t)

is obtained from Theorem 4 which holds for any

f \in H

not just for

f \in A_{1} (D)

. There is a trade-off between universality and accuracy. Therefore, the constant

C (s, μ, t)

is not so good. Nevertheless, our results still have some advantages. The range of s in Corollary 1 is wider than that in Theorem 4. The factor

{(1 + m)}^{- 1 / 2}

is slightly better than

m^{- 1 / 2}

in Theorem 4. It would be very interesting to derive the error of the RPSGA(s,t) for

f \in A_{1} (D)

directly. This would help us to see if the constant

C (s, μ, t)

can be improved.

In order to prove Theorem 6, we need the following two lemmas.

Lemma 1

([13]). Assume a dictionary

D

has coherence μ. Let

{g_{i}}_{i = 1}^{s} \subset D

and

H_{(s)} = s p a n (g_{1}, \dots, g_{s})

. Then, we have

\frac{1}{1 + μ (s - 1)} \sum_{i = 1}^{s} {〈 f, g_{i} 〉}^{2} \leq {∥ P_{H_{(s)}} (f) ∥}^{2} \leq \frac{1}{1 - μ (s - 1)} \sum_{i = 1}^{s} {〈 f, g_{i} 〉}^{2} .

Lemma 2

([23]). Let

ℓ > 0

,

r > 0

,

B > 0

,

J \in N

be fixed, and

{a_{m}}_{m = 1}^{\infty}

and

{b_{m}}_{m = 2}^{\infty}

be sequences of non-negative numbers satisfying the inequalities

a_{J} \leq B, a_{m} \leq a_{m - 1} (1 - \frac{b_{m}}{r} a_{m - 1}^{ℓ}), m = J + 1, J + 2, . . . .

Then, we have

a_{m} \leq max \{1, ℓ^{- 1 / ℓ}\} r^{1 / ℓ} {(r B^{- ℓ} + \sum_{k = J + 1}^{m} b_{k})}^{- 1 / ℓ}, m = J + 1, J + 2, \dots .

Proof of Theorem 6:

First, we show that for any

f \in H

and any

h \in A_{1} (D)

, the inequality

∥ f - f_{m} ∥^{2} \leq {∥ f - h ∥}^{2} + \frac{C {(s, μ, t)}^{2}}{s (1 + m)} {∥ h ∥}_{A_{1} (D)}^{2}

(2)

holds for

m = 1, 2, \dots .

Since

A_{1}^{0} (D, N)

is dense in

A_{1} (D, N)

, it is enough to prove (2) for elements that are finite sums

h = \sum_{j} c_{j} g_{j}

with

\sum_{j} | c_{j} | \leq N

and

g_{j} \in D

. Let us fix

ϵ > 0

, and choose a representation for

h = \sum_{j = 1}^{\infty} c_{j} g_{j}

, such that

\sum_{j = 1}^{\infty} | c_{j} | < N + ϵ, | c_{1} | \geq | c_{2} | \geq | c_{3} | \geq \dots .

Let

P_{A_{1} (D)} (f)

be the projection of f onto

A_{1} (D)

. Note that

P_{A_{1} (D)} (f)

can be approximated arbitrarily well by the form of some above elements in

A_{1} (D)

.

Suppose q is such that

| c_{q} | \geq \frac{a}{s} (N + ϵ) \geq | c_{q + 1} |

with

a = \frac{3 + t}{2 t}

. Then, the above assumption on the sequence

c_{j}

implies that

q \leq ⌊ \frac{s}{a} ⌋

and

| c_{s + 1} | < \frac{N + ϵ}{s} .

We claim that the elements

g_{1}, \dots, g_{q}

will be chosen among

φ_{1}, \dots, φ_{s}

at the first iteration.

Indeed, for

i \in [1, q]

, we have

\begin{matrix} |〈 P_{A_{1} (D)} (f), g_{i} 〉| & = & |〈\sum_{j = 1}^{\infty} c_{j} g_{j}, g_{i}〉| \\ \geq & |c_{i}| - |\sum_{j \neq i}^{\infty} 〈c_{j} g_{j}, g_{i}〉| \\ \geq & |c_{i}| - μ (N + ϵ - | c_{i} |) \\ \geq & \frac{a}{s} (N + ϵ) (1 + μ) - μ (N + ϵ) . \end{matrix}

For all g distinct from

g_{1}, \dots, g_{s}

, we have

\begin{matrix} |〈P_{A_{1} (D)} (f), g〉| & = & |〈\sum_{j = 1}^{\infty} c_{j} g_{j}, g〉| \end{matrix}

\begin{matrix} \leq & | c_{g} | + |〈\sum_{g_{j} \neq g}^{\infty} c_{j} g_{j}, g〉| \\ \leq & \frac{N + ϵ}{s} + μ (N + ϵ) \\ = & (N + ϵ) (μ + \frac{1}{s}) . \end{matrix}

Our assumption

s \leq \frac{1}{2 μ} + \frac{3}{2}

implies

\frac{a}{s} t (N + ϵ) (1 + μ) - t μ (N + ϵ) \geq (N + ϵ) (μ + \frac{1}{s}) .

Then, we obtain

| 〈 P_{A_{1} (D)} (f), g_{i} 〉 | \geq | 〈 P_{A_{1} (D)} (f), g 〉 | .

This implies that

| 〈 f, g_{i} 〉 | \geq | 〈 f, g 〉 | .

Thus, we do not pick any

g \in D

distinct from

g_{1}, \dots, g_{s}

until we have chosen all

g_{1}, \dots, g_{q} .

We denote

r_{m} = f - f_{m}

and

a_{m} = ∥ r_{m} ∥^{2} - {∥ f - h ∥}^{2}, m = 0, 1, 2, . . . .

Since

f_{m}

is the orthogonal projection of f onto span

{\hat{f_{m}}}

, we have

〈 r_{m}, f_{m} 〉 = 0,

(3)

and

∥ r_{m} ∥^{2} = ∥ f - s_{m} {\hat{f}}_{m} ∥^{2} \leq {∥ f - {\hat{f}}_{m} ∥}^{2} .

(4)

According to the definition of

{\hat{f}}_{m}

and the choice of

φ_{i}, i \in I_{m}

, we obtain

\begin{matrix} ∥ f - {\hat{f}}_{m} ∥^{2} & = & 〈r_{m - 1} - P_{F_{m}} (r_{m - 1}), r_{m - 1} - P_{F_{m}} (r_{m - 1})〉 \\ = & ∥ r_{m - 1} ∥^{2} - 2 〈 r_{m - 1}, P_{F_{m}} (r_{m - 1}) 〉 + {∥ P_{F_{m}} (r_{m - 1}) ∥}^{2} . \end{matrix}

We deduce

\begin{matrix} 〈 r_{m - 1}, P_{F_{m}} (r_{m - 1}) 〉 \\ = 〈 P_{F_{m}} (r_{m - 1}) + r_{m - 1} - P_{F_{m}} (r_{m - 1}), P_{F_{m}} (r_{m - 1}) 〉 \\ = ∥ P_{F_{m}} (r_{m - 1}) ∥^{2} . \end{matrix}

Denote

p_{m} = ∥ P_{F_{m}} (r_{m - 1}) ∥

. We have

∥ f - {\hat{f}}_{m} ∥^{2} = {∥ r_{m - 1} ∥}^{2} - p_{m}^{2} .

(5)

Combining (4) with (5), we have

∥ r_{m} ∥^{2} \leq {∥ r_{m - 1} ∥}^{2} - p_{m}^{2} .

(6)

It follows from (6) that

∥ r_{m} ∥_{m = 0}^{\infty}

is a decreasing sequence, and hence,

{a_{m}}_{m = 0}^{\infty}

is also a decreasing sequence.

Now the proof is divided into two cases, according to whether

a_{0} \leq 0

or

a_{0} > 0

.

Case 1:

a_{0} : = {∥ f ∥}^{2} - {∥ f - h ∥}^{2} \leq 0

. Therefore

a_{m} \leq 0

for all

m \geq 1

, hence inequality (2) holds for

m = 1, 2, \dots .

Case 2:

a_{0} > 0

. With (3), we notice that

\begin{matrix} ∥ r_{m - 1} ∥^{2} = {∥ f - f_{m - 1} ∥}^{2} & = & 〈 f - f_{m - 1}, f - f_{m - 1} 〉 \\ = & 〈 f - f_{m - 1}, f 〉 = 〈 f - f_{m - 1}, h + f - h 〉 \\ \leq & \frac{1}{2} (∥ r_{m - 1} ∥^{2} {+ ∥ f - h ∥}^{2}) + 〈 r_{m - 1}, h 〉 . \end{matrix}

(7)

We proceed with a lower estimate for

p_{m}^{2}

. We consider the following quantity for

r_{m - 1}

p_{s}^{'} : = sup_{h_{i} \in D, i \in [1, s]} ∥ P_{H (s)} (r_{m - 1}) ∥,

where

H (s) : = span (h_{1}, \dots, h_{s}) .

Applying Lemma 1, we have

{p_{s}^{'}}^{2} \leq sup_{h_{i} \in D, i \in [1, s]} {(1 - μ (s - 1))}^{- 1} \sum_{i = 1}^{s} {〈 r_{m - 1}, h_{i} 〉}^{2} .

(8)

In order to get the relation between

p_{s}^{'}

and

p_{m}

, we consider an arbitrary set

{h_{i}, i \in [1, s]}

of distinct elements of

D

. Let

W = {i \in [1, s] | h_{i} = φ_{k (i)}, k (i) \in I_{m}}

and

W^{'} = {k (i), i \in W}

. Then

\begin{matrix} \sum_{i = 1}^{s} {〈f - G_{m - 1} (f), h_{i}〉}^{2} & = & \sum_{i \in W} {〈r_{m - 1}, h_{i}〉}^{2} + \sum_{i \in [1, s] ∖ W} {〈r_{m - 1}, h_{i}〉}^{2} . \end{matrix}

Using the choice of

{φ_{k}}_{k \in I_{m}}

, we have

min_{k \in I_{m} ∖ W^{'}} | 〈r_{m - 1}, φ_{k}〉 | \geq t sup_{i \in [1, s] ∖ W} | 〈r_{m - 1}, h_{i}〉 |,

and hence

\begin{matrix} \sum_{i = 1}^{s} {〈r_{m - 1}, h_{i}〉}^{2} & \leq & \sum_{k \in W^{'}} {〈r_{m - 1}, φ_{k}〉}^{2} + t^{- 2} \sum_{k \in I_{m} ∖ W^{'}} {〈r_{m - 1}, φ_{k}〉}^{2} \\ \leq & t^{- 2} \sum_{k \in I_{m}} {〈r_{m - 1}, φ_{k}〉}^{2} . \end{matrix}

Therefore, by (8) and Lemma 1, we obtain

{p_{s}^{'}}^{2} \leq {(1 - μ (s - 1))}^{- 1} t^{- 2} \sum_{k \in I_{m}} {〈r_{m - 1}, φ_{k}〉}^{2} \leq \frac{1 + μ (s - 1)}{(1 - μ (s - 1)) t^{2}} p_{m}^{2} .

(9)

For any

h \in A_{1} (D)

, we turn to bound

〈 r_{m - 1}, h 〉

. For

m \geq 2

, we write

〈 r_{m - 1}, h 〉

as

〈r_{m - 1}, h〉 = 〈r_{m - 1}, \sum_{j = 1}^{q} c_{j} g_{j}〉 + 〈r_{m - 1}, \sum_{j = q + 1}^{\infty} c_{j} g_{j}〉 .

(10)

By setting

J_{l} : = [(l - 1) s + q + 1, l s + q]

, we first bound

〈r_{m - 1}, \sum_{j = q + 1}^{\infty} c_{j} g_{j}〉

as follows

\begin{matrix} 〈r_{m - 1}, \sum_{j = q + 1}^{\infty} c_{j} g_{j}〉 & \leq & \sum_{l = 1}^{\infty} {(\sum_{j \in J_{l}} c_{j}^{2})}^{1 / 2} {(\sum_{j \in J_{l}} {〈 r_{m - 1}, g_{j} 〉}^{2})}^{1 / 2} \\ \leq & \sum_{l = 1}^{\infty} {(\sum_{j \in J_{l}} c_{j}^{2})}^{1 / 2} {(1 + μ (s - 1))}^{1 / 2} ∥ P_{H_{(J_{l})}} (r_{m - 1}) ∥ \\ \leq & \sum_{l = 1}^{\infty} {(\sum_{j \in J_{l}} c_{j}^{2})}^{1 / 2} {(1 + μ (s - 1))}^{1 / 2} p_{s}^{'} . \end{matrix}

(11)

Since the sequence

c_{j}

has the property

| c_{q + 1} | \geq | c_{q + 2} | \geq | c_{q + 3} | \geq \dots, \sum_{j = q + 1}^{\infty} | c_{j} | \leq N + ϵ, | c_{q + 1} | \leq \frac{a}{s} (N + ϵ),

we may use the simple inequality

{(\sum_{j \in J_{l}} c_{j}^{2})}^{1 / 2} \leq s^{1 / 2} |c_{(l - 1) s + q + 1}|

to derive

\begin{matrix} \sum_{l = 1}^{\infty} {(\sum_{j \in J_{l}} c_{j}^{2})}^{1 / 2} & \leq & \sum_{l = 1}^{\infty} s^{1 / 2} |c_{(l - 1) s + q + 1}| \\ \leq & s^{1 / 2} (\frac{a}{s} (N + ϵ) + s^{- 1} \sum_{l = 2}^{\infty} \sum_{j \in J_{l - 1}} | c_{j} |) \\ \leq & s^{- 1 / 2} (N + ϵ) (a + 1) . \end{matrix}

Next, we bound

〈r_{m - 1}, \sum_{j = 1}^{q} c_{j} g_{j}〉

as follows

〈r_{m - 1}, \sum_{j = 1}^{q} c_{j} g_{j}〉 \leq s^{1 / 2} \frac{N + ϵ}{s} {(1 + μ (s - 1))}^{1 / 2} p_{s}^{'} .

(12)

Furthermore, by (7) and (10)–(12), we have

p_{s}^{'} \geq \frac{s^{1 / 2} (∥ r_{m - 1} ∥^{2} {- ∥ f - h ∥}^{2})}{2 (a + 2) (N + ϵ) {(1 + μ (s - 1))}^{1 / 2}} .

(13)

By (6), (9), and (13), we get

\begin{matrix} ∥ r_{m} ∥^{2} & \leq & ∥ r_{m - 1} ∥^{2} - \frac{(1 - μ (s - 1)) t^{2}}{1 + μ (s - 1)} {p_{s}^{'}}^{2} \\ \leq & ∥ r_{m - 1} ∥^{2} - \frac{s t^{2} (1 - μ (s - 1)) (∥ r_{m - 1} ∥^{2} {- ∥ f - h ∥}^{2})^{2}}{4 {(1 + μ (s - 1))}^{2} {(a + 2)}^{2} {(N + ϵ)}^{2}} . \end{matrix}

(14)

Let

ϵ \to 0

. Subtracting

{∥ f - h ∥}^{2}

from both sides in (14), we obtain

a_{m} \leq a_{m - 1} (1 - \frac{s a_{m - 1}}{N^{2} C {(s, μ, t)}^{2}}),

(15)

where the constant

C (s, μ, t)

is defined in Theorem 6.

Case 2.1:

0 < a_{0} < N^{2} C {(s, μ, t)}^{2} s^{- 1}

. In terms of the decreasing property of the sequence

{a_{m}}_{m = 0}^{\infty}

, either

{a_{m}}_{m \geq 0} \subset (0, N^{2} C {(s, μ, t)}^{2} s^{- 1})

, or for some

m^{*} \geq 1

we have that

a_{m^{*}} \leq 0

. Then for

m \geq m^{*}

the arguments are as in Case 1. Applying Lemma 2 to the positive numbers in the sequence

{a_{m}}_{m = 0}^{\infty}

with

ℓ = 1

,

b_{m} = 1

,

B = N^{2} C {(s, μ, t)}^{2} s^{- 1}

,

J = 0

and

r = N^{2} C {(s, μ, t)}^{2} s^{- 1}

, we obtain

a_{m} \leq N^{2} C {(s, μ, t)}^{2} {(s (1 + m))}^{- 1} .

Case 2.2:

a_{0} \geq N^{2} C {(s, μ, t)}^{2} s^{- 1}

. Taking

m = 1

in inequality (15), we get

a_{1} \leq a_{0} (1 - \frac{s a_{0}}{N^{2} C {(s, μ, t)}^{2}}) .

Therefore

a_{1} \leq 0

, that is,

∥ r_{1} ∥ \leq ∥ f - h ∥

, which gives (2) because of monotonicity.

Now inequality (2) has been proved completely. Then, we have

\begin{matrix} ∥ f - f_{m} ∥ & \leq & K (f, C (s, μ, t) {(s (1 + m))}^{- 1 / 2}; H, A_{1} (D)) \\ \leq & C (s, μ, t) K (f, {(s (1 + m))}^{- 1 / 2}; H, A_{1} (D)) . \end{matrix}

□

Proof of Corollary 1:

If

f \in A_{1} (D)

, then

K (f, {(s (1 + m))}^{- 1 / 2}; H, A_{1} (D)) \leq {∥ f ∥}_{A_{1} (D)} {(s (1 + m))}^{- 1 / 2} .

Thus we get the desired result from Theorem 6. □

3. The Rescaled Pure Super Greedy Learning Algorithms

In this section, we consider the application of the RPSGA(s, 1) to supervised learning. In this context, the RPSGA(s, 1) has its new form. We call it the Rescaled Pure Super Greedy Learning Algorithm (RPSGLA). The precise definition of the RPSGLA will be given later.

We first formulate the problem of supervised learning. Let the input space

X \subset R^{d}

be a compact subset and the output space

Y \subset R

. Let

ρ

be a Borel probability measure on

Z = X \times Y .

The generalization error for a function

f : X \to Y

is defined as

E (f) : = \int_{Z} {(f (x) - y)}^{2} d ρ .

For each input

x \in X

and output

y \in Y

,

{(f (x) - y)}^{2}

is the error suffered from the use of f as a model for the process producing y from x. By integrating over

X \times Y

with respect to

ρ

, we average out the error over all pairs

(x, y)

.

Denote by

L_{ρ_{X}}^{2}

the Hilbert space of the square integrable functions defined on X with respect to the measure

ρ_{X}

, where

ρ_{X} (x)

is the marginal measure of

ρ

on X and

{∥ f (\cdot) ∥}_{L_{ρ_{X}}^{2}} = {(\int_{X} {| f (\cdot) |}^{2} | d ρ_{X})}^{1 / 2} .

The regression function

f_{ρ}

which minimizes

E (f)

over all

f \in L_{ρ_{X}}^{2}

is given by

f_{ρ} (x) : = \int_{Y} y d ρ (y | x),

where

ρ (\cdot | x)

is the conditional distribution induced by

ρ

at

x \in X

. In the framework of supervised learning,

ρ

is unknown and what we have in hand is a set of random samples

z = {(x_{i}, y_{i})}_{i = 1}^{m} \in Z^{m}

, without loss of generality, we assume that

| y_{i} | \leq M

for a fixed

M > 0

, drawn from the measure

ρ

identically and independently. The task is to find a good approximation

f_{z}

of the regression function, which is derived from some learning algorithm. To measure the approximation ability of

f_{z}

, we estimate the excess generalization error

E (f_{z}) - E (f_{ρ}) = {∥ f_{z} - f_{ρ} ∥}_{L_{ρ_{X}}^{2}}^{2} .

(16)

In the designation of the learning algorithm, we replace the generalization error

E (f)

with the empirical error

E_{z} (f) : = \frac{1}{m} \sum_{i = 1}^{m} {|(f (x_{i}) - y_{i})|}^{2} .

We expect to find a good approximation of

f_{ρ}

by minimizing

E_{z} (f)

in a suitable way.

Given a training data

z = {(x_{i}, y_{i})}_{i = 1}^{m}

, the empirical inner product and empirical norm are defined as follows:

{〈 f, g 〉}_{m} = \frac{1}{m} \sum_{i = 1}^{m} f (x_{i}) g (x_{i}),

{∥ f ∥}_{m}^{2} : = \frac{1}{m} \sum_{i = 1}^{m} {| f (x_{i}) |}^{2} .

With the definition of the empirical inner product and empirical norm, the empirical error can be represented as follow

E_{z} (f) : = {∥ f - y ∥}_{m}^{2} .

We consider the kernel-based greedy learning algorithms, which have been studied extensively in machine learning, see [13,16,18,20,22,24]. We will use the continuous kernels, which are more general than the usual Mercer kernels, see [25,26]. The hypothesis spaces based on continuous kernels were used widely in many fields of machine learning, see, for instance, Refs. [18,27,28]. In this way, a wider selection of the kernel offers more flexibility. We say a function K defined on

X \times X

is a continuous kernel if K is continuous at every

(u, v) \in X \times X

. Let K be a continuous and symmetric kernel. Assume that, in addition, K is positive definite. Then, the space

H = span {K_{x} : = K (x, \cdot) : x \in X}

is a pre-Hilbert space equipped with the inner product

{〈 K_{x_{1}}, K_{x_{2}} 〉}_{H} = K (x_{1}, x_{2}) .

Let

\bar{H}

be the closure of H with respect to the norm induced by the above inner product. Then,

\bar{H}

is a reproducing kernel Hilbert space (RKHS) associated with the kernel K. The reproducing property is given by

{〈 K_{x}, f 〉}_{H} = f (x), \forall x \in X, f \in \bar{H}

. For more details, one can see [25].

We define a data-independent hypothesis space as follows.

Definition 1.

Define a data-independent hypothesis space

L_{1} = \{f : f = \sum_{j = 0}^{\infty} β_{j} {\tilde{K}}_{v_{j}}, \sum_{j = 0}^{\infty} | β_{j} | < \infty, {v_{j}} \subset X, {\tilde{K}}_{v_{j}} = \frac{K_{v_{j}}}{∥ K_{v_{j}} ∥_{L_{ρ_{X}}^{2}}}\},

with the norm

{∥ f ∥}_{L_{1}} = inf \{\sum_{j = 0}^{\infty} | β_{j} | : f \in L_{1}\} .

Note that the space

L_{1}

is a special kind of

A_{1}

space which was introduced in [10], see Section 1.

It follows from the definition of

f_{ρ}

that

| f_{ρ} | \leq M

, so it is natural to restrict the approximation functions to

[- M, M]

. The following truncation operator has been used in the error analysis of learning algorithms for improving the learning rate, see [16,18,22].

Definition 2.

The truncation operator

π_{M}

is defined on the space of measurable functions

f : X \to R

as

π_{M} (f) (x) = \{\begin{matrix} M, & i f f (x) > M; \\ - M, & i f f (x) < - M; \\ f (x), & o t h e r w i s e . \end{matrix}

Given a training data

z = {(x_{i}, y_{i})}_{i = 1}^{m} \subset X \times Y

, we define the data-dependent hypothesis space as

L_{z} : = \{f : f = \sum_{i = 0}^{m} α_{i} {\tilde{K}}_{x_{i}}, \sum_{i = 1}^{m} | α_{i} | < \infty\},

and define the norm on

L_{z}

as

{∥ f ∥}_{L_{z}} = inf \{\sum_{i = 0}^{m} | α_{i} |, f \in L_{z}\} .

So,

L_{z}

is a subspace of

L_{1}

.

We will choose an approximant for

f_{ρ}

from

L_{z}

by the RPSGLA.

We now present the RPSGLA as follows Algorithm 1:

Algorithm 1 The RPSGLA

Input:

z = {(x_{i}, y_{i})}_{i = 1}^{m}

, and K, and

T > 0

, we get

D_{m} = {K_{x_{i}}, i = 1, 2, . . ., m} .

step 1. Normalization:

{\hat{K}}_{x_{i}} = K_{x_{i}} / {∥ K_{x_{i}} ∥}_{m}

and dictionary:

D_{m}^{*} = {\pm {\hat{K}}_{x_{i}}}, i = 1, 2, . . ., m .

step 2. Computation: Let

f_{0} = 0

,

r_{k} = y - f_{k} .

1: Denote

I_{k} : = [(k - 1) s + 1, k s]

. Let

φ_{(k - 1) s + 1}, \dots, φ_{k s} \in D_{m}^{*}

satisfy the inequality

min_{i \in I_{k}} | {〈 r_{k - 1}, φ_{i} 〉}_{m} | \geq sup_{φ \in D_{m}^{*}, φ \neq φ_{i}, i \in I_{k}} | {〈 r_{k - 1}, φ 〉}_{m} | .

2: Let

F_{k} : = span (φ_{(k - 1) s + 1}, \dots, φ_{k s})

and

P_{F_{k}}

denote the orthogonal projection operator onto

F_{m}

. With

\hat{f_{k}} = f_{k - 1} + P_{F_{k}} (r_{k - 1}),

define the next approximant to be

f_{k} : = s_{k} \hat{f_{k}}, w i t h s_{k} = \frac{{〈 y, \hat{f_{k}} 〉}_{m}}{{〈 \hat{f_{k}}, \hat{f_{k}} 〉}_{m}} .

3: if

∥ y - f_{k} ∥_{m}^{2} + ∥ f_{k} ∥_{L_{z}} \leq {∥ y ∥}_{m}^{2}

and

k \geq T

then break.

end

Output:

π_{M} (f_{k})

In the case of

s = 1,

this algorithm coincides with the Rescaled Pure Greedy Learning Algorithm (RPGLA), which was studied in [24]. Here we concentrate on

s > 1

.

We recall the following condition which has been widely used for error analysis, see, for instance, Refs. [18,20,26,27,29].

Definition 3.

We say that a kernel K is a

C^{γ}

kernel with

γ > 0

if there exists some constant

C_{γ} > 0,

such that

|K^{([γ])} (u, x) - K^{([γ])} (u, x^{^{'}})| \leq C_{γ} {| x - x^{^{'}} |}^{γ - [γ]}, \forall u, x, x^{^{'}} \in X,

where

[γ]

denotes the largest integer not exceeding γ and

K^{([γ])} (u, x)

denotes

[γ]

-th partial derivative of

K (u, x)

with respect to variable x. We set

K^{(0)} (u, x) = K (u, x)

.

We recall two types of kernels that are used widely in practice. They are all the

C^{γ}

kernels for any

γ > 0

.

The first one is Guassian kernels:

K_{σ} (u, x) = exp (- σ^{2} {∥ u - x ∥}^{2}), u, x \in R^{d}, σ > 0 .

The second one is polynomial kernels:

K_{n} (u, x) = {(1 + u \cdot x)}^{n}, u, x \in R^{d}, n \in N .

Here

u \cdot x

is the Euclidean inner product of u and x in

R^{d}

.

Let

C (X)

denote the space of continuous functions defined on X.

Definition 4.

For

0 < r \leq \frac{1}{2}

, we define the space

L_{1}^{r}

to be the set of all functions f such that, for all m, there exist

h^{^{'}} \in s p a n {D_{m}}

such that

∥ h^{^{'}} ∥_{L_{1}} \leq B, ∥ f - h^{^{'}} ∥ \leq B m^{- r},

where

∥ \cdot ∥

denotes the uniform norm on

C (X)

, and the smallest constant

B

such that this holds defines a norm for

L_{1}^{r}

.

Under the assumption of

f_{ρ} \in L_{1}^{r}

, we obtain the following convergence rate of the RPSGLA.

Theorem 7.

Assume that

f_{ρ} \in L_{1}^{r}

and

δ \in (0, 1)

. Let K be a

C^{γ}

kernel with

0 < k_{1} \leq K (u, v) \leq k_{2}

, for any

u, v \in X

. Choose

T \geq m .

For

s \leq \frac{1}{2 μ} + \frac{3}{2}

, the output of the RPSGLA satisfies the following inequality with confidence

1 - δ

,

E (π_{M} (f_{m})) - E (f_{ρ}) \leq c B^{2} log \frac{3}{δ} ({(s m)}^{- 1} + m^{- 2 r} + m^{- \frac{2}{p + 2}}),

(17)

where

p = \{\begin{matrix} 2 d / (d + 2 γ), & 0 < γ \leq 1, \\ 2 d / (d + 2), & 1 < γ \leq 1 + d / 2, \\ d / γ, & γ > 1 + d / 2, \end{matrix}

and the constant c depends at most on

C_{γ}

, δ, K, and M.

Remark 3.

If we take K as an infinitely smooth kernel, such as a Gaussian kernel, then when r is sufficiently close to 1/2, the convergence rate of RPSGLA can be arbitrarily close to

O (m^{- 1})

, which is the best learning rate one can obtain so far, see [18,30]. Since in this case the result of Theorem 7 is valid for any

γ > 0

, the convergence rate of RPSGLA can be arbitrarily close to

O (m^{- 1})

when γ is sufficiently large. To see this, let

γ \to \infty

and

r \to 1 / 2

, then the right-hand side of (17) tends to

c \cdot m^{- 1}

. Note that if we take a finite smooth kernel, then we could not obtain the above convergence rate. In this case, inequality (17) holds only for some fix γ but not all γ.

Remark 4.

We show that the efficiency of the RPSGLA is better than some existing greedy learning algorithms. The convergence rate of the RPSGLA is faster than that of the Orthogonal Super Greedy Learning Algorithm (OSGLA) in [13]. In [13], the rate

O ({(m log m)}^{- 1 / 2})

was derived. Our convergence rate is also faster than

O ({(m log m)}^{- 1 / 2})

the convergence rate of the Orthogonal Greedy Learning Algorithm (OGLA) and Relaxed Greedy Learning Algorithm (RGLA) in [16] and

O (m^{- (2 γ + d) / (2 γ + 2 d)})

the convergence rate of the RGLA in [20]. Additionally, our convergence rate is almost the same as that of the OGLA in [18] and RPGLA in [24]. However, the complexity of the RPSGLA is smaller than the OGLA and RPGLA. We will illustrate this in Section 5 and Section 6. On the other hand, since greedy learning is a large field, there are other greedy learning procedures that are quite different from ours, see [17,19,21].

Remark 5.

Kernel-based greedy algorithms can be used to solve different problems. We only mention some typical works. Moreover, we focus on the approximation problems on RKHS. In [31], the authors proposed kernel-based greedy algorithms to approximate non-linear vectorial functions and derived the rate

O (m^{- 1 / 2})

. In [7,32], kernel-based greedy algorithms were used to approximate a linear functional defined on an RKHS. The authors proved that for the square-integrable functions, the convergence rate can attain

O (m^{- 1 / 2})

. Roughly speaking, the convergence rate of greedy algorithms for functional approximation is similar to that of function approximation, while for the regression problem, one can obtain a faster convergence rate.

4. Error Analysis of the RPSGLA

In this section, we prove Theorem 7. The proof is divided into five parts, the error decomposition strategy, the estimate of sample error, the estimate of hypothesis error, and the estimate of approximation error. Finally, the Theorem 7 is proved by synthesizing the results of each error estimate.

4.1. Error Decomposition Strategy

Before we show the error decomposition strategy of the error analysis, we construct a stepstone function

f_{k}^{*} \in span (D_{m})

as follows. As

f_{ρ} \in L_{1}^{r}

, there exists a

h_{ρ} : = \sum_{i = 0}^{m} a_{i} K_{x_{i}} \in span {D_{m}}

such that

∥ h_{ρ} ∥_{L_{1}} \leq B, ∥ f_{ρ} - h_{ρ} ∥ \leq B m^{- r} .

(18)

Define

f_{0}^{*} = 0

,

\begin{matrix} f_{k}^{*} & = & (1 - \frac{1}{k}) f_{k - 1}^{*} + \frac{∥ h_{ρ} ∥_{L_{1}}}{k} φ_{k}^{*}, \end{matrix}

(19)

where

φ_{k}^{*} : = arg max_{φ \in D_{m}^{^{'}}} |{〈h_{ρ} - (1 - \frac{1}{k}) f_{k - 1}^{*}, φ〉}_{L_{ρ_{X}}^{2}}|,

and

D_{m}^{^{'}} = {\{K_{x_{i}} / ∥ K_{x_{i}} ∥_{L_{ρ_{X}}^{2}}\}}_{i = 1}^{m} \cup {\{- K_{x_{i}} / ∥ K_{x_{i}} ∥_{L_{ρ_{X}}^{2}}\}}_{i = 1}^{m} .

Lemma 3.

Let

f_{k}^{*}

be defined in (19). Then we have

∥ f_{k}^{*} ∥_{L_{1}} \leq B .

Proof.

By the definition of

f_{k}^{*}

, we have

\begin{matrix} f_{k}^{*} & = & (1 - \frac{1}{k}) (1 - \frac{1}{k - 1}) \cdot \cdot \cdot (1 - \frac{1}{2}) {∥ h_{ρ} ∥}_{L_{1}} φ_{1}^{*} + (1 - \frac{1}{k}) \cdot \cdot \cdot (1 - \frac{1}{3}) \frac{∥ h_{ρ} ∥_{L_{1}}}{2} φ_{2}^{*} \\ + & \cdot \cdot \cdot + (1 - \frac{1}{k}) \frac{∥ h_{ρ} ∥_{L_{1}}}{k - 1} g_{k - 1}^{*} + \frac{∥ h_{ρ} ∥_{L_{1}}}{k} φ_{k}^{*}, \end{matrix}

then the conclusion of the lemma 3 follows from the fact

∥ h_{ρ} ∥_{L_{1}} \leq B

and

∥ φ_{i}^{*} ∥_{L_{ρ_{X}}^{2}} = 1, i = 1, 2 \dots k .

□

With

f_{k}^{*}

at hand, we can give an upper bound of

E (π_{M} (f_{k})) - E (f_{ρ})

as follows.

Proposition 1.

Let

f_{k}

be defined in Algorithm 1. Then, for the

f_{k}^{*}

in (19), we have the error decomposition as follows:

E (π_{M} (f_{k})) - E (f_{ρ}) \leq S (z, k) + D (k) + P (z, k),

where

S (z, k) = E (π_{M} (f_{k})) - E_{z} (π (f_{k})) + E_{z} (f_{k}^{*}) - E (f_{k}^{*}),

D (k) = E (f_{k}^{*}) - E (f_{ρ}),

P (z, k) = E_{z} (π_{M} (f_{k})) - E_{z} (f_{k}^{*}) .

are known as the sample error, the approximation error, and the hypothesis error, respectively, in learning theory.

4.2. Estimate of Sample Error

In this subsection, we will bound the sample error

S (z, k)

. We set

S_{1} (z, k) = \{E_{z} (f_{k}^{*}) - E_{z} (f_{ρ})\} - {E (f_{k}^{*}) - E (f_{ρ})}

and

S_{2} (z, k) = {E (π_{M} (f_{k})) - E (f_{ρ})} - {E_{z} (π_{M} (f_{k})) - E_{z} (f_{ρ})} .

Then, the sample error can be written as

S (z, k) = S_{1} (z, k) + S_{2} (z, k)

.

The bound of

S_{1} (z, k)

has been proved in [20] by using the one-side Bernstein inequality and the inequality

∥ f_{k}^{*} ∥_{_{L_{1}}} \leq B

.

Proposition 2.

For any

0 < δ < 1

, with confidence at least

1 - \frac{δ}{3}

, we have

S_{1} (z, k) \leq \frac{7 {(3 M + B)}^{2} log (3 / δ)}{3 m} + \frac{1}{2} D (k) .

For the function

f_{k}

changed with the sample

z

, we should obtain the uniform upper bound of

S_{2} (z, k)

.

We should consider the data-dependent space

B_{ϱ} = \{f \in L_{1} : {∥ f ∥}_{L_{1}} \leq ϱ\} .

To estimate the capacity of

B_{ϱ}

, we need the concept of the empirical covering number.

Definition 5.

Let E be a metric space with metric d and F be a subset of E. For any

ϵ > 0

, the covering number

N (F, ϵ, d)

of F with respect to ϵ and d is defined as the minimal number of balls of radius ϵ whose union covers F, that is,

N (F, ϵ, d) : = min \{l \in N : F \subset ⋃_{j = 1}^{l} B (t_{j}, ϵ), f o r s o m e {t_{j}}_{j = 1}^{l} \subset E\},

where

B (t_{j}, ϵ) = {t \in E : d (t_{j}, t) \leq ϵ} .

Definition 6.

(

l_{2}

empirical covering number) Let

F

be a set of functions on X,

x = {(x_{i})}_{i = 1}^{m} \subset X^{m}

and let

{F |}_{x} = {{(f (x_{i}))}_{i = 1}^{m}, f \in F} \subset R^{m} .

Set

N_{2, x} (F, ϵ) = N (F |_{x}, ϵ, d),

for arbitrary

ϵ > 0

. The

l_{2}

empirical covering number of

F

is defined by

N_{2} (F, ϵ) = sup_{m \in N} sup_{x \in X_{m}} N_{2, x},

where the

l_{2}

metric

d_{2} (a, b) = {(\frac{1}{m} \sum_{i = 0}^{m} {| a_{i} - b_{i} |}^{2})}^{\frac{1}{2}}, \forall a = {(a_{i})}_{i = 1}^{m} \in R^{m}, b = {(b_{i})}_{i = 1}^{m} \in R^{m} .

We will use the following result on the capacity of the unit ball

B_{1}

.

Lemma 4

([30]). Let X be a compact subset of

R^{d}

and K be a

C^{γ}

kernel with some

γ > 0

. Then, there exist an exponent p,

0 < p < 2

, and a constant

c_{p}

, for arbitrary

ϵ > 0

, such that

log N_{2} (B_{1}, ϵ) \leq c_{p} ϵ^{- p},

where

p = \{\begin{matrix} 2 d / (d + 2 γ), & 0 < γ \leq 1, \\ 2 d / (d + 2), & 1 < γ \leq 1 + d / 2, \\ d / γ, & γ > 1 + d / 2 . \end{matrix}

Now we recall the concentration inequality from [29].

Lemma 5.

Assume that there are constants

B, c > 0

and

α \in [0, 1]

such that

{∥ f ∥}_{\infty} \leq B

and

E f^{2} \leq c {(E f)}^{α}

, for every

f \in F .

If for arbitrary

ϵ > 0

,

log (N_{2} (F, ϵ)) \leq a ϵ^{- p}

holds for some

a > 0

and

p \in (0, 2)

, then there exists a constant

c_{p}^{^{'}}

depending only on p such that for any

t > 0

, with probability at least

1 - e^{- t}

, there

E f - \frac{1}{m} \sum_{i = 1}^{m} f (z_{i}) \leq \frac{1}{2} η^{1 - α} {(E f)}^{α} + c_{p}^{^{'}} η + 2 {(\frac{c t}{m})}^{\frac{1}{2 - α}} + \frac{18 B t}{m}, \forall f \in F,

where

η : = max \{c^{\frac{2 - p}{4 - 2 α + p α}} {(\frac{a}{m})}^{\frac{2}{4 - 2 α + p α}}, B^{\frac{2 - p}{2 + p}} {(\frac{a}{m})}^{\frac{2}{2 + p}}\} .

Proposition 3.

If K is a

C^{γ}

kernel, then for any

0 < δ < 1

, with confidence at least

1 - \frac{δ}{3}

, we have

S_{2} (z, k) \leq \frac{1}{2} {E (π_{M} (f_{k})) - E (f_{ρ})} + C_{2} log \frac{3}{δ} m^{- \frac{2}{p + 2}} .

Proof.

Denote

F_{ϱ} = {g (z) = {(y - π_{M} (f) (x))}^{2} - {(y - f_{ρ} (x))}^{2}, f \in B_{ϱ}} .

Hence

E g = E (π_{M} (f)) - E (f_{ρ}) = {∥ π_{M} (f) - f_{ρ} ∥}_{L_{ρ_{X}}^{2}}^{2},

and

\frac{1}{m} \sum_{i = 1}^{m} g (z_{i}) = E_{z} (π_{M} (f)) - E_{z} (f_{ρ}) .

Observe that

g (z) = (π_{M} (f) (x) - f_{ρ} (x)) (π_{M} (f) (x) - y) + (f_{ρ} (x) - y)) .

From the obvious inequalities

{∥ f ∥}_{\infty} \leq M

and

| f_{ρ} | \leq M

, we get the inequalities

| g (z) | \leq 8 M^{2}

and

E g^{2} = \int_{Z} {(π_{M} (f) (x) - f_{ρ} (x))}^{2} {((π_{M} (f) (x) - y) + (f_{ρ} (x) - y))}^{2} d ρ \leq 16 M^{2} E g .

For

g_{1}, g_{2} \in F_{ϱ},

we have

\begin{matrix} | g_{1} (z) - g_{2} (z) | & = & |{(y - π_{M} (f_{1}) (x))}^{2} - {(y - π_{M} (f_{2}) (x))}^{2}| \\ \leq & 4 M |π_{M} (f_{1}) (x) - π_{M} (f_{2}) (x)| \\ \leq & 4 M |f_{1} (x) - f_{2} (x)| . \end{matrix}

For any

ϵ > 0

, it follows that

N_{2, z} (F_{r}, ϵ) \leq N_{2, x} (B_{r}, \frac{ϵ}{4 M}) \leq N_{2, x} (B_{1}, \frac{ϵ}{4 M ϱ}) .

By Lemma 4, we have

log N_{2} (F_{ϱ}, ϵ) \leq c_{p} {(4 M ϱ)}^{p} ϵ^{- p}, \forall ϵ > 0,

where

c_{p}

is a constant independent of

ϵ

.

Applying Lemma 5 with

B = 16 M^{2},

α = 1

,

c = 16 M^{2},

and

a = c_{p} {(4 M ϱ)}^{p}

, then for any

0 < δ < 1

and any

g \in F_{ϱ}

, the inequality

E g - \frac{1}{m} \sum_{i = 1}^{m} g (z_{i}) \leq \frac{1}{2} E g + C_{1} η + C_{2} M^{2} \frac{log (3 / δ)}{m}

holds with confidence

1 - \frac{δ}{3}

, where

C_{1}

and

C_{2}

are constants depending at most on d, X, and the kernel K and

η = B^{\frac{2 - p}{p + 2}} {(\frac{{(4 M ϱ)}^{p}}{m})}^{\frac{2}{2 + p}} .

It follows from the definition of

f_{k}

in Algorithm 1 that

∥ f_{k} ∥_{L_{1}} \leq k_{2} M^{2}

. Then, there exists a constant C depending at most on d, X, M, and kernel K, such that

S_{2} (z, k) \leq \frac{1}{2} \{E (π_{M} (f_{k})) - E (f_{ρ})\} + C log \frac{3}{δ} m^{- \frac{2}{p + 2}} .

□

4.3. Estimate of Hypothesis Error

In this subsection, we give an error estimate for

P (z, k)

. For this purpose, we need the following two lemmas.

The first one is Hoeffding inequality, which was established by Wassily Hoeffding in 1963.

Lemma 6

([33]). Let

X_{1}, \dots, X_{n}

be independent random variables bounded by the interval

[0, 1] :

0 \leq X_{i} \leq 1

. We define the empirical mean of these variables by

\bar{X} = \frac{1}{n} (X_{1} + \dots + X_{n}) .

Then, the following inequality holds for any given

t > 0

Prob (\bar{X} - E (\bar{X}) \geq t) \leq exp (- 2 t^{2} n) .

When

X_{i}

are strictly bounded by the intervals

[a_{i}, b_{i}], i = 1, \dots, n

, the generalization of the above inequality holds for any given

t > 0

Prob (\bar{X} - E (\bar{X}) \geq t) \leq exp (\frac{- 2 t^{2} n}{\sum_{i = 1}^{n} {(b_{i} - a_{i})}^{2}}) .

The second one is an immediate consequence of Theorem 6.

Equip

H = {f = \sum_{i = 1}^{\infty} α_{i} K_{v_{i}}, {α_{i}}_{i = 1}^{\infty} \subset ℓ_{1}, {v_{i}} \subset X}

with the empirical inner product. Then H is a Hilbert space. Given a training data

z = {(x_{i}, y_{i})}_{i = 1}^{m} \subset X \times Y

,

D_{m} = {{\hat{K}}_{x_{i}} : = \frac{K_{x_{i}}}{∥ K_{x_{i}} ∥_{m}}}

is a dictionary of H. We denote its coherence by

K .

Lemma 7.

Let K be a

C^{γ}

kernel with

0 < k_{1} \leq K (u, v) \leq k_{2}

. For any

g \in s p a n (D_{m}^{^{'}}),

there exists

{ξ_{i}}_{i = 1}^{m}

such that

g = \sum_{i = 1}^{m} ξ_{i} \frac{K_{x_{i}}}{∥ K_{x_{i}} ∥_{L_{ρ_{X}}^{2}}}

,

\sum_{i = 1}^{m} | ξ_{i} | < \infty

. We define

{∥ g ∥}_{L_{1}^{m}} = inf \{\sum_{i = 1}^{m} | ξ_{i}^{m} |, ξ_{i}^{m} = ξ_{i} \frac{∥ K_{x_{i}} ∥_{m}}{∥ K_{x_{i}} ∥_{L_{ρ_{X}}^{2}}}\} .

Then, for any

f \in H

and any

g \in s p a n (D_{m}^{^{'}}),

and

s \leq \frac{1}{2 μ} + \frac{3}{2}

, the error of the RPSGLA satisfies

∥ f - f_{k} ∥_{m}^{2} \leq {∥ f - g ∥}_{m}^{2} + \frac{C {(s, K, 1)}^{2} {∥ g ∥}_{L_{1}^{m}}^{2}}{s k}, k = 1, 2, \dots .

Proof.

According to (2), for any

f \in H

and any

g \in span (D_{m}^{^{'}}) \subset A_{1} (D_{m})

, the inequality

∥ f - f_{k} ∥_{m}^{2} \leq {∥ f - g ∥}_{m}^{2} + \frac{C {(s, K, 1)}^{2}}{s (1 + k)} {∥ g ∥}_{A_{1} (D_{m})}^{2}

(20)

holds for

k = 1, 2, . . . .

Given a training data

z = {(x_{i}, y_{i})}_{i = 1}^{m} \subset X \times Y

, since

\begin{matrix} {∥ g ∥}_{A_{1} (D_{m})} & = & {∥\sum_{i = 1}^{m} ξ_{i} \frac{K_{x_{i}}}{∥ K_{x_{i}} ∥_{L_{ρ_{X}}^{2}}}∥}_{A_{1} (D_{m})} \\ = & {∥\sum_{i = 1}^{m} ξ_{i} \frac{∥ K_{x_{i}} ∥_{m}}{∥ K_{x_{i}} ∥_{L_{ρ_{X}}^{2}}} {\hat{K}}_{x_{i}}∥}_{A_{1} (D_{m})} \\ = & inf \{\sum_{i = 1}^{m} | ξ_{i}^{m} |\} = {∥ g ∥}_{L_{1}^{m}}, \end{matrix}

(21)

then, combining (20) with (21), we get the desired result. □

Proposition 4.

For any

0 < δ < 1

, with confidence at least

1 - \frac{δ}{3}

, we have

E_{z} (π_{M} (f_{k})) - E_{z} (f_{k}^{*}) \leq 4 min \{k_{2}^{2} k_{1}^{- 2}, {(1 + \frac{k_{2}}{k_{1}} \sqrt{\frac{log (3 / δ)}{2 m}})}^{2}\} B^{2} {(s k)}^{- 1} .

Proof.

Applying Lemma 7 with

g = f_{k}^{*}

, we have

∥ f - f_{k} ∥_{m}^{2} \leq {∥ f - f_{k}^{*} ∥}_{m}^{2} + \frac{C {(s, K, 1)}^{2} {∥ f_{k}^{*} ∥}_{L_{1}^{m}}^{2}}{s k} .

(22)

Since

∥ f_{k}^{*} ∥_{L_{1}^{m}}

changes with training data

z

, we should find its relation with

∥ f_{k}^{*} ∥_{L_{1}} .

We know that

∥ f_{k}^{*} ∥_{L_{1}^{m}} \leq \frac{k_{2}}{k_{1}} {∥ f_{k}^{*} ∥}_{L_{1}} .

We also have that

E {\tilde{K}}_{v_{i}}^{2} = (\int_{Z} {\tilde{K}}_{v_{i}}^{2} d ρ) = {∥ {\tilde{K}}_{v_{i}} ∥}_{L_{ρ_{x}}^{2}}^{2} = 1,

and

∥ {\tilde{K}}_{v_{i}} ∥_{m} = {(\frac{1}{m} \sum_{i = 1}^{m} {| {\tilde{K}}_{v_{i}} (x_{i}) |}^{2})}^{\frac{1}{2}} .

(23)

Based on Lemma 6, for

t > 0

and any i, we have

Prob \{\frac{1}{m} \sum_{i = 1}^{m} {| {\tilde{K}}_{v_{i}} (x_{i}) |}^{2} - E {\tilde{K}}_{v_{i}}^{2} < t\} \geq 1 - exp \{- \frac{2 k_{1}^{2} t^{2} m}{k_{2}^{2}}\} .

(24)

By setting

δ = 3 exp \{- \frac{2 k_{1}^{2} t^{2} m}{k_{2}^{2}}\}

, we have

t = \frac{k_{2}}{k_{1}} \sqrt{\frac{log (3 / δ)}{2 m}}

. From (23) and (24), with the confidence

1 - \frac{δ}{3}

, we have

\begin{matrix} ∥ {\tilde{K}}_{v_{i}} ∥_{m} & \leq & \frac{1}{m} \sum_{i = 1}^{m} {| {\tilde{K}}_{v_{i}} (x_{i}) |}^{2} \leq E {\tilde{K}}_{v_{i}}^{2} + \frac{k_{2}}{k_{1}} \sqrt{\frac{log (3 / δ)}{2 m}} \\ \leq & 1 + \frac{k_{2}}{k_{1}} \sqrt{\frac{log (3 / δ)}{2 m}} . \end{matrix}

By the definition of

f_{k}^{*}

, we observe that the following inequality

∥ f_{k}^{*} ∥_{L_{1}^{m}} \leq (1 + \frac{k_{2}}{k_{1}} \sqrt{\frac{log (3 / δ)}{2 m}}) {∥ f_{k}^{*} ∥}_{L_{1}},

(25)

holds with the confidence

1 - \frac{δ}{3} .

Combining (22), (25) with Lemma 3, with the confidence at least

1 - \frac{δ}{3}

, we have

E_{z} (π_{M} (f_{k})) - E_{z} (f_{k}^{*}) \leq C {(s, K, 1)}^{2} min \{k_{2}^{2} k_{1}^{- 2}, {(1 + \frac{k_{2}}{k_{1}} \sqrt{\frac{log (3 / δ)}{2 m}})}^{2}\} B^{2} {(s k)}^{- 1} .

□

4.4. Estimate of Approximation Error

Finally, we estimate the approximation error.

Proposition 5.

If

f_{ρ} \in L_{1}^{r}

, then

D (k) \leq B^{2} {(k^{- 1 / 2} + m^{- r})}^{2} .

Proof.

From the definition of

D (k)

and (16), there holds:

D (k) = E (f_{k}^{*}) - E (f_{ρ}) = {∥ f_{k}^{*} - f_{ρ} ∥}_{L_{ρ_{X}}^{2}}^{2} .

For

h_{ρ}

satisfying (18), and

∥ h_{ρ} ∥_{L_{ρ_{X}}^{2}} \leq ∥ h_{ρ} ∥

, from Theorem 2.2 in [16], we obtain

\begin{matrix} D (k) & \leq & {(∥ h_{ρ} - f_{k}^{*} ∥_{L_{ρ_{X}}^{2}} + {∥ h_{ρ} - f_{ρ} ∥}_{L_{ρ_{X}}^{2}})}^{2} \\ \leq & {(∥ h_{ρ} - f_{k}^{*} ∥ + ∥ h_{ρ} - f_{ρ} ∥)}^{2} \\ \leq & B^{2} {(k^{- 1 / 2} + m^{- r})}^{2} . \end{matrix}

□

4.5. Proof of Theorem 7

Now we prove Theorem 7.

Proof of Theorem 7

Assembling the results in Proposition 1–5 together, we have the inequality

\begin{matrix} E (π_{M} (f_{k})) - E (f_{ρ}) & \leq & D (k) + S (z, k) + P (z, k) \\ \leq & \frac{3}{2} B^{2} {(k^{- 1 / 2} + m^{- r})}^{2} + \frac{7 {(3 M + B)}^{2} log (3 / δ)}{3 m} \\ + & \frac{1}{2} {E (π (f_{k})) - E (f_{ρ})} + C log \frac{3}{δ} m^{- \frac{2}{p + 2}} \\ + & C (s, K, 1) min \{k_{2}^{2} k_{1}^{- 2}, {(1 + \frac{k_{2}}{k_{1}} \sqrt{\frac{log (3 / δ)}{2 m}})}^{2}\} B^{2} {(s k)}^{- 1} \end{matrix}

holds with confidence at least

1 - δ

.

Therefore,

E (π_{M} (f_{k})) - E (f_{ρ}) \leq c B^{2} log \frac{3}{δ} ({(s k)}^{- 1} + m^{- 2 r} + m^{- \frac{2}{p + 2}}) .

holds with confidence at least

1 - δ

, where

p = \{\begin{matrix} 2 d / (d + 2 γ), & 0 < γ \leq 1, \\ 2 d / (d + 2), & 1 < γ \leq 1 + d / 2, \\ d / γ, & γ > 1 + d / 2 . \end{matrix}

This completes the proof of Theorem 7.

□

5. Simulation Results

In this section, we test the performance of the RPSGLA and present the results of our experiments. We first describe the details relevant to our experiments in Section 5.1. We demonstrate the effectiveness of the RPSGLA in Section 5.2. In Section 5.3, we compare the RPSGLA with existing greedy learning algorithms to show the efficiency and robustness of our algorithm. We also study the effects of the parameter s in Section 5.4.

5.1. Implementation

For all experiments, we uniformly sample 500 data from the objective function

\begin{matrix} x \mapsto f (x), x \in [- 1, 1] . \end{matrix}

Let

x_{0} = - 1

and

{x_{i}}

be the 200 equally spaced points in

[- 1, 1]

, i.e.,

x_{i} = \frac{i - 1}{200}

,

0 \leq i < 200

. We use

{e^{- ∥ x - x_{i} ∥^{2}} : i = 0, \dots, 199}

as a dictionary for the learning algorithms. We also add a Gaussian noise from

N (0, δ^{2})

, with

δ^{2} = 0.2

for testing the performance. When the matrix is singular during calculating the orthogonal projection, we take the Moore–Penrose pseudoinverse instead of the inverse matrix.

To demonstrate the performance of learning algorithms, we use the mean square error (MSE) to measure the error between objective function f and its approximant

f_{ρ}

on the unlabeled samples

x = {{\tilde{x}}_{j}}_{j = 1}^{500}

, which is defined as follows

\begin{matrix} MSE = \frac{1}{500} \sum_{j = 1}^{500} {(f ({\tilde{x}}_{j}) - f_{ρ} ({\tilde{x}}_{j}))}^{2} . \end{matrix}

5.2. Effectiveness of the RPSGLA

In the first numerical experiment, we use the uniformly sampled data from the function

\begin{matrix} x \mapsto x \cdot sin π x, x \in [- 1, 1], \end{matrix}

and set

s = 5

.

Figure 1 shows the performance of the RPSGLA (

s = 5

) after 100 iterations with noiseless data for the target function

f (x) = x sin π x

, where the blue points represent noiseless samples and the red line represents the approximation. By repeating the test 100 times, we calculate the mean square error: MSE

= 3.4159 \times 10^{- 4}

.

Figure 2 shows the performance of the RPSGLA (

s = 5

) after 100 iterations with Gaussian noise from

N (0, 0.2)

for the target function

f (x) = x sin π x

, where the blue points represent noisy samples and the red line represents the approximation. By repeating the test 100 times, we calculate the mean square error: MSE

= 2.4 \times 10^{- 3}

.

The above results show that the RPSGLA is very effective for regression. Although the prediction performance will be affected when noise occurs, our method still has stable effectiveness.

5.3. Comparison with Existing Algorithms

In order to demonstrate the efficiency and robustness of the RPSGLA, we implement the RPGLA (when

s = 1

for RPSGLA) and the Orthogonal Super Greedy Learning Algorithm (OSGLA) based on the WOSGA with the same objective function and parameters as in Section 5.2 for comparison. We first recall the definition of the OSGLA in [13] (Algorithm 2):

Algorithm 2 The OSGLA

Input:

z = {(x_{i}, y_{i})}_{i = 1}^{m}

, and K, and

T > 0

, we get

D_{m} = {K_{x_{i}}, i = 1, 2, . . ., m} .

step 1. Normalization:

{\hat{K}}_{x_{i}} = K_{x_{i}} / {∥ K_{x_{i}} ∥}_{m}

and dictionary:

D_{m}^{*} = {\pm {\hat{K}}_{x_{i}}}, i = 1, 2, . . ., m .

step 2. Computation: Let

f_{0} = 0

,

r_{k} = y - f_{k} .

1: Denote

I_{k} : = [(k - 1) s + 1, k s]

. Let

φ_{(k - 1) s + 1}, \dots, φ_{k s} \in D_{m}^{*}

satisfy the inequality

min_{i \in I_{k}} | {〈 r_{k - 1}, φ_{i} 〉}_{m} | \geq sup_{φ \in D_{m}^{*}, φ \neq φ_{i}, i \in I_{k}} | {〈 r_{k - 1}, φ 〉}_{m} | .

2: Let

H_{k} = span {φ_{1}, \dots, φ_{k s}}

and

P_{H_{k}}

denote the orthogonal projection operator onto

H_{k}

. Define the next approximant to be

\begin{matrix} f_{k} = P_{H_{k}} (f) . \end{matrix}

3: if

∥ y - f_{k} ∥_{m}^{2} + ∥ f_{k} ∥_{L_{z}} \leq {∥ y ∥}_{m}^{2}

and

k \geq T

then break.

end

Output:

π_{M} (f_{k})

By repeating the test 100 times, the mean square error of RPGLA and OSGLA (

s = 5

) after 100 iterations with noiseless data and Gaussian noise from

N (0, 0.2)

are shown in Table 1. Additionally, the running time of the RPGLA, OSGLA, and RPSGLA on different types of data are reported in Table 2.

According to the mean square error, obviously, our algorithm is more effective than the RPGLA on both the noiseless data and the noisy data (Gaussian). The performance of the OSGLA on noiseless data outperforms the RPSGLA by a lot. This trend is expected because the approximant of the OSGLA is updated by solving a

s \times k

-dimensional least squares problem at k-th iteration, while for the RPSGLA, it is updated by solving a s-dimensional least squares problem. That is also the reason why the RPSGLA runs much more quickly than the OSGLA as reported in Table 2. However, for the data with Gaussian noise from

N (0, 0.2)

, the performance of the RPSGLA outperforms the OSGLA, showing the robustness and effectiveness of our proposed algorithm.

In Table 2, we compare the running time of the RPGLA, OSGLA, and RPSGLA. We can observe that the RPSGLA runs more quickly than the RPGLA and the existing super algorithm, the OSGLA, demonstrating the efficiency of our algorithm.

5.4. Effects of the Parameter S

In this subsection, we study the effects of the parameter

s \in {1, 2, \dots, 10}

. We take the same dictionary and Gaussian noise as in Section 5.2. By repeatedly implementing the RPSGLA 100 times on the uniformly sampled data from the function

\begin{matrix} x \mapsto x^{2}, x \in [- 1, 1], \end{matrix}

we calculate the mean square error respectively with

s = 1, 2, \dots, 10

, which are reported in Figure 3 and Figure 4.

Figure 3 and Figure 4, respectively, show the MSE with noiseless data and Gaussian noise from

N (0, 0.2)

for the target function

f (x) = x^{2}

when s increases from 1 to 10, which implies that the error of the RPSGLA will decrease with the increased

s \in {1, 2, \dots, 10}

. For the data with Gaussian noise, according to MSE, the accuracy of the RPSGLA is roughly increased when parameter s increases from 1 to 10. For the noiseless data, the accuracy of the RPSGLA is increased and then decreased, reaching its best when

s = 6

.

As shown in Figure 5 and Figure 6, with the increase of the parameter s, the running time of the RPSGLA is decreased and then increased. The running time of the RPSGLA reaches its lowest when

s = 2

(for both noiseless data and the data with Gaussian noise), which implies that our algorithms reach their best computational cost in this case. Although the running time would change with different settings of parameter s, the RPSGLA is fairly efficient overall. In particular, the RPSGLA is more efficient than the RPGLA in general.

6. Discussion

The experiments in Section 5 show the superiority of the RPSGA(s, t) directly. In this section, we will explain why the tested results are achieved. By using the super selection step and one-dimensional optimization together for the first time, we get the simplest good greedy algorithm so far. We provide more details. First, we recall the other three greedy algorithms—the Relaxed Greedy Algorithm (RGA), the Greedy Algorithm with Free Relaxation (GAFR), and the Rescaled Relaxed Greedy Algorithm (RRGA).

In [10], the RGA was defined as the following steps.

RGA(H, $D$ ):

Step 0: Define $f_{0} = 0$ .
Stepm:

-: If $f = f_{m - 1}$ , stop the algorithm and define $f_{k} = f_{m - 1} = f$ for $k \geq m$ .
-: If $f \neq f_{m - 1}$ , choose an element $φ_{m} \in D$ such that

$|〈f - f_{m - 1}, φ_{m}〉| = sup_{φ \in D} | 〈 f - f_{m - 1}, φ 〉 | .$

For

m = 1

, define

f_{1} = 〈 f, φ_{1} 〉 φ_{1} .

For

m \geq 2

, define the next approximant to be

f_{m} = (1 - \frac{1}{m}) f_{m - 1} + \frac{1}{m} φ_{m},

and proceed to Step

m + 1

.

In [10], the authors proved that the RGA converged only for the target elements from

A_{1} (D)

and derived the following convergence rate for the RGA.

Theorem 8.

Let

D

be an arbitrary dictionary in H. Then, for each

f \in A_{1} (D),

we take the approximant as

f_{m} = (1 - \frac{1}{m}) f_{m - 1} + \frac{{∥ f ∥}_{A_{1} (D)}}{m} φ_{m}, m \geq 2

the RGA has the following convergence rate

∥f - f_{m}∥ \leq 2 {∥ f ∥}_{A_{1} (D)} m^{- 1 / 2}, m = 0, 1, 2, \dots .

In [9], the GAFR was defined as follows:

GAFR(H, $D)$ :

Step 0: Define $f_{0} = 0$ .
Stepm:

-: If $f = f_{m - 1}$ , stop the algorithm and define $f_{k} = f_{m - 1} = f$ for $k \geq m$ .
-: If $f \neq f_{m - 1}$ , choose an element $φ_{m} \in D$ such that

$|〈f - f_{m - 1}, φ_{m}〉| = sup_{φ \in D} | 〈 f - f_{m - 1}, φ 〉 | .$

With

(ω_{m}, θ_{m}) = arg {min}_{(ω, θ) \in R^{2}} ∥ f - ω f_{m - 1} - θ φ_{m} ∥,

define the next approximant to be

f_{m} = ω_{m} f_{m - 1} + θ_{m} φ_{m},

and proceed to Step

m + 1

.

In [9], the RRGA was defined as follows:

RRGA(H, $D)$ :

-: If $f = f_{m - 1}$ , stop the algorithm and define $f_{k} = f_{m - 1} = f$ for $k \geq m$ .
-: If $f \neq f_{m - 1}$ , choose an element $φ_{m} \in D$ such that

$| 〈 f - f_{m - 1}, φ_{m} 〉 | = sup_{φ \in D} | 〈 f - f_{m - 1}, φ 〉 | .$

With

λ_{m} = arg min_{θ \in R} ∥ f - f_{m - 1} - θ_{m} ∥, ω_{m} = arg min_{ω \in R} ∥ f - ω (f_{m - 1} + θ_{m} φ_{m}) ∥,

define the next approximant to be

f_{m} = ω_{m} (f_{m - 1} + θ_{m} φ_{m}),

and proceed to Step

m + 1

.

In [9], the authors derived the following convergence rate for the GAFR and RRGA.

Theorem 9.

Let

D

be an arbitrary dictionary in H. Then, for each

f \in A_{1} (D),

the GAFR and RRGA have the following convergence rate

∥f - f_{m}∥ \leq C {∥ f ∥}_{A_{1} (D)} m^{- 1 / 2}, m = 0, 1, 2, \dots .

From Theorems 2–4, Corollary 1, and Theorems 8 and 9, for each

f \in A_{1} (D)

, the OGA, RPGA, RGA, GAFR, RRGA, OSGA(s, t), and RPSGA(s, t) all have the following convergence rate

∥ f - f_{m} {∥ \leq C ∥ f ∥}_{A_{1} (D)} m^{- 1 / 2} .

The rate is optimal and the results show that these algorithms have almost identical error performances. We call these types of greedy algorithms Good Greedy Algorithms (GGA). Therefore, we just compare the complexity and execution time of the GGA.

In terms of scope of application, the RGA converges only for the target elements from

A_{1} (D)

. Obviously, the RPSGA(s, t) is better than the RGA.

From the viewpoint of complexity, the OGA has to solve a m-dimensional optimization problem

f_{m} = P_{{φ_{1}, \dots, φ_{m}}} f = \sum_{k = 1}^{m} θ_{k}^{m} φ_{k},

where

(θ_{1}^{m}, \dots, θ_{m}^{m}) = arg {min}_{(θ_{1}, \dots, θ_{m}) \in R^{m}} ∥ f - \sum_{k = 1}^{m} θ_{k} φ_{k} ∥ .

The GAFR has to solve a two-dimensional optimization problem

f_{m} = ω_{m} f_{m - 1} + θ_{m} φ_{m},

where

(ω_{m}, θ_{m}) = arg {min}_{(ω, θ) \in R^{2}} ∥ f - (ω f_{m - 1} + θ φ_{m}) ∥ .

The RRGA employs two one-dimensional optimization problems

f_{m} = ω_{m} (f_{m - 1} + θ_{m} φ_{m}),

where

θ_{m} = arg {min}_{θ \in R} ∥ f - (f_{m - 1} + θ φ_{m}) ∥

,

ω_{m} = arg {min}_{ω \in R} ∥ f - ω (f_{m - 1} + θ_{m} φ_{m}) ∥ .

However, the RPGA only needs to solve a one-dimensional optimization problem

f_{m} = s_{m} {\hat{f}}_{m}, where s_{m} = arg min_{s \in R} ∥ f - s {\hat{f}}_{m} ∥ .

Then, the RPGA is simpler than the OGA, RGA, GAFR, and RRGA. So the RPGA can save much execution time. According to the definition, the RPSGA(s, t) selects more than one element from a dictionary in each iteration step and hence reduces the computational burden of the RPGA (taking s=1 in the RPSGA(s, t)), especially when the RPGA and RPSGA(s, t) run with noisy data, see Figure 6. Combining with the empirical test results in Table 1 and Table 2 in Section 5, it can be easily found that from the viewpoints of error performance and execution time, the RPSGA(s, t) is more effective than the RPGA. The commonly used super algorithm is only the OSGA(s, t), as far as we know, and the OSGA(s, t) always needs to solve an

m s

-dimensional optimization problem. While the RPSGA(s, t) only need to solve a s-dimensional optimization problem and an one-dimensional optimization problem. Table 2 also shows the robustness and effectiveness of the RPSGA(s, t) compared with the OSGA(s, t). Therefore, the efficiency of the RPSGA(s, t) is the best among the GGA.

7. Conclusions and Further Studies

We propose a new type of super greedy algorithm—the WRPSGA. The RPSGA(s, t) is simpler than the OGA, RGA, RRGA, RPGA, and OSGA(s, t) from the viewpoint of computational complexity. The convergence rate of RPSGA(s, t) on

A_{1} (D)

is optimal. Based on this result, we design the RPSGLA for solving kernel-based regression problems in supervised learning. When the kernel is infinitely smooth, we derive a significantly faster learning rate that can be arbitrarily close to the best rate

O (m^{- 1})

under some mild assumptions of the regression function. The efficiency of the RPSGLA is better than some existing greedy learning algorithms. For instance, the convergence rate of the RPSGLA is faster than the OSGLA in [13], RGLA in [20], and the OGLA and RGLA in [16]. Additionally, our convergence rate is almost the same as that of the OGLA in [18]. However, the complexity of the RPSGLA is smaller than the OGLA. We test the performance of the RPSGLA by numerical experiments. Our simulation results show that the RPSGLA is very efficient for regression.

In addition to the applications in machine learning, the greedy algorithms can also be used to solve convex optimization problems which are quite different from the approximation problems, see Refs. [23,34,35,36]. We formulate the problem of convex optimization as follows.

Let H be a Hilbert space with an inner product

〈 \cdot, \cdot 〉

, and E be a convex function on H which is called the objective function. We assume that E has a Fréchet derivative

E^{^{'}} (x) \in H

at each point

x \in H

, that is

lim_{h \to 0} \frac{| E (x + h) - E (x) - 〈 E^{'} (x), h 〉 |}{∥ h ∥} = 0,

where

∥ \cdot ∥

is the norm induced by

〈 \cdot, \cdot 〉

. We want to find an approximate solution to the problem

inf_{x \in Ω} E (x),

(26)

where

Ω

is a bounded convex subset of H.

For a dictionary

D

of H, we denote

Σ_{m} (D)

as the set of all at most m-term linear combinations with respect to

D,

Σ_{m} : = Σ_{m} (D) = {g : g = \sum_{i \in Λ} c_{i} g_{i}, g_{i} \in D, Λ \subset N, # (Λ) \leq m} .

Let

\bar{x} = arg {inf}_{x \in Ω} E (x) .

We will develop a new greedy algorithm to produce an approximation

x_{m} \in Σ_{m} (D)

of

\bar{x}

. This algorithm is a modification of the RPSGA. We denote it as the RPSGA(co). We present the definition of the RPSGA(co) as follows:

RPSGA(co):

Step 0: Define $x_{0} = 0$ . If $E^{^{'}} (x_{0}) = 0$ , stop the algorithm and define $x_{k} : = x_{0}, k \geq 1$ .
Step m: Assume that $x_{m - 1}$ has been defined and $E^{^{'}} (x_{m - 1}) \neq 0$ . Denote $I_{m} : =$

[(m - 1) s + 1, m s]

. Let

φ_{(m - 1) s + 1}, \dots, φ_{m s} \in D

satisfy the inequality

min_{i \in I_{m}} | 〈 E^{^{'}} (x_{m - 1}), φ_{m} 〉 | \geq sup_{φ \in D, φ \neq φ_{i}, i \in I_{m}} | 〈 E^{^{'}} (x_{m - 1}), φ 〉 | .

Let

F_{m} : = span (φ_{(m - 1) s + 1}, \dots, φ_{m s})

. With

{\hat{x}}_{m} : = x_{m - 1} + arg min_{x \in F_{m}} E^{^{'}} (x)

define the next approximant to be

x_{m} = s_{m} {\hat{x}}_{m},

where

s_{m} = \frac{〈 x, {\hat{x}}_{m} 〉}{〈 {\hat{x}}_{m}, {\hat{x}}_{m} 〉} .

If

E (x_{m}) = 0

, then stop the algorithm and define

x_{k} = x_{m}, k > m

, otherwise go to Step

m + 1

.

We will impose the following assumptions on the objective function E.

Condition 0: E has a Fréchet derivative

E^{^{'}} (x) \in H

at each point x in

Ω

and

∥ E^{^{'}} (x) ∥ \leq M_{0}, x \in Ω .

Uniform Smoothness: There are constants

β > 0

,

1 < q \leq 2

, and

U > 0

, such that for all x,

x^{^{'}}

with

∥ x - x^{^{'}} ∥ \leq U

, x in

Ω

,

E (x^{^{'}}) - E (x) - 〈 E^{^{'}} (x), x - x^{^{'}} 〉 \leq β {∥ x - x^{^{'}} ∥}^{q} .

It is known from [23] that the OGA(co) can be used effectively to solve convex optimization problems. Since the RPSGA(s, t) is more efficient than the OGA, the RPSGA(co) can solve these problems more efficiently. For the RPSGA(co), by estimating the decay of

E (x_{m}) - E (\bar{x})

as

m \to \infty

, we will obtain the convergence rate

O (m^{1 - q})

, which is independent of the dimension of the underlying space H. So, the curse of dimensionality for problem (26) can be overcome by using the RPSGA(co). This work will be of great interest in practical applications. It is also important to compare the efficiency of the RPSGA(co) with other greedy optimization algorithms in future work.

Author Contributions

Conceptualization, all authors; methodology, all authors; software, all authors; validation, all authors; formal analysis, all authors; investigation, all authors; resources, all authors; data curation, all authors; writing—original draft preparation, all authors; writing—review and editing, all authors; visualization, all authors; supervision, all authors; project administration, all authors; funding acquisition, all authors. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China under Grant 11671213.

Acknowledgments

The authors would like to thank the referees, the editors, Shi Lei, and Guo Qin for their very useful suggestions, which improved significantly this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

DeVore, R.A. Nonlinear approximation. Acta. Numer. 1998, 7, 51–150. [Google Scholar] [CrossRef]
Donahue, M.J.; Darken, C.; Gurvits, L.; Sontag, E. Rates of convex approximation in non-Hilbert spaces. Constr. Approx. 1997, 13, 187–220. [Google Scholar]
Jiang, B.; Ye, P.X. Efficiency of the weak rescaled pure greedy algorithm. Int. J. Wavelets Multiresolut. Inf. Process. 2021, 19, 2150001. [Google Scholar] [CrossRef]
Jiang, B.; Ye, P.X.; Zhang, W.H. Unified error estimate for weak biorthogonal Greedy algorithms. Int. J. Wavelets Multiresolut. Inf. Process. 2022, 20, 2250010. [Google Scholar] [CrossRef]
Shen, Y.; Li, S. Sparse signals recovery from noisy measurements by Orthogonal Matching Pursuit. Inverse. Probl. Imag. 2015, 9, 231–238. [Google Scholar] [CrossRef]
Wei, X.J.; Ye, P.X. Efficiency of orthogonal super greedy algorithm under the restricted isometry property. J. Inequal. Appl. 2019, 124, 21. [Google Scholar] [CrossRef]
Temlyakov, V.N. Numerical integration and discrepancy under smoothness assumption and without it. Constr. Approx. 2022, 55, 743–771. [Google Scholar]
Wei, W.B.; Xu, Y.S.; Ye, P.X. Adaptive algorithms of nonlinear approximation with finite terms. Acta. Math. Sin. 2007, 23, 1663–1672. [Google Scholar] [CrossRef]
Dereventsov, A.V.; Temlyakov, V.N. A unified way of analyzing some greedy algorithms. J. Funct. Anal. 2019, 277, 1–30. [Google Scholar] [CrossRef]
DeVore, R.A.; Temlyakov, V.N. Some remarks on greedy algorithms. Adv. Comput. Math. 1996, 5, 173–187. [Google Scholar] [CrossRef]
Petrova, G. Rescaled pure greedy algorithm for Hilbert and Banach spaces. Appl. Comput. Harmon. Anal. 2016, 41, 852–866. [Google Scholar] [CrossRef]
Shao, C.F.; Ye, P.X. Almost optimality of orthogonal super greedy algorithms for incoherent dictionaries. Int. J. Wavelets Multiresolut. Inf. Process 2017, 15, 1750029. [Google Scholar] [CrossRef]
Fang, J.; Lin, S.B.; Xu, Z.B. Learning and approximation capability of orthogonal super greedy algorithm. Knowl-Based. Syst. 2016, 95, 86–98. [Google Scholar] [CrossRef]
Liu, E.T.; Temlyakov, V.N. The orthogonal super greedy algorithm and applications in compressed sensing. IEEE. T. Inform. Theory. 2012, 58, 2040–2047. [Google Scholar] [CrossRef]
Shao, C.F.; Chang, J.C.; Ye, P.X.; Zhang, W.H.; Xing, S. Almost optimality of the orthogonal super greedy algorithm for μ-coherent dictionaries. Axioms 2022, 11, 186. [Google Scholar] [CrossRef]
Barron, A.R.; Cohen, A.; Dahmen, W.; DeVore, R.A. Approximation and learning by greedy algorithms. Ann. Stat. 2008, 36, 64–94. [Google Scholar] [CrossRef]
Alcin, O.F.; Sengur, A.; Ghofrani, S.; Ince, M.C. GA-SELM: Greedy algorithms for sparse extreme learning machine. Measurement 2014, 55, 126–132. [Google Scholar]
Chen, H.; Zhou, Y.C.; Tang, Y.Y.; Li, L.Q.; Pan, Z.B. Convergence rate of the semi-supervised greedy algorithm. Neural Netw. 2013, 44, 44–50. [Google Scholar] [CrossRef]
Herrero, H.; Solares, C. A Greedy Algorithm for observability analysis. IEEE. Trans. Power. Syst. 2020, 35, 1638–1641. [Google Scholar] [CrossRef]
Lin, S.B.; Rong, Y.H.; Sun, X.P.; Xu, Z.B. Learning capability of the relaxed greedy algorithms. IEEE. Trans. Neur. Net. Lear. 2013, 24, 1598–1608. [Google Scholar]
Xu, X.L.; Wang, M.Z.; Wang, Y.X.; Ma, D.C. Two-stage routing with optimized guided search and greedy algorithm on proximity graph. Knowl-Based. Syst. 2021, 229, 107305. [Google Scholar] [CrossRef]
Chen, H.; Li, L.Q.; Pan, Z.B. Learning rates of multi-kernel regression by orthogonal greedy algorithm. J. Stat. Plan. Infer. 2013, 143, 276–282. [Google Scholar] [CrossRef]
Nguyen, H.; Petrova, G. Greedy strategies for convex optimization. Calcolo 2017, 54, 207–224. [Google Scholar] [CrossRef]
Zhang, W.H.; Ye, P.X.; Xing, S. Optimality of the rescaled pure greedy learning algorithms, unpublished manuscript.
Aronszajn, N. Theory of reproducing kernels. Trans. Amer. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
Cucker, F.; Zhou, D.X. Learning Theory: An aPproximation Theory Viewpoint; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
Shi, L.; Feng, Y.L.; Zhou, D.X. Concentration estimates for learning with l₁ regularizer and data dependent hypothesis spaces. Appl. Comput. Harmon. Anal. 2011, 31, 286–302. [Google Scholar] [CrossRef]
Xiao, Q.W.; Zhou, D.X. Learning by nonsymmetric kernel with data dependent spaces and l₁-regularizer. Taiwan. J. Math. 2010, 14, 1821–1836. [Google Scholar] [CrossRef]
Wu, Q.; Ying, Y.M.; Zhou, D. Multi-kernel regularized classifiers. J. Complexity. 2007, 23, 108–134. [Google Scholar] [CrossRef]
Shi, L. Learning theory estimates for coefficient-based regularized regression. Appl. Comput. Harmon. Anal. 2013, 34, 252–265. [Google Scholar] [CrossRef]
Wirtz, D.; Haasdonk, B. A vectorial kernel orthogonal greedy algorithm. Dolomit. Res. Notes. Approx. 2013, 6, 83–100. [Google Scholar]
Santin, G.; Karvonen, T.; Haasdonk, B. Sampling based approximation of linear functionals in reproducing kernel Hilbert spaces. Bit. Numer. Math. 2022, 62, 279–310. [Google Scholar] [CrossRef]
Hoeffding, W. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 1963, 58, 13–30. [Google Scholar] [CrossRef]
DeVore, R.A.; Temlyakov, V.N. Convex optimization on Banach spaces. Found. Comput. Math. 2016, 16, 369–394. [Google Scholar]
Temlyakov, V.N. Greedy expansions in convex optimization. P. Steklov. I. Math. 2014, 284, 252–270. [Google Scholar] [CrossRef]
Temlyakov, V.N. Greedy approximation in convex optimization. Constr. Approx. 2015, 41, 269–296. [Google Scholar] [CrossRef] [Green Version]

Figure 1. RPSGLA (

s = 5

) with noiseless data after 100 iterations.

Figure 1. RPSGLA (

s = 5

) with noiseless data after 100 iterations.

Figure 2. RPSGLA (

s = 5

) with noisy data after 100 iterations.

Figure 2. RPSGLA (

s = 5

) with noisy data after 100 iterations.

Figure 3. MSE with noiseless data after 100 iterations.

Figure 4. MSE with noisy data after 100 iterations.

Figure 5. Running time with noiseless data after 100 iterations.

Figure 6. Running time with noisy data after 100 iterations.

Table 1. The average MSE (repeating 100 times) of the RPGLA, RPSGLA, and OSGLA obtained on noiseless data and noisy data (Gaussian).

Algorithm	Noiseless Data	Noisy Data (Gaussian)
RPGLA	0.0291	0.0301
OSGLA	$1.1246 \times 10^{- 7}$	$3.3 \times 10^{- 3}$
RPSGLA	$3.4159 \times 10^{- 4}$	$2.4 \times 10^{- 3}$

Table 2. The average running time (repeating 100 times) of the RPGLA, RPSGLA, and OSGLA obtained on noiseless data and noisy data (Gaussian).

Algorithm	Noiseless Data	Noisy Data (Gaussian)
RPGLA	0.1151 s	0.1173 s
OSGLA	2.2322 s	1.5415 s
RPSGLA	0.1105 s	0.1093 s

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, W.; Ye, P.; Xing, S.; Xu, X. Optimality of the Approximation and Learning by the Rescaled Pure Super Greedy Algorithms. Axioms 2022, 11, 437. https://doi.org/10.3390/axioms11090437

AMA Style

Zhang W, Ye P, Xing S, Xu X. Optimality of the Approximation and Learning by the Rescaled Pure Super Greedy Algorithms. Axioms. 2022; 11(9):437. https://doi.org/10.3390/axioms11090437

Chicago/Turabian Style

Zhang, Wenhui, Peixin Ye, Shuo Xing, and Xu Xu. 2022. "Optimality of the Approximation and Learning by the Rescaled Pure Super Greedy Algorithms" Axioms 11, no. 9: 437. https://doi.org/10.3390/axioms11090437

APA Style

Zhang, W., Ye, P., Xing, S., & Xu, X. (2022). Optimality of the Approximation and Learning by the Rescaled Pure Super Greedy Algorithms. Axioms, 11(9), 437. https://doi.org/10.3390/axioms11090437

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimality of the Approximation and Learning by the Rescaled Pure Super Greedy Algorithms

Abstract

1. Introduction

2. The Weak Rescaled Pure Super Greedy Algorithms

3. The Rescaled Pure Super Greedy Learning Algorithms

4. Error Analysis of the RPSGLA

4.1. Error Decomposition Strategy

4.2. Estimate of Sample Error

4.3. Estimate of Hypothesis Error

4.4. Estimate of Approximation Error

4.5. Proof of Theorem 7

5. Simulation Results

5.1. Implementation

5.2. Effectiveness of the RPSGLA

5.3. Comparison with Existing Algorithms

5.4. Effects of the Parameter S

6. Discussion

7. Conclusions and Further Studies

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI