ReLU Network with Bounded Width Is a Universal Approximator in View of an Approximate Identity

Moon, Sunghwan

doi:10.3390/app11010427

Open AccessArticle

ReLU Network with Bounded Width Is a Universal Approximator in View of an Approximate Identity

by

Sunghwan Moon

Department of Mathematics, Kyungpook National University, Daegu 41566, Korea

Appl. Sci. 2021, 11(1), 427; https://doi.org/10.3390/app11010427

Submission received: 19 November 2020 / Revised: 29 December 2020 / Accepted: 31 December 2020 / Published: 4 January 2021

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figure

Versions Notes

Abstract

:

Deep neural networks have shown very successful performance in a wide range of tasks, but a theory of why they work so well is in the early stage. Recently, the expressive power of neural networks, important for understanding deep learning, has received considerable attention. Classic results, provided by Cybenko, Barron, etc., state that a network with a single hidden layer and suitable activation functions is a universal approximator. A few years ago, one started to study how width affects the expressiveness of neural networks, i.e., a universal approximation theorem for a deep neural network with a Rectified Linear Unit (ReLU) activation function and bounded width. Here, we show how any continuous function on a compact set of

R^{n_{i n}}, n_{i n} \in N

can be approximated by a ReLU network having hidden layers with at most

n_{i n} + 5

nodes in view of an approximate identity.

Keywords:

deep neural nets; ReLU network; universal approximation theory; a feed-forward neural network

1. Introduction

Over the past several years, deep neural networks have achieved state-of-the-art performance in a wide range of tasks such as image recognition/segmentation and machine translation (see the review article [1] and recent book [2] for more background). In spite of their outstanding successes, many aspects of why they work so well are still not well understood. Some works to date have focused on the problem of explaining and quantifying the expressivity of a deep neural network [3,4,5].

This line of research about the expressive power of neural networks, its ability to approximate functions, dates back to at least the work of Cybenko [6], in which a fully connected neural network with a sigmoid activation function and one single hidden layer can approximate any continuous univariate function on a bounded domain with an arbitrarily small error. Barron [7], Hornik et al. [8], and Funahashi [9] generalized the sigmoid function to a large class of activation functions to get universal approximation. However, they do not consider the number of nodes of a hidden layer; actually, the number of nodes of a hidden layer can be exponentially increased in the input dimension.

Because the outstanding performance of deep neural networks has recently been exhibited, there is a lot of literature concerning their expressive power theoretically. In 2011, Delalleau and Bengio [10] gave a family of functions which can be represented much more efficiently with a deep network than with a neural network with one hidden layer. Mhaskar and Poggio [4] provided conditions under which deep convolutional neural networks perform much better in function approximation than a neural network with one hidden layer. Eldan and Shamir [11] constructed a three-layer network which cannot be realized by any 2-layer if the number of nodes of an output layer is no more than an exponential bound.

Actually, the Rectified Linear Units (ReLU) activation function is the most popular choice in practical use of the neural network [12]. In this reason, most of the recent results on the universal approximation theory is about the ReLU network [5,13,14,15,16,17,18,19,20]. Cohen et al. [13] provided the deep convolutional neural network with the ReLU activation function that cannot be realized by a shallow network if the number of nodes of its hidden layer is no more than an exponential bound.

Lu et al. [14] presented a universal approximation theorem for deep neural networks with ReLU activation functions and hidden layers with a bounded width in 2017, since the expressive power of depth in ReLU networks with a bounded width has received a lot of attention. Hanin proved universal approximation theorem using convex function theory in [15,16], and Lin and Jegelka showed this theorem for residual networks with one-neuron hidden layers in [19].

Ohn and Kim [18] showed that the ReLU activation function can be represented by a linear combination of piecewise linear activation function and, using this fact, extended many results in [5,17,20] to any continuous piecewise linear activation function. However, their results are not about the bounded width.

Here, we study a universal approximation theorem for ReLU networks in view of approximate identity by constructing a network having

n_{i n} k^{n_{i n}} + 1

hidden layers with width of at most

n_{i n} + 5

. As a measure of approximation error, we adopt the supremum norm and

L^{p}

distance.

The remainder of the paper is organized as follows: in the next subsection, we introduce some notations. Section 2 and Section 3 are devoted to our theorem statement and its proof, respectively. Finally, Section 4 offers a conclusion.

Notations

Let

Z_{k} = {1, 2, \dots, k}

and

x = (x_{1}, x_{2}, \dots, x_{n}) \in R^{n}

be the input. The network architecture

(l, n)

consists of several hidden layers

l \in N

and a width vector

n = (n_{0}, n_{1}, \dots, n_{l + 1}) \in N^{l + 2}, n_{0} = n_{i n}

[20]. A neural network with network architecture

(l, n)

is then any function of the form

\begin{matrix} f_{(l, n)} : R^{n_{0}} \to R^{n_{l + 1}}, \\ f_{(l, n)} (x) = L^{l + 1} \circ ReLU \circ L^{l} \circ ReLU \circ \dots \circ L^{2} \circ ReLU \circ L^{1} (x), \end{matrix}

where

\begin{matrix} L^{i} : R^{n_{i - 1}} \to R^{n_{i}} i = 1, \dots, l + 1, \end{matrix}

and

ReLU (x) = max {x, 0}

,

\begin{matrix} ReLU (x_{1}, x_{2}, \dots, x_{m}) = (ReLU (x_{1}), ReLU (x_{2}), \dots, ReLU (x_{m})) . \end{matrix}

Here, we set

n_{l + 1} = 1

. To this end, we define the space of network functions with the given network architecture

F (l, n) : = {f_{(l, n)} : f_{(l, n)} (x) = L^{l + 1} \circ \dots \circ ReLU \circ L^{1} (x), (l, n) \in N \times N^{l + 2}} .

Our network of interest is ‘a feedforward neural network’ which is an artificial neural network wherein connections between the nodes do not form a cycle. That is, in this network, the information moves in a forward direction, from the input nodes to the output nodes through the hidden nodes (if any).

For later discussion, we introduce the positive part

f_{+} = max (f, 0)

and negative parts

f_{-} = max (- f, 0)

of a function f. Then,

f_{+} - f_{-}

is equal to f.

2. Main Result

Here is our universal approximation theorem for a ReLU network with a bounded width:

Theorem 1.

For any

M > 0

and

n_{i n} \in N

, let

f : {[- M, M]}^{n_{i n}} \to R

be a continuous function. For

ϵ > 0

, there exists

k \in N

such that

f_{(l, n)} \in F (n_{i n} k^{n_{i n}} + 1, n_{i n}, n_{i n} + 5, \dots, n_{i n} + 5, 2, 1)

satisfies

sup_{x \in {[- M, M]}^{n}} | f (x) - f_{(l, n)} (x) | < ϵ .

To prove this theorem, we explicitly construct a ReLU network with a bounded width in Section 3.2. In our construction, each of the first

n_{i n} + 3

nodes in each hidden layer is connected to only one node of the previous layer, and each, only for two

n_{i n} + 4

and

n_{i n} + 5

nodes in each hidden layer, is connected to less than four nodes of the previous layer (for more details, see Section 3.2). This means that the non-fully connected ReLU network can be working.

We can show that there is a feed-forward neural network that approximates any

f \in L^{p} (R^{n_{i n}})

:

Corollary 1.

Let

n_{i n} \in N

and

f \in L^{p} (R^{n_{i n}}), 1 \leq p < \infty

. For

ϵ > 0

, there exists

k \in N

such that

f_{(l, n)} \in F (n_{i n} k^{n_{i n}} + 1, n_{i n}, n_{i n} + 5, \dots, n_{i n} + 5, 2, 1)

satisfies

| | f (x) - f_{(l, n)} (x) {| |}_{p} < ϵ .

Proof.

For any

ϵ > 0

, there is a continuous function

f_{c}

on

R^{n_{i n}}

with compact support in

{[- M, M]}^{n_{i n}}

such that

\int_{R^{n_{i n}}} {| f (x) - f_{c} (x) |}^{p} d x < \frac{ϵ^{p}}{2^{p}} .

Using Theorem 1, we can find

k \in N

such that a ReLU neural network architecture

(l, n)

with a represented function

f_{(l, n)}

satisfying

sup_{x \in {[- M, M]}^{n_{i n}}} | f_{c} (x) - f_{(l, n)} (x) | < \frac{ϵ}{2^{n_{i n} + 1} M^{n_{i n}}} .

Therefore, we have

| | f (x) - f_{(l, n)} {(x) | |}_{p} \leq | | f (x) - f_{c} {(x) | |}_{p} + | | f_{c} (x) - f_{(l, n)} (x) {| |}_{p} < ϵ .

□

3. Proof of Theorem 1

To convey our idea more clearly, we first construct a ReLU network with

n_{i n} = 1

and

k + 1

hidden layers with width at most 6. Then, we construct a ReLU network with general

n_{i n}

.

3.1. One-Dimensional, Input

Let

L^{1, t e m p} : R \to R^{3}

and

L^{2, t e m p} : R^{3} \to R

with

L^{1, t e m p} (x) = (x - 1, x, x + 1)

and

L^{2, t e m p} (x_{1}, x_{2}, x_{3}) = x_{1} - 2 x_{2} + x_{3}

. Then, the function

f_{(1, 1, 3, 1)}

represented by

L^{2, t e m p} \circ ReLU \circ L^{1, t e m p}

is

\begin{matrix} f_{(1, 1, 3, 1)} (x) & = L^{2, t e m p} \circ ReLU \circ L^{1, t e m p} (x) \\ = ReLU (x - 1) - 2 ReLU (x) + ReLU (x + 1) \\ = \{\begin{matrix} 0 & if | x | > 1, \\ 1 - | x | & if - 1 < x < 1 . \end{matrix} \end{matrix}

We notice that, for a (B-spline) function

g (x) = L^{2, t e m p} \circ ReLU \circ L^{1, t e m p} (x)

and

λ > 0

,

g_{λ} (x) = λ^{- 1} g (λ^{- 1} x)

is an approximate identity: for any continuous function

f (x)

on

[- M, M]

,

f * g_{λ}

converges to

f (x)

uniformly. For any

ϵ > 0

, there is

λ > 0

such that

{sup}_{- M \leq x \leq M} | f * g_{λ} (x) - f (x) | < ϵ / 2 .

By mensuration by division, there is

k \in N

such that

sup_{- M \leq x \leq M} |\begin{matrix} f * g_{λ} (x) - \sum_{j = 1}^{k} \frac{2 M}{k} f (- M + \frac{2 M j}{k}) g_{λ} (x - (- M + \frac{2 M j}{k})) \end{matrix}| < ϵ / 2

and thus

sup_{- M \leq x \leq M} |\begin{matrix} f (x) - \sum_{j = 1}^{k} \frac{2 M}{k} f (- M + \frac{2 M j}{k}) g_{λ} (x - (- M + \frac{2 M j}{k})) \end{matrix}| < ϵ .

Now, for a fixed

k \in N

, we can make a network architecture

(l, n) = (k + 1, 1, 6, 6, \dots, 6, 2, 1)

with ReLU activations, input, and output dimensions 1, and

k + 1

hidden layer width at most 6, such that

sup_{- M \leq x \leq M} |f (x) - f_{(l, n)} (x)| < ϵ .

Let

L^{1, 1} : R \to R^{6}

,

L^{1, j} : R^{6} \to R^{6}, j = 2, \dots, k

,

L^{k + 1} : R^{6} \to R^{2}

, and

L^{k + 2} : R^{2} \to R

with

\begin{matrix} L^{1, 1} (x) = (\begin{matrix} x + M \\ λ^{- 1} (x - (- M + 2 M / k)) - 1 \\ λ^{- 1} (x - (- M + 2 M / k)) \\ λ^{- 1} (x - (- M + 2 M / k)) + 1 \\ 0 \\ 0 \end{matrix}), \\ L^{1, j} (x) = (\begin{matrix} x_{1} \\ λ^{- 1} (x_{1} - M - (- M + \frac{2 M j}{k})) - 1 \\ λ^{- 1} (x_{1} - M - (- M + \frac{2 M j}{k})) \\ λ^{- 1} (x_{1} - M - (- M + \frac{2 M j}{k})) + 1 \\ x_{5} + \frac{2 M}{k λ} f_{+} (- M + \frac{2 M (j - 1)}{k}) (x_{2} - 2 x_{3} + x_{4}) \\ x_{6} + \frac{2 M}{k λ} f_{-} (- M + \frac{2 M (j - 1)}{k}) (x_{2} - 2 x_{3} + x_{4}) \end{matrix}) for j = 2, \dots, k, \\ L^{k + 1} (x) = (\begin{matrix} x_{5} + \frac{M}{k λ} f_{+} (M) (x_{2} - 2 x_{3} + x_{4}) \\ x_{6} + \frac{M}{k λ} f_{-} (M) (x_{2} - 2 x_{3} + x_{4}) \end{matrix}), and \\ L^{k + 2} (x) = (x_{1} - x_{2}) . \end{matrix}

(The index 1 in our index

(1, j)

is unnecessary, but it is included because it suggests how to generalize to input of arbitrary dimension.) We note that

\begin{matrix} L^{1, 2} \circ ReLU \circ L^{1, 1} (x) = (\begin{matrix} x + M \\ λ^{- 1} (x - (- M + \frac{2 M}{k})) - 1 \\ λ^{- 1} (x - (- M + \frac{2 M}{k})) \\ λ^{- 1} (x - (- M + \frac{2 M}{k})) + 1 \\ \frac{2 M}{k λ} f_{+} (- M + \frac{2 M}{k}) g (λ^{- 1} x - λ^{- 1} (- M + \frac{2 M}{k})) \\ \frac{2 M}{k λ} f_{-} (- M + \frac{2 M}{k}) g (λ^{- 1} x - λ^{- 1} (- M + \frac{2 M}{k})) \end{matrix}) . \end{matrix}

and the fifth and sixth components of

L^{1, j}

are non-negative. Since for a non-negative function

h : R \to R

,

ReLU (h) = h

, we have for

α \geq 0

,

\begin{matrix} ReLU (h (x) + α g_{λ} (x - (- M + \frac{2 M j}{k}))) = h (x) + α g_{λ} (x - (- M + \frac{2 M j}{k})) \end{matrix}

and thus (inductively) we have

\begin{matrix} L^{k + 2} \circ \dots \circ ReLU \circ L^{1, 1} (x) \\ = \sum_{j = 1}^{k} \frac{2 M}{k} f_{+} (- M + \frac{2 M j}{k}) g_{λ} (x - (- M + \frac{2 M j}{k})) \\ - \sum_{j = 1}^{k} \frac{2 M}{k} f_{-} (- M + \frac{2 M j}{k}) g_{λ} (x - (- M + \frac{2 M j}{k})) \\ = \sum_{j = 1}^{k} \frac{2 M}{k} f (- M + \frac{2 M j}{k}) g_{λ} (x - (- M + \frac{2 M j}{k})) . \end{matrix}

3.2. General-Dimensional Input

To generalize our idea on input dimension 1 to general input dimension

n_{i n}

, the main difficult part is how to make a ReLU network an approximate identity, for example, a kind of a B-spline function on

R^{2}

. We emphasize that we need a function with the shape of a B-spline function, not an exact B-spline function on

R^{2}

(see Figure 1). For

n = 2

, we have

\begin{matrix} ReLU (x_{2} + (ReLU (x_{1} + 1) - 2 ReLU (x_{1}) + ReLU (x_{1} - 1))) - 2 ReLU (x_{2}) \\ + ReLU (x_{2} - (ReLU (x_{1} + 1) - 2 ReLU (x_{1}) + ReLU (x_{1} - 1))) \\ = \{\begin{matrix} 0 & if | x_{1} | + | x_{2} | > 1 \\ 1 - | x_{1} | - | x_{2} | & if | x_{1} | + | x_{2} | < 1 . \end{matrix} \end{matrix}

In general,

Lemma 1.

For

b \in R

, let

h (x_{1}, b) = ReLU (x_{1} + b) - 2 ReLU (x_{1}) + ReLU (x_{1} - b)

. Then, we have

\begin{matrix} h (x_{n}, h (x_{n - 1}, h (x_{n - 2}, \dots, h (x_{1}, 1)))) = \{\begin{matrix} 0 & i f \sum_{j = 1}^{n} | x_{j} | \geq 1, \\ 1 - \sum_{j = 1}^{n} | x_{j} | & i f \sum_{j = 1}^{n} | x_{j} | < 1 . \end{matrix} \end{matrix}

In addition, we have

\int_{R^{n}} h (x_{n}, h (x_{n - 1}, h (x_{n - 2}, \dots, h (x_{1}, 1)))) d x = \frac{2^{n}}{(n + 1)!} .

(1)

We notice that for

g (x) = h (x_{n}, h (x_{n - 1}, h (x_{n - 2}, \dots, h (x_{1}, 1)))), x \in R^{n}

(see Figure 1) and

\frac{(n + 1)!}{2^{n}} g_{λ} (x) = \frac{(n + 1)!}{2^{n} λ^{n}} g (λ^{- 1} x)

,

λ > 0

is an approximate identity: for a continuous function f on

{[- M, M]}^{n}

,

\frac{(n + 1)!}{2^{n}} f * g_{λ}

converges to

f (x)

uniformly. For any

ϵ > 0

, there is

λ > 0

such that

sup_{x \in {[- M, M]}^{n_{i n}}} |\frac{(n + 1)!}{2^{n}} f * g_{λ} (x) - f (x)| < \frac{ϵ}{2} .

(2)

Proof.

We will use mathematical induction on n. We have already shown the case of

n = 1

. Suppose

\begin{matrix} h (x_{n}, h (x_{n - 1}, \dots, h (x_{1}, 1))) = \{\begin{matrix} 0 & if \sum_{j = 1}^{n} | x_{j} | \geq 1, \\ 1 - \sum_{j = 1}^{n} | x_{j} | & if \sum_{j = 1}^{n} | x_{j} | < 1 . \end{matrix} \end{matrix}

For simplicity, let

g (x) = h (x_{n}, h (x_{n - 1}, \dots, h (x_{1}, 1)))

,

x \in R^{n}

. If

\sum_{j = 1}^{n + 1} | x_{j} | \geq 1

, then

| x_{n + 1} | \geq 1 - \sum_{j = 1}^{n} | x_{j} |

and thus

| x_{n + 1} | \geq g (x)

. For

x_{n + 1} \geq 0

,

\begin{matrix} h (x_{n + 1}, g (x)) & = ReLU (x_{n + 1} + g (x)) - 2 ReLU (x_{n + 1}) + ReLU (x_{n + 1} - g (x)) \\ = (x_{n + 1} + g (x)) - 2 x_{n + 1} + (x_{n + 1} - g (x)) \end{matrix}

is equal to zero. In addition, for

x_{n + 1} < 0

, we can show that

h (x_{n + 1}, g (x))

is also zero. Now, we consider

\sum_{j = 1}^{n + 1} | x_{j} | < 1

. Then, by hypothesis,

g (x) = 1 - \sum_{j = 1}^{n} | x_{j} |

. For

x_{n + 1} > 0

, we have

\begin{matrix} h (x_{n + 1}, g (x)) & = ReLU (x_{n + 1} + g (x)) - 2 ReLU (x_{n + 1}) + ReLU (x_{n + 1} - g (x)) \\ = x_{n + 1} + 1 - \sum_{j = 1}^{n} | x_{j} | - 2 x_{n + 1} = 1 - \sum_{j = 1}^{n + 1} | x_{j} |, \end{matrix}

since

x_{n + 1} - 1 + \sum_{j = 1}^{n} | x_{j} | < 0

. Similarly, for

x_{n + 1} < 0

, we have

\begin{matrix} h (x_{n + 1}, g (x)) = x_{n + 1} + 1 - \sum_{j = 1}^{n} | x_{j} | = 1 - \sum_{j = 1}^{n + 1} | x_{j} | . \end{matrix}

Direct integral calculation follows (1). □

Again, similar to the one-dimensional case, by mensuration by division, there is

k \in N

such that

sup_{x \in {[- M, M]}^{n_{i n}}} |\begin{matrix} \frac{(n_{i n} + 1)!}{2^{n_{i n}}} f * g_{λ} (x) - \frac{(n_{i n} + 1)! M^{n_{i n}}}{k^{n_{i n}}} \sum_{j \in Z_{k}^{n_{i n}}} f (- (M, \dots, M) + \frac{2 M j}{k}) \\ \times g_{λ} (x - (- (M, \dots, M) + \frac{2 M j}{k})) \end{matrix}| < \frac{ϵ}{2}

and thus, with (2), we have for some

λ > 0 and k \in N

,

|\begin{matrix} f (x) - \frac{(n_{i n} + 1)! M^{n_{i n}}}{k^{n_{i n}}} \sum_{j \in Z_{k}^{n_{i n}}} f (- (M, \dots, M) + \frac{2 M j}{k}) \\ \times g_{λ} (x - (- (M, \dots, M) + \frac{2 M j}{k})) \end{matrix}| < ϵ .

Now, our goal is to construct a ReLU network satisfying

\begin{matrix} f_{(l, n)} (x) \\ = \frac{(n_{i n} + 1)! M^{n_{i n}}}{k^{n_{i n}}} \sum_{j \in Z_{k}^{n_{i n}}} f (- (M, \dots, M) + \frac{2 M j}{k}) g_{λ} (x - (- (M, \dots, M) + \frac{2 M j}{k})) . \end{matrix}

Since we have an approximate identity

\frac{(n_{i n} + 1)!}{2^{n_{i n}}} g

, how to make the ReLU network remains. Here, the indexing part is not difficult, but it is slightly cumbersome. Similarly to the case

n_{i n} = 1

, for

\begin{matrix} (i, j) = (i, j_{1}, j_{2}, \dots, j_{n_{i n}}) \in {1, \dots, n_{i n}}^{} \times Z_{k}^{n_{i n}} ∖ {(1, 1, \dots, 1)} \end{matrix}

(i index for dimension and

j

index for mensuration by division), let

\begin{matrix} L^{1, \dots, 1} : R^{n_{i n}} \to R^{n_{i n} + 5}, L^{i, j} : R^{n_{i n} + 5} \to R^{n_{i n} + 5}, \\ L^{n_{i n} k^{n_{i n}} + 1} : R^{n_{i n} + 5} \to R^{2}, a n d L^{n_{i n} k^{n_{i n}} + 2} : R^{2} \to R \end{matrix}

with

\begin{matrix} L^{1, 1, \dots, 1} (x) = (\begin{matrix} x_{1} + M \\ x_{2} + M \\ ⋮ \\ x_{n_{i n}} + M \\ λ^{- 1} (x_{1} - (- M + 2 M / k)) - 1 \\ λ^{- 1} (x_{1} - (- (M + 2 M / k)) \\ λ^{- 1} (x_{1} - (- (M + 2 M / k)) + 1 \\ 0 \\ 0 \end{matrix}), \end{matrix}

\begin{matrix} L^{i, j} (x) = (\begin{matrix} x_{1} \\ ⋮ \\ x_{n_{i n}} \\ λ^{- 1} (x_{i} - M - (- M + 2 M j_{i} / k)) - (x_{n_{i n} + 1} - 2 x_{n_{i n} + 2} + x_{n_{i n} + 3}) \\ λ^{- 1} (x_{i} - M - (- M + 2 M j_{i} / k)) \\ λ^{- 1} (x_{i} - M - (- M + 2 M j_{i} / k)) + (x_{n_{i n} + 1} - 2 x_{n_{i n} + 2} + x_{n_{i n} + 3}) \\ x_{n_{i n} + 4} \\ x_{n_{i n} + 5} \end{matrix}) \\ for i = 2, \dots, n_{i n}, \end{matrix}

\begin{matrix} L^{1, j} (x) = (\begin{matrix} x_{1} \\ ⋮ \\ x_{n_{i n}} \\ λ^{- 1} (x_{1} - M - (- M + \frac{2 M j_{1}}{k})) - (x_{n_{i n} + 1} - 2 x_{n_{i n} + 2} + x_{n_{i n} + 3}) \\ λ^{- 1} (x_{1} - M - (- M + \frac{2 M j_{1}}{k})) \\ λ^{- 1} (x_{1} - M - (- M + \frac{2 M j_{1}}{k})) + (x_{n_{i n} + 1} - 2 x_{n_{i n} + 2} + x_{n_{i n} + 3}) \\ x_{n_{i n} + 4} + \frac{(n_{i n} + 1)! M^{n_{i n}}}{k^{n_{i n}} λ^{n_{i n}}} f_{+} (- (M, \dots, M) + \frac{2 M (j - 1_{j})}{k}) \\ \times (x_{n_{i n} + 1} - 2 x_{n_{i n} + 2} + x_{n_{i n} + 3})) \\ x_{n_{i n} + 5} + \frac{(n_{i n} + 1)! M^{n_{i n}}}{k^{n_{i n}} λ^{n_{i n}}} f_{-} (- (M, \dots, M) + \frac{2 M (j - 1_{j})}{k}) \\ \times (x_{n_{i n} + 1} - 2 x_{n_{i n} + 2} + x_{n_{i n} + 3}) \end{matrix}), \\ (j \neq (1, 1, \dots, 1)) \end{matrix}

\begin{matrix} L^{n_{i n} k^{n_{i n}} + 1} (x) \\ = (\begin{matrix} x_{n_{i n} + 4} + \frac{(n_{i n} + 1)! M^{n_{i n}}}{k^{n_{i n}} λ^{n_{i n}}} f_{+} (M, \dots, M) (x_{n_{i n} + 1} - 2 x_{n_{i n} + 2} + x_{n_{i n} + 3}) \\ x_{n_{i n} + 5} + \frac{(n_{i n} + 1)! M^{n_{i n}}}{k^{n_{i n}} λ^{n_{i n}}} f_{-} (M, \dots, M) (x_{n_{i n} + 1} - 2 x_{n_{i n} + 2} + x_{n_{i n} + 3}) \end{matrix}), \\ and L^{n_{i n} k^{n_{i n}} + 2} (x) = (x_{1} - x_{2}) . \end{matrix}

Here,

\begin{matrix} 1_{j} = \{\begin{matrix} (1, 0, \dots, 0) & if j_{1} \neq 1, \\ (k - 1, 1, 0 \dots, 0) & if j_{1} = 1 and j_{2} \neq 1, \\ ⋮ \\ (k - 1, k - 1, \dots, k - 1, \underset{l + 1 - t h}{\underset{̲}{1}}, 0, \dots, 0) & if j_{1} = j_{2} = \dots = j_{l} = 1 and j_{l + 1} \neq 1, \\ ⋮ \\ (k - 1, k - 1, \dots, k - 1, 1) & if j_{1} = \dots = j_{n_{i n} - 1} = 1 and j_{n_{i n}} \neq 1 . \end{matrix} \end{matrix}

.

We stack in such a way that the index i is increased, and then the second index

j_{1}

is increased, and then

j_{2}

, and so on:

\begin{matrix} (1, 1, 1, \dots, 1), (2, 1, 1, \dots, 1), & (3, 1, 1, \dots, 1), & \dots, & (n_{i n}, 1, 1, \dots, 1), \\ (1, 2, 1, \dots, 1), (2, 2, 1, \dots, 1), & \dots, & (n_{i n}, 2, 1, \dots, 1), \\ ⋮ \\ (1, k, 1, \dots, 1), & \dots, & (n_{i n}, k, 1, \dots, 1), \\ (1, 1, 2, 1 \dots, 1), & \dots, & (n_{i n}, 1, 2, 1 \dots, 1), \\ ⋮ \\ (1, k, 2, 1, \dots, 1), & \dots, & (n_{i n}, k, 2, 1, \dots, 1), \\ (1, 1, 3, 1, \dots, 1), & \dots, & (n_{i n}, 1, 3, 1, \dots, 1), \\ ⋮ \\ (1, k, k, 1, \dots, 1), & \dots, & (n_{i n}, k, k, 1, \dots, 1), \\ (1, 1, 1, 2, 1 \dots, 1), & \dots, & (n_{i n}, 1, 1, 2, 1 \dots, 1), \\ ⋮ \\ (1, k - 1, k, k, k \dots, k), & \dots, & (n_{i n}, k - 1, k, k, k \dots, k), \\ (1, k, k, k, k \dots, k), & \dots, & (n_{i n}, k, k, k, k \dots, k) \end{matrix}

(which means that the index i is the first one where we increase) and, if we omit the index i, then

\begin{matrix} (1, 1, 1, \dots, 1), & (2, 1, 1, \dots, 1), & (3, 1, 1, \dots, 1), \dots, & (k, 1, 1, \dots, 1), \\ (1, 2, 1, \dots, 1), & (2, 2, 1, \dots, 1), & \dots, & (k, 2, 1, \dots, 1), \\ ⋮ \\ (1, k, 1, \dots, 1), & (2, k, 1, \dots, 1), & \dots, (k - 1, k, 1, \dots, 1), & (k, k, 1, \dots, 1), \\ (1, 1, 2, \dots, 1), & (2, 1, 2, \dots, 1), & \dots, (k - 1, 1, 2, \dots, 1), & (k, 1, 2, \dots, 1), \\ ⋮ \\ (1, k, \dots, k, k - 1, k), & \dots, & (k - 1, k, \dots, k - 1, k), & (k, \dots, k - 1, k), \\ (1, k, k \dots, k, k), & \dots, & (k - 1, k, k, \dots, k, k), & (k, k \dots, k, k) . \end{matrix}

(3)

In the second terms of the

n_{i n} + 4

and

n_{i n} + 5

-th components of

L^{1, j},

j \neq (1, 1, \dots, 1)

f_{\pm} (- (M, \dots, M) + \frac{2 M (j - 1_{j})}{k}) (x_{n_{i n} + 1} - 2 x_{n_{i n} + 2} + x_{n_{i n} + 3}),

we want the value at the previous step and thus

j - 1_{j}

is the exact previous one step of

j

.

We want to notice the

n_{i n} + 1

,

n_{i n} + 2

, and

n_{i n} + 3

-th components of a function

f_{1, n_{i n}, n_{i n} + 5, n_{i n} + 5} (x)

represented by

L^{2, 1, 1, \dots, 1} \circ ReLU \circ L^{1, 1, 1, \dots, 1} (x)

are

\begin{matrix} λ^{- 1} (x_{2} - M - (- M + \frac{2 M}{k})) - h (λ^{- 1} (x_{1} - (- M + \frac{2 M}{k})), 1) \\ λ^{- 1} (x_{2} - M - (- M + \frac{2 M}{k})) \\ λ^{- 1} (x_{2} - M - (- M + \frac{2 M}{k})) + h (λ^{- 1} (x_{1} - (- M + \frac{2 M}{k})), 1), \end{matrix}

and thus

n_{i n} + 1

,

n_{i n} + 2

, and

n_{i n} + 3

-th components of a function

f_{2, n_{i n}, n_{i n} + 5, n_{i n} + 5, n_{i n} + 5} (x)

becomes

\begin{matrix} λ^{- 1} (x_{3} - M - (- M + \frac{2 M}{k})) - \bar{g} (λ^{- 1} ((x_{1}, x_{2}) - (- (M, M) + \frac{2 M (1, 1)}{k}))) \\ λ^{- 1} (x_{3} - M - (- M + \frac{2 M}{k})) \\ λ^{- 1} (x_{3} - M - (- M + \frac{2 M}{k})) + \bar{g} (λ^{- 1} ((x_{1}, x_{2}) - (- (M, M) + \frac{2 M (1, 1)}{k}))), \end{matrix}

where

\bar{g} (x_{1}, x_{2}) = h (x_{2}, h (x_{1}, 1))

. Thus, a function

f_{n_{i n}, n_{i n}, n_{i n} + 5, \dots, n_{i n} + 5} (x)

represented by

L^{1, 2, 1, \dots, 1} \circ \dots \circ ReLU \circ L^{1, 1, \dots, 1} (x)

is equal to

(\begin{matrix} x_{1} + M \\ ⋮ \\ x_{n_{i n}} + M \\ λ^{- 1} (x_{1} - M - (- M + \frac{2 M}{k})) - (x_{n_{i n} + 1} - 2 x_{n_{i n} + 2} + x_{n_{i n} + 3}) \\ λ^{- 1} (x_{1} - M - (- M + \frac{2 M}{k})) \\ λ^{- 1} (x_{1} - M - (- M + \frac{2 M}{k})) + (x_{n_{i n} + 1} - 2 x_{n_{i n} + 2} + x_{n_{i n} + 3}) \\ \frac{(n_{i n} + 1)! M^{n_{i n}}}{k^{n_{i n}} λ^{n_{i n}}} f_{+} (- (M, \dots, M) + \frac{2 M (1, \dots, 1)}{k}) \\ \times g (λ^{- 1} x - λ^{- 1} (- (M, \dots, M) + \frac{2 M (1, \dots, 1)}{k})) \\ \frac{(n_{i n} + 1)! M^{n_{i n}}}{k^{n_{i n}} λ^{n_{i n}}} f_{-} (- (M, \dots, M) + \frac{2 M (1, \dots, 1)}{k}) \\ \times g (λ^{- 1} x - λ^{- 1} (- (M, \dots, M) + \frac{2 M (1, \dots, 1)}{k})) \end{matrix}),

since, for a non-negative function h,

ReLU (h (x)) = h (x)

. Then, the

n_{i n} + 4

and

n_{i n} + 5

-th components of

f_{l, n} (x)

represented by

L^{1, \bar{j}} \circ \dots \circ ReLU \circ L^{1, 1, 1, \dots, 1} (x)

are

(\begin{matrix} \sum_{j < \bar{j}} \frac{(n_{i n} + 1)! M^{n_{i n}}}{k^{n_{i n}} λ^{n_{i n}}} f_{+} (- (M, \dots, M) + \frac{2 M j}{k}) \\ \times g (λ^{- 1} x - λ^{- 1} (- (M, \dots, M) + \frac{2 M j}{k})) \\ \sum_{j < \bar{j}} \frac{(n_{i n} + 1)! M^{n_{i n}}}{k^{n_{i n}} λ^{n_{i n}}} f_{-} (- (M, \dots, M) + \frac{2 M j}{k}) \\ \times g (λ^{- 1} x - λ^{- 1} (- (M, \dots, M) + \frac{2 M j}{k})) \end{matrix}),

where

j \leq \bar{j}

means when we give the well order to

j

in ordering like in (3) (for example,

(3, 1, 1, \dots, 1) \leq (1, 2, 1, \dots, 1)

).

Therefore, we have

\begin{matrix} L^{k n_{i n} + 2} \circ ReLU \circ L^{k n_{i n} + 1} \circ ReLU \circ L^{(n_{i n}, k, \dots, k)} \circ \dots \circ L^{2, 1, 1, \dots, 1} \circ ReLU \circ L^{1, 1, \dots, 1} (x) \\ \begin{matrix} = & \frac{(n_{i n} + 1)! M^{n_{i n}}}{k^{n_{i n}}} \sum_{j \in Z_{k}^{n_{i n}}} f_{+} (- (M, \dots, M) + \frac{2 M j}{k}) g_{λ} (x - (- (M, \dots, M) + \frac{2 M j}{k})) \\ - \frac{(n_{i n} + 1)! M^{n_{i n}}}{k^{n_{i n}}} \sum_{j \in Z_{k}^{n_{i n}}} f_{-} (- (M, \dots, M) + \frac{2 M j}{k}) g_{λ} (x - (- (M, \dots, M) + \frac{2 M j}{k})) \\ = & \frac{(n_{i n} + 1)! M^{n_{i n}}}{k^{n_{i n}}} \sum_{j \in Z_{k}^{n_{i n}}} f (- (M, \dots, M) + \frac{2 M j}{k}) g_{λ} (x - (- (M, \dots, M) + \frac{2 M j}{k})) . \end{matrix} \end{matrix}

4. Conclusions

The universal approximation theorem is the mathematical theory of artificial neural networks and the classic one states that a feed-forward network with a hidden layer and some activation function can approximate continuous functions on compact subsets. Here, we show that, for a given continuous functions on compact subsets, the ReLU network with

n_{i n} k^{n_{i n}} + 1

hidden layers of at most

n_{i n} + 5

nodes and proper weights can approximate to the function using the approximate identity. To show this, we explicitly give connection and weights. Our construction is rather partially rather than fully connected; especially, each of the first

n_{i n} + 3

nodes in each hidden layer is connected to only one node of the previous layer. This means that actually we don’t need a fully connected network.

Actually, we are interested in the relation between the parameter k and error bound

ϵ

because, if we know this, we can estimate how many hidden layers should be constructed to get the allowable error. Thus, the error estimate depending on the parameter k is our next research line.

Funding

This work was supported by the National Research Foundation of Korea grant funded by the Korea government (MSIP) (2018R1D1A3B07041149).

Data Availability Statement

Not applicable.

Acknowledgments

The author thanks K Kim, J Kang, and J Jeong for fruitful discussions.

Conflicts of Interest

The authors declare no conflict of interest.

References

LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Arora, R.; Basu, A.; Mianjy, P.; Mukherjee, A. Understanding deep neural networks with rectified linear units. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Mhaskar, H.N.; Poggio, T. Deep vs. shallow networks: An approximation theory perspective. Anal. Appl. 2016, 14, 829–848. [Google Scholar] [CrossRef]
Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Netw. 2017, 94, 103–114. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control. Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
Barron, A.R. Approximation and estimation bounds for artificial neural networks. Mach. Learn. 1994, 14, 115–133. [Google Scholar] [CrossRef]
Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
Funahashi, K.-I. On the approximate realization of continuous mappings by neural networks. Neural Netw. 1989, 2, 183–192. [Google Scholar] [CrossRef]
Delalleau, O.; Bengio, Y. Shallow vs. Deep Sum-Product Networks. In Advances in Neural Information Processing Systems 24; Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Granada, Spain, 12–17 December 2011; pp. 666–674. [Google Scholar]
Eldan, R.; Shamir, O. The power of depth for feedforward neural networks. In Proceedings of the Conference on Learning Theory, New York, NY, USA, 23–26 June 2016; pp. 907–940. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
Cohen, N.; Sharir, O.; Shashua, A. On the expressive power of deep learning: A tensor analysis. In Proceedings of the Conference on Learning Theory, New York, NY, USA, 23–26 June 2016; pp. 698–728. [Google Scholar]
Lu, Z.; Pu, H.; Wang, F.; Hu, Z.; Wang, L. The expressive power of neural networks: A view from the width. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Long Beach, CA, USA, 4–9 December 2017; pp. 6231–6239. [Google Scholar]
Hanin, B. Universal function approximation by deep neural nets with bounded width and relu activations. Mathematics 2019, 7, 992. [Google Scholar] [CrossRef] [Green Version]
Hanin, B.; Sellke, M. Approximating continuous functions by relu nets of minimal width. arXiv 2017, arXiv:1710.11278. [Google Scholar]
Suzuki, T. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: Optimal rate and curse of dimensionality. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Ohn, I.; Kim, Y. Smooth function approximation by deep neural networks with general activation functions. Entropy 2019, 21, 627. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Lin, H.; Jegelka, S. Resnet with one-neuron hidden layers is a universal approximator. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2–8 December 2018; Curran Associates, Inc.: Montréal, QC, Canada, 2018; pp. 6172–6181. [Google Scholar]
Schmidt-Hieber, J. Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 2020, 48, 1875–1897. [Google Scholar] [CrossRef]

Figure 1.

g (x)

Left: 1-dimension Center: 2-dimension Right: 3-dimension with range.

Figure 1.

g (x)

Left: 1-dimension Center: 2-dimension Right: 3-dimension with range.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moon, S. ReLU Network with Bounded Width Is a Universal Approximator in View of an Approximate Identity. Appl. Sci. 2021, 11, 427. https://doi.org/10.3390/app11010427

AMA Style

Moon S. ReLU Network with Bounded Width Is a Universal Approximator in View of an Approximate Identity. Applied Sciences. 2021; 11(1):427. https://doi.org/10.3390/app11010427

Chicago/Turabian Style

Moon, Sunghwan. 2021. "ReLU Network with Bounded Width Is a Universal Approximator in View of an Approximate Identity" Applied Sciences 11, no. 1: 427. https://doi.org/10.3390/app11010427

APA Style

Moon, S. (2021). ReLU Network with Bounded Width Is a Universal Approximator in View of an Approximate Identity. Applied Sciences, 11(1), 427. https://doi.org/10.3390/app11010427

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ReLU Network with Bounded Width Is a Universal Approximator in View of an Approximate Identity

Abstract

1. Introduction

Notations

2. Main Result

3. Proof of Theorem 1

3.1. One-Dimensional, Input

3.2. General-Dimensional Input

4. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI