Next Article in Journal
Hierarchical Phoneme Classification for Improved Speech Recognition
Previous Article in Journal
Motion Planning for Mobile Robot with Modified BIT* and MPC
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

ReLU Network with Bounded Width Is a Universal Approximator in View of an Approximate Identity

Department of Mathematics, Kyungpook National University, Daegu 41566, Korea
Appl. Sci. 2021, 11(1), 427; https://doi.org/10.3390/app11010427
Submission received: 19 November 2020 / Revised: 29 December 2020 / Accepted: 31 December 2020 / Published: 4 January 2021
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
Deep neural networks have shown very successful performance in a wide range of tasks, but a theory of why they work so well is in the early stage. Recently, the expressive power of neural networks, important for understanding deep learning, has received considerable attention. Classic results, provided by Cybenko, Barron, etc., state that a network with a single hidden layer and suitable activation functions is a universal approximator. A few years ago, one started to study how width affects the expressiveness of neural networks, i.e., a universal approximation theorem for a deep neural network with a Rectified Linear Unit (ReLU) activation function and bounded width. Here, we show how any continuous function on a compact set of R n i n , n i n N can be approximated by a ReLU network having hidden layers with at most n i n + 5 nodes in view of an approximate identity.

1. Introduction

Over the past several years, deep neural networks have achieved state-of-the-art performance in a wide range of tasks such as image recognition/segmentation and machine translation (see the review article [1] and recent book [2] for more background). In spite of their outstanding successes, many aspects of why they work so well are still not well understood. Some works to date have focused on the problem of explaining and quantifying the expressivity of a deep neural network [3,4,5].
This line of research about the expressive power of neural networks, its ability to approximate functions, dates back to at least the work of Cybenko [6], in which a fully connected neural network with a sigmoid activation function and one single hidden layer can approximate any continuous univariate function on a bounded domain with an arbitrarily small error. Barron [7], Hornik et al. [8], and Funahashi [9] generalized the sigmoid function to a large class of activation functions to get universal approximation. However, they do not consider the number of nodes of a hidden layer; actually, the number of nodes of a hidden layer can be exponentially increased in the input dimension.
Because the outstanding performance of deep neural networks has recently been exhibited, there is a lot of literature concerning their expressive power theoretically. In 2011, Delalleau and Bengio [10] gave a family of functions which can be represented much more efficiently with a deep network than with a neural network with one hidden layer. Mhaskar and Poggio [4] provided conditions under which deep convolutional neural networks perform much better in function approximation than a neural network with one hidden layer. Eldan and Shamir [11] constructed a three-layer network which cannot be realized by any 2-layer if the number of nodes of an output layer is no more than an exponential bound.
Actually, the Rectified Linear Units (ReLU) activation function is the most popular choice in practical use of the neural network [12]. In this reason, most of the recent results on the universal approximation theory is about the ReLU network [5,13,14,15,16,17,18,19,20]. Cohen et al. [13] provided the deep convolutional neural network with the ReLU activation function that cannot be realized by a shallow network if the number of nodes of its hidden layer is no more than an exponential bound.
Lu et al. [14] presented a universal approximation theorem for deep neural networks with ReLU activation functions and hidden layers with a bounded width in 2017, since the expressive power of depth in ReLU networks with a bounded width has received a lot of attention. Hanin proved universal approximation theorem using convex function theory in [15,16], and Lin and Jegelka showed this theorem for residual networks with one-neuron hidden layers in [19].
Ohn and Kim [18] showed that the ReLU activation function can be represented by a linear combination of piecewise linear activation function and, using this fact, extended many results in [5,17,20] to any continuous piecewise linear activation function. However, their results are not about the bounded width.
Here, we study a universal approximation theorem for ReLU networks in view of approximate identity by constructing a network having n i n k n i n + 1 hidden layers with width of at most n i n + 5 . As a measure of approximation error, we adopt the supremum norm and L p distance.
The remainder of the paper is organized as follows: in the next subsection, we introduce some notations. Section 2 and Section 3 are devoted to our theorem statement and its proof, respectively. Finally, Section 4 offers a conclusion.

Notations

Let Z k = { 1 , 2 , , k } and x = ( x 1 , x 2 , , x n ) R n be the input. The network architecture ( l , n ) consists of several hidden layers l N and a width vector n = ( n 0 , n 1 , , n l + 1 ) N l + 2 , n 0 = n i n [20]. A neural network with network architecture ( l , n ) is then any function of the form
f ( l , n ) : R n 0 R n l + 1 , f ( l , n ) ( x ) = L l + 1 ReLU L l ReLU L 2 ReLU L 1 ( x ) ,
where
L i : R n i 1 R n i i = 1 , , l + 1 ,
and ReLU ( x ) = max { x , 0 } ,
ReLU ( x 1 , x 2 , , x m ) = ( ReLU ( x 1 ) , ReLU ( x 2 ) , , ReLU ( x m ) ) .
Here, we set n l + 1 = 1 . To this end, we define the space of network functions with the given network architecture
F ( l , n ) : = { f ( l , n ) : f ( l , n ) ( x ) = L l + 1 ReLU L 1 ( x ) , ( l , n ) N × N l + 2 } .
Our network of interest is ‘a feedforward neural network’ which is an artificial neural network wherein connections between the nodes do not form a cycle. That is, in this network, the information moves in a forward direction, from the input nodes to the output nodes through the hidden nodes (if any).
For later discussion, we introduce the positive part f + = max ( f , 0 ) and negative parts f = max ( f , 0 ) of a function f. Then, f + f is equal to f.

2. Main Result

Here is our universal approximation theorem for a ReLU network with a bounded width:
Theorem 1.
For any M > 0 and n i n N , let f : [ M , M ] n i n R be a continuous function. For ϵ > 0 , there exists k N such that f ( l , n ) F ( n i n k n i n + 1 , n i n , n i n + 5 , , n i n + 5 , 2 , 1 ) satisfies
sup x [ M , M ] n | f ( x ) f ( l , n ) ( x ) | < ϵ .
To prove this theorem, we explicitly construct a ReLU network with a bounded width in Section 3.2. In our construction, each of the first n i n + 3 nodes in each hidden layer is connected to only one node of the previous layer, and each, only for two n i n + 4 and n i n + 5 nodes in each hidden layer, is connected to less than four nodes of the previous layer (for more details, see Section 3.2). This means that the non-fully connected ReLU network can be working.
We can show that there is a feed-forward neural network that approximates any f L p ( R n i n ) :
Corollary 1.
Let n i n N and f L p ( R n i n ) , 1 p < . For ϵ > 0 , there exists k N such that f ( l , n ) F ( n i n k n i n + 1 , n i n , n i n + 5 , , n i n + 5 , 2 , 1 ) satisfies
| | f ( x ) f ( l , n ) ( x ) | | p < ϵ .
Proof. 
For any ϵ > 0 , there is a continuous function f c on R n i n with compact support in [ M , M ] n i n such that
R n i n | f ( x ) f c ( x ) | p d x < ϵ p 2 p .
Using Theorem 1, we can find k N such that a ReLU neural network architecture ( l , n ) with a represented function f ( l , n ) satisfying
sup x [ M , M ] n i n | f c ( x ) f ( l , n ) ( x ) | < ϵ 2 n i n + 1 M n i n .
Therefore, we have
| | f ( x ) f ( l , n ) ( x ) | | p | | f ( x ) f c ( x ) | | p + | | f c ( x ) f ( l , n ) ( x ) | | p < ϵ .

3. Proof of Theorem 1

To convey our idea more clearly, we first construct a ReLU network with n i n = 1 and k + 1 hidden layers with width at most 6. Then, we construct a ReLU network with general n i n .

3.1. One-Dimensional, Input

Let L 1 , t e m p : R R 3 and L 2 , t e m p : R 3 R with L 1 , t e m p ( x ) = ( x 1 , x , x + 1 ) and L 2 , t e m p ( x 1 , x 2 , x 3 ) = x 1 2 x 2 + x 3 . Then, the function f ( 1 , 1 , 3 , 1 ) represented by L 2 , t e m p ReLU L 1 , t e m p is
f ( 1 , 1 , 3 , 1 ) ( x ) = L 2 , t e m p ReLU L 1 , t e m p ( x ) = ReLU ( x 1 ) 2 ReLU ( x ) + ReLU ( x + 1 ) = 0 if | x | > 1 , 1 | x | if 1 < x < 1 .
We notice that, for a (B-spline) function g ( x ) = L 2 , t e m p ReLU L 1 , t e m p ( x ) and λ > 0 , g λ ( x ) = λ 1 g ( λ 1 x ) is an approximate identity: for any continuous function f ( x ) on [ M , M ] , f g λ converges to f ( x ) uniformly. For any ϵ > 0 , there is λ > 0 such that sup M x M | f g λ ( x ) f ( x ) | < ϵ / 2 . By mensuration by division, there is k N such that
sup M x M f g λ ( x ) j = 1 k 2 M k f M + 2 M j k g λ x M + 2 M j k < ϵ / 2
and thus
sup M x M f ( x ) j = 1 k 2 M k f M + 2 M j k g λ x M + 2 M j k < ϵ .
Now, for a fixed k N , we can make a network architecture ( l , n ) = ( k + 1 , 1 , 6 , 6 , , 6 , 2 , 1 ) with ReLU activations, input, and output dimensions 1, and k + 1 hidden layer width at most 6, such that
sup M x M f ( x ) f ( l , n ) ( x ) < ϵ .
Let L 1 , 1 : R R 6 , L 1 , j : R 6 R 6 , j = 2 , , k , L k + 1 : R 6 R 2 , and L k + 2 : R 2 R with
L 1 , 1 ( x ) = x + M λ 1 ( x ( M + 2 M / k ) ) 1 λ 1 ( x ( M + 2 M / k ) ) λ 1 ( x ( M + 2 M / k ) ) + 1 0 0 , L 1 , j ( x ) = x 1 λ 1 ( x 1 M ( M + 2 M j k ) ) 1 λ 1 ( x 1 M ( M + 2 M j k ) ) λ 1 ( x 1 M ( M + 2 M j k ) ) + 1 x 5 + 2 M k λ f + ( M + 2 M ( j 1 ) k ) ( x 2 2 x 3 + x 4 ) x 6 + 2 M k λ f ( M + 2 M ( j 1 ) k ) ( x 2 2 x 3 + x 4 ) for j = 2 , , k , L k + 1 ( x ) = x 5 + M k λ f + ( M ) ( x 2 2 x 3 + x 4 ) x 6 + M k λ f ( M ) ( x 2 2 x 3 + x 4 ) , and L k + 2 ( x ) = ( x 1 x 2 ) .
(The index 1 in our index ( 1 , j ) is unnecessary, but it is included because it suggests how to generalize to input of arbitrary dimension.) We note that
L 1 , 2 ReLU L 1 , 1 ( x ) = x + M λ 1 ( x ( M + 2 M k ) ) 1 λ 1 ( x ( M + 2 M k ) ) λ 1 ( x ( M + 2 M k ) ) + 1 2 M k λ f + ( M + 2 M k ) g ( λ 1 x λ 1 ( M + 2 M k ) ) 2 M k λ f ( M + 2 M k ) g ( λ 1 x λ 1 ( M + 2 M k ) ) .
and the fifth and sixth components of L 1 , j are non-negative. Since for a non-negative function h : R R , ReLU ( h ) = h , we have for α 0 ,
ReLU h ( x ) + α g λ x M + 2 M j k = h ( x ) + α g λ x M + 2 M j k
and thus (inductively) we have
L k + 2 ReLU L 1 , 1 ( x ) = j = 1 k 2 M k f + M + 2 M j k g λ x M + 2 M j k j = 1 k 2 M k f M + 2 M j k g λ x M + 2 M j k = j = 1 k 2 M k f M + 2 M j k g λ x M + 2 M j k .

3.2. General-Dimensional Input

To generalize our idea on input dimension 1 to general input dimension n i n , the main difficult part is how to make a ReLU network an approximate identity, for example, a kind of a B-spline function on R 2 . We emphasize that we need a function with the shape of a B-spline function, not an exact B-spline function on R 2 (see Figure 1). For n = 2 , we have
ReLU ( x 2 + ( ReLU ( x 1 + 1 ) 2 ReLU ( x 1 ) + ReLU ( x 1 1 ) ) ) 2 ReLU ( x 2 ) + ReLU ( x 2 ( ReLU ( x 1 + 1 ) 2 ReLU ( x 1 ) + ReLU ( x 1 1 ) ) ) = 0 if | x 1 | + | x 2 | > 1 1 | x 1 | | x 2 | if | x 1 | + | x 2 | < 1 .
In general,
Lemma 1.
For b R , let h ( x 1 , b ) = ReLU ( x 1 + b ) 2 ReLU ( x 1 ) + ReLU ( x 1 b ) . Then, we have
h ( x n , h ( x n 1 , h ( x n 2 , , h ( x 1 , 1 ) ) ) ) = 0 i f j = 1 n | x j | 1 , 1 j = 1 n | x j | i f j = 1 n | x j | < 1 .
In addition, we have
R n h ( x n , h ( x n 1 , h ( x n 2 , , h ( x 1 , 1 ) ) ) ) d x = 2 n ( n + 1 ) ! .
We notice that for
g ( x ) = h ( x n , h ( x n 1 , h ( x n 2 , , h ( x 1 , 1 ) ) ) ) , x R n
(see Figure 1) and ( n + 1 ) ! 2 n g λ ( x ) = ( n + 1 ) ! 2 n λ n g ( λ 1 x ) , λ > 0 is an approximate identity: for a continuous function f on [ M , M ] n , ( n + 1 ) ! 2 n f g λ converges to f ( x ) uniformly. For any ϵ > 0 , there is λ > 0 such that
sup x [ M , M ] n i n ( n + 1 ) ! 2 n f g λ ( x ) f ( x ) < ϵ 2 .
Proof. 
We will use mathematical induction on n. We have already shown the case of n = 1 . Suppose
h ( x n , h ( x n 1 , , h ( x 1 , 1 ) ) ) = 0 if j = 1 n | x j | 1 , 1 j = 1 n | x j | if j = 1 n | x j | < 1 .
For simplicity, let g ( x ) = h ( x n , h ( x n 1 , , h ( x 1 , 1 ) ) ) , x R n . If j = 1 n + 1 | x j | 1 , then | x n + 1 | 1 j = 1 n | x j | and thus | x n + 1 | g ( x ) . For x n + 1 0 ,
h ( x n + 1 , g ( x ) ) = ReLU ( x n + 1 + g ( x ) ) 2 ReLU ( x n + 1 ) + ReLU ( x n + 1 g ( x ) ) = ( x n + 1 + g ( x ) ) 2 x n + 1 + ( x n + 1 g ( x ) )
is equal to zero. In addition, for x n + 1 < 0 , we can show that h ( x n + 1 , g ( x ) ) is also zero. Now, we consider j = 1 n + 1 | x j | < 1 . Then, by hypothesis, g ( x ) = 1 j = 1 n | x j | . For x n + 1 > 0 , we have
h ( x n + 1 , g ( x ) ) = ReLU ( x n + 1 + g ( x ) ) 2 ReLU ( x n + 1 ) + ReLU ( x n + 1 g ( x ) ) = x n + 1 + 1 j = 1 n | x j | 2 x n + 1 = 1 j = 1 n + 1 | x j | ,
since x n + 1 1 + j = 1 n | x j | < 0 . Similarly, for x n + 1 < 0 , we have
h ( x n + 1 , g ( x ) ) = x n + 1 + 1 j = 1 n | x j | = 1 j = 1 n + 1 | x j | .
Direct integral calculation follows (1). □
Again, similar to the one-dimensional case, by mensuration by division, there is k N such that
sup x [ M , M ] n i n ( n i n + 1 ) ! 2 n i n f g λ ( x ) ( n i n + 1 ) ! M n i n k n i n j Z k n i n f ( M , , M ) + 2 M j k × g λ x ( M , , M ) + 2 M j k < ϵ 2
and thus, with (2), we have for some λ > 0 and k N ,
f ( x ) ( n i n + 1 ) ! M n i n k n i n j Z k n i n f ( M , , M ) + 2 M j k × g λ x ( M , , M ) + 2 M j k < ϵ .
Now, our goal is to construct a ReLU network satisfying
f ( l , n ) ( x ) = ( n i n + 1 ) ! M n i n k n i n j Z k n i n f ( M , , M ) + 2 M j k g λ x ( M , , M ) + 2 M j k .
Since we have an approximate identity ( n i n + 1 ) ! 2 n i n g , how to make the ReLU network remains. Here, the indexing part is not difficult, but it is slightly cumbersome. Similarly to the case n i n = 1 , for
( i , j ) = ( i , j 1 , j 2 , , j n i n ) { 1 , , n i n } × Z k n i n { ( 1 , 1 , , 1 ) }
(i index for dimension and j index for mensuration by division), let
L 1 , , 1 : R n i n R n i n + 5 , L i , j : R n i n + 5 R n i n + 5 , L n i n k n i n + 1 : R n i n + 5 R 2 , a n d L n i n k n i n + 2 : R 2 R
with
L 1 , 1 , , 1 ( x ) = x 1 + M x 2 + M x n i n + M λ 1 ( x 1 ( M + 2 M / k ) ) 1 λ 1 ( x 1 ( ( M + 2 M / k ) ) λ 1 ( x 1 ( ( M + 2 M / k ) ) + 1 0 0 ,
L i , j ( x ) = x 1 x n i n λ 1 ( x i M ( M + 2 M j i / k ) ) ( x n i n + 1 2 x n i n + 2 + x n i n + 3 ) λ 1 ( x i M ( M + 2 M j i / k ) ) λ 1 ( x i M ( M + 2 M j i / k ) ) + ( x n i n + 1 2 x n i n + 2 + x n i n + 3 ) x n i n + 4 x n i n + 5 for i = 2 , , n i n ,
L 1 , j ( x ) = x 1 x n i n λ 1 ( x 1 M ( M + 2 M j 1 k ) ) ( x n i n + 1 2 x n i n + 2 + x n i n + 3 ) λ 1 ( x 1 M ( M + 2 M j 1 k ) ) λ 1 ( x 1 M ( M + 2 M j 1 k ) ) + ( x n i n + 1 2 x n i n + 2 + x n i n + 3 ) x n i n + 4 + ( n i n + 1 ) ! M n i n k n i n λ n i n f + ( M , , M ) + 2 M ( j 1 j ) k × ( x n i n + 1 2 x n i n + 2 + x n i n + 3 ) ) x n i n + 5 + ( n i n + 1 ) ! M n i n k n i n λ n i n f ( M , , M ) + 2 M ( j 1 j ) k × ( x n i n + 1 2 x n i n + 2 + x n i n + 3 ) , ( j ( 1 , 1 , , 1 ) )
L n i n k n i n + 1 ( x ) = x n i n + 4 + ( n i n + 1 ) ! M n i n k n i n λ n i n f + ( M , , M ) ( x n i n + 1 2 x n i n + 2 + x n i n + 3 ) x n i n + 5 + ( n i n + 1 ) ! M n i n k n i n λ n i n f ( M , , M ) ( x n i n + 1 2 x n i n + 2 + x n i n + 3 ) , and L n i n k n i n + 2 ( x ) = ( x 1 x 2 ) .
Here,
1 j = ( 1 , 0 , , 0 ) if j 1 1 , ( k 1 , 1 , 0 , 0 ) if j 1 = 1 and j 2 1 , ( k 1 , k 1 , , k 1 , 1 ̲ l + 1 t h , 0 , , 0 ) if j 1 = j 2 = = j l = 1 and j l + 1 1 , ( k 1 , k 1 , , k 1 , 1 ) if j 1 = = j n i n 1 = 1 and j n i n 1 .
.
We stack in such a way that the index i is increased, and then the second index j 1 is increased, and then j 2 , and so on:
( 1 , 1 , 1 , , 1 ) , ( 2 , 1 , 1 , , 1 ) , ( 3 , 1 , 1 , , 1 ) , , ( n i n , 1 , 1 , , 1 ) , ( 1 , 2 , 1 , , 1 ) , ( 2 , 2 , 1 , , 1 ) , , ( n i n , 2 , 1 , , 1 ) , ( 1 , k , 1 , , 1 ) , , ( n i n , k , 1 , , 1 ) , ( 1 , 1 , 2 , 1 , 1 ) , , ( n i n , 1 , 2 , 1 , 1 ) , ( 1 , k , 2 , 1 , , 1 ) , , ( n i n , k , 2 , 1 , , 1 ) , ( 1 , 1 , 3 , 1 , , 1 ) , , ( n i n , 1 , 3 , 1 , , 1 ) , ( 1 , k , k , 1 , , 1 ) , , ( n i n , k , k , 1 , , 1 ) , ( 1 , 1 , 1 , 2 , 1 , 1 ) , , ( n i n , 1 , 1 , 2 , 1 , 1 ) , ( 1 , k 1 , k , k , k , k ) , , ( n i n , k 1 , k , k , k , k ) , ( 1 , k , k , k , k , k ) , , ( n i n , k , k , k , k , k )
(which means that the index i is the first one where we increase) and, if we omit the index i, then
( 1 , 1 , 1 , , 1 ) , ( 2 , 1 , 1 , , 1 ) , ( 3 , 1 , 1 , , 1 ) , , ( k , 1 , 1 , , 1 ) , ( 1 , 2 , 1 , , 1 ) , ( 2 , 2 , 1 , , 1 ) , , ( k , 2 , 1 , , 1 ) , ( 1 , k , 1 , , 1 ) , ( 2 , k , 1 , , 1 ) , , ( k 1 , k , 1 , , 1 ) , ( k , k , 1 , , 1 ) , ( 1 , 1 , 2 , , 1 ) , ( 2 , 1 , 2 , , 1 ) , , ( k 1 , 1 , 2 , , 1 ) , ( k , 1 , 2 , , 1 ) , ( 1 , k , , k , k 1 , k ) , , ( k 1 , k , , k 1 , k ) , ( k , , k 1 , k ) , ( 1 , k , k , k , k ) , , ( k 1 , k , k , , k , k ) , ( k , k , k , k ) .
In the second terms of the n i n + 4 and n i n + 5 -th components of L 1 , j , j ( 1 , 1 , , 1 )
f ± ( M , , M ) + 2 M ( j 1 j ) k ( x n i n + 1 2 x n i n + 2 + x n i n + 3 ) ,
we want the value at the previous step and thus j 1 j is the exact previous one step of j .
We want to notice the n i n + 1 , n i n + 2 , and n i n + 3 -th components of a function f 1 , n i n , n i n + 5 , n i n + 5 ( x ) represented by L 2 , 1 , 1 , , 1 ReLU L 1 , 1 , 1 , , 1 ( x ) are
λ 1 x 2 M M + 2 M k h λ 1 x 1 M + 2 M k , 1 λ 1 x 2 M M + 2 M k λ 1 x 2 M M + 2 M k + h λ 1 x 1 M + 2 M k , 1 ,
and thus n i n + 1 , n i n + 2 , and n i n + 3 -th components of a function f 2 , n i n , n i n + 5 , n i n + 5 , n i n + 5 ( x ) becomes
λ 1 x 3 M M + 2 M k g ¯ λ 1 ( x 1 , x 2 ) ( M , M ) + 2 M ( 1 , 1 ) k λ 1 x 3 M M + 2 M k λ 1 x 3 M M + 2 M k + g ¯ λ 1 ( x 1 , x 2 ) ( M , M ) + 2 M ( 1 , 1 ) k ,
where g ¯ ( x 1 , x 2 ) = h ( x 2 , h ( x 1 , 1 ) ) . Thus, a function f n i n , n i n , n i n + 5 , , n i n + 5 ( x ) represented by L 1 , 2 , 1 , , 1 ReLU L 1 , 1 , , 1 ( x ) is equal to
x 1 + M x n i n + M λ 1 ( x 1 M ( M + 2 M k ) ) ( x n i n + 1 2 x n i n + 2 + x n i n + 3 ) λ 1 ( x 1 M ( M + 2 M k ) ) λ 1 ( x 1 M ( M + 2 M k ) ) + ( x n i n + 1 2 x n i n + 2 + x n i n + 3 ) ( n i n + 1 ) ! M n i n k n i n λ n i n f + ( M , , M ) + 2 M ( 1 , , 1 ) k × g λ 1 x λ 1 ( M , , M ) + 2 M ( 1 , , 1 ) k ( n i n + 1 ) ! M n i n k n i n λ n i n f ( M , , M ) + 2 M ( 1 , , 1 ) k × g λ 1 x λ 1 ( M , , M ) + 2 M ( 1 , , 1 ) k ,
since, for a non-negative function h, ReLU ( h ( x ) ) = h ( x ) . Then, the n i n + 4 and n i n + 5 -th components of f l , n ( x ) represented by L 1 , j ¯ ReLU L 1 , 1 , 1 , , 1 ( x ) are
j < j ¯ ( n i n + 1 ) ! M n i n k n i n λ n i n f + ( M , , M ) + 2 M j k × g λ 1 x λ 1 ( M , , M ) + 2 M j k j < j ¯ ( n i n + 1 ) ! M n i n k n i n λ n i n f ( M , , M ) + 2 M j k × g λ 1 x λ 1 ( M , , M ) + 2 M j k ,
where j j ¯ means when we give the well order to j in ordering like in (3) (for example, ( 3 , 1 , 1 , , 1 ) ( 1 , 2 , 1 , , 1 ) ).
Therefore, we have
L k n i n + 2 ReLU L k n i n + 1 ReLU L ( n i n , k , , k ) L 2 , 1 , 1 , , 1 ReLU L 1 , 1 , , 1 ( x ) = ( n i n + 1 ) ! M n i n k n i n j Z k n i n f + ( M , , M ) + 2 M j k g λ x ( M , , M ) + 2 M j k ( n i n + 1 ) ! M n i n k n i n j Z k n i n f ( M , , M ) + 2 M j k g λ x ( M , , M ) + 2 M j k = ( n i n + 1 ) ! M n i n k n i n j Z k n i n f ( M , , M ) + 2 M j k g λ x ( M , , M ) + 2 M j k .

4. Conclusions

The universal approximation theorem is the mathematical theory of artificial neural networks and the classic one states that a feed-forward network with a hidden layer and some activation function can approximate continuous functions on compact subsets. Here, we show that, for a given continuous functions on compact subsets, the ReLU network with n i n k n i n + 1 hidden layers of at most n i n + 5 nodes and proper weights can approximate to the function using the approximate identity. To show this, we explicitly give connection and weights. Our construction is rather partially rather than fully connected; especially, each of the first n i n + 3 nodes in each hidden layer is connected to only one node of the previous layer. This means that actually we don’t need a fully connected network.
Actually, we are interested in the relation between the parameter k and error bound ϵ because, if we know this, we can estimate how many hidden layers should be constructed to get the allowable error. Thus, the error estimate depending on the parameter k is our next research line.

Funding

This work was supported by the National Research Foundation of Korea grant funded by the Korea government (MSIP) (2018R1D1A3B07041149).

Data Availability Statement

Not applicable.

Acknowledgments

The author thanks K Kim, J Kang, and J Jeong for fruitful discussions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  2. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  3. Arora, R.; Basu, A.; Mianjy, P.; Mukherjee, A. Understanding deep neural networks with rectified linear units. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  4. Mhaskar, H.N.; Poggio, T. Deep vs. shallow networks: An approximation theory perspective. Anal. Appl. 2016, 14, 829–848. [Google Scholar] [CrossRef]
  5. Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Netw. 2017, 94, 103–114. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control. Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
  7. Barron, A.R. Approximation and estimation bounds for artificial neural networks. Mach. Learn. 1994, 14, 115–133. [Google Scholar] [CrossRef]
  8. Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
  9. Funahashi, K.-I. On the approximate realization of continuous mappings by neural networks. Neural Netw. 1989, 2, 183–192. [Google Scholar] [CrossRef]
  10. Delalleau, O.; Bengio, Y. Shallow vs. Deep Sum-Product Networks. In Advances in Neural Information Processing Systems 24; Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Granada, Spain, 12–17 December 2011; pp. 666–674. [Google Scholar]
  11. Eldan, R.; Shamir, O. The power of depth for feedforward neural networks. In Proceedings of the Conference on Learning Theory, New York, NY, USA, 23–26 June 2016; pp. 907–940. [Google Scholar]
  12. Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
  13. Cohen, N.; Sharir, O.; Shashua, A. On the expressive power of deep learning: A tensor analysis. In Proceedings of the Conference on Learning Theory, New York, NY, USA, 23–26 June 2016; pp. 698–728. [Google Scholar]
  14. Lu, Z.; Pu, H.; Wang, F.; Hu, Z.; Wang, L. The expressive power of neural networks: A view from the width. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Long Beach, CA, USA, 4–9 December 2017; pp. 6231–6239. [Google Scholar]
  15. Hanin, B. Universal function approximation by deep neural nets with bounded width and relu activations. Mathematics 2019, 7, 992. [Google Scholar] [CrossRef] [Green Version]
  16. Hanin, B.; Sellke, M. Approximating continuous functions by relu nets of minimal width. arXiv 2017, arXiv:1710.11278. [Google Scholar]
  17. Suzuki, T. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: Optimal rate and curse of dimensionality. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  18. Ohn, I.; Kim, Y. Smooth function approximation by deep neural networks with general activation functions. Entropy 2019, 21, 627. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  19. Lin, H.; Jegelka, S. Resnet with one-neuron hidden layers is a universal approximator. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2–8 December 2018; Curran Associates, Inc.: Montréal, QC, Canada, 2018; pp. 6172–6181. [Google Scholar]
  20. Schmidt-Hieber, J. Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 2020, 48, 1875–1897. [Google Scholar] [CrossRef]
Figure 1. g ( x ) Left: 1-dimension Center: 2-dimension Right: 3-dimension with range.
Figure 1. g ( x ) Left: 1-dimension Center: 2-dimension Right: 3-dimension with range.
Applsci 11 00427 g001
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Moon, S. ReLU Network with Bounded Width Is a Universal Approximator in View of an Approximate Identity. Appl. Sci. 2021, 11, 427. https://doi.org/10.3390/app11010427

AMA Style

Moon S. ReLU Network with Bounded Width Is a Universal Approximator in View of an Approximate Identity. Applied Sciences. 2021; 11(1):427. https://doi.org/10.3390/app11010427

Chicago/Turabian Style

Moon, Sunghwan. 2021. "ReLU Network with Bounded Width Is a Universal Approximator in View of an Approximate Identity" Applied Sciences 11, no. 1: 427. https://doi.org/10.3390/app11010427

APA Style

Moon, S. (2021). ReLU Network with Bounded Width Is a Universal Approximator in View of an Approximate Identity. Applied Sciences, 11(1), 427. https://doi.org/10.3390/app11010427

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop