Echo State Condition at the Critical Point

Mayer, Norbert Michael

doi:10.3390/e19010003

Open AccessArticle

Echo State Condition at the Critical Point

by

Norbert Michael Mayer

Department of Electrical Engineering and Advanced Institute of Manufacturing with High-Tech Innovations (AIM-HI), National Chung Cheng University, Chia-Yi 62102, Taiwan

Entropy 2017, 19(1), 3; https://doi.org/10.3390/e19010003

Submission received: 29 October 2016 / Revised: 13 December 2016 / Accepted: 15 December 2016 / Published: 23 December 2016

(This article belongs to the Special Issue Complexity, Criticality and Computation (C³))

Download

Browse Figures

Versions Notes

Abstract

:

Recurrent networks with transfer functions that fulfil the Lipschitz continuity with

K = 1

may be echo state networks if certain limitations on the recurrent connectivity are applied. It has been shown that it is sufficient if the largest singular value of the recurrent connectivity is smaller than 1. The main achievement of this paper is a proof under which conditions the network is an echo state network even if the largest singular value is one. It turns out that in this critical case the exact shape of the transfer function plays a decisive role in determining whether the network still fulfills the echo state condition. In addition, several examples with one-neuron networks are outlined to illustrate effects of critical connectivity. Moreover, within the manuscript a mathematical definition for a critical echo state network is suggested.

Keywords:

reservoir computing; uniformly state contracting networks; power law

1. Introduction

Classic approaches of recurrent neural networks (RNNs), such as back-propagation through time [1], have been considered difficult to handle. In particular learning in the recurrent layer is slow and problematic due to potential instabilities. About 15 years ago, reservoir computing [2] was suggested as an alternative approach for RNNs. Here, it is not necessary to train connectivity in the recurrent layer. Instead, constant, usually random, connectivity weights are used in the recurrent layer. The supervised learning can be done by training the output layer using linear regression. Two types of reservoir computing are well established in the literature. The first is called liquid state machines (LSMs, [3]), which are usually based on a network of spiking neurons. The second type is called an echo state nework (ESN, [4]), which uses real valued neurons that initially use a sigmoid as a transfer function. Although a random recurrent connectivity pattern can be used, heuristically it has been found that typically the performance of the network depends strongly on the statistical features of this random connectivity (cf. for example [5] for ESNs).

Thus, what is a good reservoir with regard to particular stationary input statistics? This has been a fundamental question for research in this field since the invention of reservoir computing. One fundamental idea is that a reservoir can only infer training output from this window of the input history of which traces still can be found inside the reservoir dynamics. However, if the necessary inference from time series in order to learn the training output is far in the past, it may happen that no traces of this input remain inside the reservoir. So, the answer seems to be that a good reservoir is a reservoir from whose states it is possible to reconstruct an input history with a time span that is as long as possible. More precisely, they should be reconstructed in a way that is sufficiently accurate in order to predict the training output. In other words, a good reservoir is a reservoir that has a good memory of the input history.

There have been efforts to quantify the quality of the memory of the reservoir. Most common is the “memory capacity” (MC) according to Jaeger’s definition [4]. However, MC has several drawbacks. For example, it is not directly compatible to a Shannon information based measure. Still, it illustrates that ESNs are relatively tightly restricted in the way that the upper limit of the MC is equal to the number of hidden layer neurons. So the capabilities of the network increase with the number of neurons.

One more important limiting factor with regard to the reservoir memory is the strength of the recurrent connectivity. According to the echo state condition, the nature of the reservoir requires that the maximum

{| λ |}_{\max}

of its eigenvalues in modulus is smaller than 1, which is called the echo state property (ESP). This seems always to result in a exponential forgetting of previous states. Thus, forgetting is independent from the input statistics but instead has to be pre-determined and is due to the design of the reservoir dynamics.

In order to proceed, there are several important aspects. First, it is necessary to get rid of the intrinsic time scale of forgetting that is induced by

{| λ |}_{\max} < 1

. More precisely, the remaining activity of inputs to the reservoir that date back earlier than

Δ t

is a fraction smaller than

{| λ |}_{\max}^{Δ t}

. Networks where the largest eigenvalue is larger than 1 cannot be used as reservoirs anymore, a point which is detailed below. One can try

{| λ |}_{\max} = 1

and see if this improves the network performance and how this impacts the memory of the reservoir on earlier events. Steps toward this direction have been made by going near the “edge of chaos” [5] or even further where the network may not be an echo state network for all possible input sequences but instead just around some permissible inputs [6]. Presumably, these approaches still all forget exponentially fast.

Strictly, networks with

{| λ |}_{\max} = 1

are not covered by the initial proof of Jaeger for the echo state condition. One important purpose of this paper is to close this gap and to complete Jaeger’s proof in this sense. The other purpose is to motivate the principles of [7] in as simple as possible examples and thus to increase the resulting insight.

The intentions of the following sections of the paper are to motivate the concept of critical neural networks and explain how they are related to memory compression. These intentions comprise a relatively large part of the paper because it seems important to argue for the principle value of critical ESNs. Section 2 introduces the concept of reservoir computing and also defines important variables for the following sections. An important feature is that Lyapunov coefficients are reviewed in order to suggest a clear definition for critical reservoirs that can be used analytically on candidate reservoirs. Section 3 describes how critical one-neuron reservoirs can be designed and also introduces the concept of extending to large networks. Section 4 explains why the critical ESNs are not covered by Jaeger’s proof. The actual proof for the echo state condition can be found in Section 5. Certain aspects of the proof have been transferred to the appendix.

2. Motivation

The simplest way to train with data in a supervised learning paradigm is to interpolate data (cf. for example [8]). Thus, for a time series of input data

u_{t} \in R^{n}

that forms an input sequence

{\bar{u}}^{\infty}

and a corresponding output data

o_{t} \in R^{m}

one can choose a vector of non-linear, linearly independent functions

F (u_{t}) : R^{n} \to R^{k}

and a transfer matrix

w^{o u t} : R^{k} \to R^{m}

. Then, one can define

\begin{matrix} x_{t} & = & F (u_{t}) \\ \tilde{o_{t}} & = & w^{o u t} x_{t} . \end{matrix}

w^{o u t}

can be calculated by linear regression, i.e.,

w^{o u t} = {(A A^{'})}^{- 1} (A B),

(1)

where the rectangular matrices

A = [x_{0}, x_{1}, \dots, x_{t}]

and

B = [o_{0}, o_{1}, \dots, o_{t}]

are composed from the data of the training set and

A^{'}

is the transpose of A. Further, one can use a single transcendental function

θ (.)

such that

x_{t} = F (u_{t}) = θ (w^{i n} u_{t}),

(2)

where

w^{i n} : R^{n} \to R^{k}

is a matrix in which each line consists of a unique vector and

θ (.)

is defined in the Matlab fashion; so the function is applied to each entry of the vector separately. Linear independence of the components of

F

can then be guaranteed if the column vectors of

w^{i n}

are linearly independent. Practically, linear independence can be assumed if the entries of

w^{i n}

are chosen randomly from a continuous set and

k \geq n

.

The disadvantage of the pure interpolation method with regard to time series is that the input history, that is

u_{t - 1}, u_{t - 2}, \dots u_{0}

, has no impact on training the current output

\tilde{o_{t}}

. Thus, if a relation between previous inputs and current outputs exists, that relation cannot be learned.

Different from Equation (1), a reservoir in the sense of reservoir computing [2,9,10] can be defined as

x_{t} = F (x_{t - 1}, u_{t}) .

(3)

The recursive update function adds several new aspects to the interpolation that is outlined in Equation (1):

The new function turns the interpolation into a dynamical system:

$x_{t} = F (x_{t - 1}, u_{t}) = F_{{\bar{u}}^{\infty}} (x_{t - 1}),$

(4)

where the notation $F_{{\bar{u}}^{\infty}} (x_{t - 1})$ is intended to illustrate the character of the particular discrete time, deterministic dynamical system and the fact that each possible time series ${\bar{u}}^{\infty}$ defines a specific set of dynamics (For obvious reasons one may call $F$ an input driven system.).
The superset $F$ over all possible ${\bar{u}}^{\infty}$ ,

$F = ⋃_{{\bar{u}}^{\infty}} F_{{\bar{u}}^{\infty}},$

may be called the reservoir dynamics. Thus, $F$ covers all possible dynamics of a particular reservoir with regard to any time series of input vectors in $R^{n}$ . Note that this way of looking into the dynamics of reservoirs is non-standard. Rather the standard approach is to interpret the reservoir as a non-autonomous dynamical system and then to formalize the system accordingly [6]. For the present work, the turn towards standard dynamical systems has been chosen because here the relevant methodology is well established and the above mentioned formalization appears sufficient for all purposes of this work.
It is now possible to account for information from a time series’ past in order to calculate the appropriate output.

One important question is if the regression step in the previous section, and thus the interpolation, works at all for the recursive definition in Equation (3). Jaeger showed ([4], p. 43; [11]) that the regression is applicable, i.e., the echo state property (ESP) is fulfilled, if and only if the network is uniformly state contracting. Uniformly state contraction is defined in the following.

Assume an infinite stimulus sequence

{\bar{u}}^{\infty} = {\{u_{n}\}}_{n = 0}^{\infty}

and two random initial internal states of the system

x_{0}

and

y_{0}

. To both initial states

x_{0}

and

y_{0}

the sequences

{\bar{x}}^{\infty} = {\{x_{n}\}}_{n = 0}^{\infty}

and

{\bar{y}}^{\infty} = {\{y_{n}\}}_{n = 0}^{\infty}

can be respectively assigned.

\begin{matrix} x_{t} = F (x_{t - 1}, u_{t}) & = & F_{{\bar{u}}^{\infty}} (x_{t - 1}) \\ y_{t} = F (y_{t - 1}, u_{t}) & = & F_{{\bar{u}}^{\infty}} (y_{t - 1}) \\ q_{{\bar{u}}^{\infty}, t} & = & d (x_{t}, y_{t}), \end{matrix}

(5)

where

q_{{\bar{u}}^{\infty}}

is another series and

d (., .)

shall be a distance measure using the square norm.

Then the system

F

is uniformly state contracting if it is independent from

{\bar{u}}^{\infty}

and if for any initial state (

x_{0}

,

y_{0}

) and all real values

ϵ > 0

there exists a finite

τ_{F} (ϵ) < \infty

for which

\max_{F} q_{{\bar{u}}^{\infty}, t} \leq ϵ

(6)

for all

t \geq τ_{F}

.

Another way to look at the echo state condition is that the network

F

behaves in a time invariant manner, in the way that some finite subsequence in an input time series will roughly result always in the same outcome. In other words

x_{Δ t + t_{0}} \approx y_{Δ t + t_{0}}

independent of

t_{0}

,

x_{t_{0}}

and

y_{t_{0}}

and if

Δ t

is sufficiently large.

Lyapunov analysis is a method to analyze predictability versus instability of a dynamical system (see [12]). More precisely, it measures exponential stability.

In the context of non-autonomous systems, one may define the Lyapunov exponent as

Λ_{{\bar{u}}^{\infty}} = \lim_{| q_{{\bar{u}}^{\infty}, t = 0} | \to 0} \lim_{t \to \infty} \frac{1}{t} \log \frac{| q_{{\bar{u}}^{\infty}, t} |}{| q_{{\bar{u}}^{\infty}, 0} |},

(7)

Thus, if

q_{{\bar{u}}^{\infty}, t} \propto \exp (b t),

then

Λ_{{\bar{u}}^{\infty}}

approximates b and thus measures the exponent of exponential decay. For power law decays, the Lyapunov exponent is always zero (For example, one may try

q_{{\bar{u}}^{\infty}, t} = \frac{1}{t + 1}

).

In order to define criticality, we use the following definition.

Definition 1.

A reservoir that is uniformly state contracting shall be called critical if for at least one input sequence

{\bar{u}}^{\infty}

there is at least one Lyapunov exponent

Λ_{{\bar{u}}^{\infty}}

that is zero.

The echo state network (ESN) is an implementation of reservoir dynamics as outlined in Equation (3). Like other reservoir computing approaches, the system is intended to resemble the dynamics of a biologically inspired recurrent neural network. The dynamics can be described for discrete time-steps

t

, with the following equations:

\begin{matrix} x_{lin, t} = W x_{t - 1} + w^{i n} u_{t} \\ x_{t} = θ (x_{lin, t}) \\ \tilde{o_{t}} = w^{o u t} x_{t .} \end{matrix}

(8)

With regard to the transfer function

θ (.)

, it shall be assumed that it is continuous, differentiable, transcendental and monotonically increasing with the limit

1 \geq θ^{'} (.) \geq 0

, which is compatible with the requirement that

θ (.)

fulfills the Lipschitz continuity with

K = 1

. Jaeger’s approach uses random matrices for

W

and

w^{i n}

, learning is restricted to the output layer

w^{o u t}

. The learning (i.e., training

o_{t}

) can be performed by linear regression (cf. Equation (1)).

The ESN fulfills the echo state condition (i.e., it is uniformly state contracting) if certain restrictions on the connectivity of the recurrent layer apply, for which one can name a necessary condition and a sufficient condition:

C1 A network has echo states only if

${1 > | λ |}_{\max} = \max abs (λ (W)),$

(9)

i.e., the absolute value of the biggest eigenvalue of $W$ is below 1. The condition means that a network is not and ESN if $1 < \max abs (λ (W))$ .
C2 Jaeger named here initially

$1 > s_{\max} = \max s (W),$

(10)

where s is the vector of singular values of the matrix $W$ . However, a closer sufficient condition has been found in [13]. Thus, it is already sufficient to find a full rank matrix D for which

$1 > \max s (D W D^{- 1}) .$

(11)

The authors of [14] found another formulation of the same constraint: The network with internal weight matrix

W

satisfies the echo state property for any input if W is diagonally Schur stable, i.e., there exists a diagonal

P > 0

such that

W^{T} P W - P

is negative definite.

Apart from the requirement that a reservoir has to be uniformly state contracting, the learning process itself is not of interest in the scope of this paper.

3. Critical Reservoirs with Regard to the Input Statistics ${\bar{u}}^{\infty}$

Various ideas on what types of reservoirs work better than others have been brought up. One can try to keep the memories of the input history in the network as long as possible. The principle idea is to tune the network’s recurrent connectivity to a level where the convergence for a subset of

F_{c r i t} \in F

with regard to Equation (6) is

q_{{\bar{u}}_{c r i t}^{\infty}, t} \propto t^{a}

(12)

rather than

q_{{\bar{u}}^{\infty}, t} \propto b^{t},

(13)

where

a < 0

and

0 < b < 1

are system specific values, i.e., they depend on

F_{{\bar{u}}^{\infty}}

.

A network according to Equation (12) is still an ESN since it fullfils the ESP. Still, forgetting of initial states is not bound to a certain time scale. Remnants of information can —under certain circumstances— remain for virtually infinite long times within the network given that not too much unpredictable new input enters the network. Lyapunov analysis of a time series according to Equation (12) would result in zero, and Lyapunov analysis of Equation (13) yields a nonzero result.

In ESNs forgetting according to the power law of Equation (12) for an input time series

q_{{\bar{u}}_{c r i t}^{\infty}, t}

is achievable if the following constraints are fulfilled:

The recurrent connectivity $W$ , the input connectivity $w^{i n}$ of the ESN and the transfer function $θ (.)$ have to be arranged in a way that if the ESN is fed with $q_{{\bar{u}}_{c r i t}^{\infty}, t}$ one approximates

$\lim_{t \to \infty} |\dot{θ} (x_{lin, t})| = 1 .$

Thus, the aim of the training is

$|\dot{θ} (x_{lin, t})| = 1 .$

(14)

Since the ESN has to fulfil the Lipschitz continuity with $K = 1$ , the points where $\dot{θ} = 1$ have to be inflection points of the transfer function. In the following these inflection points shall be called epi-critical points (ECPs).
The recurrent connectivity of the network is to be designed in a way that the largest absolute eigenvalue and the largest singular value of $W$ both are equal to one. This can be done by using normal matrices for $W$ (see Section 3.5).

3.1. Transfer Functions

The standard transfer function that is widely used in ESNs and other types of neural networks is the sigmoid function, i.e.,

θ (x) = \tanh (x) .

(15)

The sigmoid transfer function has one ECP

Π_{0} = 0

.

As a second possible transfer function, one may consider

θ (x) = 0.5 x - 0.25 \sin (2 x) .

(16)

Here, one has a infinite set of ECPs at

Π_{i} = (n + 0.5) \times π

. It is important to have more than one ECP in the transfer functions because for a network with discrete values it appears necessary that each neuron has at least 2 possible states in order to have any inference from the past to the future. In the case of

\tanh

, the only solution of Equation (14) is

x_{lin, t} = 0 .

That type of trained network cannot infer information from the past to the present for the expected input, which significantly restricts its capabilities. One can see that from

x_{t} = \tanh (x_{lin, t}) = 0 .

The next iteration yields

x_{t + 1} = \tanh (w^{in} u_{t}),

which again, after training would require

0 = w^{i n} u_{t} .

Thus, the network can only anticipate the same input at all time steps.

In the case of Equation (16), the maximal derivative

θ^{'} (x)

is 1 at

x = π (n + 1 / 2)

, where n is an integer number (confer Figure 1). Here the main advantage is that there exists an infinite set of epi-critical points. However, all these points are positioned along the linear function

y = x / 2

. This setting still significantly restricts the training of

F_{c r i t}

. Here one can consider the polynomial with the lowest possible rank (cf. the green line in Figure 1, left side) that interpolates between the epi-critical points (in the following called an epi-critical transfer function). In the case of Equation (16) the epi-critical transfer function is the linear function

x_{t + 1} = 0.5 W x_{t} + 0.5 w^{i n} u_{t .}

Thus, the effective dynamics of the trained reservoir on the expected input time series is -if this is possible- the dynamics of a linear network. This results in a very restricted set of trainable time series.

As an alternative, one could consider also a transcendental function for the interpolation between the points, such as depicted in Figure 2. The true transfer function (blue line in Figure 2) can be constructed in the following way. Around a set of defined epi-critical points

Π_{i}

, define θ as either

θ (x) = \tanh (x - Π_{i}) + \tanh (Π_{i})

or

θ (x) = \tanh (x) .

This is one conceptional suggestion for further investigations. The result is a transfer function with the epi-critical points

Π_{i}

and 0. The epi-critical transfer in this case is a

\tanh

function.

3.2. Examples Using a Single Neuron as a Reservoir

In this and the following sections, practical examples are brought up where a single neuron represents a reservoir. Single neuron reservoirs have been studied in other researches [15,16]. Here the intention is to illustrate the principle benefits and other features of critical ESNs.

First one can consider a neuron with

\tanh

as a transfer function along with a single input unit

x_{t + 1} = \tanh (b x_{t} + u_{t}),

(17)

where in order to achieve a critical network

| b |

has to be equal to 1. i.e., the network exactly fulfills the boundary condition. From previous consideration one knows that

F_{c r i t}

has the dynamics that results from

u_{t} = 0

as an input. In this case the only fixed-point of the dynamics is

x_{t} = 0

, which is also the epi-critical point if

| b | = 1

.

Power law forgetting: Starting from the two initial values

x_{0} = 0

and say

y_{0} = 0.01

, one can see that the two networks converge in a power law manner to zero. On the other hand, for a linear network

x_{t + 1} = b x_{t} + u_{t}

with the same initial conditions the dynamics of the two networks never converge (independently of

u_{t}

). Instead, the difference between

x_{0} = 0

and say

y_{0} = 0.01

stays the same forever. Thus, the network behavior in the the case of

| b | = 1

depends on the nature of the transfer function. For all other values of b, both transfer functions result qualitatively in the same behavior in dependence on b: either they diverge or they converge exponentially. Since

| b | = 1

is also the border between convergence and divergence and thus the border between uniformly state contracting networks and not uniformly state contracting networks, the case of

| b | = 1

is a critical point of the dynamical system, in a similar manner as a critical point at the transition from ordered dynamics to instability. In the following it is intended to extend rules for different transfer functions, where different transfer functions result in the critical point in uniformly state contracting networks and where this is not the case.

As a final preliminary remark, it has to be emphasized that a network being uniformly state contracting means that the states are contracting for any kind of input

u_{t}

. It does not mean that for any kind of input the contraction follows a power law. In fact, for all input settings

u_{t} \neq 0

the contraction is exponential for the neuron of Equation (17), even in the critical case (

| b | = 1

).

3.3. Single Neuron Network Example with Alternating Input and Power Law Forgetting

As outlined above, for practical purposes the sigmoid function, i.e.,

θ = t a n h (.)

, is not useful for critical networks because the only critical state occurs when total activity of such a network is null. In that case it is not possible to transfer information about the input history. The reason is illustrated in the following example, where instead of a sigmoid function other types of transfer functions are used.

So, the one-neuron network

x_{t + 1} = θ (- b x_{t} + (2 - b) u_{t})

(18)

with a constant b, the transfer function of Equation (16) and the expected alternating input

u_{t} = {(- 1)}^{t} π / 4

has an attractor state when also

x_{t}

is alternating with

x_{t} = {(- 1)}^{t} π / 4

independently from b. Thus,

u_{t} = {(- 1)}^{t} π / 4

shall be interpreted as the expected input that directs the activity of the network exactly to where

θ^{'} = 1

and thus induces a critical dynamic

F_{{\bar{u}}^{\infty}} = F_{c r i t}

.

It is now interesting to investigate the convergence behavior for different values of b considering differing starting values for internal states

x_{t}

and

y_{t}

. Figure 3 depicts the resulting different Lyapunov exponents for one-neuron networks with different values of b. One can see that—not surprisingly—the Lyapunov exponent for

b = 1

is zero. This is the critical point that marks the transition from order to instability in the system and at the same time the transition from ESP to non-ESP.

In the following the same network at

b = 1

is investigated. Some unexpected input, i.e.,

u_{t} \neq {(- 1)}^{t} π / 4

, lets the network jump out of the attractor. If the input afterwards continues as expected, the network slowly returns to the attractor state in a power law fashion. Thus,

d (x_{t}, y_{t}) \propto t^{- a},

where

y_{t}

represents the undisturbed time series and a is a constant value. Note that

x_{t}

contains all information of the network history and that the network was simulated using IEEE 754-2008 double precision variables, which have a memory size of 64 bits on Intel architecture computers. Although floating point variables are organized in a complicated way of three parts, the sign, exponent and mantissa, it is clear that the total reservoir capacity cannot exceed those 64 bits, which means in the limit a reservoir of one-neuron cannot remember more than 64 binary i.i.d. random numbers.

Thus, about 64 iterations after an unexpected input the difference between

x_{t}

and

y_{t}

should be annihilated. Thus, if both networks

x_{t}

and

y_{t}

receive the same unexpected input that is of the same magnitude, the difference between the

x_{t}

and

y_{t}

should reach virtually 0 within 64 iterations. The consideration can be tested by setting the input

u_{t} = π / 4 \times rand,

where

rand

is an i.i.d. random list of +1s and -1s, that produces a representative of

F_{{\bar{u}}^{\infty}} \neq F_{c r i t}

. Figure 4 and Figure 5 depict results from simulations, where after one iteration when two networks receive different inputs, both networks receive again the same input. Depicted is again the development of the difference between both networks

d (x_{t}, y_{t})

versus the number of iterations. The graphs appear in a double logarithmic fashion, so power law decays appear as straight lines. One can see that the networks that receive alternating, i.e., the expected input, pertains the difference for very long time spans (that exceed 64 iterations). On the other hand, if the network input for both networks is identical but i.i.d random and of the same order of magnitude as the expected input, the difference vanishes within 64 iterations.

The network distinguishes regular input from irregular input. Memories of irregular events pertain for a long time in the network provided that the following input is regular again. How a reservoir can be trained to anticipate certain input statistics has been discussed [7]. Additional solutions to this problem are subject to further investigations.

3.4. Relation to “Near Edge of Chaos” Approaches

It is a common experience that—in spite of given theoretical limits for the ESPs—the recurrent connectivity can be significantly higher than

s_{\max} = 1

for many practical input statistics. Those over-tuned ESNs in many cases show a much better performance than those that actually obey Jaeger’s initial limit ESP. So, recently researchers came up with theoretical insights with regard to ESPs that are subject to a network and a particular input statistic [6]. In the scope of this work, instead of defining the ESP for a network the ESP is always defined as related to a network and an input statistic. Also, similar efforts have been undertaken in the field of so-called liquid state machines [17,18].

One may assume that those approaches show similar properties as the one that has been presented here. However, for a good reason those approaches all are called ‘near edge of chaos’ approaches. In order to illustrate the problems that arise from those approaches, one may consider what happens if those overtuned ESNs are set exactly to the critical point. Here, just for the general understanding one may consider again a one-neuron network and a tanh as a transfer function, so

x_{t + 1} = \tanh (- b x_{t} + u_{t}) .

(19)

Note that the ESP limit outlined above requires that the recurrent connectivity should be

b < 1

. An input time series than one can use is from the previous section

u_{t} = {(- 1)}^{t} π / 4

(20)

Slightly tedious but basically simple calculus results in a critical value of

b \approx 2.344

for the input time series, where

x_{t} \approx {(- 1)}^{t} \times 0.757

. In this situation one can test for the convergence of two slightly different initial conditions and obtain a power law decay of the difference. However, setting up the amplitude of the input just a tiny bit higher is going to result in two diverging time series

x_{t}

and

y_{t}

. If the conditions of the ESN are chosen to be exactly at the critical point, it is possible that a untrained input sequence very close to the trained input sequence turns the ESN into a state where the ESP is not fulfilled anymore (for a related and more detailed discussion with a numerical analysis, confer [19]).

For this reason all the networks have to be chosen at a significant margin away from the edge of instability. That is very different from the approach in the previous section where although the expected input sequence for the network is exactly at the critical point, all other input sequences result in a stable ESN in most cases with exponential forgetting.

3.5. How the One-Neuron Example Can Be Extended to Multi-Neuron Networks: Normal Matrices

A normal matrix

W

commutes with its own transpose, i.e.,

W W^{T} = W^{T} W .

For a normal matrix

W

, it can easily be shown that

\max s (W) = \max abs (λ (W)) .

These matrices apply to the spectral theorem; the largest absolute eigenvalue is the same as the largest singular value, which makes a clear theoretical separation between networks that are uniformly state contracting and those that are not compatible to the echo state condition. Still, for normal matrices all previously known ES conditions do not determine to which of those two groups the critical point itself belongs.

Summarizing, all previous works result in theorems for an open set of conditions that are defined by the strict inequalities Equations (9) and (11). In the closest case, the case of normal matrices, when considering the singular condition

1 = \max s (W) = \max abs (λ (W))

(21)

there is no statement of the above mentioned theorems if the network is uniformly state contracting.

Some simple, preliminary numerical tests reveal that in the case of networks that satisfy Equation (21) the further development of the network strictly depends on the exact shape of transfer function.

4. Echo State Condition Limit with Weak Contraction

In the previous section, it has been shown how power law forgetting may occur in an ESN type neural network. These networks are all tuned to the point where

{| λ |}_{\max} = S = 1

. For this tuning it is still undetermined if the ESP is fulfilled or not even if normal matrices or one-neuron RNNs are used. The current section is dedicated to determining under which conditions Jaeger’s ESP can be extended to this boundary condition.

Jaeger’s sufficient echo state condition (see [4], App. C, p. 41) has strictly been proven only for non-critical systems (largest singular value

S < 1

) and with

\tanh (.)

as a transfer function. The original proof is based on the fact that

\tanh

in combination with

S < 1

is a contraction. In that case Jaeger shows an exponential convergence.

The core of all considerations of a sufficient condition is to give a close upper estimate of the distance between the next iterations of two different states

y_{t}

and

x_{t}

. The estimate is of the form

\max_{{\bar{u}}^{\infty}} d (y_{t + 1}, x_{t + 1}) = \max_{F} d (F (y_{t}), F (x_{t})) \leq ϕ_{1} \cdot d (y_{t}, x_{t}),

where the parameter

ϕ_{1}

basically is quantified by the nature of the transfer function and the strength of the connectivity matrix. The estimate has to be good enough that the iterative application of

ϕ_{1}

should result in a convergence to 0:

\lim_{t \to \infty} {[ϕ_{1} \cdot]}^{t} d (y_{0}, x_{0}) = 0,

(22)

This is equivalent to investigating a series

q_{t}

with

d (y_{t}, x_{t}) < q_{t} = {[ϕ_{1} \cdot]}^{t} d (y_{0}, x_{0})

and

\lim_{t \to \infty} q_{t} = 0,

(23)

which can prove that the requirement of uniformly state contraction (cf. Equation (6)) is fulfilled. For example, consider the case of a reservoir with one-neuron as described in Equation (17). Here the challenge is to find an estimator for

ϕ_{1}

such that

\max_{u_{t} \in R} d (θ (b y_{t} + u_{t}), θ (b x_{t} + u_{t})) \leq ϕ_{1} \cdot d (y_{t}, x_{t}),

where the chosen

ϕ_{1}

still holds the limit in Equation (22). For

| b | < 1

, convergence can be proven easily:

\begin{matrix} \max_{u_{t} \in R} d (θ (b y_{t} + u_{t}), θ (b x_{t} + u_{t})) & \leq \\ \max_{u_{t} \in R} | | b y_{t} + u_{t} - b x_{t} - u_{t} {| |}_{2} & = \\ | b | \cdot d (y_{t}, x_{t}) . \end{matrix}

Thus for

| b | < 1

, one can easily define

ϕ_{1} = | b |

. So

q_{t}

can be defined as

q_{t} = {| b |}^{t} \cdot d (y_{0}, x_{0}) .

So Equation (23) is fulfilled. The convergence is exponential. The arguments so far are analogous to the core of Jaeger’s proof for the sufficient echo state condition C2 that is restricted to one dimension.

For

| b | = 1

, this argument does not work anymore. Obviously, Jaeger’s proof is not valid under these circumstances. However, the initial theorem can be extended. As a pre-requisite, one can replace the constant

ϕ_{1}

with a function that depends on

q_{t}

as an argument.

So one can try

ϕ_{1} (q_{t}) = 1 - η q_{t}^{κ},

where

η > 0

and

κ > 1

have to be defined appropriately. This works for small values of

q_{t}

. However, it is necessary to name a limit for large

q_{t} > γ

. Define

ϕ_{1} (z) : = \{\begin{matrix} 1 - η z^{κ} & if z < γ \\ 1 - η γ^{κ} & if z \geq γ . \end{matrix}

(24)

Three things have to be done to check this cover function:

First of all, one needs to find out if indeed the cover function $ϕ_{1}$ fulfills

$\max_{u_{t} \in R} d (θ (y_{t} + u_{t}), θ (x_{t} + u_{t})) \leq d (y_{t}, x_{t}) ϕ_{1} (d (y_{t}, x_{t})) .$

In order to keep the proof compatible with the proof for multiple neurons for this work, one has to chose a slightly different application for $ϕ_{1}$ ,

$\max_{u_{t} \in R} d^{2} (θ (y_{t} + u_{t}), θ (x_{t} + u_{t})) \leq d^{2} (y_{t}, x_{t}) ϕ_{1} (d^{2} (y_{t}, x_{t})),$

which serves the same purpose and is much more convenient for multiple neurons.
In Appendix C one can find a recipe for this check.
Second, one has to look for the convergence of

$q_{t + 1} = q_{t} ϕ_{1} (q_{t}) = q_{t} (1 - η q_{t}^{κ}),$

when $q_{t} \leq γ$ . The analysis is done in Appendix A.
Third, one needs to check

$q_{t + 1} = q_{t} ϕ_{1} (q_{t}) = q_{t} (1 - η γ^{κ}),$

as long as $q_{t} > γ$ . Since the factor $(1 - η γ^{κ})$ is positive, smaller than one and constant, the convergence process is exponential, obviously.

Note that the next section’s usage of the cover function differs slightly even through it has the same form as Equation (24).

5. Sufficient Condition for a Critical ESN

The content of this section is a replacement of the condition C2 where the validity of the ESP is inferred for

S \leq 1

.

Theorem 1.

If hyperbolic tangent or the function of Equation (16) are used as transfer functions, the echo state condition (see Equation (6)) is fulfilled even if

S = 1

.

Summary of the Proof.

As an important precondition, the proof requires that both transfer functions fulfill

d (θ (y_{t}), θ (x_{t})) \leq d (y_{t}, x_{t}) ϕ_{k} (d^{2} (y_{t}, x_{t})),

(25)

where

ϕ_{k} (z)

is defined for a network with k hidden neurons as

ϕ_{k} (z) : = \{\begin{matrix} 1 - η z^{κ} & if z < γ \\ 1 - η γ^{κ} & if z \geq γ \end{matrix} .

(26)

Here,

1 > γ > 0, 1 > η > 0

,

κ \geq 1

are constant parameters that are determined by the transfer function and the metric norm

d (., .) = {| | . | |}_{2}

☐

In Appendix C it is shown that indeed both transfer functions fulfil that requirement. It then remains to prove that in the slowest case we have a convergence in a process with 2 stages. In the first stage, if

d^{2} (y_{t}, x_{t}) > γ

there is a convergence that is faster or equal to an exponential decay. The second stage is a convergence process that is faster or equal to a power law decay.

Proof.

Note with regard to the test function

ϕ_{k}

:

\begin{matrix} ϕ_{k} \leq 1, \\ \forall z, \forall Z : 0 \leq z \leq Z & \leftrightarrow & ϕ_{k} (z) \geq ϕ_{k} (Z), \\ and \forall z, \forall Z : 0 \leq z \leq Z & \leftrightarrow & Z \times ϕ_{k} (Z) \geq z \times ϕ_{k} (z) \end{matrix}

In analogy to Jaeger, one can check now the contraction between the time step t and

t + 1

:

\begin{matrix} d^{2} (y_{t + 1}, x_{t + 1}) = d^{2} (θ (y_{l i n, t + 1}), θ (x_{l i n, t + 1})) \\ \leq d^{2} (y_{l i n, t + 1}, x_{l i n, t + 1}) \times ϕ_{k} (d^{2} (y_{l i n, t + 1}, x_{l i n, t + 1})) \end{matrix}

(27)

One can rewrite

\begin{matrix} d^{2} (y_{l i n, t + 1}, x_{l i n, t + 1}) = | | W y_{t} + I - W x_{t} - {I | |}_{2}^{2} = | | W (y_{t} - x_{t}) {| |}_{2}^{2}, \end{matrix}

(28)

where

I = w^{i n} u_{t}

. Next one can consider that one can decompose the recurrent matrix by using singular value decomposition (SVD) and obtain

W = U \cdot S \cdot V^{T}

. Note that both

U

and

V

are orthogonal matrices and that

S

is diagonal with positive values

s_{i}

on the main diagonal. We consider

a = V^{T} (y_{t} - x_{t}) .

Because

V

is an orthogonal matrix, the left side of the equation above is a rotation of the right side and the length

| | a | |

is the same as

| | y_{t} - x_{t} | |

. One can write

d^{2} (y_{t}, x_{t}) = \sum_{i} a_{i}^{2},

where the

a_{i}

are entries of the vector

a

. Since

y_{l i n, t + 1} - x_{l i n, t + 1} = U S a

and

U

is again a rotation matrix, one can write

d^{2} (y_{l i n, t + 1}, x_{l i n, t + 1}) = \sum_{i} s_{i}^{2} a_{i}^{2},

where

s_{i}

is the i-th component of the diagonal matrix

S

, i.e., the i-th singular value.

In the following we define

s_{\max} = \max_{i} s_{i}

and calculate

\begin{matrix} d^{2} (y_{l i n, t + 1}, x_{l i n, t + 1}) \times ϕ_{k} (d^{2} (y_{l i n, t + 1}, x_{l i n, t + 1})) \\ = (\sum_{i} s_{i}^{2} a_{i}^{2}) \times ϕ_{k} (\sum_{i} s_{i}^{2} a_{i}^{2}) \\ \leq (s_{\max}^{2} \sum_{i} a_{i}^{2}) \times ϕ_{k} (s_{\max}^{2} \sum_{i} a_{i}^{2}) \\ \leq (s_{\max}^{2} d^{2} (y_{t}, x_{t})) \times ϕ_{k} (s_{\max}^{2} d^{2} (y_{t}, x_{t})) \end{matrix}

(29)

Merging Equations (27) and (29) results in the inequality

d (y_{t + 1}, x_{t + 1}) \leq s_{\max} d (y_{t}, x_{t}) \times (ϕ_{k} {(s_{\max}^{2} d^{2} (y_{t}, x_{t}))}^{0.5} .

First, assuming

s_{\max} < 1

and since we know

ϕ_{k} \leq 1

, we get an exponential decay

d (y_{n}, x_{n}) \leq s_{\max}^{t} d (x_{0}, y_{0}) .

This case is handled by Jaeger’s initial proof. With regard to an upper limit of the contraction speed (cf. Equation (6)), one can find

τ (ϵ) = \frac{\log ϵ - \log (d (y_{0}, x_{0}))}{\log s_{\max}} .

If the largest singular value

s_{\max} > 1

, then for some type of connectivities (i.e., normal matrices) the largest absolute eigenvalue is also larger than 1 due to the spectral theorem. In this case, the echo state condition is not always fulfilled, which has been shown also by Jaeger.

What remains is to check the critical case

s_{\max} = 1

. Here again one can discuss two different situations (rather two separate phases of the convergence process) separately:

If

d^{2} (y_{l i n, t}, x_{l i n, t}) > γ

, we can write the update inequality of Equation (30) as:

d^{2} (y_{t + 1}, x_{t + 1}) \leq (1 - η γ^{κ}) d^{2} (y_{t}, x_{t}) .

Thus, for all

ϵ^{2} \geq γ

, the slowest decay process can be covered by

τ (ϵ) = \frac{2 \log ϵ - 2 \log (d (y_{0}, x_{0}))}{\log (1 - η γ^{κ})} .

If

ϵ^{2} < γ

, then Equation (30) becomes:

d^{2} (y_{t + 1}, x_{t + 1}) \leq d^{2} (y_{t}, x_{t}) (1 - η d {(y_{t}, x_{t})}^{2 κ}) .

One can replace

q_{t} = d^{2} (y_{t}, x_{t})

(30)

and again consider the sequence

q_{t + 1} = q_{t} (1 - η q_{t}^{κ}),

which is discussed in Appendix A. The result there is that the sequence converges faster than

q_{*} (t) = {[\frac{η}{κ} t + q_{0}^{- κ}]}^{- 1 / κ} .

(31)

Note that, although the Lyapunov exponent (cf. Equation (7)) of

q_{*} (t)

is zero, the sequence

q_{*} (t)

converges in a power law fashion. Thus,

\lim_{t \to \infty} q (t) = d^{2} (y_{t}, x_{t}) = 0

(32)

and thus ESP has been proven. ☐

Moreover, one can calculate the upper time limit

τ (ϵ)

:

τ (ϵ) = \frac{κ}{η} (ϵ^{- 2 κ} - a_{0}^{- κ}) .

6. Summary, Discussion and Outlook

The background of this paper is to investigate the limit of recurrent connectivity in ESNs. The preliminary hypothesis towards the main work can be summarized in Figure 6. Initially it is hard to quantify the transition point between uniformly state contracting and non-state contracting ESNs exactly. However, for normal matrices and one-neuron networks the gap between the sufficient condition and the necessary condition collapses in a way that there are two neighboring open sets. The first open set is known to have the ESP, and the other open set evidently does not have the ESP. What remains is the boundary set. The boundary set is interesting to analyze because it can easily be shown that here power law forgetting can occur.

The proof of Section 4 shows a network is an ESN even if the largest eigenvalue of the recurrent connectivity matrix is equal to 1 and if the transfer function is either Equation (15) or (16). The proof is also extensible to other transfer functions. On the other hand it is obvious that some transfer functions result in networks that are not ESNs. For example a linear transfer function (

θ (x) = x

) is not state contracting.

Even if the network is state contracting, it is not necessarily exponentially uniformly state contracting. Its rate of convergence might follow a power law in the slowest case. Several examples for power law forgetting have been shown in the present work. More examples of preliminary learning have been outlined in [7]. One important target of the present research is to allow for a kind of memory compression in the reservoir by letting only the unpredicted input enter the reservoir.

One ultimate target of the present work is to find a way to organize reservoirs as recurrent filters with a memory compression feature. In order to bring concepts of data compression into the field of reservoir computing and in order to project as much as possible of the input history to the limit size reservoir, principles of memory compression have to be transferred into reservoir computing. However, the reservoir computing techniques that are analogous to classic memory compression have not been identified so far.

Another topic that needs further investigation is entropy in time series. Power law forgetting is only possible if the time series that relates to the criticality is either of a finite entropy, i.e., from a certain point in time all following entries of the time series can be predicted from the previous entries, or if the network simply ignores certain aspects of the incoming time series.

There also potential analogies in biology. Several measurements of memory decay in humans exist that reveal that there the forgetting follows a power law at least for a large fraction of the investigated examples [20,21].

Acknowledgments

The Nation Science Council and the Ministry of Science and Technology of Taiwan provided the budget for our laboratory. Also thanks go to AIM-HI for various ways of support.

Conflicts of Interest

The author declares no conflict of interest.

Notations in Formulas

$s (M)$		vector of singular values of matrix M
$S$		diagonal matrix of singular values
$s_{\max}$		largest singular value of a matrix
$λ (M)$		vector of eigenvalues
${\| λ \|}_{\max}$		the maximum of the eigenvalues in modulus
$θ (.)$	$R \to R$	transcendental transfer function
$θ^{'}$ , $\dot{θ}$		its derivative
$w^{i n}$	$R^{n \times k}$	input matrix
$W$	$R^{k \times k}$	recurrent transfer matrix
$w^{o u t}$	$R^{k \times m}$	output matrix
$u_{t}$	$R^{n}$	input time series
${\bar{u}}^{\infty}$		complete infinite input time series
$x_{t}$	$R^{k}$	hidden layer states
$y_{t}$	$R^{k}$	alternative hidden layer states to check convergence
$o_{t}$		training set
$\tilde{o_{t}}$		trained network output
$d (., .)$	${\| \| . \| \|}_{2}$	distance measure that is used to check convergence
$F (.)$		vector of a set of linearly independent, non-linear functions
$F_{{\bar{u}}^{\infty}}$	$\in F$	dynamics of the network with regard to the input
$F_{c r i t}$		all dynamics that show power law forgetting
$q_{{\bar{u}}^{\infty}, t}$		time series related to forgetting in ESNs
$q_{t}$		time series variable
$ϕ_{1} ()$		cover function to estimate contraction in a 1 neuron net
$ϕ_{k} ()$		dto. for a k neuron network
η, κ, γ		const parameters of both $ϕ_{1}$ and $ϕ_{k}$
$Λ_{{\bar{u}}^{\infty}}$		Lyapunov exponent of a neural network wrt. an input time series ${\bar{u}}^{\infty}$

Appendix A. Analyze $q_{t + 1} = q_{t} (1 - η q_{t}^{κ})$

We can consider the sequence

q_{t}

:

q_{t + 1} = q_{t} (1 - η q_{t}^{κ}) .

(A1)

Convergence can be analyzed in the following way:

Δ q_{t} = q_{t + 1} - q_{t} = - η q_{t}^{κ + 1} .

(A2)

Thus, the series

q_{t + 1}

can be written as

q_{t + 1} = \sum_{t^{'} < t} Δ q_{t^{'}}

Thus one comes up with the following discrete formula

Δ q_{t^{'}} = - η q_{t^{'}}^{κ + 1} .

(A3)

Since

κ \geq 1

and

1 > η > 0

, convergence towards null is obvious here. In addition since the right side of Equation (A3) is decreasing continuously it is obvious that it is converging fast than the corresponding solution of the differential equation for a function

q_{*}

:

\frac{d q_{*}}{d t} = - η q_{*}^{κ + 1},

which is easily solvable to:

- \frac{1}{κ η} q_{*}^{- κ} = t + C,

where C is the integration constant. From this solution, by setting

q_{0} = q_{*} (0)

one can derive

q_{*} (t) = {[\frac{η}{κ} t + q_{0}^{- κ}]}^{- 1 / κ} .

Note that

\lim_{t \to \infty} q_{*} = 0 .

Simple algebraic considerations show that

q_{*} (t)

covers the sequence

q_{t}

. For a strict proof see Appendix B.

Thus,

q_{*, t} \geq q_{t}

for all

t > 0

if

q_{*, 0} = q_{0}

So, the sequences

q_{*, t}

and

q_{t}

converge to zero.

Appendix B. Sequence q_*,t Covers up Sequence q_t

In the following, identical definitions to Section 4 are used. One can start from the statement

- \frac{η}{κ} \geq - η .

(B1)

Since

η \geq 0

and

κ \geq 1

, it can easily be seen that the inequality is fulfilled for any combination of s, η and κ. One can now extend the numerator and denominator of the right side by

(\frac{η}{κ} t + C)

and add also

(\frac{η}{κ} t + C)

to both sides of the inequality. Here and in the following C is defined as

C = q_{0}^{- κ} \geq 1 .

(B2)

One obtains

\begin{matrix} \frac{η}{κ} t + C - \frac{η}{κ} \geq (\frac{η}{κ} t + C) - (\frac{η}{κ} t + C) \frac{η}{\frac{η}{κ} t + C} . \end{matrix}

(B3)

A rearrangement of the right side results in

\begin{matrix} \frac{η}{κ} t + C - \frac{η}{κ} \geq (\frac{η}{κ} t + C) \times (1 - \frac{η}{\frac{η}{κ} t + C}) . \end{matrix}

(B4)

One can add

\frac{η}{κ}

to both sides and obtain

\begin{matrix} \frac{η}{κ} t + C \geq \frac{s η}{κ} + (\frac{η}{κ} t + C) \times (1 - \frac{η}{\frac{s η}{κ} t + C}) . \end{matrix}

(B5)

Now, one can use the fact that

1 \geq 1 - \frac{η}{\frac{η}{κ} t + C} \geq {(1 - \frac{η}{\frac{η}{κ} t + C})}^{κ}

and

\frac{η}{κ} \geq \frac{η}{κ} \times (1 - \frac{η}{\frac{η}{κ} t + C}) \geq \frac{η}{κ} \times {(1 - \frac{η}{\frac{η}{κ} t + C})}^{κ}

and thus rewrite inequality Equation (B5) as

\begin{matrix} \frac{η}{κ} t + C & \geq & (\frac{η}{κ} t + C + \frac{η}{κ}) \times (1 - \frac{η}{\frac{η}{κ} t + C}) \end{matrix}

and finally arrive at

\begin{matrix} (\frac{η}{κ} t + C) & \geq & (\frac{η}{κ} t + C + \frac{η}{κ}) \times {(1 - \frac{η}{\frac{η}{κ} t + C})}^{κ} . \end{matrix}

One can multiply both sides by

{(\frac{η}{κ} t + C + \frac{η}{κ})}^{- 1} \times {(\frac{η}{κ} t + C)}^{- 1}

. So,

\begin{matrix} {(\frac{η}{κ} t + C + \frac{η}{κ})}^{- 1} & \geq & {(\frac{η}{κ} t + C)}^{- 1} \times {(1 - \frac{η}{\frac{η}{κ} t + C})}^{κ} . \end{matrix}

Taking the kth root on both sides one gets

\begin{matrix} {(\frac{η}{κ} t + C + \frac{η}{κ})}^{- 1 / κ} & \geq & {(\frac{η}{κ} t + C)}^{- 1 / κ} \times (1 - \frac{η}{\frac{η}{κ} t + C}) . \end{matrix}

One can rearrange the inequality to:

\begin{matrix} {(\frac{η}{κ} t + C + \frac{η}{κ})}^{- 1 / κ} \geq {(\frac{η}{κ} t + C)}^{- 1 / κ} \times (1 - η {(\frac{η}{κ} t + C)}^{- 1}) \\ {(\frac{η}{κ} (t + 1) + C)}^{- 1 / κ} \geq {(\frac{η}{κ} t + C)}^{- 1 / κ} \times (1 - η {({(\frac{η}{κ} t + C)}^{- 1 / κ})}^{κ}) \end{matrix}

(B6)

Using the definitions of

q_{*, t}

and

q_{*, t + 1}

, i.e.,

q^{*} (t) = {[\frac{η}{κ} t + q_{0}^{- κ}]}^{- 1 / κ},

one has finally

\begin{matrix} q_{*, t + 1} & \geq & q_{*, t} \times (1 - η {(q_{*, t})}^{κ}) . \end{matrix}

(42)

proven as a true statement. Thus,

q_{*, t} \geq q_{t} \geq d (y_{t}, x_{t})

for all

t > 0

if

q_{*, 0} = q_{0} = d (y_{0}, x_{0}) .

Thus, the positive definite sequence

q_{*, t}

covers

d (y_{t}, x_{t})

.

Appendix C. Weak Contraction with the Present Transfer Function

In this appendix a test function of the form of Equation (26), is verified for the function of Equation (16) and hyperbolic tangent. Within this section we test the following values

\begin{matrix} η = \frac{1}{48 \times n^{2}}, γ = \frac{1}{2} and κ = 2, \end{matrix}

(C1)

for both transfer functions and square norm (

{| | . | |}_{2}

), and n-neurons. In order to derive these values, one can start by considering linear responses

| | y_{l i n, t} - x_{l i n, t} | |

and the final value

| | y_{t} - x_{t} | |

within one single neuron as

\begin{matrix} y_{t} = θ (y_{l i n, t}) \\ and x_{t} = θ (x_{l i n, t}) . \end{matrix}

(C2)

We can define

\begin{matrix} y_{l i n, t, i} = Δ_{i} + ζ_{i}, x_{l i n, t, i} = ζ_{i}, and x_{t, i} - y_{t, i} = δ_{i} \end{matrix}

(C3)

where i is the index of the particular hidden layer neuron. Note that

| | y_{l i n, t} - x_{l i n, t} {| |}^{2} = \sum_{i} Δ_{i}^{2}, | | y_{t} - x_{t} {| |}^{2} = \sum_{i} δ_{i}^{2} .

Appendix C.1. One-Neuron Transfer

In this section considerations are restricted to the case in which one has only one neuron in the hidden layer. For the sake of simplicity, the subscript index i is left out in the following considerations.

Setting in the definitions of Equation (C3) into the square of Equation (25) and for a single neuron and for any ζ we get

Δ^{2} ϕ_{1} (Δ^{2}) \geq {(θ (Δ + ζ) - θ (ζ))}^{2} .

Thus, it suffices to consider

ϕ_{1} (Δ^{2}) \geq \max_{ζ} ω (Δ, ζ),

(C4)

where

ω (Δ, ζ) = {(\frac{θ (Δ + ζ) - θ (ζ)}{Δ})}^{2}

The

\max_{ζ}

can be found by basic analysis. Extremal points can be found as solutions of

\begin{matrix} \frac{\partial}{\partial ζ} ω = \\ \frac{\partial}{\partial ζ} [{(\frac{θ (Δ + ζ) - θ (ζ)}{Δ})}^{2}] = \\ \frac{2 (θ (Δ + ζ) - θ (ζ)) \times (\dot{θ} (Δ + ζ) - \dot{θ} (ζ))}{Δ^{2}} = 0 . \end{matrix}

(C5)

This can only be fulfilled if

\dot{θ} (Δ + ζ) - \dot{θ} (ζ) = 0

Since

θ^{'}

is an even function for both suggested transfer functions, one gets

ζ = - z / 2

as the extremal point. Fundamental analysis shows that this point in both cases is a maximum. Thus, requiring

ϕ_{1} (Δ^{2}) \geq {[\frac{2 θ (Δ / 2)}{Δ}]}^{2}

(C6)

also would satisfy Equation (C4).

Numerically, one can find parameters for

ϕ_{1}

that are

η = 1 / 48

and

κ = 2

of as in Equation (C1).

For

Δ^{2} > γ

, it suffices to check if

\frac{4 \tanh^{2} (Δ / 2)}{Δ^{2}} \leq 1 - η γ^{κ}

First the inequality is fulfilled for

Δ^{2} = γ

. Since the left side of the equation above is strictly decreasingfulfilled for all values

Δ^{2} > γ

.

Analogous considerations lead to the same parameters to cover up the transfer function from Equation (16).

Appendix C.2. Multi-Neuron Parameters

For several neurons one has to consider the variational problem of all possible combinations of values of

Δ_{i}

. One can start from the proven relation from the previous section,

δ_{i}^{2} \leq Δ_{i}^{2} ϕ_{1} (Δ_{i}^{2})

Thus,

\sum_{i} δ_{i}^{2} \leq \sum_{i} (Δ_{i}^{2} ϕ_{1} (Δ_{i}^{2})),

implying that

\sum_{i} (Δ_{i}^{2} ϕ_{1} (Δ_{i}^{2})) \leq \sum_{i} Δ_{i}^{2} - Δ_{m a x}^{2} + Δ_{m a x}^{2} ϕ_{1} (Δ_{m a x}^{2}),

(C7)

where

Δ_{m a x}^{2} = \max_{i} Δ_{i}^{2}

. The smallest possible

Δ_{m a x}^{2}

is

\frac{\sum_{i} Δ_{i}^{2}}{n},

Substituting into Equation (C7), we get

\begin{matrix} \sum_{i} Δ_{i}^{2} - Δ_{m a x}^{2} + Δ_{m a x}^{2} ϕ_{1} (Δ_{m a x}^{2}) & \leq \\ \sum_{i} Δ_{i}^{2} - \frac{\sum_{i} Δ_{i}^{2}}{n} + \frac{\sum_{i} Δ_{i}^{2}}{n} ϕ_{1} (\frac{\sum_{i} Δ_{i}^{2}}{n}) & = \\ (\sum_{i} Δ_{i}^{2}) ϕ_{1} (\frac{\sum_{i} Δ_{i}^{2}}{n^{2}}) . \end{matrix}

Thus, the inequality (which is equivalent to Equation (25)),

\sum_{i} δ_{i}^{2} \leq (\sum_{i} Δ_{i}^{2}) ϕ_{k} (\sum_{i} Δ_{i}^{2}),

is fulfilled if

ϕ_{k}

is defined as

ϕ_{k} (x) = ϕ_{1} (\frac{x}{n^{2}}) .

Thus, the parameters from Equation (C1) fulfil the inequality of Equation (25).

References

Werbos, P.J. Backpropagation through time: What it does and how to do it. Proc. IEEE 1990, 78, 1550–1560. [Google Scholar] [CrossRef]
Lukoševičius, M.; Jaeger, H. Reservoir computing approaches to recurrent neural network training. Comput. Sci. Rev. 2009, 3, 127–149. [Google Scholar] [CrossRef]
Maass, W.; Natschläger, T.; Markram, H. Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Comput. 2002, 14, 2531–2560. [Google Scholar] [CrossRef] [PubMed]
Jaeger, H. The “echo state” approach to analysing and training recurrent neural networks—With an erratum note. In GMD German National Research Insitute for Computer Science 2011; GMD Report 148; Available online: https://pdfs.semanticscholar.org/8430/c0b9afa478ae660398704b11dca1221ccf22.pdf (accessed on 22 December 2016).
Boedecker, J.; Obst, O.; Lizier, J.; Mayer, N.M.; Asada, M. Information processing in echo state networks at the edge of chaos. Theory Biosci. 2012, 131, 205–213. [Google Scholar] [CrossRef] [PubMed]
Manjunath, G.; Jaeger, H. Echo state property linked to an input: Exploring a fundamental characteristic of recurrent neural networks. Neural Comput. 2013, 25, 671–696. [Google Scholar] [CrossRef] [PubMed]
Mayer, N.M. Adaptive critical reservoirs with power law forgetting of unexpected input events. Neural Comput. 2015, 27, 1102–1119. [Google Scholar] [CrossRef] [PubMed]
Ng, A. CS 229 Machine Learning, UC Stanford. URL. 2014. Available online: http://cs229.stanford.edu/notes/cs229-notes1.pdf (accessed on 22 December 2016).
Jaeger, H.; Maass, W.; Principe, J. Special issue on echo state networks and liquid state machines. Neural Netw. 2007, 20, 287–289. [Google Scholar] [CrossRef]
Schrauwen, B.; Verstraeten, D.; Van Campenhout, J. An overview of reservoir computing: Theory, applications and implementations. In Proceedings of the 15th European Symposium on Artificial Neural Networks, Bruges, Belgium, 25–27 April 2007.
Jaeger, H. Adaptive nonlinear system identification with echo state networks. In Proceedings of the Neural Information Processing Systems: Natural and Synthetic, Vancouver, BC, Canada, 9–14 December 2002.
Wainrib, G.; Galtier, M.N. A local Echo State Property through the largest Lyapunov exponent. Neural Netw. 2016, 76, 39–45. [Google Scholar] [CrossRef] [PubMed]
Buehner, M.R.; Young, P. A Tighter Bound for the Echo State Property. IEEE Trans. Neural Netw. 2006, 17, 820–824. [Google Scholar] [CrossRef] [PubMed]
Yildiz, I.B.; Jaeger, H.; Kiebel, S.J. Re-visiting the echo state property. Neural Netw. 2012, 35, 1–20. [Google Scholar] [CrossRef] [PubMed]
Appeltant, L.; Soriano, M.C.; Van der Sande, G.; Danckaert, J.; Massar, S.; Dambre, J.; Schrauwen, B.; Mirasso, C.R.; Fischer, I. Information processing using a single dynamical node as complex system. Nat. Commun. 2011, 2, 468. [Google Scholar] [CrossRef] [PubMed]
Ortín, S.; Soriano, M.C.; Pesquera, L.; Brunner, D.; San-Martín, D.L.; Fischer, I.; Mirasso, C.R.; Gutiérrez, J.M. A unified framework for reservoir computing and extreme learning machines based on a single time-delayed neuron. Sci. Rep. 2015, 5, 14945. [Google Scholar] [CrossRef] [PubMed]
Natschläger, T.; Bertschinger, N.; Legenstein, R.A. At the edge of chaos: Real-time computations and self-organized criticality in recurrent neural networks. In Proceedings of the Advances in Neural Information Processing Systems 17, Bonn, Germany, 7–11 August 2005.
Legenstein, R.A.; Maass, W. Edge of chaos and prediction of computational performance for neural circuit models. Neural Netw. 2007, 20, 323–334. [Google Scholar] [CrossRef] [PubMed]
Mayer, N.M. Critical Echo State Networks that Anticipate Input Using Adaptive Transfer Functions. arXiv, 2016; arXiv:1606.03674. [Google Scholar]
Rubin, D.C.; Wenzel, A.E. One hundred years of forgetting: A quantitative description of retention. Psychol. Rev. 1996, 103, 734. [Google Scholar] [CrossRef]
Kahana, M.J.; Adler, M. Note on the Power Law of Forgetting; University of Pennsylvania: Philadelphia, PA, USA, 2002. [Google Scholar]

Figure 1. Variant possible transfer functions with interesting features regarding criticality. On the left side (a)

\tanh

has only one epi-critical point. Equation (16) is graphed on the right side (b). Here, there is an infinite set of epi-critical points

n π + π / 2

that are all positioned along the line

y = x / 2

. In both graphs green dots indicate epi-critical points, green curves are smooth interpolations between those points, and the blue line indicates the particular transfer function itself.

Figure 1. Variant possible transfer functions with interesting features regarding criticality. On the left side (a)

\tanh

has only one epi-critical point. Equation (16) is graphed on the right side (b). Here, there is an infinite set of epi-critical points

n π + π / 2

that are all positioned along the line

y = x / 2

. In both graphs green dots indicate epi-critical points, green curves are smooth interpolations between those points, and the blue line indicates the particular transfer function itself.

Figure 2. Tailored transfer function where the epi-critical points can be organized in an adaptive way and follow a transcendental function (which has certain advantages to the examples depicted in Figure 1, in blue). Green dots indicate epi-critical points, the green curve is a smooth interpolation between those points, and the blue line indicates the particular transfer function itself. Note that the green curve is a

\tanh

and thus a transcendental function.

Figure 2. Tailored transfer function where the epi-critical points can be organized in an adaptive way and follow a transcendental function (which has certain advantages to the examples depicted in Figure 1, in blue). Green dots indicate epi-critical points, the green curve is a smooth interpolation between those points, and the blue line indicates the particular transfer function itself. Note that the green curve is a

\tanh

and thus a transcendental function.

Figure 3. Depicted is the Lyapunov exponent for the example system of Equation (18) for different values of b. At

b = 1

the Lyapunov exponent crosses zero. At this point all types of linear analysis fail. The point marks the border between networks that fulfil the echo state property (ESP) and those that do not. So, this point is the called critical point. Further analysis shows that the point itself belongs to the area with ESP. All results from Figure 4 and Figure 5 are drawn from the point where the Lyapunov exponent is zero, that is

b = 1

.

Figure 3. Depicted is the Lyapunov exponent for the example system of Equation (18) for different values of b. At

b = 1

the Lyapunov exponent crosses zero. At this point all types of linear analysis fail. The point marks the border between networks that fulfil the echo state property (ESP) and those that do not. So, this point is the called critical point. Further analysis shows that the point itself belongs to the area with ESP. All results from Figure 4 and Figure 5 are drawn from the point where the Lyapunov exponent is zero, that is

b = 1

.

Figure 4. Results from two dynamics defined in Equation (18), with

b = 1

and different types on input

u_{t}

. One copy of the two initially identical networks has received a variant input at iteration one. Depicted here is the decay of the difference of the state in the recurrent layer if both networks receive the same and expected (alternating) input (in dark blue). The pale curves are data from Figure 5 embedded for comparison. One can see that the difference function (blue) follows a power law and pertains for longer than 64 iterations.

Figure 4. Results from two dynamics defined in Equation (18), with

b = 1

and different types on input

u_{t}

. One copy of the two initially identical networks has received a variant input at iteration one. Depicted here is the decay of the difference of the state in the recurrent layer if both networks receive the same and expected (alternating) input (in dark blue). The pale curves are data from Figure 5 embedded for comparison. One can see that the difference function (blue) follows a power law and pertains for longer than 64 iterations.

Figure 5. Complementary data for Figure 4: (a) Decay if both networks receive the same but irregular (i.i.d random) input (different colors are used for different trials). The difference vanishes much faster than in the left case. Finally the difference becomes and stays null, which is out of a logarithmic scale. Thus, later iterations are not depicted anymore; (b) Depicted here is the same data as at the left, however in a double-logarithmic plot. The data forms a straight line, which indicates an exponential decay.

Figure 6. The graph illustrates the relation between the different conditions with regard to echo state property (ESP). C1 and C2 are the necessary and sufficient condition according to Jäger [4]. The large arrows represent the connectivity strength of the recurrent synaptic weight matrix

W

. For a small connectivity strength the ESP is fullfilled (cyan areas). For a strong connectivity, the ESP is not fulfilled anymore (green). C2+ represents symbolically the improvement of the sufficient condition according to [13,14]. For general matrices there can be a non-zero gap between C2+ and C1 which is drawn diagonally shaded. The transition from ESP to non-ESP happens somewhere within this gap. All three conditions C1, C2 and C2+ describe an open set. In the case of normal matrices and for one-neuron networks the gap is closed except for the separation line itself. The proof of Section 4 shows that there it depends on the transfer function if the network is an ESN or not.

Figure 6. The graph illustrates the relation between the different conditions with regard to echo state property (ESP). C1 and C2 are the necessary and sufficient condition according to Jäger [4]. The large arrows represent the connectivity strength of the recurrent synaptic weight matrix

W

. For a small connectivity strength the ESP is fullfilled (cyan areas). For a strong connectivity, the ESP is not fulfilled anymore (green). C2+ represents symbolically the improvement of the sufficient condition according to [13,14]. For general matrices there can be a non-zero gap between C2+ and C1 which is drawn diagonally shaded. The transition from ESP to non-ESP happens somewhere within this gap. All three conditions C1, C2 and C2+ describe an open set. In the case of normal matrices and for one-neuron networks the gap is closed except for the separation line itself. The proof of Section 4 shows that there it depends on the transfer function if the network is an ESN or not.

© 2016 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mayer, N.M. Echo State Condition at the Critical Point. Entropy 2017, 19, 3. https://doi.org/10.3390/e19010003

AMA Style

Mayer NM. Echo State Condition at the Critical Point. Entropy. 2017; 19(1):3. https://doi.org/10.3390/e19010003

Chicago/Turabian Style

Mayer, Norbert Michael. 2017. "Echo State Condition at the Critical Point" Entropy 19, no. 1: 3. https://doi.org/10.3390/e19010003

APA Style

Mayer, N. M. (2017). Echo State Condition at the Critical Point. Entropy, 19(1), 3. https://doi.org/10.3390/e19010003

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Echo State Condition at the Critical Point

Abstract

1. Introduction

2. Motivation

3. Critical Reservoirs with Regard to the Input Statistics ${\bar{u}}^{\infty}$

3.1. Transfer Functions

3.2. Examples Using a Single Neuron as a Reservoir

3.3. Single Neuron Network Example with Alternating Input and Power Law Forgetting

3.4. Relation to “Near Edge of Chaos” Approaches

3.5. How the One-Neuron Example Can Be Extended to Multi-Neuron Networks: Normal Matrices

4. Echo State Condition Limit with Weak Contraction

5. Sufficient Condition for a Critical ESN

6. Summary, Discussion and Outlook

Acknowledgments

Conflicts of Interest

Notations in Formulas

Appendix A. Analyze $q_{t + 1} = q_{t} (1 - η q_{t}^{κ})$

Appendix B. Sequence q_*,t Covers up Sequence q_t

Appendix C. Weak Contraction with the Present Transfer Function

Appendix C.1. One-Neuron Transfer

Appendix C.2. Multi-Neuron Parameters

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Echo State Condition at the Critical Point

Abstract

1. Introduction

2. Motivation

3. Critical Reservoirs with Regard to the Input Statistics u ¯ ∞

3.1. Transfer Functions

3.2. Examples Using a Single Neuron as a Reservoir

3.3. Single Neuron Network Example with Alternating Input and Power Law Forgetting

3.4. Relation to “Near Edge of Chaos” Approaches

3.5. How the One-Neuron Example Can Be Extended to Multi-Neuron Networks: Normal Matrices

4. Echo State Condition Limit with Weak Contraction

5. Sufficient Condition for a Critical ESN

6. Summary, Discussion and Outlook

Acknowledgments

Conflicts of Interest

Notations in Formulas

Appendix A. Analyze q t + 1 = q t ( 1 − η q t κ )

Appendix B. Sequence q*,t Covers up Sequence qt

Appendix C. Weak Contraction with the Present Transfer Function

Appendix C.1. One-Neuron Transfer

Appendix C.2. Multi-Neuron Parameters

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. Critical Reservoirs with Regard to the Input Statistics ${\bar{u}}^{\infty}$

Appendix A. Analyze $q_{t + 1} = q_{t} (1 - η q_{t}^{κ})$

Appendix B. Sequence q_*,t Covers up Sequence q_t