Integral Neuron: A New Concept for Nonlinear Neuron Modeling Using Weight Functions. Creation of XOR Neurons

Yotov, Kostadin; Hadzhikolev, Emil; Hadzhikoleva, Stanka

doi:10.3390/math12243982

Open AccessArticle

Integral Neuron: A New Concept for Nonlinear Neuron Modeling Using Weight Functions. Creation of XOR Neurons

by

Kostadin Yotov

,

Emil Hadzhikolev

and

Stanka Hadzhikoleva

^*

Faculty of Mathematics and Informatics, University of Plovdiv Paisii Hilendarski, 236 Bulgaria Blvd., 4027 Plovdiv, Bulgaria

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(24), 3982; https://doi.org/10.3390/math12243982

Submission received: 7 November 2024 / Revised: 8 December 2024 / Accepted: 16 December 2024 / Published: 18 December 2024

(This article belongs to the Section E: Applied Mathematics)

Download

Browse Figures

Versions Notes

Abstract

:

In the present study, an extension of the idea of dynamic neurons is proposed by replacing the weights with a weight function that is applied simultaneously to all neuron inputs. A new type of artificial neuron called an integral neuron is modeled, in which the total signal is obtained as the integral of the weight function. The integral neuron enhances traditional neurons by allowing the signal shape to be linear and nonlinear. The training of the integral neuron involves finding the parameters of the weight function, where its functional values directly influence the total signal in the neuron’s body. This article presents theoretical and experimental evidence for the applicability and convergence of standard training methods such as gradient descent, Gauss–Newton, and Levenberg–Marquardt in searching for the optimal weight function of an integral neuron. The experimental part of the study demonstrates that a single integral neuron can be trained on the logical XOR function—something that is impossible for single classical neurons due to the linear nature of the summation in their bodies.

Keywords:

integral neuron; XOR neuron; weight function; integral neural network; dynamic artificial neural networks

MSC:

68T01

1. Introduction

The key characteristics of dynamic neural networks (DNNs) are their changing structure, temporal dependency, greater expressive power, and better adaptability than static artificial neural networks (ANNs). Regarding their changing structure, it should be noted that the number of neurons, the connections between them, and even the entire architecture of a DNN can change during training or operation [1,2,3]. Their temporal dependency is linked to the fact that these DNNs can account for previous states and predict future events, making them particularly useful for tasks in meteorology [4,5,6,7], financial market analysis [8,9,10,11], speech recognition [12,13,14,15], and even robot control [16,17,18,19]. The greater expressive power lies in the fact that DNNs can represent more complex functions and patterns than static ANNs, while their better adaptability allows them to adjust to changing conditions and learn from new data more effectively. However, the strengths of DNNs also contribute to some challenges in their use. For example, designing and training DNNs is more complex compared to static ANNs, and both the training and subsequent operation often require more computational resources.

This article presents various stages of modeling a new type of neuron called integral neurons, which use weight functions instead of fixed weights during the training process, as well as the implementation of an integral XOR neuron that can be implemented with different weight and transfer functions. The main sections of the article, following the introduction, are as follows:

In the second part of the study, various DNNs are examined. The idea is proposed that a new type of dynamic behavior can be created by replacing the static weights determined during the training of ANNs with weight functions, allowing for the creation of a nonlinear signal within the neuron.
In the third part, definitions are proposed for a weight function, which is used instead of static weights within the neuron, and an integral neuron, featuring an integral function in its body that uses the weight function. Evidence is provided for the applicability of training methods such as gradient descent, Gauss–Newton, and Levenberg–Marquardt in the search for the optimal weight function.
The fourth part—Discussion—examines the similarities and differences between classical neurons and the newly proposed integral neurons and provides directions for potential developments in the topic of artificial integral structures. Definitions are provided for integral and integral–classical neural networks, among others.
In the experimental fifth part, the possibility of creating a single integral neuron that solves the XOR logical function, which is unachievable by standard artificial neurons, is explored. Theoretical and experimental evidence for the creation of integral XOR neurons, trained with several different weight and transfer functions, is presented. The MATLAB v2018a scripts used for creating the integral XOR neurons are provided in Appendix A.

2. Dynamic Artificial Neural Networks

The main types of DNNs are Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, Gated Recurrent Units (GRUs), Time-Delay Neural Networks (TDNNs), Echo State Networks (ESNs), and Temporal Convolutional Networks (TCNs) [20,21].

2.1. Recurrent Neural Networks

RNNs contain feedback loops in their architecture which allow information from previous time steps to be retained and used for current computations. These loops cyclically pass information through time, as RNNs maintain an internal state

h_{t}

, which is updated at each time step

t

. This state is used to store information from previous inputs.

In standard feedforward ANNs, neurons in one layer transmit signals only to the neurons in the next layer. The weight matrices

W

determine the connections between the layers, and the states of the network (the outputs from the neurons) are

h = g (W x + b),

(1)

where

$x$ is the input vector;
$b$ is the bias vector (thresholds);
$g$ is the transfer function.

In contrast, RNNs have an additional aspect—the feedback over time. At each time step, RNNs not only take into account the current input

x_{t}

, but also the previous state

h_{t - 1}

[22,23]. This allows the network to have memory and retain information across time steps. At each time step

t

, the RNN performs the following computations:

h_{t} = g (W_{h} h_{t - 1} + W_{x} x_{t} + b),

(2)

where

$h_{t}$ is the hidden state of the network at time step $t$ ;
$h_{t - 1}$ is the hidden state of the network at the previous time step;
$x_{t}$ is the current input;
$W_{h}$ and $W_{x}$ are the weight matrices.

The weight matrix

W_{h}

processes the previous state

h_{t - 1}

and provides the information accumulated from prior time steps. This is the key element that gives RNNs the ability to remember past information. The weight matrix

W_{x}

processes the current input

x_{t}

, which is analogous to what happens in standard ANNs. It handles the processing of new information that enters at each time step. This structure allows RNNs to combine new information with the accumulated memory from previous time steps, making it possible to recognize temporal dependencies. It should be noted that deep RNNs include multiple recurrent layers, with each layer receiving input from the previous one. This enables the network to capture more complex dependencies in the data. If there are two recurrent layers, the computations will be performed with the following equations, characterizing the states (outputs) of the layers:

\{\begin{array}{l} h_{t}^{(1)} = g (W_{h}^{(1)} h_{t - 1}^{(1)} + W_{x}^{(1)} x_{t} + b^{(1)}) \\ h_{t}^{(2)} = g (W_{h}^{(2)} h_{t - 1}^{(2)} + W_{x}^{(2)} h_{t}^{(1)} + b^{(2)}) \end{array},

(3)

where

$h_{t}^{(1)}$ and $h_{t}^{(2)}$ are the hidden states of the first and second layers, respectively;
$W_{h}^{(1)}$ , $W_{x}^{(1)}$ , $W_{h}^{(2)}$ , and $W_{x}^{(2)}$ are the weight matrices for the respective layers;
$b^{(1)}$ and $b^{(2)}$ are the bias vectors (containing thresholds) for the respective layers;
$g$ is the transfer function.

Examples of the dynamic capabilities of RNNs can be found in [24,25,26,27].

2.2. Long Short-Term Memory Networks

LSTM networks are a type of RNN specifically designed to address the vanishing gradient problem in RNNs by introducing memory cells and gates that control the flow of information. The main components of LSTM are the forget gate, input gate, memory cell, and output gate, with dynamic behavior achieved through more complex equations that involve memory cells and gates controlling the flow of information [28,29]:

\{\begin{array}{l} f_{t} = g (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f}) \\ i_{t} = g (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i}) \\ o_{t} = g (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o}) \\ \tilde{c_{t}} = g (W_{c} x_{t} + U_{c} h_{t - 1} + b_{c}) \\ c_{t} = f_{t} ⨀ c_{t - 1} + i_{t} ⨀ \tilde{c_{t}} \\ h_{t} = o_{t} ⨀ g (c_{t}) \end{array} .

(4)

Here,

$f_{t}$ is the state of the forget gate;
$i_{t}$ and $o_{t}$ are the input and output gates, respectively;
$\tilde{c_{t}}$ is the proposed new value for the cell, i.e., the cell gate activation vector;
$c_{t}$ is the current cell state;
$h_{t}$ is the current hidden state.

In all the presented equations, the subscript

t

indicates the time step,

g

is a sigmoid-like transfer function (most commonly the hyperbolic tangent), and

⊙

represents the Hadamard product—a binary operation that takes two matrices of the same dimensions and returns a matrix of element-wise multiplications. These equations demonstrate how LSTM networks can retain and manipulate information over long time periods by using complex memory management mechanisms. In their work, Zhumei Wang et al. use the dynamic characteristics of LSTM networks for accurate traffic forecasting [30], while Ruijie Huang et al. base their model on LSTM dynamics to create a system fed with historical data from 4000 previous days, predicting the production efficiency of a carbonate reservoir well for the next 500 days [31]. The use of dynamics in LSTM models can be explored in many other studies [32,33,34,35].

2.3. Gated Recurrent Units

GRUs are improved versions of earlier LSTM models. They have a simplified structure compared to LSTM, as they use fewer gates and parameters. This makes them easier to train and often faster than LSTM, especially when working with smaller datasets. The GRU cell includes two main gates: the update gate and the reset gate. The update gate helps the cell decide how much of the previous information will be retained in the new state. It combines the functions of the input gate and the forget gate in LSTM, thus controlling the extent to which information will be carried over to future time steps. The reset gate determines how much of the previous information should be forgotten. This gate allows the model to discard irrelevant information, which can be especially useful when modeling time series, where earlier data points may not always be important for understanding later points.

There are different variations of GRUs. For instance, simplified versions of GRUs like the Minimal Gated Unit can be used, where the gating mechanisms are reduced to a single main mechanism for controlling input and forget information [36,37]. Another variation is the Coupled GRU, where there is a dependency between the reset and update gates, rather than using them as completely independent elements [38]. GRU-Ds are modifications with a bidirectional architecture designed to handle time series with missing data [39], among others. In the system of Equation (5), the operation and key components of the mathematical model of the fully gated version of GRUs are presented [40]:

\{\begin{array}{l} z_{t} = σ (W_{z} x_{t} + U_{z} h_{t - 1} + b_{z}) \\ r_{t} = σ (W_{r} x_{t} + U_{r} h_{t - 1} + b_{r}) \\ \tilde{h_{t}} = ϕ (W_{h} x_{t} + U_{h} (r_{t} ⨀ h_{t - 1}) + b_{h}) \\ h_{t} = (1 - z_{t}) ⨀ h_{t - 1} + z_{t} ⨀ \tilde{h_{t}} \end{array} .

(5)

Here,

$x_{t}$ and $h_{t}$ are the input and output vectors, respectively;
$\tilde{h_{t}}$ is the candidate activation vector;
$z_{t}$ and $r_{t}$ are the states of the update and reset gates, respectively.

The transfer functions in this case,

σ

and

ϕ

, are the logistic function and hyperbolic tangent, respectively. The dynamic nature of the architecture and functionality of GRUs makes them attractive tools for various tasks. Examples of their use can be found in numerous studies [41,42,43,44].

2.4. Time-Delay Neural Networks

The main idea behind TDNNs is the introduction of a time delay for the input data, allowing the network to access historical (previous) values for a certain number of time steps. This specific feature makes them particularly suitable for tasks involving time sequences or time-dependent data, such as speech recognition, audio signal processing, and time series analysis. The input data to a TDNN are fed in as consecutive time slices, so at a given time

t

, the neuron will receive inputs not only for the current moment but also for previous moments

(t - 1), (t - 2)

, etc. The output

y (t)

of a neuron in a TDNN at time

t

can be expressed as a sum of the input data

x (t)

at the current and previous time steps, multiplied by the corresponding weights

w_{i}

and thresholds

b

:

y (t) = g (\sum_{i = 0}^{d} w_{i} x (t - i) + b) .

(6)

If multiple layers are considered, the output of a neuron from the

l

-th layer is given by

y_{j}^{(l)} (t) = g (\sum_{i = 0}^{d} {w_{i j}}^{(l)} y_{j}^{(l - 1)} (t - i) + b_{j}^{(l)}),

(7)

where

y_{j}^{(l - 1)} (t - i)

is the outputs from the neurons of the previous layers,

b_{j}^{(l)}

is the corresponding biases, and

g

is the transfer function. Examples of the use of TDNNs in solving problems related to speech recognition, capturing dependencies between sounds and their temporal structures, as well as audio signal processing, can be found in [45,46,47,48,49,50].

2.5. Echo State Networks

ESNs are a specific type of DNN primarily used for tasks involving time sequences. They belong to the broader class of reservoir computing and represent an alternative approach to modeling temporal dependencies in data without the heavy optimization of internal weights, as observed in classical RNN models. In practice, ESNs rely on a reservoir of neurons, which is a large, randomly connected recurrent network, where input data are transformed into a more complex, high-dimensional space using nonlinear connections. The key aspect of this type of DNN is that the internal weights in the reservoir are not trained through standard backpropagation algorithms but are randomly initialized and remain fixed during training. Unlike standard RNNs, in ESNs, only the weights of the output layer are trained. These weights connect the reservoir to the output neurons, a feature that greatly simplifies the training process, as only the optimization of linear connections is required, which can be achieved with simple linear regression or other straightforward methods.

The mathematical model of ESNs defines the state of the reservoir

x (t)

as being updated at each time step

t

by the following equation:

x (t) = g (W_{r e s} x (t - 1) + W_{i n} u (t) + b),

(8)

where

W_{r e s}

is the weight matrix of the reservoir,

W_{i n}

is the weight matrix connecting the inputs to the reservoir, and

u (t)

and

b

are the matrices of input data and biases, respectively. The outputs of the ESNs are expressed as a linear combination of the current state of the reservoir:

y (t) = W_{o u t} x (t),

(9)

where

W_{o u t}

represents the output weights, which are optimized during training.

In [51], the authors use ESNs for a new approach to time series data prediction, and Huanling Hu et al. propose a hybrid model incorporating this type of network for forecasting wind speed at the Sotavento Wind Farm in Galicia, northwest Spain [52]. In this model, ESNs are used for prediction in each individual sub-series. Other interesting applications of ESNs can be explored in [53,54,55].

2.6. Temporal Convolutional Networks

TCNs are designed for tasks related to processing temporal sequences, based on the principles of convolutional neural networks (CNNs). Their convolution ensures that the output at time

t

depends only on current and previous values over time, but not on future values. The use of convolutional layers allows for greater parallelization and faster training, as at any given moment, a TCN examines multiple previous time steps simultaneously. This quality makes them more efficient than traditional recurrent networks, where processing each step depends on the previous one. Additionally, TCNs use dilated convolution, which helps extend the receptive field of the convolutional filters without necessarily increasing the number of parameters. This allows the TCN to “see” longer time dependencies while remaining efficient. To ensure that the output at time

t

depends only on current and previous values over time and not on future values, the convolution is structured so that the output

y (t)

depends only on the current and previous inputs

x (t), x (t - 1), \dots, x (t - k + 1)

:

y (t) = g (\sum_{i = 0}^{k - 1} w_{i} x (t - i) + b),

(10)

where

w_{i}

and

b

are the filter weights and biases, respectively, and

g

is the transfer function. TCNs are highly effective for forecasting values from time series, such as financial data predictions [56,57] or climate conditions [58,59]. Additionally, they can be used for tasks related to text processing [60,61] and natural language processing [62,63], as well as for problems involving speech, audio, and video processing [64,65].

2.7. Conclusions Regarding Dynamic Artificial Neural Networks

The dynamic nature of the DNNs discussed is expressed in their ability to process and model temporal dependencies and sequences in the data. Unlike static ANNs, which accept a fixed set of input data and produce outputs without considering the order or temporal dependencies between input values, DNNs have internal mechanisms that allow for the processing of data that changes over time. Thus, during the training process, DNNs update their parameters (weights and biases) based not only on the current input data but also on the errors accumulated over time. This means that each iteration of training takes into account the temporal sequence of the data.

However, is it possible to add even more to the dynamic aspect? Is it possible, during training, for networks to be equipped with linear or nonlinear weight functions instead of a weight vector, thereby allowing the signal within the neuron’s body and the neuron’s output to change the way they are formed? We provide a positive answer to these questions in the following sections of this article.

3. Integral Neurons with Weight Functions: Training Algorithms for Integral Neurons

In this section, we explore the possibility of creating a new type of dynamic neuron with dynamically changing weights through the definition of weight functions. We propose definitions for the concepts of weight function and integral neuron. The training methods of gradient descent, Gauss–Newton, and Levenberg–Marquardt are examined, with algorithms and theoretical evidence presented for their applicability in finding the optimal weight function for an integral neuron.

3.1. Problem Definition

Let us consider a single artificial neuron influenced by

n

stimuli

x_{1}, x_{2}, \dots, x_{n}

, where

x_{i} \in R

(Figure 1).

In its body, a total cumulative signal is formed from all inputs:

S = \sum_{i = 0}^{n} w_{i} x_{i}, w_{i} \in R \forall i = 1, 2, \dots, n

(11)

where the standard and common procedure of including the neuron’s threshold

b

in the total sum is implemented:

x_{0} = 1, w_{0} = b .

(12)

Let a task be set in which the neuron must be trained to approximate the target function

H

using a set of

m

training samples:

{D = \{[x_{j 0}, x_{j 1}, \dots, x_{j n}; h_{j}]\}}_{j = 1}^{m}, x_{j 0} = 1, \forall j = 1, 2, \dots, m .

(13)

In dataset (13),

(x_{j 0}, x_{j 1}, \dots, x_{j n})

is the input set, and

h_{j}

is its corresponding functional value of

H

, i.e.,

h_{j} = H (x_{j 0}, x_{j 1}, \dots x_{j n}), j = 1, 2, \dots, m .

(14)

Classical procedures related to training the neuron are aimed at finding the weight vector

\vec{W} = (w_{0}, w_{1}, \dots, w_{n}), {\forall w}_{i} = c o n s t \in R, i = 1, 2, \dots, n,

(15)

for which the error function

E (w_{0}, w_{1}, \dots, w_{n}) = \frac{1}{m} \sum_{j = 1}^{m} {(h_{j} - g (x_{j 0}, x_{j 1}, \dots, x_{j n}, w_{1}, w_{2}, \dots, w_{n}))}^{2}

(16)

is at its minimum.

Once the desired performance of the neuron is achieved, the specific weight set (15) that achieves this remains unchanged (

{\forall w}_{i} \in R

), unless the neuron is retrained. Thus, the artificial neuron will respond to any subsequent stimuli on its inputs with the constant weight vector (15) obtained after training.

The question we pose is the following: Is it possible to conduct training that seeks not a weight vector of constants but a function,

w = w (x_{i}), i = 1, 2, \dots, n,

(17)

where the weight for each input will vary each time, depending on the value of the corresponding stimulus (Figure 2)?

Replacing static weights with a weight function would create a new type of artificial neuron and neural structure, where the signal shape within the neuron’s body could now be nonlinear. This enables the development of new, more effective solutions for various tasks. For instance, the fundamental unsolved problem of creating a single XOR neuron with traditional techniques is resolved with these new methods and is presented later in this article.

If we reconsider the standard neuron shown in Figure 2, we will notice that all terms in the sum

S = \sum_{i = 0}^{n} w_{i} x_{i} = x_{0} w_{0} + x_{1} w_{1} + \dots + x_{n} w_{n}

(18)

can be viewed as the areas of rectangles (Figure 3).

Let the variational series obtained from the input values of the neuron be

x_{0}^{'} = x_{m i n}, x_{1}^{'}, x_{2}^{'}, \dots, x_{n}^{'} = x_{m a x}

(19)

Then, the total signal (18) within the neuron’s body can be represented as the sum of the areas of adjacent rectangles (Figure 4).

Given that

\sum_{i = 0}^{n} x_{i}^{'} = \sum_{i = 0}^{n} x_{i},

(20)

and if

∆ x_{i} = x_{i} - x_{i - 1}, i = 1, 2, \dots, n

and

λ = \max \{∆ x_{1}, ∆ x_{2}, \dots, ∆ x_{n}\} \to 0

, then

S = \int_{0}^{\sum_{i = 0}^{n} x_{i}} w (t) d t .

(21)

As a result, along the axon of a standard neuron with transfer function

g

, the signal has a value infinitely close to

O u t = g (\int_{0}^{\sum_{i = 0}^{n} x_{i}} w (t) d t) .

(22)

Given the training set (13)

{D = \{[x_{j 0} = 1, x_{j 1}, \dots, x_{j n}; h_{j}]\}}_{j = 1}^{m}

, the system of errors for each of the samples is

\{\begin{array}{l} e_{1} = h_{1} - g (\int_{0}^{\sum_{i = 0}^{n} x_{1 i}} w (t) d t) \\ e_{2} = h_{2} - g (\int_{0}^{\sum_{i = 0}^{n} x_{2 i}} w (t) d t) \\ \dots \\ e_{m} = h_{m} - g (\int_{0}^{\sum_{i = 0}^{n} x_{m i}} w (t) d t) \end{array} .

(23)

The mean squared error (MSE) of the network is given by

M S E = \frac{1}{m} \sum_{j = 1}^{m} e_{j}^{2},

(24)

i.e.,

M S E = \frac{1}{m} \sum_{j = 1}^{m} {(h_{j} - g (\int_{0}^{\sum_{i = 0}^{n} x_{j i}} w (t) d t))}^{2} .

(25)

The last expression shows that the task of training algorithms for the ANN in this case should not be focused on finding a specific set of weights, as is typical in standard approaches. Instead, the focus should shift towards finding the function

w (t)

that minimizes the error function (25). Given the representation of the MSE, it becomes clear that we are, in fact, searching for the minimum of the function

E = \sum_{j = 1}^{m} {(h_{j} - g (\int_{0}^{\sum_{i = 0}^{n} x_{j i}} w (t) d t))}^{2} .

(26)

Many training algorithms utilize derivatives of the error function, which in turn imposes the condition of differentiability on the integral function involved in (26). Thus, to ensure the correct calculation of derivatives in training algorithms, it is necessary for the integrand function

w (t)

to possess a certain degree of smoothness. Specifically, the differentiability of

w (t)

enables the application of methods such as gradient descent, Gauss–Newton, and Levenberg–Marquardt, which we will demonstrate in this study. For this purpose, we consider it necessary to introduce concepts for a continuously differentiable weight function and an integral neuron.

Let

C^{k}

be the class of

k

-times differentiable functions, and let

N_{0}

be the set of natural numbers with zero included, i.e.,

N_{0} = N \cup \{0\}

.

Definition 1

(Weight function of a neuron). A continuously differentiable real function

w (t) \in C^{k}, k \in N_{0}

, defining the cumulative signal in the body of an artificial neuron by the equation

S = \int_{0}^{\sum_{i = 0}^{n} x_{i}} w (t) d t,

where

x_{0}, x_{1}, x_{2}, \dots, x_{n}

are the input stimuli, will be called the weight function for the respective neuron.

Definition 2

(Integral neuron). An artificial neuron that uses a weight function

w (t) \in C^{k}, k \in N_{0}

to form the signal within its body will be called an integral neuron of class

(k + 1)

.

Let us note that, due to the nature of integration, the class of the integral function

S

, which describes the signal within the neuron’s body, is one degree higher than that of the weight function. This increase draws our attention to the fact that the integration process provides an additional level of smoothness. For example, if a twice-differentiable weight function is chosen, the signal within the neuron’s body will be three times differentiable.

3.2. Algorithm for Finding the Optimal Weight Function Using Gradient Descent of the Error

The unusual aspect of implementing our proposed idea is that the training algorithm seeks not a vector of numerical weight values that minimizes the error but rather the type and parameters of a weight function. In the context of modeling the weight function, the coefficients of the weight function are also referred to as parameters.

When using the error minimization algorithm (Figure 5) with gradient descent, the following stages are followed:

Initialization: At this stage, an initial function $w^{(0)} (t)$ is selected: this can be any continuous and parametric linear or nonlinear function. An initial value is set for the parameters of $w (t)$ (for example, for a polynomial or exponential function).
Error calculation: For each $j$ -th sample in the training set (13) ${D = \{[x_{j 0} = 1, x_{j 1}, \dots, x_{j n}; h_{j}]\}}_{j = 1}^{m}$ , the neuron’s output is calculated as

$y_{j} = g (\int_{0}^{\sum_{i = 0}^{n} x_{j i}} w (t) d t),$

(27)

the errors are computed as

$e_{j} = h_{j} - y_{j}, \forall j = 1, 2, \dots, m,$

(28)

and the MSE is formed as

$E = \frac{1}{m} \sum_{j = 1}^{m} e_{j}^{2} .$

(29)
Gradient calculation and parameter update of the weight function: Unlike the traditional approach, at this stage, the derivative of the error is calculated not with respect to weights and biases, but with respect to the parameters of the weight function $w (t)$ . For example, in the case of a polynomial form for $w (t)$ , derivatives are calculated with respect to the polynomial’s parameters. This means we are actually searching for the optimal parameters of the weight function that minimize the error (29). At each step $k$ , the parameter update of $w (t)$ is performed using the calculated gradient:

$w^{(k + 1)} (t) = w^{(k)} (t) - η \frac{\partial E}{\partial w (t)},$

(30)

where $η > 0 \in R$ is the descent step—a training hyperparameter, also known as the learning rate. The descent step determines how quickly the algorithm will move in the direction of the steepest slope of the error function. The larger its value, the faster the parameters will be updated, but the higher the risk of missing the minimum. On the other hand, a very small step can make the training process very slow, as a small $η$ value will require more iterations. The value of $η$ can be set by the user implementing the algorithm or can be built into software solutions, with an option for adaptive adjustment during training.
Iterations: Steps 2 and 3 are repeated until one of the following conditions is met:
- A local or global minimum of the error function is found;
- One of the termination conditions set in the software implementation for optimization is met (e.g., maximum number of epochs reached, specified number of objective function calls exceeded, error plateau identified for a specified number of iterations, etc.).

Upon completion of the algorithm, the desired weight function is

{w (t) = w}^{(l)} (t)

, where

l

is the number of the last iteration performed.

3.2.1. Example Application of the Algorithm

To clarify the proposed algorithm, let us consider a polynomial weight function

w (t)

for a neuron whose transfer is achieved through a hyperbolic tangent. The choice of a polynomial weight function in this case is justified, as it is differentiable and easy to integrate. Additionally, according to the Weierstrass Approximation Theorem, any continuous function on a closed interval can be approximated by a polynomial [66], such as a Taylor series.

Given a polynomial weight function of degree

q

,

w (t) = a_{0} + a_{1} t + a_{2} t^{2} + \dots + a_{q} t^{q}, \forall a_{p} \in R, p = 0, 1, 2, \dots, q .

(31)

Following the described algorithm, the following stages should be completed:

Initialization:

Initial values for the parameters

a_{0}, a_{1}, a_{2}, \dots, a_{q}

are selected.

2.: Error calculation:

For each sample from the training set, the output of the neuron is

y_{j} = t a n h (\int_{0}^{\sum_{i = 0}^{n} x_{j i}} w (t) d t) .

(32)

After substituting the integrand with the polynomial (31) and solving the integral, we can easily reach the representation

y_{j} = t a n h (\sum_{p = 0}^{q} a_{p} \frac{{(\sum_{i = 0}^{n} x_{j i})}^{p + 1}}{p + 1}),

(33)

which we use to calculate the errors (28) and (29), which become

e_{j} = h_{j} - t a n h (\sum_{p = 0}^{q} a_{p} \frac{{(\sum_{i = 0}^{n} x_{j i})}^{p + 1}}{p + 1}), \forall j = 1, 2, \dots, m

(34)

and

E = \frac{1}{m} \sum_{j = 1}^{m} {(h_{j} - t a n h (\sum_{p = 0}^{q} a_{p} \frac{{(\sum_{i = 0}^{n} x_{j i})}^{p + 1}}{p + 1}))}^{2} .

(35)

3.: Gradient calculation and parameter update of the weight function:

At this stage, the derivatives of the error (35) are calculated with respect to each of the parameters

a_{p}

of the function

w (t)

:

\frac{\partial E}{\partial a_{p}} = - \frac{2}{m} \sum_{j = 1}^{m} e_{j} (1 - y_{j}^{2}) \frac{x_{j}^{p + 1}}{p + 1} .

(36)

The coefficients of the weight function are updated as follows:

a_{p} \leftarrow a_{p} - η \frac{\partial E}{\partial a_{p}}, \forall p = 1, 2, \dots, q,

(37)

where

η > 0 \in R

is the descent step.

4.: Iterations:

Steps 2 and 3 are repeated until a set of parameters

a_{0}, a_{1}, a_{2}, \dots, a_{q}

of the weight function is reached, for which the error meets the acceptable requirements.

3.2.2. Convergence of the Algorithm

Let us denote the parameters of the weight function and the gradient of the error function as column vectors:

a = {[\begin{matrix} a_{0} \\ a_{1} \\ a_{2} \\ ⋮ \\ a_{q} \end{matrix}]}_{(q + 1) \times 1},

(38)

and

\frac{\partial E}{\partial w (t)} = {[\begin{matrix} \frac{\partial E}{\partial a_{0}} \\ \frac{\partial E}{\partial a_{1}} \\ \frac{\partial E}{\partial a_{2}} \\ ⋮ \\ \frac{\partial E}{\partial a_{q}} \end{matrix}]}_{(q + 1) \times 1} .

(39)

The possibility of finding the optimal function

w (t)

, for which the error function

E (w (t))

is minimal, is demonstrated by the following convergence theorem for the proposed algorithm.

Theorem 1

(Convergence of the Gradient Descent Method for Optimizing the Weight Function). Let an integral neuron of at least first class with a smooth transfer function

g

be given. Let

a

be the vector containing the parameters of the weight function

w (t)

, and

\frac{\partial E}{\partial w (t)}

be the gradient of the error function.

Then, if the parameter update for

w (t)

is performed through an optimization algorithm based on gradient descent of the mean squared error function

E (w (t))

with step

η > 0

,

a^{(k + 1)} (t) = a^{k} (t) - η \frac{\partial E}{\partial w (t)},

then the algorithm is convergent and brings the parameters of the function

w (t)

closer to values that minimize

E (w (t))

.

Proof of Theorem 1.

In the first stage, we will demonstrate the differentiability of the error function and the applicability of the gradient descent method.

Let an integral neuron (Figure 6) of at least first class with a smooth transfer function

g

be given, which is trained with examples from the set

{D = \{[x_{j 0} = 1, x_{j 1}, \dots, x_{j n}; h_{j}]\}}_{j = 1}^{m}

.

Since the mean squared error function is represented as

E (w (t)) = \sum_{j = 1}^{m} {(h_{j} - g (\int_{0}^{\sum_{i = 0}^{n} x_{j i}} w (t) d t))}^{2}

(40)

it is also differentiable, as it is a composition of differentiable functions that comprise it. Therefore, the derivative

\frac{\partial E}{\partial w (t)}

exists and is continuous for every value of

t

. Then, given the bounded nature of

E (w (t))

(the function is bounded below:

E (w (t)) \geq 0

), it follows that an optimization algorithm using gradient descent on this function can be applied.

In the second stage, we will demonstrate the reduction in the error.

a^{(k + 1)} (t) = a^{k} (t) - η \frac{\partial E}{\partial w (t)},

(41)

At each iteration, the algorithm updates the weight function

w (t)

.

Here,

η > 0

is the descent step. Let

w^{(k)} (t)

be the value of the weight function involved in the error (40) at the end of the

k

-th iteration. Then, its next value will be

w^{(k + 1)} (t) = w^{k} (t) - η \frac{\partial E}{\partial w (t)} |_{w = w^{k} (t)} .

(42)

We will show that

E (w^{k + 1} (t)) < E (w^{k} (t)) .

(43)

Let us consider the first-order Taylor series expansion of the error function.

E (w^{k + 1} (t)) \approx E (w^{k} (t)) + (w^{k + 1} (t) - w^{k} (t)) \frac{\partial E}{\partial w (t)} |_{w = w^{k} (t)} .

(44)

If we substitute

w^{k + 1} (t)

from the parameter update in Equation (42) into (44), we obtain

E (w^{k + 1} (t)) \approx E (w^{k} (t)) - η {(\frac{\partial E}{\partial w (t)} |_{w = w^{k} (t)})}^{2} .

(45)

Since

η > 0

and

{(\frac{\partial E}{\partial w (t)} |_{w = w^{k} (t)})}^{2} > 0

, it follows exactly what we intended to prove:

E (w^{k + 1} (t)) < E (w^{k} (t)) .

(46)

It is important to note the following: If the function

E (w (t))

is convex, the algorithm will approach the global minimum. Otherwise, the convergence may be towards a local minimum; however, the algorithm will still be convergent and will find the necessary parameters of the weight function

w (t)

at which this minimum is achieved. □

3.3. Algorithm for Finding the Optimal Weight Function Using the Gauss–Newton Method

Applying the Gauss–Newton method within our framework of using a weight function instead of static weights still requires going through the previously described steps 1–4. The first three steps remain the same, and Equations (13) and (27)–(29) are once again satisfied (Figure 7). The new and key aspect is the update of the weight function parameters

a = {[\begin{matrix} a_{0} \\ a_{1} \\ a_{2} \\ ⋮ \\ a_{q} \end{matrix}]}_{(q + 1) \times 1},

(47)

which is now expressed through the transition

a \leftarrow a - {(J^{T} J)}^{- 1} J^{T} e .

(48)

In the transition in (48),

J

is the Jacobian matrix, containing the partial derivatives of the error

E (w (t))

with respect to each parameter of

w (t)

:

J = {[\begin{matrix} \begin{matrix} \begin{matrix} \frac{\partial e_{1}}{\partial a_{0}} & \frac{\partial e_{1}}{\partial a_{1}} & \dots \\ \frac{\partial e_{2}}{\partial a_{0}} & \frac{\partial e_{2}}{\partial a_{1}} & \dots \\ ⋮ & ⋮ & ⋱ \end{matrix} \\ \begin{matrix} \frac{\partial e_{m}}{\partial a_{0}} & \frac{\partial e_{m}}{\partial a_{1}} & \dots \end{matrix} \end{matrix} & \begin{matrix} \begin{matrix} \frac{\partial e_{1}}{\partial a_{q}} \\ \frac{\partial e_{2}}{\partial a_{q}} \end{matrix} \\ ⋮ \\ \frac{\partial e_{m}}{\partial a_{q}} \end{matrix} \end{matrix}]}_{m \times (q + 1)},

(49)

and

e = {[\begin{matrix} e_{1} \\ e_{2} \\ e_{3} \\ ⋮ \\ e_{m} \end{matrix}]}_{m \times 1}

(50)

is the vector containing all errors as each training sample passes through the neuron at each iteration. Here,

m

is the number of training samples, and

q + 1

is the number of parameters of the weight function.

Theorem 2

(Convergence of the Gauss–Newton Method for Optimizing the Weight Function). Let an integral neuron of at least second class with a smooth transfer function

g

be given. Let

a

be the vector containing the parameters of the weight function,

J

be the Jacobian matrix of the error function

E (w (t))

, and

e

be the vector of errors:

e_{j} = h_{j} - g (\int_{0}^{\sum_{i = 0}^{n} x_{j i}} w (t) d t), j = 1, 2, \dots, m .

Then, if the rank of

J

is full and the parameter update for

w (t)

is performed using the Gauss–Newton optimization method,

a^{(k + 1)} = a^{(k)} - {(J^{T} J)}^{- 1} J^{T} e,

the algorithm is convergent and brings the parameters of the function

w (t)

closer to values that minimize

E (w (t))

.

Proof of Theorem 2.

Let the Gauss–Newton method approximate the error function using a quadratic approximation and update parameter

a

according to the equation

a^{(k + 1)} = a^{(k)} - {(J^{T} J)}^{- 1} J^{T} e,

(51)

where

J

is the Jacobian matrix containing the partial derivatives of the errors

(e_{j})

with respect to parameter

a

, and

e

is the vector of errors for all samples

(j)

.

The error function of the neuron is represented as

E (a) = \sum_{j = 1}^{m} {(h_{j} - g (\int_{0}^{\sum_{i = 0}^{n} x_{j i}} w (t) d t))}^{2},

(52)

where

w (t)

is the weight function, which depends on parameter

a

. Given that the neuron is at least of the second class (

w (t) \in C^{1}

), it follows that the composition

E (w (t)) \in C^{2}

, and it can be quadratically approximated in a small neighborhood of the current solution

a^{(k)}

by

E (a) \approx E (a^{(k)}) + \nabla {E (a^{(k)})}^{T} ∆ a + \frac{1}{2} {∆ a}^{T} H ∆ a,

(53)

where

\nabla {E (a^{(k)})}^{T}

is the transposed gradient of the error function, and

H \approx J^{T} J

is the Hessian matrix:

H = {[\begin{matrix} \begin{matrix} \begin{matrix} \frac{\partial^{2} E}{{\partial a}_{0}^{2}} & \frac{\partial^{2} E}{\partial a_{0} \partial a_{1}} & \dots \\ \frac{\partial^{2} E}{\partial a_{1} \partial a_{0}} & \frac{\partial^{2} E}{{\partial a}_{1}^{2}} & \dots \\ ⋮ & ⋮ & ⋱ \end{matrix} \\ \begin{matrix} \frac{\partial^{2} E}{\partial a_{q} \partial a_{0}} & \frac{\partial^{2} E}{\partial a_{q} \partial a_{1}} & \dots \end{matrix} \end{matrix} & \begin{matrix} \begin{matrix} \frac{\partial^{2} E}{\partial a_{0} \partial a_{q}} \\ \frac{\partial^{2} E}{\partial a_{1} \partial a_{q}} \end{matrix} \\ ⋮ \\ \frac{\partial^{2} E}{{\partial a}_{q}^{2}} \end{matrix} \end{matrix}]}_{(q + 1) \times (q + 1)} .

(54)

Let

∆ a = a^{(k + 1)} - a^{(k)} .

(55)

From (51) and (55), it follows that

∆ a = - {(J^{T} J)}^{- 1} J^{T} e .

(56)

The change in the error function with this update

∆ E = E (a^{(k + 1)}) - E (a^{(k)})

(57)

is represented as

∆ E = - \nabla {E (a^{(k)})}^{T} {(J^{T} J)}^{- 1} \nabla E (a^{(k)}) + \frac{1}{2} \nabla {E (a^{(k)})}^{T} {(J^{T} J)}^{- 1} H {(J^{T} J)}^{- 1} \nabla E (a^{(k)}) .

(58)

The first term in the change in the error is

- \nabla {E (a^{(k)})}^{T} {(J^{T} J)}^{- 1} \nabla E (a^{(k)}) .

(59)

This term represents the main contribution to the change in

E (a)

and is linearly dependent on

∆ a

. Since linear dependence is stronger than quadratic dependence for small values of

∆ a

(where the new parameter values

a^{(k + 1)}

remain close to the old

a^{(k)}

), the first term dominates in the expression for

∆ E

.

If the function

E (a)

is exactly quadratic, then the second term

\frac{1}{2} \nabla {E (a^{(k)})}^{T} {(J^{T} J)}^{- 1} H {(J^{T} J)}^{- 1} \nabla E (a^{(k)})

(60)

represents the exact contribution of the second derivative to the change

∆ E

. In this case, the quadratic approximation is exact, and the approximation error is zero. If the error function is nearly quadratic, this means that the higher derivatives of

E (a)

(third-order and above) are very small. This results in a minimal contribution from the second term to the overall change in

E (a)

. Ultimately, we reach what we set out to prove:

∆ E < 0 .

(61)

Before we proceed, let us note something important. In the statement of Theorem 2, suitable initial values for the parameters of the weight function were deliberately left undefined. The Gauss–Newton and Levenberg–Marquardt methods (the latter based on the former) use a quadratic approximation of the error function, which performs well only in a small neighborhood around the optimal solution. Therefore, the initial parameters need to be close to the global minimum for the algorithm to converge towards it. However, if the initial parameters are far from the global minimum, the algorithm will tend to find the nearest local minimum. It is important to emphasize that, in all cases, the algorithm will converge to some minimum, which may be local but is not necessarily the global minimum. □

3.4. Algorithm for Finding the Optimal Weight Function Using the Levenberg–Marquardt Method

The Levenberg–Marquardt method combines the gradient descent and Gauss–Newton methods for nonlinear optimization, making it particularly effective for minimization tasks of nonlinear functions, such as the error function in ANNs. In this method, the parameter update for the weight function is represented by the transition (Figure 8)

a \leftarrow a - {(J^{T} J + λ I)}^{- 1} J^{T} e,

(62)

where

$a$ is the vector containing the parameters of the weight function $w (t)$ ;
$J$ is the Jacobian matrix with the partial derivatives of the error $E (w (t))$ with respect to each parameter of $w (t)$ ;
$e$ is the vector containing all errors as each training sample passes through the neuron at each iteration;
$λ$ is the parameter that enables the transition between the gradient descent method and the Gauss–Newton method;
$I$ is the identity matrix.

Figure 8. Algorithm for finding the optimal weight function using the Levenberg–Marquardt method.

Theorem 3

(Convergence of the Levenberg–Marquardt Method for Optimizing the Weight Function). Let an integral neuron of at least second class with a smooth transfer function

g

be given. Let

a

be the vector containing the parameters of the weight function,

J

be the Jacobian matrix of the error function

E (w (t))

, and

e

be the vector of errors

e_{j} = h_{j} - g (\int_{0}^{\sum_{i = 0}^{n} x_{j i}} w (t) d t), j = 1, 2, \dots, m .

Let

λ > 0

be the parameter for transitioning between gradient descent

(λ \to \infty)

and the Gauss–Newton method (

λ \to 0

), and let

I

be the identity matrix of the same dimension as

J

.

Then, if the rank of

J

is full and the parameter update for

w (t)

is performed using the Levenberg–Marquardt optimization method,

a^{(k + 1)} = a^{(k)} - {(J^{T} J + λ I)}^{- 1} J^{T} e,

the algorithm is convergent and brings the parameters of the function

w (t)

closer to values that minimize

E (w (t))

.

Proof of Theorem 3.

The proof of this theorem is similar to that of Theorem 2, concerning the Gauss–Newton method. We again consider a neuron of at least second class with a smooth transfer function

g

, which is trained with examples from the set

{D = \{[x_{j 0} = 1, x_{j 1}, \dots, x_{j n}; h_{j}]\}}_{j = 1}^{m}

.

Since the MSE function is represented as

E (w (t)) = \sum_{j = 1}^{m} {(h_{j} - g (\int_{0}^{\sum_{i = 0}^{n} x_{j i}} w (t) d t))}^{2},

(63)

and given that the neuron is at least of the second class, it follows that the composition

E (w (t)) \in C^{2}

. To show that the algorithm

a \leftarrow a - {(J^{T} J + λ I)}^{- 1} J^{T} e

(64)

is convergent, we need to show that

E (a^{(k + 1)}) < E (a^{(k + 1)}) .

(65)

Formally, the parameter update

a

at the

(k + 1)

-th iteration is expressed by the equation

a^{(k + 1)} = a^{(k)} - {(J^{T} J + λ I)}^{- 1} J^{T} e .

(66)

Let

∆ a

denote the matrix

(a^{(k + 1)} - a^{(k)})

. It is evident that

∆ a = - {(J^{T} J + λ I)}^{- 1} J^{T} e .

(67)

The error function

E (a)

can be simplified in an infinitesimally small neighborhood of the current value

a^{(k)}

through a quadratic approximation:

E (a) \approx E (a^{(k)}) + \nabla {E (a^{(k)})}^{T} ∆ a + \frac{1}{2} {∆ a}^{T} H ∆ a,

(68)

where

$\nabla {E (a^{(k)})}^{T}$ is the transposed gradient of the error function;
$H \approx J^{T} J$ is the Hessian matrix.

Then, if we substitute the change (67) in the parameter update for the weight function

w (t)

into expression (68) and denote

∆ E = E (a^{(k + 1)}) - E (a^{(k)})

(69)

we obtain

∆ E = - \nabla {E (a^{(k)})}^{T} {(J^{T} J + λ I)}^{- 1} \nabla E (a^{(k)}) + \frac{1}{2} \nabla {E (a^{(k)})}^{T} {(J^{T} J + λ I)}^{- 1} H {(J^{T} J + λ I)}^{- 1} \nabla E (a^{(k)}) .

(70)

Since

λ > 0

and

(J^{T} J + λ I)

is positive definite, we can maintain the Levenberg–Marquardt optimization idea of dynamically adjusting

λ

during training:

If $∆ E < 0$ , then the update is successful, and the error has decreased. In this case, the algorithm decreases $λ$ , allowing for larger steps in the next iteration.
If $∆ E \geq 0$ , then the update is unsuccessful, and the error has increased. In this case, $λ$ is increased, which limits the step size and makes the behavior closer to gradient descent.

Thus, ultimately, in each iteration, the Levenberg–Marquardt algorithm will adjust the parameters of the weight function to reduce the error function

E

. Since the error function is at least twice differentiable and the algorithm dynamically adapts the step size

λ

, convergence to a minimum is guaranteed, even if we start with arbitrary initial parameter values. □

3.5. Selection of Weight Function and Algorithm Complexity

The determination of an appropriate weight function

w (t)

depends on the specifics of the problem itself. Polynomial, exponential, and trigonometric functions are often a good choice due to their flexibility and ease of parameterization. For problems requiring interpolation or subsequent extrapolation, polynomial weight functions are an excellent choice. However, if the goal is to use the integral neuron to describe decaying dependencies, exponential weight functions would be more suitable. Gaussian weight functions are appropriate for processing local dependencies, and so on. Moreover, considering the fact that the weight function essentially determines the signal within the neuron’s body, which is then passed as an argument to the transfer function, several details should be noted that might prove critical when selecting the weight function:

1.

Interval and values of input data:

The weight function must be compatible with the distribution of input values. If the inputs are defined within a specific range,

w (t)

can be as follows:

-: A polynomial function for uniform distribution;
-: A Gaussian function for local dependencies, where $w (t)$ can be adjusted to a specific sub-range through parameters related to the mean and standard deviation (variance).

2.

Transfer function

g (S) :

The selection of

w (t

) should also take into account the characteristics of the transfer function:

-: For transfer functions such as $t a n h (S)$ or the sigmoid transfer function $σ (S), w (t)$ should ensure that the signal $S$ falls within regions where the transfer function is most sensitive (e.g., around $S = 0$ ). For instance, polynomial and exponential weight functions can be scaled to direct the signal $S$ to optimal ranges.
-: For transfer functions with boundaries or whose domain excludes certain points, the weight function must constrain the signal $S$ to a range compatible with the required restrictions. For example, for the reverse sigmoid,

$g (S) = \frac{1}{1 + e^{- \frac{1}{s}}},$

or for the modified $t a n h (S)$ ,

$g (S) = \tanh (S) + \frac{1}{S^{2}}$

the weight function $w (t)$ , which forms the signal

$S = \int_{0}^{\sum_{i = 0}^{n} x_{i}} w (t) d t,$

must be chosen to prevent situations where $S = 0$ , as this would render the use of transfer functions like $t a n h (S)$ and $σ (S)$ impossible.

3.

Smoothness and computational complexity:

The weight function should be at least

C^{k}

(smooth up to the

k - t h o r d e r

), to facilitate the following:

-: The calculation of gradients and the Jacobian and Hessian matrix during training;
-: The correct behavior of the integral.

From the perspective of computational complexity, different choices of weight functions can be considered. Polynomial functions are easy to compute and differentiate, while more complex weight functions (e.g.,

w (t) = s i n (t)

or

w (t) = e^{- t^{2}}

) offer additional flexibility but are computationally more expensive.

4.

Nature of the problem:

The type of task (regression, classification, modeling dependencies) influences the choice:

-: Regression: Polynomial or exponential weight functions provide smooth approximation.
-: Classification: Locally sensitive weight functions (such as Gaussian) can aid in class separation.
-: Temporal tasks: Exponential weight functions can model a decaying effect over time.

Ultimately, the problem of selecting the most appropriate weight function for the integral neuron is the same as the problem of choosing a transfer function for standard neurons. In the theory of artificial neural networks, there is still no objective rule for this. Of course, it is advisable for the weight function to be continuous, monotonic, and differentiable, but which specific function would be the most useful remains a matter of additional judgment. Based on their experience, researchers develop professional intuition that helps them determine what type of transfer function is needed under specific conditions defined by the task and the other parameters of the neurons being used. The exact, specific type is determined as a result of experimentation and comparison of results. For this purpose, algorithms are even being developed for automated search for suitable neurons. Similarly, the same principles apply to the optimal selection of the weight function.

4. Discussion: Classical and Integral Neurons and Neural Networks

It turns out that the optimization algorithms—gradient descent, Gauss–Newton, and their generalization, the Levenberg–Marquardt method—are convergent and, starting from random parameters for the weight function, can optimize it in a way that indeed minimizes the error. Here, it is essential to emphasize something critical. The condition for the proximity of input data that we imposed from the start,

λ = \max \{∆ x_{1}, ∆ x_{2}, \dots, ∆ x_{n}\} \to 0

for

∆ x_{i} = x_{i} - x_{i - 1}, i = 1, 2, \dots, n

, ensured the possibility of representing

S = \sum_{i = 0}^{n} w_{i} x_{i} = \int_{0}^{\sum_{i = 0}^{n} x_{i}} w (t) d t .

(71)

This condition was solely driven by our desire to establish equivalence between classical neurons and those with weight functions. However, the initial conditions of the approximation problem are often not like this. In the general case, the input data

{x_{j 0}, x_{j 1}, \dots, x_{j n}}_{j = 1}^{m}

provided for each sample can have values spaced far apart (

λ

has a large value), which could introduce error in the representation (71). In these cases, a commonly used approach is data normalization. If the input–output data are transformed via appropriate normalization to fit within a small interval, (71) will hold with negligible error, giving us a weight function that returns different weights along the dendrites depending on the dataset. This effectively endows the neuron with dynamic characteristics.

In the opposite case, if the requirement

λ = \max \{∆ x_{1}, ∆ x_{2}, \dots, ∆ x_{n}\} \to 0

is disregarded, entirely new neural structures emerge that have no standard equivalent. However, the cumulative signal within their bodies is still obtained from the value of the integral in (71) with some arbitrary continuous and smooth function

w (t)

:

\tilde{S} = \int_{0}^{\sum_{i = 0}^{n} x_{i}} w (t) d t .

(72)

In this case, Theorems 1–3 will still hold, which is very important. By abandoning the requirement

λ = \max \{∆ x_{1}, ∆ x_{2}, \dots, ∆ x_{n}\} \to 0

, we see the possibility of existing and training neural structures characterized by the absence of a true dendritic tree, where each input dataset is provided as a set through only one dendritic segment with a weight of 1 (Figure 9).

Since, for these integral neurons, it is not required that

λ = \max \{∆ x_{1}, ∆ x_{2}, \dots, ∆ x_{n}\} \to 0

, it follows that we cannot assign them a classical equivalent with weights

w_{i}

for which

\sum_{i = 0}^{n} w_{i} x_{i} = \int_{0}^{\sum_{i = 0}^{n} x_{i}} w (t) d t .

(73)

This makes integral neurons an independent class, distinct from the classical structures we are accustomed to working with. Considering that

\sum_{i = 0}^{n} x_{i} = \sum_{i = 0}^{n} {w_{i} x}_{i}, \forall w_{i} = 1,

(74)

it follows that we can represent the integral neuron using familiar schemes where the dendritic branches have the same transmissive strength, equal to 1 (Figure 10).

The conditions in Theorems 1–3 do not require the proximity of input data,

λ = \max \{∆ x_{1}, ∆ x_{2}, \dots, ∆ x_{n}\} \to 0

, making them applicable to integral neurons as well. With these theorems, we have theoretically demonstrated that, just like classical neural structures, integral neural structures can be trained with samples and ultimately approximate a given problem with minimal error.

In classical neurons, training algorithms focus on modifying the weights, which influence the signal within the neuron body and, subsequently, the output. In integral neural structures, the weights of input data are fixed and constant with a value of 1, and the training algorithms are directed specifically at the parameters of the weight function—that is, the optimization of the type and shape of the signal within the neurons (Figure 11).

Another avenue for discussion is the possibility of creating neural structures that contain both integral and classical neurons simultaneously. Although this study examines only individual integral neurons, definitions can be proposed for the new integral structures created with their assistance.

Definition 3

(Integral Neural Network). An artificial neural network consisting entirely of integral neurons of class

(k + 1)

will be called a fully integral neural network of class

(k + 1), k \in N_{0}

.

Definition 4

(Network with Integral–Classical Architecture). A neural network containing

i

layers of integral neurons of class

(k + 1)

and

t

layers of classical neurons with a linear cumulative signal in the body

(i, t \in N)

will be called a neural network with integral–classical architecture of class

(k + 1)

and signature

(i, t)

.

A neural network with an integral–classical architecture combines layers of integral and classical neurons organized in a sequential or parallel structure, enabling both the processing of input stimuli through integral weight functions and calculations via linear combinations of stimuli, depending on the specific task and training requirements. In this type of hybrid architecture, integral layers can be seen with neurons, in the bodies of which we have a signal,

\tilde{S} = \int_{0}^{\sum_{i \in I} x_{i}} w (t) d t,

(75)

as well as classical structures with a linear signal in the bodies

\tilde{S} = \int_{0}^{\sum_{i \in I} x_{i}} w (t) d t,

(76)

where

I

and

J

are the sets of indices of neurons from previous layers that provide output signals to the current layer. This hybrid architecture combines the advantages of both neuron types, allowing flexible (integral and nonlinear) representation of dependencies among inputs and standard processing through classical linear sums. It is evident that a fully integral neural network with

i

hidden layers can be considered and exhibits all the characteristics of an integral–classical network with signature

(i, 0)

, and any classical neural network with

j

hidden layers can be regarded as an integral–classical architecture with signature

(0, j)

, where

i, j \in N

.

Definition 5

(Architectural Balance). Let

N e t

be a neural network with an integral–classical architecture of signature

(i, t)

. The ratio

p_{N e t} = \frac{\sum_{k = 1}^{i} N_{k}}{\sum_{m = 1}^{t} N_{m}}

(77)

where

N_{k}

and

N_{m}

are the numbers of integral and classical neurons in the respective layers will be called the architectural balance coefficient. We will say that the neural network

N e t

has an ideal architectural balance if and only if

i = t

and

p_{N e t} = 1

. Such a neural network will be referred to as integrally and classically balanced.

By adjusting the signature (

i, t

), thereby altering the architectural balance of the neural structure, we can optimize the network for specific tasks. For tasks with higher demands for nonlinear representation, the number of integral layers and neurons can be increased, while for tasks with simpler dependencies, a greater emphasis can be placed on classical layers and neurons. Naturally, it should be acknowledged that, although an integral–classical architecture combines the strengths of both approaches—the efficiency of linear computations and the power of integral dependencies—training this type of structure presents a true scientific and practical challenge. One of the team’s goals is to develop multitask training where integral and classical layers are trained with partially independent algorithms, but with coordinated optimization steps across iterations.

5. Integral XOR Neuron

A particularly important topic in the theory of artificial neural networks relates to the XOR logical function and the problem that arises from the inability to train a single neuron to solve XOR.

For the logical functions “AND” and “OR”, it is straightforward to construct a linear classifier and, therefore, a single neuron that recognizes objects from the two classes of points, returning “1” or “0”, respectively (Figure 12).

The recognition of different classes is due to the fact that a signal is formed within the body of a standard neuron:

S = w_{1} x_{1} + w_{2} x_{2} + b,

(78)

which, for

S = 0

, is equivalent to the line

w_{1} x_{1} + w_{2} x_{2} + b = 0

(79)

that separates the classes. For points on one side of the line,

S > 0

, and for those on the other side,

S < 0

. The division of points into two classes is achieved using an appropriate (typically step) transfer function.

On the other hand, “exclusive OR” (XOR) is a logical operation between two Boolean variables that returns true (“1”) if and only if the inputs are different. The XOR truth Table 1 is as follows:

The key problem with XOR and a single neuron lies in the fact that XOR cannot be solved through linear classification, as there is no single line that can separate the input values (0,0) and (1,1), which return “0,” from (0,1) and (1,0), which return “1”, in a two-dimensional space (Figure 13). This means that a perceptron cannot find weights that successfully separate the ordered input pairs

(X_{1}, X_{2})

based on their XOR output values.

This problem was highlighted as early as 1969 in the book by Marvin Minsky and Seymour Papert [67], which led to a temporary decline in interest in neural network research. The authors demonstrated that single-layer perceptrons are limited and cannot be trained on certain types of logical functions, such as XOR.

To overcome the limitation of a single perceptron, multilayer perceptrons (MLPs) were introduced, which include one or more hidden layers. These layers enable the ANN to learn and represent nonlinear relationships, like XOR, by using combinations of linear classifications connected through nonlinear transfer functions, such as sigmoid or ReLU. Thus, MLPs can create complex, nonlinear separating surfaces that successfully model and solve XOR and similar functions. However, the problem of representing XOR with a single neuron remains unresolved in the theory of ANNs. One reason for this is that, since the advent of artificial neurons, the signal formed within their body has been conceptualized in the same way.

What would happen if the sum in the neuron’s body lost its linear, discrete nature,

S = \sum_{i = 0}^{n} w_{i} x_{i},

(80)

and was instead replaced by

\tilde{S} = \int_{0}^{\sum_{i = 0}^{n} {w_{i} x}_{i}} w (t) d t, \forall w_{i} = 1

(81)

where

w (t)

is the neuron’s weight function? In the case of the XOR logical function, we consider a neuron with two inputs (Figure 14), which we will refer to as an integral XOR neuron.

The set of training samples for the neuron is represented as

{D = \{[(x_{j 1}, x_{j 2}); h_{j}]\}}_{j = 1}^{4},

(82)

where the input–output data values are as specified in Table 2.

Before presenting the results of our theoretical considerations and their experimental confirmation, it is important to note that the Levenberg–Marquardt algorithm is sensitive to errors and to the training stage. With a higher amplitude transfer function, neuron output errors are detected and corrected more effectively, leading to a faster convergence to the error function minimum and improved overall convergence. Many transfer functions, including the hyperbolic tangent (

t a n h

), have relatively smooth gradients around zero, which can slow down training.

On the other hand, scaling transfer functions by coefficients does not alter their shape but intensifies their effect, increasing gradients and amplitudes, which makes training more efficient. Therefore, we have chosen the transfer function

g (\tilde{S}) = 3 t a n h (\tilde{S})

. We deliberately multiply by three to increase the neuron’s output amplitude, meaning that the output signal becomes stronger for the same input values. This allows the neuron to cover a broader range of values and respond more strongly to changes in its inputs, helping the training process capture and correct errors more effectively.

Subsequent experiments with linear and sinusoidal transfer functions also demonstrated their viability in building an integral XOR neuron. In all three experiments, a first-degree polynomial

w (t) = a_{0} + a_{1} t

was used as the weight function.

In artificial neural network theory, there is no strict rule or objective criterion for selecting an appropriate transfer function. Researchers typically choose a transfer function based on prior experience and professional intuition, considering the specific task or problem requirements. For instance, continuous transfer functions with sigmoid characteristics are known to solve a wide range of problems, but the specific choice of function involves experimentation combined with past experience.

Similarly, the choice of weight function follows the same principle. While certain criteria—such as continuity and smoothness—are essential, the exact form of the weight function depends on the particular task. Considering the XOR problem, with only four training samples (Table 1), it is appropriate to select a polynomial weight function with a maximum degree of

k = 3

, possessing four parameters. However, due to the characteristics of the transfer functions used, experimentation revealed that the optimal weight function is based on a first-degree polynomial

w (t) = a_{0} + a_{1} t

, characterized by two parameters (

a_{0}, a_{1}

), whose optimal values are determined by the training algorithms.

The MATLAB v2018a scripts for creating, training, and operating integral XOR neurons are provided in Appendix A of this article.

5.1. Solving XOR with an Integral Neuron Using a Hyperbolic Tangent Transfer Function

5.1.1. Theoretical Solution

For the weight function, we choose a first-degree polynomial:

w (t) = a_{0} + a_{1} t,

(83)

where

a = (a_{0}, a_{1})

are the parameters of

w (t)

that need to be optimized.

The error function is represented as

E (a) = \sum_{j = 1}^{4} {(h_{j} - 3 \tanh (\tilde{S}))}^{2},

(84)

where

\tilde{S} = \int_{0}^{\sum_{i = 1}^{2} {w_{i} x}_{i}} w (t) d t = a_{0} (w_{1} x_{1} + w_{2} x_{2}) + \frac{1}{2} a_{1} {(w_{1} x_{1} + w_{2} x_{2})}^{2}, w_{1} = w_{2} = 1 .

(85)

We will prove that parameters

{a = (a}_{0}, a_{1})

for the weight function

w (t)

exist, such that the neuron can solve the XOR logical function. In practice, we need to show that ∃

{a = (a}_{0}, a_{1})

such that

\{\begin{matrix} O u t (0, 0) = 3 \tanh (\tilde{S} (0, 0)) = 0 \\ O u t (0, 1) = 3 \tanh (\tilde{S} (0, 1)) = 1 \\ O u t (1, 0) = 3 \tanh (\tilde{S} (1, 0)) = 1 \\ O u t (1, 1) = 3 \tanh (\tilde{S} (1, 1)) = 0 \end{matrix} .

(86)

The first equality

O u t (0, 0) = 3 \tanh (\tilde{S} (0, 0)) = 3 t a n h (a_{0} (0 + 0) + \frac{1}{2} a_{1} {(0 + 0)}^{2}) = 3 t a n h (0) = 0

(87)

holds for any

{a = (a}_{0}, a_{1}) \in R

.

The second and third equalities in (86), due to the nature of

\tilde{S}

, provide the same information,

3 t a n h (a_{0} + \frac{1}{2} a_{1}) = 1,

(88)

which implies

a_{0} + \frac{1}{2} a_{1} = \frac{1}{2} \ln (2) .

(89)

Finally, from the last equality in (86), we have

3 t a n h ({2 (a}_{0} + a_{1})) = 0,

(90)

i.e.,

a_{1} = - a_{0} .

(91)

Then, from (89) and (91), we can form the system

\{\begin{array}{l} a_{0} + \frac{1}{2} a_{1} = \frac{1}{2} l n (2) \\ a_{1} = {- a}_{0} \end{array} .

(92)

From this, it follows that

(a_{0}, a_{1}) = (\ln (2), - \ln (2))

.

Thus, in the body of the XOR neuron, we have the signal

\tilde{S} = \ln (2) (w_{1} x_{1} + w_{2} x_{2}) - \frac{1}{2} \ln (2) {(w_{1} x_{1} + w_{2} x_{2})}^{2},

(93)

and the neuron itself appears as shown in Figure 15.

This integral neuron should be able to solve the XOR logical function.

5.1.2. Experimental Confirmation

For the purposes of the experiments, a MATLAB v2018a script was created (see Appendix B) that implements our proposed Levenberg–Marquardt algorithm for training an integral neuron. When using this script, arbitrary initial values for the vector

a = (a_{0}, a_{1})

, containing the parameters of the weight function, are set and then updated during training using the transition

a \leftarrow a - {(J^{T} J + λ I)}^{- 1} J^{T} e,

(94)

where

J

is the Jacobian matrix containing the derivatives of the error

E (a)

with respect to each parameter of

w (t)

,

e

is the vector containing all errors for each sample in the training set that passes through the neuron at each iteration,

λ

is the parameter that controls the transition between the gradient descent method and the Gauss–Newton algorithm, and

I

is the identity matrix.

During the iterations, the following parameters are tracked:

Iteration: This is the number of the current iteration of the algorithm. The algorithm goes through multiple iterations, updating parameters each time in an attempt to find the optimal solution. Iterations continue until the maximum number of iterations is reached or convergence criteria are met (e.g., when changes in error or step size become very small).
Func-count: This is the number of calls to the objective function, which calculates the error (residuals). Monitoring the number of calls is essential because each call requires computations that can be costly for complex tasks. The algorithm may call the function more than once per iteration to calculate gradients or other necessary quantities.
First-Order Optimality: This measures how close the current solution is to the optimum, calculated using the gradient of the error function. In optimization, first-order optimality measures how far the gradient is from zero. A gradient close to zero suggests proximity to the optimal solution. If the first-order optimality value is low, it is an indication that the algorithm may be near a local (or global) minimum.
Lambda ( $λ$ ): This is the Levenberg–Marquardt parameter controlling the transition between the gradient descent method and the Gauss–Newton method. At high values of $λ$ , the algorithm behaves like gradient descent, taking smaller, more conservative steps. When $λ$ is low, the algorithm approaches the Gauss–Newton method, making more aggressive steps that can lead to faster convergence. The algorithm dynamically adjusts $λ$ based on the success of previous iterations.
Norm of Step: The step norm indicates the magnitude of the parameter change in the current iteration. Large step norm values usually mean the algorithm is making significant adjustments to parameters, while small values indicate the algorithm is fine-tuning and making more minor adjustments. When the step norm becomes very small, it usually signifies that the algorithm is close to convergence.

To start the algorithm, the matrix

a

is randomly initialized with values in the range [−1,1]. For the results presented in Table 3, the algorithm randomly selected initial values for the weight function parameters as follows:

a_{0} = 0.629447372786358, a_{1} = 0.811583874151238

.

As is typical in artificial intelligence, initially, with the randomly selected parameters of its weight function

w (t)

, the integral neuron is not adapted to the problem it needs to solve, resulting in an error of 12.4149. However, as the iterations proceed, the parameters of

w (t)

gradually change, leading to increasing effectiveness until reaching zero error. In this case, the algorithm identifies the optimal parameters for the weight function as

a (a_{0} = 0.693147180559945, a_{1} = - 0.693147180559945)

, making the final form of the weight function

w (t) = 0.693147180559945 - 0.693147180559945 t .

(95)

It is noteworthy that

0.693147180559945 = l n (2)

, as predicted by theory, indicating that the Levenberg–Marquardt algorithm has accurately reached the global minimum of the error function.

The results returned by the integral neuron with this weight function are presented in Table 4.

5.2. Solving XOR with an Integral Neuron Using the Transfer Function $f (x) = x$

5.2.1. Theoretical Solution

By analogy with the theoretical considerations we made for an integral neuron with the transfer function

3 t a n h

and with the same choice of weight function, we conclude that the optimal parameters satisfying the equations

\{\begin{matrix} O u t (0, 0) = [\tilde{S} (0, 0)] = 0 \\ O u t (0, 1) = [\tilde{S} (0, 1)] = 1 \\ O u t (1, 0) = [\tilde{S} (1, 0)] = 1 \\ O u t (1, 1) = [\tilde{S} (1, 1)] = 0 \end{matrix}

(96)

are solutions to the system

\{\begin{array}{l} a_{0} + \frac{1}{2} a_{1} = 1 \\ a_{1} = {- a}_{0} \end{array} .

(97)

The solution to (97) leads to optimal parameters for the weight function of

{(a}_{0} = 2, a_{1} = - 2)

, which means that the integral neuron capable of handling XOR would appear as shown in Figure 16.

5.2.2. Experimental Confirmation

Applying the Levenberg–Marquardt method to the parameters of the weight function, with an initial random selection of

a_{0} = - 0.746026367412988, a_{1} = 0.826751712278039

, the algorithm, through its iterations, reached the pair

{(a}_{0} = 2, a_{1} = - 2)

, as predicted by theory, defining the global minimum of the error (Table 5).

5.3. Solving XOR with an Integral Neuron Using the Transfer Function $f (x) = 2 s i n (x)$

5.3.1. Theoretical Solution

Let us consider a neuron with the transfer function

g (x) = 2 s i n (x)

, where the weight function is again a first-degree polynomial:

w (t) = a_{0} + a_{1} t .

(98)

The signal in the body of the neuron is

\tilde{S} = \int_{0}^{\sum_{i = 1}^{2} {w_{i} x}_{i}} w (t) d t = a_{0} (w_{1} x_{1} + w_{2} x_{2}) + \frac{1}{2} a_{1} {(w_{1} x_{1} + w_{2} x_{2})}^{2}, w_{1} = w_{2} = 1 .

(99)

Given that

w_{1} = w_{2} = 1

, it is clear that the neuron’s output is

O u t = 2 s i n (a_{0} (x_{1} + x_{2}) + \frac{1}{2} a_{1} {(x_{1} + x_{2})}^{2}) .

(100)

We will prove that parameters

{a = (a}_{0}, a_{1})

exist for the weight function

w (t)

that enable the neuron to solve the XOR logical function. In practice, we need to show that ∃

{a = (a}_{0}, a_{1})

such that

\{\begin{matrix} O u t (0, 0) = 2 \sin (\tilde{S} (0, 0)) = 0 \\ O u t (0, 1) = 2 \sin (\tilde{S} (0, 1)) = 1 \\ O u t (1, 0) = 2 \sin (\tilde{S} (1, 0)) = 1 \\ O u t (1, 1) = 2 \sin (\tilde{S} (1, 1)) = 0 \end{matrix} .

(101)

The first equality

O u t (0, 0) = 2 \sin (\tilde{S} (0, 0)) = 2 s i n (a_{0} (0 + 0) + \frac{1}{2} a_{1} {(0 + 0)}^{2}) = 2 \sin (0) = 0

(102)

is satisfied for all

{a = (a}_{0}, a_{1}) \in R

.

The second and third equalities in (101), due to the nature of

\tilde{S}

, provide the same information:

2 s i n (a_{0} + \frac{1}{2} a_{1}) = 1,

(103)

which implies

a_{0} + \frac{1}{2} a_{1} = \frac{π}{6} + 2 k π, k ϵ Z

(104)

or

a_{0} + \frac{1}{2} a_{1} = \frac{5 π}{6} + 2 k π, k ϵ Z .

(105)

Finally, from the last equality in (101), we have

2 s i n ({2 (a}_{0} + a_{1})) = 0,

(106)

i.e.,

a_{1} = - a_{0}

(107)

Let us consider one of the possible systems:

\{\begin{array}{l} a_{0} + \frac{1}{2} a_{1} = \frac{π}{6} + 2 k π \\ a_{1} = {- a}_{0} \end{array}, k ϵ Z .

(108)

The system has infinitely many solutions, and one solution for

k = 0

,

(a_{0}, a_{1}) = (\frac{π}{3}, - \frac{π}{3})

. With these parameters, the neuron has a weight function

w (t) = \frac{π}{3} (1 - t),

(109)

and in its body, we obtain the cumulative signal

\tilde{S} = \int_{0}^{\sum_{i = 1}^{2} x_{i}} w (t) d t = \frac{π}{3} (x_{1} + x_{2}) - \frac{π}{6} {(x_{1} + x_{2})}^{2} .

(110)

Along the neuron’s axon (Figure 17), the output signal propagates as

O u t = 2 s i n (\frac{π}{3} (x_{1} + x_{2}) - \frac{π}{6} {(x_{1} + x_{2})}^{2}) .

(111)

It can easily be seen that, theoretically, this neuron can indeed solve the XOR logical function problem (Table 6).

5.3.2. Experimental Confirmation

The empirical confirmation of the existence and trainability of this type of neuron was again carried out in the MATLAB v2018a environment. The transfer function

g (x) = 2 s i n (x)

, like those previously used, is continuous and smooth, fulfilling the conditions for the applicability of the Levenberg–Marquardt method according to Theorem 3. The initial conditions (initial parameters of the weight function) are once again generated as random numbers within the segment [−1,1]. Given this random initialization, it is expected to obtain various final results for the optimal parameters of the weight function. In the worst observed case, the neuron successfully handles the XOR function with an error on the order of

10^{- 29}

. In many instances, the algorithm reaches the global minimum, discovering the weight function for a neuron that operates with zero error. Table 7 presents an example where the iterative algorithm completes in just six epochs, resulting in a final zero total error.

With initial, random parameters of the weight function

(a_{0} = - 0.447949846002843, a_{1} = 0.359405353707350)

, before training, the integral neuron produced an error of

4.80645

when solving XOR. After the final iteration of the training algorithm, it adjusted the parameters to

(a_{0} = 1.047197551196598, a_{1} = - 1.047197551196598)

, yielding an absolutely accurate result for XOR (Table 8), corresponding to the theoretically determined model

(a_{0}, a_{1}) = (\frac{π}{3}, - \frac{π}{3})

.

6. Conclusions

From the very inception of the classical neuron, the idea of forming a total summed signal in its body has been adopted:

S = \sum_{i = 0}^{n} w_{i} x_{i},

(112)

where

x_{i}

and

w_{i}

represent the inputs to the neuron and their respective weights

(i = 0, 1, \dots, n)

. The neuron’s output then takes the value

O u t = g (\sum_{i = 0}^{n} w_{i} x_{i}),

(113)

where

g

is its transfer function.

The linear combination of input signals and weights is the simplest and most intuitive way to model the interaction between inputs and the neuron. It is straightforward for mathematical representation and processing and allows for the use of established methods in linear algebra and optimization, facilitating the training of neurons and neural networks, especially in their early forms. The first stage where nonlinearity can emerge in neuron operation is with the application of the transfer function, which transforms the linear form (112) into the nonlinear signal (113). However, the linear sum alone, without an appropriate transfer function, cannot capture complex, nonlinear dependencies among the input signals. Additionally, it is sensitive to the scale of the input signals. If the inputs vary in scale, the sum can be dominated by larger values, potentially making the neuron less effective at processing information. This often necessitates pre-normalization of the inputs.

In this study, we set out to explore the possibility of changing the form of the signal (112). It turns out that if we normalize the input–output samples

{D = \{[x_{j 0} = 1, x_{j 1}, \dots x_{j n}; h_{j}]\}}_{j = 1}^{m}

within a sufficiently small interval, where

λ = \max \{∆ x_{1}, ∆ x_{2}, \dots, ∆ x_{n}\} \to 0

for

∆ x_{i} = x_{i} - x_{i - 1}, i = 1, 2, \dots, n

, we can consider a dynamic neural structure with a signal in its body as follows:

S = \sum_{i = 0}^{n} w_{i} x_{i} = \int_{0}^{\sum_{i = 0}^{n} x_{i}} w (t) d t

(114)

where the function

w (t)

can define each weight of the neuron as

w_{i} = w (x_{i})

.

An even more interesting case emerged when the requirement for data proximity

λ = \max \{∆ x_{1}, ∆ x_{2}, \dots, ∆ x_{n}\} \to 0

was not met. Under such conditions, we can consider new and unconventional artificial intelligence cell structures. Here, assuming that inputs are applied with constant weights

w_{i} = 1

, an integral signal is formed:

\tilde{S} = \int_{0}^{\sum_{i = 0}^{n} x_{i}} w (t) d t,

(115)

where

w (t)

is a continuous and smooth function (the neuron’s weight function). As a result, the axon of this type of neuron yields the following values:

O u t = g (\int_{0}^{\sum_{i = 0}^{n} x_{i}} w (t) d t) .

(116)

Thus, given the training set

{D = \{[x_{j 0}, x_{j 1}, \dots x_{j n}; h_{j}]\}}_{j = 1}^{m}, x_{j 0} = 1, \forall j = 1, 2, \dots, m

(117)

the mean squared error is

E = \frac{1}{m} \sum_{j = 1}^{m} {(h_{j} - g (\int_{0}^{\sum_{i = 0}^{n} x_{j i}} w (t) d t))}^{2} .

(118)

In the structure of integral neurons, traditional weights are absent, but in their place, we have the weight function

w (t)

, which not only replaces weights but also imparts a nonlinear character to the signal. With the introduction of the weight function, the optimization algorithms for the neuron are no longer aimed at finding a set of static weights but at discovering an appropriate form of

w (t)

(its parameters) that minimizes the error. It turns out that gradient descent, Gauss–Newton, and the combined Levenberg–Marquardt optimization algorithms, when applied in this context, are convergent. During neuron training, starting from random parameters of the weight function, they are able to optimize it in a way that indeed leads to error minimization.

Are integral neurons (Figure 18) part of artificial intelligence? Absolutely. Like all artificial intelligence systems, integral neurons are initially unrefined and relatively ineffective in relation to the specific problem they are directed toward. However, during the training process, through the use of training samples, the parameters of their weight functions are adjusted in a way that drives the error toward its local or global minimum.

In the experimental part of this study, this type of integral neuron was used to solve the XOR logical function. It turns out that integral neural structures can be trained to solve this function—something that is impossible for standard neurons due to the linear nature of the sum in their bodies, which serves as the equation for the separating line. Theoretical solutions and experimental confirmations are shown for solving XOR using integral neurons with three different transfer functions. In each of the experiments, the weight functions are initialized with random parameters. Initially, the neurons have a high error rate, but after training with XOR samples, their weight functions are adjusted in a way that produces solutions with zero errors and parameters predicted by theory.

Integral neurons can also be used in more complex real-world situations, but the following limitations should be taken into account:

Training Complexity: Training an integral neuron, especially with higher smoothness classes of the weight function, requires precise calculation of the gradient, Jacobian, and, potentially, Hessian matrix. This can potentially lead to significant computational challenges in real-world applications.
Algorithm Convergence: As noted in the article, algorithms such as Gauss–Newton or Levenberg–Marquardt require appropriate initial parameters to ensure convergence to a global minimum. Otherwise, there is a risk of falling into a local minimum.
Normalization Issues: To use the integral equivalent of a standard neuron, the following condition must be satisfied for the input data:

$∆ x_{i} = x_{i} - x_{i - 1}, i = 1, 2, \dots, n and λ = \max \{∆ x_{1}, ∆ x_{2}, \dots, ∆ x_{n}\} \to 0$

This means that the input data must be carefully normalized to ensure the correctness of the integral representation and the execution of the algorithms. Of course, this is only necessary if we aim to replicate the behavior of standard neurons. In cases where we abandon requirement (1), we are dealing with an entirely new artificial neural structure for which such normalization is not necessary.

Implementation: The practical implementation of the integral neuron requires complex computational systems for signal processing, especially when the integrand function has a non-trivial form.

A subject of future research is the comparison of the integral neuron with other types of neurons, e.g., rule-based neural networks where each neuron is a fuzzy inference system [68], as well as their capabilities to solve various regression and classification tasks.

Finally, it should be noted that in this study, integral neurons were equipped with polynomial weight functions. However, these could be functions of various types, as long as they meet the requirements for continuity and differentiability imposed by the convergence theorems of the training algorithms. Future goals in this area include exploring different types of weight functions and investigating the possibilities for building and using integral neural networks, as well as hybrid neural networks that incorporate both classical and integral neurons.

Author Contributions

Conceptualization, K.Y. and E.H.; methodology, K.Y. and E.H.; software, K.Y.; validation, K.Y., E.H. and S.H.; formal analysis, K.Y.; investigation, K.Y., E.H. and S.H.; resources, K.Y.; data curation, K.Y.; writing—original draft preparation, K.Y. and E.H.; writing—review and editing, E.H. and S.H.; visualization, K.Y.; supervision, E.H.; project administration, E.H.; funding acquisition, E.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the MUPD23-FMI-021 and SP23-FMI-008 projects of the Research Fund of the University of Plovdiv “Paisii Hilendarski”.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. MATLAB v2018a Scripts for Creating and Using an Integral XOR Neuron

The integral neuron described in the scripts has the following characteristics:

-: Weight function w(t) = a0 + a1*t;
-: Integral function a(1)*sum_x + (a(2)/2)*sum_x², corresponding to the weight function;
-: Transfer function 3*tanh(integral_value), which uses the calculated values of the integral function;
-: Trained using the Levenberg–Marquardt method.

To modify these characteristics, the scripts need to be appropriately adjusted.

Figure A1. MATLAB v2018a file compute_error.m—function for error calculation.

Figure A2. MATLAB v2018a file Levenberg_Markvard_XOR_neuron.m—script for training the integral XOR neuron, calculating the coefficients of the weight function.

Figure A3. MATLAB v2018a file Test.m—usage of the integral XOR neuron.

Appendix B. Calculation of the Error Gradient for an Arbitrary Weight Function—Example

Calculation of the Error Gradient for an Arbitrary Weight Function

In the case of the integral neuron, the error function is defined as

E (a) = \sum_{j = 1}^{m} {(h_{j} - g (\int_{0}^{\sum_{i = 0}^{n} x_{j i}} w (t) d t))}^{2},

(A1)

where

-: $h_{j}$ is the desired output value (the corresponding value of the target function);
-: $g$ is the transfer function of the neuron;
-: $w (t)$ is a differentiable weight function parameterized by the vector

$a = {[\begin{matrix} a_{0} \\ a_{1} \\ a_{2} \\ ⋮ \\ a_{q} \end{matrix}]}_{(q + 1) \times 1}$

The derivative of

E

with respect to the parameter

a_{p} (f o r p = 1, 2, \dots q)

is derived in the following steps:

\frac{\partial E}{\partial a_{p}} = 2 \sum_{j = 1}^{m} ((h_{j} - g (S_{j})) (- \frac{\partial g (S_{j})}{\partial S_{j}}) (\frac{\partial S_{j}}{\partial a_{p}})),

(A2)

where

S_{j}

are the signals within the neuron body for each training sample, i.e.,

S_{j} = \int_{0}^{\sum_{i = 0}^{n} x_{j i}} w (t) d t, \forall j = 1, 2, \dots m .

(A3)

The derivative of the signal

(A 3)

, involved in the sum

(A 2)

, is represented as

\frac{\partial S_{j}}{\partial a_{p}} = \int_{0}^{\sum_{i = 0}^{n} x_{j i}} \frac{\partial w (t)}{\partial a_{p}} d t .

(A4)

Thus, for an arbitrary weight function, we have

\frac{\partial E}{\partial a_{p}} = 2 \sum_{j = 1}^{m} ((h_{j} - g (S_{j})) (- \frac{\partial g (S_{j})}{\partial S_{j}}) \int_{0}^{\sum_{i = 0}^{n} x_{j i}} \frac{\partial w (t)}{\partial a_{p}} d t) .

(A5)

We have already discussed the fact that, according to the Weierstrass Approximation Theorem, any continuous function on a closed interval can be approximated by a polynomial. In this case, if

w (t) = a_{0} + a_{1} t + a_{2} t^{2} + \dots + a_{q} t^{q}, \forall a_{p} \in R, p = 0, 1, 2, \dots, q, TO \frac{\partial w (t)}{\partial a_{p}} = t^{p},

(A6)

for the gradient of the error function, we have

\frac{\partial E}{\partial a_{p}} = 2 \sum_{j = 1}^{m} ((h_{j} - g (S_{j})) (- \frac{\partial g (S_{j})}{\partial S_{j}}) \int_{0}^{\sum_{i = 0}^{n} x_{j i}} t^{p} d t)

(A7)

or equivalently

\frac{\partial E}{\partial a_{p}} = 2 \sum_{j = 1}^{m} ((h_{j} - g (S_{j})) (- \frac{\partial g (S_{j})}{\partial S_{j}}) \frac{{(\sum_{i = 0}^{n} x_{j i})}^{p + 1}}{p + 1}), \forall p = 1, 2, \dots q .

(A8)

Equations (A7) for the general case and (A8) for a polynomial weight function provide a clear understanding of how all components of the error gradient are computed:

\frac{\partial E}{\partial w (t)} = {[\begin{matrix} \frac{\partial E}{\partial a_{0}} \\ \frac{\partial E}{\partial a_{1}} \\ \frac{\partial E}{\partial a_{2}} \\ ⋮ \\ \frac{\partial E}{\partial a_{q}} \end{matrix}]}_{(q + 1) \times 1} .

(A9)

2.: Example

Let an integral neuron be given with a hyperbolic tangent

(t a n h)

transfer function, and the weight function is a second-degree polynomial:

w (t) = a_{0} + a_{1} t + a_{2} t^{2}

(A10)

The signal within the neuron body is

S_{j} = \int_{0}^{\sum_{i = 0}^{n} x_{j i}} w (t) d t = \sum_{p = 0}^{2} a_{p} \frac{{(\sum_{i = 0}^{n} x_{j i})}^{p + 1}}{p + 1}, 3 a \forall j = 1, 2, \dots m,

(A11)

and its derivatives are

\{\begin{matrix} \frac{\partial S_{j}}{\partial a_{0}} = \int_{0}^{\sum_{i = 0}^{n} x_{j i}} 1 d t = \sum_{i = 0}^{n} x_{j i}, \\ \frac{\partial S_{j}}{\partial a_{1}} = \int_{0}^{\sum_{i = 0}^{n} x_{j i}} t d t = \frac{{(\sum_{i = 0}^{n} x_{j i})}^{2}}{2}, \\ \frac{\partial S_{j}}{\partial a_{2}} = \int_{0}^{\sum_{i = 0}^{n} x_{j i}} t^{2} d t = \frac{{(\sum_{i = 0}^{n} x_{j i})}^{3}}{3} . \end{matrix}

(A12)

Given that the transfer function is the hyperbolic tangent, we have

\begin{array}{l} \frac{\partial g (S_{j})}{\partial S_{j}} = & 1 - {t a n h}^{2} (S_{j}) = 1 - {t a n h}^{2} (\sum_{p = 0}^{2} a_{p} \frac{{(\sum_{i = 0}^{n} x_{j i})}^{p + 1}}{p + 1}), \\ \forall j = 1, 2, \dots m \end{array}

(A13)

Thus, for the individual components of the gradient

\frac{\partial E}{\partial w (t)} = [\begin{matrix} \frac{\partial E}{\partial a_{0}} \\ \frac{\partial E}{\partial a_{1}} \\ \frac{\partial E}{\partial a_{2}} \end{matrix}]

(A14)

according to (A8), we have

\frac{\partial E}{\partial a_{0}} = - 2 \sum_{j = 1}^{m} ((h_{j} - t a n h (\sum_{p = 0}^{2} a_{p} \frac{{(\sum_{i = 0}^{n} x_{j i})}^{p + 1}}{p + 1})) (1 - {t a n h}^{2} (\sum_{p = 0}^{2} a_{p} \frac{{(\sum_{i = 0}^{n} x_{j i})}^{p + 1}}{p + 1})) \sum_{i = 0}^{n} x_{j i}),

(A15)

\frac{\partial E}{\partial a_{1}} = - 2 \sum_{j = 1}^{m} ((h_{j} - t a n h (\sum_{p = 0}^{2} a_{0} \frac{{(\sum_{i = 0}^{n} x_{j i})}^{p + 1}}{p + 1})) (1 - {t a n h}^{2} (\sum_{p = 0}^{2} a_{p} \frac{{(\sum_{i = 0}^{n} x_{j i})}^{p + 1}}{p + 1})) \frac{{(\sum_{i = 0}^{n} x_{j i})}^{2}}{2}),

(A16)

\frac{\partial E}{\partial a_{2}} = - 2 \sum_{j = 1}^{m} ((h_{j} - t a n h (\sum_{p = 0}^{2} a_{0} \frac{{(\sum_{i = 0}^{n} x_{j i})}^{p + 1}}{p + 1})) (1 - {t a n h}^{2} (\sum_{p = 0}^{2} a_{p} \frac{{(\sum_{i = 0}^{n} x_{j i})}^{p + 1}}{p + 1})) \frac{{(\sum_{i = 0}^{n} x_{j i})}^{3}}{3}) .

(A17)

References

Gupta, M.; Jin, L.; Homma, N. Static and Dynamic Neural Networks: From Fundamentals to Advanced Theory; John Wiley & Sons: New York, NY, USA, 2004. [Google Scholar]
Amari, S.I. Mathematical theories of neural networks. In Handbook of Neural Computation; CRC Press: Boca Raton, FL, USA, 2020; p. H1-1. [Google Scholar]
Guo, J.; Chen, C.P.; Liu, Z.; Yang, X. Dynamic neural network structure: A review for its theories and applications. IEEE Trans. Neural Netw. Learn. Syst. 2024, 1–21. [Google Scholar] [CrossRef]
Kang, J.; Wang, H.; Yuan, F.; Wang, Z.; Huang, J.; Qiu, T. Prediction of precipitation based on recurrent neural networks in Jingdezhen, Jiangxi Province, China. Atmosphere 2020, 11, 246. [Google Scholar] [CrossRef]
Khaniani, A.S.; Motieyan, H.; Mohammadi, A. Rainfall forecast based on GPS PWV together with meteorological parameters using neural network models. J. Atmos. Sol. Terr. Phys. 2021, 214, 105533. [Google Scholar] [CrossRef]
Rahman, M.M.; Shakeri, M.; Khatun, F.; Tiong, S.K.; Alkahtani, A.A.; Samsudin, N.A.; Amin, N.; Pasupuleti, J.; Hasan, M.K. A comprehensive study and performance analysis of deep neural network-based approaches in wind time-series forecasting. J. Reliab. Intell. Environ. 2023, 9, 183–200. [Google Scholar] [CrossRef]
Casallas, A.; Ferro, C.; Celis, N.; Guevara-Luna, M.A.; Mogollón-Sotelo, C.; Guevara-Luna, F.A.; Merchán, M. Long short-term memory artificial neural network approach to forecast meteorology and PM 2.5 local variables in Bogotá, Colombia. Model. Earth Syst. Environ. 2022, 8, 2951–2964. [Google Scholar] [CrossRef]
Hou, X.; Wang, K.; Zhong, C.; Wei, Z. St-trader: A spatial-temporal deep neural network for modeling stock market movement. IEEE/CAA J. Autom. Sin. 2021, 8, 1015–1024. [Google Scholar] [CrossRef]
Fjellström, C. Long short-term memory neural network for financial time series. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; IEEE: New York, NY, USA, 2022; pp. 3496–3504. [Google Scholar] [CrossRef]
Zhou, D.; Uddin, A.; Tao, X.; Shang, Z.; Yu, D. Temporal Bipartite Graph Neural Networks for Bond Prediction. In Proceedings of the ICAIF ’22: Third ACM International Conference on AI in Finance, New York, NY, USA, 2–4 November 2022; pp. 308–316. [Google Scholar] [CrossRef]
Yılmaz, Ü.; Orbak, Â.Y. Prediction of Turkish mutual funds’ net asset value using the fund portfolio distribution. Neural Comput. Appl. 2023, 35, 18873–18890. [Google Scholar] [CrossRef]
Zhang, L.; Shi, Z.; Han, J.; Shi, A.; Ma, D. FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks. In Proceedings of the MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, Republic of Korea, 5–8 January 2020; Proceedings, Part I 26. Springer International Publishing: Berlin/Heidelberg, Germany; 2020; pp. 653–665. [Google Scholar] [CrossRef]
Wu, Z.; Zhao, D.; Liang, Q.; Yu, J.; Gulati, A.; Pang, R. Dynamic sparsity neural networks for automatic speech recognition. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 6014–6018. [Google Scholar] [CrossRef]
Wang, Q.; Zhang, T.; Han, M.; Wang, Y.; Zhang, D.; Xu, B. Complex dynamic neurons improved spiking transformer network for efficient automatic speech recognition. Proc. AAAI Conf. Artif. Intell. 2023, 37, 102–109. [Google Scholar] [CrossRef]
Lin, Z.; Hu, Z.; Zhu, K. Speech emotion recognition based on dynamic convolutional neural network. J. Comput. Electron. Inf. Manag. 2023, 10, 72–77. [Google Scholar] [CrossRef]
Su, H.; Qi, W.; Yang, C.; Sandoval, J.; Ferrigno, G.; De Momi, E. Deep neural network approach in robot tool dynamics identification for bilateral teleoperation. IEEE Robot. Autom. Lett. 2020, 5, 2943–2949. [Google Scholar] [CrossRef]
Zhang, J.; Liu, H.; Chang, Q.; Wang, L.; Gao, R.X. Recurrent neural network for motion trajectory prediction in human-robot collaborative assembly. CIRP Ann. 2020, 69, 9–12. [Google Scholar] [CrossRef]
Liu, C.; Wen, G.; Zhao, Z.; Sedaghati, R. Neural-network-based sliding-mode control of an uncertain robot using dynamic model approximated switching gain. IEEE Trans. Cybern. 2020, 51, 2339–2346. [Google Scholar] [CrossRef] [PubMed]
Xu, Z.; Li, S.; Zhou, X.; Zhou, S.; Cheng, T.; Guan, Y. Dynamic neural networks for motion-force control of redundant manipulators: An optimization perspective. IEEE Trans. Ind. Electron. 2020, 68, 1525–1536. [Google Scholar] [CrossRef]
Han, Y.; Huang, G.; Song, S.; Yang, L.; Wang, H.; Wang, Y. Dynamic Neural Networks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7436–7456. [Google Scholar] [CrossRef]
Verma, P.; Singh, N.; Pantola, D.; Cheng, X. Neural network developments: A detailed survey from static to dynamic models. Comput. Electr. Eng. 2024, 120, 109710. [Google Scholar] [CrossRef]
Sherstinsky, A. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network. Phys. D Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef]
Rezk, N.M.; Purnaprajna, M.; Nordström, T.; Ul-Abdin, Z. Recurrent Neural Networks: An Embedded Computing Perspective. IEEE Access 2020, 8, 57967–57996. [Google Scholar] [CrossRef]
Nair, R.S.; Supriya, P. Robotic path planning using recurrent neural networks. In Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020; IEEE: New York, NY, USA, 2020; pp. 1–5. [Google Scholar] [CrossRef]
Anagnostis, A.; Benos, L.; Tsaopoulos, D.; Tagarakis, A.; Tsolakis, N.; Bochtis, D. Human activity recognition through recurrent neural networks for human–robot interaction in agriculture. Appl. Sci. 2021, 11, 2188. [Google Scholar] [CrossRef]
Qi, W.; Ovur, S.E.; Li, Z.; Marzullo, A.; Song, R. Multi-sensor guided hand gesture recognition for a teleoperated robot using a recurrent neural network. IEEE Robot. Autom. Lett. 2021, 6, 6039–6045. [Google Scholar] [CrossRef]
Zhu, J.; Jiang, Q.; Shen, Y.; Qian, C.; Xu, F.; Zhu, Q. Application of recurrent neural network to mechanical fault diagnosis: A review. J. Mech. Sci. Technol. 2022, 36, 527–542. [Google Scholar] [CrossRef]
Hochreiter, S. Long Short-term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Gers, F.A.; Schmidhuber, J.; Cummins, F. Learning to forget: Continual prediction with LSTM. Neural Comput. 2000, 12, 2451–2471. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Su, X.; Ding, Z. Long-term traffic prediction based on lstm encoder-decoder architecture. IEEE Trans. Intell. Transp. Syst. 2020, 22, 6561–6571. [Google Scholar] [CrossRef]
Huang, R.; Wei, C.; Wang, B.; Yang, J.; Xu, X.; Wu, S.; Huang, S. Well performance prediction based on Long Short-Term Memory (LSTM) neural network. J. Pet. Sci. Eng. 2022, 208, 109686. [Google Scholar] [CrossRef]
Hossain, M.S.; Mahmood, H. Short-term photovoltaic power forecasting using an LSTM neural network and synthetic weather forecast. IEEE Access 2020, 8, 172524–172533. [Google Scholar] [CrossRef]
Fang, Z.; Wang, Y.; Peng, L.; Hong, H. Predicting flood susceptibility using LSTM neural networks. J. Hydrol. 2021, 594, 125734. [Google Scholar] [CrossRef]
Seng, D.; Zhang, Q.; Zhang, X.; Chen, G.; Chen, X. Spatiotemporal prediction of air quality based on LSTM neural network. Alex. Eng. J. 2021, 60, 2021–2032. [Google Scholar] [CrossRef]
Xu, Y.; Hu, C.; Wu, Q.; Jian, S.; Li, Z.; Chen, Y.; Zhang, G.; Zhang, Z.; Wang, S. Research on particle swarm optimization in LSTM neural networks for rainfall-runoff simulation. J. Hydrol. 2022, 608, 127553. [Google Scholar] [CrossRef]
Salem, F.M.; Salem, F.M. Gated RNN: The minimal gated unit (MGU) RNN. In Recurrent Neural Networks; Springer: Cham, Switzerland, 2022; pp. 101–113. [Google Scholar] [CrossRef]
Kumar, C.; Abuzar, M.; Kumar, M. Mgu-gnn: Minimal gated unit based graph neural network for session-based recommendation. Appl. Intell. 2023, 53, 23147–23165. [Google Scholar] [CrossRef]
Shailesh, S.; Judy, M.V. Understanding dance semantics using spatio-temporal features coupled GRU networks. Entertain. Comput. 2022, 42, 100484. [Google Scholar] [CrossRef]
Ruan, X.; Fu, S.; Storlie, C.B.; Mathis, K.L.; Larson, D.W.; Liu, H. Real-time risk prediction of colorectal surgery-related post-surgical complications using GRU-D model. J. Biomed. Inform. 2022, 135, 104202. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
Yang, S.; Yu, X.; Zhou, Y. Lstm and gru neural network performance comparison study: Taking yelp review dataset as an example. In Proceedings of the 2020 International Workshop on Electronic Communication and Artificial Intelligence (IWECAI), Shanghai, China, 12–14 June 2020; IEEE: New York, NY, USA, 2020; pp. 98–101. [Google Scholar] [CrossRef]
Nosouhian, S.; Nosouhian, F.; Khoshouei, A.K. A review of recurrent neural network architecture for sequence learning: Comparison between LSTM and GRU. Preprints 2021, 2021070252. [Google Scholar] [CrossRef]
Mahjoub, S.; Chrifi-Alaoui, L.; Marhic, B.; Delahoche, L. Predicting energy consumption using LSTM, multi-layer GRU and drop-GRU neural networks. Sensors 2022, 22, 4062. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Luo, J.; Wang, S.; Liu, F. Oil price forecasting: A hybrid GRU neural network based on decomposition–reconstruction methods. Expert Syst. Appl. 2023, 218, 119617. [Google Scholar] [CrossRef]
Yu, Y.Q.; Li, W.J. Densely Connected Time Delay Neural Network for Speaker Verification. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 921–925. [Google Scholar] [CrossRef]
Liu, W.; Zhu, L.; Feng, F.; Zhang, W.; Zhang, Q.J.; Lin, Q.; Liu, G. A time delay neural network based technique for nonlinear microwave device modeling. Micromachines 2020, 11, 831. [Google Scholar] [CrossRef] [PubMed]
Wan, Z.K.; Ren, Q.H.; Qin, Y.C.; Mao, Q.R. Statistical pyramid dense time delay neural network for speaker verification. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7532–7536. [Google Scholar] [CrossRef]
Jiang, C.; Li, H.; Qiao, W.; Yang, G.; Liu, Q.; Wang, G.; Liu, F. Block-oriented time-delay neural network behavioral model for digital predistortion of RF power amplifiers. IEEE Trans. Microw. Theory Tech. 2021, 70, 1461–1473. [Google Scholar] [CrossRef]
Sheikh, S.A.; Sahidullah, M.; Hirsch, F.; Ouni, S. Stutternet: Stuttering detection using time delay neural network. In Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; IEEE: New York, NY, USA, 2021; pp. 426–430. [Google Scholar] [CrossRef]
Liao, C.; Huang, J.; Yuan, H.; Yao, P.; Tan, J.; Zhang, D.; Deng, F.; Wang, X.; Song, C. Dynamic TF-TDNN: Dynamic time delay neural network based on temporal-frequency attention for dialect recognition. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Kim, T.; King, B.R. Time series prediction using deep echo state networks. Neural Comput. Appl. 2020, 32, 17769–17787. [Google Scholar] [CrossRef]
Hu, H.; Wang, L.; Tao, R. Wind speed forecasting based on variational mode decomposition and improved echo state network. Renew. Energy 2021, 164, 729–751. [Google Scholar] [CrossRef]
Gao, R.; Du, L.; Duru, O.; Yuen, K.F. Time series forecasting based on echo state network and empirical wavelet transformation. Appl. Soft Comput. 2021, 102, 107111. [Google Scholar] [CrossRef]
Gao, R.; Li, R.; Hu, M.; Suganthan, P.N.; Yuen, K.F. Dynamic ensemble deep echo state network for significant wave height forecasting. Appl. Energy 2023, 329, 120261. [Google Scholar] [CrossRef]
Jordanou, J.P.; Antonelo, E.A.; Camponogara, E. Echo state networks for practical nonlinear model predictive control of unknown dynamic systems. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 2615–2629. [Google Scholar] [CrossRef] [PubMed]
Dai, W.; An, Y.; Long, W. Price change prediction of ultra high frequency financial data based on temporal convolutional network. Procedia Comput. Sci. 2022, 199, 1177–1183. [Google Scholar] [CrossRef]
Yao, Y.; Zhang, Z.Y.; Zhao, Y. Stock index forecasting based on multivariate empirical mode decomposition and temporal convolutional networks. Appl. Soft Comput. 2023, 142, 110356. [Google Scholar] [CrossRef]
Hewage, P.; Behera, A.; Trovati, M.; Pereira, E.; Ghahremani, M.; Palmieri, F.; Liu, Y. Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station. Soft Comput. 2020, 24, 16453–16482. [Google Scholar] [CrossRef]
Villia, M.M.; Tsagkatakis, G.; Moghaddam, M.; Tsakalides, P. Embedded Temporal Convolutional Networks for Essential Climate Variables Forecasting. Sensors 2022, 22, 1851. [Google Scholar] [CrossRef] [PubMed]
Dholvan, M.; Bhuvanagiri, A.K.; Bathina, S.M.; Bussa, R. Offensive text detection using temporal convolutional networks. Int. J. Adv. Sci. Technol. 2020, 29, 5177–5185. [Google Scholar]
Liya, B.S.; Indumathy, P.; Hemlathadhevi, A.; Dharaniya, R. Cascaded Adaptive Dilated Temporal Convolution Network-Based Efficient Sentiment Analysis Model from Social Media Posts. Int. J. Image Graph. 2024, 2650015. [Google Scholar] [CrossRef]
Sun, J.; Luo, X.; Gao, H.; Wang, W.; Gao, Y.; Yang, X. Categorizing malware via A Word2Vec-based temporal convolutional network scheme. J. Cloud Comput. 2020, 9, 53. [Google Scholar] [CrossRef]
Shi, Y.; Xiao, Y.; Quan, P.; Lei, M.; Niu, L. Document-level relation extraction via graph transformer networks and temporal convolutional networks. Pattern Recognit. Lett. 2021, 149, 150–156. [Google Scholar] [CrossRef]
Wu, H.; Sangaiah, A.K. Oral English Speech Recognition Based on Enhanced Temporal Convolutional Network. Intell. Autom. Soft Comput. 2021, 28, 1. [Google Scholar] [CrossRef]
Savchenko, A.V.; Sidorova, A.P. EmotiEffNet and Temporal Convolutional Networks in Video-based Facial Expression Recognition and Action Unit Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 17–18 June 2024; pp. 4849–4859. [Google Scholar]
Weierstrass, K. Über die analytische Darstellbarkeit sogenannter willkürlicher Functionen einer reellen Veränderlichen. Sitzungsberichte Der Königlich Preußischen Akad. Der Wiss. Zu Berl. 1885, 2, 633–639. [Google Scholar]
Minsky, M.; Papert, S. Perceptrons: An Introduction to Computational Geometry; MIT Press: Cambridge, MA, USA, 1969. [Google Scholar]
Dombi, J.; Hussain, A. Robust Rule Based Neural Network Using Arithmetic Fuzzy Inference System. In Intelligent Systems and Applications; IntelliSys 2022, Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2023; Volume 542. [Google Scholar] [CrossRef]

Figure 1. Standalone artificial neuron.

Figure 2. Neuron with dynamically changing weights for each input set.

Figure 3. Rectangles with areas representing the products of inputs and their weights in a neuron.

Figure 4. Example of a signal within a neuron’s body represented as the sum of areas of adjacent rectangles.

Figure 5. Algorithm for finding the optimal weight function using gradient descent.

Figure 6. Integral neuron equipped with smooth weight and transfer functions,

w (t)

and

g

, respectively.

Figure 6. Integral neuron equipped with smooth weight and transfer functions,

w (t)

and

g

, respectively.

Figure 7. Algorithm for finding the optimal weight function using the Gauss–Newton method.

Figure 9. Structure of an integral neuron cell.

Figure 10. Integral neuron represented as a classical neuron.

Figure 11. Training process for (a) classical neurons; (b) integral neurons.

Figure 12. Linear separability in (a) logical “AND”; (b) logical “OR”.

Figure 13. Inability to classify XOR with a single divider.

Figure 14. Integral XOR neuron.

Figure 15. Integral XOR neuron with transfer function

3 t a h n (\tilde{S})

.

Figure 15. Integral XOR neuron with transfer function

3 t a h n (\tilde{S})

.

Figure 16. Integral XOR neuron with transfer function

f (x) = x

.

Figure 16. Integral XOR neuron with transfer function

f (x) = x

.

Figure 17. Integral XOR neuron with transfer function

g (x) = 2 s i n (x)

.

Figure 17. Integral XOR neuron with transfer function

g (x) = 2 s i n (x)

.

Figure 18. Structure of an integral neuron with a weight function.

Table 1. The XOR truth table.

$X_{1}$	$X_{2}$	$X_{1} X O R X_{2}$
0	0	0
0	1	1
1	0	1
1	1	0

Table 2. Input-output data for training the XOR neuron.

j	$X_{j 1}$	$X_{j 2}$	$h_{j} = X_{j 1} X O R X_{j 2}$
1	0	0	0
2	0	1	1
3	1	0	1
4	1	1	0

Table 3. Results of the Levenberg–Marquardt method for finding the optimal weight function of the XOR neuron.

Iteration	Func-Count	Residual	First-Order Optimality	Lambda	Norm of Step
0	3	12.4149	3.39	0.01
1	7	7.73309	15	0.1	1.41445
2	10	1.03079	5.34	0.01	0.423156
3	13	0.007648	0.536	0.001	0.445144
4	16	1.88 × 10⁻⁹	0.000306	0.0001	0.030293
5	19	9.98 × 10⁻²¹	3.08 × 10⁻¹⁰	1.00 × 10⁻⁵	4.85 × 10⁻⁶
6	22	5.42 × 10⁻³¹	3.40 × 10⁻¹⁵	1.00 × 10⁻⁶	7.84 × 10⁻¹¹
7	25	0	0	1.00 × 10⁻⁷	4.78 × 10⁻¹⁶

Table 4. Results of the integral neuron with initial parameters of the weight function and hyperbolic tangent transfer function.

Inputs $(x_{1}, x_{2})$	Neuron Body $\tilde{S} = \int_{0}^{x_{1} + x_{2}} w (t) d t$	Neuron Output $O u t = 3 \tanh (\tilde{S})$
(0,0)	0	0
(0,1)	0.346573590279973	1
(1,0)	0.346573590279973	1
(1,1)	0	0

Table 5. Results of the Levenberg–Marquardt method for finding the optimal weight function of the XOR neuron with linear transfer function

f (x) = x

.

Table 5. Results of the Levenberg–Marquardt method for finding the optimal weight function of the XOR neuron with linear transfer function

f (x) = x

.

Iteration	Func-Count	Residual	First-Order Optimality	Lambda	Norm of Step
0	3	3.57798	2.34	0.01
1	6	0.007213	0.0268	0.001	3.74853
2	9	1.90 × 10⁻⁷	0.000145	0.0001	0.19179
3	12	5.03 × 10⁻¹⁴	7.48 × 10⁻⁸	1.00 × 10⁻⁵	0.000988
4	15	1.33 × 10⁻²²	3.85 × 10⁻¹²	1.00 × 10⁻⁶	5.09 × 10⁻⁷
5	18	0	0	1.00 × 10⁻⁷	2.62 × 10⁻¹¹

Table 6. Theoretical solution of the XOR logical function with an integral neuron using the transfer function

g (x) = 2 s i n (x)

.

Table 6. Theoretical solution of the XOR logical function with an integral neuron using the transfer function

g (x) = 2 s i n (x)

.

Inputs $(x_{1}, x_{2})$	Neuron Body $\tilde{S} = \frac{π}{3} (x_{1} + x_{2}) - \frac{π}{6} {(x_{1} + x_{2})}^{2}$	Neuron Output $O u t = 3 \tanh (\tilde{S})$
(0,0)	0	0
(0,1)	$π / 6$	1
(1,0)	$π / 6$	1
(1,1)	0	0

Table 7. Results from the application of the Levenberg–Marquardt Method for finding the optimal weight function of the XOR neuron equipped with the transfer function

g (x) = 2 s i n (x)

.

Table 7. Results from the application of the Levenberg–Marquardt Method for finding the optimal weight function of the XOR neuron equipped with the transfer function

g (x) = 2 s i n (x)

.

Iteration	Func-Count	Residual	First-Order Optimality	Lambda	Norm of Step
0	3	4.80645	7.29	0.01
1	6	0.000443	0.0404	0.001	2.02773
2	9	3.22 × 10⁻⁹	8.10 × 10⁻⁵	0.0001	0.025227
3	12	8.72 × 10⁻¹⁷	5.72 × 10⁻⁹	1.00 × 10⁻⁵	7.08 × 10⁻⁵
4	15	2.28 × 10⁻²⁶	8.81 × 10⁻¹⁴	1.00 × 10⁻⁶	1.19 × 10⁻⁸
5	18	2.47 × 10⁻³²	3.85 × 10⁻¹⁶	1.00 × 10⁻⁷	1.92 × 10⁻¹³
6	21	0	0	1.00 × 10⁻⁸	1.81 × 10⁻¹⁶

Table 8. Experimental results of the neuron’s performance with final parameters of the weight function and transfer function

g (x) = 2 s i n (x)

.

Table 8. Experimental results of the neuron’s performance with final parameters of the weight function and transfer function

g (x) = 2 s i n (x)

.

Inputs $(x_{1}, x_{2})$	Neuron Body $\tilde{S} = a_{0} (x_{1} + x_{2}) - \frac{1}{2} a_{1} {(x_{1} + x_{2})}^{2}$	Neuron Output $O u t = 3 \tanh (\tilde{S})$
(0,0)	0	0
(0,1)	0.523598775598299	1
(1,0)	0.523598775598299	1
(1,1)	0	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yotov, K.; Hadzhikolev, E.; Hadzhikoleva, S. Integral Neuron: A New Concept for Nonlinear Neuron Modeling Using Weight Functions. Creation of XOR Neurons. Mathematics 2024, 12, 3982. https://doi.org/10.3390/math12243982

AMA Style

Yotov K, Hadzhikolev E, Hadzhikoleva S. Integral Neuron: A New Concept for Nonlinear Neuron Modeling Using Weight Functions. Creation of XOR Neurons. Mathematics. 2024; 12(24):3982. https://doi.org/10.3390/math12243982

Chicago/Turabian Style

Yotov, Kostadin, Emil Hadzhikolev, and Stanka Hadzhikoleva. 2024. "Integral Neuron: A New Concept for Nonlinear Neuron Modeling Using Weight Functions. Creation of XOR Neurons" Mathematics 12, no. 24: 3982. https://doi.org/10.3390/math12243982

APA Style

Yotov, K., Hadzhikolev, E., & Hadzhikoleva, S. (2024). Integral Neuron: A New Concept for Nonlinear Neuron Modeling Using Weight Functions. Creation of XOR Neurons. Mathematics, 12(24), 3982. https://doi.org/10.3390/math12243982

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integral Neuron: A New Concept for Nonlinear Neuron Modeling Using Weight Functions. Creation of XOR Neurons

Abstract

1. Introduction

2. Dynamic Artificial Neural Networks

2.1. Recurrent Neural Networks

2.2. Long Short-Term Memory Networks

2.3. Gated Recurrent Units

2.4. Time-Delay Neural Networks

2.5. Echo State Networks

2.6. Temporal Convolutional Networks

2.7. Conclusions Regarding Dynamic Artificial Neural Networks

3. Integral Neurons with Weight Functions: Training Algorithms for Integral Neurons

3.1. Problem Definition

3.2. Algorithm for Finding the Optimal Weight Function Using Gradient Descent of the Error

3.2.1. Example Application of the Algorithm

3.2.2. Convergence of the Algorithm

3.3. Algorithm for Finding the Optimal Weight Function Using the Gauss–Newton Method

3.4. Algorithm for Finding the Optimal Weight Function Using the Levenberg–Marquardt Method

3.5. Selection of Weight Function and Algorithm Complexity

4. Discussion: Classical and Integral Neurons and Neural Networks

5. Integral XOR Neuron

5.1. Solving XOR with an Integral Neuron Using a Hyperbolic Tangent Transfer Function

5.1.1. Theoretical Solution

5.1.2. Experimental Confirmation

5.2. Solving XOR with an Integral Neuron Using the Transfer Function f x = x

5.2.1. Theoretical Solution

5.2.2. Experimental Confirmation

5.3. Solving XOR with an Integral Neuron Using the Transfer Function f x = 2 s i n x

5.3.1. Theoretical Solution

5.3.2. Experimental Confirmation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. MATLAB v2018a Scripts for Creating and Using an Integral XOR Neuron

Appendix B. Calculation of the Error Gradient for an Arbitrary Weight Function—Example

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.2. Solving XOR with an Integral Neuron Using the Transfer Function $f (x) = x$

5.3. Solving XOR with an Integral Neuron Using the Transfer Function $f (x) = 2 s i n (x)$