1. Introduction
The key characteristics of dynamic neural networks (DNNs) are their changing structure, temporal dependency, greater expressive power, and better adaptability than static artificial neural networks (ANNs). Regarding their changing structure, it should be noted that the number of neurons, the connections between them, and even the entire architecture of a DNN can change during training or operation [
1,
2,
3]. Their temporal dependency is linked to the fact that these DNNs can account for previous states and predict future events, making them particularly useful for tasks in meteorology [
4,
5,
6,
7], financial market analysis [
8,
9,
10,
11], speech recognition [
12,
13,
14,
15], and even robot control [
16,
17,
18,
19]. The greater expressive power lies in the fact that DNNs can represent more complex functions and patterns than static ANNs, while their better adaptability allows them to adjust to changing conditions and learn from new data more effectively. However, the strengths of DNNs also contribute to some challenges in their use. For example, designing and training DNNs is more complex compared to static ANNs, and both the training and subsequent operation often require more computational resources.
This article presents various stages of modeling a new type of neuron called integral neurons, which use weight functions instead of fixed weights during the training process, as well as the implementation of an integral XOR neuron that can be implemented with different weight and transfer functions. The main sections of the article, following the introduction, are as follows:
In the second part of the study, various DNNs are examined. The idea is proposed that a new type of dynamic behavior can be created by replacing the static weights determined during the training of ANNs with weight functions, allowing for the creation of a nonlinear signal within the neuron.
In the third part, definitions are proposed for a weight function, which is used instead of static weights within the neuron, and an integral neuron, featuring an integral function in its body that uses the weight function. Evidence is provided for the applicability of training methods such as gradient descent, Gauss–Newton, and Levenberg–Marquardt in the search for the optimal weight function.
The fourth part—Discussion—examines the similarities and differences between classical neurons and the newly proposed integral neurons and provides directions for potential developments in the topic of artificial integral structures. Definitions are provided for integral and integral–classical neural networks, among others.
In the experimental fifth part, the possibility of creating a single integral neuron that solves the XOR logical function, which is unachievable by standard artificial neurons, is explored. Theoretical and experimental evidence for the creation of integral XOR neurons, trained with several different weight and transfer functions, is presented. The MATLAB v2018a scripts used for creating the integral XOR neurons are provided in
Appendix A.
2. Dynamic Artificial Neural Networks
The main types of DNNs are Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, Gated Recurrent Units (GRUs), Time-Delay Neural Networks (TDNNs), Echo State Networks (ESNs), and Temporal Convolutional Networks (TCNs) [
20,
21].
2.1. Recurrent Neural Networks
RNNs contain feedback loops in their architecture which allow information from previous time steps to be retained and used for current computations. These loops cyclically pass information through time, as RNNs maintain an internal state , which is updated at each time step . This state is used to store information from previous inputs.
In standard feedforward ANNs, neurons in one layer transmit signals only to the neurons in the next layer. The weight matrices
determine the connections between the layers, and the states of the network (the outputs from the neurons) are
where
In contrast, RNNs have an additional aspect—the feedback over time. At each time step, RNNs not only take into account the current input
, but also the previous state
[
22,
23]. This allows the network to have memory and retain information across time steps. At each time step
, the RNN performs the following computations:
where
is the hidden state of the network at time step ;
is the hidden state of the network at the previous time step;
is the current input;
and are the weight matrices.
The weight matrix
processes the previous state
and provides the information accumulated from prior time steps. This is the key element that gives RNNs the ability to remember past information. The weight matrix
processes the current input
, which is analogous to what happens in standard ANNs. It handles the processing of new information that enters at each time step. This structure allows RNNs to combine new information with the accumulated memory from previous time steps, making it possible to recognize temporal dependencies. It should be noted that deep RNNs include multiple recurrent layers, with each layer receiving input from the previous one. This enables the network to capture more complex dependencies in the data. If there are two recurrent layers, the computations will be performed with the following equations, characterizing the states (outputs) of the layers:
where
and are the hidden states of the first and second layers, respectively;
, , , and are the weight matrices for the respective layers;
and are the bias vectors (containing thresholds) for the respective layers;
is the transfer function.
Examples of the dynamic capabilities of RNNs can be found in [
24,
25,
26,
27].
2.2. Long Short-Term Memory Networks
LSTM networks are a type of RNN specifically designed to address the vanishing gradient problem in RNNs by introducing memory cells and gates that control the flow of information. The main components of LSTM are the forget gate, input gate, memory cell, and output gate, with dynamic behavior achieved through more complex equations that involve memory cells and gates controlling the flow of information [
28,
29]:
Here,
is the state of the forget gate;
and are the input and output gates, respectively;
is the proposed new value for the cell, i.e., the cell gate activation vector;
is the current cell state;
is the current hidden state.
In all the presented equations, the subscript
indicates the time step,
is a sigmoid-like transfer function (most commonly the hyperbolic tangent), and
represents the Hadamard product—a binary operation that takes two matrices of the same dimensions and returns a matrix of element-wise multiplications. These equations demonstrate how LSTM networks can retain and manipulate information over long time periods by using complex memory management mechanisms. In their work, Zhumei Wang et al. use the dynamic characteristics of LSTM networks for accurate traffic forecasting [
30], while Ruijie Huang et al. base their model on LSTM dynamics to create a system fed with historical data from 4000 previous days, predicting the production efficiency of a carbonate reservoir well for the next 500 days [
31]. The use of dynamics in LSTM models can be explored in many other studies [
32,
33,
34,
35].
2.3. Gated Recurrent Units
GRUs are improved versions of earlier LSTM models. They have a simplified structure compared to LSTM, as they use fewer gates and parameters. This makes them easier to train and often faster than LSTM, especially when working with smaller datasets. The GRU cell includes two main gates: the update gate and the reset gate. The update gate helps the cell decide how much of the previous information will be retained in the new state. It combines the functions of the input gate and the forget gate in LSTM, thus controlling the extent to which information will be carried over to future time steps. The reset gate determines how much of the previous information should be forgotten. This gate allows the model to discard irrelevant information, which can be especially useful when modeling time series, where earlier data points may not always be important for understanding later points.
There are different variations of GRUs. For instance, simplified versions of GRUs like the Minimal Gated Unit can be used, where the gating mechanisms are reduced to a single main mechanism for controlling input and forget information [
36,
37]. Another variation is the Coupled GRU, where there is a dependency between the reset and update gates, rather than using them as completely independent elements [
38]. GRU-Ds are modifications with a bidirectional architecture designed to handle time series with missing data [
39], among others. In the system of Equation (5), the operation and key components of the mathematical model of the fully gated version of GRUs are presented [
40]:
Here,
and are the input and output vectors, respectively;
is the candidate activation vector;
and are the states of the update and reset gates, respectively.
The transfer functions in this case,
and
, are the logistic function and hyperbolic tangent, respectively. The dynamic nature of the architecture and functionality of GRUs makes them attractive tools for various tasks. Examples of their use can be found in numerous studies [
41,
42,
43,
44].
2.4. Time-Delay Neural Networks
The main idea behind TDNNs is the introduction of a time delay for the input data, allowing the network to access historical (previous) values for a certain number of time steps. This specific feature makes them particularly suitable for tasks involving time sequences or time-dependent data, such as speech recognition, audio signal processing, and time series analysis. The input data to a TDNN are fed in as consecutive time slices, so at a given time
, the neuron will receive inputs not only for the current moment but also for previous moments
, etc. The output
of a neuron in a TDNN at time
can be expressed as a sum of the input data
at the current and previous time steps, multiplied by the corresponding weights
and thresholds
:
If multiple layers are considered, the output of a neuron from the
-th layer is given by
where
is the outputs from the neurons of the previous layers,
is the corresponding biases, and
is the transfer function. Examples of the use of TDNNs in solving problems related to speech recognition, capturing dependencies between sounds and their temporal structures, as well as audio signal processing, can be found in [
45,
46,
47,
48,
49,
50].
2.5. Echo State Networks
ESNs are a specific type of DNN primarily used for tasks involving time sequences. They belong to the broader class of reservoir computing and represent an alternative approach to modeling temporal dependencies in data without the heavy optimization of internal weights, as observed in classical RNN models. In practice, ESNs rely on a reservoir of neurons, which is a large, randomly connected recurrent network, where input data are transformed into a more complex, high-dimensional space using nonlinear connections. The key aspect of this type of DNN is that the internal weights in the reservoir are not trained through standard backpropagation algorithms but are randomly initialized and remain fixed during training. Unlike standard RNNs, in ESNs, only the weights of the output layer are trained. These weights connect the reservoir to the output neurons, a feature that greatly simplifies the training process, as only the optimization of linear connections is required, which can be achieved with simple linear regression or other straightforward methods.
The mathematical model of ESNs defines the state of the reservoir
as being updated at each time step
by the following equation:
where
is the weight matrix of the reservoir,
is the weight matrix connecting the inputs to the reservoir, and
and
are the matrices of input data and biases, respectively. The outputs of the ESNs are expressed as a linear combination of the current state of the reservoir:
where
represents the output weights, which are optimized during training.
In [
51], the authors use ESNs for a new approach to time series data prediction, and Huanling Hu et al. propose a hybrid model incorporating this type of network for forecasting wind speed at the Sotavento Wind Farm in Galicia, northwest Spain [
52]. In this model, ESNs are used for prediction in each individual sub-series. Other interesting applications of ESNs can be explored in [
53,
54,
55].
2.6. Temporal Convolutional Networks
TCNs are designed for tasks related to processing temporal sequences, based on the principles of convolutional neural networks (CNNs). Their convolution ensures that the output at time
depends only on current and previous values over time, but not on future values. The use of convolutional layers allows for greater parallelization and faster training, as at any given moment, a TCN examines multiple previous time steps simultaneously. This quality makes them more efficient than traditional recurrent networks, where processing each step depends on the previous one. Additionally, TCNs use dilated convolution, which helps extend the receptive field of the convolutional filters without necessarily increasing the number of parameters. This allows the TCN to “see” longer time dependencies while remaining efficient. To ensure that the output at time
depends only on current and previous values over time and not on future values, the convolution is structured so that the output
depends only on the current and previous inputs
:
where
and
are the filter weights and biases, respectively, and
is the transfer function. TCNs are highly effective for forecasting values from time series, such as financial data predictions [
56,
57] or climate conditions [
58,
59]. Additionally, they can be used for tasks related to text processing [
60,
61] and natural language processing [
62,
63], as well as for problems involving speech, audio, and video processing [
64,
65].
2.7. Conclusions Regarding Dynamic Artificial Neural Networks
The dynamic nature of the DNNs discussed is expressed in their ability to process and model temporal dependencies and sequences in the data. Unlike static ANNs, which accept a fixed set of input data and produce outputs without considering the order or temporal dependencies between input values, DNNs have internal mechanisms that allow for the processing of data that changes over time. Thus, during the training process, DNNs update their parameters (weights and biases) based not only on the current input data but also on the errors accumulated over time. This means that each iteration of training takes into account the temporal sequence of the data.
However, is it possible to add even more to the dynamic aspect? Is it possible, during training, for networks to be equipped with linear or nonlinear weight functions instead of a weight vector, thereby allowing the signal within the neuron’s body and the neuron’s output to change the way they are formed? We provide a positive answer to these questions in the following sections of this article.
3. Integral Neurons with Weight Functions: Training Algorithms for Integral Neurons
In this section, we explore the possibility of creating a new type of dynamic neuron with dynamically changing weights through the definition of weight functions. We propose definitions for the concepts of weight function and integral neuron. The training methods of gradient descent, Gauss–Newton, and Levenberg–Marquardt are examined, with algorithms and theoretical evidence presented for their applicability in finding the optimal weight function for an integral neuron.
3.1. Problem Definition
Let us consider a single artificial neuron influenced by
stimuli
, where
(
Figure 1).
In its body, a total cumulative signal is formed from all inputs:
where the standard and common procedure of including the neuron’s threshold
in the total sum is implemented:
Let a task be set in which the neuron must be trained to approximate the target function
using a set of
training samples:
In dataset (13),
is the input set, and
is its corresponding functional value of
, i.e.,
Classical procedures related to training the neuron are aimed at finding the weight vector
for which the error function
is at its minimum.
Once the desired performance of the neuron is achieved, the specific weight set (15) that achieves this remains unchanged (), unless the neuron is retrained. Thus, the artificial neuron will respond to any subsequent stimuli on its inputs with the constant weight vector (15) obtained after training.
The question we pose is the following: Is it possible to conduct training that seeks not a weight vector of constants but a function,
where the weight for each input will vary each time, depending on the value of the corresponding stimulus (
Figure 2)?
Replacing static weights with a weight function would create a new type of artificial neuron and neural structure, where the signal shape within the neuron’s body could now be nonlinear. This enables the development of new, more effective solutions for various tasks. For instance, the fundamental unsolved problem of creating a single XOR neuron with traditional techniques is resolved with these new methods and is presented later in this article.
If we reconsider the standard neuron shown in
Figure 2, we will notice that all terms in the sum
can be viewed as the areas of rectangles (
Figure 3).
Let the variational series obtained from the input values of the neuron be
Then, the total signal (18) within the neuron’s body can be represented as the sum of the areas of adjacent rectangles (
Figure 4).
Given that
and if
and
, then
As a result, along the axon of a standard neuron with transfer function
, the signal has a value infinitely close to
Given the training set (13)
, the system of errors for each of the samples is
The mean squared error (MSE) of the network is given by
i.e.,
The last expression shows that the task of training algorithms for the ANN in this case should not be focused on finding a specific set of weights, as is typical in standard approaches. Instead, the focus should shift towards finding the function
that minimizes the error function (25). Given the representation of the MSE, it becomes clear that we are, in fact, searching for the minimum of the function
Many training algorithms utilize derivatives of the error function, which in turn imposes the condition of differentiability on the integral function involved in (26). Thus, to ensure the correct calculation of derivatives in training algorithms, it is necessary for the integrand function to possess a certain degree of smoothness. Specifically, the differentiability of enables the application of methods such as gradient descent, Gauss–Newton, and Levenberg–Marquardt, which we will demonstrate in this study. For this purpose, we consider it necessary to introduce concepts for a continuously differentiable weight function and an integral neuron.
Let be the class of -times differentiable functions, and let be the set of natural numbers with zero included, i.e., .
Definition 1 (Weight function of a neuron). A continuously differentiable real function , defining the cumulative signal in the body of an artificial neuron by the equationwhere are the input stimuli, will be called the weight function for the respective neuron. Definition 2 (Integral neuron). An artificial neuron that uses a weight function to form the signal within its body will be called an integral neuron of class .
Let us note that, due to the nature of integration, the class of the integral function , which describes the signal within the neuron’s body, is one degree higher than that of the weight function. This increase draws our attention to the fact that the integration process provides an additional level of smoothness. For example, if a twice-differentiable weight function is chosen, the signal within the neuron’s body will be three times differentiable.
3.2. Algorithm for Finding the Optimal Weight Function Using Gradient Descent of the Error
The unusual aspect of implementing our proposed idea is that the training algorithm seeks not a vector of numerical weight values that minimizes the error but rather the type and parameters of a weight function. In the context of modeling the weight function, the coefficients of the weight function are also referred to as parameters.
When using the error minimization algorithm (
Figure 5) with gradient descent, the following stages are followed:
Initialization: At this stage, an initial function is selected: this can be any continuous and parametric linear or nonlinear function. An initial value is set for the parameters of (for example, for a polynomial or exponential function).
Error calculation: For each
-th sample in the training set (13)
, the neuron’s output is calculated as
the errors are computed as
and the MSE is formed as
Gradient calculation and parameter update of the weight function: Unlike the traditional approach, at this stage, the derivative of the error is calculated not with respect to weights and biases, but with respect to the parameters of the weight function
. For example, in the case of a polynomial form for
, derivatives are calculated with respect to the polynomial’s parameters. This means we are actually searching for the optimal parameters of the weight function that minimize the error (29). At each step
, the parameter update of
is performed using the calculated gradient:
where
is the descent step—a training hyperparameter, also known as the learning rate. The descent step determines how quickly the algorithm will move in the direction of the steepest slope of the error function. The larger its value, the faster the parameters will be updated, but the higher the risk of missing the minimum. On the other hand, a very small step can make the training process very slow, as a small
value will require more iterations. The value of
can be set by the user implementing the algorithm or can be built into software solutions, with an option for adaptive adjustment during training.
Iterations: Steps 2 and 3 are repeated until one of the following conditions is met:
A local or global minimum of the error function is found;
One of the termination conditions set in the software implementation for optimization is met (e.g., maximum number of epochs reached, specified number of objective function calls exceeded, error plateau identified for a specified number of iterations, etc.).
Upon completion of the algorithm, the desired weight function is , where is the number of the last iteration performed.
3.2.1. Example Application of the Algorithm
To clarify the proposed algorithm, let us consider a polynomial weight function
for a neuron whose transfer is achieved through a hyperbolic tangent. The choice of a polynomial weight function in this case is justified, as it is differentiable and easy to integrate. Additionally, according to the Weierstrass Approximation Theorem, any continuous function on a closed interval can be approximated by a polynomial [
66], such as a Taylor series.
Given a polynomial weight function of degree
,
Following the described algorithm, the following stages should be completed:
Initial values for the parameters are selected.
- 2.
Error calculation:
For each sample from the training set, the output of the neuron is
After substituting the integrand with the polynomial (31) and solving the integral, we can easily reach the representation
which we use to calculate the errors (28) and (29), which become
and
- 3.
Gradient calculation and parameter update of the weight function:
At this stage, the derivatives of the error (35) are calculated with respect to each of the parameters
of the function
:
The coefficients of the weight function are updated as follows:
where
is the descent step.
- 4.
Iterations:
Steps 2 and 3 are repeated until a set of parameters of the weight function is reached, for which the error meets the acceptable requirements.
3.2.2. Convergence of the Algorithm
Let us denote the parameters of the weight function and the gradient of the error function as column vectors:
and
The possibility of finding the optimal function , for which the error function is minimal, is demonstrated by the following convergence theorem for the proposed algorithm.
Theorem 1 (Convergence of the Gradient Descent Method for Optimizing the Weight Function). Let an integral neuron of at least first class with a smooth transfer function be given. Let be the vector containing the parameters of the weight function , and be the gradient of the error function.
Then, if the parameter update for is performed through an optimization algorithm based on gradient descent of the mean squared error function with step ,then the algorithm is convergent and brings the parameters of the function closer to values that minimize . Proof of Theorem 1. In the first stage, we will demonstrate the differentiability of the error function and the applicability of the gradient descent method.
Let an integral neuron (
Figure 6) of at least first class with a smooth transfer function
be given, which is trained with examples from the set
.
Since the mean squared error function is represented as
it is also differentiable, as it is a composition of differentiable functions that comprise it. Therefore, the derivative
exists and is continuous for every value of
. Then, given the bounded nature of
(the function is bounded below:
), it follows that an optimization algorithm using gradient descent on this function can be applied.
In the second stage, we will demonstrate the reduction in the error.
At each iteration, the algorithm updates the weight function .
Here,
is the descent step. Let
be the value of the weight function involved in the error (40) at the end of the
-th iteration. Then, its next value will be
Let us consider the first-order Taylor series expansion of the error function.
If we substitute
from the parameter update in Equation (42) into (44), we obtain
Since
and
, it follows exactly what we intended to prove:
It is important to note the following: If the function is convex, the algorithm will approach the global minimum. Otherwise, the convergence may be towards a local minimum; however, the algorithm will still be convergent and will find the necessary parameters of the weight function at which this minimum is achieved. □
3.3. Algorithm for Finding the Optimal Weight Function Using the Gauss–Newton Method
Applying the Gauss–Newton method within our framework of using a weight function instead of static weights still requires going through the previously described steps 1–4. The first three steps remain the same, and Equations (13) and (27)–(29) are once again satisfied (
Figure 7). The new and key aspect is the update of the weight function parameters
which is now expressed through the transition
In the transition in (48),
is the Jacobian matrix, containing the partial derivatives of the error
with respect to each parameter of
:
and
is the vector containing all errors as each training sample passes through the neuron at each iteration. Here,
is the number of training samples, and
is the number of parameters of the weight function.
Theorem 2 (Convergence of the Gauss–Newton Method for Optimizing the Weight Function). Let an integral neuron of at least second class with a smooth transfer function be given. Let be the vector containing the parameters of the weight function, be the Jacobian matrix of the error function , and be the vector of errors: Then, if the rank of is full and the parameter update for is performed using the Gauss–Newton optimization method,the algorithm is convergent and brings the parameters of the function closer to values that minimize . Proof of Theorem 2. Let the Gauss–Newton method approximate the error function using a quadratic approximation and update parameter
according to the equation
where
is the Jacobian matrix containing the partial derivatives of the errors
with respect to parameter
, and
is the vector of errors for all samples
.
The error function of the neuron is represented as
where
is the weight function, which depends on parameter
. Given that the neuron is at least of the second class (
), it follows that the composition
, and it can be quadratically approximated in a small neighborhood of the current solution
by
where
is the transposed gradient of the error function, and
is the Hessian matrix:
From (51) and (55), it follows that
The change in the error function with this update
is represented as
The first term in the change in the error is
This term represents the main contribution to the change in and is linearly dependent on . Since linear dependence is stronger than quadratic dependence for small values of (where the new parameter values remain close to the old ), the first term dominates in the expression for .
If the function
is exactly quadratic, then the second term
represents the exact contribution of the second derivative to the change
. In this case, the quadratic approximation is exact, and the approximation error is zero. If the error function is nearly quadratic, this means that the higher derivatives of
(third-order and above) are very small. This results in a minimal contribution from the second term to the overall change in
. Ultimately, we reach what we set out to prove:
Before we proceed, let us note something important. In the statement of Theorem 2, suitable initial values for the parameters of the weight function were deliberately left undefined. The Gauss–Newton and Levenberg–Marquardt methods (the latter based on the former) use a quadratic approximation of the error function, which performs well only in a small neighborhood around the optimal solution. Therefore, the initial parameters need to be close to the global minimum for the algorithm to converge towards it. However, if the initial parameters are far from the global minimum, the algorithm will tend to find the nearest local minimum. It is important to emphasize that, in all cases, the algorithm will converge to some minimum, which may be local but is not necessarily the global minimum. □
3.4. Algorithm for Finding the Optimal Weight Function Using the Levenberg–Marquardt Method
The Levenberg–Marquardt method combines the gradient descent and Gauss–Newton methods for nonlinear optimization, making it particularly effective for minimization tasks of nonlinear functions, such as the error function in ANNs. In this method, the parameter update for the weight function is represented by the transition (
Figure 8)
where
is the vector containing the parameters of the weight function ;
is the Jacobian matrix with the partial derivatives of the error with respect to each parameter of ;
is the vector containing all errors as each training sample passes through the neuron at each iteration;
is the parameter that enables the transition between the gradient descent method and the Gauss–Newton method;
is the identity matrix.
Figure 8.
Algorithm for finding the optimal weight function using the Levenberg–Marquardt method.
Figure 8.
Algorithm for finding the optimal weight function using the Levenberg–Marquardt method.
Theorem 3 (Convergence of the Levenberg–Marquardt Method for Optimizing the Weight Function). Let an integral neuron of at least second class with a smooth transfer function be given. Let be the vector containing the parameters of the weight function, be the Jacobian matrix of the error function , and be the vector of errors Let be the parameter for transitioning between gradient descent and the Gauss–Newton method (), and let be the identity matrix of the same dimension as .
Then, if the rank of is full and the parameter update for is performed using the Levenberg–Marquardt optimization method,the algorithm is convergent and brings the parameters of the function closer to values that minimize . Proof of Theorem 3. The proof of this theorem is similar to that of Theorem 2, concerning the Gauss–Newton method. We again consider a neuron of at least second class with a smooth transfer function , which is trained with examples from the set .
Since the MSE function is represented as
and given that the neuron is at least of the second class, it follows that the composition
. To show that the algorithm
is convergent, we need to show that
Formally, the parameter update
at the
-th iteration is expressed by the equation
Let
denote the matrix
. It is evident that
The error function
can be simplified in an infinitesimally small neighborhood of the current value
through a quadratic approximation:
where
Then, if we substitute the change (67) in the parameter update for the weight function
into expression (68) and denote
we obtain
Since and is positive definite, we can maintain the Levenberg–Marquardt optimization idea of dynamically adjusting during training:
If , then the update is successful, and the error has decreased. In this case, the algorithm decreases , allowing for larger steps in the next iteration.
If , then the update is unsuccessful, and the error has increased. In this case, is increased, which limits the step size and makes the behavior closer to gradient descent.
Thus, ultimately, in each iteration, the Levenberg–Marquardt algorithm will adjust the parameters of the weight function to reduce the error function . Since the error function is at least twice differentiable and the algorithm dynamically adapts the step size , convergence to a minimum is guaranteed, even if we start with arbitrary initial parameter values. □
3.5. Selection of Weight Function and Algorithm Complexity
The determination of an appropriate weight function depends on the specifics of the problem itself. Polynomial, exponential, and trigonometric functions are often a good choice due to their flexibility and ease of parameterization. For problems requiring interpolation or subsequent extrapolation, polynomial weight functions are an excellent choice. However, if the goal is to use the integral neuron to describe decaying dependencies, exponential weight functions would be more suitable. Gaussian weight functions are appropriate for processing local dependencies, and so on. Moreover, considering the fact that the weight function essentially determines the signal within the neuron’s body, which is then passed as an argument to the transfer function, several details should be noted that might prove critical when selecting the weight function:
- 1.
Interval and values of input data:
The weight function must be compatible with the distribution of input values. If the inputs are defined within a specific range, can be as follows:
- -
A polynomial function for uniform distribution;
- -
A Gaussian function for local dependencies, where can be adjusted to a specific sub-range through parameters related to the mean and standard deviation (variance).
- 2.
Transfer function
The selection of ) should also take into account the characteristics of the transfer function:
- -
For transfer functions such as or the sigmoid transfer function should ensure that the signal falls within regions where the transfer function is most sensitive (e.g., around ). For instance, polynomial and exponential weight functions can be scaled to direct the signal to optimal ranges.
- -
For transfer functions with boundaries or whose domain excludes certain points, the weight function must constrain the signal
to a range compatible with the required restrictions. For example, for the reverse sigmoid,
or for the modified
,
the weight function
, which forms the signal
must be chosen to prevent situations where
, as this would render the use of transfer functions like
and
impossible.
- 3.
Smoothness and computational complexity:
The weight function should be at least (smooth up to the ), to facilitate the following:
- -
The calculation of gradients and the Jacobian and Hessian matrix during training;
- -
The correct behavior of the integral.
From the perspective of computational complexity, different choices of weight functions can be considered. Polynomial functions are easy to compute and differentiate, while more complex weight functions (e.g., or ) offer additional flexibility but are computationally more expensive.
- 4.
Nature of the problem:
The type of task (regression, classification, modeling dependencies) influences the choice:
- -
Regression: Polynomial or exponential weight functions provide smooth approximation.
- -
Classification: Locally sensitive weight functions (such as Gaussian) can aid in class separation.
- -
Temporal tasks: Exponential weight functions can model a decaying effect over time.
Ultimately, the problem of selecting the most appropriate weight function for the integral neuron is the same as the problem of choosing a transfer function for standard neurons. In the theory of artificial neural networks, there is still no objective rule for this. Of course, it is advisable for the weight function to be continuous, monotonic, and differentiable, but which specific function would be the most useful remains a matter of additional judgment. Based on their experience, researchers develop professional intuition that helps them determine what type of transfer function is needed under specific conditions defined by the task and the other parameters of the neurons being used. The exact, specific type is determined as a result of experimentation and comparison of results. For this purpose, algorithms are even being developed for automated search for suitable neurons. Similarly, the same principles apply to the optimal selection of the weight function.
4. Discussion: Classical and Integral Neurons and Neural Networks
It turns out that the optimization algorithms—gradient descent, Gauss–Newton, and their generalization, the Levenberg–Marquardt method—are convergent and, starting from random parameters for the weight function, can optimize it in a way that indeed minimizes the error. Here, it is essential to emphasize something critical. The condition for the proximity of input data that we imposed from the start,
for
, ensured the possibility of representing
This condition was solely driven by our desire to establish equivalence between classical neurons and those with weight functions. However, the initial conditions of the approximation problem are often not like this. In the general case, the input data provided for each sample can have values spaced far apart ( has a large value), which could introduce error in the representation (71). In these cases, a commonly used approach is data normalization. If the input–output data are transformed via appropriate normalization to fit within a small interval, (71) will hold with negligible error, giving us a weight function that returns different weights along the dendrites depending on the dataset. This effectively endows the neuron with dynamic characteristics.
In the opposite case, if the requirement
is disregarded, entirely new neural structures emerge that have no standard equivalent. However, the cumulative signal within their bodies is still obtained from the value of the integral in (71) with some arbitrary continuous and smooth function
:
In this case, Theorems 1–3 will still hold, which is very important. By abandoning the requirement
, we see the possibility of existing and training neural structures characterized by the absence of a true dendritic tree, where each input dataset is provided as a set through only one dendritic segment with a weight of 1 (
Figure 9).
Since, for these integral neurons, it is not required that
, it follows that we cannot assign them a classical equivalent with weights
for which
This makes integral neurons an independent class, distinct from the classical structures we are accustomed to working with. Considering that
it follows that we can represent the integral neuron using familiar schemes where the dendritic branches have the same transmissive strength, equal to 1 (
Figure 10).
The conditions in Theorems 1–3 do not require the proximity of input data, , making them applicable to integral neurons as well. With these theorems, we have theoretically demonstrated that, just like classical neural structures, integral neural structures can be trained with samples and ultimately approximate a given problem with minimal error.
In classical neurons, training algorithms focus on modifying the weights, which influence the signal within the neuron body and, subsequently, the output. In integral neural structures, the weights of input data are fixed and constant with a value of 1, and the training algorithms are directed specifically at the parameters of the weight function—that is, the optimization of the type and shape of the signal within the neurons (
Figure 11).
Another avenue for discussion is the possibility of creating neural structures that contain both integral and classical neurons simultaneously. Although this study examines only individual integral neurons, definitions can be proposed for the new integral structures created with their assistance.
Definition 3 (Integral Neural Network). An artificial neural network consisting entirely of integral neurons of class will be called a fully integral neural network of class .
Definition 4 (Network with Integral–Classical Architecture). A neural network containing layers of integral neurons of class and layers of classical neurons with a linear cumulative signal in the body will be called a neural network with integral–classical architecture of class and signature .
A neural network with an integral–classical architecture combines layers of integral and classical neurons organized in a sequential or parallel structure, enabling both the processing of input stimuli through integral weight functions and calculations via linear combinations of stimuli, depending on the specific task and training requirements. In this type of hybrid architecture, integral layers can be seen with neurons, in the bodies of which we have a signal,
as well as classical structures with a linear signal in the bodies
where
and
are the sets of indices of neurons from previous layers that provide output signals to the current layer. This hybrid architecture combines the advantages of both neuron types, allowing flexible (integral and nonlinear) representation of dependencies among inputs and standard processing through classical linear sums. It is evident that a fully integral neural network with
hidden layers can be considered and exhibits all the characteristics of an integral–classical network with signature
, and any classical neural network with
hidden layers can be regarded as an integral–classical architecture with signature
, where
.
Definition 5 (Architectural Balance). Let be a neural network with an integral–classical architecture of signature . The ratiowhere and are the numbers of integral and classical neurons in the respective layers will be called the architectural balance coefficient. We will say that the neural network has an ideal architectural balance if and only if and . Such a neural network will be referred to as integrally and classically balanced. By adjusting the signature (), thereby altering the architectural balance of the neural structure, we can optimize the network for specific tasks. For tasks with higher demands for nonlinear representation, the number of integral layers and neurons can be increased, while for tasks with simpler dependencies, a greater emphasis can be placed on classical layers and neurons. Naturally, it should be acknowledged that, although an integral–classical architecture combines the strengths of both approaches—the efficiency of linear computations and the power of integral dependencies—training this type of structure presents a true scientific and practical challenge. One of the team’s goals is to develop multitask training where integral and classical layers are trained with partially independent algorithms, but with coordinated optimization steps across iterations.
5. Integral XOR Neuron
A particularly important topic in the theory of artificial neural networks relates to the XOR logical function and the problem that arises from the inability to train a single neuron to solve XOR.
For the logical functions “AND” and “OR”, it is straightforward to construct a linear classifier and, therefore, a single neuron that recognizes objects from the two classes of points, returning “1” or “0”, respectively (
Figure 12).
The recognition of different classes is due to the fact that a signal is formed within the body of a standard neuron:
which, for
, is equivalent to the line
that separates the classes. For points on one side of the line,
, and for those on the other side,
. The division of points into two classes is achieved using an appropriate (typically step) transfer function.
On the other hand, “exclusive OR” (XOR) is a logical operation between two Boolean variables that returns true (“1”) if and only if the inputs are different. The XOR truth
Table 1 is as follows:
The key problem with XOR and a single neuron lies in the fact that XOR cannot be solved through linear classification, as there is no single line that can separate the input values (0,0) and (1,1), which return “0,” from (0,1) and (1,0), which return “1”, in a two-dimensional space (
Figure 13). This means that a perceptron cannot find weights that successfully separate the ordered input pairs
based on their XOR output values.
This problem was highlighted as early as 1969 in the book by Marvin Minsky and Seymour Papert [
67], which led to a temporary decline in interest in neural network research. The authors demonstrated that single-layer perceptrons are limited and cannot be trained on certain types of logical functions, such as XOR.
To overcome the limitation of a single perceptron, multilayer perceptrons (MLPs) were introduced, which include one or more hidden layers. These layers enable the ANN to learn and represent nonlinear relationships, like XOR, by using combinations of linear classifications connected through nonlinear transfer functions, such as sigmoid or ReLU. Thus, MLPs can create complex, nonlinear separating surfaces that successfully model and solve XOR and similar functions. However, the problem of representing XOR with a single neuron remains unresolved in the theory of ANNs. One reason for this is that, since the advent of artificial neurons, the signal formed within their body has been conceptualized in the same way.
What would happen if the sum in the neuron’s body lost its linear, discrete nature,
and was instead replaced by
where
is the neuron’s weight function? In the case of the XOR logical function, we consider a neuron with two inputs (
Figure 14), which we will refer to as an integral XOR neuron.
The set of training samples for the neuron is represented as
where the input–output data values are as specified in
Table 2.
Before presenting the results of our theoretical considerations and their experimental confirmation, it is important to note that the Levenberg–Marquardt algorithm is sensitive to errors and to the training stage. With a higher amplitude transfer function, neuron output errors are detected and corrected more effectively, leading to a faster convergence to the error function minimum and improved overall convergence. Many transfer functions, including the hyperbolic tangent (), have relatively smooth gradients around zero, which can slow down training.
On the other hand, scaling transfer functions by coefficients does not alter their shape but intensifies their effect, increasing gradients and amplitudes, which makes training more efficient. Therefore, we have chosen the transfer function . We deliberately multiply by three to increase the neuron’s output amplitude, meaning that the output signal becomes stronger for the same input values. This allows the neuron to cover a broader range of values and respond more strongly to changes in its inputs, helping the training process capture and correct errors more effectively.
Subsequent experiments with linear and sinusoidal transfer functions also demonstrated their viability in building an integral XOR neuron. In all three experiments, a first-degree polynomial was used as the weight function.
In artificial neural network theory, there is no strict rule or objective criterion for selecting an appropriate transfer function. Researchers typically choose a transfer function based on prior experience and professional intuition, considering the specific task or problem requirements. For instance, continuous transfer functions with sigmoid characteristics are known to solve a wide range of problems, but the specific choice of function involves experimentation combined with past experience.
Similarly, the choice of weight function follows the same principle. While certain criteria—such as continuity and smoothness—are essential, the exact form of the weight function depends on the particular task. Considering the XOR problem, with only four training samples (
Table 1), it is appropriate to select a polynomial weight function with a maximum degree of
, possessing four parameters. However, due to the characteristics of the transfer functions used, experimentation revealed that the optimal weight function is based on a first-degree polynomial
, characterized by two parameters (
), whose optimal values are determined by the training algorithms.
The MATLAB v2018a scripts for creating, training, and operating integral XOR neurons are provided in
Appendix A of this article.
5.1. Solving XOR with an Integral Neuron Using a Hyperbolic Tangent Transfer Function
5.1.1. Theoretical Solution
For the weight function, we choose a first-degree polynomial:
where
are the parameters of
that need to be optimized.
The error function is represented as
where
We will prove that parameters
for the weight function
exist, such that the neuron can solve the XOR logical function. In practice, we need to show that ∃
such that
The first equality
holds for any
.
The second and third equalities in (86), due to the nature of
, provide the same information,
which implies
Finally, from the last equality in (86), we have
i.e.,
Then, from (89) and (91), we can form the system
From this, it follows that .
Thus, in the body of the XOR neuron, we have the signal
and the neuron itself appears as shown in
Figure 15.
This integral neuron should be able to solve the XOR logical function.
5.1.2. Experimental Confirmation
For the purposes of the experiments, a MATLAB v2018a script was created (see
Appendix B) that implements our proposed Levenberg–Marquardt algorithm for training an integral neuron. When using this script, arbitrary initial values for the vector
, containing the parameters of the weight function, are set and then updated during training using the transition
where
is the Jacobian matrix containing the derivatives of the error
with respect to each parameter of
,
is the vector containing all errors for each sample in the training set that passes through the neuron at each iteration,
is the parameter that controls the transition between the gradient descent method and the Gauss–Newton algorithm, and
is the identity matrix.
During the iterations, the following parameters are tracked:
Iteration: This is the number of the current iteration of the algorithm. The algorithm goes through multiple iterations, updating parameters each time in an attempt to find the optimal solution. Iterations continue until the maximum number of iterations is reached or convergence criteria are met (e.g., when changes in error or step size become very small).
Func-count: This is the number of calls to the objective function, which calculates the error (residuals). Monitoring the number of calls is essential because each call requires computations that can be costly for complex tasks. The algorithm may call the function more than once per iteration to calculate gradients or other necessary quantities.
First-Order Optimality: This measures how close the current solution is to the optimum, calculated using the gradient of the error function. In optimization, first-order optimality measures how far the gradient is from zero. A gradient close to zero suggests proximity to the optimal solution. If the first-order optimality value is low, it is an indication that the algorithm may be near a local (or global) minimum.
Lambda (): This is the Levenberg–Marquardt parameter controlling the transition between the gradient descent method and the Gauss–Newton method. At high values of , the algorithm behaves like gradient descent, taking smaller, more conservative steps. When is low, the algorithm approaches the Gauss–Newton method, making more aggressive steps that can lead to faster convergence. The algorithm dynamically adjusts based on the success of previous iterations.
Norm of Step: The step norm indicates the magnitude of the parameter change in the current iteration. Large step norm values usually mean the algorithm is making significant adjustments to parameters, while small values indicate the algorithm is fine-tuning and making more minor adjustments. When the step norm becomes very small, it usually signifies that the algorithm is close to convergence.
To start the algorithm, the matrix
is randomly initialized with values in the range [−1,1]. For the results presented in
Table 3, the algorithm randomly selected initial values for the weight function parameters as follows:
.
As is typical in artificial intelligence, initially, with the randomly selected parameters of its weight function
, the integral neuron is not adapted to the problem it needs to solve, resulting in an error of 12.4149. However, as the iterations proceed, the parameters of
gradually change, leading to increasing effectiveness until reaching zero error. In this case, the algorithm identifies the optimal parameters for the weight function as
, making the final form of the weight function
It is noteworthy that , as predicted by theory, indicating that the Levenberg–Marquardt algorithm has accurately reached the global minimum of the error function.
The results returned by the integral neuron with this weight function are presented in
Table 4.
5.2. Solving XOR with an Integral Neuron Using the Transfer Function
5.2.1. Theoretical Solution
By analogy with the theoretical considerations we made for an integral neuron with the transfer function
and with the same choice of weight function, we conclude that the optimal parameters satisfying the equations
are solutions to the system
The solution to (97) leads to optimal parameters for the weight function of
, which means that the integral neuron capable of handling XOR would appear as shown in
Figure 16.
5.2.2. Experimental Confirmation
Applying the Levenberg–Marquardt method to the parameters of the weight function, with an initial random selection of
, the algorithm, through its iterations, reached the pair
, as predicted by theory, defining the global minimum of the error (
Table 5).
5.3. Solving XOR with an Integral Neuron Using the Transfer Function
5.3.1. Theoretical Solution
Let us consider a neuron with the transfer function
, where the weight function is again a first-degree polynomial:
The signal in the body of the neuron is
Given that
, it is clear that the neuron’s output is
We will prove that parameters
exist for the weight function
that enable the neuron to solve the XOR logical function. In practice, we need to show that ∃
such that
The first equality
is satisfied for all
.
The second and third equalities in (101), due to the nature of
, provide the same information:
which implies
or
Finally, from the last equality in (101), we have
i.e.,
Let us consider one of the possible systems:
The system has infinitely many solutions, and one solution for
,
. With these parameters, the neuron has a weight function
and in its body, we obtain the cumulative signal
Along the neuron’s axon (
Figure 17), the output signal propagates as
It can easily be seen that, theoretically, this neuron can indeed solve the XOR logical function problem (
Table 6).
5.3.2. Experimental Confirmation
The empirical confirmation of the existence and trainability of this type of neuron was again carried out in the MATLAB v2018a environment. The transfer function
, like those previously used, is continuous and smooth, fulfilling the conditions for the applicability of the Levenberg–Marquardt method according to Theorem 3. The initial conditions (initial parameters of the weight function) are once again generated as random numbers within the segment [−1,1]. Given this random initialization, it is expected to obtain various final results for the optimal parameters of the weight function. In the worst observed case, the neuron successfully handles the XOR function with an error on the order of
. In many instances, the algorithm reaches the global minimum, discovering the weight function for a neuron that operates with zero error.
Table 7 presents an example where the iterative algorithm completes in just six epochs, resulting in a final zero total error.
With initial, random parameters of the weight function
, before training, the integral neuron produced an error of
when solving XOR. After the final iteration of the training algorithm, it adjusted the parameters to
, yielding an absolutely accurate result for XOR (
Table 8), corresponding to the theoretically determined model
.
6. Conclusions
From the very inception of the classical neuron, the idea of forming a total summed signal in its body has been adopted:
where
and
represent the inputs to the neuron and their respective weights
. The neuron’s output then takes the value
where
is its transfer function.
The linear combination of input signals and weights is the simplest and most intuitive way to model the interaction between inputs and the neuron. It is straightforward for mathematical representation and processing and allows for the use of established methods in linear algebra and optimization, facilitating the training of neurons and neural networks, especially in their early forms. The first stage where nonlinearity can emerge in neuron operation is with the application of the transfer function, which transforms the linear form (112) into the nonlinear signal (113). However, the linear sum alone, without an appropriate transfer function, cannot capture complex, nonlinear dependencies among the input signals. Additionally, it is sensitive to the scale of the input signals. If the inputs vary in scale, the sum can be dominated by larger values, potentially making the neuron less effective at processing information. This often necessitates pre-normalization of the inputs.
In this study, we set out to explore the possibility of changing the form of the signal (112). It turns out that if we normalize the input–output samples
within a sufficiently small interval, where
for
, we can consider a dynamic neural structure with a signal in its body as follows:
where the function
can define each weight of the neuron as
.
An even more interesting case emerged when the requirement for data proximity
was not met. Under such conditions, we can consider new and unconventional artificial intelligence cell structures. Here, assuming that inputs are applied with constant weights
, an integral signal is formed:
where
is a continuous and smooth function (the neuron’s weight function). As a result, the axon of this type of neuron yields the following values:
Thus, given the training set
the mean squared error is
In the structure of integral neurons, traditional weights are absent, but in their place, we have the weight function , which not only replaces weights but also imparts a nonlinear character to the signal. With the introduction of the weight function, the optimization algorithms for the neuron are no longer aimed at finding a set of static weights but at discovering an appropriate form of (its parameters) that minimizes the error. It turns out that gradient descent, Gauss–Newton, and the combined Levenberg–Marquardt optimization algorithms, when applied in this context, are convergent. During neuron training, starting from random parameters of the weight function, they are able to optimize it in a way that indeed leads to error minimization.
Are integral neurons (
Figure 18) part of artificial intelligence? Absolutely. Like all artificial intelligence systems, integral neurons are initially unrefined and relatively ineffective in relation to the specific problem they are directed toward. However, during the training process, through the use of training samples, the parameters of their weight functions are adjusted in a way that drives the error toward its local or global minimum.
In the experimental part of this study, this type of integral neuron was used to solve the XOR logical function. It turns out that integral neural structures can be trained to solve this function—something that is impossible for standard neurons due to the linear nature of the sum in their bodies, which serves as the equation for the separating line. Theoretical solutions and experimental confirmations are shown for solving XOR using integral neurons with three different transfer functions. In each of the experiments, the weight functions are initialized with random parameters. Initially, the neurons have a high error rate, but after training with XOR samples, their weight functions are adjusted in a way that produces solutions with zero errors and parameters predicted by theory.
Integral neurons can also be used in more complex real-world situations, but the following limitations should be taken into account:
Training Complexity: Training an integral neuron, especially with higher smoothness classes of the weight function, requires precise calculation of the gradient, Jacobian, and, potentially, Hessian matrix. This can potentially lead to significant computational challenges in real-world applications.
Algorithm Convergence: As noted in the article, algorithms such as Gauss–Newton or Levenberg–Marquardt require appropriate initial parameters to ensure convergence to a global minimum. Otherwise, there is a risk of falling into a local minimum.
Normalization Issues: To use the integral equivalent of a standard neuron, the following condition must be satisfied for the input data:
This means that the input data must be carefully normalized to ensure the correctness of the integral representation and the execution of the algorithms. Of course, this is only necessary if we aim to replicate the behavior of standard neurons. In cases where we abandon requirement (1), we are dealing with an entirely new artificial neural structure for which such normalization is not necessary.
A subject of future research is the comparison of the integral neuron with other types of neurons, e.g., rule-based neural networks where each neuron is a fuzzy inference system [
68], as well as their capabilities to solve various regression and classification tasks.
Finally, it should be noted that in this study, integral neurons were equipped with polynomial weight functions. However, these could be functions of various types, as long as they meet the requirements for continuity and differentiability imposed by the convergence theorems of the training algorithms. Future goals in this area include exploring different types of weight functions and investigating the possibilities for building and using integral neural networks, as well as hybrid neural networks that incorporate both classical and integral neurons.