1. Introduction
In the past few decades, the gradient-based learning method has been widely used in training neural networks, such as the backpropagation (BP) algorithm, which uses the back propagation of error to adjust the weight of the network. However, due to the improper learning step size, the convergence speed of the algorithm is very slow, and it is easy to produce a local minimum value. So, a lot of iterations are often needed to get more satisfactory accuracy. These problems have become the main bottleneck restricting its development in the application field. Therefore, improving the learning ability and generalization performance of neural network models is a challenging task. One solution is to find an appropriate architecture for the neural network model. Artificial neural networks have two important hyper-parameters, which are used to control the scale of the network: the number of layers and the number of nodes in each hidden layer. The values of these parameters must be specifically determined when configuring the network. However, there is no rule to determine the scale of the network. In the regression prediction application, most researchers set up a series of models with different scales during the training process, and then selected the best one according to the test results [
1,
2,
3]. However, this kind of method increases the training cost and time. Thus, how to determine the ideal number of hidden layer nodes before network training is an urgent problem to be solved [
4,
5]. In order to solve this problem, a newly developed randomized learner model, termed stochastic configuration networks (SCNs), was proposed by Wang et al. [
6]. As a single-layer feedforward neural network, the SCN belongs to the random neural network class. Although the parameters of the SCN are also randomly generated, it is different from the existing randomized learning algorithms for single layer feed-forward neural networks (SLFNNs); the SCN mainly randomly assigns the input weights and biases of hidden nodes in the light of a supervisory mechanism, and the output weights are analytically evaluated in either a constructive or selective manner. Compared to other random neural networks, this random learning model is also different from the classical random vector functional link (RVFL) network. The SCN restricts the assignment of input weights and bias by introducing inequality constraints. Under the supervisory mechanism, SCN has a universal approximation property with the increase of the number of hidden nodes [
7,
8,
9,
10]. Instead of training a model with a fixed architecture, the construction process of the SCN starts with a small sized network and then adds hidden nodes incrementally until an acceptable tolerance is achieved, then solves a global least squares problem with the current learner model to find the output weights. Its advantages ar: the minimization of a convex cost that avoids the presence of local minima, good generalization performance, and a notable representation ability [
6]. Compared with deep neural networks, the SCN has lower training complexity and faster learning speed.
Nowadays, random neural networks have left impressive performance in the fields of deep learning and cognitive science. Compared with being applied to computer platforms, the implementation of random neural networks on reconfigurable digital platforms, such as field programmable gate array (FPGA), shows its huge and unique advantages: First, in the neural domain where parallelism and distributed computing are inherently involved, FPGAs have increased their speed with their very high computing power [
11]. Second, with the miniaturization of component manufacturing technology [
12], neural networks are becoming more and more widespread in embedded applications. Third, compared to computers, hardware systems can reduce costs by decreasing power requirements and lowering the number of components [
13]. Fourth, parallel and distributed architectures have a high fault tolerance rate for system components [
14], and provide support for applications that require security. Also today, a large number of mobile devices are connected to the internet, and cloud computing data centers are under excessive load. The implementation of neural networks in software requires a lot of computing resources. In order to reduce the Internet load, the collaborative use of edge and cloud computing is particularly important. FPGA-based random neural networks stand at the edge-computing perspective and move part of the computational power to data collection sources, thereby reducing the network load [
15]. With the surge in data volume and the constant demand for computing power, the original computing framework consisting solely of CPUs has been unable to meet the real-time requirements of the edge-computing system, while one of the greatest value of FPGAs is that they are reconfigurable, so designs can be updated at any time, even after the hardware has been deployed in the field. By virtue of this advantage and high efficiency, FPGA is widely used in many edge computing scenarios [
16]. At the same time, with the development of Internet of Things (IoT), hubs should support a large number of ultra-low power network protocols, various applications workloads, and be responsible for completing authentication, encryption, and security. This kind of changing and uncertain environment is a terrible thing for ASIC or SoC, but it is easy to implement for FPGA with very high running speed and computational efficiency. Therefore, the implementation of neural networks on FPGAs has very good development prospects and application values.
Researchers have done a lot of works on the hardware implementation of random neural networks and made many achievements. In 2012, Decherchi et al. implemented the classification prediction model of extreme learning machine (ELM) [
17,
18,
19] on the FPGA and achieved high-precision prediction results [
20]. In 2018, Ragusa et al. improved the hardware implementation model of the ELM classifier for resource-constrained devices, effectively balancing accuracy, and network complexity, and reducing resource utilization [
21]. In 2019, Safaei et al. proposed a specialized system on chip (SoC) hardware implementation and design approach for embedded online sequential ELM (OS-ELM) classification, which has been optimized for efficiency in real-time applications [
22].
Inspired by the above papers, this paper designs and completes the implementation of the SCN regression prediction model on the FPGA. The main contributions of this paper are listed below. (1) The SCN model exhibiting good performance in learning and generalization is investigated for regression prediction; this is the first time the SCN model on the FPGA has been implemented. (2) A new nonlinear activation function is proposed to optimize the FPGA implementation of the SCN model; this new activation function, unlike the previous ones, further considers the prediction accuracy and hardware resource utilization. (3) Experimental results from simulation and real data sets indicate that the proposed FPGA framework successfully implements the SCN regression prediction model. (4) The prediction performance of the proposed FPGA implementation of the SCN model is significantly improved compared with the same case studies for other implementation in the literature [
20].
The rest of this paper is organized as follows.
Section 2 describes the specifics of the SCN.
Section 3 proposes the hardware architecture of the SCN.
Section 4 proposes methods for improving and optimizing the performance of FPGA models.
Section 5 verifies the designed SCN hardware prediction model on the simulation data set and the real industrial data set. Finally, the conclusion is drawn in
Section 6.
2. Stochastic Configuration Networks
For a target function , suppose that an SCN model has already been built with hidden nodes, i.e., , where , and is an activation function of the th hidden node with random input weights and bias . denotes the residual error where .
Given a training data set with
sample pairs
, where
and
, let
and
represent the input and output data matrix;
be the residual error matrix, where each column
,
. Denote the output vector of the Lth hidden node
for the input
by
Thus, the hidden layer output matrix of
can be expressed as
. Denoted by
,
where
and
is a nonnegative real number sequence with
subjected to
. The SCN algorithm firstly generates a large pool of
candidate nodes, namely
, in varying intervals. Then, it picks up those candidate nodes whose minimal value of the set
is positive. Then, the candidate note
with the largest value of
will be assigned as the
Lth hidden node for
. Thus, the output weight matrix of the SCN model,
, could be computed by the standard least squares method, that is,
where
is the Moore-Penrose generalized inverse of the matrix
, and
represents the Frobenius norm [
23,
24,
25].
The construction process of the SCN starts with a small sized network, then incrementally adds hidden nodes followed by computing the output weights. This process continues until the model meets some termination criteria. The supervisory mechanism of the SCN guarantees the universal approximation property.
3. FPGA-Based Implementation of the SCN
The implementation of the SCN on the FPGA needs to balance accuracy, speed, and resource utilization. The proposed architecture should make full use of the advantages of FPGA parallel processing. Due to the fact that the SCN adopts the method of gradually increasing hidden nodes to find the optimal solution, the flexibility of the model must be fully considered in the design, so that it can make specific changes for different problems.
Figure 1 shows the overall architecture of the SCN inference prediction model based on the FPGA. The whole architecture includes three parts: the first part belongs to a parallel processing structure, and the second and third parts adopt a pipeline structure. The first part, Input Part, stores the feature vector
, the weights
connecting the input layer to the hidden node and the bias term
. The feature vector
x adopts a signed, two-complement fixed-point representation. The binary number length of each feature vector is
, where
is the number of digits representing a positive number,
is the number of digits representing a decimal, and the remaining one is used to represent a sign bit. Negative numbers are coded by inverting their absolute value and adding 1. In order to facilitate the calculation by FPGA, the weight value
is specially processed and can be expressed as
where
and
are random quantities (in the program,
can take the following values: 1, 2, 3). If
is extended to
and
to
, the bias term
can be embedded in
. The second part, Neuron Part, receives the results of the parallel processing from the first part and calculates the output of the activation function. The third part, Output Part, receives the output of the second part and calculates the output neurons through serial processing. A finite-state machine controls the entire process, ensuring that the calculations of the third part are always one clock cycle ahead from the calculations of the first and second parts.
The specific design of each part is as follows:
- (1)
Input Part: The Input module stores all extended feature vectors , and the Shifter modules stores the absolute value of extended random weights . Due to the special processing of , the FPGA can input results into the Mux module through parallel shift calculation. The Mux module outputs the calculation results in turn according to the finite state machine, and outputs items each time.
- (2)
Neuron Part: First, the
Inverting module receives the output of the first part according to the signs of the random weights
, and applies a bitwise NOT to the result item whose corresponding random weight is negative. Then, the output
result and the corresponding item in the ones module are input to the
Sum module for summation, where the
Ones module compensates the difference “1” between the calculated result and the true value due to the bitwise NOT. Finally, the result of the
Sum module is activated by the sigmoid function of the
activation module to obtain the output
of the hidden layer. The
activation module should be a hardware implementation of the activation function. This is an extremely critical step. The implementation and optimization of the activation function in FPGA are specifically introduced in
Section 4.1.
- (3)
Output Part: The Mac module multiplies the output of the hidden layer by the weight from the hidden layer to the output layer according to the control of the state machine. The calculation results are summed by an accumulator, and then the output of the neural network is obtained.
5. Experimental Results
The regression prediction model based on the FPGA is tested on simulation data set and real industrial data set. The simulation data set consists of 1200 patterns (located in 1-eigenvalue space), of which the training set contains 600 patterns and the test set contains 600 patterns. The real hot-rolled strip crown data set were collected from a 1780 mm hot strip production line of a company in Hebei Province, China. In a hot-rolled process, crown is defined as the difference of thickness between the center and a point 40 mm from the edge of the strip. For strip products, smaller crown is required, which can save materials and reduce costs. Therefore, control of strip crown is a high priority for hot-rolled production process. The crown of the strip is decided by the 3D deformation in finishing mill, it can be regarded as the reflection of the cross-sectional shape of roll gap at the outlet of finishing mill. Thus, all the factors which can affect the cross-sectional shape of roll gap at the outlet of deformation zone are the factors that affect the crown value of strip. In this paper, nine important attributes in hot-rolled strip production are selected as input variables. They are: Cooling water flow of rolling mill (%), Entrance temperature (°C), Exit temperature (°C), Strip width (m), Entrance thickness (mm), Exit thickness (mm), Bending force (kN), Rolling force (kN) and Entry profile (μm). The real hot-rolled strip crown data set contains 474 patterns (located in 9 eigenvalue space), of which the training set contains 380 patterns and the test set contains 94 patterns. All data are normalized within the range of [−1, 1]. The experimental test realizes the regression prediction of the SCN based on the FPGA. In [
20], the ELM model implemented on the FPGA is applied to the detection of classification problems. We modified the FPGA-based ELM model in [
20] and adopted it to the regression problem. We give the experimental results of the FPGA-based ELM model on the simulation data set and the real data set, and compare it with the proposed FPGA-based SCN model to explain the superiority of the SCN. The FPGA used in the experiment is Xilinx’s FPGA XC7Z020CLG400-2.
Figure 3 shows the functional simulation results of the input and output signals of the modules in the FPGA architecture, including: clock signal (net_clk), reset signal (net_reset), pattern selection signal (input_select), an output signal of the
Input module (Input_out0—an eigenvalue), an output of the
Shifter module (Shifter1_out0), an output of the
Mux module (Mux_out0), an output of the
Inverting module (Inverting_out0), an output of the
Sum module (Sum_out0), an output of the
activation module (activation_out0), output enable signal (out0_ready) of the
Mac module which indicates that the result has been calculated and the final output of the system (out0) in the
Mac module. Taking the signal in
Figure 3 as an example, it represents the calculation process of the data of one pattern in the FPGA, where the value “10′h002” of the Signal
input_select represents the calculation of the second pattern currently being performed. The calculation process in the FPGA given in
Figure 3 has 18 clock cycles of the Signal
net_clk, corresponding to the 18 hidden nodes in the SCN model. The Signal
out_ready in the figure is the flag signal for calculating the data of each pattern. The Signal
out_ready is set low at the rising edge of the Signal
net_reset, indicating that this pattern data calculation process starts, and the low level state is always maintained during the calculation process. Then after the calculation is completed (that is to say, after the 18 clock cycles of the Signal
net_clk), the Signal
out_ready is set high, which indicates that the pattern data calculation process ends. After the rising edge of the Signal
out_ready appears, the value “16’h066B” of the Signal
out0 is the prediction result of the second pattern. The specific implementation process of the SCN on the FPGA is as follows: After being triggered by the reset signal (
net_reset), the
Shifter module acquires an input vector
from the
Input module. At the rising edge of the next clock, the
Mux module receives the result
from the
Shifter module in parallel. The
Inverting module processes the data and outputs it according to the signs of the input weights. After another rising edge of the clock, the
Sum module adds up the data and activates the data through the
activation module. The
Mac module accumulates the activated data after another rising edge of the clock. When all data accumulation is completed, the Signal out0_ready generates a rising edge, and then the final output of the system can be read out from the signal out0.
Figure 4 and
Figure 5 show the results of FPGA regression prediction model on the simulation data set and real data set. In the simulation data set, the number of hidden layer nodes of the SCN and ELM were 18 and 35 respectively, while in the real data set, the number of hidden layer nodes of the SCN and the ELM were 42 and 55 respectively. As can be seen from
Figure 4, for the simulation data set, the SCN model based on Equation (7) and the ELM model can only predict the general trend of the real value, while the SCN model based on Equation (6) cannot predict the trend of the real value, and only the SCN model based on Equation (9) can predict the real value well. As can also be seen from
Figure 5, for the real data set, the SCN model based on Equation (9) has the best prediction result, and the predicted value almost completely coincides with the real value, while the prediction result of other models is relatively poor. Therefore, both the simulation data set and real data set prove that the FPGA implementation of the SCN model based on Equation (9) proposed in this paper has the best prediction performance.
In order to quantitatively analyze the implementation effect of the optimized SCN on the FPGA,
Table 4 shows the average values of 30 groups of root-mean-square errors (RMSE) of the implementation results of FPGA-based SCNs and ELM, computer-based SCN and ELM on two data sets. Considering that FPGA is limited by hardware resources and calculation accuracy, and its calculation capability is relatively weak compared with that of a computer,
Table 4 takes the prediction results of computer as benchmark, and compares the implementation accuracy of the proposed model on FPGA with that of a computer. As can be seen from
Table 4, in the simulated data set, the implementation result of the SCN on the computer is worse than that of the ELM, but in the real data set, the implementation result of the SCN on the computer is better than that of the ELM. Therefore, the advantages of the SCN based on computer implementation are not obvious. Compared with the implementation of the SCN on computer, the prediction accuracy of the SCN implemented on FPGA with Equations (6) and (7) is greatly reduced. Using the sigmoid function proposed in Equation (9), the prediction accuracy of the SCN implemented on FPGA is obviously the best, which is completely better than the ELM, especially in real data set, and is almost the same as the implementation accuracy of the SCN on the computer. The results fully demonstrate the strong generalization ability and high prediction accuracy of the FPGA-based the SCN proposed in this paper.
Table 5 and
Table 6 show the resource utilization and power consumption of SCNs and ELM implemented on the FPGA. The analysis of
Table 5 and
Table 6 shows that, whether it is the simulation data set or the real data set, compared with the ELM, the resource utilization and power consumption of SCN implemented on FPGA is lower. However, the resource utilization and power consumption of SCNs based on different equations are basically the same. In order to analyze the running speed of the SCN model based on the FPGA,
Table 7 shows the actual clock frequencies of the SCN and the ELM implemented on the two data sets for comparison. As can be seen from
Table 7, the experiments on both data sets prove that the SCN implemented on FPGA runs faster than ELM. The fundamental reason is that the SCN needs fewer hidden layer nodes to reach the optimal solution, therefore, FPGA-based implementation of the SCN has lower resource utilization and power consumption and faster operation speed than the ELM.