1. Introduction
In field-programmable gate array (FPGA) applications, reduction circuits can be applied to reduce vectors to scalars. In recent years, advances in integrated circuit technology have brought significant improvements to FPGA performance. Coupled with the inherently high hardware parallelism, FPGAs are used as hardware accelerators in more and more fields, such as signal processing [
1,
2], scientific computing [
3,
4,
5], machine learning [
6,
7,
8], and data centers [
9,
10,
11]. In these applications, some algorithms include a large number of operations of vector and matrix, which belongs to the category of reduction problem. Therefore, a high-performance reduction circuit is key to matrix operations.
The combination of a high-performance reduction circuit and a small-sized binary tree can effectively solve the trade-off between area and performance. The computing task in
Figure 1a contains
m data sets
, and the data set
contains
elements. It can be executed by a variety of hardware solutions.
Figure 1b shows the use of a reduction circuit as a solution. The data are serially input, one data per cycle. Its advantage lies in its small area, usually only a reduction operator is needed.
Figure 1c is the diagram of the binary tree. Data are input in parallel, and all elements of a data set are input at a time. Compared with the reduction circuit, the performance of the binary tree is higher. However, it requires more operators, which consumes a lot of area. A compromise is to combine the reduction circuit with a small-sized binary tree, as shown in
Figure 1d. It can not only ensure that the area is maintained in a reasonable range, but also has higher performance. Therefore, a high-performance, small-area reduction circuit is required.
The design of the reduction circuit faces many challenges. The first is the number of reduction operators. Reduction operators are the primary area source of reduction circuit, so it is necessary to reduce the number of reduction operators as much as possible. The second is data hazards. When the reduction operator of the multi-stage pipeline is used, some partial results cannot be consumed immediately. It needs to be stored in the buffer, waiting for the arrival of another operand. In order to reduce the size of the buffer, the partial results of different data sets are stored in the same buffer. For a unit in the buffer, only data can be written after it is used; otherwise, write after read hazards will occur. Therefore, the data in the buffer need to be properly managed. Third, the circuit needs to be capable of processing any number of data sets, and each data set contains any number of elements. If the reduction circuit can only handle a certain amount of data, it will greatly limit its application. Finally, there are necessities for the performance and complexity of the reduction method.
In this paper, a novel reduction circuit based on binary tree path partition (BTPP) is proposed. It can handle multiple data sets, and each data set can have any number of elements. The proposed method is implemented on FPGAs, which has the smallest area-time product compared with other designs.
The remainder of this paper is organized as follows.
Section 2 describes the background of the reduction circuit, and
Section 3 gives some previous works. Then,
Section 4 details the principle and working mechanism of BTPP. After that, the hardware design of BTPP is discussed in
Section 5, and the evaluation and comparison results are given in
Section 6. Finally,
Section 7 summarizes the paper.
2. Background
The function of the reduction circuit is to reduce vector to a scalar by using associative and commutative reduction operators [
12]. For the data set
, reduction circuit performs the following operations:
where ⊛ is the reduction operator. The reduction operator is a binary operator, such as adder, multiplier, max/min function, etc. If the reduction operator is an adder, then
. If the reduction operator is a multiplier, the output of the reduction circuit is
. The reduction circuit is only related to the number of pipeline stages and bit width of the reduction operator, and does not care about the hardware structure of the reduction operator. This means that the reduction circuit can use the existing IP core as the reduction operator, which greatly reduces the design difficulty and shortens the design cycle. In addition, it also helps the reduction circuit change its function. For example, the reduction operator of a reduction circuit is an adder. By changing the adder to a multiplier, as long as their pipeline stages and bit widths are equal, the reduction circuit can work normally without changing the control circuit. In this paper, the reduction operator takes the adder as an example to analyze the reduction circuit.
The simplest reduction circuit can be realized by feeding back the output of the reduction operator to the input, as shown in
Figure 2a. It only requires a reduction operator and can handle any amount of data. However, the increase in the frequency of the reduction operator does not bring about an increase in throughput. The reduction operator used in
Figure 2b is combinational logic, and calculating
requires
clock cycles, while
Figure 2c uses a reduction operator of a
p-stage pipeline (
), and calculating
requires
clock cycles, where
,
satisfy
There is a feedback loop in
Figure 2a, assuming that the reduction operator is combinational logic and its critical path delay is
D. Since there is a register in the loop, its frequency
is
where
is the time of one cycle. For a
p-stage pipeline adder, there are
registers in the loop, then its frequency
becomes
where
has the same properties as
. The frequency of the latter is increased by
times. Therefore, the calculation time required in
Figure 2b is
and the calculation time required in
Figure 2c is
This means that, when handling the same amount of data, the time required for these two solutions is the same.
Figure 2d shows an example of performing one operation per clock cycle. It can only reduce the input data to
partial results, and the reduction calculation cannot be completed. Therefore, it is necessary to propose a reduction circuit to complete the calculation correctly and give full play to the advantages of the pipelined reduction operator.
There are some differences between the reduction circuit and accumulator [
13,
14,
15,
16], which are listed as follows:
The accumulator has a single function, while the reduction circuit can choose different types of reduction operators to achieve different functions, such as accumulation and max function.
The accumulator needs to implement the addition function in the circuit, and the circuit structure is complicated. While the reduction circuit uses the existing adder or multiplier as the reduction operator and then builds the peripheral circuit. The reduction circuit focuses more on data control and management.
The accumulator is sensitive to the data type. When changing the data type, the accumulator may need to modify some circuits. For example, if the data type is changed from 16-bit fixed-point to 32-bit single-precision floating-point, the accumulator needs to add circuits for processing exponent and fraction according to the data type of the floating-point number. As for the reduction circuit, it only needs to replace a reduction operator that supports single-precision floating point numbers, and then change the data width of the control circuit and buffer to 32-bit.
5. Hardware Design
The hardware structure of BTPP is shown in
Figure 7. It is mainly composed of the controller, path table pipeline, and data-path. The controller includes modules
,
,
,
, and
. Different signals are marked with different colored lines. Data are input sequentially through the port
, and the signal
is used to indicate whether the current input data are valid. If the signal
is valid, it means that the current input data are the last element of the set. Module
is used to generate time information, which indicates that the
i-th operation of the
k-th iteration is being executed. Module
generates path table based on input signals
and
. In addition, the path table is input into the path table pipeline. The path table pipeline has a
p-stage, the same as the reduction operator. Module
and
respectively generate buffer read and write signals and MUX control signals. These signals will be sent to the data-path to control data reading and writing and computations. When all the paths in the path table have been executed, and it does not need to be added to the result of the next iteration, the signal
generated by the module
becomes valid, which indicates that the signal
at the current moment is the final result of the set.
Data-path contains a p-stage pipeline adder. The data input from will be stored in buffer 1 first, and it will be delayed by p clock cycles. Therefore, the size of buffer 1 is p. During these p clock cycles, a path table is generated. If is the last element of the data set, and it is not consumed at the current moment, it will be stored in buffer 4. The data in buffer 4 will be sent to the reduction operator as operand 2. If is not the final result of the set, it will be sent to the reduction operator or stored in the buffer. Buffer 2 and buffer 3 are used to store operand 1 and operand 2, respectively. The two operands are stored in two different buffers, which can simplify the control logic of the buffer and improve the speed of reading and writing. The path tables of the data in buffer 2 are stored in buffer 5, so the two buffers use the same control signals. The path tables of the data in buffer 3 and buffer 4 are stored in buffer 6 and buffer 7, respectively.
When
, BTPP executes serially, and there is no need to use a buffer. When
, buffers 2–7 have the same size, and their sizes are all
. The maximum number of iterations executed simultaneously is
, where
. As shown in
Figure 5c, there are four iterations at the same time (
). After iteration 0 is completed, iteration 4 is executed. When
, there are
nodes in an iteration that need to be allocated storage units. Taking
Figure 5e as an example, there are only two possible cases for node
and node
. For node
, it will execute
or
. In both cases, there is no need to read operands from the buffer. Only the operands of node
and node
are stored in the buffer, so
nodes need to be allocated storage units. If
p is an odd number, it is
. Therefore, the sizes are
when
.
BTPP has a simple buffer management mechanism. From the previous description, when
, there are at most
iterations at the same time, and only
nodes in each iteration need to be allocated storage units. For these
nodes, their operand 1 is stored in buffer 4, and operand 2 is stored in buffer 2 or buffer 3. The sizes of buffers are
, which means that each node has a dedicated storage unit. When the execution of the node is completed, the storage unit will be released. For example, in
Figure 5c, after iteration 0 is executed, the storage units it occupied are released and used to store the data of iteration 4. Therefore, when a node is executed, the module
generates the address to read the dedicated storage unit of the node. The most significant bits of the data in buffers 5–7 are valid bits. According to these valid bits, the circuit can determine whether the data in buffers 2–4 are valid. After reading the data, the circuit writes “0” to the valid bit in buffer 5–7, which means that the data are invalid. When writing data, module
generates a write enable signal and an address to write data to the dedicated storage unit of the node.
Path partition is the core of the BTPP. Complicated hardware circuits will cause a large delay, leading to performance degradation. If the configuration codes corresponding to all possible situations of the path partition are stored in the memory, a large amount of memory resources will be consumed. Therefore, we need to use simple hardware to achieve path partition. In BTPP hardware, the module is used to achieve path partition and generate path tables. It is mainly implemented by the initial path table, path generation matrix (PGM), and path table flag (PTF).
Figure 8 shows an example of obtaining the path generation matrix. We rewrite the initial path table of iteration
k in
Figure 5e into the form in
Figure 8. For operations, when the second operand comes, we can confirm its execution. Therefore, we use this feature to determine the priority of nodes in the iteration. When
is input, by judging whether it needs to be added to the result of the previous iteration, it can be determined whether the node at the time
needs to be passed. In the same way, when
are input, it can be determined whether it is necessary to pass the nodes at
, and
. This essentially reorders the nodes in the initial path table. We use the element “1” to indicate that the path passes through the node corresponding to the element, and the element “0” indicates that it does not pass, so we get the PGM. The first column in the PGM stores the address of the initial path table. For example, the data in the second row of the initial path table are
, which is stored in the 0th row of the PGM together with the location address to get the data
. Use “1” to represent the node passing
, so the 0th row of PGM is
. It can be seen from the PGM that the nodes that the data
pass have not changed. Since the execution time of nodes in different iterations is different, this method can reduce the amount of data storage. The PGM can be reused in each iteration.
Figure 9 gives an example of the path partition of iteration 2 in
Figure 6. According to the information in
Figure 5e, when
, the data input time of iteration 2 is
. The data corresponding to the data
are respectively
, so
belongs to set
A, and
belongs to set
B. Path partition is obtained through multiple OR operations on elements in PGM and PTF. In an iteration, the elements belonging to the same set in the PGM and PTF will perform the OR operation. All elements of the PTF are zero in the initial situation, and
is the current cycle modulo
p. As shown in
Figure 9a, when the cycle is 8,
is 0. At this time, columns corresponding to only
in the PGM and initial PTF perform an OR operation to obtain PTF(0). As shown in
Figure 9b–d, when the cycle is 9–11,
is 1, 2, and 3, respectively. At this time, columns corresponding to
in PGM and PTF perform OR operations to obtain PTF(1), PTF(2), and PTF(3), respectively. PTF(3) contains the information after path partition.
PTF combined with the initial path table can get the partitioned path. As shown in
Figure 9e, the execution time
of the nodes in iteration 2 are 9, 11, 16, and 22, respectively, so we can get the initial path table. The element “0” in PTF(3) means that the path does not pass through the node corresponding to the element, and the element “1” means that the path passes through the node corresponding to the element. When
, the default path of
in the initial path table is
, which is 9, 16, 22. However, after path partition, the first column of PTF(3) is {0, 0, 1}, which means that
does not pass through nodes 9 and 16, but only passes through node 22. Therefore, the path of
is {22}. The paths corresponding to data
are similar.
A working example of BTPP hardware is given. To better understand the hardware working, we first give the overall situation of the calculation, as shown in
Figure 10. Different from
Figure 6, buffer 1 delays the input data by
p clock cycles, so the execution time of all nodes is delayed by
p. In this example, the reduction operator is an adder with a 4-stage pipeline (
), and the tasks to be solved are
and
.
Figure 11 shows the hardware working details. Path table is generated at
and updated at
. At cycle 0, data
is input from
. The signal
is 1, which means that
is valid data. Then, data
are input in sequence. At cycle 4, the signal
becomes valid, indicating that the data
input at the current moment are the last data of the data set
A. After four clock cycles are delayed by buffer 1, data
appear at
, and its path table has been generated. In the path table, the “5” in
and the first “1” in
represent that
is sent to the reduction operator as the first operand at cycle 5. Other data represent similar meanings. The value of TNI is 1, which means that the calculation result of cycle 18 will be sent to the next iteration for calculation. Data
are not consumed at the current moment. Since it will be sent to the reduction operator as the first operand, there is no need to store it in the buffer. At cycle 5,
is
. According to its path table, it is sent to the reduction operator as the second operand at this time. Therefore,
is
. The signal
is
, which meets the demand of
at the current moment, so
is
. The situation is similar for cycles 6–7. At cycle 8,
is
, and it is not consumed at the current moment. It is the second operand of the node 22, so it is stored in buffer 4. At cycle 9,
is
. According to the path table,
is sent to
. The signal
at this time does not meet the requirements of
, so
is 0. The signal
outputs partial result
, and its path table is
and
. From the path table of
,
and
, it is easy to see that
is the calculation result of
and
. Since
needs to be consumed as the first operand at cycle 12, it is stored in buffer 2. The situation at cycle 11 is similar, and the difference is that
is stored in buffer 3 because it is consumed as the second operand. At cycle 20, the path table of
is empty except TNI is 1. From the information of
and iteration, the data
are the result of iteration 1. Therefore, it needs to be sent to the last node of iteration 2, which means that
will be consumed at cycle 26. The situation is similar for cycle 22. At cycle 26, the path table of
is empty, which means it is the final result of data set
A, so the signal
becomes valid.
6. Evaluations and Comparison
We implemented BTPP on Xilinx XC5VLX110T platform using Verilog HDL because most of the existing work is implemented on the same platform. Our circuit is designed and validated manually. To ensure that the reduction operator used by BTTP is the same as other works, a double-precision floating-point adder with 14-stage pipeline, which is synthesized by Xilinx ISE 10.1, is chosen as the reduction operator. The post-implementation results are shown in
Table 1. MFPA, AeMFPA, and A
eMFPA are designs of multiple reduction operators. They consume more slices than the design of a single reduction operator, but the two designs proposed by [
21] are exceptions. Although they are designs of a single reduction operator, their frequency and slices are related to the number of data sets they can process simultaneously. As the number of data sets grows, the slices increase more, which will limit its application. The frequency and slices of BTPP and SBM are stable. It has nothing to do with the number of data sets and the number of elements contained in each data set. Their frequency and slices are more balanced in all designs. However, the circuit of SBM is more complicated, which is mainly reflected in the buffer management unit. It needs to identify which units in the buffer store data that are valid, and it also needs to detect elements in the buffer that belong to the same data set. Compared with SBM, the hardware structure of BTPP is simple, without particularly complex logic. At the same time, it can guarantee a better clock frequency and slice consumption than SBM. On Virtex-5, each slice includes four 6-input LUTs and four registers. BTPP uses 1554 slice registers. In addition, 1495 slice LUTs are used as logic, and only three slice LUTs are used as route-thru. The buffers in BTPP are all implemented with dual-port RAM, so 10 BRAMs are used. In addition, 3 DSP48Es are used in BTPP.
We compared the execution time and area-time product of different designs in
Table 1, and the results are shown in
Figure 12. The number of pipeline stages of reduction operator is 14. The parameter
m is the number of data sets, and
n is the number of elements in each data set. The two designs of [
21] in
Figure 12 are two circuits that support a maximum of 512 data sets. Since the clock frequency of BTPP is lower than High Speed, MFPA and AeMFPA, BTPP does not have an advantage in execution time. However, it has a smaller consumption of slices, so it has the least area-time product of all designs, which makes BTPP extremely competitive.
To prove that the BTPP can be used for different reduction operators, we give pre-synthesis simulation examples of BTPP using an adder and a multiplier, respectively, in
Figure 13. Since the most common problems are
and
, which are the main problems solved by the reduction circuit, we chose adder and multiplier as the reduction operator. The adder and multiplier are IPs generated by Xilinx Vivado 2019.1 software, their data are double-precision floating-point numbers, and the number of pipeline stages is 14. The circuit is implemented in Verilog HDL, and is simulated by ModelSim called by Vivado. Data sets
,
,
, and
are the test. It can be seen from the figure that the input signals of
Figure 13a,b are the same, and the time for
to become valid is also the same. At 256ns, both circuits output the calculation results of the last data set. The output of
Figure 13a is 21 (
), and the output of
Figure 13b is 720 (
).
We have implemented the two designs in
Figure 13 on Xilinx Artix-7 XC7A100T, and the post-implementation results are shown in
Table 2. Since the multiplier uses more DSP, its frequency is much higher than that of the adder, and it uses fewer slice LUTs and slice registers than the adder. The BTPP that uses the adder as the reduction operator and the BTPP that uses the multiplier as the reduction operator are listed in the last two columns of
Table 2. Through comparison, it can be seen that the control circuit of BTPP does not use DSP but uses 9.5 BRAMs. The resources used by the two BTPP control circuits are approximately equal.
We have implemented two circuits for matrix-vector multiplication on Xilinx Artix-7 XC7A100T. The first one is composed of a multiplier and an adder. The accumulation function is realized by feeding back the output of the adder to the input. The second circuit is composed of a multiplier and BTPP that uses an adder as a reduction operator. The adder and multiplier used here are the same as those in
Table 2, and the data type and number of pipeline stages are unchanged. The circuit is implemented in Verilog HDL, synthesized and placed and routed through Xilinx Vivado 2019.1 software. The post-implementation results are shown in
Table 3. For the circuit that does not use BTPP, the slice LUTs and slice registers consumed are less than the design using BTPP.
The two circuits in
Table 3 are applied to matrix–vector multiplication, and the results are shown in
Figure 14. The matrix in
Figure 14 has 15 rows and the number of columns is between 15 and 30. In other words, the reduction circuit needs to handle 15 data sets, and the number of elements in each data set is between 15 and 30. For the case of not using BTPP, it needs to consume a lot of execution time. This is because the deep pipelining of the adder does not bring about an increase in throughput. BTPP takes full advantage of the deep pipelining of the adder and greatly reduces the execution time. Compared with the design that does not use BTPP, it can achieve a
speed up. BTPP uses the less additional area in exchange for a great improvement in performance, which fully demonstrates the advantages of BTPP.