1. Introduction
At present, CNNs are widely used in image classification [
1], target recognition [
2,
3], and semantic segmentation [
4,
5]. CNNs are essentially composed of a convolution layer, RELU layer (Rectified Linear Unit), pooling layer, and full connection layer. In the CNNs model, the convolution layer’s amount of calculations accounts for more than 85% of the total calculation [
6], which brings huge workload. In addition, with the deepening of CNNs, the time complexity and space complexity of CNNs realized by software are increasing. The scheme of CNNs based on software cannot meet the current high-speed application requirements. In order to solve this problem, accelerator design schemes based on various platforms are proposed, such as graphics processing unit (GPU) [
7], or customized application specific integrated circuits (ASIC) [
8,
9,
10,
11], field programmable gate arrays (FPGA), and other hardware to complete the acceleration of CNNs. However, due to the problems of power consumption, development cost, and cycles, the research and development of GPU and ASIC are largely limited.
Recently, most of the existing CNNs accelerators on FPGA have been proposed [
12,
13,
14]. These designs only focus on the improvement of performance, and ignore the overhead of energy and storage. However, the power consumption restricts their use in an embedded platform or mobile devices directly, and the overhead of storage leads to increase the requirement of hardware, thereby increasing the cost. These two aspects bring great challenges to the promotion of accelerators.
In this paper, we propose a CNNs accelerator design using a MTCA [
15] decomposition algorithm. The convolution is divided into blocks and expanded by the algorithm, so as to realize the highly parallel computing of each matrix multiplication array (MMA) in the accelerator. We implement it on an FPGA platform with the systolic array architecture [
16]. Generally speaking, the main advantages of this architecture are as follows.
Firstly, it optimizes the convolution expansion and storage mode by utilizing a block parallel computing algorithm based on MTCA. Not only does the operational efficiency improve, it also reduces the storage cost of the intermediate matrix expanded by convolution. Moreover, the MTCA-based block computing algorithm can ensure the high parallel operation of the calculation module and achieve optimal algorithmic efficiency.
Secondly, a matrix multiplication array (MMA) based on a systolic parallel architecture is adopted to ensure a high parallelism and independence in both the MMA and the internal PE (Processing Element), which gives the accelerator design good scalability. This approach can avoid the transmission of partial-sum between PEs, optimize the internal pipeline, and achieve efficient processing. With the support of the specific dataflow mode, the kernel level matrix operation is realized via row calculation, whose data are fetched from row vectors of matrix. It can avoid the access to the column vector data and the transposition of the matrix. As a result, our design reduces the control difficulty and hardware costs.
The organizational structure of the paper is as follows. Section II summarizes the related work. Section III introduces and explains the MTCA algorithm. Section IV introduces the architecture of accelerator in detail. Section V provides the implementation results and comparison. Section VI is the conclusion.
2. Motivation and Related Work
In this section, we introduce im2col, MEC, MTCA algorithms. We also introduce the related work and challenges.
2.1. Im2col, MEC, MTCA Algorithm
Figure 1 shows the operational processes of conventional convolution. The operation process of conventional convolution involves executing the inner product between the gray part of the input feature map and the filter (see
Figure 1); the result is an element of the output feature map, while the sliding window can obtain the final output feature map.
For the
im2col process described in
Figure 2, both the input feature map and filter are spread by rows. The final output feature map is obtained by two intermediate process matrix operations. Although the intermediate matrix increases the memory cost, it can accelerate the convolution by using the matrix vector multiplication library, and finally improve the computational performance.
For MEC, moreover (see
Figure 3), the input feature map is divided into sections A–E longitudinally according to the size of H × K, after which each segment is expanded to obtain the intermediate matrix row by row (The gray part of
Figure 3). The filter is expanded by row, which is the same as
im2col. In this case, we obtain two intermediate matrices with size (5,H × K) and (K^2,1). Taking out a part of the intermediate matrix in a (5,K^2) window, and perform matrix multiplication with the intermediate matrix of filter. The result is a row of output feature map elements (The red or green parts of
Figure 3), while the sliding window with step size of K can obtain the final output feature map.
The operation process of the MTCA algorithm is illustrated in
Figure 4. Different from MEC, the input feature map is horizontally divided into sections A–E according to the size of K × W, then expanded to obtain a matrix of (H × K,H-K + 1). The data expansion method of each part of A–E is shown by the arrow in the figure. Based on the
im2col expansion, the filter is then expanded into a matrix of (H-K + 1,H × K) size by zero; subsequently, the final output feature map is obtained by multiplying the two intermediate matrices.
2.2. Related Work
Previous studies have proposed a number of designs about CNNs accelerator architecture. The DianNao [
17,
18] series achieved high performance by improving the parallel computing method, and optimizing the dataflow model. Eyeriss [
19] reduced data mobility by maximizing local data reuse, and implementing data gating. However, the development cycle of them are too long to adapt the rapid development of CNNs, and they did not study the large-scale convolution computation.
You Huang [
20] realized the high-performance for large-scale convolution by optimization of the matrix-multiplication computing architecture. Junyang Zhang [
21] proposed a large-scale convolution parallel computing method that realized high-performance CNN acceleration on FT2000 architecture. However, none of them optimize the storage efficiency of convolution expansion. Jing. Shen [
22] proposed a design to balance storage bandwidth and computational speed, the convolutional layer is subdivided into granular tasks and executed to mask the time of accessing external storage, which outperforms traditional accelerators. Chaoyang Zhu [
23] propose a sparse wise dataflow to skip the cycles of processing multiply-and-accumulates (MACs) with zero weights and exploit data statistics to minimize energy through zeros gating to avoid unnecessary computations, which provides high speedup and energy efficiency. Maurizio Capra [
24] introduces what a hardware accelerator is, and what are its main components, followed by the latest techniques in the field of dataflow, reconfigurability, variable bit-width, and sparsity.
In addition, existing studies have proposed a number of designs about improving storage efficiency in terms of algorithms. M. Cho [
25] proposed a memory-efficient convolution (MEC) algorithm to reduce the overhead of storage. Yulin Fang [
15] proposed a convolution optimization algorithm based on computing unified device architecture (CUDA) acceleration library. However, the logic of piecewise convolution control is complex, and the address of the segmented matrix is not aligned. It is not conducive to hardware implementation.
2.3. Motivation
In summary, in most existing designs, it is rarely involved in the following three aspects. How to reduce the storage overhead of the convolution expansion? How to reduce the difficulty of memory read-write control? How to accelerate large-scale convolution? We proposed the block algorithm based on MTCA to achieve large-scale convolution computation. Compared with the im2col, MEC, MTCA, we improve the operation performance, save hardware resources, and improve the computational efficiency by realizing the optimal blocking. In addition, we reduce the difficulty of reading and writing control, which is especially important in large-scale convolution.
3. Algorithm Analysis and Mapping
3.1. A Design of Block Parallel Computing Algorithm Based on MTCA
Although the MEC algorithm also introduces the intermediate matrix, it saves 51.2% memory space than im2col. However, the MEC also has some shortcomings, that is, the logic of the matrix partition is complex and the address space of the small matrix is not aligned, which makes it more difficult to implement the algorithm both in hardware and software.
Compared with MEC, MTCA has a larger intermediate transformation matrix, simpler control logic, and is easier to implement in parallel computing on hardware to achieve better performance. However, there will be a large number of elements with a value of zero in the intermediate process matrix expanded by the filter, while the proportion of non-zero elements is small. During the actual computing process, this will not only increase the operational load, but will also consume a large amount of hardware storage resources. At the same time, given the increasing size of the input feature map and filter, this setup is inconvenient for parallel computation in hardware implementation. Accordingly, in order to solve the above problems, we propose a block parallel computing algorithm based on MTCA.
As shown in
Figure 5, the input feature map is longitudinally divided into m blocks, and each block is calculated via MTCA. During this process, whether or not each block should be supplemented is determined with reference to the number of blocks m. If the block size,
of the input feature map is less than
, each block matrix needs to be filled according to Equation (1) below; here, the capacity size is n, while S denotes the window sliding step size. MTCA:
Apart from the filling content of the first block, which comes from the back part of the input feature map, the filling contents of the other blocks come from the front part of the input feature map (marked in gray in
Figure 5). According to the sliding size and the number of blocks, the filter fills to
with 0 according to the MTCA. The intermediate process matrix of each block is assigned to each MMA in order to perform matrix-multiplication with the intermediate process matrix of the filter. The final output feature map is obtained via the final combination.
3.2. Computation Method of PE Based on Vector Single Instruction Multiple Data (SIMD) Technology
The computing performance of the accelerator is primarily determined by the computing power of each MMA. To reduce the idle time and pauses caused by memory access between the MMAs and the internal PEs, this paper uses a calculation method based on GEMM (General Matrix Multiplication) to realize the block calculation process by row. For the intermediate process matrix multiplication O = IW, the calculation method outlined in Equation (2) below is usually used:
The intermediate process matrix W and output matrix O of the filter is divided into blocks according to row; the row vectors are
and
, respectively. Equation (2) can then be transformed into Equation (3), as follows:
As shown in
Figure 6, in a departure from the conventional method, the vector calculation process [
26] expands the b-th element of the line taken from the input intermediate process matrix I into a vector with the same value. As illustrated in
Figure 7, the element
is loaded into each internal local register via broadcasting for each PE. The k elements of the vector in row B of matrix W are also loaded into each PE to facilitate the vector multiplication and accumulation calculation. The multiplication and accumulation results on each PE are then saved in the on-chip register for use in the next iteration.
Compared with the traditional calculation method, the programming method based on vector SIMD technology is not intuitive. However, conducting the calculation between matrix I and W, row by row, has high memory access efficiency; in particular, it avoids the traditional column vector data access of matrix W, and further does not require transposing the matrix W and calculation sum between PEs. This can reduce the extra cost of inner loop pipelining to the outer loop, and makes it easier to achieve efficient software pipelining.
4. Accelerator Architecture Design
4.1. Accelerator Architecture Overview
The structure of the CNN accelerator is illustrated in
Figure 8. It mainly comprises the processor, DRAM (Dynamic Random Access Memory), matrix-multiplication array (MMA), and system interconnection bus.
The main modules and their functions are as follows:
- (1)
CPU (Central Processing Unit): Runs the program control task execution module, communicates with the accelerator controller through the general interface, and facilitates interaction with the accelerator by writing the task descriptor into the memory mapping register;
- (2)
Controller: After receiving the task information sent by the CPU, loads the task onto the accelerator and begins the calculation. Distributes the tasks to each MMA execution module according to the present algorithm, and starts the DMA (Direct Memory Access) module through the interconnection configuration to move the data to the BUFFER or write to DDR (Double Data Rate), until the calculation is completed. The results are transferred to the external DRAM, after which the information is sent back to the CPU;
- (3)
MMA: Calculates the block matrix based on vector SIMD technology.
4.2. The Structure of the MMA
The MMA unit is responsible for the calculation of the block intermediate process matrix. Its overall structure, as shown in
Figure 9, consists of a × b processing elements (PEs). This is a structure that combines a systolic array structure and bus broadcast-based parallel structure, and is thus called systolic-based extended parallel architecture. Each PE does not require any data exchange and communication in the x-dimensional direction. Therefore, the bus broadcast mode is used to load data into each PE; this can effectively eliminate data transmission delay between adjacent processing units, and further reduces the difficulty of designing the PE control module in the x-dimensional direction. Moreover, due to data reuse in the Y-dimensional direction, the systolic array structure is adopted to reduce the communication overhead. The PE is connected by short wires, which can effectively ensure the high independence and parallelism of data processing between PEs. This approach can not only reduce the MMA’s bandwidth requirements, but can also solve the problem of the simple bus broadcast structure having insufficient expansibility.
In order to clearly and simply explain the working process of MMA, the process for an example MMA structure with
PE units in processing intermediate process matrix multiplication will be considered:
, the scale of matrix I is
, and the scale of matrix W is
(see Equation (4)):
The MMA flow chart is shown in
Figure 10. Each PE unit synchronously executes the following process when performing multiplication calculation: first, the data of the filter intermediate matrix W is transmitted to the next PE unit. The adder then adds the last calculation result, with the intermediate result stored on-chip REG in PE and writes it back to the on-chip memory until the end of the cycle. The final result is directly written back to the external memory by the DMA module in the reverse direction of the matrix W input by the systolic array structure.
4.3. The Structure of PE
A PE is the most basic operation unit of an accelerator. The structure of a PE is presented in
Figure 11.
The PE mainly consists of the multiply-add unit, on-chip memory, row and column communication interface, I/O bus, and so on. The PE unit receives the data of matrix I from the broadcast bus and also receives the data of matrix W from the buffer or the previous PE. Only when the data meet the IEEE-754 standard can it be sent to the multiplier. After the multiplication result is obtained, it is input into the adder. The adder reads the intermediate value of the result matrix O from the on-chip register through an MUX1 multiplexer (when the initial calculation is performed, the data of matrix O is zero). After calculation is complete, a MUX2 multiplexer is used to determine whether the result will be stored in on-chip RAM or written to external DRAM. Because the intermediate value is stored in the PE’s internal memory, the calculation is independent, such that there is no need to transfer the intermediate results to an external location.
4.4. The Double Buffer Data Movement Based on DMA
Reducing the cost of moving data between different levels of storage is one of the key means by which the efficiency of matrix-multiplication can be improved. The data storage of an accelerator involves a register file, Buffer, L1D SRAM, and DDR multi-level storage structure. In L1D SRAM, two buffers are set up, while a Ping-Pong relay is used to facilitate the overlapping of core computing and DMA data moving; this is done to reduce the data moving time and improve the calculation efficiency of the CNN accelerator. Taking an MMA with
PEs as an example, the specific data migration strategy is shown in
Figure 12.
Step 1: Prepare the data of the first buffer in L1D SRAM through DMA before MMA starts the calculation.
Step 2: The buffer pools, corresponding to matrix I and matrix W in MMA, read the data in L1D SRAM, respectively. The buffer pools, corresponding to matrix I, read the data of the corresponding rows and transmit them the appropriate PE via bus broadcast. The data of matrix W is read row by row and distributed to each buffer pool corresponding to matrix W, after which the data are transferred to the next PE.
Step 3: During the process of data calculation in the PE, DMA moves the next data block of matrix I and matrix W from DDR to the second buffer of L1D SRAM. The matrix W buffer pool reads the remaining data in the first buffer, then uses the MMA’s systolic array structure to transmit the data of matrix W to the next PE via the short line between the PEs. After receiving the data, the PE starts the calculation of the PE data in the second layer. After the PEs have computed the first block matrix operation result, the buffer pools corresponding to matrix I and matrix W continue to read the data of the second buffer in L1D SRAM.
The calculation results of the above process are repeated until each channel of the input feature map is divided into blocks and the corresponding channel data of the filter is calculated; subsequently, the calculation results are moved to DDR by using the systolic structure through DMA.
6. Conclusions
In this paper, we propose a design based on the MTCA block parallel computing algorithm for CNNs hardware accelerator. With the row matrix-multiplication computing, the MTCA block parallel computing algorithm can provide high performance and high efficiency acceleration for various convolutions on the premise of reducing bandwidth and maximizing data reuse. In our design, we realize the convolution expansion and the partition of the intermediate matrix, which is conducive to the high parallel computation on hardware. Compared with the accelerator design based on the im2col algorithm, our design can save more than 60% of the hardware storage space. Especially in the case of computing large-scale convolution, it saves nearly 82% of the storage space. In addition, it shows more prominent advantages than the accelerator design based on MEC. Experiments show that, with the support of specific dataflow mode, and the per unit power operation performance reaches 13 GFLOPS/W, which is better than popular processors.