Whilst earlier versions of FPGAs just comprised multiple Lookup Tables (LUT) connected to registers and accelerated by fast adders, FPGAs now comprise more coarse-grained functions such as dedicated, full-custom, low-power DSP slices. For example, the Xilinx DSP48E1 block comprises a 25-bit pre-adder, a 25 × 18-bit multiplier and a 48-bit adder/subtracter/logic unit, multiple distributed RAM blocks which offer high bandwidth capability (
Figure 1), and a plethora of registers which supports high levels of pipelining.
Whilst FPGAs have been successfully applied in embedded systems and communications, they have struggled as a mainstream computational platform. Addressing the following considerations would make FPGAs a major platform rival for “data-intensive” applications:
High-Level Programming Environment
The proposed methodology employs a reprogrammable model comprising multi-core processors supporting SIMD operation and an associated inter-processor communication methodology. A dataflow design methodology has chosen as the high-level programming approach as it offers concurrency, scalability, modularity and provides data driven properties, all of which match the design requirements associated with image processing systems. A dataflow model allows algorithms to be realized as actors with specific firing rules that are mapped into directed graphs where the nodes represent computations and arcs represent the movement of data. The term data-driven is used to express the execution control of dataflow with the availability of the data itself. In this context, an actor is a standalone entity, which defines an execution procedure and can be implemented in the IPPro processor. Actors communicate with other actors by passing data tokens, and the execution is done through the token passing through
First-In-First-Out (FIFO) units. The combination of a set of actors with a set of connections between actors constructs a network, which maps well to the system level architecture of the IPPro processors. An earlier version of the programming environment has been is detailed in [
11] allowing the user to explore parallel implementation and providing the necessary back-end compilation support.
In our flow, every processor can be thought of as an actor and data is fired through the FIFO structures but the approach needs to be sensitive to FPGA-based limitations such as restricted memory.
Cal Actor Language (CAL) [
22] is a dataflow programming language that has been focussed at image processing and FPGAs and it offers the necessary constructs for expressing parallel or sequential coding, bitwise types, a consistent memory model, and a communication between parallel tasks through queues. RVC-CAL is supported by an open source dataflow development environment and compiler framework, Orcc, that allows the trans-compilation of actors and generates equivalent code depending on the chosen back-ends [
23]. An RVC-CAL based design is composed of a dataflow network file (
.xdf file) that supports task and data-level parallelism.
Figure 2 illustrates the possible pipelined decomposition of dataflow actors. These dataflow actors need to be balanced as the worst-case execution time of the actor determines the overall achievable performance. Data-level parallelism is achieved by making multiple instances of an actor and requires SIMD operations that shall be supported by the underlying processor architecture. In addition, it requires software configurable system-level infrastructure that manages control and data distribution/collection tasks. It involves the initialisation of the soft core processors (programming the decomposed dataflow actor description), receiving data from the host processor, distributing them to first-level actors, gathering processed data from the final-level actors and send it back to host processor.
Data-level parallelism directly impacts the system performance; the major limiting factor is the number of resources available on FPGA. An example pipeline structure with an algorithm composed of four actors each having different execution times, and multiple instances of the algorithm realised in SIMD fashion is shown in
Figure 2. The performance metric, frames-per-second (fps) can be approximated using
the number of pixels in a frame,
the number of pixels consumed by an actor in each iteration and
is operating frequency of processor.
To improve the , the following options are possible:
Efficient FPGA-based processor design that operates at higher operating frequency .
Reducing the actor’s execution time by decomposing it into multiple pipelined stages, thus reducing to improve the . Shorter actors can be merged sequentially to minimise the data transfer overhead by localising data into FIFOs between processing stages.
Vertical scaling to exploit data parallelism by mapping an actor on multiple processor cores, thus reducing () at the cost of additional system-level data distribution, control, and collection mechanisms.
The developed tool flow (
Figure 3) starts with a user-defined RVC-CAL description composed of actors selected to execute in FPGA-based soft cores with the rest to be run in the host CPUs. By analyzing behaviour, software/hardware partitioning is decided by two main factors, the actors with the worse execution time (determined exactly by number of instructions and the average waiting time to receive the input tokens and send the produced tokens), and the overheads incurred in transferring the image data to/from the accelerator. The behavioural description of an algorithm could be coded in different formats:
No explicit balanced actors or actions are provided by the user.
The actors include actions which are balanced without depending on each other, e.g., no global variables in an actor is updated by one action and then used by the other ones; otherwise, these would need to be decomposed into separate actors.
The actors are explicitly balanced and only require hardware/software partitioning.
There are two types of decomposition, “row-” and “column-wise”. The newly generated data- independent actors can be placed row-wise at the same pipeline stage; otherwise they can be placed column-wise as consecutive pipeline stages. Row-wise is preferred as the overhead incurred in token transmission can be a limiting factor but typically a combination is employed.
If the actors or actions are not balanced, then they need to be decomposed. This is done by detecting a sequence of instructions without branches (unless this occurs at the end) and then breaking the program into basic blocks. The “balance points” whereby the actor needs to be divided into multiple sets of basic blocks such that if each set is placed in a new actor, then need to be found; this will ensure that the overhead of transferring tokens among the sets will not create a bottleneck and infer the selection and use of one with the lowest overhead (See Ref. [
11]). Once the graph is partitioned, the original xdf file no longer represents the network topology, so each set of actors must be redesigned separately and their input/output ports fixed and a new set of xdf dataflow network description files, generated. The actors to run on the host CPU are compiled from RVC-CAL to C using the C backend of Orcc development environment, whereas the FPGA-based functionality is then created using the proposed compiler framework.
The degree of SIMD applied will affect the controller interface settings. For a target board, the design will have a fixed number of IPPro cores realized and interconnected with each other and controllers, determined by the FPGA resources and fan-out delay; for the Zedboard considered here, 32 cores are selected. The compilation infrastructure is composed of three distinctive steps:
Examination of the xdf dataflow network file and assignment and recording of the actor mapping to the processors on the network.
Compilation of each actor’s RVC-CAL code to IPPro assembly code.
Generation of control register values, mainly for AXI Lite Registers, and parameters required by the developed C-APIs. running on the host CPU
While FPGA-targeted actor interaction is handled by the compiler, the processes for receiving the image data and storing the output in the edge actors need to be developed. Multiple controllers (programmable by the host CPU) are designed to provide the interface to transfer the data to the accelerators, gather the results and transfer them back to the host. With the host CPU running part of the design and setting control registers, and the IPPro binary codes of the other actors loaded to the proper cores on the accelerator, and the interface between the software/hardware sections set accordingly, the system implementation is in place and ready to run.