1. Introduction
According to the Moore’s law, the number of transistors in an integrated circuit is doubling approximately every two years. In the field programmable gate array (FPGA) [
1,
2] design flow, the routing of nets (which are a collection of two or more interconnected components) is one of the most time-consuming steps. Hence, there is a need to develop fast routing algorithms that tackle the problem of the increasing numbers of transistors per chip, and subsequently, the increased runtime of FPGA CAD (computer-aided design) tools. This can be achieved in two ways. First, by parallelizing the routing algorithms for hardware having multiple cores. However, the pathfinder algorithm [
3], which is one of the most commonly used FPGA routing algorithm is intrinsically sequential. Hence, this approach seems inappropriate for parallelizing all types of FPGA routing algorithms.
Second, instead of compiling the entire design together, the users can partition their design, compile partitions progressively, and then assemble all the partitions to form the entire design. Some existing works have proposed this approach [
4,
5]. However, the routing resources required by one partition may be held by another partition, i.e., there is no guarantee to have balanced partitions. In other words, in this approach, there is a need to tackle the difficulties arising in sharing of routing resources.
The authors in ParaLaR [
6] overcome the limitations of existing approaches by formulating the FPGA routing problem as an optimization problem [
7]. Here, the objective function is linear and the decision variables can only have binary values. Hence, the FPGA routing problem is converted to a binary integer linear programming (BILP) minimization problem [
8,
9,
10] (generally termed as linear programming (LP)). In this LP, the dependencies (or constraints) that prevent the nets from being routed in parallel are examined and relaxed by using Lagrange relaxation multipliers [
11]. The relaxed LP is solved in a parallel manner by the sub-gradient method and the Steiner tree algorithm. The complete algorithm is called ParaLaR.
This parallelization gives significant speed-ups. However, in this approach, the sub-gradient method is used in a standard way that does not always give a feasible solution (i.e., some constraints are violated). This directly impacts the standard metric of minimum channel width, which could also be improved.
There are many variants of the sub-gradient method and a problem-specific method gives better results. In this paper, we use the same framework as for ParaLaR, but use an adapted sub-gradient method. We use a primal–dual variant of the sub-gradient method with an adapted step size in this iterative method. Our approach substantially solves the above two problems.
By experiments on standard benchmarks, we compared our algorithm (termed as ParaLarPD) both with ParaLaR and the commonly used non-parallelizable routing algorithm VPR (Versatile Place and Route) [
12]. We performed 100 independent experiments/runs (for all benchmarks) for both ParaLarPD and ParaLaR, and obtained the aggregate results. We performed these experiments for two configurations. For our best configuration, we got results as below. The number of infeasible solutions and the minimum channel width requirement on average reduced by
as compared to ParaLaR (as above, the two metrics are related). Our minimum channel width wa on average
better than that obtained by VPR (there are no constraints that could be violated in VPR). When looking at the total wire length, we obtained the same value as by ParaLaR, which was on an average
better than as obtained by VPR. This is the original metric to be minimized, for which ParaLaR was proposed.
In achieving the above improvements, on an average, the critical path delay in ParaLarPD was worse than that in ParaLaR and better than that in VPR. When running a parallel version of ParaLarPD with eight threads, we obtained maximum relative speed-ups of up to seven times over ParaLarPD’s sequential implementation. These speed-ups are similar to those as obtained when comparing parallel ParaLaR and sequential ParaLaR.
There have been some other attempts to improve the performance of VPR in the past few years. Two of the resulting algorithms are RVPack [
13] and GGAPack/GGAPack2 [
13]. Although these algorithms do not attempt to parallelize the routing process, for the completeness we compare our algorithm to these as well. The rest of this paper is organized as follows:
Section 2 describes the formulation of the FPGA routing as an optimization problem.
Section 3 explains the implementation of our proposed approach.
Section 4 presents experimental results. Finally,
Section 5 gives conclusions and discusses future work.
2. Formulation of the Optimization Problem
The routing problem in FPGA or electronic circuit design is a standard problem that is formulated as a weighted grid graph of certain set of vertices V and edges E, where a cost is associated with each edge. In this grid graph, there are three types of vertices; the net vertices, the Steiner vertices, and the other vertices. A net is represented as a set consisting of all net vertices. A Steiner vertex is not part of the net vertices but it is used to construct the net tree, which is the route of a net (i.e., a sub-tree T of the graph G). A net tree is also called a Steiner tree.
Figure 1 shows an example of a
grid graph, where the black color circles represent the net vertices; the gray color circles represent the Steiner vertices; and the white color circles are the other vertices. The horizontal and the vertical lines represent the edges (as above, these edges have a cost associated with them but that is not marked here). Two net trees are shown by dotted edges.
The number of nets and the set of vertices belonging to each net is given. The objective here is to find a route for each net such that the union of all the routes will minimize the total wire length of the graph
G (which is directly proportional to the total path cost). The goal here is to also minimize the channel width requirement of each edge. Both these objectives are explained in detail below, after (
1).
To achieve the above two objectives, the problem of routing of nets is formulated as an LP problem given as follows [
6] (ParaLaR paper):
This optimization problem minimizes the total wire length of FPGA routing, where is the number of nets; E is the set of edges with e denoting one such edge; is the cost/time delay associated with the edge e; is the decision variable that indicates whether an edge e (routing channel) is utilized by the net i (value 1) or not (value 0); W is a constant; is the vector of all for net i that represents the ith net’s route tree; is the node-arch incidence matrix (which represents a graph as a constraint matrix of the minimum cost flow problem); and is the demand/supply vector (which signifies the amount of cost flow to each node).
The inequality constraints are the channel width constraints that restrict the number of nets utilizing an edge to W (which is iteratively reduced; discussed later). The equality constraints guarantee that a valid route tree is formed for each net (these are implicitly satisfied by our solution approach; again discussed later). To find a feasible route for each net efficiently, the above LP should be parallelized. There are two main challenges here, which are discussed in the two sub-sections below.
We use the same elements as used in ParaLaR [
6], since we improve that existing algorithm. These elements are discussed in the subsequent paragraphs. There is a slight change of notations, which we have done to make the exposition of our new algorithm more easier. This change of notations is summarized below.
Number of nets: N in ParaLaR and in ParaLarPD.
Cost/time delay associated to edge e: in ParaLaR and in ParaLarPD.
Node–arch incidence matrix: in ParaLaR and in ParaLarPD.
2.1. The Channel Width Constraints
The first challenge to parallelize the LP given in (
1)–(2c) is created by the channel width constraints. These constraints introduce dependency in the parallelizing process, and therefore, should be eliminated or relaxed. The Lagrange relaxation [
11] is a technique well suited for problems where the constraints can be relaxed by incorporating them into an objective function. If this is not possible, then a Lagrange heuristic can be developed [
14,
15,
16].
In our ParaLarPD, similar to ParaLaR, we introduce Lagrange relaxation multipliers
for each channel width constraint (2a), and cumulatively these multipliers represent the cost that prevents overuse of the routing channel. The relaxation of constraints is implemented by adding
times the corresponding constraint into the objective function. That is, rewriting (
1)–(2c) as in ParaLaR.
After rearranging the objective function above, the modified LP is given as [
6]
In the above LP, is the new cost associated with the edge e. This LP can be easily solved in parallel manner.
2.2. The Choice of Decision Variable
The second challenge to solve the LP given by (
5)–(6c) is created by the decision variables
. As earlier, if an edge
e is utilized by the net
i, then
else
. Thus, as mentioned earlier this is a BILP that is non-differentiable, and hence, cannot be solved by conventional methods such as the simplex method [
8,
17], the interior point method [
18], etc. Some methods to solve non-differentiable optimization problems include the sub-gradient-based methods [
19], the approximation method [
20], etc.
The sub-gradient based methods are commonly used to minimize non-differentiable convex functions
. These are iterative in nature that update the variable
x as
, where
and
are the step size and a sub-gradient of the objective function, respectively, at the
kth iteration. In ParaLaR, the LP given in (
5)–(6c) is not solved directly by a sub-gradient based method but only the Lagrange relaxation multipliers are obtained by it. After this, the minimum Steiner tree algorithm is used in a parallel manner for FPGA routing. This two-step procedure is followed because just using a sub-gradient method will not always give binary solutions, which we need (recall
can be 0 or 1). Moreover, using a Steiner tree algorithm helps us in achieving feasible routing (equality constraints are implicitly satisfied). In our ParaLarPD, we follow this same approach.
3. Proposed Approach
There are many variants of the sub-gradient based methods such as the projected sub-gradient method [
19], the primal–dual sub-gradient method [
21], the conditional sub-gradient method [
22], the deflected sub-gradient method [
22], etc. In ParaLaR [
6], authors use the projected sub-gradient method, where the Lagrange relaxation multipliers are calculated as
Here,
and
are the Lagrange relaxation multipliers at the
kth and the
th iteration, respectively; and
, i.e., a sub-gradient of the objective function given in (
5) at the
kth iteration. Also, as above,
denotes the size of the step taken in the direction of the sub-gradient at the
kth iterative step, and is updated as
This approach satisfactorily parallelizes FPGA routing and gives better results over VPR [
12], but there are many inequality constraints that are violated for some cases. This directly affects the minimum channel width requirement, which can be improved further. The channel width is defined as
.
The LP given by (
5)–(6c) is the dual of LP given by (
1)–(2c) (see [
9,
22]). Hence, a sub-gradient method that is specific to a dual problem would give better results compared to the standard one, that is, the primal–dual sub-gradient method. Thus, we use this method with ParaLaR leading to our ParaLarPD. This achieves our main goal of reducing violation of constraints with an added benefit of minimization of channel width.
Next, we describe our ParaLarPD where we compare our usage of primal–dual sub-gradient method with the projected sub-gradient as used in ParaLaR, and also briefly summarized above. As earlier, we expand upon two aspects; iterative update of the Lagrange multipliers and the corresponding step sizes.
Our Algorithm
We update the Lagrange relaxation multipliers in (
5)–(6c) as follows [
21]
where
is a sub-gradient of the objective function at the
kth iteration—the partial derivative of the objective function in (
5) (the
h in the projected sub-gradient method from (
7) is the same). Also, we take
, which is the most general initial guess [
11].
Let us now compare (
9) with (
7). For both the methods, if the inequality constraints are violated at the
kth iteration (see (2a);
for some
), then the particular Lagrange relaxation multiplier at the
th iteration is incremented by
times the sub-gradient of the objective function at the
kth iteration. For the primal–dual sub-gradient method, this is obvious from (
9). For the projected sub-gradient method, this is true because
is non-negative, and
and
h both are positive in (
7).
Further, if the inequality constraints are not violated at the
kth iteration (again see (2a);
for some
), then for the primal–dual sub-gradient method, the value of the particular Lagrange relaxation multiplier at the
th iteration is the same as the
kth iteration, while for the projected sub-gradient method, it may change (again see (
9) and (
7), respectively). In general, this works better because logically a Lagrange relaxation multiplier at the
th iteration should be equal to the value of this multiplier at the iteration when the particular (or corresponding) constraint is not violated.
Next, we discuss the choice of the step size. If the step size is too small, then the algorithm would get stuck at the current point, and if it is too large, the algorithm may oscillate between any two non-optimal solutions. Hence, it is very important to select the step size appropriately. The choice of step size can be either constant in all the iterations or can be reduced in each successive iteration. In our proposed scheme, the computation of step size involves a combination of the iteration number as well as the norm of the Karush–Kuhn–Tucker (KKT) operator of the objective function at that particular iteration [
22] (instead of using the iteration number only, as used in ParaLaR; see (
8)). This ensures that the problem characteristic is used in the computation of the step size. That is,
where
k is the iteration number,
is the
KKT operator for the objective function (
5), and
is the 2-norm of
.
The sub-gradient based methods are iterative algorithms, and hence, we need to check when to stop. There is no ideal stopping criterion for sub-gradient based methods. However, some possible measures that can be used are discussed below (including our choice) [
19].
If at an iteration
k, the constraints violation (2a) (i.e.,
) is satisfied and
, then we obtain the optimal point. Therefore, we stop here because there is no constraint violation (a necessary condition of Lagrange relaxation) [
11]. However, this stopping criterion is achieved only if strong duality holds but, in case of our problem, there is weak duality. Details of strong and weak duality can be found in [
9,
22].
Let at iteration k, and be the optimal function value and the available function value at the kth iteration, respectively, then the sub-gradient iterations can be stopped when (where is a very small positive number). In this criterion, the optimal value of the objective function is required in advance, which we do not have.
Another approach is to stop when the step size becomes too small. This is, because the sub-gradient method would get stuck at the current iteration.
As none of the above stopping criteria fit us, we stop our algorithm after a sufficient and a fixed number of iterations, as used in ParaLaR. As earlier, we term our proposed algorithm as ParaLarPD because we use primal–dual sub-gradient algorithm with ParaLaR.
Rest of steps of our ParaLarPD are the same as in ParaLar. We start with a constant value of
W and solve the optimization problem (
5)–(6c) by combination of the sub-gradient method and the Steiner tree algorithm. This gives us the total wire length and the constraint violation (or channel width).
Next, we reduce the value of W and again follow the above steps to obtain better local minima both for the total wire length and the constraint violation (or channel width).
The pseudocode of our ParaLarPD is given in Algorithm 1, where the new addition to ParaLaR are captured by lines 6 and 16.
Algorithm 1 ParaLarPD |
Input: Architecture description file and benchmark file. Output: Route edges.
- 1:
Run VPR with the input architecture and benchmark circuit. - 2:
▹ Initialize Steiner points - 3:
InitGridGraph() ▹ Initialize grid graph - 4:
▹ Initialize Lagrangian relaxation multipliers - 5:
fortodo - 6:
Calculate the step size using the Equation ( 10). ▹ It is used to update - 7:
- 8:
parallel_for to do ▹ number of nets - 9:
- 10:
if then - 11:
MST(, ) ▹ MST is call to a function that executes a Minimum Steiner Tree. - 12:
end if - 13:
MST(, ) - 14:
end parallel_for - 15:
while e ∈ E do - 16:
Update Lagrangian relaxation multipliers using the Equation ( 9). - 17:
Update the edge weight of the on . New edge weights are - 18:
end while - 19:
end for
|
In the above algorithm, initially we pack and place the benchmark circuit using VPR (as described in the architecture file). Next, set of Steiner points are initialized as empty, the grid graph is initialized to model an FPGA that is large enough to realize the input netlist (
is the number of total nets), and the Lagrangian relaxation multipliers (i.e.,
) are initialized to zero. Next, the routing algorithm runs for
(which is set to 50) number of iterations. In each iteration, first, we calculate the step size. Thereafter, we initialize
as empty set. Then, the
for loop at line 8 routes the nets in parallel by running the minimum Steiner tree (MST) algorithm to obtain the route edges. Finally, edge weights of the
are updated in the while loop from line 15 to line 18. In
Appendix A, we have listed a code snippet, which is the actual code of important functions.
4. Experimental Results
We performed experiments on a machine with single Intel(R) Xeon(R) CPU E5-1620 v3 running at 3.5 GHz and 64 GB of RAM. The operating system was Ubuntu 14.04 LTS, and the kernel version is 3.13.0–100. Our code was written in C++11 and compiled using GCC version 4.8.4 with O3 optimization flag. The resulting compiled code was run using a different number of threads.
We compared our proposed ParaLarPD with ParaLaR [
6] and VPR [
12]. For comparison purposes, ParaLaR and VPR 7.0 from the verilog-to-routing (VTR) package were compiled using the same GCC version and optimization flag.
As discussed earlier, besides ParaLaR and VPR, other commonly used routing algorithms are random VPack (RVPack) [
13] and grouping genetic algorithm pack (GGAPack/GGAPack2) [
13]. We discuss the benefits of ParaLarPD over these two algorithms as well.
The architecture parameters used for our experiments are given in
Table 1, which are most commonly used [
12,
13,
23].
In
Table 1, the values of
N and
K specify that the CLBs in the architecture contained ten fracturable logic elements (FLEs) and each FLE had six inputs, respectively. The values of
and
specify that every input and output pin was driven by 15% and 10%, of the tracks in a channel, respectively. We also performed experiments with
and
, the results of which are not reported in this paper. However, our algorithm still gave better results than ParaLaR and VPR. In FPGA terminology, the value of
specifies the number of wire segments that can be connected to each wire segment where horizontal and vertical channels intersect. This value can only be a multiple of 3. Here, we performed experiments with
. The value of length specifies the number of logic blocks spanned by each segment. We took this as 4, although our proposed method can be used for architectures with varying lengths, e.g., length
or a mix of length
and length
. All these parameters (i.e., N, K,
,
,
, and length) in ParaLaR and VPR were modified according to the values given in
Table 1, to run them identical to our model.
We tested on MCNC benchmark circuits [
24], which ranged from small sized to large sized logic blocks. Initially, the circuits were packed and placed using VPR. After that, routing was performed by all three methods (i.e., our proposed ParaLarPD, ParaLaR, and VPR). For parallelization, we used Intel threading building blocks (TBB) libraries. To examine the behavior of ParaLarPD and ParaLaR, we performed 100 independent runs of each of ParaLarPD and ParaLaR (for all benchmarks) and then collected the aggregate results of these 100 runs.
There is no general rule of choosing the initial value of the channel width for experimental purposes. However, a value of
to
more than the minimum channel width obtained from VPR is commonly used [
6,
23]. For our experiments, both ParaLarPD and ParaLaR are initialized with initial channel width (
W) as
, where
is the minimum channel width obtained from VPR. We also do experiments with initial
W as
, which does not change the results. We use an upper limit of 50 for the number of iterations for all the three methods because this is the limit chosen in the experiments of ParaLaR from [
6]. The best results out of all these iterations are reported.
Also, in our proposed model, routing of individual nets is independent and we update the cost of utilizing the edges at the end of each routing iteration. Thus, there is no race condition leading to no randomness. Hence, our executions are deterministic.
In
Table 2 and
Table 3, we compare the channel width (also called the minimum channel width), the total wire length, and the critical path delay as obtained by our proposed ParaLarPD with ParaLaR and routing-driven VPR. These metrics are independent of the number of threads used, therefore, here we give results for a single thread only. The results of ParaLarPD and ParaLaR are the average values of their 100 independent runs. In the tables, we report the geometric mean (Geo. Mean) of all the values obtained for different benchmark circuits. It indicates the central tendency of a set of numbers and is commonly used [
6].
It is important to emphasize that ParaLaR is sensitive to the system configuration of the machine used for experiments (in fact, ParaLarPD is equally sensitive). Hence, the ParaLaR data in
Table 2 and
Table 3 is slightly different than that reported in the original ParaLaR paper.
In
Table 2 (i.e., for
), if we look at the channel width, then ParaLarPD gives on an average
improvement over ParaLaR. As discussed earlier, constraints violation (which is a problem in ParaLaR) is directly related to the minimum channel width. Hence, ParaLarPD proportionally improves the constraints violation of ParaLaR. Further, ParaLarPD gives on an average
improvement in the channel width when compared with VPR (in VPR, there is no concept of constraints violation). If we look at the total wire length, then ParaLarPD achieves the same value as obtained by ParaLaR, which is on an average
better than the one obtained by VPR.
While the focus of this work is minimizing the maximum channel width, we did measure the impact of the algorithm on the critical path delay. The average critical path delay of ParaLarPD is worse as compared to ParaLaR and better as compared to VPR. Considering the significant improvements in minimum channel width, i.e., 20.16%, and almost the same speed-up (when compared to ParaLaR), we believe this trade-off is still reasonable since the critical path delay is not deteriorated significantly.
Similarly, in
Table 3 (i.e., for
), if we look at the channel width, then ParaLarPD gives on an average
improvement over ParaLaR (as above, this gives the same improvement in the constraints violation). The improvement in comparison to VPR is on average 22.01%. If we look at the total wire length, then ParaLarPD achieves the same value as obtained by ParaLaR, which is on an average
better than the one obtained by VPR. Similar to
Table 2, the critical path delay obtained by using ParaLarPD is only slightly worse than ParaLaR and VPR; about
worse than ParaLaR and
worse than VPR.
To better demonstrate the improvements of ParaLarPD over ParaLaR, we compare them on an average and in percentage terms separately in
Table 4. In this table, we give percentage improvement of ParaLarPD over ParaLaR, where all the results (i.e., the channel width, wire length, and critical path delay) are obtained as the geometric mean of the maximum, minimum, average, and standard values of 100 independent runs for all benchmarks. As above, and also evident from this table, ParaLarPD performs substantially better on channel width and almost the same on total wire length and critical path delay. We see from
Table 4 that for
, the average standard deviation value of critical path delay of ParaLarPD is 86.9% lower as compared to ParaLaR, while for
, ParaLarPD has 111.11% higher critical path delay as compared to ParaLar. However, the standard deviation is just a measure of the amount of variation. It does not imply that ParaLarPD has this much percentage of lower or higher delay.
In addition to above tables, we also represent the maximum, minimum, average, and standard deviation values of the channel width (for all the benchmarks, and when obtaining them for 100 independent runs) of ParaLarPD and ParaLaR [
6] (for the configuration
,
,
and
). This result is shown by the box plot graph, shown in
Figure 2. In this figure, all the values are normalized to VPR (given in
Table 2). From this figure, we can see that ParaLarPD is always better than VPR, since all values are below 1, even in the worst cases observed in 100 runs. Hence, from the above discussion and this figure, we can see that ParaLarPD performs better than ParaLaR and VPR. We have not included the figures for the wire length and the critical path delay since the values of these metrics are almost the same for both ParaLarPD and ParaLaR.
Recall, the underlying goal of ParaLaR and this work is to efficiently parallelize the routing process. Hence, next we report results when using different number of threads in
Table 5. The benchmark dataset used (first column in
Table 5) is the same as discussed in the earlier paragraphs.
First, we discuss the speed-ups obtained when using different number of threads for ParaLarPD. The absolute execution time (in seconds) for different threads is given in columns two through five and is represented as 1X, 2X, 4X, and 8X. The relative speed-ups are given in columns six through eight and are calculated as below.
It can be observed from this data that when we used ParaLarPD with 2 threads, on an average, speed-up of up to times was observed (over the single thread execution). Similarly, when using 4 threads and 8 threads, on an average, speed-up of up to times and times, respectively, was observed (over the single thread execution).
To show the dependency between the size of the benchmark circuits and their speed-ups, we plot a bar graph (
Figure 3). In this figure, on the x-axis, we have the benchmark circuits in the increasing order of their execution time when running them with one thread, which is directly proportional to the size of the benchmark circuits [
6]. On the y-axis, we have the speed-ups of these benchmark circuits when running them with different threads. We use three different colored bars to represent speed-ups with two threads, four threads, and eight threads. From this figure, it can be observed that for large benchmark circuits, the speed-ups of ParaLarPD nearly match the ideal speed-ups (proportional to the number of threads used). For example, the last circuit or “spla” has speed-ups of
,
, and
, which are very close to their corresponding ideal speed-ups of 2, 4, and 8, respectively.
To compare the speed-up of ParaLarPD and ParaLaR, we perform experiments with the same number of threads for both of them. We obtain almost the same execution times for both, and hence, we do not report this data.
Next, we compare ParaLarPD’s execution time with VPR’s execution time. This data is given in columns nine and ten of
Table 5. We can observe from these columns that on an average ParaLarPD (when executed using a single thread) is
times faster than VPR. Since in VPR, there is no concept of using multiple threads, we could not compare ParaLarPD’s data for a higher number of threads with it. Next, we compare the performance of ParaLarPD with RVPack [
13] and GGAPack2 [
13] algorithms. We do not perform a direct comparison between our ParaLarPD and these two algorithms. Rather, we use VPR as an intermediate algorithm for comparison. Thus, in
Table 6, we give the percentage improvement of ParaLarPD over VPR, and improvement of RVPack and GGAPack2 over VPR. We follow this strategy for two reasons. First, the architecture parameters as used in these two algorithms are different from ours, and using our parameters for executing them and their parameters for executing ParaLarPD would lead to an unfair comparison. Second, these two algorithms have randomness associated with them, and hence, the results are difficult to replicate. As evident from
Table 6, ParaLarPD outperforms these two algorithms in-terms of all metrics.