1. Introduction
The key concept of coordinate rotation digital computer (CORDIC) arithmetic is that it can effectively compute trigonometric functions, vector rotation, multiplication, and division through an iterative formulation of shift and add operations. Ever since the CORDIC algorithm was first described in 1959 by Jack E. Volder [
1,
2], a wide variety of explorations of CORDIC applications, algorithms, and architectures have been investigated to find low-cost, high-performance hardware solutions [
3,
4,
5,
6,
7,
8,
9,
10,
11,
12].
Throughput and power consumption have become the major issues in CORDIC algorithm design, especially for implementations in embedded systems with limited resources. As CORDIC acts as a basic arithmetic operator in addition to adders, subtractors, and multipliers in hardware implementation, their performance greatly affects the performance of the entire system, especially in CORDIC applications requiring high performance, such as fast Fourier transform (FFT) [
13] or embedded FPGA-based synthesizers, including chaotic Pseudo-random number generators [
14]. In particular, the latency of CORDIC computation is a major issue due to the large number of iterations required to preserve sufficient precision of the output, despite its linear-rate convergence [
15]. Therefore, the speed of CORDIC operations is limited either by the required precision (number of iterations) or the clock period. Angle recoding (AR) methods [
5,
6,
8,
9,
10] can be used to reduce the number of CORDIC iterations by encoding the rotation angle as a linear combination of a set of selected elementary angles. However, selective implementation of micro-rotation carries significant scaling overhead.
Parallel processing and pipelining are two popular techniques used to enhance the performance of any computing system [
16]. Area is traded for speed in parallel processing while latency and register complexity are traded for speed in pipeline processing [
17,
18]. The scope of parallel processing and pipelining for improving the performance of CORDIC has been explored in the literature [
9,
19,
20,
21]. Due to the inherent sequential nature of CORDIC, it is not possible to have a full parallel CORDIC, and any parallel implementation of CORDIC carries significant overhead. In [
9], it was further shown that the rotation direction can be determined once the input angle is known in order to enable parallel realization of micro-rotations. Despite the parallel execution of CORDIC rotation according to [
9], the decomposition method of each positional binary weight produces extra micro-rotation stages, especially as the word length of the input angle increases.
An alternative approach to speed up CORDIC is to reduce the clock period and increase the throughput rate of the CORDIC output. As the CORDIC iterations are identical and involve nearly the same complexity, mapping them onto pipelined architectures is very hardware-friendly. The main objective of pipelined implementation is to reduce the critical path. The pipelined architecture of CORDIC was suggested by Deprettere et al. in 1984 [
21]. Pipelined CORDIC circuits were used thereafter for high-throughput sinusoidal wave generation, FFT, adaptive filters, and other signal processing applications [
22,
23,
24]. Given each pipeline stage performs one predetermined micro-rotation, the number of shifts required for shift-add or shift-sub operations in each pipeline stage is known a priori. Therefore, shift operations can be hardwired, and barrel-shifters can be completely eliminated in the non-recursive cascaded implementation of CORDIC. Then, the critical path of pipelined CORDIC amounts to the time required to execute add or subtract operations in each stage.
Fully-pipelined-cascaded CORDIC (FPCC) has potential for a very low clock period and very high throughput. However, such high throughput rate is usually not required in real applications. For example, when CORDIC is used to implement the complex butterfly operation for a fast Fourier transform (FFT), the throughput rate provided by CORDIC must match the throughput requirement of the application and the type of FFT implementation, such as fully-pipelined implementation, folded-pipeline implementation, and the number of butterfly circuits. On the other hand, we find that removing a pair of pipeline registers to merge two pipeline stages does not substantially increase the critical path. The propagation delay of a pair of CORDIC stages is generally assumed to be the sum of delays of two adders or two subtractors. However, such an assumption is currently invalid in ASIC and FPGA implementation, wherein adders or subtractors in each stage are not considered as discrete components. We also find that the rear stages in the cascaded CORDIC have longer propagation delays than the preceding stages, thus inserting pipeline registers for each is not a good pipelining strategy for a given number of stages. However, there is no systematic analysis regarding the appropriate pipeline decision in a cascaded CORDIC. In this paper, we discuss the fine-grained estimation of propagation delays in cascaded CORDIC. Based on that, we derive a formulation that could be used by a designer to determine the number of pipeline stages to be considered and where to place the pipeline registers in the cascaded design.
Cascaded CORDIC involves one CORDIC unit for each micro-rotation, while at the other extreme, recursive CORDIC uses only one CORDIC unit for all micro-rotations. Instead, we can have a hybrid of non-recursive cascaded CORDIC and recursive CORDIC, which we refer to as pipelined-recursive-cascaded CORDIC (PRCC). PRCC consist of a cascade of a few CORDIC units, where each such CORDIC unit performs a certain number of micro-rotations recursively. In this paper, we investigate the design and implementation of such recursive cascade as another design option for the cascaded CORDIC.
The rest of the paper is organized as follows. A critical path analysis for a cascaded CORDIC with unknown rotation angles is presented in
Section 2. The fine-grained critical path analysis of a cascaded CORDIC for known rotation angles is presented in
Section 3. An algorithm for determining the number of pipeline registers and their locations for a given timing constraint is also proposed in
Section 3. A hybrid recursive cascaded CORDIC is presented in
Section 4. The performance of the proposed designs in terms of area, throughput, latency, and power consumption is discussed in
Section 5. Conclusions are given in
Section 6.
2. Critical Path Analysis of Cascaded CORDIC for Unknown Rotation Angles
The rotation mode of the CORDIC algorithm performs vector rotation iteratively, as follows [
1,
2]:
for
, where
N is the number of micro-rotations, and
if
and
otherwise. The initial value of
is a two-dimensional vector to be rotated, and
is set to the rotation angle
.
Figure 1a shows the structure of the conventional CORDIC circuit for recursive implementation of all micro-rotations whereas
Figure 1b shows the structure of the cascaded CORDIC where possible locations of pipeline registers are marked with a thick solid lines. The barrel-shifters in
Figure 1a can be completely eliminated in the cascaded CORDIC because the shift operations can be hardwired.
If the rotation angle is not known in advance, i.e.,
for
is not given, a circuit for angle update according to Equation (1c) is required. The propagation delay of the fully-pipelined-cascaded CORDIC (FPCC), where all the thick solid lines between the CORDIC sections in
Figure 1b are replaced with pipeline registers, can be represented as
where
and
are propagation delays of an inverter and
L-bit adder or subtractor, respectively. Sign extension to avoid overflow is not considered because the word-length is generally retained during CORDIC operations. Assuming that a ripple carry adder (RCA) is used for the adder or subtractor, we have [
25]
where
is a delay of a 2-input XOR gate.
,
, and
are delays of the 1-bit full adder (FA) from port
input-A to port
carry-out, from port
carry-in to port
carry-out, and from port
carry-in to port
sum, respectively. Note that the propagation delay of an adder or subtractor increases by
as the bit-width of the input increases by 1.
If all pipeline registers marked by thick solid lines in
Figure 1b are removed, we can obtain a non-pipelined-cascaded CORDIC (NPCC), and the black-dotted line in
Figure 1b becomes the critical path. Then, the propagation delay of an
N-stage,
L-bit NPCC
can be generalized as follows:
where
is taken to obtain
,
, …,
, and another
is the propagation delay of the inverter and adder or subtractor in the last stage.
4. A Recursive Cascaded CORDIC
In this Section, we propose a recursive cascaded CORDIC design, namely a pipelined-recursive- cascaded CORDIC (PRCC).
Figure 4 shows the structure of a 2-stage PRCC that performs
N successive micro-rotations using two CORDIC units. When
N is even, the first
micro-rotations out of
N micro-rotations are performed during the first stage, whereas the remaining
micro-rotations are performed during the second stage. A pair of 2:1 MUXes in the first stage selects the initial input values
and
during the first clock cycle of the period of the first
clock cycles, and it selects outputs from two adders or subtractors in the first stage during the subsequent
clock cycles. The first stage is responsible only for the first
cycles of micro-rotations and passes the intermediate results
and
to the second stage. Another pair of inputs are taken and fed in parallel to the MUXes in every
cycles. A pair of 2:1 MUXes in the second stage selects
and
from the first stage every
cycles, and the second stage performs the remaining
micro-rotations. PRCC can also be implemented with more than two CORDIC units. If
N is represented as a multiple of
K, where
, we can have a
K-stage PRCC, where each stage is responsible for
micro-rotations, and each stage receives a pair of inputs from the previous stage every
cycles.
The hardware-complexity of the barrel-shifter in a PRCC can be effectively reduced with a simple hardwired pre-shifting scheme [
7], as shown in the second stage of
Figure 4. Only the
most significant bits of the input words from the registers can be loaded to the barrel-shifters, because
is the minimum number of shifts in the second stage and the
less significant bits would become truncated during shifting. Therefore, the barrel-shifter must implement a maximum of
shifts. The output from the barrel-shifters are loaded into the
LSBs of the adder or subtractor, and the
MSBs of the corresponding operand are sign-extended. The barrel-shifter in the second stage can be implemented using
number of MUXes, whereas
MUXes are required without using hardwired pre-shifting. Therefore, the area complexity of a PRCC can be reduced by implementing hardwired pre-shifting.
The propagation delay of the barrel-shifter could also be decreased by hardwired pre-shifting. As the maximum number of shifts in the first and second stages are equal, as shown in
Figure 4, each stage has the same propagation delay
, where
is the propagation delay of the 2:1 MUX. Therefore, the critical path of a PRCC is determined by any of its stages. However, the critical path of a PRCC is less than that of a conventional CORDIC because the maximum number of shifts with the barrel-shifters in a PRCC is less than that in a conventional CORDIC. The propagation delay of the
K-stage,
L-bit PRCC with
N micro-rotations is
Note that is the delay of a barrel-shifter and another is the delay of a 2:1 MUX used for input selection.
5. Design Examples and Results
The hardware and time complexities of the proposed designs and conventional designs for known rotation angles are listed in
Table 3. The proposed
K-stage PPCC is compared with an NPCC and FPCC [
21], which are two opposing CORDIC versions in terms of pipeline strategy in a cascaded CORDIC. The proposed
K-stage PRCC is also compared with a conventional CORDIC [
1] computed through recursive computation of one stage and opposing non-recursive cascaded CORDIC computed through
N stages. The propagation delay in a conventional CORDIC is the sum of delays of a 2:1 MUX, barrel-shifter, and adder or subtractor. FPCC has the shortest critical path, which involves an inverter and an adder or subtractor. The FPCC, NPCC, and PPCC produce an output sample during every cycle when they are used in the pipelined CORDIC applications, whereas the conventional CORDIC and
K-stage PRCC produce the output sample every
N and
cycles, respectively. The
K-stage PRCC has
K CORDIC units, thus it involves a factor
K more adders or subtractors, barrel-shifters, registers, and MUXes than a conventional CORDIC. However, the area of a
K-stage PRCC is less than a factor
K greater than that of a conventional CORDIC due to its hardwired pre-shifting. The propagation delay of a PRCC can be reduced to be less than that of a conventional CORDIC.
We coded Designs-1, 2, 3, 4, and 6 in the design examples in
Table 2 in VHDL and synthesized those using the Synopsys Design Compiler with the TSMC 90 nm general purpose standard CMOS library [
26]. A conventional CORDIC [
1], 2, 4, and 8-stage PRCCs, and an FPCC [
21] were also synthesized and compared for known rotation angles. The values of
L and
N were 16. The maximum propagation delay, throughput per second (TPS), latency required to obtain the first output sample, area, power consumption at 50 MHz operating frequency, energy consumed per sample (EPS) required to produce a 50 MHz clock output, and area-delay product (ADP) are listed in
Table 4.
The conventional CORDIC has the smallest area of
m
because it involves only one CORDIC unit with throughput rate of
Mega samples per second (MSPS). It is shown that PRCCs have lower delays than a conventional CORDIC due to the hardwired pre-shifting. Specifically, a 2-stage PRCC occupies a factor
greater area than a conventional CORDIC, but it offers nearly double the throughput in the pipeline CORDIC application,
less EPS, and
less ADP. Similarly, the 4-stage and 8-stage PRCCs provide factors
and
larger throughput,
and
less EPS, and
and
less ADP compared to the conventional design, respectively. FPCC can produce an output during every cycle at 549 MHz maximum operating frequency. However, in applications that do not require such high throughput, NPCC or PPCC can be alternative architectural options by removing unnecessary pipeline registers. If the throughput rate of less than 60 MSPS is acceptable, NPCC becomes the one of the best options as it requires
less area and uses
less EPS compared to the fully-pipelined design. 2, 3, 4, and 8-stage PPCCs produce 109 MSPS, 155 MSPS, 202 MSPS, and 310 MSPS, but require
,
,
, and
less area with
,
,
, and
less EPS than FPCC, respectively. The 4 and 8-stage PPCCs offer
to
savings in EPS and
and
savings in ADP over a conventional CORDIC. One should note that the propagation delays of PPCCs obtained by synthesis closely match the estimated propagation delays shown in
Table 2.
We designed NPCC, PPCC, and FPCC for unknown rotation angles. The proposed designs for different number of pipeline stages with different timing constraints were synthesized, and
Table 5 lists the synthesis results. Note that 3, 4, and 6 stages are sufficient to produce greater than 100 MSPS, 150 MSPS, and 200 MSPS TPS, which requires
,
, and
less area and
,
, and
less EPS than an FPCC, respectively. From
Table 5, we know that remarkable savings in terms of area and EPS can also be obtained over fully-pipelined designs, even for unknown rotation angles.