1. Introduction
Residue Number Systems (RNSs) have been used in different applications such as digital signal processing (DSP) [
1] and deep learning systems [
2] to provide low-power, high-speed and fault-tolerant computations [
3]. The main feature of an RNS is fast and parallel implementation of addition and multiplication based on separate modular arithmetic circuits. However, detection of multiplication overflow is one of the difficult RNS problems, since the multiplication of any two operands larger than half of the dynamic range results in overflow. Therefore, the high probability of overflow occurrence in multiplication has motivated researchers to develop overflow prevention mechanisms for an RNS. Scaling (i.e., division of the RNS number by a constant number) is one of the ways to reduce the size of the operands to prevent overflow in RNS operations. However, scaling is a difficult process, since the division operation in an RNS cannot be performed in parallel modular channels like multiplication and addition [
4]. Therefore, usually one of the modulo of the moduli set is selected as the scaling factor to reduce the complexity [
5].
The scaling for general moduli sets is usually realized using look-up tables (LUTs) [
6], while the adder-based implementations can be achieved based on special moduli sets with higher performance. Due to this, there is a variety of works focusing on designing scalers for the well-known RNS three-moduli set {2
n − 1, 2
n, 2
n + 1} [
7,
8,
9]. The authors of [
7,
8] considered the modulo 2
n as the scaling factor. Using 2
n as the scaling factor resulted in simplified scalers with high-precision output. However, using only one modulo as the scaling factor is mostly applicable for addition operations, since it cannot drastically reduce the size of the numbers to prevent multiplication overflow. Due to this, the authors of [
9] proposed two-moduli scaling based on 2
n (2
n + 1) as the scaling factor, which led to a low-precision output. Although this scaling factor can significantly reduce the size of the operands, the limited 3n-bit dynamic range of the three-moduli set {2
n − 1, 2
n, 2
n + 1} is not suitable for two-moduli scaling factors because in this three-moduli RNS system, the values of most numbers are less than the scaling factor (i.e., 2
n (2
n + 1)), which results in a zero output for the scaler, consequently making the next operation faulty. This is a significant problem which indicates the importance of a zero-aware scaling mechanism, which is not covered by previous research.
Therefore, two-moduli scaling factors together with the large dynamic range four- or five-moduli sets, such as {2
k, 2
n − 1, 2
n + 1, 2
n + 1 − 1} [
10], {2
n − 1, 2
n, 2
n + 1, 2
2n + 1 − 1} [
11] and {2
n − 1, 2
n + 1, 2
2n, 2
2n + 1} [
12], {2
n − 1, 2
n + 1, 2
2n, 2
2n + 1 − 1} [
13] and {2
2n + p, 2
n − 1, 2
n + 1, 2
n − 2
(n + 1)/2 +1, 2
n + 2
(n + 1)/2 +1} [
14], should be used. However, there is a limited number of works that consider the scaling for four-moduli sets. The authors of [
15] designed a scaler based on a two-level architecture with the single-modulo scaling factor 2
n + k. The first level of this scaler performs scaling based on the three-moduli set {2
n − 1, 2
n + x, 2
n + 1}, where 0 ≤
x ≤
n, and then the second level computes the final four-moduli scaling using the composite set {2
n + k (2
2n − 1),
m4} [
15]. This two-level architecture requires high hardware requirements due to the multiple uses of modular adders. Furthermore, scaling by the 2
k modulo is not sufficient for large dynamic range four-moduli sets to avoid overflow. In other words, the regular modulo 2
n scaling of the numbers based on the three-moduli set {2
n − 1, 2
n, 2
n + 1} is not equivalent to modulo 2
n scaling in the four-moduli set {2
n − 1, 2
n, 2
n + 1, 2
n + 1 − 1}, since the dynamic ranges of these moduli sets are 3
n- and (4
n + 1)-bit, respectively. Therefore, two-moduli scaling must be used to prevent multiplication overflow for large dynamic range RNS systems.
On the other hand, to have a zero-aware scaler, the operands should be compared with the scaling constant before operation to prevent scaling with zero output. However, magnitude comparison is a difficult RNS operation, and its realization increases hardware complexity [
4]. Here, we address this problem without using an RNS magnitude comparator based on a method for deriving two different scaling outputs from the same circuit.
In the proposed work, first, a low-precision scaler based on two moduli is proposed for RNS four-moduli sets. Then, we reuse the hardware architecture of a low-precision scaler for producing high-precision scaled output that can be used when the low-precision scaler generates zero for non-zero operands. It is shown how new Chinese remainder theorem 2 (New CRT-II) can be used to achieve simplified two-moduli scaling for four-moduli sets. The proposed approach (i.e., a double-output scaler with both low- and high-precision outputs) has two main advantages. First, the high-precision output can be applied to addition operands, while the low-precision output can be used for multiplication operations to prevent overflow by considerably reducing the operands’ size. Second, in the case of using a low-precision output for multiplication, if the low-precision output becomes zero, the high-precision output can be used to prevent overflow, resulting in a zero-aware scaling approach. Moreover, derivation of the proposed general approach for two special large dynamic range four-moduli sets {2n − 1, 2n + 1, 22n, 22n + 1} and {2n − 1, 2n + 1, 22n, 22n + 1 − 1} is presented, and its performance is compared with the conventional method.
In the rest of the paper, the mathematical formulation and proof of the proposed scaling approach for both general and special four-moduli sets are described in
Section 2. Next,
Section 3 presents the fully adder-based hardware design of the proposed scalers. Moreover, a performance comparison is presented in
Section 4. Finally,
Section 5 concludes the paper.
2. Low-Precision Scaling with Two-Moduli Scaling Factor: Mathematical Formulation
This section presents the proposed approach to design scalers for RNS using the New CRT-II [
11]. In the rest of this section, a brief introduction about the scaling concept and new CRT-II is first described. Then, mathematical formulations of the proposed approach in general form and for two sample special forms will be presented.
2.1. Scaling Concept and CRT-II
The scaling of the weighted number
X by the constant factor
K according to the scaling operator defined in [
16] and is as follows:
This formula shows that any weighted number can be formed as a summation of its remainder with scaling factor
K and the multiplication of the scaling result (i.e., S) and
K. In other words,
S is the integer quotient of dividing
X by
K, and it can be expressed as follows [
5,
7]:
Note that Equations (1) and (2) are based on weighted numbers. However, they should be implemented inside RNS using residues. Therefore, consider the following residue representations for
X and
S based on the four-moduli set {
m1,
m2,
m3,
m4}:
where the scale factor
m1 is one of the moduli. Aside from that, also consider
Second, consider the RNS number (
x1,
x2,
x3,
x4), which can be converted into its corresponding weighted number
X using the New CRT-II conversion formulas for the generic four-moduli set {
m1,
m2,
m3,
m4} as follows [
11]:
where the required multiplicative inverses can be achieved by considering the following relations:
Equations (6)–(8) can be rewritten as follows:
where
2.2. General Formulations
Now, we choose the scaling factor as the product of the first and second modulo (i.e.,
m1m2). Therefore, scaling of
X by
m1m2 can be performed by considering
k =
m1m2 and substituting Equation (6) into Equation (4) as follows:
where
x1 is a residue in modulo
m1 and the maximum value of
H in Equation (16) is
m2 − 1. Therefore, the maximum value of
Z in Equation (13) can be computed as follows:
It is clear that the floor of the division of
ZMax by
m1m2 is zero. Therefore, by considering this point and taking into account that
T is an integer number, Equation (18) can be simplified as follows:
Now, according to Equation (5), the residues of
T based on the moduli should be computed to achieve the residues of the scaled number as follows:
Therefore, scaling of Z by m1m2 is reduced to T, and the full reverse conversion (i.e., full computing of Equation (6)) is not needed.
Now, we are going to achieve a single-modulo scaler for the same moduli set, (i.e., {
m1,
m2,
m3,
m4}) but with the aim of reusing the two-moduli scaler formulas to reduce the overhead. Hence, considering
k =
m1 and the main CRT-II formula of Equation (6) in Equation (4) results in
Insertion of Equation (13) into Equation (22) leads to
Therefore, since
x1 is less than
m1, and
H and
T are integer numbers, Equation (23) can be simplified as follows:
Now, according to Equation (5), we have
According to the residue arithmetic properties [
6], Equation (25) can be rewritten as
However, from Equation (21), we know the remainders of
T in moduli
mi are the two-moduli scaling residues. Therefore, we have
Therefore, by using Equation (21), the single-modulo scaling residues can be achieved from the previously computed two-moduli scaling residues with the minimum overhead. It should be mentioned that more simplifications of Equation (21) can be performed using the exact value of the moduli as shown in the next subsections.
2.3. Case Study: Moduli Set {22n + 1, 22n,2n + 1, 2n − 1}
The moduli set has a 6
n-bit dynamic range, and its reverse converters are all designed based on the New CRT-I [
11]. However, in contrast to the reverse converters of this moduli set, here, we use the New CRT-II to derive efficient two-moduli scaling formulas. First, consider the moduli set {
m1,
m2,
m3,
m4} = {2
2n + 1, 2
2n, 2
n + 1, 2
n − 1}. According to Equation (20), we must compute
T in Equation (15), and then its residues are the low-precision scaling residues. First, the following lemma computes the required multiplicative inverses.
Lemma 1. The multiplicative inverses required in Equations (15)–(17) are k1 = 22n − 1, k2 = 1 and k3 = 2n − 1.
Proof of Lemma 1. Verification can be performed by substituting the values of the multiplicative increases and moduli in Equations (9)–(11) as follows:
Now, inserting the values of the moduli and multiplicative inverses in Equations (13)–(17) leads to
Equation (31) can be further simplified by substituting Equations (31) and (32) into it as follows:
The following well-known residue arithmetic properties can be used to further simplify Equations (33)–(36).
Property 1. is equal to the P-bit circular left shifting of vi if vi is represented as a k-bit binary number [11]. Property 2. is equal to one’s complement of vi (i.e.,) if vi is represented as a k-bit binary number [11]. Property 3. is equal to if vi is represented as a k-bit binary number [17]. Property 4. is equal to if vi is represented as a k-bit binary number [17]. First, according to the moduli set {2
2n + 1, 2
2n, 2
n + 1, 2
n − 1},
x1 and
x2 are (2
n + 1)- and 2
n-bit numbers, respectively. Therefore, Equation (33) can be simplified using
Property 3 as follows:
where
xi,j means the
j-th bit of the residue
xi and
x4 and
x3 are (
n + 1)- and
n-bit numbers, respectively. Therefore, Equation (34) can be rewritten as
where
x3 is a residue in modulo 2
n + 1. Therefore, when
x3,n is equal to one, the other bits will be surely be zero, and if the
n low significant bits (LSBs) of
x3 are not equal to zero, then the most significant bit (MSB) of
x3 (i.e.,
x3,n) should be zero [
12]. Therefore, by considering this point and Properties 1 and 2, Equation (38) can be simplified as follows:
where
Finally, Equation (35) can be simplified using Properties 1 and 2 as follows:
where
Note that
P is an
n-bit number, and due to this computation of Equation (44), it became a simple concatenation. Aside from that, the constant coefficient of
H in Equation (47) was substituted with −1 since
Next, after calculation of
T using Equation (42), we must compute the residues of
T according to Equation (21) to achieve the two-moduli scaled residues. However, the largest value of
T is 2
2n − 2, and therefore, it is always less than the first and second moduli. Hence, we have
The third and fourth two-moduli scaled residues can be achieved as follows:
Now, based on Equation (27), we can also achieve single-modulo scaling formulas from the two-moduli scaling residues as follows:
We can simplify Equation (27) by substituting Equation (49) into it and considering that the maximum value of
H in (37) is 2
2n − 1. Therefore, we have
Similarly, for other residues, we have
Therefore, we can compute H and T from Equations (37) and (42) just one time and then using them several times to compute both the single- and two-moduli scaling.
2.4. Case Study: Moduli Set {2n − 1, 2n + 1, 22n, 22n + 1 − 1}
This moduli set has the same moduli as {2n − 1, 2n + 1, 22n, 22n + 1} except for 22n + 1 which is substituted with 22n + 1 − 1. Due to this, it can lead to a faster RNS arithmetic unit. However, its reverse converter will be more complex. The overall process of designing the scaler for this moduli set is relatively the same as for the moduli set {2n − 1, 2n + 1, 22n, 22n + 1} described in the previous subsection.
First, consider the moduli order {
m1,
m2,
m3,
m4} = {2
2n + 1 − 1, 2
2n, 2
n + 1, 2
n − 1}. Then, according to Equations (15)–(17), the multiplicative inverses can be computed as
k1 =
k2 = 1, and
k3 = 2
n − 1 (the proof is straightforward and similar to Lemma 1). Therefore, Equation (15) is a key formula in the scaling that can be calculated as follows:
where
The two-moduli scaled residue formulas for the moduli set {2
2n + 1 − 1, 2
2n, 2
n + 1, 2
n − 1} are the same as those for the moduli set {2
2n + 1, 2
2n, 2
n + 1, 2
n − 1} (i.e., Equations (49)–(52)), since all of them are based on
T. That aside, the single-modulo scaled residues are the same as in Equations (55)–(57) except for the first scaled residue, which is as follows:
3. Low-Precision Scaling with Two-Moduli Scaling Factor: Hardware Design
This section presents the full adder-based and memory-free hardware design of the proposed RNS scaler. The overview of the proposed approach for a generic four-moduli set is depicted in
Figure 1. First,
P,
H and
T are computed using Equations (15)–(17), and then, the two-moduli scaled residues are computed using Equation (21). Afterward, the single-modulo scaled residues are obtained using the precomputed two-moduli scaled residues based on Equation (27). The important part of the scaler is the calculation of
T that is shared between both kinds of scaling, resulting in a significant hardware reduction.
The scaler of
Figure 1 is designed based on a general value of the moduli. However, for special RNS moduli sets with power-of-two moduli such as {2
2n + 1, 2
2n, 2
n + 1, 2
n − 1}, the design can be considerably simplified, as presented in
Figure 2. First, the
H in Equation (39) is implemented using a 2
n-bit regular carry-propagate adder (CPA) where its carry-in is connected to one. Aside from that,
P also requires a modulo 2
n − 1 CPA, which can be implemented using an
n-bit CPA with EAC [
18] based on Equation (40).
The operand preparation unit performs the required inversions, shifting and multiplexing needed in Equation (41). Then, the important variable
T in Equation (42) can be realized using three carry-save adders (CSAs) with EAC followed by a modulo 2
2n − 1 CPA [
18]. Then, according to Equations (49)–(52), the first and second two-moduli scaled residues are equal to
T, and the third and fourth are only the reduction of
T in moduli 2
n − 1 and 2
n + 1, which can be realized using an
n-bit CPA with EAC and
n-bit CPA with complement EAC (CEAC), respectively. Note that CPA-CEAC is a representation of the modulo 2
n + 1 adder which can be realized using different methods [
19]. Finally, the single-modulo scaled residues can be achieved using Equations (54)–(57). The CSAs are used to compress the three operands into two, and then a modulo adder produces the scaled residue. It can be seen that in the customized version of the scaler for the moduli set {2
2n + 1, 2
2n, 2
n + 1, 2
n − 1}, the units for
m1 and
m2 reduction in the two-moduli scaling part are removed, since the scaled residues are equal to
T. Aside from that, the second single-modulo scaled residue is
H, and hence, the required
m2 reduction unit is removed.
Finally, Algorithm 1 shows how the proposed hardware architecture can be used to provide zero-aware RNS scaling. If the low-precision scaled residues become zero, then the high-precision scaled residue should be used as the output, except in the case that they also become zero. In this case (i.e., both scaler outputs become zero), the number is very small and is less than both of the scaling constants. In this case, its original value can be used in the computations. Note that here we do not use any magnitude comparator which is a complex unit in RNS, and only by checking the scaled residues against zero we could evaluate the relative magnitude of the number (less or greater than the scaling coefficients).
Algorithm 1: Zero-Aware RNS Scaling. |
Input:Non-Zero RNS Number (x1, x2, x3, x4) |
Output:Non-Zero Scaled RNS Number (s1, s2, s3, s4) |
1: Calculate the low-precision scaled residues (sl1, sl2, sl3, sl4) |
2: If (sl1, sl2, sl3, sl4) ≠ (0, 0, 0, 0) Then return (sl1, sl2, sl3, sl4) |
3: Calculate the high-precision scaled residues (sh1, sh2, sh3, sh4) |
4: If (sh1, sh2, sh3, sh4) ≠ (0, 0, 0, 0) Then return (sh1, sh2, sh3, sh4) |
5: Return original residues (x1, x2, x3, x4) |
4. Performance Evaluation
The majority of the available state-of-the-art RNS scalers are dedicatedly designed for three-moduli sets, and only [
15] presents the first RNS scaler design for four-moduli sets. The RNS scaler for the moduli set {2
2n + 1 − 1, 2
2n, 2
n + 1, 2
n − 1} is fully designed in [
15] based on the scaling factor 2
2n as shown in
Figure 3. To perform a technology-independent performance comparison, the unit-gate (U-G) model is used according to [
15] for comparative assessment of the works. All the assumptions considered in [
15] for estimation of the area and delay of modular adders are also considered here for a fair comparison, as shown in
Table 1.
Note that in the U-G model, each XOR or XNOR gate counts as two unit gates in the area and delay, and an AND or OR gate is considered one unit gate for both the area and delay. Therefore, the combinatorial circuits such as FAs, half adders (HAs) and one-bit 2×1 multiplexers count as 7, 3 and 3 unit gates in area and 4, 2 and 2 gates in the delay, respectively. Aside from that, the U-G area and delay estimations for each component of the proposed scaler is described in
Table 2. Note that the gray lines in
Table 2 are not on the critical delay path.
Finally, the overall area and delay estimations for scalers are presented in
Table 3 for a general value of
n. It can be seen that while the proposed low-precision scaler is based on a two-moduli scaling factor, the hardware requirement is less than the single-modulo scaler for the same moduli set, while the delay is almost the same. That aside, the proposed scaler outperforms the design of [
15] in terms of hardware requirements. Furthermore, as is expected, the high-precision single-modulo version requires a higher area and delay since it is computed based on the output of the two-moduli scaler.