Next Article in Journal
A Bibliometric Analysis Exploring the Acceptance of Virtual Reality among Older Adults: A Review
Previous Article in Journal
access@tour by Action: A Platform for Improving Accessible Tourism Conditions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Area–Time-Efficient High-Radix Modular Inversion Algorithm and Hardware Implementation for ECC over Prime Fields

Computer Architecture Laboratory, Department of Computer Science, Faculty of Computer and Information Sciences, Hosei University, Tokyo 184-8584, Japan
Computers 2024, 13(10), 265; https://doi.org/10.3390/computers13100265
Submission received: 10 September 2024 / Revised: 8 October 2024 / Accepted: 10 October 2024 / Published: 12 October 2024
(This article belongs to the Special Issue Using New Technologies in Cyber Security Solutions (2nd Edition))

Abstract

:
Elliptic curve cryptography (ECC) is widely used for secure communications, because it can provide the same level of security as RSA with a much smaller key size. In constrained environments, it is important to consider efficiency, in terms of execution time and hardware costs. Modular inversion is a key time-consuming calculation used in ECC. Its hardware implementation requires extensive hardware resources, such as lookup tables and registers. We investigate the state-of-the-art modular inversion algorithms, and evaluate the performance and cost of the algorithms and their hardware implementations. We then propose a high-radix modular inversion algorithm aimed at reducing the execution time and hardware costs. We present a detailed radix-8 hardware implementation based on 256-bit primes in Verilog HDL and compare its cost performance to other implementations. Our implementation on the Altera Cyclone V FPGA chip used 1227 ALMs (adaptive logic modules) and 1037 registers. The modular inversion calculation took 3.67 ms. The AT (area–time) factor was 8.30, outperforming the other implementations. We also present an implementation of ECC using the proposed radix-8 modular inversion algorithm. The implementation results also showed that our modular inversion algorithm was more efficient in area–time than the other algorithms.

1. Introduction

Nowadays, Internet of Things (IoT) applications use hardware security modules to ensure secure communications. In such a constrained environment, execution time and hardware costs are critical to efficient system design. Elliptic curve cryptography (ECC) is one of the most advanced public key cryptographic techniques. It requires a smaller key than other methods, to achieve roughly the same level of security.
ECC can be used to provide secure key agreement between two parties over an insecure network. It can also be used for digital signatures, to verify the authenticity and integrity of digital messages. Modular inversion is a critical operation in ECC. ECC calculates points on an elliptic curve over a finite field (such as a field of prime numbers) based on point addition (PA) and point doubling (PD) computations. In affine coordinates, PA and PD must calculate the slope of a line. Such calculations involve costly modular inversions. In projective or Jacobian coordinates, PA and PD do not require such calculations, but a modular inversion is still required to transform the points into affine coordinates to obtain the same key for the two parties.
Given a prime number m, the inverse r of a number a with a < m is defined as r = a 1 mod m . There are two main popular methods for calculating modular inversion:
  • Extended Euclidean algorithm (EEA) without using divisions.
  • Using Fermat’s little theorem a m 1 = 1 mod m  [1]: r = a m 2 mod m = a 1 mod m .
We will see that the method using Fermat’s little theorem takes more time and requires more registers than the EEA. Therefore, we focus our design on the use of the EEA.
The EEA inherently needs divisions. It calculates the integer quotient and the remainder based on the quotient. The divisions can be replaced by addition, subtraction, and shift operations. For simplicity, we will also refer to the EEA which does not use divisions as an EEA.
To calculate r = a 1 mod m , the EEA first initializes the variables u , v , x , y with inputs a , m , 1 , 0 , respectively. Then, the EEA repeats calculations containing only addition, subtraction, and shift operations on u , v , x , y until u = 1 or v = 1 . Finally, the modular inversion result is available by adjusting x or y, corresponding to u = 1 or v = 1 . A modular inversion algorithm is said to be fast if u or v reaches 1 quickly.
The most widely used modular inversion algorithm is Algorithm 2.22, proposed by Hankerson, Menezes, and Vanstone [2]. It repeatedly shifts u or v to the right when u or v is even. Correspondingly, x is also shifted to the right with the shift of u; y is also shifted to the right with the shift of v. Note that, when x or y is odd, m will be added before the shift. This guarantees that the value to be right-shifted is even, since the prime m is odd. Next, if u v , u and x are replaced by u v and x y , respectively. Otherwise, v and y are replaced by v u and y x , respectively. Finally, the result is x mod m if u = 1 , and y mod m otherwise. Hossain and Kong [3] revised Algorithm 2.22 by adding m to x or y if it is negative. This ensures that x and y are non-negative. Daly, Marnane, Kerins, and Popovici [4] revised Algorithm 2.22 by dividing u v or v u by two because the subtraction result is even (both u and v are odd before the subtraction). Correspondingly, x y or y x also needs to be divided by two: If x y or y x is odd, m is added before the division. Division by two is performed by shifting one bit to the right. Mrabet, El-Mrabet, Bouallegue, Mesnager, and Machhout proposed a modular inversion algorithm [5] with u + v . Instead of u v or v u , as Algorithm 2.22 uses, they perform u + v for new u or v. This operation slows down the speed at which u or v reaches 1, increasing the execution time. Chen and Qin proposed a modular inversion algorithm [6] that only uses adders. Subtractions are performed by addition with inversion and addition by 1. Choi, Lee, Kong, and Kim proposed a modular inversion algorithm [7] that replaces the repeated shift of u or v and the corresponding shift of x or y in Algorithm 2.22 by a selection of u , x or 0 , 0 , or a selection of v , y or 0 , 0 , based on the even/odd of v or u. This simplifies the circuit by replacing adders with multiplexers, reducing the circuit delay. In addition, they use v and y , instead of v and y, during the calculation. This merges u v and v u into u + v and merges x y and y x into x + y , reducing the circuit cost. Mixed radix-4 modular inversion algorithms were investigated in [7,8,9,10]. If u or v is divisible by four, u or v is shifted to the right by two bits. Otherwise, if u or v is even (divisible by two), u or v is shifted to the right by one bit. Otherwise (both u and v are odd), u v or v u is shifted to the right by one bit and assigned to u or v. Correspondingly, x or y is adjusted by adding m , m, or 2 m and shifted to the right by two bits or one bit. [8] proposed a radix-4 modular inversion algorithm that uses sequential condition checking for the calculation of u, v, x, and y. [9] implemented the SM2 ECC protocol. The iterations of the modular inversion are controlled by the bit counter ρ , resulting in unnecessary iterations. Using u and v to control the iterations will finish the calculation quickly. [10] presented a radix-4 version of Algorithm 2.22. Dong, Zhang, and Gao proposed a mixed radix-8 modular inversion algorithm [11] that uses extensive hardware resources. Hao et al. presented a lightweight architecture for elliptic curve scalar multiplication over prime fields [12]. They revised Algorithm 2.22 by using only adders and forced x y and y x to be in the range 0 to m. Guo et al. proposed a modular inversion algorithm [13] that makes v always be odd. If u is even, it is shifted one bit to the right. Correspondingly, x is also shifted one bit to the right (if x is odd, m is added before the shift). Otherwise (u is odd), u v (if u v ) or v u (if u < v ) is shifted one bit to the right, and the shifted value is assigned to u. Meanwhile, x y or y x is shifted one bit to the right, and the shifted value is assigned to x. Then, v is updated with u or v; y is updated with x or y. The above calculations are repeated until u becomes 1. Then, the result of the modular inversion is x. In lines 14 and 15 of their algorithm, x 1 and x 2 are compared and x 1 is guaranteed to be non-negative. However, due to the division by two, the dividend must be adjusted so that it is even. If it is odd, m (p in their algorithm) must be added to it before the division. These codes were not presented in their algorithm.
The AT (area–time) factor is often used for comparisons between implementations. It is defined as the execution time in milliseconds multiplied by the required hardware resources consisting of registers and lookup tables or ALMs (adaptive logic modules).
In this paper, we implement and evaluate all the algorithms mentioned above. We then propose a mixed radix-8 modular inversion algorithm aimed at reducing the execution time and hardware costs. We give a detailed hardware implementation in Verilog HDL based on 256-bit prime numbers. This had lower hardware costs for ALMs and registers and had better performance than the other algorithms. The implementation on the Altera Cyclone V FPGA chip used 1227 ALMs and 1037 registers and took 3.67 ms for the modular inversion computation. It achieved an AT factor of 8.30, lower than all other implementations. We show that the proposed algorithm is also efficient for 192-bit and 521-bit prime numbers. We implemented ECC using different modular inversion algorithms and compared their cost performance. The ECC implementation results showed that our modular inversion algorithm was more efficient in area–time than the other algorithms. We also present an efficient implementation of the Montgomery ladder scalar point multiplication algorithm, a constant execution time algorithm that is resistant to side-channel attacks. The short execution time and low hardware cost of our algorithm and implementation are significant advantages, especially in constrained environments where computing and battery power are limited.
The rest of this paper is organized as follows: Section 2 introduces ECC and modular inversion algorithms. In Section 3, we propose a mixed radix-8 modular inversion algorithm, give its hardware implementation in Verilog HDL, and compare its cost performance with other algorithms. In Section 4, we provide an ECC implementation using the proposed radix-8 modular inversion algorithm and compare its cost performance with ECC implementations using other modular inversion algorithms. This section also presents an efficient implementation of the Montgomery ladder scalar point multiplication algorithm. In Section 5, we discuss some issues related to the algorithm and hardware design. We conclude the paper and suggest some future research topics in Section 6.

2. ECC and Modular Inversion Algorithms

This section introduces ECC and modular inversion algorithms based on the EEA.

2.1. Elliptic Curve Cryptography

ECC [14,15] relies on the fact that scalar point multiplication Q = d P can be computed, but it is almost impossible to compute d given only the original point P and the point of the product Q. An ECC over the finite field of an n-bit prime number m can use the following equation:
y 2 = x 3 + a x + b mod   m
For example, the Secp256k1 [16] elliptic curve used in Ethereum Blockchain uses a 256-bit m = 2 256 2 32 2 9 2 8 2 7 2 6 2 4 1 . Secp256k1 defines y 2 = x 3 + a x + b = x 3 + 7 and gives a point P = [ x , y ] on the elliptic curve, as follows:
  a = 0x0000000000000000000000000000000000000000000000000000000000000000
  b = 0x0000000000000000000000000000000000000000000000000000000000000007
  m = 0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f
  x = 0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798
  y = 0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8
The elliptic curve Diffie–Hellman (ECDH) key exchange protocol can be used by two parties, Alice and Bob for example, to establish a shared secret key over an insecure network [16,17]. The ECDH protocol is shown in Table 1.
Because Q a b = d a Q b = d a d b P , Q b a = d b Q a = d b d a P , and d a d b = d b d a , we have Q b a = Q a b . Below is an ECDH key exchange example using Secp256k1. We can see that the two parties, Alice and Bob, have the same shared secret key (Qabx = Qbax).
  Alice generates and keeps d a secret and exposes Q a = d a P :
da   = 0x650aa7095daeaa37ab9051541f0ce304f8969a6d88bb3bebb4fe680fca9a2595
Qax  = 0x167d2537aa6bbd8d978b58be0f9466520b7b184e205ff96a9ff567b35b32c7b7
Qay  = 0xde3961553d36551f92726fee0e332133960edddccd2784b98b2af730d2fc6e14
Bob generates and keeps d b secret and exposes Q b = d b P :
db   = 0xedc68f194c4e30d6ef90467df822b00e5ef122dea48c9d1c54817080d1a341f4
Qbx  = 0x839da64a414c2243a5526230603109be9c615613a9e98c3d650bb0488580bbda
Qby  = 0x96e88e99304a5afcdd77c4f3b3327a28162627ebe08194baa0c78dfb67a11042
Alice obtains Q b and calculates Q a b = d a Q b :
Qabx = 0x1f254c7da15899275cdcab9d992f58251a4ab630fe9864d20cf317ab57749947
Qaby = 0xd6cb400b3c49d33d3df28f9d34fa09f8b6c8edf117a378c5a45d0a51e6c0debc
Bob obtains Q a and calculates Q b a = d b Q a :
Qbax = 0x1f254c7da15899275cdcab9d992f58251a4ab630fe9864d20cf317ab57749947
Qbay = 0xd6cb400b3c49d33d3df28f9d34fa09f8b6c8edf117a378c5a45d0a51e6c0debc
Now, Alice and Bob have the same secret key (Qabx = Qbax). They can use symmetric-key cryptography for subsequent communications. A third party, Eve for example, knows y 2 = x 3 + a x + b mod   m , P, Q a , and Q b , but cannot calculate the same secret key.

2.2. Point Addition and Point Doubling

Scalar point multiplication Q = d P uses point addition (PA) and point doubling (PD).

2.2.1. Point Addition

Given P = [ x p , y p ] and Q = [ x q , y q ] , the formulas for point addition R = [ x r , y r ] = P + Q on elliptic curve y 2 = x 3 + a x + b mod   m are shown as follows, where λ is the slope of the line through points P and Q. The derivation of the formulas can be found in [18].
λ = y q y p x q x p mod   m x r = ( λ 2 x p x q ) mod   m y r = ( λ ( x p x r ) y p ) mod   m
The point at infinity, denoted O , is included in the group of elliptic curves and is defined as P + ( P ) = O for Q = P . By this definition, P + O = P . In our implementation, O is represented as [ 1 , 1 ] . In the case of P = O , R = P + Q = O + Q = Q . In the case of Q = O , R = P + Q = P + O = P . We give the point addition R = P + Q algorithm over the finite field of F m in Algorithm 1. In the case of Q = P , R = P + Q = P + ( P ) = O (line 5 in the algorithm). In the case of Q = P , R = P + Q = P + P = 2 P , we perform the point doubling R = 2 P (line 6 in the algorithm).
Algorithm 1 PA (P, Q, m, a) (Point Addition in Affine Coordinates).
inputs: Points P = [ P x , P y ] and Q = [ Q x , Q y ] ; m and a in y 2 = x 3 + a x + b mod   m
output:  R = P + Q = [ R x , R y ] = [ x r , y r ]
begin
1          x p = P x , y p = P y , x q = Q x , y q = Q y , O = [ 1 , 1 ]
2         if  P = O return Q                                                               /* O + Q = Q */
3         if  Q = O return P                                                               /* P + O = P */
4         if  x p = x q
5                  if  ( y p + y q ) mod   m = 0 return O                              /* P + ( P ) = O */
6                  else return PD (P, m, a)                                          /* P + P = 2 P */
7          λ = ( ( y q y p ) / ( x q x p ) ) mod   m
8          x r = ( λ 2 x p x q ) mod   m
9          y r = ( λ ( x p x r ) y p ) mod   m
10     return  [ x r , y r ]                                                                         /* R = P + Q */
end
An example of point addition R = P + Q on the Secp256k1 curve is shown below, where [ P x , P y ] = P , [ Q x , Q y ] = Q , and [ R x , R y ] = R in affine coordinates.
  Px = 0xfc7dafb820a20da1a73c36465f2fe37bfd98ce4ef3a10a5df110abda03b20a3d
  Py = 0xa442a2d1b8bde4a09e45725add5daae89e726b56f0e8fe6609dacaf5279b2564
  Qx = 0xe106c069450b2663febb83e29b67fa93c4c48a45d5fbe7ce4ddb8ceb601fcc1d
  Qy = 0xc9da9bd440909c8862c06a44d432d2dd45284636b7049b9bf4695f9e4018d2f2
  Rx = 0xfd52a0334e16f8cf45a6b0820887a9e8b1b180516a76c8adfef95df98aeef376
  Ry = 0xb0fe3f04cc4c64fd66a133b8c97b4905771238f8ba89631efb85a8059e969a49

2.2.2. Point Doubling

Given P = [ x p , y p ] , the formulas for point doubling R = [ x r , y r ] = 2 P on elliptic curve y 2 = x 3 + a x + b mod   m are shown as follows, where λ is the slope of the tangent line of the elliptic curve at point P. The derivation of the formulas can be found in [18].
λ = 3 x p 2 + a 2 y p mod   m x r = ( λ 2 2 x p ) mod   m y r = ( λ ( x p x r ) y p ) mod   m
We give the point doubling R = 2 P algorithm over the finite field of F m in Algorithm 2. In the case of P y = 0 (vertical tangent line), R = 2 P = O (line 2 in the algorithm).
An example of point doubling R = 2 P on the Secp256k1 curve is shown below, where [ P x , P y ] = P and [ R x , R y ] = R in affine coordinates.
  Px = 0x6034b56424fb31ea6ec5483b52ae5d07d6f3ef80264d769ae2714abb83fb279a
  Py = 0xfe4cde1ff7546a87f906f50ab1002fda7811828ea6fc467a44d1c6c11aa65a37
  Rx = 0x5491ee8b73a4ed9713ed32e467de5100b80861babf8ffd09fd595ab457d042c9
  Ry = 0xf91e6a4e132a1bdf4f5c846559431ec7373de8872b719f188b5902932f0a2b30
The computation of λ in PA and PD requires modular division, which can be realized using a modular inversion algorithm based on the EEA.
Algorithm 2 PD (P, m, a) (Point Doubling in Affine Coordinates).
inputs: Point P = [ P x , P y ] ; m and a in y 2 = x 3 + a x + b mod   m
output:  R = 2 P = [ R x , R y ] = [ x r , y r ]
begin
1          x p = P x , y p = P y , O = [ 1 , 1 ]
2         if  y p = 0 return O                                 /* vertical tangent */
3          λ = ( ( 3 x p 2 + a ) / ( 2 y p ) ) mod   m
4          x r = ( λ 2 2 x p ) mod   m
5          y r = ( λ ( x p x r ) y p ) mod   m
6         return  [ x r , y r ]                                                     /* R = 2 P */
end

2.3. Modular Inversion Algorithms

Given a prime number m, the inverse r of a number a with a < m is defined as
r = a 1 mod     m
Algorithm 3 (modinv_fermat) implements the modular inversion calculation using Fermat’s little theorem. If m is prime and a 0 ( mod   m ) , Fermat’s little theorem says that a m 1 = 1 mod   m . Multiplying both sides by a 1 gives us a m 2 mod   m = a 1 mod   m . Then, we can calculate the modular inversion with r = a m 2 mod   m . This modular exponentiation can be performed using the multiply-squaring method, as shown in Algorithm 3 (modinv_fermat). This calculation consists of costly modular multiply and modular squaring, very similar to RSA exponentiation [19].
Algorithm 3 modinv_fermat (a, m) (Modinv using Fermat’s little theorem).
inputs: Prime m and a with a < m
output:  a 1 mod   m
begin
1          k = m 2 ; x = 1 ; y = a
2         while  k 0
3                  if k is odd
4                            x = x y mod   m                            /* modular multiply */
5                   y = y 2 mod   m                                     /* modular squaring */
6                   k = k 1
7         return x
end
The Python code below implements Algorithm 3 (modinv_fermat). When executed, it outputs 4 4 4. The first value is calculated by the code, and the rest are for checking.
Computers 13 00265 i001
The EEA can be used for the modular inversion calculation. Algorithm 4 (modinv1) gives the fundamental EEA for the modular inversion calculation. Line 4 calculates the integer quotient q of u divided by v. Lines 5 and 6 calculate the remainders v = u q v and y = x q y based on the quotient q and store the original v and y in u and x, respectively. These calculations are repeated until v = 0 .
Algorithm 4 modinv1 (b, a, m) (Modular Inversion Algorithm 1).
inputs: Prime m, a, and b with a , b < m
output:  b a 1 mod   m
begin
1          u , v = a , m                                     /* u = a and v = m */
2          x , y = b , 0                                       /* x = b and y = 0 */
3         while  v 0
4                   q = u / v                            /* q: integer quotient */
5                   u , v = v , u q v
6                   x , y = y , x q y
7         return  x mod   m
end
Considering b = 1 . u and x are initialized with a and 1, respectively. At each iteration, u and x are modified with similar calculations. Therefore, when u reaches 1 from a, x reaches the reciprocal of a from 1:
u : a ( / a ) 1 ; x : 1 ( / a ) a 1 ; x : b ( / a ) b a 1
If m is a prime number, the greatest common divisor of a and m is guaranteed to be 1, and we can always obtain the inverse result of a. With the initialization of x with b, the algorithm performs the modular division r = b a 1   mod   m .
An execution example of Algorithm 4 (modinv1) with b = 1 , a = 3 , and m = 11 is shown in Table 2. The calculation finishes when v = 0 . The result r = a 1   mod   m = x   mod   m = 4   mod   11 = 4 . We can check the correctness as follows: r a   mod   m = 4 × 3   mod   11 = 12   mod   11 = 1   mod   11 .
The algorithm requires division, which is expensive. As shown in Algorithm 5 (modinv2), we can eliminate the division by setting the quotient to 0 or 1.
Algorithm 5 modinv2 (b, a, m) (Modular Inversion Algorithm 2).
inputs: Prime m, a, and b with a , b < m
output:  b a 1 mod   m
begin
1          u , v = a , m
2          x , y = b , 0
3         while  v 0
4                   q = 0 if u < v  else 1
5                   u , v = v , u q v
6                   x , y = y , x q y
7         return  x mod   m
end
The algorithm yields a quotient of 0 or 1 based on the comparison of u and v. If the quotient is a 1, subtractions u v and x y are performed (lines 5 and 6). Otherwise no calculations are performed (simply swapping u with v and swapping x with y), which make the computation slower. The execution of Algorithm 5 (modinv2) with b = 1 , a = 3 , and m = 11 requires nine iterations, as shown in Table 3.
The algorithm can be modified to remove the no calculations, as shown in Algorithm 6 (modinv3). The code in lines 4 and 5 ensures that u and v are non-negative. Note that u and x are one pair, and v and y are another pair. Algorithm 6 (modinv3) reduces the number of iterations by about half, as shown in Table 4.
Algorithm 6 modinv3 (b, a, m) (Modular Inversion Algorithm 3).
inputs: Prime m, a, and b with a , b < m
output:  b a 1 mod   m
begin
1          u , v = a , m
2          x , y = b , 0
3         while  u 1 and v 1
4                  if  u < v : v , y = v u , y x
5                  else          u , x = u v , x y
6         if  u = 1 return x mod   m
7         else       return  y mod   m
end
In Table 4, because v > u for i = 0 to 2, we update v with v u . Correspondingly, we update y with y x . For i = 3 , because u > v , we update u with u v . Correspondingly, we update x with x y . For i = 4 , because u = 1 , the result is x mod   m = 4 .
We can check u first, before the subtractions. If it is even, we shift it to the right by one bit (the least significant bit 0 is shifted out). Correspondingly, x must also be shifted. To ensure that the value to be right-shifted is even, m will be added before the shift if x is odd. Note that m is odd because it is a prime number. These shifts of u and x can be performed repeatedly until u becomes an odd number.
Similarly, if v is even, we shift it to the right by one bit. Correspondingly, y must also be shifted. If y is odd, m will be added before the shift. These shifts of v and y can be performed repeatedly until v becomes an odd number.
Then, we obtain the algorithm shown in Algorithm 7 (modinv4). The idea behind it is that division makes u and v reach 1 faster than subtraction. In fact, this is Algorithm 2.22 provided in [2] and implemented in Verilog HDL in [18].
Algorithm 7 modinv4 (b, a, m) (Modular Inversion Algorithm 4).
inputs: Prime m, a, and b with a , b < m
output:  b a 1 mod   m
begin
1          u , v = a , m
2          x , y = b , 0
3         while  u 1 and v 1
4                  while u is even
5                            u = u / 2
6                           if x is even: x = x / 2
7                           else              x = ( x + m ) / 2
8                  while v is even
9                            v = v / 2
10                         if y is even: y = y / 2
11                         else              y = ( y + m ) / 2
12                if  u < v : v , y = v u , y x
13                else           u , x = u v , x y
14       if  u = 1 return x mod   m
15       else       return  y mod   m
end
Algorithm 7 (modinv4) has been widely adopted in ECC implementations. When the two inner while loops (lines 4 to 11) finish, u and v are both odd numbers. Therefore, u v or v u is even. Then, we can shift it to the right by one bit. Correspondingly, x y or y x must also be shifted. If x y or y x is odd, m must be added before the shift so that the bit being shifted out is 0. This algorithm is shown in Algorithm 8 (modinv5). The shifts are performed by the code in lines 12 to 19.
Algorithm 8 modinv5 (b, a, m) (Modular Inversion Algorithm 5).
inputs: Prime m, a, and b with a , b < m
output:  b a 1 mod   m
begin
1          u , v = a , m
2          x , y = b , 0
3         while  u 1 and v 1
4                  while u is even
5                            u = u / 2
6                           if x is even: x = x / 2
7                           else              x = ( x + m ) / 2
8                  while v is even
9                            v = v / 2
10                         if y is even: y = y / 2
11                         else              y = ( y + m ) / 2
12                if  u < v
13                          v , y = ( v u ) / 2 , y x
14                         if y is even: y = y / 2
15                         else              y = ( y + m ) / 2
16                else
17                          u , x = ( u v ) / 2 , x y
18                         if x is even: x = x / 2
19                         else              x = ( x + m ) / 2
20       if  u = 1 return x mod   m
21       else       return  y mod   m
end
The two inner while loops in Algorithm 8 (modinv5) can be replaced by assigning u, x, v, y, or 0 to the temporary variables t u , t x , t v , t y , so that t u t v is an even number. Next, t u t v is shifted one bit to the right. Correspondingly, t x t y is also shifted one bit to the right. Note that if t x t y is odd, m is added before the shift. This algorithm is shown in Algorithm 9 (modinv6). Note that only t x t y is executed; it may be negative. The code in lines 12 and 13 ensures that u and v are non-negative.
Algorithm 9 modinv6 (b, a, m) (Modular Inversion Algorithm 6).
inputs: Prime m, a, and b with a , b < m
output:  b a 1 mod   m
begin
1          u , v = a , m
2          x , y = b , 0
3         while  u 1 and v 1
4                  if u is odd: t v , t y = v , y
5                  else             t v , t y = 0 , 0
6                  if v is odd: t u , t x = u , x
7                  else             t u , t x = 0 , 0
8                   t u v , t x y = t u t v , t x t y
9                   u v = t u v / 2
10                if  t x y is even: x y = t x y / 2
11                else                 x y = ( t x y + m ) / 2
12                if  u v < 0 : v , y = u v , x y
13                else             u , x = u v , x y
14       if  u = 1 return x mod   m
15       else       return  y mod   m
end
Algorithm 9 (modinv6) requires calculations of t u t v , t v t u , t x t y , and t y t x . We can unify these calculations with negative assignments v and y to v and y, respectively, so that only t u + t v and t x + t y are sufficient for the calculations. That is, with negative assignments v and y to v and y, u = u v = u + ( v ) becomes u = u + v , and x = x y = x + ( y ) becomes x = x + y . Similarly, v = ( u v ) becomes v = u + v , and y = ( x y ) becomes y = x + y with negative assignments v and y to v and y. Therefore, u + v and x + y are sufficient for the calculations, saving hardware costs.
The algorithm is given in Algorithm 10 (modinv7). Because of the negative assignments to v and y, v is initialized with m and y is initialized with 0 = 0 . Note that x is never greater than or equal to m. Therefore, for the final result, no adjustment of x = x m or x = x mod   m is required. All we need is x = x + m for x < 0 .
Algorithm 10 modinv7 (b, a, m) (Modular Inversion Algorithm 7).
inputs: Prime m, a, and b with a , b < m
output:  b a 1 mod   m
begin
1          u , v = a , m
2          x , y = b , 0
3         while  u 1
4                  if u is odd: t v , t y = v , y
5                  else             t v , t y = 0 , 0
6                  else if v is odd: t u , t x = u , x
7                  else else             t u , t x = 0 , 0
8                  else  t u v , t x y = t u + t v , t x + t y
9                  else  u v = t u v / 2
10                if  t x y is even: x y = t x y / 2
11                else
12                       if  t x y < 0 : x y = ( t x y + m ) / 2
13                       else              x y = ( t x y m ) / 2
14                if  u v < 0 : v , y = u v , x y
15                else             u , x = u v , x y
16       if  x < 0 : x = x + m
17       return x
end
The Python codes for Algorithms 4–10 (modinv1 to modinv7) are given in Appendix A. We generate random numbers b and a that are smaller than a fixed 256-bit m. For the same b, a, m inputs, all modular inversion algorithms have the same output.
A modular inversion algorithm is said to be good when u reaches 1 quickly (high performance) and the algorithm uses a small number of adders and subtractors (low cost).

3. Proposed Radix-8 Modular Inversion Algorithm and Its Performance

The proposed mixed radix-8 modular inversion algorithm is given in Algorithm 11 (modinv_radix8). To calculate r = b a 1 mod m , we initialize u = a , v = m , x = b , and y = 0 with the negative assignment to v and y. The temporary variable t u is assigned with u or 0 and the temporary variable t v is assigned with v or 0, so that t u v = t u + t v is even. Correspondingly, the temporary variable t x is also assigned with x or 0 and the temporary variable t y is assigned with y or 0. If the least significant three bits of t u v are 000, it is shifted to the right by three bits (radix-8). Otherwise, if the least significant two bits of t u v are 00, it is shifted to the right by two bits (radix-4). Otherwise, it is shifted to the right by one bit (radix-2), because t u v is even. This is also called a hybrid radix algorithm. It is difficult to develop a complete radix-8 algorithm without using radix-4 or radix-2 arithmetic. We need to handle all cases where the least significant three bits of t u v are not 000 (there are seven cases) and perform the corresponding radix-8 arithmetic.
Algorithm 11 modinv_radix8 (b, a, m) (Radix-8 Modular Inversion Algorithm).
inputs: Prime m, a, and b with a , b < m
output:  b a 1 mod   m
begin
1          u , v = a , m
2          x , y = b , 0
3         while  u 1
4                  if u is odd: t v , t y = v , y
5                  else  t v , t y = 0 , 0
6                  if v is odd: t u , t x = u , x
7                  else  t u , t x = 0 , 0
8                   t u v , t x y = t u + t v , t x + t y                                            /* t u v is even */
9                  if  t u v & 6 = 0                                                                  /* radix 8 */
10                        u v = t u v / 8
11                       if  t x y & 1 = 0
12                               if  t x y & 2 = 0
13                                       if  t x y & 4 = 0 : x y = t x y / 8
14                                       else   x y = ( t x y + 4 m ) / 8
15                               else
16                                       if  t x y & 4 = ( 2 m & 4 ) : x y = ( t x y 2 m ) / 8
17                                       else   x y = ( t x y + 2 m ) / 8
18                       else
19                               if  t x y & 6 = m & 6 : x y = ( t x y m ) / 8
20                               else
21                                       if  t x y & 2 = m & 2 : x y = ( t x y + 3 m ) / 8
22                                       else
23                                              if  t x y & 4 m & 4 : x y = ( t x y + m ) / 8
24                                              else  x y = ( t x y 3 m ) / 8
25                else
26                       if  t u v & 2 = 0                                                             /* radix 4 */
27                                u v = t u v / 4
28                               if  t x y & 1 = 0
29                                       if  t x y & 2 = 0 : x y = t x y / 4
30                                       else   x y = ( t x y + 2 m ) / 4
31                               else
32                                       if  t x y & 3 = m & 3 : x y = ( t x y m ) / 4
33                                       else  x y = ( t x y + m ) / 4
34                       else                                                                          /* radix 2 */
35                                u v = t u v / 2
36                               if  t x y & 1 = 0 : x y = t x y / 2
37                               else
38                                       if  t x y < 0 : x y = ( t x y + m ) / 2
39                                       else   x y = ( t x y m ) / 2
40                if  u v < 0 : v , y = u v , x y
41                else  u , x = u v , x y
42       if  x < 0 : x = x + m
43       return x
end
Correspondingly, t x and t y are arranged and t x y = t x + t y is also shifted to the right by three bits, two bits, or one bit. The bits being shifted out must be 0. Therefore, we need to adjust t x y using the prime number m before the shift. Table 5 lists such adjustments based on the least significant three bits of t x y and the least significant three bits of m for the radix-8 operations, where x represents a don’t-care term. The least significant three bits of the adjusted value are 000, as shown in the Comment column of the table.
Similarly, Table 6 lists the adjustments based on the least significant two bits of t x y and the least significant two bits of m for the radix-4 operations. The least significant two bits of the adjusted value are 00, as shown in the Comment column of the table.
The Python code for the proposed algorithm is also given in Appendix A. The modinv_radix8 algorithm takes 206 iterations to reach u = 1 and v = 1 . In contrast, the modinv_radix4 and modinv_radix2 algorithms require 243 and 356 iterations, respectively.
To reduce the number of adders, we use a multiplexer to select an appropriate value and assign it to the temporary variable t z . Then, we perform t x y = t x + t y + t z . Based on the least significant three bits of t u v , we assign t x y 1 , t x y 2 , or t x y 3 to x y .
Figure 1 shows the block diagram of the proposed radix-8 modular inversion circuit. To perform the addition t x y = t x + t y + t z , 2 m and 3 m are replaced by + 6 m and + 5 m , respectively. This is because, for example, for an integer i, ( i 3 m ) mod m = ( i 3 m + 8 m ) mod m = ( i + 5 m ) mod m . Furthermore, for the addition, we prepare m that can be obtained by inverting all the bits of m and setting the right-most bit to 1 because m is odd. The Verilog HDL implementation uses continuous assignment to compute u v and x y and writes them to the corresponding registers on the rising edge of the clock signal. Note that we have to use adders for generating 3 m , 5 m , and 6 m , which are not shown in the figure.
Below we give the hardware implementation code in Verilog HDL for the proposed radix-8 modular inversion algorithm (modinv_r8.v). The signals start and ready indicate the start of the modular inversion calculation and the availability of the calculation result, respectively. Because we use the Secp256k1 elliptic curve, the input and output signals b, a, m, and c are 256 bits. We use 260 bits for the internal signals. Instead of the 260-bit t x y = t x + t y in the Python code, we achieve it with 3 bits (t3). This reduces the execution time and hardware costs. And we perform ( t x < 0 ) | ( t y < 0 ) for t x y < 0 . This guarantees ( t x + t y + m ) / 2 < m .
Computers 13 00265 i002Computers 13 00265 i003
Below is the testbench Verilog HDL code used to simulate modinv_r8.v.
Computers 13 00265 i004
Figure 2 shows the functional simulation waveform, generated with ModelSim. The result c was available in 416 ns. That is, the calculation took 208 clock cycles. Note that the value of the result c is the same as the output of the Python code in Appendix A.
We implemented the modular inversion algorithms on the Altera Cyclone V 5CGXFC9 E7F35C8 FPGA chip. The EDA tool we used is Quartus Prime Version 20.1.1 Build 720 11/11/2020 SJ Lite Edition. This is the latest edition that integrates with ModelSim for simulation. All algorithms were evaluated in the same environment.
Table 7 lists the cost performance of the modular inversion algorithms. The column Cycles shows the required number of clock cycles when executing the modular inversion algorithm. The column Freq.(MHz) shows the clock frequency in MHz at which the circuit can work. The column Latency(μs) shows the execution time in microseconds calculated by dividing the clock cycles by the clock frequency. The column ALMs shows the required number of adaptive logic modules. The column Registers shows the required number of flip-flops. The flip-flops are mainly used to store u, v, x, and y. Their contents are updated in every clock cycle. The last column shows the AT factor, which is the product of the Latency in milliseconds and the sum of the ALMs and Registers:
AT = Latency × ( ALMs + Registers )
The row [1] in the table shows the performance and cost of modular inversion using Fermat’s little theorem r = a m 2 mod m . It consists of costly modular multiply and modular squaring. Its AT factor is much higher than the others. The remaining rows show the performance and cost of the EEA-based modular inversion algorithms. The numbers of registers used by [2,3,8,10] were larger than the others. This is because extra registers are used to adjust the value of x or y, so that the modular inversion result is within the range of 0 and m. Our algorithm implementation achieved an execution time of 3.67 μs and an AT factor of 8.30, outperforming all other implementations. Figure 3 shows an intuitive view of the latency and AT histograms.
The proposed radix-8 algorithm was demonstrated to be efficient for 256-bit primes. Table 8 compares the cost performance on the Secp192k1 192-bit prime field curve and Secp521r1 521-bit prime field curve [16]. Here, we only provide a comparison with Algorithm 2.22 [2]. For ease of comparison, we also show the case when using the Secp256k1 256-bit prime field curve in the table. Our algorithm’s AT outperformed [2] by 17.57 / 4.38 = 4.01 times, 38.31 / 8.30 = 4.62 times, and 253.31 / 55.44 = 4.57 times for the curves of Secp192k1, Secp256k1, and Secp521r1, respectively. This shows that our algorithm also demonstrated scalability to other prime sizes and adaptability to other cryptographic curves.

4. ECC Implementation with Proposed Modular Inversion Algorithm

ECC relies on scalar point multiplication. Suppose P = [ x p , y p ] is a point on the curve, the scalar point multiplication Q = d P obtains the Q = [ x q , y q ] that is also on the curve, where d = d n 1 d 1 d 0 is an n-bit scalar. Scalar point multiplication can be conducted with PA and PD, as shown in Algorithm 12.
Algorithm 12 ScaMul (d, P, m, a) (Scalar Point Multiplication).
inputs: d = d n 1 d 1 d 0 and point P = [ P x , P y ] ; m and a in y 2 = x 3 + a x + b mod m
output:  Q = d P
begin
1          Q = O , R = P , k = d                                                          /* Q = O and R = P */
2         while  k 0  do
3                  if  k 0 = 1
4                            Q = PA (Q, R, m, a)                          /* Q = Q + R (Algorithm 1) */
5                   R = PD (R, m, a)                                             /* R = 2 R (Algorithm 2) */
6                   k = k 1
7         endwhile
8         return Q                                                                                         /* Q = d P */
end
The algorithm calls point addition PA (P, Q, m, a) and point doubling PD (P, m, a). Table 9 gives an example to show the calculation steps of the scalar point multiplication. For a 5-bit d = 10101 2 = 21 , we calculate Q = d P in five steps to obtain Q = 21 P . We can see that the algorithm is similar to RSA exponentiation [19].
We used the ECDH key exchange protocol, described in Table 1, to establish a shared secret key for two parties. An ECDH key exchange algorithm in Python code is given in Appendix B. The algorithm invokes the scalar point multiplications that use two computations — point addition and point doubling. Four primitive modular calculations (addition, subtraction, multiplication, and inversion) are used for these two computations, as shown in Figure 4. The Python function names in Appendix B are also shown in the figure.
See Appendix B, the Python code is hardware-oriented. Essentially, a Python function defined using the def keyword was implemented in a Verilog HDL module. For integrity, we listed the modinv_radix8 code again, but now with + 6 m and + 5 m instead of 2 m and 3 m , respectively, and the function name has been changed to modinv.
Based on the Python code, we implemented ECC using our radix-8 modular inversion algorithm for calculating λ in PA and PD. Figure 5 shows the functional simulation waveform of scalar point multiplication Q = d P with P = [ x , y ] and Q = [ q x , q y ] . The result was available in 635,362 ns. That is, the calculation took 317,681 clock cycles. Outputs q x and q y are the same as the outputs Qax and Qay, respectively, of the Python code in Appendix B.
An ECC cost performance comparison is given in Table 10 when implementing on the Altera Cyclone V 5CGXFC9E7F35C8 FPGA chip. We also used Quartus Prime Version 20.1.1 Lite Edition for the implementation. All ECC implementations used the same circuit, except for the modular inversion part.
Figure 6 shows the latency and AT histogram. The ECC latency using our proposed radix-8 modular inversion algorithm was 0.01970 s and its AT factor was 393,546.29. From the table and histogram, we can see that our ECC implementation achieved lower latency and lower AT factor than all other implementations.
Algorithm 12, the traditional scalar point multiplication, suffers from side-channel attacks, because its execution time depends on the input scalar d. Side-channel attacks attempt to reveal the secret key from leaked information, such as timing, power consumption, or electromagnetic radiation. We can use the Montgomery ladder algorithm [20] to perform the scalar point multiplication, as shown in Algorithm 13. Its execution takes the same constant time regardless of the input scalar d, making it resistant to side-channel attacks.
Algorithm 13 ScaMulMont (d, P, m, a) (Montgomery Ladder Scalar Point Multiplication).
inputs: d = 1 d n 2 d 1 d 0 and point P = [ P x , P y ] ; m and a in y 2 = x 3 + a x + b mod m
output:  Q = d P
begin
1          Q = P , R = PD (P, m, a)                                                        /* Q = P and R = 2 P */
2         for  i = n 2  downto 0 do
3                if  d i = 1
4                        Q = PA (Q, R, m, a)                                     /* Q = Q + R (Algorithm 1) */
5                        R = PD (R, m, a)                                               /* R = 2 R (Algorithm 2) */
6                else
7                        R = PA (Q, R, m, a)                                     /* R = Q + R (Algorithm 1) */
8                        Q = PD (Q, m, a)                                              /* Q = 2 Q (Algorithm 2) */
9         endfor
10      return Q                                                                                                /* Q = d P */
end
We give the Montgomery ladder scalar point multiplication algorithm’s Python code as follows, and give the block diagram of the Montgomery ladder circuit in Figure 7b.
Computers 13 00265 i005
Figure 7. Block diagrams of scalar point multiplication circuits. (a) Traditional scalar point multiplication circuit; (b) Montgomery ladder scalar point multiplication circuit.
Figure 7. Block diagrams of scalar point multiplication circuits. (a) Traditional scalar point multiplication circuit; (b) Montgomery ladder scalar point multiplication circuit.
Computers 13 00265 g007
Since multiple PA or PD operations are not performed simultaneously, one PA module and one PD module are sufficient (PA and PD operate in parallel). Note that the iterative control part based on the scalar d is omitted in the figure. In contrast, the block diagram of the traditional scalar point multiplication circuit is shown in Figure 7a.
Algorithm 13 is resistant to side-channel attacks, while requiring approximately the same execution time and the same hardware resources compared to Algorithm 12, as shown in Table 11. Because the Montgomery ladder scalar point multiplication circuit (Figure 7b) uses more multiplexers, the number of ALMs is larger than Algorithm 12. Algorithm 12 uses a 256-bit register to shift the scalar d, and Algorithm 13 uses a 9-bit counter to control the iterations. Therefore, Algorithm 12 uses more registers than Algorithm 13.

5. Discussion of Design Issues

In the previous sections, we demonstrated through simulations using ModelSim and implementations with the Quartus II EDA tool that the proposed high-radix modular inversion algorithm works correctly and is efficient in terms of execution time and hardware costs. We also showed that the ECC implementation using the proposed algorithm outperformed implementations using other investigated algorithms.
The modular inversion algorithm repeatedly performs addition, subtraction, and shift operations on the variables u, v, x, and y. These variables are typically implemented using registers that are updated on the rising edge of a clock signal. The clock frequency of the circuit is determined by the operation delay between two successive clock rising edges (a clock cycle). Increasing the clock frequency requires decreasing the delay of operations within a clock cycle.
We can use a finite state machine to divide sequential computations into multiple steps and store the results of the steps in registers. This reduces the latency within one clock cycle and increases the clock frequency. However, implementing a finite state machine requires more clock cycles and more registers.
Multiplexers have much lower latency and lower hardware cost compared to carry propagate adders. To reduce hardware costs, if some additions have the same input, instead of using adders and then a multiplexer, we can use a multiplexer before an adder. For example, instead of s = mux ( a + b , a + c , a + d , a + e ) , we can have s = a + mux ( b , c , d , e ) , which reduces the number of adders. The circuit for calculating t x y in Figure 1 is designed in this way to generate t z using a multiplexer before the adder. In addition, to reduce hardware costs, we only use addition for both addition and subtraction calculations. For example, in Figure 1 we only perform the addition t x y = t x + t y + t z . For radix-8 modular operations, the subtractions 2 m and 3 m are replaced by + 6 m and + 5 m , respectively.
The carry-select adder (CSLA) can be used to reduce the latency of the carry propagate adder. For an n bits carry propagate adder, we split the n bits into two n / 2 bits, the upper n / 2 bits and the lower n / 2 bits. The addition of the upper n / 2 bits is performed simultaneously by two adders, assuming that the carry-in of one adder is 0 and the carry-in of the other adder is 1. Three n / 2 -bit adders (including one for the lower n / 2 bits) operate in parallel. When the carry-out of the lower n / 2 -bit adder is available, a multiplexer is used to select the correct result from the upper two adders. Using a CSLA by dividing n bits into two n / 2 bits can reduce the latency by approximately half. On the other hand, the hardware cost increases by more than 50% (one extra n / 2 -bit adder and one n / 2 -bit multiplexer are required). In general, splitting n bits into m ( n / m )-bits reduces the latency by about a factor of m, but increases the hardware cost exponentially.
The use of a carry-save adder (CSA) can significantly reduce latency. There is no ripple carry between bits. The result is represented as a carry set and a sum set. A single carry-save adder is equivalent to a 1-bit full adder, which has a low latency. Because of the representation of the two sets, it is not possible to know the final addition result and its sign (negative or positive) without performing an additional addition, with a carry look-ahead adder for example. Therefore, a CSA is commonly used for intermediate calculations. It takes three sets of inputs and produces two sets of outputs.
As shown in Algorithm 11, our proposed radix-8 modular inversion algorithm allows for arbitrary bit primes, because we considered all combinations of the least significant three bits of prime numbers, as shown in Table 5. If we use a fixed prime m, defined by Secp256k1 [16] for instance, the circuit can be simplified by removing the parts of prime number m whose least significant three bits are not 111. Furthermore, 3 m , 5 m , and 6 m can be calculated and stored in a constant table in advance. Then, the hardware can use them directly, without any calculations. This speeds up the radix-8 modular inversion calculations.
Our radix-8 modular inversion algorithm allows any bit prime and has no special requirements on the prime, so we can easily use different elliptic curves with the same or different prime sizes, as shown in Table 8. In addition, the hardware implementation of the algorithm is provided in Verilog HDL, which does not rely on special circuit libraries, allowing the algorithm to be implemented on a variety of different platforms.
This paper mainly focused on the modular inversion algorithm and the hardware implementation. We presented the performance and hardware cost of simple low-cost ECC implementations using different modular inversion algorithms, to demonstrate the benefits of the proposed algorithm. To make a fair comparison between implementations, all ECC hardware circuits were identical, except for the modular inversion circuit. From Table 10, we can see that the frequencies of all ECC implementations were quite low. To increase the clock frequency, we can also use pipeline techniques to divide computations into multiple stages and use pipeline registers to store intermediate results. However, the hardware costs will increase due to the use of pipeline registers.

6. Concluding Remarks

In this paper, we proposed a mixed radix-8 modular inversion algorithm and hardware implementation based on 256-bit primes in Verilog HDL and compared its cost performance with other implementations on the Altera Cyclone V FPGA chip. The algorithm and its hardware implementation were area–time efficient with an AT factor of 8.30, which outperformed the other algorithms and implementations. We showed that our algorithm also demonstrates scalability to other prime sizes and adaptability to other cryptographic curves. We also presented the cost performance of an ECC implementation using the proposed modular inversion algorithm. Implementation results also showed that our algorithm reduces the execution time and requires fewer hardware resources than the other investigated algorithms. We presented an efficient implementation of the Montgomery ladder scalar point multiplication algorithm that is resistant to side-channel attacks.
Future work could include shortening the critical path and using carry-select adders and carry-save adders to speed up the addition of large operands. In addition, using a fixed prime m, Secp256k1 for example, we could simplify the circuit by considering only the case where the least significant three bits are equal to 111 and using precomputations of 3 m , 5 m , and 6 m to speed up the radix-8 modular inversion calculations.
Another important future work is to minimize the latency of the ECC implementation by using longer pipelines. The pipeline stages could be split, so that a modular addition or subtraction can be completed within a single pipeline stage.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. EEA-Based Modular Inversion Algorithms in Python

In the previous sections, we described Algorithms 4–10 (modinv1 to modinv7) and the radix-8 modular inversion algorithm in pseudocode. This appendix provides their implementation codes in Python. We simply summarize the codes as below: modinv1 is the fundamental EEA code; modinv2 removes the division; modinv3 reduces the number of iterations; modinv4 repeatedly divides u and v by two; modinv5 divides u v or v u by two; modinv6 assigns t u and t v , so that t u t v is even; modinv7 uses negative assignments to v and y; and modinv_radix8 gives the code of the proposed algorithm.
Computers 13 00265 i006Computers 13 00265 i007Computers 13 00265 i008
The last assert statement verifies the correctness of the calculated modular inversion result c. Below is an example of the output when the code is executed. We can see that for the same inputs b, a, m, the eight functions have the same output. These outputs of the Python code are used to check the correctness of the circuit’s outputs. For example, the value of the signal c in Figure 2 is the same as the output c of the Python code.
$ python3 modinv12345678.py
b  = 0x9cfa1c993911914be0f15bd74a878abe0079c6254b961b82e1abda76387d1d85
a  = 0xd5076ae274e874c2eb0f7778717c39460236549ddd9fc651e68a0c0e787b4ce8
m  = 0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f
c1 = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e
c2 = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e
c3 = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e
c4 = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e
c5 = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e
c6 = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e
c7 = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e
c  = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e

Appendix B. Elliptic Curve Diffie–Hellman (ECDH) Key Exchange Algorithm in Python

The ECDH algorithm is used to reach key agreement between two parties over an insecure network, as shown in Table 1. The following Python code demonstrates that two parties, Alice and Bob for example, obtain the same secure key by calling scalar point multiplications. We skip the communication process and focus on how to implement scalar point multiplication by calling point addition and point doubling.
Computers 13 00265 i009Computers 13 00265 i010Computers 13 00265 i011
Python functions for modular addition (modadd), modular subtraction (modsub), modular multiplication (modmul), and modular inversion (modinv) are provided. These functions are used by point addition (point_addition) and point doubling (point_doubling). All the codes are hardware-oriented for circuit design.
The last assert statement is used to check whether the two parties obtained the same shared secure key. Below is an example of the output when the code is executed. We can see that Qbax is equal to Qabx (shared secure key). These outputs of the Python code are used to check the correctness of the circuit’s outputs. For example, the values of signals q x and q y in Figure 5 are the same as the outputs Qax and Qay, respectively, of the Python code.
$ python3 ecdh.py
da   = 0x650aa7095daeaa37ab9051541f0ce304f8969a6d88bb3bebb4fe680fca9a2595
db   = 0xedc68f194c4e30d6ef90467df822b00e5ef122dea48c9d1c54817080d1a341f4
Qax  = 0x167d2537aa6bbd8d978b58be0f9466520b7b184e205ff96a9ff567b35b32c7b7
Qay  = 0xde3961553d36551f92726fee0e332133960edddccd2784b98b2af730d2fc6e14
Qbx  = 0x839da64a414c2243a5526230603109be9c615613a9e98c3d650bb0488580bbda
Qby  = 0x96e88e99304a5afcdd77c4f3b3327a28162627ebe08194baa0c78dfb67a11042
Qabx = 0x1f254c7da15899275cdcab9d992f58251a4ab630fe9864d20cf317ab57749947
Qaby = 0xd6cb400b3c49d33d3df28f9d34fa09f8b6c8edf117a378c5a45d0a51e6c0debc
Qbax = 0x1f254c7da15899275cdcab9d992f58251a4ab630fe9864d20cf317ab57749947
Qbay = 0xd6cb400b3c49d33d3df28f9d34fa09f8b6c8edf117a378c5a45d0a51e6c0debc

References

  1. Burton, D. The History of Mathematics/An Introduction, 7th ed.; McGraw-Hill: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
  2. Hankerson, D.; Menezes, A.; Vanstone, S. Guide to Elliptic Curve Cryptography; Springer: New York, NY, USA, 2004. [Google Scholar] [CrossRef]
  3. Hossain, M.S.; Kong, Y. High-Performance FPGA Implementation of Modular Inversion over F_256 for Elliptic Curve Cryptography. In Proceedings of the 2015 IEEE International Conference on Data Science and Data Intensive Systems, Sydney, Australia, 11–13 December 2015; pp. 169–174. [Google Scholar] [CrossRef]
  4. Daly, A.; Marnane, W.; Kerins, T.; Popovici, E. Division in GF(p) for Application in Elliptic Curve Cryptosystems on Field Programmable Logic. In New Algorithms, Architectures and Applications for Reconfigurable Computing; Springer: Boston, MA, USA, 2005; pp. 219–229. [Google Scholar] [CrossRef]
  5. Mrabet, A.; El-Mrabet, N.; Bouallegue, B.; Mesnager, S.; Machhout, M. An efficient and scalable modular inversion/division for public key cryptosystems. In Proceedings of the 2017 International Conference on Engineering & MIS (ICEMIS), Monastir, Tunisia, 8–10 May 2017; pp. 1–6. [Google Scholar] [CrossRef]
  6. Chen, C.; Qin, Z. Fast Algorithm and Hardware Architecture for Modular Inversion in GF(p). In Proceedings of the 2009 Second International Conference on Intelligent Networks and Intelligent Systems, Tianjin, China, 1–3 November 2009; pp. 43–45. [Google Scholar] [CrossRef]
  7. Choi, P.; Lee, M.K.; Kong, J.T.; Kim, D.K. Efficient Design and Performance Analysis of a Hardware Right-shift Binary Modular Inversion Algorithm in GF(p). J. Semicond. Technol. Sci. 2017, 17, 425–437. [Google Scholar] [CrossRef]
  8. Wang, D.; Lin, Y.; Hu, J.; Zhang, C.; Zhong, Q. FPGA Implementation for Elliptic Curve Cryptography Algorithm and Circuit with High Efficiency and Low Delay for IoT Applications. Micromachines 2023, 14, 1037. [Google Scholar] [CrossRef] [PubMed]
  9. Yang, D.; Dai, Z.; Li, W.; Chen, T. An Efficient ASIC Implementation of Public Key Cryptography Algorithm SM2 Based on Module Arithmetic Logic Unit. In Proceedings of the 2019 IEEE 13th International Conference on ASIC (ASICON), Chongqing, China, 29 October–1 November 2019; pp. 1–4. [Google Scholar] [CrossRef]
  10. Yan, X.; Li, S. Modified modular inversion algorithm for VLSI implementation. In Proceedings of the 2007 7th International Conference on ASIC, Guilin, China, 22–25 October 2007; pp. 90–93. [Google Scholar] [CrossRef]
  11. Dong, X.; Zhang, L.; Gao, X. An Efficient FPGA Implementation of ECC Modular Inversion over F256. In Proceedings of the 2nd International Conference on Cryptography, Security and Privacy, Guiyang, China, 16–18 March 2018; pp. 29–33. [Google Scholar] [CrossRef]
  12. Hao, Y.; Zhong, S.; Ma, M.; Jiang, R.; Huang, S.; Zhang, J.; Wang, W. Lightweight Architecture for Elliptic Curve Scalar Multiplication over Prime Field. Electronics 2022, 11, 2234. [Google Scholar] [CrossRef]
  13. Guo, K.Y.; Fang, W.C.; Fahier, N. An Efficient Hardware Design of Prime Field Modular Inversion/Division for Public Key Cryptography. In Proceedings of the 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 21–25 May 2023; pp. 1–5. [Google Scholar] [CrossRef]
  14. Koblitz, N. Elliptic curve cryptosystems. Math. Comput. 1987, 48, 203–209. Available online: https://www.ams.org/journals/mcom/1987-48-177/S0025-5718-1987-0866109-5/S0025-5718-1987-0866109-5.pdf (accessed on 10 October 2024). [CrossRef]
  15. Miller, V.S. Use of Elliptic Curves in Cryptography. In Proceedings of the Advances in Cryptology—CRYPTO ’85 Proceedings; Springer: Berlin/Heidelberg, Germany, 1986; pp. 417–431. Available online: https://link.springer.com/content/pdf/10.1007/3-540-39799-X_31.pdf?pdf=inline%20link (accessed on 10 October 2024).
  16. Certicom Corp. Standards for Efficient Cryptography. SEC 2: Recommended Elliptic Curve Domain Parameters. 2010. Available online: http://www.secg.org/sec2-v2.pdf (accessed on 10 October 2024).
  17. Barker, E.; Chen, L.; Roginsky, A.; Vassilev, A.; Davis, R. SP 800-56A Rev. 3, Recommendation for Pair-Wise Key-Establishment Schemes Using Discrete Logarithm Cryptography; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2018. Available online: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-56Ar3.pdf (accessed on 10 October 2024).
  18. Li, Y. Hardware Implementations of Elliptic Curve Cryptography Using Shift-Sub Based Modular Multiplication Algorithms. Cryptography 2023, 7, 1–29. [Google Scholar] [CrossRef]
  19. Li, Y.; Chu, W. Shift-Sub Modular Multiplication Algorithm and Hardware Implementation for RSA Cryptography. In Proceedings of the 17th International Conference on Information Assurance and Security, Lecture Notes in Networks and Systems; Springer: Cham, Switzerland; 2021; pp. 541–552. [Google Scholar] [CrossRef]
  20. Oliveira, T.; López, J.; Rodríguez-Henríquez, F. The Montgomery ladder on binary elliptic curves. J. Cryptogr. Eng. 2018, 8, 241–258. [Google Scholar] [CrossRef]
Figure 1. Block diagram of the proposed mixed radix-8 modular inversion circuit.
Figure 1. Block diagram of the proposed mixed radix-8 modular inversion circuit.
Computers 13 00265 g001
Figure 2. Waveform of the proposed radix-8 modular inversion algorithm.
Figure 2. Waveform of the proposed radix-8 modular inversion algorithm.
Computers 13 00265 g002
Figure 3. Latency and AT comparison of modular inversion algorithms. Details (year and first author’s name) of the numbers [n] (algorithm) on the horizontal axis are in Table 7.
Figure 3. Latency and AT comparison of modular inversion algorithms. Details (year and first author’s name) of the numbers [n] (algorithm) on the horizontal axis are in Table 7.
Computers 13 00265 g003
Figure 4. The ECDH key exchange algorithm uses scalar point multiplication, which uses two operations, point addition and point doubling, which use the four basic modular operations (addition, subtraction, multiplication, and inversion).
Figure 4. The ECDH key exchange algorithm uses scalar point multiplication, which uses two operations, point addition and point doubling, which use the four basic modular operations (addition, subtraction, multiplication, and inversion).
Computers 13 00265 g004
Figure 5. Waveform of the scalar point multiplication Q = d P with P = [ x , y ] and Q = [ q x , q y ] using proposed the radix-8 modular inversion algorithm.
Figure 5. Waveform of the scalar point multiplication Q = d P with P = [ x , y ] and Q = [ q x , q y ] using proposed the radix-8 modular inversion algorithm.
Computers 13 00265 g005
Figure 6. ECC latency and AT comparison of modular inversion algorithms. Details (year and first author’s name) of the numbers [n] (algorithm) on the horizontal axis are in Table 10.
Figure 6. ECC latency and AT comparison of modular inversion algorithms. Details (year and first author’s name) of the numbers [n] (algorithm) on the horizontal axis are in Table 10.
Computers 13 00265 g006
Table 1. ECDH key exchange.
Table 1. ECDH key exchange.
Expose an Elliptic Curve y 2 = x 3 + ax + b mod m and a Point P
on the Elliptic Curve to the World
AliceBob
Generate a secret d a Generate a secret d b
Calculate Q a = d a P Calculate Q b = d b P
Expose Q a Expose Q b
Get Q b from BobGet Q a from Alice
Calculate Q a b = d a Q b Calculate Q b a = d b Q a
Use x of Q a b as the keyUse x of Q b a as the key
Table 2. Execution example of Algorithm 4 (modinv1) with b = 1 , a = 3 , and m = 11 . It calculates r = 3 1 mod   11 . The result is x mod   m = 4 mod   11 = 4 .
Table 2. Execution example of Algorithm 4 (modinv1) with b = 1 , a = 3 , and m = 11 . It calculates r = 3 1 mod   11 . The result is x mod   m = 4 mod   11 = 4 .
iuvxyq
03=a    11=m1=b0q= u / v
0u=vv= u q v x=yy= x q y
1    0= 3 / 11    
1   11=v   3= 3 0 11    0=y1= 1 0 0
2 3= 11 / 3
23=v2= 11 3 3 1=y 3 = 0 3 1
3 1= 3 / 2
32=v1= 3 1 2     3 =y   4= 1 1 ( 3 )
4 2= 2 / 1
41=v0= 2 2 1 4=y    11 = ( 3 ) 2 4    
   End   u=1v=0x=4
Table 3. Execution example of Algorithm 5 (modinv2) with b = 1 , a = 3 , and m = 11 . It calculates r = 3 1 mod   11 . The result is x mod   m = 7 mod   11 = ( 11 7 ) mod   11 = 4 .
Table 3. Execution example of Algorithm 5 (modinv2) with b = 1 , a = 3 , and m = 11 . It calculates r = 3 1 mod   11 . The result is x mod   m = 7 mod   11 = ( 11 7 ) mod   11 = 4 .
iuvxyq
0311100
1113011
2381 1 0
383 1 11
4351 2 0
553 2 11
6321 3 1
721 3 41
8114 7 1
910 7 11
Table 4. Execution example of Algorithm 6 (modinv3) with b = 1 , a = 3 , and m = 11 . It calculates r = 3 1 mod   11 . The result is x mod   m = 4 mod   11 = 4 , because u = 1 .
Table 4. Execution example of Algorithm 6 (modinv3) with b = 1 , a = 3 , and m = 11 . It calculates r = 3 1 mod   11 . The result is x mod   m = 4 mod   11 = 4 , because u = 1 .
iuvxy
031110
1381 1
2351 2
3321 3
4124 3
Table 5. XY adjustment for shift right by three bits in the proposed modular inversion algorithm.
Table 5. XY adjustment for shift right by three bits in the proposed modular inversion algorithm.
txym 2 m 3 m 4 m ( t xy Comment
000xx1 ( t x y + 0 m ) / 8 0 + 0 = 0
100xx1 100 ( t x y + 4 m ) / 8 4 + 4 = 8
010x01010 ( t x y 2 m ) / 8 2 2 = 0
110x11110 ( t x y 2 m ) / 8 6 6 = 0
010x11110 ( t x y + 2 m ) / 8 2 + 6 = 8
110x01010 ( t x y + 2 m ) / 8 6 + 2 = 8
001001 ( t x y 1 m ) / 8 1 1 = 0
011011 ( t x y 1 m ) / 8 3 3 = 0
101101 ( t x y 1 m ) / 8 5 5 = 0
111111 ( t x y 1 m ) / 8 7 7 = 0
001101010111 ( t x y + 3 m ) / 8 1 + 7 = 8
011111110101 ( t x y + 3 m ) / 8 3 + 5 = 8
101001010011 ( t x y + 3 m ) / 8 5 + 3 = 8
111011110001 ( t x y + 3 m ) / 8 7 + 1 = 8
001111 ( t x y + 1 m ) / 8 1 + 7 = 8
011101 ( t x y + 1 m ) / 8 3 + 5 = 8
101011 ( t x y + 1 m ) / 8 5 + 3 = 8
111001 ( t x y + 1 m ) / 8 7 + 1 = 8
001011110001 ( t x y 3 m ) / 8 1 1 = 0
011001010011 ( t x y 3 m ) / 8 3 3 = 0
101111110101 ( t x y 3 m ) / 8 5 5 = 0
111101010111 ( t x y 3 m ) / 8 7 7 = 0
Table 6. XY adjustment for shift right by two bits in the proposed modular inversion algorithm.
Table 6. XY adjustment for shift right by two bits in the proposed modular inversion algorithm.
txym 2 m ( t xy Comment
00x1 ( t x y + 0 m ) / 4 0 + 0 = 0
10x110 ( t x y + 2 m ) / 4 2 + 2 = 4
0101 ( t x y 1 m ) / 4 1 1 = 0
1111 ( t x y 1 m ) / 4 3 3 = 0
0111 ( t x y + 1 m ) / 4 1 + 3 = 4
1101 ( t x y + 1 m ) / 4 3 + 1 = 4
Table 7. Comparison of modular inversion algorithms (on Altera Cyclone V FPGA chip).
Table 7. Comparison of modular inversion algorithms (on Altera Cyclone V FPGA chip).
AlgorithmCyclesFreq. (MHz)Latency (μs)  ALMs  RegistersAT
[1]  2011, Burton66,26457.541151.63200427755503.66
[2]  2004, Hankerson53454.669.772619130238.31
[3]  2015, Hossain53554.529.813735130349.42
[4]  2005, Daly35839.739.012474103831.64
[5]  2017, Mrabet120564.5518.671596104349.26
[6]  2009, Chen72372.2110.011968104230.13
[7]  2017, Choi35863.605.63  959103711.24
[8]  2023, Wang42359.567.103475130333.92
[9]  2019, Yang35660.435.893950105729.50
[10]  2007, Yan42354.997.693644130338.05
[11]  2018, Dong33456.935.875276105737.15
[12]  2022, Hao53454.669.772619130238.31
[13]  2023, Guo35633.9510.491653103928.23
Ours20856.713.67122710378.30
Table 8. Comparison on Secp192k1 192-bit and Secp521r1 521-bit prime field curves.
Table 8. Comparison on Secp192k1 192-bit and Secp521r1 521-bit prime field curves.
CurveAlgorithmCyclesFreq. (MHz)Latency (μs)  ALMs  RegistersAT
Secp192k1[2]40467.785.96196598217.57
Ours15159.382.549407814.38
Secp256k1[2]53454.669.772619130238.31
Ours20856.713.67122710378.30
Secp521r1[2]110936.3630.5056782627253.31
Ours42933.6012.772245209755.44
Table 9. Execution example of Q = d P with d = 10101 2 = 21 .
Table 9. Execution example of Q = d P with d = 10101 2 = 21 .
WeightPoint AdditionPoint Doubling
Initial Q = O R = P
d 0 = 1 1 Q = Q + R = O + P = P R = 2 R = 2 P
d 1 = 0 2 R = 2 R = 4 P
d 2 = 1 4 Q = Q + R = P + 4 P = 5 P R = 2 R = 8 P
d 3 = 0 8 R = 2 R = 16 P
d 4 = 1 16 Q = Q + R = 5 P + 16 P = 21 P R = 2 R = 32 P
Table 10. ECC comparison using modular inversion algorithms (on Altera Cyclone V FPGA chip).
Table 10. ECC comparison using modular inversion algorithms (on Altera Cyclone V FPGA chip).
AlgorithmCyclesFreq. (MHz)Latency (ms)ALMsRegistersAT
[2]  2004, Hankerson402,14515.9425.2315,0438355590,300.42
[3]  2015, Hossain402,40015.5825.8317,5858355669,977.92
[4]  2005, Daly357,26216.0622.2514,9757821507,107.38
[5]  2017, Mrabet570,14216.0735.4813,2927834749,522.08
[6]  2009, Chen455,42516.1528.2014,2117831621,577.58
[7]  2017, Choi356,87816.0122.2912,1147820444,347.67
[8]  2023, Wang372,12715.9823.2917,2928353597,196.30
[9]  2019, Yang352,76115.7222.4417,8417860576,737.31
[10]  2007, Yan372,12716.0823.1418,5488355622,595.32
[11]  2018, Dong346,19415.8821.8020,7167859622,952.99
[12]  2022, Hao402,14515.8925.3115,0418354592,081.96
[13]  2023, Guo356,49615.6422.7913,5717824487,674.68
Ours317,68116.1319.7012,1607822393,546.29
Table 11. Comparison of scalar point multiplication circuits (on Altera Cyclone V FPGA chip).
Table 11. Comparison of scalar point multiplication circuits (on Altera Cyclone V FPGA chip).
Algorithm Cycles Freq. (MHz)Latency (ms) ALMs RegistersAT
Traditional (Algorithm 12)317,68116.1319.7012,1607822393,546.29
Mont.ladder (Algorithm 13)318,83115.8620.1012,7237577408,087.60
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y. Area–Time-Efficient High-Radix Modular Inversion Algorithm and Hardware Implementation for ECC over Prime Fields. Computers 2024, 13, 265. https://doi.org/10.3390/computers13100265

AMA Style

Li Y. Area–Time-Efficient High-Radix Modular Inversion Algorithm and Hardware Implementation for ECC over Prime Fields. Computers. 2024; 13(10):265. https://doi.org/10.3390/computers13100265

Chicago/Turabian Style

Li, Yamin. 2024. "Area–Time-Efficient High-Radix Modular Inversion Algorithm and Hardware Implementation for ECC over Prime Fields" Computers 13, no. 10: 265. https://doi.org/10.3390/computers13100265

APA Style

Li, Y. (2024). Area–Time-Efficient High-Radix Modular Inversion Algorithm and Hardware Implementation for ECC over Prime Fields. Computers, 13(10), 265. https://doi.org/10.3390/computers13100265

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop