Area–Time-Efficient High-Radix Modular Inversion Algorithm and Hardware Implementation for ECC over Prime Fields

Li, Yamin

doi:10.3390/computers13100265

Open AccessArticle

Area–Time-Efficient High-Radix Modular Inversion Algorithm and Hardware Implementation for ECC over Prime Fields

by

Yamin Li

Computer Architecture Laboratory, Department of Computer Science, Faculty of Computer and Information Sciences, Hosei University, Tokyo 184-8584, Japan

Computers 2024, 13(10), 265; https://doi.org/10.3390/computers13100265

Submission received: 10 September 2024 / Revised: 8 October 2024 / Accepted: 10 October 2024 / Published: 12 October 2024

(This article belongs to the Special Issue Using New Technologies in Cyber Security Solutions (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

:

Elliptic curve cryptography (ECC) is widely used for secure communications, because it can provide the same level of security as RSA with a much smaller key size. In constrained environments, it is important to consider efficiency, in terms of execution time and hardware costs. Modular inversion is a key time-consuming calculation used in ECC. Its hardware implementation requires extensive hardware resources, such as lookup tables and registers. We investigate the state-of-the-art modular inversion algorithms, and evaluate the performance and cost of the algorithms and their hardware implementations. We then propose a high-radix modular inversion algorithm aimed at reducing the execution time and hardware costs. We present a detailed radix-8 hardware implementation based on 256-bit primes in Verilog HDL and compare its cost performance to other implementations. Our implementation on the Altera Cyclone V FPGA chip used 1227 ALMs (adaptive logic modules) and 1037 registers. The modular inversion calculation took 3.67 ms. The AT (area–time) factor was 8.30, outperforming the other implementations. We also present an implementation of ECC using the proposed radix-8 modular inversion algorithm. The implementation results also showed that our modular inversion algorithm was more efficient in area–time than the other algorithms.

Keywords:

computer security; elliptic curve cryptography; modular inversion; hardware; Verilog HDL; FPGA; cost performance evaluation

1. Introduction

Nowadays, Internet of Things (IoT) applications use hardware security modules to ensure secure communications. In such a constrained environment, execution time and hardware costs are critical to efficient system design. Elliptic curve cryptography (ECC) is one of the most advanced public key cryptographic techniques. It requires a smaller key than other methods, to achieve roughly the same level of security.

ECC can be used to provide secure key agreement between two parties over an insecure network. It can also be used for digital signatures, to verify the authenticity and integrity of digital messages. Modular inversion is a critical operation in ECC. ECC calculates points on an elliptic curve over a finite field (such as a field of prime numbers) based on point addition (PA) and point doubling (PD) computations. In affine coordinates, PA and PD must calculate the slope of a line. Such calculations involve costly modular inversions. In projective or Jacobian coordinates, PA and PD do not require such calculations, but a modular inversion is still required to transform the points into affine coordinates to obtain the same key for the two parties.

Given a prime number m, the inverse r of a number a with

a < m

is defined as

r = a^{- 1} \mod m

. There are two main popular methods for calculating modular inversion:

Extended Euclidean algorithm (EEA) without using divisions.
Using Fermat’s little theorem $a^{m - 1} = 1 \mod m$ [1]: $r = a^{m - 2} \mod m = a^{- 1} \mod m$ .

We will see that the method using Fermat’s little theorem takes more time and requires more registers than the EEA. Therefore, we focus our design on the use of the EEA.

The EEA inherently needs divisions. It calculates the integer quotient and the remainder based on the quotient. The divisions can be replaced by addition, subtraction, and shift operations. For simplicity, we will also refer to the EEA which does not use divisions as an EEA.

To calculate

r = a^{- 1} \mod m

, the EEA first initializes the variables

u, v, x, y

with inputs

a, m, 1, 0

, respectively. Then, the EEA repeats calculations containing only addition, subtraction, and shift operations on

u, v, x, y

until

u = 1

or

v = 1

. Finally, the modular inversion result is available by adjusting x or y, corresponding to

u = 1

or

v = 1

. A modular inversion algorithm is said to be fast if u or v reaches 1 quickly.

The most widely used modular inversion algorithm is Algorithm 2.22, proposed by Hankerson, Menezes, and Vanstone [2]. It repeatedly shifts u or v to the right when u or v is even. Correspondingly, x is also shifted to the right with the shift of u; y is also shifted to the right with the shift of v. Note that, when x or y is odd, m will be added before the shift. This guarantees that the value to be right-shifted is even, since the prime m is odd. Next, if

u \geq v

, u and x are replaced by

u - v

and

x - y

, respectively. Otherwise, v and y are replaced by

v - u

and

y - x

, respectively. Finally, the result is

x \mod m

if

u = 1

, and

y \mod m

otherwise. Hossain and Kong [3] revised Algorithm 2.22 by adding m to x or y if it is negative. This ensures that x and y are non-negative. Daly, Marnane, Kerins, and Popovici [4] revised Algorithm 2.22 by dividing

u - v

or

v - u

by two because the subtraction result is even (both u and v are odd before the subtraction). Correspondingly,

x - y

or

y - x

also needs to be divided by two: If

x - y

or

y - x

is odd, m is added before the division. Division by two is performed by shifting one bit to the right. Mrabet, El-Mrabet, Bouallegue, Mesnager, and Machhout proposed a modular inversion algorithm [5] with

u + v

. Instead of

u - v

or

v - u

, as Algorithm 2.22 uses, they perform

u + v

for new u or v. This operation slows down the speed at which u or v reaches 1, increasing the execution time. Chen and Qin proposed a modular inversion algorithm [6] that only uses adders. Subtractions are performed by addition with inversion and addition by 1. Choi, Lee, Kong, and Kim proposed a modular inversion algorithm [7] that replaces the repeated shift of u or v and the corresponding shift of x or y in Algorithm 2.22 by a selection of

u, x

or

0, 0

, or a selection of

v, y

or

0, 0

, based on the even/odd of v or u. This simplifies the circuit by replacing adders with multiplexers, reducing the circuit delay. In addition, they use

- v

and

- y

, instead of v and y, during the calculation. This merges

u - v

and

v - u

into

u + v

and merges

x - y

and

y - x

into

x + y

, reducing the circuit cost. Mixed radix-4 modular inversion algorithms were investigated in [7,8,9,10]. If u or v is divisible by four, u or v is shifted to the right by two bits. Otherwise, if u or v is even (divisible by two), u or v is shifted to the right by one bit. Otherwise (both u and v are odd),

u - v

or

v - u

is shifted to the right by one bit and assigned to u or v. Correspondingly, x or y is adjusted by adding

- m

, m, or

2 m

and shifted to the right by two bits or one bit. [8] proposed a radix-4 modular inversion algorithm that uses sequential condition checking for the calculation of u, v, x, and y. [9] implemented the SM2 ECC protocol. The iterations of the modular inversion are controlled by the bit counter

ρ

, resulting in unnecessary iterations. Using u and v to control the iterations will finish the calculation quickly. [10] presented a radix-4 version of Algorithm 2.22. Dong, Zhang, and Gao proposed a mixed radix-8 modular inversion algorithm [11] that uses extensive hardware resources. Hao et al. presented a lightweight architecture for elliptic curve scalar multiplication over prime fields [12]. They revised Algorithm 2.22 by using only adders and forced

x - y

and

y - x

to be in the range 0 to m. Guo et al. proposed a modular inversion algorithm [13] that makes v always be odd. If u is even, it is shifted one bit to the right. Correspondingly, x is also shifted one bit to the right (if x is odd, m is added before the shift). Otherwise (u is odd),

u - v

(if

u \geq v

) or

v - u

(if

u < v

) is shifted one bit to the right, and the shifted value is assigned to u. Meanwhile,

x - y

or

y - x

is shifted one bit to the right, and the shifted value is assigned to x. Then, v is updated with u or v; y is updated with x or y. The above calculations are repeated until u becomes 1. Then, the result of the modular inversion is x. In lines 14 and 15 of their algorithm,

x_{1}^{'}

and

x_{2}^{'}

are compared and

x_{1}

is guaranteed to be non-negative. However, due to the division by two, the dividend must be adjusted so that it is even. If it is odd, m (p in their algorithm) must be added to it before the division. These codes were not presented in their algorithm.

The AT (area–time) factor is often used for comparisons between implementations. It is defined as the execution time in milliseconds multiplied by the required hardware resources consisting of registers and lookup tables or ALMs (adaptive logic modules).

In this paper, we implement and evaluate all the algorithms mentioned above. We then propose a mixed radix-8 modular inversion algorithm aimed at reducing the execution time and hardware costs. We give a detailed hardware implementation in Verilog HDL based on 256-bit prime numbers. This had lower hardware costs for ALMs and registers and had better performance than the other algorithms. The implementation on the Altera Cyclone V FPGA chip used 1227 ALMs and 1037 registers and took 3.67 ms for the modular inversion computation. It achieved an AT factor of 8.30, lower than all other implementations. We show that the proposed algorithm is also efficient for 192-bit and 521-bit prime numbers. We implemented ECC using different modular inversion algorithms and compared their cost performance. The ECC implementation results showed that our modular inversion algorithm was more efficient in area–time than the other algorithms. We also present an efficient implementation of the Montgomery ladder scalar point multiplication algorithm, a constant execution time algorithm that is resistant to side-channel attacks. The short execution time and low hardware cost of our algorithm and implementation are significant advantages, especially in constrained environments where computing and battery power are limited.

The rest of this paper is organized as follows: Section 2 introduces ECC and modular inversion algorithms. In Section 3, we propose a mixed radix-8 modular inversion algorithm, give its hardware implementation in Verilog HDL, and compare its cost performance with other algorithms. In Section 4, we provide an ECC implementation using the proposed radix-8 modular inversion algorithm and compare its cost performance with ECC implementations using other modular inversion algorithms. This section also presents an efficient implementation of the Montgomery ladder scalar point multiplication algorithm. In Section 5, we discuss some issues related to the algorithm and hardware design. We conclude the paper and suggest some future research topics in Section 6.

2. ECC and Modular Inversion Algorithms

This section introduces ECC and modular inversion algorithms based on the EEA.

2.1. Elliptic Curve Cryptography

ECC [14,15] relies on the fact that scalar point multiplication

Q = d P

can be computed, but it is almost impossible to compute d given only the original point P and the point of the product Q. An ECC over the finite field of an n-bit prime number m can use the following equation:

y^{2} = x^{3} + a x + b \mod m

(1)

For example, the Secp256k1 [16] elliptic curve used in Ethereum Blockchain uses a 256-bit

m = 2^{256} - 2^{32} - 2^{9} - 2^{8} - 2^{7} - 2^{6} - 2^{4} - 1

. Secp256k1 defines

y^{2} = x^{3} + a x + b = x^{3} + 7

and gives a point

P = [x, y]

on the elliptic curve, as follows:

a = 0x0000000000000000000000000000000000000000000000000000000000000000

b = 0x0000000000000000000000000000000000000000000000000000000000000007

m = 0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f

x = 0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798

y = 0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8

The elliptic curve Diffie–Hellman (ECDH) key exchange protocol can be used by two parties, Alice and Bob for example, to establish a shared secret key over an insecure network [16,17]. The ECDH protocol is shown in Table 1.

Because

Q_{a b} = d_{a} Q_{b} = d_{a} d_{b} P

,

Q_{b a} = d_{b} Q_{a} = d_{b} d_{a} P

, and

d_{a} d_{b} = d_{b} d_{a}

, we have

Q_{b a} = Q_{a b}

. Below is an ECDH key exchange example using Secp256k1. We can see that the two parties, Alice and Bob, have the same shared secret key (Qabx = Qbax).

Alice generates and keeps

d_{a}

secret and exposes

Q_{a} = d_{a} P

:

da = 0x650aa7095daeaa37ab9051541f0ce304f8969a6d88bb3bebb4fe680fca9a2595

Qax = 0x167d2537aa6bbd8d978b58be0f9466520b7b184e205ff96a9ff567b35b32c7b7

Qay = 0xde3961553d36551f92726fee0e332133960edddccd2784b98b2af730d2fc6e14

Bob generates and keeps

d_{b}

secret and exposes

Q_{b} = d_{b} P

:

db = 0xedc68f194c4e30d6ef90467df822b00e5ef122dea48c9d1c54817080d1a341f4

Qbx = 0x839da64a414c2243a5526230603109be9c615613a9e98c3d650bb0488580bbda

Qby = 0x96e88e99304a5afcdd77c4f3b3327a28162627ebe08194baa0c78dfb67a11042

Alice obtains

Q_{b}

and calculates

Q_{a b} = d_{a} Q_{b}

:

Qabx = 0x1f254c7da15899275cdcab9d992f58251a4ab630fe9864d20cf317ab57749947

Qaby = 0xd6cb400b3c49d33d3df28f9d34fa09f8b6c8edf117a378c5a45d0a51e6c0debc

Bob obtains

Q_{a}

and calculates

Q_{b a} = d_{b} Q_{a}

:

Qbax = 0x1f254c7da15899275cdcab9d992f58251a4ab630fe9864d20cf317ab57749947

Qbay = 0xd6cb400b3c49d33d3df28f9d34fa09f8b6c8edf117a378c5a45d0a51e6c0debc

Now, Alice and Bob have the same secret key (Qabx = Qbax). They can use symmetric-key cryptography for subsequent communications. A third party, Eve for example, knows

y^{2} = x^{3} + a x + b \mod m

, P,

Q_{a}

, and

Q_{b}

, but cannot calculate the same secret key.

2.2. Point Addition and Point Doubling

Scalar point multiplication

Q = d P

uses point addition (PA) and point doubling (PD).

2.2.1. Point Addition

Given

P = [x_{p}, y_{p}]

and

Q = [x_{q}, y_{q}]

, the formulas for point addition

R = [x_{r}, y_{r}] = P + Q

on elliptic curve

y^{2} = x^{3} + a x + b \mod m

are shown as follows, where

λ

is the slope of the line through points P and Q. The derivation of the formulas can be found in [18].

\begin{matrix} \{\begin{matrix} λ & = \frac{y_{q} - y_{p}}{x_{q} - x_{p}} \mod m \\ x_{r} & = (λ^{2} - x_{p} - x_{q}) \mod m \\ y_{r} & = (λ (x_{p} - x_{r}) - y_{p}) \mod m \end{matrix} \end{matrix}

(2)

The point at infinity, denoted

O

, is included in the group of elliptic curves and is defined as

P + (- P) = O

for

Q = - P

. By this definition,

P + O = P

. In our implementation,

O

is represented as

[- 1, - 1]

. In the case of

P = O

,

R = P + Q = O + Q = Q

. In the case of

Q = O

,

R = P + Q = P + O = P

. We give the point addition

R = P + Q

algorithm over the finite field of

F_{m}

in Algorithm 1. In the case of

Q = - P

,

R = P + Q = P + (- P) = O

(line 5 in the algorithm). In the case of

Q = P

,

R = P + Q = P + P = 2 P

, we perform the point doubling

R = 2 P

(line 6 in the algorithm).

Algorithm 1 PA (P, Q, m, a) (Point Addition in Affine Coordinates).

inputs: Points

P = [P_{x}, P_{y}]

and

Q = [Q_{x}, Q_{y}]

; m and a in

y^{2} = x^{3} + a x + b \mod m

output:

R = P + Q = [R_{x}, R_{y}] = [x_{r}, y_{r}]

begin

1

x_{p} = P_{x}

,

y_{p} = P_{y}

,

x_{q} = Q_{x}

,

y_{q} = Q_{y}

,

O = [- 1, - 1]

2 if

P = O

return Q /*

O + Q = Q

*/

3 if

Q = O

return P /*

P + O = P

*/

4 if

x_{p} = x_{q}

5 if

(y_{p} + y_{q}) \mod m = 0

return

O

/*

P + (- P) = O

*/

6 else return PD (P, m, a) /*

P + P = 2 P

*/

7

λ = ((y_{q} - y_{p}) / (x_{q} - x_{p})) \mod m

8

x_{r} = (λ^{2} - x_{p} - x_{q}) \mod m

9

y_{r} = (λ (x_{p} - x_{r}) - y_{p}) \mod m

10 return

[x_{r}, y_{r}]

/*

R = P + Q

*/

end

An example of point addition

R = P + Q

on the Secp256k1 curve is shown below, where

[P_{x}, P_{y}] = P

,

[Q_{x}, Q_{y}] = Q

, and

[R_{x}, R_{y}] = R

in affine coordinates.

Px = 0xfc7dafb820a20da1a73c36465f2fe37bfd98ce4ef3a10a5df110abda03b20a3d

Py = 0xa442a2d1b8bde4a09e45725add5daae89e726b56f0e8fe6609dacaf5279b2564

Qx = 0xe106c069450b2663febb83e29b67fa93c4c48a45d5fbe7ce4ddb8ceb601fcc1d

Qy = 0xc9da9bd440909c8862c06a44d432d2dd45284636b7049b9bf4695f9e4018d2f2

Rx = 0xfd52a0334e16f8cf45a6b0820887a9e8b1b180516a76c8adfef95df98aeef376

Ry = 0xb0fe3f04cc4c64fd66a133b8c97b4905771238f8ba89631efb85a8059e969a49

2.2.2. Point Doubling

Given

P = [x_{p}, y_{p}]

, the formulas for point doubling

R = [x_{r}, y_{r}] = 2 P

on elliptic curve

y^{2} = x^{3} + a x + b \mod m

are shown as follows, where

λ

is the slope of the tangent line of the elliptic curve at point P. The derivation of the formulas can be found in [18].

\begin{matrix} \{\begin{matrix} λ & = \frac{3 x_{p}^{2} + a}{2 y_{p}} \mod m \\ x_{r} & = (λ^{2} - 2 x_{p}) \mod m \\ y_{r} & = (λ (x_{p} - x_{r}) - y_{p}) \mod m \end{matrix} \end{matrix}

(3)

We give the point doubling

R = 2 P

algorithm over the finite field of

F_{m}

in Algorithm 2. In the case of

P_{y} = 0

(vertical tangent line),

R = 2 P = O

(line 2 in the algorithm).

An example of point doubling

R = 2 P

on the Secp256k1 curve is shown below, where

[P_{x}, P_{y}] = P

and

[R_{x}, R_{y}] = R

in affine coordinates.

Px = 0x6034b56424fb31ea6ec5483b52ae5d07d6f3ef80264d769ae2714abb83fb279a

Py = 0xfe4cde1ff7546a87f906f50ab1002fda7811828ea6fc467a44d1c6c11aa65a37

Rx = 0x5491ee8b73a4ed9713ed32e467de5100b80861babf8ffd09fd595ab457d042c9

Ry = 0xf91e6a4e132a1bdf4f5c846559431ec7373de8872b719f188b5902932f0a2b30

The computation of

λ

in PA and PD requires modular division, which can be realized using a modular inversion algorithm based on the EEA.

Algorithm 2 PD (P, m, a) (Point Doubling in Affine Coordinates).

inputs: Point

P = [P_{x}, P_{y}]

; m and a in

y^{2} = x^{3} + a x + b \mod m

output:

R = 2 P = [R_{x}, R_{y}] = [x_{r}, y_{r}]

begin

1

x_{p} = P_{x}

,

y_{p} = P_{y}

,

O = [- 1, - 1]

2 if

y_{p} = 0

return

O

/* vertical tangent */

3

λ = ((3 x_{p}^{2} + a) / (2 y_{p})) \mod m

4

x_{r} = (λ^{2} - 2 x_{p}) \mod m

5

y_{r} = (λ (x_{p} - x_{r}) - y_{p}) \mod m

6 return

[x_{r}, y_{r}]

/*

R = 2 P

*/

end

2.3. Modular Inversion Algorithms

Given a prime number m, the inverse r of a number a with

a < m

is defined as

r = a^{- 1} \mod m

(4)

Algorithm 3 (modinv_fermat) implements the modular inversion calculation using Fermat’s little theorem. If m is prime and

a \neq 0 (\mod m)

, Fermat’s little theorem says that

a^{m - 1} = 1 \mod m

. Multiplying both sides by

a^{- 1}

gives us

a^{m - 2} \mod m = a^{- 1} \mod m

. Then, we can calculate the modular inversion with

r = a^{m - 2} \mod m

. This modular exponentiation can be performed using the multiply-squaring method, as shown in Algorithm 3 (modinv_fermat). This calculation consists of costly modular multiply and modular squaring, very similar to RSA exponentiation [19].

Algorithm 3 modinv_fermat (a, m) (Modinv using Fermat’s little theorem).

inputs: Prime m and a with

a < m

output:

a^{- 1} \mod m

begin

1

k = m - 2

;

x = 1

;

y = a

2 while

k \neq 0

3 if k is odd

4

x = x y \mod m

/* modular multiply */

5

y = y^{2} \mod m

/* modular squaring */

6

k = k ≫ 1

7 return x

end

The Python code below implements Algorithm 3 (modinv_fermat). When executed, it outputs 4 4 4. The first value is calculated by the code, and the rest are for checking.

The EEA can be used for the modular inversion calculation. Algorithm 4 (modinv1) gives the fundamental EEA for the modular inversion calculation. Line 4 calculates the integer quotient q of u divided by v. Lines 5 and 6 calculate the remainders

v = u - q v

and

y = x - q y

based on the quotient q and store the original v and y in u and x, respectively. These calculations are repeated until

v = 0

.

Algorithm 4 modinv1 (b, a, m) (Modular Inversion Algorithm 1).

inputs: Prime m, a, and b with

a, b < m

output:

b a^{- 1} \mod m

begin

1

u, v = a, m

/*

u = a

and

v = m

*/

2

x, y = b, 0

/*

x = b

and

y = 0

*/

3 while

v \neq 0

4

q = ⌊ u / v ⌋

/* q: integer quotient */

5

u, v = v, u - q v

6

x, y = y, x - q y

7 return

x \mod m

end

Considering

b = 1

. u and x are initialized with a and 1, respectively. At each iteration, u and x are modified with similar calculations. Therefore, when u reaches 1 from a, x reaches the reciprocal of a from 1:

u : a \to (/ a) \to 1; x : 1 \to (/ a) \to a^{- 1}; x : b \to (/ a) \to b a^{- 1}

If m is a prime number, the greatest common divisor of a and m is guaranteed to be 1, and we can always obtain the inverse result of a. With the initialization of x with b, the algorithm performs the modular division

r = b a^{- 1} \mod m

.

An execution example of Algorithm 4 (modinv1) with

b = 1

,

a = 3

, and

m = 11

is shown in Table 2. The calculation finishes when

v = 0

. The result

r = a^{- 1} \mod m = x

\mod m = 4 \mod 11 = 4

. We can check the correctness as follows:

r a \mod m = 4 \times 3

\mod 11 = 12 \mod 11 = 1 \mod 11

.

The algorithm requires division, which is expensive. As shown in Algorithm 5 (modinv2), we can eliminate the division by setting the quotient to 0 or 1.

Algorithm 5 modinv2 (b, a, m) (Modular Inversion Algorithm 2).

inputs: Prime m, a, and b with

a, b < m

output:

b a^{- 1} \mod m

begin

1

u, v = a, m

2

x, y = b, 0

3 while

v \neq 0

4

q = 0

if

u < v

else 1

5

u, v = v, u - q v

6

x, y = y, x - q y

7 return

x \mod m

end

The algorithm yields a quotient of 0 or 1 based on the comparison of u and v. If the quotient is a 1, subtractions

u - v

and

x - y

are performed (lines 5 and 6). Otherwise no calculations are performed (simply swapping u with v and swapping x with y), which make the computation slower. The execution of Algorithm 5 (modinv2) with

b = 1

,

a = 3

, and

m = 11

requires nine iterations, as shown in Table 3.

The algorithm can be modified to remove the no calculations, as shown in Algorithm 6 (modinv3). The code in lines 4 and 5 ensures that u and v are non-negative. Note that u and x are one pair, and v and y are another pair. Algorithm 6 (modinv3) reduces the number of iterations by about half, as shown in Table 4.

Algorithm 6 modinv3 (b, a, m) (Modular Inversion Algorithm 3).

inputs: Prime m, a, and b with

a, b < m

output:

b a^{- 1} \mod m

begin

1

u, v = a, m

2

x, y = b, 0

3 while

u \neq 1

and

v \neq 1

4 if

u < v : v, y = v - u, y - x

5 else

u, x = u - v, x - y

6 if

u = 1

return

x \mod m

7 else return

y \mod m

end

In Table 4, because

v > u

for

i = 0

to 2, we update v with

v - u

. Correspondingly, we update y with

y - x

. For

i = 3

, because

u > v

, we update u with

u - v

. Correspondingly, we update x with

x - y

. For

i = 4

, because

u = 1

, the result is

x \mod m = 4

.

We can check u first, before the subtractions. If it is even, we shift it to the right by one bit (the least significant bit 0 is shifted out). Correspondingly, x must also be shifted. To ensure that the value to be right-shifted is even, m will be added before the shift if x is odd. Note that m is odd because it is a prime number. These shifts of u and x can be performed repeatedly until u becomes an odd number.

Similarly, if v is even, we shift it to the right by one bit. Correspondingly, y must also be shifted. If y is odd, m will be added before the shift. These shifts of v and y can be performed repeatedly until v becomes an odd number.

Then, we obtain the algorithm shown in Algorithm 7 (modinv4). The idea behind it is that division makes u and v reach 1 faster than subtraction. In fact, this is Algorithm 2.22 provided in [2] and implemented in Verilog HDL in [18].

Algorithm 7 modinv4 (b, a, m) (Modular Inversion Algorithm 4).

inputs: Prime m, a, and b with

a, b < m

output:

b a^{- 1} \mod m

begin

1

u, v = a, m

2

x, y = b, 0

3 while

u \neq 1

and

v \neq 1

4 while u is even

5

u = u / 2

6 if x is even:

x = x / 2

7 else

x = (x + m) / 2

8 while v is even

9

v = v / 2

10 if y is even:

y = y / 2

11 else

y = (y + m) / 2

12 if

u < v : v, y = v - u, y - x

13 else

u, x = u - v, x - y

14 if

u = 1

return

x \mod m

15 else return

y \mod m

end

Algorithm 7 (modinv4) has been widely adopted in ECC implementations. When the two inner while loops (lines 4 to 11) finish, u and v are both odd numbers. Therefore,

u - v

or

v - u

is even. Then, we can shift it to the right by one bit. Correspondingly,

x - y

or

y - x

must also be shifted. If

x - y

or

y - x

is odd, m must be added before the shift so that the bit being shifted out is 0. This algorithm is shown in Algorithm 8 (modinv5). The shifts are performed by the code in lines 12 to 19.

Algorithm 8 modinv5 (b, a, m) (Modular Inversion Algorithm 5).

inputs: Prime m, a, and b with

a, b < m

output:

b a^{- 1} \mod m

begin

1

u, v = a, m

2

x, y = b, 0

3 while

u \neq 1

and

v \neq 1

4 while u is even

5

u = u / 2

6 if x is even:

x = x / 2

7 else

x = (x + m) / 2

8 while v is even

9

v = v / 2

10 if y is even:

y = y / 2

11 else

y = (y + m) / 2

12 if

u < v

13

v, y = (v - u) / 2, y - x

14 if y is even:

y = y / 2

15 else

y = (y + m) / 2

16 else

17

u, x = (u - v) / 2, x - y

18 if x is even:

x = x / 2

19 else

x = (x + m) / 2

20 if

u = 1

return

x \mod m

21 else return

y \mod m

end

The two inner while loops in Algorithm 8 (modinv5) can be replaced by assigning u, x, v, y, or 0 to the temporary variables

t u

,

t x

,

t v

,

t y

, so that

t u - t v

is an even number. Next,

t u - t v

is shifted one bit to the right. Correspondingly,

t x - t y

is also shifted one bit to the right. Note that if

t x - t y

is odd, m is added before the shift. This algorithm is shown in Algorithm 9 (modinv6). Note that only

t x - t y

is executed; it may be negative. The code in lines 12 and 13 ensures that u and v are non-negative.

Algorithm 9 modinv6 (b, a, m) (Modular Inversion Algorithm 6).

inputs: Prime m, a, and b with

a, b < m

output:

b a^{- 1} \mod m

begin

1

u, v = a, m

2

x, y = b, 0

3 while

u \neq 1

and

v \neq 1

4 if u is odd:

t v, t y = v, y

5 else

t v, t y = 0, 0

6 if v is odd:

t u, t x = u, x

7 else

t u, t x = 0, 0

8

t u v, t x y = t u - t v, t x - t y

9

u v = t u v / 2

10 if

t x y

is even:

x y = t x y / 2

11 else

x y = (t x y + m) / 2

12 if

u v < 0 : v, y = - u v, - x y

13 else

u, x = u v, x y

14 if

u = 1

return

x \mod m

15 else return

y \mod m

end

Algorithm 9 (modinv6) requires calculations of

t u - t v

,

t v - t u

,

t x - t y

, and

t y - t x

. We can unify these calculations with negative assignments

- v

and

- y

to v and y, respectively, so that only

t u + t v

and

t x + t y

are sufficient for the calculations. That is, with negative assignments

- v

and

- y

to v and y,

u = u - v = u + (- v)

becomes

u = u + v

, and

x = x - y = x + (- y)

becomes

x = x + y

. Similarly,

v = - (u - v)

becomes

v = u + v

, and

y = - (x - y)

becomes

y = x + y

with negative assignments

- v

and

- y

to v and y. Therefore,

u + v

and

x + y

are sufficient for the calculations, saving hardware costs.

The algorithm is given in Algorithm 10 (modinv7). Because of the negative assignments to v and y, v is initialized with

- m

and y is initialized with

- 0 = 0

. Note that x is never greater than or equal to m. Therefore, for the final result, no adjustment of

x = x - m

or

x = x \mod m

is required. All we need is

x = x + m

for

x < 0

.

Algorithm 10 modinv7 (b, a, m) (Modular Inversion Algorithm 7).

inputs: Prime m, a, and b with

a, b < m

output:

b a^{- 1} \mod m

begin

1

u, v = a, - m

2

x, y = b, - 0

3 while

u \neq 1

4 if u is odd:

t v, t y = v, y

5 else

t v, t y = 0, 0

6 else if v is odd:

t u, t x = u, x

7 else else

t u, t x = 0, 0

8 else

t u v, t x y = t u + t v, t x + t y

9 else

u v = t u v / 2

10 if

t x y

is even:

x y = t x y / 2

11 else

12 if

t x y < 0 : x y = (t x y + m) / 2

13 else

x y = (t x y - m) / 2

14 if

u v < 0 : v, y = u v, x y

15 else

u, x = u v, x y

16 if

x < 0 : x = x + m

17 return x

end

The Python codes for Algorithms 4–10 (modinv1 to modinv7) are given in Appendix A. We generate random numbers b and a that are smaller than a fixed 256-bit m. For the same b, a, m inputs, all modular inversion algorithms have the same output.

A modular inversion algorithm is said to be good when u reaches 1 quickly (high performance) and the algorithm uses a small number of adders and subtractors (low cost).

3. Proposed Radix-8 Modular Inversion Algorithm and Its Performance

The proposed mixed radix-8 modular inversion algorithm is given in Algorithm 11 (modinv_radix8). To calculate

r = b a^{- 1} \mod m

, we initialize

u = a

,

v = - m

,

x = b

, and

y = - 0

with the negative assignment to v and y. The temporary variable

t u

is assigned with u or 0 and the temporary variable

t v

is assigned with v or 0, so that

t u v = t u + t v

is even. Correspondingly, the temporary variable

t x

is also assigned with x or 0 and the temporary variable

t y

is assigned with y or 0. If the least significant three bits of

t u v

are 000, it is shifted to the right by three bits (radix-8). Otherwise, if the least significant two bits of

t u v

are 00, it is shifted to the right by two bits (radix-4). Otherwise, it is shifted to the right by one bit (radix-2), because

t u v

is even. This is also called a hybrid radix algorithm. It is difficult to develop a complete radix-8 algorithm without using radix-4 or radix-2 arithmetic. We need to handle all cases where the least significant three bits of

t u v

are not 000 (there are seven cases) and perform the corresponding radix-8 arithmetic.

Algorithm 11 modinv_radix8 (b, a, m) (Radix-8 Modular Inversion Algorithm).

inputs: Prime m, a, and b with

a, b < m

output:

b a^{- 1} \mod m

begin

1

u, v = a, - m

2

x, y = b, - 0

3 while

u \neq 1

4 if u is odd:

t v, t y = v, y

5 else

t v, t y = 0, 0

6 if v is odd:

t u, t x = u, x

7 else

t u, t x = 0, 0

8

t u v, t x y = t u + t v, t x + t y

/*

t u v

is even */

9 if

t u v & 6 = 0

/* radix 8 */

10

u v = t u v / 8

11 if

t x y & 1 = 0

12 if

t x y & 2 = 0

13 if

t x y & 4 = 0 : x y = t x y / 8

14 else

x y = (t x y + 4 m) / 8

15 else

16 if

t x y & 4 = (2 m & 4) : x y = (t x y - 2 m) / 8

17 else

x y = (t x y + 2 m) / 8

18 else

19 if

t x y & 6 = m & 6 : x y = (t x y - m) / 8

20 else

21 if

t x y & 2 = m & 2 : x y = (t x y + 3 m) / 8

22 else

23 if

t x y & 4 \neq m & 4 : x y = (t x y + m) / 8

24 else

x y = (t x y - 3 m) / 8

25 else

26 if

t u v & 2 = 0

/* radix 4 */

27

u v = t u v / 4

28 if

t x y & 1 = 0

29 if

t x y & 2 = 0 : x y = t x y / 4

30 else

x y = (t x y + 2 m) / 4

31 else

32 if

t x y & 3 = m & 3 : x y = (t x y - m) / 4

33 else

x y = (t x y + m) / 4

34 else /* radix 2 */

35

u v = t u v / 2

36 if

t x y & 1 = 0 : x y = t x y / 2

37 else

38 if

t x y < 0 : x y = (t x y + m) / 2

39 else

x y = (t x y - m) / 2

40 if

u v < 0 : v, y = u v, x y

41 else

u, x = u v, x y

42 if

x < 0 : x = x + m

43 return x

end

Correspondingly,

t x

and

t y

are arranged and

t x y = t x + t y

is also shifted to the right by three bits, two bits, or one bit. The bits being shifted out must be 0. Therefore, we need to adjust

t x y

using the prime number m before the shift. Table 5 lists such adjustments based on the least significant three bits of

t x y

and the least significant three bits of m for the radix-8 operations, where x represents a don’t-care term. The least significant three bits of the adjusted value are 000, as shown in the Comment column of the table.

Similarly, Table 6 lists the adjustments based on the least significant two bits of

t x y

and the least significant two bits of m for the radix-4 operations. The least significant two bits of the adjusted value are 00, as shown in the Comment column of the table.

The Python code for the proposed algorithm is also given in Appendix A. The modinv_radix8 algorithm takes 206 iterations to reach

u = 1

and

v = - 1

. In contrast, the modinv_radix4 and modinv_radix2 algorithms require 243 and 356 iterations, respectively.

To reduce the number of adders, we use a multiplexer to select an appropriate value and assign it to the temporary variable

t z

. Then, we perform

t x y = t x + t y + t z

. Based on the least significant three bits of

t u v

, we assign

t x y ≫ 1

,

t x y ≫ 2

, or

t x y ≫ 3

to

x y

.

Figure 1 shows the block diagram of the proposed radix-8 modular inversion circuit. To perform the addition

t x y = t x + t y + t z

,

- 2 m

and

- 3 m

are replaced by

+ 6 m

and

+ 5 m

, respectively. This is because, for example, for an integer i,

(i - 3 m) \mod m = (i - 3 m + 8 m)

\mod m =

(i + 5 m) \mod m

. Furthermore, for the addition, we prepare

- m

that can be obtained by inverting all the bits of m and setting the right-most bit to 1 because m is odd. The Verilog HDL implementation uses continuous assignment to compute

u v

and

x y

and writes them to the corresponding registers on the rising edge of the clock signal. Note that we have to use adders for generating

3 m

,

5 m

, and

6 m

, which are not shown in the figure.

Below we give the hardware implementation code in Verilog HDL for the proposed radix-8 modular inversion algorithm (modinv_r8.v). The signals start and ready indicate the start of the modular inversion calculation and the availability of the calculation result, respectively. Because we use the Secp256k1 elliptic curve, the input and output signals b, a, m, and c are 256 bits. We use 260 bits for the internal signals. Instead of the 260-bit

t x y = t x + t y

in the Python code, we achieve it with 3 bits (t3). This reduces the execution time and hardware costs. And we perform

(t x < 0) | (t y < 0)

for

t x y < 0

. This guarantees

(t x + t y + m) / 2 < m

.

Below is the testbench Verilog HDL code used to simulate modinv_r8.v.

Figure 2 shows the functional simulation waveform, generated with ModelSim. The result c was available in 416 ns. That is, the calculation took 208 clock cycles. Note that the value of the result c is the same as the output of the Python code in Appendix A.

We implemented the modular inversion algorithms on the Altera Cyclone V 5CGXFC9 E7F35C8 FPGA chip. The EDA tool we used is Quartus Prime Version 20.1.1 Build 720 11/11/2020 SJ Lite Edition. This is the latest edition that integrates with ModelSim for simulation. All algorithms were evaluated in the same environment.

Table 7 lists the cost performance of the modular inversion algorithms. The column Cycles shows the required number of clock cycles when executing the modular inversion algorithm. The column Freq.(MHz) shows the clock frequency in MHz at which the circuit can work. The column Latency(μs) shows the execution time in microseconds calculated by dividing the clock cycles by the clock frequency. The column ALMs shows the required number of adaptive logic modules. The column Registers shows the required number of flip-flops. The flip-flops are mainly used to store u, v, x, and y. Their contents are updated in every clock cycle. The last column shows the AT factor, which is the product of the Latency in milliseconds and the sum of the ALMs and Registers:

AT = Latency \times (ALMs + Registers)

(5)

The row [1] in the table shows the performance and cost of modular inversion using Fermat’s little theorem

r = a^{m - 2} \mod m

. It consists of costly modular multiply and modular squaring. Its AT factor is much higher than the others. The remaining rows show the performance and cost of the EEA-based modular inversion algorithms. The numbers of registers used by [2,3,8,10] were larger than the others. This is because extra registers are used to adjust the value of x or y, so that the modular inversion result is within the range of 0 and m. Our algorithm implementation achieved an execution time of 3.67 μs and an AT factor of 8.30, outperforming all other implementations. Figure 3 shows an intuitive view of the latency and AT histograms.

The proposed radix-8 algorithm was demonstrated to be efficient for 256-bit primes. Table 8 compares the cost performance on the Secp192k1 192-bit prime field curve and Secp521r1 521-bit prime field curve [16]. Here, we only provide a comparison with Algorithm 2.22 [2]. For ease of comparison, we also show the case when using the Secp256k1 256-bit prime field curve in the table. Our algorithm’s AT outperformed [2] by

17.57 / 4.38 = 4.01

times,

38.31 / 8.30 = 4.62

times, and

253.31 / 55.44 = 4.57

times for the curves of Secp192k1, Secp256k1, and Secp521r1, respectively. This shows that our algorithm also demonstrated scalability to other prime sizes and adaptability to other cryptographic curves.

4. ECC Implementation with Proposed Modular Inversion Algorithm

ECC relies on scalar point multiplication. Suppose

P = [x_{p}, y_{p}]

is a point on the curve, the scalar point multiplication

Q = d P

obtains the

Q = [x_{q}, y_{q}]

that is also on the curve, where

d = 〈 d_{n - 1} \dots d_{1} d_{0} 〉

is an n-bit scalar. Scalar point multiplication can be conducted with PA and PD, as shown in Algorithm 12.

Algorithm 12 ScaMul (d, P, m, a) (Scalar Point Multiplication).

inputs:

d = 〈 d_{n - 1} \dots d_{1} d_{0} 〉

and point

P = [P_{x}, P_{y}]

; m and a in

y^{2} = x^{3} + a x + b \mod m

output:

Q = d P

begin

1

Q = O

,

R = P

,

k = d

/*

Q = O

and

R = P

*/

2 while

k \neq 0

do

3 if

k_{0} = 1

4

Q =

PA (Q, R, m, a) /*

Q = Q + R

(Algorithm 1) */

5

R =

PD (R, m, a) /*

R = 2 R

(Algorithm 2) */

6

k = k ≫ 1

7 endwhile

8 return Q /*

Q = d P

*/

end

The algorithm calls point addition PA (P, Q, m, a) and point doubling PD (P, m, a). Table 9 gives an example to show the calculation steps of the scalar point multiplication. For a 5-bit

d = 10101_{2} = 21

, we calculate

Q = d P

in five steps to obtain

Q = 21 P

. We can see that the algorithm is similar to RSA exponentiation [19].

We used the ECDH key exchange protocol, described in Table 1, to establish a shared secret key for two parties. An ECDH key exchange algorithm in Python code is given in Appendix B. The algorithm invokes the scalar point multiplications that use two computations — point addition and point doubling. Four primitive modular calculations (addition, subtraction, multiplication, and inversion) are used for these two computations, as shown in Figure 4. The Python function names in Appendix B are also shown in the figure.

See Appendix B, the Python code is hardware-oriented. Essentially, a Python function defined using the def keyword was implemented in a Verilog HDL module. For integrity, we listed the modinv_radix8 code again, but now with

+ 6 m

and

+ 5 m

instead of

- 2 m

and

- 3 m

, respectively, and the function name has been changed to modinv.

Based on the Python code, we implemented ECC using our radix-8 modular inversion algorithm for calculating

λ

in PA and PD. Figure 5 shows the functional simulation waveform of scalar point multiplication

Q = d P

with

P = [x, y]

and

Q = [q x, q y]

. The result was available in 635,362 ns. That is, the calculation took 317,681 clock cycles. Outputs

q x

and

q y

are the same as the outputs Qax and Qay, respectively, of the Python code in Appendix B.

An ECC cost performance comparison is given in Table 10 when implementing on the Altera Cyclone V 5CGXFC9E7F35C8 FPGA chip. We also used Quartus Prime Version 20.1.1 Lite Edition for the implementation. All ECC implementations used the same circuit, except for the modular inversion part.

Figure 6 shows the latency and AT histogram. The ECC latency using our proposed radix-8 modular inversion algorithm was 0.01970 s and its AT factor was 393,546.29. From the table and histogram, we can see that our ECC implementation achieved lower latency and lower AT factor than all other implementations.

Algorithm 12, the traditional scalar point multiplication, suffers from side-channel attacks, because its execution time depends on the input scalar d. Side-channel attacks attempt to reveal the secret key from leaked information, such as timing, power consumption, or electromagnetic radiation. We can use the Montgomery ladder algorithm [20] to perform the scalar point multiplication, as shown in Algorithm 13. Its execution takes the same constant time regardless of the input scalar d, making it resistant to side-channel attacks.

Algorithm 13 ScaMulMont (d, P, m, a) (Montgomery Ladder Scalar Point Multiplication).

inputs:

d = 〈 1 d_{n - 2} \dots d_{1} d_{0} 〉

and point

P = [P_{x}, P_{y}]

; m and a in

y^{2} = x^{3} + a x + b \mod m

output:

Q = d P

begin

1

Q = P

,

R =

PD (P, m, a) /*

Q = P

and

R = 2 P

*/

2 for

i = n - 2

downto 0 do

3 if

d_{i} = 1

4

Q =

PA (Q, R, m, a) /*

Q = Q + R

(Algorithm 1) */

5

R =

PD (R, m, a) /*

R = 2 R

(Algorithm 2) */

6 else

7

R =

PA (Q, R, m, a) /*

R = Q + R

(Algorithm 1) */

8

Q =

PD (Q, m, a) /*

Q = 2 Q

(Algorithm 2) */

9 endfor

10 return Q /*

Q = d P

*/

end

We give the Montgomery ladder scalar point multiplication algorithm’s Python code as follows, and give the block diagram of the Montgomery ladder circuit in Figure 7b.

Figure 7. Block diagrams of scalar point multiplication circuits. (a) Traditional scalar point multiplication circuit; (b) Montgomery ladder scalar point multiplication circuit.

Since multiple PA or PD operations are not performed simultaneously, one PA module and one PD module are sufficient (PA and PD operate in parallel). Note that the iterative control part based on the scalar d is omitted in the figure. In contrast, the block diagram of the traditional scalar point multiplication circuit is shown in Figure 7a.

Algorithm 13 is resistant to side-channel attacks, while requiring approximately the same execution time and the same hardware resources compared to Algorithm 12, as shown in Table 11. Because the Montgomery ladder scalar point multiplication circuit (Figure 7b) uses more multiplexers, the number of ALMs is larger than Algorithm 12. Algorithm 12 uses a 256-bit register to shift the scalar d, and Algorithm 13 uses a 9-bit counter to control the iterations. Therefore, Algorithm 12 uses more registers than Algorithm 13.

5. Discussion of Design Issues

In the previous sections, we demonstrated through simulations using ModelSim and implementations with the Quartus II EDA tool that the proposed high-radix modular inversion algorithm works correctly and is efficient in terms of execution time and hardware costs. We also showed that the ECC implementation using the proposed algorithm outperformed implementations using other investigated algorithms.

The modular inversion algorithm repeatedly performs addition, subtraction, and shift operations on the variables u, v, x, and y. These variables are typically implemented using registers that are updated on the rising edge of a clock signal. The clock frequency of the circuit is determined by the operation delay between two successive clock rising edges (a clock cycle). Increasing the clock frequency requires decreasing the delay of operations within a clock cycle.

We can use a finite state machine to divide sequential computations into multiple steps and store the results of the steps in registers. This reduces the latency within one clock cycle and increases the clock frequency. However, implementing a finite state machine requires more clock cycles and more registers.

Multiplexers have much lower latency and lower hardware cost compared to carry propagate adders. To reduce hardware costs, if some additions have the same input, instead of using adders and then a multiplexer, we can use a multiplexer before an adder. For example, instead of

s = mux (a + b, a + c, a + d, a + e)

, we can have

s = a + mux (b, c, d, e)

, which reduces the number of adders. The circuit for calculating

t x y

in Figure 1 is designed in this way to generate

t z

using a multiplexer before the adder. In addition, to reduce hardware costs, we only use addition for both addition and subtraction calculations. For example, in Figure 1 we only perform the addition

t x y = t x + t y + t z

. For radix-8 modular operations, the subtractions

- 2 m

and

- 3 m

are replaced by

+ 6 m

and

+ 5 m

, respectively.

The carry-select adder (CSLA) can be used to reduce the latency of the carry propagate adder. For an n bits carry propagate adder, we split the n bits into two

n / 2

bits, the upper

n / 2

bits and the lower

n / 2

bits. The addition of the upper

n / 2

bits is performed simultaneously by two adders, assuming that the carry-in of one adder is 0 and the carry-in of the other adder is 1. Three

n / 2

-bit adders (including one for the lower

n / 2

bits) operate in parallel. When the carry-out of the lower

n / 2

-bit adder is available, a multiplexer is used to select the correct result from the upper two adders. Using a CSLA by dividing n bits into two

n / 2

bits can reduce the latency by approximately half. On the other hand, the hardware cost increases by more than 50% (one extra

n / 2

-bit adder and one

n / 2

-bit multiplexer are required). In general, splitting n bits into m (

n / m

)-bits reduces the latency by about a factor of m, but increases the hardware cost exponentially.

The use of a carry-save adder (CSA) can significantly reduce latency. There is no ripple carry between bits. The result is represented as a carry set and a sum set. A single carry-save adder is equivalent to a 1-bit full adder, which has a low latency. Because of the representation of the two sets, it is not possible to know the final addition result and its sign (negative or positive) without performing an additional addition, with a carry look-ahead adder for example. Therefore, a CSA is commonly used for intermediate calculations. It takes three sets of inputs and produces two sets of outputs.

As shown in Algorithm 11, our proposed radix-8 modular inversion algorithm allows for arbitrary bit primes, because we considered all combinations of the least significant three bits of prime numbers, as shown in Table 5. If we use a fixed prime m, defined by Secp256k1 [16] for instance, the circuit can be simplified by removing the parts of prime number m whose least significant three bits are not 111. Furthermore,

3 m

,

5 m

, and

6 m

can be calculated and stored in a constant table in advance. Then, the hardware can use them directly, without any calculations. This speeds up the radix-8 modular inversion calculations.

Our radix-8 modular inversion algorithm allows any bit prime and has no special requirements on the prime, so we can easily use different elliptic curves with the same or different prime sizes, as shown in Table 8. In addition, the hardware implementation of the algorithm is provided in Verilog HDL, which does not rely on special circuit libraries, allowing the algorithm to be implemented on a variety of different platforms.

This paper mainly focused on the modular inversion algorithm and the hardware implementation. We presented the performance and hardware cost of simple low-cost ECC implementations using different modular inversion algorithms, to demonstrate the benefits of the proposed algorithm. To make a fair comparison between implementations, all ECC hardware circuits were identical, except for the modular inversion circuit. From Table 10, we can see that the frequencies of all ECC implementations were quite low. To increase the clock frequency, we can also use pipeline techniques to divide computations into multiple stages and use pipeline registers to store intermediate results. However, the hardware costs will increase due to the use of pipeline registers.

6. Concluding Remarks

In this paper, we proposed a mixed radix-8 modular inversion algorithm and hardware implementation based on 256-bit primes in Verilog HDL and compared its cost performance with other implementations on the Altera Cyclone V FPGA chip. The algorithm and its hardware implementation were area–time efficient with an AT factor of 8.30, which outperformed the other algorithms and implementations. We showed that our algorithm also demonstrates scalability to other prime sizes and adaptability to other cryptographic curves. We also presented the cost performance of an ECC implementation using the proposed modular inversion algorithm. Implementation results also showed that our algorithm reduces the execution time and requires fewer hardware resources than the other investigated algorithms. We presented an efficient implementation of the Montgomery ladder scalar point multiplication algorithm that is resistant to side-channel attacks.

Future work could include shortening the critical path and using carry-select adders and carry-save adders to speed up the addition of large operands. In addition, using a fixed prime m, Secp256k1 for example, we could simplify the circuit by considering only the case where the least significant three bits are equal to 111 and using precomputations of

3 m

,

5 m

, and

6 m

to speed up the radix-8 modular inversion calculations.

Another important future work is to minimize the latency of the ECC implementation by using longer pipelines. The pipeline stages could be split, so that a modular addition or subtraction can be completed within a single pipeline stage.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. EEA-Based Modular Inversion Algorithms in Python

In the previous sections, we described Algorithms 4–10 (modinv1 to modinv7) and the radix-8 modular inversion algorithm in pseudocode. This appendix provides their implementation codes in Python. We simply summarize the codes as below: modinv1 is the fundamental EEA code; modinv2 removes the division; modinv3 reduces the number of iterations; modinv4 repeatedly divides u and v by two; modinv5 divides

u - v

or

v - u

by two; modinv6 assigns

t u

and

t v

, so that

t u - t v

is even; modinv7 uses negative assignments to v and y; and modinv_radix8 gives the code of the proposed algorithm.

The last assert statement verifies the correctness of the calculated modular inversion result c. Below is an example of the output when the code is executed. We can see that for the same inputs b, a, m, the eight functions have the same output. These outputs of the Python code are used to check the correctness of the circuit’s outputs. For example, the value of the signal c in Figure 2 is the same as the output c of the Python code.

$ python3 modinv12345678.py

b = 0x9cfa1c993911914be0f15bd74a878abe0079c6254b961b82e1abda76387d1d85

a = 0xd5076ae274e874c2eb0f7778717c39460236549ddd9fc651e68a0c0e787b4ce8

m = 0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f

c1 = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e

c2 = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e

c3 = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e

c4 = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e

c5 = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e

c6 = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e

c7 = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e

c = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e

Appendix B. Elliptic Curve Diffie–Hellman (ECDH) Key Exchange Algorithm in Python

The ECDH algorithm is used to reach key agreement between two parties over an insecure network, as shown in Table 1. The following Python code demonstrates that two parties, Alice and Bob for example, obtain the same secure key by calling scalar point multiplications. We skip the communication process and focus on how to implement scalar point multiplication by calling point addition and point doubling.

Python functions for modular addition (modadd), modular subtraction (modsub), modular multiplication (modmul), and modular inversion (modinv) are provided. These functions are used by point addition (point_addition) and point doubling (point_doubling). All the codes are hardware-oriented for circuit design.

The last assert statement is used to check whether the two parties obtained the same shared secure key. Below is an example of the output when the code is executed. We can see that Qbax is equal to Qabx (shared secure key). These outputs of the Python code are used to check the correctness of the circuit’s outputs. For example, the values of signals

q x

and

q y

in Figure 5 are the same as the outputs Qax and Qay, respectively, of the Python code.

$ python3 ecdh.py