1. Introduction
Recent advancements in cloud-based Artificial Intelligence (AI) have significantly increased the importance of data privacy and security. As organizations and individuals increasingly rely on cloud services for data storage and processing, protecting sensitive information from unauthorized access has become paramount. Effective encryption algorithms are crucial in ensuring that large volumes of information can be securely stored and processed without compromising data integrity or confidentiality. Among various encryption algorithms, Fully Homomorphic Encryption (FHE) has gained considerable attention for its unique ability to perform computations on encrypted data without needing decryption. This property enables secure data processing in untrusted environments, such as cloud computing platforms, where data privacy concerns are paramount [
1,
2]. Unlike traditional encryption methods, which require decryption before any operation can be performed on the data, FHE maintains the confidentiality of the data throughout the computation process, thus significantly enhancing security. However, FHE inherently requires significantly greater hardware complexity compared to classical encryption algorithms. This is because FHE schemes involve performing complex mathematical operations directly on encrypted data, which imposes substantial computational overhead. Consequently, the hardware complexity of FHE is recognized as one of the key challenges that must be addressed for its successful application [
3,
4].
The concept of FHE was first introduced by Craig Gentry in 2009. Gentry’s groundbreaking work provided the first construction of an FHE scheme based on lattice-based problems, specifically the Learning With Errors (LWE) problem and the Ring Learning With Errors (RLWE) problem [
5]. Gentry’s scheme not only demonstrated the feasibility of FHE but also introduced a blueprint for constructing general FHE schemes capable of performing arbitrary computations on encrypted data. Since Gentry’s pioneering work, numerous FHE schemes have been proposed, each aiming to improve the efficiency and practicality of FHE. Notable examples include the Brakerski–Gentry–Vaikuntanathan (BGV) scheme [
6], the Brakerski–Fan–Vercauteren (BFV) scheme [
7], and the Cheon–Kim–Kim–Song (CKKS) scheme [
8]. These schemes have made significant strides in reducing the computational overhead and improving the performance of FHE, making it more viable for practical applications.
A critical aspect of most FHE schemes is the reliance on polynomial operations over finite fields. Polynomial multiplication, in particular, is one of the most computationally intensive and time-consuming operations in FHE [
9]. As such, it often becomes a performance bottleneck [
10]. To mitigate this issue, the Number Theoretic Transform (NTT) is commonly employed to optimize polynomial multiplication [
11,
12,
13,
14,
15]. The NTT reduces the time complexity of polynomial multiplication from
(
) to
(
) where
is the degree of the polynomial. The NTT is a specialized form of the Fast Fourier Transform (FFT) adapted for finite fields [
16]. While the FFT operates on complex numbers, the NTT operates on integer numbers within finite fields, making it particularly suitable for lattice-based cryptographic schemes like FHE [
9,
17]. The NTT transforms polynomial coefficients into the frequency domain, enabling efficient pointwise multiplication. After the multiplication, an inverse NTT (INTT) is used to convert the results back to the time domain. Despite its advantages, implementing the NTT requires substantial hardware resources, particularly for storing the Twiddle Factors (TFs) used during the transformation. This memory requirement poses a significant challenge, especially in resource-constrained environments. Therefore, efficient Twiddle Factor Generators (TFGs) are essential to reduce memory usage and enhance the overall performance of the NTT in FHE applications [
18,
19].
In this paper, we propose three innovative Twiddle Factor Generators (TFGs) aimed at improving hardware efficiency: Half-Memory-based TFG, On-the-fly Serial TFG, and On-the-fly Parallel TFG. The main contributions of this paper are summarized as follows:
We propose a Half-Memory TFG that reduces the memory size required for TF storage by half, while doubling the throughput.
We introduce two on-the-fly TFGs that do not utilize memory for storing TFs, offering significant enhancements in memory utilization for FHE schemes requiring various sets of TFs for NTT implementation.
We evaluate the proposed TFGs using the Figure of Merit (FoM) to assess their performance in terms of equivalence.
The rest of this paper is organized as follows:
Section 2 provides the theoretical background of the Number Theoretic Transform and Twiddle Factor calculation.
Section 3 discusses the proposed TFGs.
Section 4 presents the experimental results and analysis. Finally,
Section 5 concludes the paper.
2. Background
This section delves into the core operations of Fully Homomorphic Encryption (FHE), focusing on the Number Theoretic Transform (NTT) and the Twiddle Factor Generators (TFGs). These components are pivotal in optimizing the performance of FHE schemes, which enable efficient polynomial multiplication computations on encrypted data while preserving data privacy and security.
2.1. Number Theoretic Transform
Most Fully Homomorphic Encryption (FHE) schemes operate on polynomial rings over finite fields, which are typically represented in the form
[
20]. In this notation,
denotes the ring of polynomials modulo
, where
is usually chosen as
. Here,
signifies the finite field with elements represented as integers modulo
[
9,
19]. This setup ensures that polynomial coefficients are constrained within a finite set of integers, which is critical for maintaining arithmetic operations within a manageable range and ensuring security in cryptographic applications.
A general polynomial
within this ring can be expressed as
In this expression, represents the coefficients of the polynomial, and x is the indeterminate. The polynomial ring essentially denotes that we are working within the constraints of modular arithmetic defined by the polynomial and the finite field . Polynomial multiplication is a fundamental operation in cryptographic computations, particularly in FHE schemes. However, this operation is computationally intensive, with a time complexity of (). The multiplication of two polynomials, and , involves computing the product polynomial . This double summation results in a complexity that grows quadratically with the number of terms, making polynomial multiplication a significant bottleneck in cryptographic processes.
To overcome this inefficiency, the Number Theoretic Transform (NTT) is employed, which significantly accelerates polynomial multiplication. NTT reduces the time complexity from
(
) to
(
) by transforming polynomial coefficients into the frequency domain [
21]. This transformation involves multiplying the coefficients by constant factors known as Twiddle Factors (TFs), represented as
, where
is a primitive
Nth root of unity modulo
q. In the frequency domain, polynomial multiplication is reduced to pointwise multiplication, which is computationally less demanding.
After the pointwise multiplication is completed in the frequency domain, an inverse NTT (INTT) operation is applied to convert the result back into the time domain. This reconstruction is necessary to obtain the final result of the polynomial multiplication in the original polynomial form. The NTT and INTT operations can be expressed mathematically as
where
represents the coefficients in the frequency domain,
are the Twiddle Factors, and
q is the modulus used to keep the coefficients within the finite field.
where
denotes the modular inverse of
N, ensuring that the result is scaled correctly to match the original polynomial coefficients. This process effectively reconstructs the polynomial from its frequency domain representation back into the time domain, completing the polynomial multiplication operation. These mathematical transformations are fundamental for optimizing polynomial operations in FHE schemes and are essential for the efficient implementation of cryptographic protocols that rely on polynomial arithmetic.
2.2. Twiddle Factor Calculation
Twiddle Factors (TFs) are a crucial component in the NTT used for efficient polynomial multiplication in Fully Homomorphic Encryption (FHE) schemes. Twiddle Factors denoted as are elements of a finite field used in the context of the NTT. In the case of the NTT, these factors are defined as powers of a primitive root of unity. A primitive root of unity in a finite field is an element that generates all the non-zero elements of the field when raised to various powers, modulo q. This means that satisfies the equation 1 mod , where N is the size of the polynomial or the length of the transform.
Figure 1 illustrates an example of primitive roots of unity when
is 16 and
is 17. The primitive 16-th roots of unity modulo 17, denoted as
= 3, implies that every integer in mod 17 can be represented as a power of 3. As shown in
Figure 1, every non-zero number from 1 to 16 can be expressed as a power of 3 modulo 17.
The primary purpose of Twiddle Factors is to enable efficient polynomial multiplication by transforming polynomial coefficients into a frequency domain where pointwise multiplication is more straightforward, according to (2). In this transformed domain, polynomial multiplication becomes a simple element-wise multiplication of these transformed coefficients. After performing the polynomial multiplication in the frequency domain, it is necessary to revert to the time domain to obtain the result in its original polynomial form. This is achieved through the INTT, which involves the inverse of the Twiddle Factors, according to (3). Overall, Twiddle Factors are indispensable in the realm of FHE and polynomial arithmetic, facilitating the efficient and secure handling of complex polynomial operations within cryptographic systems.
2.3. Implementation of NTT
When implementing NTT, the Cooley–Tukey (CT) algorithm [
22] is commonly used, due to its efficiency and flexibility, similar to FFT. The CT algorithm optimizes performance through a recursive divide-and-conquer approach, breaking down the problem into smaller, more manageable components. The CT algorithm is summarized in Algorithm 1. In practice, the CT algorithm works by processing two polynomial coefficients at a time. Initially, one coefficient is multiplied by a Twiddle Factor (TF), and modular reduction is applied. The results are then combined to produce two new coefficients. This process is repeated at each stage of the algorithm, progressively simplifying the polynomial multiplication.
Figure 2 illustrates the block diagram of CT algorithm, which serves as a processing element (PE) in the NTT architecture. As represented in Algorithm 1, the PE operates on two polynomial coefficients, denoted as
and
. Initially,
is multiplied by the TF, and then modular reduction is performed. Subsequently, the output
is computed by adding
to the operated
and then performing modular reduction. Similarly, the output
is obtained by subtracting
from the operated
and performing modular reduction. According to Algorithm 1 and
Figure 2, this method provides a systematic approach to reduce the time complexity of polynomial multiplication from
(
) to
(
).
Algorithm 1: NTT Based on the Cooley–Tukey (CT) Algorithm |
Input:Vectors of polynomial coefficients Twiddle Factors
Output: |
; do // NTT ; do
do // Processing Element ; ; ; ; ; |
The entire process of NTT, which can be implemented in various architectures based on PE, is described in Algorithm 1 [
23]. However, in this paper, NTT is implemented in a single-path delay feedback architecture, as shown in
Figure 3. This architecture sequentially receives
polynomial coefficients
and operates along a single path over
stages. Since one coefficient is received at a time, specific input coefficients need to be stored in the feedback register to be processed by the PE. These stored coefficients are operated when they come out of the feedback register. Moreover, the feedback register stores the result of the PE, indicated as
, as a single datum is transferred to the next stage. Consequently, this architecture utilizes
multipliers,
adders,
subtractors, and
modular reductions for PE. Additionally, (
feedback registers are used for implementing the serial data flow.
2.4. Conventional Memory-Based TFG
In the traditional approach to implementing TFGs for NTT, a conventional memory-based TFG is employed. This approach relies on pre-calculating and storing the TFs in a fixed memory structure, such as Read-Only Memory (ROM). The main advantage of this method is the rapid access to precomputed TF values, which can significantly speed up the computation process during NTT. The TFs are essential for transforming polynomial coefficients into the frequency domain, where pointwise multiplication is performed. The TFs are derived from the primitive -th roots of unity, and their computation typically involves raising these roots to specific powers and taking the result modulo as a prime number q.
In a memory-based TFG, the TFs are computed beforehand and stored in ROM. The size of the ROM required depends on the number of TFs and their bit width. For example, if the polynomial degree is and the of the modular constant q is 64 bits, the total memory size required for storing TFs can be significant. Specifically, it scales with bits. To illustrate this with an example, suppose is 216 and the of q is 64 bits. The total memory required for TF storage would be 216 × 64 bits, which amounts to 524 KB of memory. This storage requirement can be substantial, especially for larger polynomial degrees or when dealing with multiple sets of TFs for different modular constants.
Figure 4 illustrates the conventional memory-based TFG integrated with the SDF structure from
Figure 3. This setup operates according to Equations (2) and (3) to perform both NTT and INTT operations. Notably, the TFG structure is designed to provide pre-stored TFs as requested, at each stage of the process. While the conventional memory-based approach offers the advantage of quick and straightforward access to TFs, it has a significant drawback: the need for a substantial amount of ROM to store the TFs and the time required to initialize this memory.
4. Experimental Results and Discussion
This section presents the experimental results for the proposed Twiddle Factor Generators (TFGs), showcasing their performance compared to conventional memory-based approaches. The goal is to evaluate the efficiency improvements and practical benefits of each TFG in real-world applications of Fully Homomorphic Encryption (FHE). To assess the effectiveness of the proposed TFGs, we conducted a series of experiments focusing on memory usage, computation time, and overall efficiency. We compared the proposed Half-Memory TFG, the On-the-Fly Serial TFG, and the On-the-Fly Parallel TFG against a straightforward memory-based TFG.
Firstly,
Table 1 presents a qualitative comparison of various TTFGs in the context of
N-polynomial NTT. The conventional Memory-based TFG, as depicted in
Figure 4, is the most straightforward and intuitive approach. It computes
N TFs in advance and stores them in a memory with
N elements for subsequent use. While this method is simple and easy to understand, it suffers from significant hardware complexity due to its large memory size requirements. In contrast, the proposed Half-Memory TFG in
Figure 6 reduces the memory usage required by the conventional approach. By leveraging the symmetric properties of TFs, the Half-Memory TFG reduces the memory requirement from
N to
N/2. It generates the remaining TFs using a subtractor, and since it can create two TFs simultaneously, it achieves twice the throughput compared to the conventional Memory-based TFG. Additionally, the on-the-fly generation methods, namely the On-the-Fly Serial TFG and On-the-Fly Parallel TFG, offer alternative approaches. These methods eliminate the need for memory by generating TFs dynamically using D Flip-Flops (D-FFs), multipliers, and modular reduction circuits. According to
Figure 7, the O-Serial TFG generates one TF at a time, requiring a single set of D flip-flops, multipliers, and modular reduction circuits. This design has the lowest hardware complexity among the proposed structures. To further enhance throughput, the On-the-Fly Parallel TFG has been proposed. As shown in
Figure 8, this structure can achieve
times the throughput of the O-Serial TFG by generating multiple TFs simultaneously. However, this comes at the cost of increased hardware complexity, which is
p times greater than that of the O-Serial TFG.
Next,
Table 2 provides a quantitative comparison of various TFGs. For the experiment, we implemented TFGs using Verilog HDL on the Vivado Design Suite 2021.2. All TFGs were synthesized on the FPGA KCU105 board with an operating frequency of 150 MHz. To introduce real FHE, we implemented the HEAAN library, which is prominent in the field of FHE [
8]. The HEAAN library implements the Homomorphic Encryption for Arithmetic of Approximate Numbers (HEAANs) scheme, designed for efficient arithmetic operations on approximate numbers. This makes it particularly suitable for applications requiring processing encrypted data, such as secure data analysis and privacy-preserving machine learning. The parameters used are
N = 2
15,
p = 64′d57646075230827315 and
= 64′d327717960149216710.
As shown in
Table 1, creating TFGs requires Memory, D FF, Subtractor, Multiplier, and Modulo reduction circuits. When implemented on an FPGA chip, these components are allocated as Look Up Tables (LUTs), Flip Flops (FFs), and Digital Signal Processing (DSPs), with each TFG requiring a different number of these components, as detailed in
Table 2. Hardware complexity is expressed by the number of LUTs, FFs, and DSPs. To provide a comprehensive comparison, we converted the necessary components for hardware implementation into FPGA internal SLICEs, labeled as Total Area. Note that each DSP is equivalent to 10 SLICEs in terms of device resources on the KCU105 board. In terms of hardware complexity, the O-serial structure is the most efficient, followed by Half Memory, O-parallel, and Conventional structures. Hardware performance is measured by throughput. Based on an operating frequency of 150 MHz, the throughput for generating
N = 2
15 TFs with the bit width of 64 is described in Gbps. The Conventional and O-serial structures generate one TF per clock cycle, the Half Memory structure generates two TFs per clock cycle, and the O-parallel structure, adopting a parallelism of 4, generates four TFs simultaneously.
Finally, considering both hardware complexity and performance, the Figure of Merit (FoM) was obtained by dividing the throughput by the area, demonstrating the performance improvement of the proposed TFGs. According to the normalized FoM, the proposed Half-Memory TFG achieves a performance improvement of 3.89 times by reducing memory size by nearly half and doubling the throughput. Next, the proposed O-serial structure significantly reduces hardware complexity by completely eliminating memory, resulting in the lowest area hardware configuration among the proposed structures. The O-serial structure shows a significant improvement in FoM. The proposed O-serial structure shows a significant improvement in FoM, achieving a 3.86-times enhancement compared to the Conventional structure. Lastly, the O-parallel structure, the parallel version of the O-serial structure, shows higher hardware complexity than the proposed Half-Memory TFG and O-serial structure but achieves p-fold increased throughput. Consequently, the O-parallel structure also demonstrates a notable improvement in FoM, amounting to 3.83.
In summary, the three proposed structures have almost similar FoM levels. Depending on the design requirements, it is advisable to choose the most suitable TFG for implementation. The O-serial structure is most suitable for low-area implementations, the O-parallel structure is ideal for high throughput, and the Half-Memory structure is best for achieving the highest FoM. Consequently, this paper offers a solution allowing the selection of the most suitable TFG design based on implementation requirements.