1. Introduction
In the short term, the availability of digital devices, along with their inherent need for interconnectivity, will raise several challenges in security, power resources, hardware, and software saturation, among others. For modern contexts of emergent technologies, such as cyber-physical systems, vehicular communication networks and Industry 4.0, assuring data integrity, confidentiality and authentication for message-passing communications is a critical requirement, where cryptographic solutions are among the most important alternatives.
Security is a constant and real concern, even more so if the application environments bring related issues along, such as failures, transfer errors, viruses, malware, compromised hardware, or physical damage to devices. These issues are also related to attackers, who may modify, replace, or damage data, causing accidents, financial losses, or different kinds of problems. Besides, faults are inherent in some hardware architectures, or they can be generated by their environment, being classified as transient or permanent. On the one hand, transient faults are considered as single-event upsets(SEU) when there is a change of logic state or multiple transient faults are triggered by several SEU. On the other hand, permanent faults cause defects in system behavior and produce halts or wrong results [
1].
Nowadays, security services for integrity, confidentiality, and authentication must consider faults, which are more critical if cryptographic algorithms are taken into account, due to their behavior focused on several iterations and processing all message data. In this case, the type of cryptographic algorithms known as hash functions (because they generate a hash from some message constituted by bits, bytes, words, images, videos, text pages, etc.) present high diffusion of all the bits in processing time. The desirable characteristics of these algorithms can be greatly affected by faults, since changing a single bit will produce a totally different hash.
Moreover, providing an autonomous-decision mechanism to determine if the data integrity calculation has been appropriately carried out, to assure system consistency, is a key design element of digital communication systems [
2].
In this sense, faults and errors caused by hardware noise or data transmission alterations, generated either by the channel quality or by external attacks, should be detected quickly and efficiently for a fault-detection module to be considered suitable for real-time applications. As an important remark, when the input and output data differ, it means an error has occurred. For instance, in message-passing communications, the presence of noise can affect the channel during transmission of bits, which can be contaminated by errors. Besides, errors can appear from a miscalculation in the microprocessors or adverse scenarios such as attackers or physical damage. An example of this is when a bit with value 0 is changed to 1, or vice versa, in the payload.
Secure Hash Algorithm (SHA) functions offer integrity, although they can be employed to provide authentication. Besides, the algorithm does not include verification mechanisms for its inside operations and future fails at the system level.
Noise and possible failures in the system are critical within the design of hardware implementations [
3,
4,
5]. For instance, several works have demonstrated that noise and interference affect the operation of FPGAs, such as it happens in the presence of magnetic noise in the white noise region [
6]. When the voltage supply is unstable, this can negatively affect the immunity of the I/O buffer by about 16.8 dB, according to Toan et al. [
7]. Therefore, we considered developing hardware modules capable of assuring integrity, security and data quality with a system-level fault detection mechanism.
Traditional electronic systems designed for fault tolerance are based on redundancy [
8,
9]. Applications where failure is critical use redundant pieces of hardware and software. In such systems, time redundancy, information redundancy, and hardware redundancy can also be considered.
Best practices in data integrity consider validation to prevent the entrance of invalid data, perform error detection in transmissions, and execute security measures against data modifications or for loss prevention [
8]. SHA-2 is a broad family of solutions to assure data integrity, and the standard [
9] defines several configurations for these cryptographic hash functions, generating hashes with 256, 384, and 512 bit sizes. The SHA-256 algorithm is widely used to balance security and fast implementation for contexts that require a real-time response based on hardware. Current hardware implementations of SHA-2 are focused on satisfying those characteristics, but there are issues regarding real-time balance among processing, efficiency, and cost, so diverse researches and analysis should be pursued.
Cryptographic algorithms can be used for different security related tasks such as confidentiality, authentication, integrity, and non-repudiation. Among the most important algorithms, we have hash functions used in the blockchain, digital signature, digital certificates, and other applications. SHA-2, designed by the National Security Agency (NSA), is a set of cryptographic hash functions, which includes important changes with respect to its predecessor (SHA-1). It consists of six hash functions and digests of 224, 256, 384, or 512 bits. While SHA-256 is a hash function with 32-bit words, SHA-512 processes 64-bit words instead. The two variations have identical structures and differ only in the amounts of displacement, additive constants, and rounds. On the other hand, we can consider SHA-224 and SHA-384 as truncated versions of SHA-256 and SHA-512, respectively, which are calculated using different initial values. Other truncated versions of SHA-512, also with different initial values, are SHA-512/224 and SHA-512/256.
In this work, we identify an open problem: How to ensure a fast and reliable fault detection architecture even in adverse situations, which can be integrated into several environments without representing an overload to the system? To answer this question, first, we should be able to solve two main issues: (i) How to efficiently determine whether an error occurs when data integrity is calculated?; and (ii) how to prevent the diffusion of an error caused by a miscalculation of data integrity?
Traditionally, state-of-the-art techniques intended to answer these questions base their approach on redundancy by including only one of the following main elements: hardware, time, or information [
10,
11]. Hence, in hardware redundancy, the physical ciphering modules (processors, memories, multipliers, etc.) are replicated, generating several hashes at the same time. Time-redundancy architectures execute the same operation for providing multiple copies of the hashes at different times. Finally, information redundancy adds check data to verify hash correctness before using it and, in some cases, even allowing the correction of some erroneous hash [
12,
13,
14]. Each architecture implementing some kind of redundancy presents advantages and disadvantages concerning power consumption, amount of hardware resources, and performance (throughput and efficiency). Hence, to provide a fault-detection architecture capable of being used in several environments, it is essential to balance system overload in terms of energy consumption, processing delays, and performance, among other parameters.
In this paper, we propose an innovative hybrid-redundancy hardware architecture for the cryptographic algorithm SHA-2. This architecture is designed to improve error and fault detection. In the following, we list our main contributions:
To the best of our knowledge, pipelining has not been previously used to perform different hash calculations with a redundancy topic. Pipelining is traditionally used to process several instructions or operations at once, and its use in redundancy and cryptography is a relevant part of the innovation of this work. Therefore, we present a hybrid architecture, which presents a 3-stage pipeline structure, implementing hardware and time redundancy. We analyzed hardware resources and performance in the distribution of the processed elements within each stage and the balance of critical path. Besides, we increased CPU performance at a certain clock speed, defining a transformation in data flow in several sequential phases where the input of each stage is the output of the previous one;
From a pipeline design perspective, data transmission is improved by eliminating related risks and conflicts, using registers in each stage for processing the hash functions, which perform multiple passes to the data to obtain their final hash. By adding an error verification module, based on a voting scheme, we were able to determine whether an error occurred or, otherwise, the output data were correct. Designing an architecture with a pipeline scheme and redundancy processes is not straightforward since several design problems, related to data dependencies, must be solved along the way;
Another contribution is the creation, analysis and demonstration of both fault tolerant and non tolerant architectures, which lays the ground for future research and studies towards understanding the advantages and disadvantages they present, also allowing comparison with other works. The selection of the architecture, along with its design and implementation, were evaluated and analyzed using various FPGA technologies (Spartan-7, Artix-7, Kintex-7, and Virtex-7). The Virtex-7 implementation was the one that achieved the best results of all the alternatives, with a throughput of 441.72 Mbps and 2255 LUTs, presenting an efficiency of 195.8 Kbps/LUT. It is true that the examined technologies are produced by the same company; however, it must be emphasized that each of them is different from the rest, as mentioned in [
15]. Hence, regardless of whether they are from the same family or not, their logic blocks can be quite different in their structure.
The content of this paper is divided as follows.
Section 2 presents a background about the SHA-2 hash function, fault-detection schemes and pipeline. In
Section 3, several related works are presented and explained. Next,
Section 4 presents the proposed hardware architecture for fault-detection. We present the results and their comparison with other techniques in
Section 5. The discussion of the results is presented in
Section 6. Finally, the main conclusions drawn from the results of this paper are presented in
Section 7.
3. Related Work
Several works have been proposed to prevent and detect transient errors or faults for applications at the hardware level. In this sense, implementations of hash-function have been used to provide integrity and reliability over the data transmitted and generated for communications. Nevertheless, given the way SHA algorithms work (which consists of a high-diffusion iterative structure), the presence of small errors in the hash calculation will result in several ones in the final hash [
1]. In the following, some outstanding works that tackle this problem are presented.
Ahmad and Das proposed, in [
1], an error detection approach using parity codes and hardware redundancy. Their performance metrics indicated that their limitations were hardware overhead and short delay overhead. They did not introduce a pipeline method to optimize the architecture. Besides, their scheme did not carry out the correction of the detected error.
Bahramali et al. proposed, in [
17], a fault diagnosis method for SHA-1 and SHA-512, focused on detecting transient and permanent faults. The proposal used time redundancy and a pipelining method. Their work included, in the critical path of the SHA hash functions, the addition operation, then they used a subtraction as the inverse function. This process was used for redundancy, detecting both permanent and temporary faults. They observed that traditional time redundancy approaches do not have this capability because they only detect transient faults.
A Totally Self-Checking (TSC) design was introduced for the SHA-256 hash function by Michail et al., in [
18], suitable for harsh environments, focused on detecting erroneous bits (odd and even numbers), if they are appropriately propagated. They observed that ASIC technologies are more adequate than reconfigurable devices, such as FPGAs, for TSC. Besides, in [
19], they presented a comparison between two hash functions, SHA-1 and SHA-256. In this case, they analytically described the development process for each component of TSC as compared to the Duplicated-with-Checking (DWC) alternative. Their conclusion was that, in comparison with DWC architectures, TSC cores have more benefits in terms of area, throughput/area, and power consumption.
Kahri et al., proposed, in [
20], an FPGA fault detection scheme to protect the SHA algorithm against attacks based on hybrid (hardware and time) redundancy. The trade-off between security and cost of implementation is an advantage of this approach.
Algredo-Badillo et al. [
21] proposed a pipeline scheme for the AES symmetric cryptographic algorithm, unlike the architecture proposed in this paper, which is focused on the SHA-2 cryptographic hash function. Additionally, while the AES architecture uses a structure with 5-pipeline stages, our architecture uses 3-pipeline stages.
4. Proposed Hardware Architecture
As mentioned before, SHA-256 takes a message and generates a digest, where the message is padded and divided into 512-bit message blocks, each of them going through 64 processing rounds. In this section, two hardware architectures are proposed: (a) a simple hardware architecture, which has an iterative structure and does not detect errors; and (b) a hybrid-redundancy hardware architecture, which is fault-tolerant and based on the simple hardware architecture.
The first proposed architecture is a simple one, which processes 512-bit message blocks during 64 clock cycles, and is not designed to detect faults, processing one 512-bit message block at a time, given its iterative structure; it also has a final message digest, which is generated after 64 clock cycles (see
Figure 1), for a given message block. The gray modules in the simple architecture correspond to the data storage for the input message and the intermediate hash.
The second proposed architecture is the hybrid-redundancy one, which is based on computing the SHA-2 algorithm three times on the same 512-bit message block. The block diagram of this architecture (see
Figure 2) has two main modules: (1) the control module and (2) the processing module. The algorithm defines 64 rounds; consequently, the proposed architecture makes different operations and requires 196 clock cycles for processing the first 512-bit message block, and 195 clock cycles (latency) for the remaining 512-bit message blocks, and all of them constitute the complete message. This latency is present because the 3-stage pipeline structure requires 3 clock cycles for computing one round defined by the SHA-256 algorithm, where the key idea is to compute three times the message digest. Additional clock cycles are necessary for initialization, data storing and output generation.
On the one hand, the processing module takes 32-bit words from the given 512-bit message block and executes several computations at the same time throughout the pipeline structure. At the end of the 196 or 195 clock cycles (depending on the case), this module has three registers with the hashes of the three pipelines, which are compared for determining whether an error occurred or not. The proposed hardware architecture takes into account three cases: (a) if the three hashes are different, then an error is produced, the error signal is set, the message digest is incorrect, and it must be ignored; (b) if two hashes are equal, then a fault is detected, but the architecture corrects and generates the exact output, the error signal is reset (indicating there is no error) and the message digest is correct; and (c) if the three hashes are equal, then there is no error, the error signal is reset, and the message digest is correct. In summary, if an error is detected, this can be solved or not; if this is not solved, then the 256-bit output hash of this module must be ignored; otherwise, the output hash is correct. On the other hand, the Control Module enables the dataflow of the first module. In the next subsections, more details are provided for both modules.
4.1. Processing Module
The Processing Module has three stages (see
Figure 3), which try to balance the critical path. This module represents the pipeline architecture and shows the three stages, which are separated by registers (gray blocks) that are populated using data in sequence, that is, the three replications of a same message block.
The general process for a message block can be explained as follows (see
Figure 4):
In the first clock cycle, the block is fed to the processing module and the first stage, Functions&Adders1, is computed (six functions and the sums);
In the second clock cycle, the output of the Functions&Adders1 stage is input to the Adders2 stage and the first stage is fed again with the same block (replicated block);
In the third clock cycle, the output of the Adders2 stage is input to the Adders3&VM stage, the output of the Functions&Adders1 stage goes to the Adders2 stage, and the first stage is fed finally with the replicated block;
From the fourth cycle to the 194-th cycle, the three repetitions of the same message block will be moving from the first to the second stage, from the second to the third stage and from the third to the first stage. This is focused on calculating the 64 rounds of the algorithm for each replication of the message block;
Finally, in the 195-th clock cycle, the three hashes are obtained and the outputs can be generated by comparing, evaluating and selecting the registers.
The previous process is executed for each message block for which its hash value must be obtained. The replications of a second block (or remaining blocks) of the message can be entered, maintaining the pipeline as busy and improving performance and efficiency with a latency of only 192 clock cycles, which represents the best case for metrics and results.
Next, each stage is described to explain the inputs, outputs and detailed behavior of the architecture, where each stage executes different tasks (see [
9]):
4.1.1. First Stage
It performs message scheduling, also executing the following four functions: , , , and .
In this stage, for message scheduling (at the top left of
Figure 3), a register is used for storing 512 bits (sixteen 32-bit words) during sixteen clock cycles, for the first sixteen rounds. Later, for the remaining forty-eight rounds, it is necessary to compute new 32-bit words and functions (
and
) and adders are implemented.
In this same stage, several temporary variables are computed (, , , and ), which are used in the next stage for computing the new state buffer (). The input for these functions is the initial hash or intermediate hash from the previous round.
4.1.2. Second Stage
It computes the new state variables (), along with the internal control logic for registering three dataflows of the intermediate message digest. For this computing, the round constant, the round word, and the four outputs of the first-stage functions are used.
The second stage retakes a great amount of the variables generated in the previous stage for computing the new state buffer. These variables are: (1) one round 32-bit word (); (2) four temporary variables (, , , and ); (3) one 256-bit state buffer (); and (4) one address () for obtaining the round constant (). Additionally, the second stage must store the new state buffer in one of the three registers because this carries out the computing of the three different processes (by the pipeline structure) of the SHA-256 algorithm.
The selection of the register for storing the state buffer is executed by the Internal Control Logic (see
Figure 5). This has two signals provided by the Control Module, whose logic allows the selection of the register according to the data flow of the three pipeline processes.
4.1.3. Third Stage
It calculates the final, intermediate and replicated message digests. In this architecture, padding is not considered in pre-processing; for that reason, the input message must have a length less than and be a multiple of 512 bits. This message is divided into 512-bit blocks, and the architecture can process one or more 512-bit blocks (N 512-bit blocks) for obtaining the final digest of the same message. Each 512-bit block is sequentially processed during several clock cycles and divided into sixteen 32-bit words. Additionally, a block is processed three times by the proposed hybrid architecture, where the Voting Module selects an output from the three final hashes; finally, it decides if there is an error or not.
The third stage considers several situations: (1) selection of the initial state buffer (constant values when the first 512-block is processed, or message digest from the previous 512-bit block when the message is constituted by several 512-bit blocks); (2) update of the intermediate state buffer (this occurs when it is processed in the internal rounds, from the 2nd to the 63rd, for the three computing lines); (3) computation of the intermediate message digest (from all the 512-bit blocks, except for the last one), and computation of the final message digest (from the last 512-bit block of the message).
In the third stage, the Voting Module stores the three final digests of the message, computing the 64 rounds defined by the SHA-256 algorithm (see
Figure 6). Moreover, this module compares these three values for generating the final message digest. Finally, it also generates the
signal, which is a flag that indicates if the final message digest is validated or not.
The
Voting Module generates: (1) the
bus for intermediate message digest or initial message digest; (2) the final Message Digest bus for the digest of the complete message; and (3) the
signal for providing the flag informing whether the final message digest is correct or not. This flag is generated accordingly to compare pairs of message digests from each stage (see
Table 1). This shows that the
signal compares 1S-MD and 2S-MD (message digests from the first stage and the second stage, respectively), and it is high if they are equal and low otherwise.
compares 1S-MD against 3S-MD and
compares 2S-MD against 3S-MD.
The truth table for generating
and
selMUx signals and
selMUx are shown in
Table 2, where the first signal is low when two or three stage message digests are equal, and the second one selects 1S-MD or 2S-MD, which is the output bus for the Message Digest from the proposed hybrid architecture. 3S-MD is ignored because the possible outputs of the message digest indicate that selecting 1S-MD or 2S-MD is sufficient for generating the correct output bus. This analysis shows that logic is simplified due to the selection of only two buses (multiplexer and control logic are simpler).
The architecture proposed in this work provides some level of inner fault-tolerance. It means that if two hashes are equal and the other one is different, then a fault has occurred; in this case, the architecture is able to catch this fault and correct it; hence, the output signal is reset and Message Digest has one validated hash. This situation happens three times: cases “001”, “010”, and “100”, when one comparator is high. If the three hashes are different, case “000” happens, then a fault is discovered, the architecture cannot recover, output signal is set, and Message Digest must be ignored. Moreover, there are three states that cannot be reached: cases “011”, “101”, and “110”, because they indicate contradictions; for example, case “011” means that 1S-MD and 2S-MD are not equal, but the pair 1S-MD and 3S-MD is equal, in addition to the pair 2S-MD and 3S-MD, which cannot be true. In these situations, the signal is set, and the output Message Digest does not care (X is Don’t care condition). Finally, in case the three hashes are equal, case “111”, the fault is not discovered, then the signal is reset, and the Message Digest is correctly computed.
4.2. Control Module
The
Control module is in charge of the data flow, which enables dataflow in the Processing Module (see
Figure 7). This control module is a DFA, based on a Moore machine, where the outputs of each state are listed below it.
The logic of the is the preloading of the data (configuration values) and the loading of the input message (512-bit blocks), tasks executed by the states from to . After the first round is executed (states , …, ), rounds from 2 to 16 (states , …, ), rounds from 17 to 63 (states , …, ), and the final round (states , …, ). Next, two situations can happen: a) there are more 512-bit blocks from the same message, hence the process starts by using an intermediate hash (states , …, ), or b) the complete message has been processed and there are no more 512-bit blocks (states and ), thus the automaton is initialized and the new state is .
The manages: (a) the memory that stores the round constants through addresses; (b) the multiplexer, which selects the initial hash or the intermediate hash through and signals; (c) the multiplexer and registers for selecting and saving the three computed hashes through , , and signals; (d) the counter for the 64 rounds defined by the algorithm through the signal; and (e) the multiplexer for obtaining external or internal 16-bit words through the signal.
5. Results and Comparisons
This section describes the following results: (a) the evaluation of the hybrid architecture with fault detection in the presence of test vectors and (b) the evaluation of its implementation on FPGAs, along with their comparison with architectures presented in related works.
The methodology used in this research is based on obtaining an optimized architecture focused on the minimum amount of specialized modules, providing an iterative design, where there is a control machine that manages data flow in all modules operating concurrently and in parallel this machine is also in charge of handling the algorithm process. The methodology is shown in
Figure 8.
This design is implemented in VHDL and analyzed in FPGA devices using Xilinx tools. The results of this iterative design are also used for comparison purposes against the hybrid architecture with error correction. Once the iterative architecture presents optimal values (which occurs when there is a trade-off between a minimum number of clock cycles/hardware resources and high throughput) the critical paths and long paths are analyzed to decide where to place the registers of the pipelined stages. This process is not obvious or straightforward since the algorithm has an iterative process at the data level (data are operated in different branches). At the block level, each block is processed during 64 rounds as defined by the standard. At the message level, the full message is split into blocks, where each block result is chained to generate a fixed-length hash, regardless of the size of the original message. Additionally, the architecture is designed and modified to implement two redundancy schemes: time and hardware (hence the term hybrid architecture) using an error-correction scheme. Then, several designs are required to find the optimum, focused on the same type of trade-off analysis of the iterative design.
Different architectures were designed to achieve the optimal trade-off, which is decided until each of them was implemented in VHDL and evaluated on FPGA devices, as shown in
Figure 8.
5.1. Analysis on Pipeline Architecture
This section describes the main advantages of the hybrid pipelined structure, measuring and comparing it against iterative and parallel structures. The proposal consists of combining attractive features of hardware and time redundancies. Specifically, we use the same hardware resource at different times (as defined in time redundancy) and simultaneously process several blocks (as defined in hardware redundancy). We integrate both redundancies using a pipeline structure.
In the following, we present a formal analysis of the hybrid pipelined structure, focusing on three cases: (a) hardware redundancy; (b) time redundancy; and (c) computation of the proposal in the hardware architecture (hybrid redundancy).
Let (
B) be a SHA-256 simple iterative architecture (see
Figure 1) requiring a power
and an energy
, which can be represented as follows:
At this point, power consumption is constant for an instant and over time, while energy consumption is variable (
) according to time.
B requires a latency
l to obtain a correct hash value, which corresponds to a message block. The replicas of this architecture
B are labeled as
, etc. Additionally, the SHA-256 hybrid redundancy hardware architecture (see
Figure 3) is labeled as
and its replicas are labeled as
,
,
, …, etc. The latency required by this architecture is represented by
. In this sense, three different types of analysis are carried out in this work:
Case (A) Architecture for hardware redundancy.In this case, hardware redundancy is implemented by means of the
blocks (modules
) for
, where
. Using this scheme, parallel and concurrent processing is successfully achieved, which is shown in
Figure 9. This figure reports high throughput, although all the devices are operating at the same time and each device requires a specified latency to process some data block.
This proposal, as can be seen, sacrifices energy and power consumption because all
N modules are present; hence power consumption and energy are high and depend on
N, as can be verified in Equation (
2).
Case (B) Architecture for time redundancy. As opposed to the previous case, this proposal processes a single data block, using a single module
B, during
N time (included and represented by the time redundancy, named
N). This is required in module (
) from iteration 1 to iteration
N (the latency needed to process each data block is represented by each time segment), which can be verified in
Figure 10. This proposal requires fewer hardware resources; however, throughput decreases as a consequence.
The advantage is that a module or instance does not consume a high power
p, although the required energy is high as time goes by, since it requires more usage time, depending on latency and the number of redundant computations (this is modeled in Equation (
3)). This situation represents a drawback, because throughput is decreased.
Case (C) Architecture for hybrid redundancy.The proposed hybrid architecture defines an instance
that operates several times with the same data block (for example,
N times), reporting a greater latency
when compared against
l. Unlike case A, where
N blocks are used, this architecture requires only one instance (
) to be used during a given time; and as opposed to case B, where a block is used several times, block
is occupied just once in this architecture (with its corresponding latency). The corresponding latency is proportional to
N, which can be verified in
Figure 11.
In this case, there is a power consumption
p required by the device. Energy consumption can be inferred from the time a data block takes to be processed.
Many factors affect energy and power consumption, such as resources for placing, routing, temperature, designs, devices, etc. In a constant scheme, if those factors do not affect, consumption is very similar to case B. Nonetheless, this alternative presents a number of advantages, which are that because the internal processes are shorter, the data travels shorter paths (it is easy to verify this behavior by observing the critical path and the minimum clock period). To summarize, this architecture, , has an energy consumption similar to case B but with shorter critical paths. On the other hand, case A generally shows minimum energy consumption and poor performance; however, passive and active hardware resources require the same energy, regardless of whether the instances or blocks have a standby process. It must be noted that, if they are put in a standby state, then there is no need to have several instances of the block; hence, there would be a noticeable performance gain for most of the applications. It is important to highlight that an unrolled architecture can improve the results of this proposal, although this must be further analyzed.
5.2. Evaluation Metrics on Implementation
This section is devoted to the presentation of the implementation results for the proposed hardware architecture for its validation and prototyping. There are different FPGA manufacturers, and each one features several technologies, architectures, configurations, families, and devices for the development of software and hardware implementations. Each FPGA family is designed for a particular target market, offering at least three options for developers: low, medium and high range. Testing with a vast spectrum of manufacturers and technologies was not the central objective of this work and is impractical. However, we compared one type of manufacturer considering four different FPGA families and the three options offered to developers [
22]. In this sense, the objective of the implementations is to show an analysis of the scope of the proposed hybrid–redundancy architecture, where we can observe the requirements of this fault-tolerant architecture, concerning design and implementation results, in addition to an analysis for comparing fault-tolerant architecture and non-tolerant architecture. We also understand that the architecture of each manufacturer is very different and they must be evaluated, but our results show that there is no dependency on specialized resources and common hardware resources are required for the fault-tolerant encryption process. In this way, the proposed architecture is synthesized, mapped, placed, and routed for different FPGA technologies: Spartan-7, Artix-7, Kintex-7, and Virtex-7 devices with the Vivado 2019.2 design tool. The implementation considers real-time operation conditions (the design conformance test data are used for this purpose) for simulation and verification.
These architectures are evaluated using several metrics, some of them provided by the Vivado tool (number of slices, LUTs, Flip-flops (FFs), BRAMs, DSPs, and IOBs, as well as the minimum clock period), some other metrics (throughput and efficiency) are obtained by performing some calculations, as explained next. All these metrics are used to analyze different trade-offs. The processed bits per second (bps) are used to compute throughput (we can obtain the clock frequency when the architecture is implemented on different FPGA devices). This can be observed in Equation (
5).
Implementation efficiency is the other metric. For a given implementation, it is defined as the ratio between the achieved throughput and the area or amount of hardware resources used for the implementation. The number of LUTs an implementation consumes (bps/LUT) as well as the number of FFs it uses (bps/FF) are two examples of efficiency. This is reflected in Equation (
6).
Next, two types of analysis of results are presented: (1) the comparison between the simple SHA-256 architecture and the hybrid SHA-256 architecture with fault detection and (2) the comparison with related works.
The iterative structure in the simple SHA-256 architecture (see
Figure 1) processes a 512-bit message block during 64 clock cycles. This architecture does not detect faults and is a basic element for the design and development of the fault detection SHA-256 hybrid-redundancy structure, which has a 3-stage pipeline architecture.
For the comparison between the two SHA-256 architectures, i.e., the simple and the hybrid one with fault detection, implementation results appear in
Table 3 for the former, and
Table 4 for the latter. These tables show the hardware resources required for the corresponding implementation (LUT, LUTRAM, FF, IOB, and BUFG), along with the physical resources needed in terms of clock period (the tool itself provides them all); used data size and latency, both of them characteristics of the design, are also shown. Finally, the performance measures, throughput and efficiency, shown in Equations (
5) and (
6), respectively, are presented as well. The critical path time corresponds to the clock period, i.e, the minimum required clock period used to determine the maximum clock frequency, which is, by definition, the inverse of the minimum clock period; this establishes the throughput and efficiency values of the diverse implementations.
There are two points worth mentioning, according to design and implementation. On the one hand, the proposed designs reflected three constant parameters: (1) latency is 195 clock cycles for tolerant architecture and 64 clock cycles for non-tolerant architecture; (2) data size is 512 bits for both architectures; and (3) tolerant architecture has 293 input/output ports (I/Os) and non-tolerant architecture has 292 I/Os. On the other hand, the implementations of those architectures show two constant parameters: (1) the number of FF is 2516 for tolerant architecture and 753 for non-tolerant architecture; (2) 64 LUTRAMs are used in both architectures; and (3) tolerant architecture requires 2255 LUTs and non-tolerant architecture needs between 1338 and 1350. The results of the implementation depend on the manufacturer’s FPGA architecture and the tool (and algorithms) for synthesis, place and route.
The hybrid architecture occupies approximately more LUTs than the iterative architecture. It also uses 2.34 times more FFs (they are required for the registers of each pipeline stage and other extra modules needed for error correction). Both architectures use the same amount of LUTRAM because they store 64 round constants, and the hybrid architecture requires one additional IOB (for the signal) and one more BUFG.
The design of the hybrid architecture tries to balance the paths of each stage of the pipeline structure, and several modules are added to the simple architecture, where the critical path time is shorter except for the implementation in Spartan-7. Even though both architectures process 512-bit message blocks, they differ in latency since the simple architecture requires 65 clock cycles, while the hybrid one uses 195.
Table 4 presents unreal throughput, which is computed based on the 1536-bit size of the input data (three 512-bit message blocks), although the 512-bit bus is the real size. This throughput would apply if the hybrid architecture were used in a 3-stage pipeline structure with three communication lines (three 512-bit message blocks, i.e., an input data size of 1536 bits), but in the proposed situation, the hybrid proposal is utilized for detecting and correcting faults, processing a 512-bit message block using a 3-stage pipeline architecture. The iterative architecture performance is very similar to the unreal throughput and efficiency of the hybrid proposal, which is achieved for the design of the simple architecture (see
Figure 1), requiring fewer modules and presenting a balanced critical path. Considering a real situation, the throughput of the simple architecture is almost three times higher than that of the hybrid architecture, and this also occurs for both efficiencies (Mbps/LUT and Mbps/FF).
The results of the comparison with related works are shown in
Table 5. We are conscious that, since different FPGA technologies are used, this comparison cannot be considered to be fair. These results, however, provide us useful insights on the design techniques as well as the implementation results. All the works reported architectures on Virtex technology, excepting [
1], which uses Altera technology. The presented implementations are SHA-1, SHA-256, and SHA-512 functions. Most works present two architectures (with and without fault detection). Disregarding [
20], the architectures with fault detection implement hybrid redundancy (each work uses two types of redundancy), where hardware redundancy is a common technique. In our case, these architectures are named simple (without fault detection) and hybrid (with fault detection).
The architectures reported in [
1] presented two SHA-512 hardware implementations—one of them was simple and the other was based on parity coding and hardware redundancy—the latter focused on single, permanent and transient faults. In comparison with our proposal, they used more than twice the hardware resources, which can occur given that the authors implemented the SHA-512 algorithm, although they used parity coding. In [
17], a hardware architecture based on time and hardware redundancy was presented and considered permanent and transient faults. The authors implemented rounds for the SHA-1 and SHA-512 algorithms, where architectures with error detection reported a decreased throughput, and the SHA-512 architectures were faster than the SHA-1 architectures. Although using different technologies, the required amount of hardware resources was very similar to that of our proposal, even when they only implemented round computations.
In Kahri et al. [
20], the authors presented a hardware architecture based on time-redundancy, which was designed against fault injection attacks. Using Virtex-II pro, the authors implemented the SHA-512 algorithm, and compared architectures with and without fault detection. The amount of resources used and the clock frequency were very similar to our proposal. Finally, in Michail et al. [
18], the authors reported a hardware architecture using information redundancy and hardware redundancy, implementing the SHA-256 algorithm as in our proposal, even though they used Virtex-5 technology. They required more than 4 times the hardware resources and their implementation operated at lower frequency, although they reported the highest throughput. While similar throughput would be expected, this was not the case, and details about latency were not provided, hence no comparisons could be made on this point.
5.3. Analysis for Fault Detection
The proposed architecture was evaluated using a text file containing 5000 test vectors. The test bench in VHDL was used to apply these vectors at simulation time (this was done using the Vivado tool). The test vectors are made up of: (1) the multiple message organized in 512-bit blocks; (2) the control buses required by the control module for managing the data flow; (3) the right bit hash result; and (4) a 32-bit output, required for modifying certain internal computations and emulating external faults or internal faults.
The 32-bit sub-vector, mentioned in 4, is key for this evaluation, since it is required for carrying out the modification of many computations at different hardware modules. For testing purposes, the hybrid architecture was modified three times. These modified versions of the hybrid architecture were not implemented in an FPGA, because they were only used for evaluating fault detection. The three versions were altered by executing an XOR operation between the 32-bit sub-vector of point 4 and the 32-bit output of: (a) the
module of the first stage; (b) the
bus of the second stage; and (c) the lower bits of the multiplexer of the third stage. Modified blocks were labeled PEB, whose input was altered by an XOR operation between the original input and the injected error, which could change the value of the data bus, according to the truth table (see modules in
Figure 12).
Using the proposed hardware architecture without modifications and the three modified architectures, the vectors were used to execute four different runs, i.e., 20,000 tests were evaluated. The hybrid architecture without modifications (this one was implemented in an FPGA) was tested in the first epoch; for this purpose, one complete epoch of the dataset was applied and analyzed by the test bench (in this case, the 32-bit sub-vector was not utilized). The three remaining epochs were applied to the three variations of the hybrid architecture; in these cases, 15,000 communication faults were simulated using the 32-bit sub-vectors. These results showed that all faults (i.e., 100% of them) were detected using the hybrid architecture.
The four runs were essential for analyzing error injection since cryptographic algorithms have high diffusion. It means that a slight modification (from 0 to 1 or vice versa) in a single bit can generate a different value in the output bus (most bits change their value). Therefore, two cases of error injection were analyzed: stationary and permanent cases.
For the stationary case, the results of this analysis showed us that in case one or two registers were altered, it was possible for the architecture to recover 100% (the majority prevailed). However, in case four registers were altered, the architecture was not able to recover (this happened because there were different contents in most of the registers). Hence, the output signals “ready but with an error” were sent, which meant precisely the system did not recover; in such a case, the result must be recalculated or simply cannot be trusted.
For the permanent case, the same logical operators were used for generating permanent errors, which were maintained. Here, the test vectors were changed as in real contexts. In this case, there were situations where output signals were used to emphasize it was not possible for the architecture to recover from the error, and that, although the output was ready, it was not correct, which also meant it had to be recalculated. In three runs, 500 test vectors were used, and the architecture recovered whenever three or more registers, at the end of the pipeline, matched (majority). Persistent failures occurred when there was a signal introducing errors (this changed when the architecture was able to generate the required majority since this was the only way it could recover from the error).
The results of these evaluations showed that both stationary and permanent errors were determined in 100% of the cases, thanks to the hybrid architecture, although the architecture may or may not recover, depending on the similarity of the registers. Because the modifications derived from the injection depended on the high cryptographic diffusion as well as on which bit was corrupted (since this, at the same time, depended on whether it is affected by the logical operation or not), it did not always happen that it resulted in a modification of the output.
In the temporary injection experiments, the test vectors with error were applied in one or two clock cycles, whereas in the permanent injections they were applied using three or more clock cycles. As a consequence, in the former case, these vectors altered not more than two stage registers in the pipeline structure; this change depended on the high diffusion, whether the local operation modified the bits, and whether the injection affected the stored hashes at the end of the pipeline (see
Table 6).
Fault detection reached 100%, but recovery was about 34.2%. The correction ratio increased, provided that fault was injected in just three clock cycles in different modules of each pipeline stage. Whenever the injection increased, the correction ratio decreased in contrast. The novel approach proposed in this work allowed us to obtain these results, also providing a more reliable strategy for detecting and correcting faults.
6. Discussion
We emphasize that the proposed architecture maintains the complete calculation of the SHA-2 algorithm, without modification, but the hardware resources of the architecture are used iteratively with feedback, maintaining the standardized security levels established by different state-of-the-art analysis. The architecture proposed in this manuscript is not supposed to be lightweight since it consists of a completely unrolled round, which is used 64 times in the case of the non-tolerant architecture (64-cycle latency) and 192 times in the case of the tolerant architecture (included in the 195-cycle latency). The first architecture is iterative and is used to obtain the hash in a common way, while the latter computes three hashes, by using the same modules and checking whether they are correct or an error occurred; in the latter case, it checks if it is possible to recover from the error (by means of majority vote), otherwise, the Error output signal is set.
Figure 13 shows the critical paths in each stage, which implies it is not possible to have large clock frequency values, because of the existing large paths. Red lines indicate a path where hardware resources appear for both place (hardware for algorithm logical and arithmetic calculations) and route (routing hardware) processes. Large paths can be reduced if more stages are used, which would need more pipeline registers and more proportional registers to store final hashes, depending on the number of stages; also, the voting module would increase complexity. For example, [
21] describes a tolerant architecture for the symmetric AES-128 algorithm, where it is shown how complex the voting module becomes when a 5-stage pipeline is used, also showing there are options to reduce this combinatorial complexity. Increasing the number of pipeline stages is not the intention of this work, and this can be part of the future work, where exploring the relationship between stages and performance results can be an important part of the analysis; the intention of this paper is to explore, lay the ground, and innovate the pipeline technique in a redundancy process in the SHA-256 algorithm, which, to the best of our knowledge, has not been used in hash functions. Most pipeline systems achieve higher frequencies because all the dataflow allows dividing into a large number of stages, improving high throughput and frequencies; however, when prediction problems, jump/no-jump situations, halting, control risks, data risks etc. appear, a lot of the power of pipelining is lost. In cryptographic cases, such as that of SHA-256, there is a higher data dependency [
23], which leads to zero advantages from using pipelining; in this situation, and in our case, it is not possible to apply pipelining to increase throughput, because high data dependency returns the system to the worst case scenario of pipelining because of data halt (i.e., the ciphering of a block must be finished before starting ciphering the next one; in other words, 64 rounds must be completely applied to cipher a block before starting the 64 rounds needed to cipher the next one).
The maximum frequency for some Virtex-7 devices is 700 MHz; however, a design will hardly reach that frequency since we should consider that the designs for some applications must use different configurations and number of gates with intermediate registers. The truth is that FPGA designs have different complexity, the algorithms that can be implemented are very diverse, and the architecture implementations reduce the maximum frequency range of the device [
24]. Cryptographic solutions commonly represent a bottleneck; therefore, the need for research such as the one presented here, where the analysis shows the scope of our proposed architecture, which can operate according to its maximum frequency, performance, hardware requirements, and efficiency.
Traditionally, a pipeline structure has many stages, which means that high frequencies and high processing capabilities are achieved. Pipelining can hardly be used in cryptographic implementations since the operating modes of the cryptographic algorithms require finishing the encryption of a message block to start encrypting another block of the same message [
25]. Pipelining is not practical for cryptographic solutions on a communication line. Instead, we use a pipeline for a single communication line, efficiently considering hardware resources, and with the ideology of redundancy. For example, suppose audio is sent on a communication line. In that case, this audio is separated into entire blocks and starts to be encrypted with SHA or AES in CBC/OFB/CFB/CTR mode. We remark that we must finish encrypting one block to start with another, so cryptographic architectures with many pipeline stages are useless here. The worst case occurs, as if all the pipeline stages were a single stage. Using
E stages for the pipeline architecture would work if
E independent communication lines were used. AES, in ECB and CTR modes, allows ciphering several blocks at once in a parallel fashion; although AES, in CBC, OFB, and CFB modes does not allow parallel ciphering, because it needs to finish ciphering one block before starting ciphering the next one. This completely affects a pipeline architecture, since the halting problem appears, which stops all the pipeline stages until the ciphering of a block is finished, i.e., there is no more advantage due to pipelining in working with several instructions or operations at once, more hardware resources are used and this is a disadvantage. This happens with any cryptographic hash function, such as SHA-256. We propose using redundancy and pipeline structures in any mode of operation for some symmetric cryptographic algorithm and any cryptographic hash function, showing the involved operations.