Next Article in Journal
An Approach to Hyperparameter Optimization for the Objective Function in Machine Learning
Next Article in Special Issue
Analysis of the Critical Bits of a RISC-V Processor Implemented in an SRAM-Based FPGA for Space Applications
Previous Article in Journal
Distributed Control Methods and Impact of Communication Failure in AC Microgrids: A Comparative Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Adaptive-Hybrid Redundancy with Error Injection

1
Department of Electrical and Computer Engineering, Air Force Institute of Technology, Wright-Patterson AFB, OH 45322, USA
2
Department of Engineering Physics, Air Force Institute of Technology, Wright-Patterson AFB, OH 45322, USA
*
Author to whom correspondence should be addressed.
Electronics 2019, 8(11), 1266; https://doi.org/10.3390/electronics8111266
Submission received: 30 September 2019 / Revised: 25 October 2019 / Accepted: 29 October 2019 / Published: 1 November 2019

Abstract

:
Adaptive-Hybrid Redundancy (AHR) shows promise as a method to allow flexibility when selecting between processing speed and energy efficiency while maintaining a level of error mitigation in space radiation environments. Whereas previous work demonstrated AHR’s feasibility in an error free environment, this work analyzes AHR performance in the presence of errors. Errors are deliberately injected into AHR at specific times in the processing chain to demonstrate best and worst case performance impacts. This analysis demonstrates that AHR provides flexibility in processing speed and energy efficiency in the presence of errors.

1. Introduction

Adaptive-Hybrid Redundancy (AHR) was developed to enable flexibility in radiation hardening redundancy methods for space vehicles. AHR incorporates Triple Modular Redundancy (TMR) and Temporal Software Redundancy (TSR) such that AHR can switch between TMR and TSR modes as needed [1]. This previous work demonstrated that AHR functions as designed, switches from TMR to TSR, and uses less energy to complete programs than TMR while completing those programs in less time than TSR in an error free environment [1]. The objective of this paper is to further illustrate the advantages and flexibility of AHR when compared to TMR or TSR alone both in error free and error prone simulated environments. This paper does not seek to prescribe how much time AHR should spend in TMR or TSR operating modes, but rather to provide a new redundancy framework to space vehicle designers, mission planners, and operators so they can decide how much time AHR should spend in each mode based upon radiation environment, processing speed requirements, energy consumption requirements, and mission requirements.
The remainder of this section discusses previous redundancy mitigation research and the architecture of AHR. Section 2 discusses the methods used to evaluate the performance of AHR. Section 3 discusses the results of the AHR performance evaluation. Section 4 discusses the impact of the results and concludes the paper.

1.1. Background

Previous Single-Event Upset (SEU) mitigation research has focused on hardware, software, hybrid, or adaptive redundancy techniques. This section will briefly review some of the research leading up to AHR.

1.1.1. Hardware Redundancy

Dual-Modular Redundancy (DMR) describes an SEU mitigation method that operates two processors in parallel by providing those processors simultaneous identical inputs and operating those processors on a common clock. A hardware comparator module compares the outputs of the two processors to ensure that they are identical. If the comparator detects a difference, an error has occurred, and both processors’ internal state is restored to a previous known state that was stored to radiation hardened/immune memory. In DMR, the processors are periodically interrupted to save their internal state to memory; this is called a Save/Restore Point [2,3,4,5,6]. To create the Save/Restore Point, the comparator issues commands to both processors to write each register to memory. Upon receiving register values from the two processors, they are compared to ensure they are equal before writing the result to memory. If an error is detected at this stage, the comparator enters error recovery mode and returns to the last Save/Restore Point. After all registers are written to memory, the Program Counter value is written to memory as well. During error recovery, the comparator resets both processors, then issues a series of load commands to both processors to load each register value in the Save/Restore Point into registers. The comparator finishes the error recovery process by issuing a branch command to return the processors to the Program Counter value stored in the Save/Restore Point. The comparator performs Save/Restore Point creation and error recovery by traversing a series of states in its internal finite state machine which is implemented in hardware.
The timing of Save/Restore Point creation is application specific and depends on factors such as radiation environment and operational needs. The Save/Restore Point time period is specified as a set number of instructions to complete between Save/Restore Points. Save/Restore Point creation takes time away from processors performing intended tasks and slows execution time so there is some pressure to maximize the amount of time between Save/Restore Point creation events. However, in the event an error occurs, the amount of time between Save/Restore Points dictates the amount of time that must be spent to recover from an error to the point at which the error initially occurred. This represents a pressure to reduce the amount of time between Save/Restore Point creation.
Triple-Modular Redundancy (TMR) describes an SEU mitigation method that operates three processors in a similar manner to DMR with some notable differences. First, the hardware comparator module is replaced with a hardware majority voter. Secondly, when the majority voter detects that two of three processors agree while one disagrees, the disagreeing processor is reset and the internal state of the two agreeing processors is copied to the disagreeing processor so that all three processors are in agreement. This greatly reduces the recovery time when compared to DMR. In the unlikely event that two or more processors encounter an error such that all three processors disagree, TMR restores all three processors to a previous Save/Restore Point in the same way as DMR: the voter contains a hardware implemented finite state machine to create the Save/Restore Point and recover from errors [7,8,9,10,11,12,13]. The voter is typically assumed to be immune to errors or is hardened in some way. This research assumes that the voter is immune to errors as well. While the voter is this research is not implemented in such a way to be error immune or radiation hardened, radiation hardening could be achieved through radiation shielding, hardware redundancy, or hardening by design.
N-Modular Redundancy (NMR) is a majority voting redundancy method that is similar to TMR, but uses N processors instead of three processors. It is considered more robust than TMR because a permanent single processor failure will result in N-1-Modular Redundancy. So long as N-1 is greater than or equal to three, the majority voting redundancy method still functions. However, there is an energy penalty to be paid because NMR uses at least N times as much energy to complete a program as a single processor with no redundancy [14,15,16].

1.1.2. Software Redundancy

The first software redundancy method this paper discusses, Error Detection by Duplicated Instructions (EDDI), is very similar to DMR, but is implemented in software instead of hardware. This software redundancy method runs on a single processor. Dual redundancy is implemented by duplicating all instructions that do not store data to memory. Each duplicated instruction stores its results to a different, physically separated register from the original instruction to achieve spatial separation of the original and duplicate results. This greatly reduces the likelihood that a single, or multiple, radiation event(s) will cause the exact same error in both the original and duplicated results. The DMR comparison function is implemented in software by adding a comparison instruction immediately prior to any store instruction. If the original and duplicate register are identical, the original is stored to memory. If the original and duplicate are not identical, an error has occurred and program execution jumps to code that performs error recovery. This error recovery code restores the state of the processor to a previously saved state called a Save/Restore Point in a similar manner to DMR, but use software instead of a hardware finite state machine. Similarly to DMR, EDDI periodically interrupts normal program execution to create Save/Restore points by executing code designed to create the Save/Restore points [17,18,19]. EDDI was proposed by Oh et al. [17,18] while Tokponnon et al. discuss a very similar software redundancy method [19]. Table 1 shows an example of what EDDI code looks like in the “redundant set” when compared to a non-redundant instruction set called the “original set”. In this example, LUI is the MIPS load upper immediate instruction, ADD is the MIPS addition instruction, SW is the MIPS store word instruction, and BNE is the MIPS branch if not equal instruction. ERR is the value of the distance which the BNE should jump in code execution if R3 and R17 are not equal. OFFSET is the value that should be added to the memory location specified by R0 where R3 should be stored. Any value indicated by R# is a register number.
Oh et al. [20] and Reis et al. [21,22] improved EDDI by adding signature detection in order to determine when Program Counter errors have occurred as a result of a missed branch or illegal branch. Signature detection methods break a program into segments and computes segment signatures at compile time and inserts a segment signature computation and a segment comparison instruction into each segment. The signatures are unique and are dependent upon the preceding segment and the current segment. At runtime, the signature is recomputed and compared to the compile time signature. Any discrepancy between the two is interpreted as a Program Counter error as a result of a missed branch or illegal branch between segments [20,21,22].

1.1.3. Hybrid Redundancy

Hybrid redundancy can take many forms and can consist of any number of combinations of hardware, software and error correcting codes.
The first hybrid redundancy method this paper discusses is only applicable to Field Programmable Gate Arrays (FPGAs) which have a configuration memory that is vulnerable to Single Event Upsets (SEUs). Many FPGAs configuration memories are comprised of Static Random Access Memory (SRAM) cells which are highly susceptible to SEUs. These configuration memories specify constants, logic functions, and signal routing on the FPGA. Any of these can have a catastrophic effect on the intended function of the hardware designed into the FPGA. Those who wish to implement a processor on an FPGA typically combine a TMR-like method with a method of correcting configuration memory called internal scrubbing. The primary concern with FPGAs is the configuration memory rather than the registers used in the processor. Internal scrubbing detects and corrects errors in the configuration memory, but can only do so at the memory refresh rate. TMR is used as a stop-gap measure to ensure outputs are correct in spite of configuration errors until internal scrubbing can correct those errors. The TMR method only does majority voting and does not correct faulty processors [23,24,25,26,27,28].
Another method of hybrid redundancy duplicates instructions in software, similar to EDDI, but uses a hardware comparator similar to the DMR comparator [29]. A method that juxtaposes the software and hardware portions has also been implemented that uses hardware for redundancy and software for comparisons [30].
A few methods of hybrid redundancy combine hardware or software redundancy with error correcting codes to protect processor registers and/or memory [31,32,33].

1.1.4. Adaptive Redundancy

Only two adaptive redundancy approaches were found in the literature survey. The first is a very simple approach that uses a radiation sensor to detect the ambient radiation environment and determines when to implement TMR and when to operate using a single processor without any mitigation [34]. This is also the only example discovered in literature that applies adaptive redundancy to a processor.
The second uses three different software redundancy methods to protect memory. Each method differs in the degree of error protection, memory access speed, and energy consumption. When there are very few SEUs occurring, the method that provides the least error protection, operates the fastest, and uses the least amount of energy is utilized. As the SEU rate increases, the method that provides an intermediate level of error protection, intermediate memory access speed, and intermediate energy consumption is used. As the SEU rate becomes too great for the intermediate level of protection to handle, the method that provides the greatest level of error protection is used at the expense of the slowest memory access and greatest energy consumption [35].

1.2. Adaptive-Hybrid Redundancy

AHR, as implemented in this work and the previous work [1], consists of TMR and TSR. The TMR implementation is just as described in Section 1.1.1 and the TSR implementation is the EDDI method described in Section 1.1.2. For this research, the time between creation of Save/Restore Points was arbitrarily selected to be 10,000 instructions for TMR and AHR operating in TMR mode and 250 main program loops for TSR and AHR Operating in TSR mode (many real-world programs typically have a main loop which is repeated numerous times until program completion). These values were chosen to ensure that every program would create at least one Save/Restore Point during its execution. These are tunable parameters that a space vehicle designer, mission planner, or operator could change as needed based upon the program running on the processor, performance requirements, radiation environment, and mission needs.
A simple illustration of the TMR architecture is shown in Figure 1. The previous work demonstrated that a program running in an error free environment in TMR takes 65% longer to run than a program running on a single processor with no redundancy because the voter adds delay to all processor inputs and outputs. The TMR architecture also uses three times the instantaneous power of a single processor because there are three processors instead of one. A program running TMR takes approximately 430% more energy to complete due to the number of processors and the added time taken to run the program [1].
A simple illustration of the TSR architecture is shown in Figure 2. The EDDI TSR method uses a special compiler to take a normal program and make it into a TSR program. The previous work demonstrated that EDDI TSR programs take 113% longer to complete than non-redundant programs and the TSR architecture uses the same amount of instantaneous power as a single processor because the TSR architecture only uses a single processor. However, TSR uses 113% more energy to complete a program than a non-redundant program because it takes 113% longer to run that program [1].
The AHR architecture adds a module called the AHR Controller to the TMR architecture as shown in Figure 3. The AHR Controller is assumed to be immune from errors for this research just as the TMR voter is assumed to be immune from errors. When AHR operates in TMR mode, the TMR Voter and three processors operate normally and the signals between the TMR Voter and memory and from the TMR Voter to the three processors are passed through combinational logic in the AHR Controller with minimal delay. Figure 4 shows how signals flow when AHR is operating in TMR mode by illustrating connected signals and modules that are operational in black and those that are not in red. When operating in TSR mode, the AHR Controller turns off the TMR Voter and two of the three processors. The remaining single processor communicates directly with memory by passing signals through combinational logic in the AHR Controller with minimal delay. Figure 5 shows how signals flow when AHR is operating in TSR mode by illustrating connected signals and modules that are operational in black and those that are not in red.
AHR begins in TMR mode and switches to TSR mode after a predetermined number of TMR instructions are completed without encountering an error. AHR remains in TSR mode so long as two consecutive errors do not occur. Two errors are considered consecutive if a second error occurs after TSR recovers from a first error and the second error occurs before TSR can create a new save/restore point. If consecutive errors occur, AHR transitions to TMR when the second error is detected. If TSR creates a new save/restore point before encountering a second error, the second error is not consecutive and AHR continues in TSR mode. This approach gives TSR mode an opportunity to recover from errors so long as the error rate is sufficiently low.
The processor upon which AHR, TMR, and TSR were based on past work and are based in this work is the Basic MIPS processor which is a simplified MIPS32TM processor that only supports 33 of the 168 MIPS32TM instructions [36]. The details of the Basic MIPS, TMR MIPS, and AHR MIPS architectures are available in Air Force Institute of Technology technical reports [36,37,38].
AHR was shown to bridge the gap between these two methods by switching between TMR and TSR so that it runs faster than TSR and uses less energy than TMR at the expense of running slower than TMR and using more energy than TMR [1]. In past work, AHR started in TMR mode and switched to TSR mode after TMR successfully completed 15,000 instructions without an error in an error free simulation. This work will examine how AHR performs when the TMR to TSR switch point is varied as well as how it performs when errors are injected into the simulation.
An appropriate error rate to be used for analysis was determined to be approximately one Single Event Upset (SEU) per hour by conducting radiation experiments on an Intel Cyclone V Field Programmable Gate Array (FPGA) at Sandia National Laboratories and performing post experimental analysis. This is the expected average rate for a space vehicle using the Cyclone V FPGA over the life of the mission. However, the SEU rate for real missions fluctuates as a result of orbital position with reference to earth’s magnetic field lines (i.e., SEU rate increases over the South Atlantic Anomaly) and as a result of solar weather (i.e., changes in solar activity impact the SEU rate). This FPGA was chosen as a representative inexpensive commercial-off-the-shelf technology for past research and this research. Based on the experiment and the knowledge that nearly every program used in the previous work and the current work has a runtime of approximately 50ms or less, it is reasonable to expect no more than one error per program run.

2. Materials and Methods

The only materials required for this work were a computer with a network connection and MATLAB installed. Previous work implemented Basic, TMR, TSR, and AHR MIPS architectures in Very High Speed Integrated Circuit (VHSIC) Hardware Description Language (VHDL). Previous work also made use of Mentor Graphics Questa Sim to simulate the VHDL architectures in order to determine important timing parameters for use in MATLAB analyses to compute program runtimes on the various architectures. These timing parameters include the time to complete individual instructions, program start, program close, save/restore point creation, error recovery, and TMR to TSR transition. The results of those simulations are used in the simulations and analyses in the current work. Previous work also made use of the Intel PowerPlay Early Power Estimator for Cyclone IV and Cyclone V tool to estimate the instantaneous power used by Basic, TMR, TSR, and AHR MIPS [1].
Section 2.1 will discuss the mechanism used to inject errors into Basic MIPS. Section 2.2 will discuss when and where errors will be injected and provide the analysis tools necessary to calculate program runtime when an error is injected.

2.1. Error Injection Mechanism

In previous works, fault injection was performed using software manipulation or direct electrical injection for Hardware-in-the-Loop (HITL) simulations [16,17,23,24,27,28,35]. This research modifies the Basic MIPS Datapath to directly inject errors into a specific general purpose register at a specific point during program execution. This is software error injection because Basic MIPS is simulated in software for this research. The General Purpose Registers (GPRs) were selected as the injection points because general purpose registers account for 992 bits that are susceptible to SEUs in the Basic MIPS architecture whereas the remaining susceptible registers account for only 68 bits. These remaining registers are the program counter, instruction register, Finite State Machine (FSM) register, and additional Datapath registers. If an SEU were to occur, it has a 94% chance of occurring in a GPR as opposed to a 6% chance of occurring elsewhere. Additionally, EDDI is unable to detect and correct all errors occurring in these other registers, so it was determined that it would be better to inject errors into GPRs so that TSR and AHR operating in TSR could detect and correct the injected errors for a more accurate performance comparison between TMR, TSR, and AHR. A detailed schematic showing how the Basic MIPS Datapath was modified to incorporate GPR error injection is provided in Appendix A.
If an error occurred in the program counter of TSR or AHR operating in TSR mode, the error would cause program execution to jump backwards and repeat instructions or forwards and skip instructions. If a backwards jump were performed, the impact would be increased runtime and energy used to complete the program. Additionally, an incorrect result may be written to memory. If a forward jump were performed, instructions would be skipped. In some instances, this might be detected if one of two redundant instructions were skipped. In other instances, the illegal branch might go undetected and several results that should have been written to memory might not be written to memory. Additionally, program runtime and energy to complete the program would be reduced.
If an error occurred in the instruction register of TSR or AHR operating in TMR mode, the error would simply be corrected by Basic MIPS internal mechanism if the error resulted in an invalid instruction. If the error caused the instruction to change to another valid instruction, it might be detected if the error affected the result of a duplicated instruction. This same error might result in changing a store word instruction to something else and a result would fail to be written to memory resulting in an undetected error. A store word instruction error might also cause a result to be written to a wrong location in memory and this error would also go undetected. An instruction register error might also create a branch instruction where none existed or change the address of a branch instruction.
A FSM register error could cause TSR or AHR operating in TSR mode to incorrectly jump to another state which would most likely cause the processor to remain trapped in the incorrect state. Processing would cease and the program would never complete. At present TSR and AHR have no protection against such an error.
The other Datapath registers are used in determining whether to execute a program jump or not when evaluating a branch instruction. An error in one of these registers would cause the incorrect branch path to be taken. This could result in a longer program execution time with greater energy usage or a shorter program execution time with reduced energy usage.

2.2. Error Injection Timing

Errors are injected into GPRs that are going to be stored to memory so that the TMR Voter and the TSR comparison instruction will detect them and initiate error recovery operations. This is done immediately prior to the TMR instruction to store a GPR to memory and immediately prior to a TSR comparison instruction before storing a GPR to memory. Injecting an error for every store word instruction for all 1000 programs used in the current work is not feasible. Instead, errors are selectively injected to probe the minimum and maximum time and energy performance of the TMR, TSR, and AHR architectures. In all three architectures, the best-case errors will minimize the amount of time and energy expended on error detection and recovery and the worst-case errors will maximize the amount of time and energy expended on error detection and recovery.
Errors intended to maximize the amount of time and energy needed to recover from an error are injected immediately before Save/Restore Point creation. A TMR processor or AHR processor in TMR mode has to perform error recovery, then repeat 10,000 instructions to return to the point at which the error initially occurred. A TSR processor or AHR processor in TSR mode has to perform error recovery, then repeat 250 main program loops to return to the point at which the error initially occurred. Errors intended to minimize the amount of time and energy needed to recover from an error are injected immediately after Save/Restore Point creation. For all processors, this minimizes the number of instructions that must be repeated after error recovery to return to the point at which the error initially occurred. For a more detailed discussion concerning when errors are injected and the calculations performed to determine the program runtime for each type of error, please refer to Appendix B.
The 1000 programs used in this work were randomly generated and exercise the full range of Basic MIPS instructions and GPRs. Individual programs may not exercise the full range of Basic MIPS instructions or utilize all GPRs; however, some programs utilize most, if not all of the Basic MIPS instructions and some programs utilize all the GPRs.

2.3. Energy Used When Errors Are Injected

The energy computations are much simpler than the timing computations. The energy to complete a TMR or TSR MIPS program is the time to complete the TMR or TSR MIPS programs multiplied by the TMR or TSR MIPS instantaneous power respectively. The time to complete AHR programs is the time spent in TMR mode multiplied by the TMR MIPS instantaneous power plus the time spent in TSR mode multiplied by the TSR MIPS instantaneous power. For a more detailed discussion concerning the calculations performed to determine the energy used to complete a program for each type of error, please refer to Appendix C.

3. Results

This section examines the results of software simulations and computational analysis when errors are injected as described in Section 2.2 and Section 2.3.
Figure 6 shows the average time and energy to complete 1000 programs for each error type in each architecture (including no error injection) This figure illustrates how AHR MIPS bridges the gap between TMR MIPS and TSR MIPS performance. The AHR MIPS TMR Type A and Type B-Best errors appear to fall on a line between the TSR MIPS Best-case error and TMR MIPS Type A and Type B-Best errors. A similar pattern appears for AHR MIPS TMR Type B-Worst and TSR Worst-case errors which appear to nearly fall on a line between the TMR Type B-Worst and TSR Worst-case errors. However, this figure does not tell the entire story as the best-case, worst-case, early, and later errors define maximum and minimum bounds for AHR MIPS performance in the presence of errors.
Figure 7 shows the performance bounds as bounding boxes when the TMR to TSR transition point occurs at 15,000 instructions. Note that the points plotted in this figure are the same as those plotted in Figure 6 and represent the average program completion times; however, the TMR Type B-Best, TSR Best-Case, AHR TMR Type A Early, AHR TMR Type B-Best Early, and AHR TSR Best-Case errors have been omitted from this plot. The TMR Type B-Best case error result was nearly identical to the TMR MIPS with no error result. The TSR MIPS Best-Case error result was nearly identical to the TSR MIPS with no error result. The AHR TMR Type A Early, AHR TMR Type B-Best Early, and AHR TSR Best-Case error results were nearly identical to the AHR MIPS with no error result. The bounding boxes indicate that the average program completion time should fall somewhere within the bounding box when errors are present.
The corners of the bounding box encompassing the TMR MIPS Type B-Best and Type B-Worst errors is shown as a dotted blue line. This box indicates that average program completion time and energy usage for a TMR MIPS program encountering a Type B error will end up within this box and it nearly overlaps the second box indicated by a solid blue line. The second box is used to outline the average performance of TMR MIPS when errors may or may not be present. It includes the no error, TMR Type A, and TMR Type B errors.
The corners of the bounding box encompassing the TSR MIPS Best- and Worst-case errors is shown as a red dashed line with the red square and the red diamond at opposing corners to indicate the average Best- and Worst-case error program runtime and energy usage. A second box with a red solid line is used to outline the average performance of TSR MIPS when errors may or may not be present. It includes the no error, TSR Best-case, and TSR Worst-case errors. Once again, these boxes almost overlap one another.
The corners of the bounding box encompassing the AHR MIPS TMR Type A Early and Late errors is shown as a gray dashed line to indicate how a program will perform in terms of time and energy usage on average when a TMR Type A error will occur. Similarly, the purple dashed line indicates the average performance of a program experiencing a TMR Type B-Best case error. The orange dashed line indicates the average performance of a program experiencing a TMR Type B-Worst error, however, no such box is visible in this figure as the TMR Type B-Worst Early and Late errors are identical in this figure. A dark blue dashed line bounding box extends from the left most plus sign (+) to the right most “X” and from the AHR MIPS no error (green circle) to the top most “X” to show the average bounds of a AHR MIPS program that encounters any TMR Type B error. The dashed teal line indicates the bounds of a AHR MIPS TSR error. Finally, the solid green line indicates the average bounds for a AHR MIPS program experiencing any TMR error, TSR error, or no error.
Note that the portion of the bounding box extending to the left of the average AHR MIPS no error runtime and energy usage does not necessarily indicate that an error could occur such that the runtime would decrease without a change in energy usage. It should be expected that a decrease in runtime would correspond to a greater number of instructions being performed in TMR mode and a resulting energy increase; however, there is insufficient analysis at this time to determine a more precise boundary region and the creation of such a region is left for future work.
Now, because the bounding boxes indicate that the average time and energy used to complete a program in the presence of errors should fall within the boxes, they should not be treated as program specific bounding boxes. It would be trivial to create program specific bounding boxes, but these are not shown here for brevity. However, the next figures will begin to show the versatility of the AHR MIPS approach as the TMR to TSR transition point is varied.
Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15 show what happens to the bounding boxes as the TMR to TSR transition point increases from 11,000, to 20,000, 30,000, 40,000, 50,000, 60,000, 70,000, and 80,000 instructions. These figures are shown to illustrate the flexibility of AHR MIPS as the TMR to TSR transition point is varied. A space vehicle designer, mission planner, or operator can change the TMR to TSR transition point depending on the program being run on the processor, radiation environment, processing speed requirements, power requirements, and other mission needs. For example, it might be desirable to operate in TMR in a low radiation environment for the purpose of maximizing processing speed when energy usage is not a concern. This is achieved by setting the TMR to TSR transition point to such a large value that the TMR to TSR transition never occurs. Alternatively, it might be desirable to operate in a low power mode regardless of the external radiation environment and its potential impact on registers other than GPRs. This is achieved by setting the TMR to TSR transition point to zero so that AHR switches from TMR back to TSR as quickly as possible after AHR suffers multiple errors in TSR mode.
As the transition point increases, The overall bounding box begins to grow, then shrinks down to match the TMR MIPS bounding box. Note that in some of these figures, the AHR MIPS TMR Type B-Worst Early and Late errors do not always coincide. Note also how the size and shapes of the smaller bounding boxes change. As the TMR to TSR transition point increases, the AHR MIPS TMR Type B-Best bounding box increases in size, then decreases in size until it becomes nonexistent. The AHR MIPS TMR Type B-Worst bounding box increases in size from nonexistence, then decreases in size until becoming nonexistent again. The AHR MIPS TMR Type A box also increases then decreases in size until becoming nearly nonexistent. The AHR MIPS TSR box decreases in size until becoming nearly nonexistent.
While these figures represent the average performance for 1000 programs, they have greater utility when created for a specific program to show how the expected program runtime and energy usage change as the TMR to TSR transition point is changed. A satellite designer, mission planner, or operator could use these to determine the best transition point based on the needs of the system. For example, the TMR to TSR transition point could be selected in order to meet certain performance criteria such as staying under maximum runtime or energy constraints.
Figure 16 shows the same things as Figure 6, but allows the TMR to TSR transition point to vary from 11,000 to 80,000 instructions in increments of 1000. This figure also provides a slightly different view to the bounding boxes in the previous figures. It is most useful in visualizing how the average program runtime and energy usage for each error scenario changes as the TMR to TSR transition point changes. Curves for all AHR MIPS error scenarios, and the no error scenario, become evident. When there are only 11,000 instructions completed in TMR before the TMR to TSR transition point, AHR MIPS behaves much more closely to TSR MIPS. As the transition point moves towards 80,000 instructions, the AHR MIPS results begin moving up and to the left until they coincide with the TMR MIPS results. Note that the AHR MIPS TSR Best- and Worst-case scenarios collapse to the no error solution for TMR MIPS when very little, if any time is spent in TSR MIPS because the TMR to TSR transition point is no longer reached during the duration of most programs. Similarly, the AHR MIPS TMR Type B-Worst scenarios converge to the TMR Type B-Worst error scenario.
Another interesting comparison is to look at the average percent difference in runtime and energy usage for each error scenario and no error scenario when compared to Basic MIPS with no errors. The average percent difference for the no error scenarios were given in the previous work [1]. The average percent difference for the programs experiencing errors are given in Equations (1)–(26).
P D T i m e T M R E r r A v B a s i c = · · · n = 1 N p r o g r a m s T T M R E r r A ( n ) - T B a s i c M I P S ( n ) T B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D T i m e T M R E r r B B e s t v B a s i c = · · · n = 1 N p r o g r a m s T T M R E r r B B e s t ( n ) - T B a s i c M I P S ( n ) T B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D T i m e T M R E r r B W o r s t v B a s i c = · · · n = 1 N p r o g r a m s T T M R E r r B W o r s t ( n ) - T B a s i c M I P S ( n ) T B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D T i m e T S R B e s t v B a s i c = · · · n = 1 N p r o g r a m s T T S R B e s t ( n ) - T B a s i c M I P S ( n ) T B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D T i m e T S R W o r s t v B a s i c = · · · n = 1 N p r o g r a m s T T S R W o r s t ( n ) - T B a s i c M I P S ( n ) T B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D T i m e C T M R A E a r l y v B a s i c = · · · n = 1 N p r o g r a m s T C T M R A E a r l y ( n ) - T B a s i c M I P S ( n ) T B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D T i m e C T M R A L a t e v B a s i c = · · · n = 1 N p r o g r a m s T C T M R A L a t e ( n ) - T B a s i c M I P S ( n ) T B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D T i m e C T M R B B e s t E a r l y v B a s i c = · · · n = 1 N p r o g r a m s T C T M R B B e s t E a r l y ( n ) - T B a s i c M I P S ( n ) T B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D T i m e C T M R B B e s t L a t e v B a s i c = · · · n = 1 N p r o g r a m s T C T M R B B e s t L a t e ( n ) - T B a s i c M I P S ( n ) T B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D T i m e C T M R B W o r s t E a r l y v B a s i c = · · · n = 1 N p r o g r a m s T C T M R B W o r s t E a r l y ( n ) - T B a s i c M I P S ( n ) T B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D T i m e C T M R B W o r s t L a t e v B a s i c = · · · n = 1 N p r o g r a m s T C T M R B W o r s t L a t e ( n ) - T B a s i c M I P S ( n ) T B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D T i m e C T S R B e s t v B a s i c = · · · n = 1 N p r o g r a m s T C T S R B e s t ( n ) - T B a s i c M I P S ( n ) T B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D T i m e C T S R W o r s t v B a s i c = · · · n = 1 N p r o g r a m s T C T M R W o r s t ( n ) - T B a s i c M I P S ( n ) T B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D E n e r g y T M R E r r A v B a s i c = · · · n = 1 N p r o g r a m s E T M R E r r A ( n ) - E B a s i c M I P S ( n ) E B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D E n e r g y T M R E r r B B e s t v B a s i c = · · · n = 1 N p r o g r a m s E T M R E r r B B e s t ( n ) - E B a s i c M I P S ( n ) E B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D E n e r g y T M R E r r B W o r s t v B a s i c = · · · n = 1 N p r o g r a m s E T M R E r r B W o r s t ( n ) - E B a s i c M I P S ( n ) E B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D E n e r g y T S R B e s t v B a s i c = · · · n = 1 N p r o g r a m s E T S R B e s t ( n ) - E B a s i c M I P S ( n ) E B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D E n e r g y T S R W o r s t v B a s i c = · · · n = 1 N p r o g r a m s E T S R W o r s t ( n ) - E B a s i c M I P S ( n ) E B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D E n e r g y C T M R A E a r l y v B a s i c = · · · n = 1 N p r o g r a m s E C T M R A E a r l y ( n ) - E B a s i c M I P S ( n ) E B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D E n e r g y C T M R A L a t e v B a s i c = · · · n = 1 N p r o g r a m s E C T M R A L a t e ( n ) - E B a s i c M I P S ( n ) E B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D E n e r g y C T M R B B e s t E a r l y v B a s i c = · · · n = 1 N p r o g r a m s E C T M R B B e s t E a r l y ( n ) - E B a s i c M I P S ( n ) E B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D E n e r g y C T M R B B e s t L a t e v B a s i c = · · · n = 1 N p r o g r a m s E C T M R B B e s t L a t e ( n ) - E B a s i c M I P S ( n ) E B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D E n e r g y C T M R B W o r s t E a r l y v B a s i c = · · · n = 1 N p r o g r a m s E C T M R B W o r s t E a r l y ( n ) - E B a s i c M I P S ( n ) E B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D E n e r g y C T M R B W o r s t L a t e v B a s i c = · · · n = 1 N p r o g r a m s E C T M R B W o r s t L a t e ( n ) - E B a s i c M I P S ( n ) E B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D E n e r g y C T S R B e s t v B a s i c = · · · n = 1 N p r o g r a m s E C T S R B e s t ( n ) - E B a s i c M I P S ( n ) E B a s i c M I P S ( n ) × 100 % N p r o g r a m s
P D E n e r g y C T S R W o r s t v B a s i c = · · · n = 1 N p r o g r a m s E C T S R W o r s t ( n ) - E B a s i c M I P S ( n ) E B a s i c M I P S ( n ) × 100 % N p r o g r a m s
The average percent difference equations were used to calculate the average percent difference for all programs experiencing errors and no errors when the TMR to TSR transition point from 11,000 to 80,000 in increments of 1000. The results for the runtime calculations are shown in Figure 17 and the results for energy usage calculations are shown in Figure 18. These figures really highlight how AHR MIPS runtime and energy performance changes when compared to Basic MIPS as the TMR to TSR transition point changes. As in some of the previous figures, the TMR Type A and TMR Type B-Best error results are omitted because they are nearly identical to the TMR no error results. The same is true for the TSR Best error results because they are identical to the TSR no error results. Additionally, the AHR TMR Type A Early, AHR TMR Type B-Best Early, and AHR TSR Best error results have been omitted because they are nearly identical to the AHR no error results. The first thing to note from these figures is how the AHR MIPS no error, AHR TMR Type A, AHR TMR Type B-Best, AHR TSR Best, and AHR TSR Worst average percent differences approach the TMR average percent difference as the number of instructions before the TMR to TSR transition increases. This is consistent with prior results because AHR MIPS performance is nearly identical to TMR MIPS performance as the number of instructions that AHR MIPS processes in TSR mode approaches zero and nearly all instructions are processed in TMR mode. Additionally, the AHR MIPS TMR Type B-Worst average percent differences approach the TMR Type B-Worst average percent difference, which is also expected for the same reasons just given.
There are a few other things to note from the percent difference figures. The first is that programs experiencing an AHR TMR Type A Late error complete faster than programs experiencing an AHR TMR Type B-Best Late error. Both of these complete faster than AHR MIPS programs experiencing no error, AHR TMR Type A Early, AHR TMR Type B-Best Early, and AHR TSR Best-case errors. The no error, AHR TMR Type A Early, AHR TMR Type B-Best Early, and AHR TSR Best-case error scenarios all take less time to complete than programs experiencing AHR TMR Type B-Worst Early, AHR TMR Type B Worst Late, and AHR TSR Worst-case errors. Programs experiencing AHR TMR Type B-Worst Late errors always complete faster than those experiencing AHR TMR Type B-Worst Early errors. AHR MIPS programs experiencing TSR Worst-case errors have the worst runtime when the TMR to TSR transition point is under about 30,000, but runs faster than programs experiencing TMR Type B-Worst Early errors when the transition point is greater than 31,000 instructions and faster than programs experiencing TMR Type B-Worst Late errors when the transition point is greater than about 37,000 instructions.
AHR MIPS programs experiencing no error, TMR Type A Early, TMR Type B-Best Early, and TSR Best-case all take about the same amount of energy to complete and use less energy than an AHR MIPS program experiencing any other type of error. AHR MIPS programs experiencing TMR Type B-Worst Late errors use the most energy followed by programs experiencing TMR Type B-Worst Early errors, then TMR Type A Late errors, then TSR Worst-case errors.
One final thing to note are the jump discontinuities in the AHR MIPS TMR Type B-Best Late and AHR MIPS TMR Type B-Worst Late timing and energy percent differences. These are a direct result of the TMR to TSR transition point moving passed one of the TMR save/restore point creation times which occur every 10,000 instructions. Note that these discontinuities occur as the TMR to TSR transition point passes 20,000, 30,000, and 40,000 instructions. This is because these late errors go from having a minimal impact when the TMR to TSR transition point occurs immediately before a save/restore point creation to a maximum impact when the TMR to TSR transition point occurs immediately after a save/restore point creation.
Figure 19 and Figure 20 are essentially derivative plots of Figure 17 and Figure 18 except that they use the average time and energy results rather than the percent differences. Each point on these graphs represent the difference in average time and average energy to complete 1000 different programs with the given error type (or no error at all) from one AHR transition point value to the previous AHR transition point value where these transition points started at 11,000 instructions, ended at 80,000 instructions, and had step sizes of 1000 instructions. Plots like these may help a mission planner determine the most optimal point, in terms of processing speed and energy usage, at which to transition AHR from TMR mode to TSR mode.

4. Discussion

AHR uses less energy than TMR and takes less time than TSR to complete programs when errors are injected. Additionally, changing the TMR to TSR transition point allows space vehicle designers, mission planners, and operators the flexibility to select operating points that meet mission processing speed and energy usage requirements not only under optimal error free conditions, but also in the worst-case error scenarios. This was demonstrated through simulation results shown in Section 3 where the time to complete programs varied between the TMR time to complete a program and the TSR time to complete a program as the TMR to TSR transition point was varied. This was also shown as the energy used to complete programs varied between the TMR energy used to complete a program and the TSR energy used to complete a program. The figures illustrated how a space vehicle designer, mission planner, or operator could choose a TMR to TSR transition point that meets the specific needs of their mission. As previously noted, if a mission needed to maximize processing speed at the expense of increased energy usage regardless of the external radiation environment, the TMR to TSR transition point could be set to such an arbitrarily large value that AHR always remains in TMR mode. In contrast, if a mission needs to minimize energy usage at the expense of slower processing speeds regardless of the radiation environment, the TMR to TSR transition point could be set to zero to ensure that AHR remains in TSR mode. For mission needs in between these two extremes, the TMR to TSR transition point could be set to a value that meets certain timing and energy performance criteria while accounting for the radiation environment and its impact on processing speed and energy usage. Additionally, the transition point can be program specific for a processor entrusted with running many different programs so that the transition point is optimized for each program. Furthermore, the transition point can be varied at any time over the course of the mission. It could even be changed during a single orbit to ensure an optimal value at all times when radiation levels and mission needs are taken into account.
Future work will implement TMR, TSR, and AHR on a Cyclone V FPGA to determine how they perform under error free and error injection conditions in terms of time and energy performance. This will be done in an effort to verify that this method works in application and not just in the realm of simulation and analysis.
Another area for future work is expanding AHR to include more redundancy methods such as dual modular redundancy, N-modular redundancy, and advanced TSR methods that can detect and correct program counter errors, which EDDI is unable to detect.
The views expressed in this paper are those of the authors, and do not reflect the official policy or position of the United States Air Force, Department of Defense, or the U.S. Government. This document has been approved for public release; distribution unlimited, Case #88ABW-2019-4400.

Author Contributions

Conceptualization, N.H.; methodology, N.H.; software, N.H.; validation, N.H., S.G., T.C. and A.B.; formal analysis, N.H.; investigation, N.H.; resources, N.H.; data curation, N.H. and J.P.; writing, original draft preparation, N.H.; writing, review and editing, N.H. and S.G.; visualization, N.H.; supervision, S.G.; project administration, S.G.; funding acquisition, T.C.

Funding

No sponsor funding provided for this research. Authors are United States Government employees and compensated by the government.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
AHRAdaptive-Hybrid Redundancy
EDDIError Detection by Duplicated Instructions
FPGAField Programmable Gate Array
FSMFinite State Machine
GPRGeneral Purpose Register
TMRTriple Modular Redundancy
TSRTemporal Software Redundancy
SETSingle Event Transient
SEUSingle Event Upset

Appendix A. Basic MIPS Datapath Error Injection Schematics

Errors are injected into Basic MIPS GPRs by adding hardware to the Basic MIPS Datapath. The Basic MIPS Datapath is shown in Figure A1. The modified Basic MIPS Datapath that enables error injection is shown in Figure A2 and changes are denoted in red. The Error_Inject module overrides the inputs to the GPR_Bank at a predetermined program counter value and loop count value when Basic MIPS is in state zero. The Error_Inject module injects an error at this predetermined time by inverting the value of a single predetermined bit of a predetermined register in the GPR_Bank.
Figure A1. Basic MIPS Datapath Schematic.
Figure A1. Basic MIPS Datapath Schematic.
Electronics 08 01266 g0a1
Figure A2. Basic MIPS Datapath With Error Injection Schematic.
Figure A2. Basic MIPS Datapath With Error Injection Schematic.
Electronics 08 01266 g0a2

Appendix B. Error Injection Timing and Calculations

TMR errors are divided into single and multiple processor errors. A single processor error, called a TMR Type A error, recovers to the same instruction at which the initial error occurred as shown in Figure A3. A multiple processor error, called a TMR Type B error, occurs when all three processors disagree and all three processors are reset and restored to a previously saved state called a save/restore point. This is shown in Figure A4. In this figure, the acronym SRP denotes points in the TMR program execution at which a save/restore point is created. The TMR Type A error is one example of a best-case TMR error. A TMR Type B error that occurs immediately after creation of a save/restore point is another example of a best-case TMR error because it minimizes the number of instructions that must be recomputed to fully recover from the error. This is called a TMR Type B-Best error. The worst-case TMR error occurs when a Type B error occurs during save/restore point creation and maximizes the number of instructions to be recomputed to fully recover from the error. This is called a TMR Type B-Worst error. The Type B-Best and Type B-Worst errors are shown in Figure A5.
Figure A3. TMR MIPS Type A error timing diagram.
Figure A3. TMR MIPS Type A error timing diagram.
Electronics 08 01266 g0a3
Figure A4. TMR MIPS Type B error timing diagram.
Figure A4. TMR MIPS Type B error timing diagram.
Electronics 08 01266 g0a4
The runtime for a TMR MIPS Type A error is given in Equation (A1) where T T M R M I P S is the time for TMR MIPS to complete a program in the absence of an error from the previous work [1], T T M R t t d A is error detection time, T T M R r e c A is the Type A recovery time, T T M R r e t A is the time to return to the instruction at which the error occurred, and T T M R r e p A is the time required to repeat the instruction at which the error occurred. The last four of these values are determined from simulation.
T T M R E r r A = T T M R M I P S + T T M R t t d A + T T M R r e c A + T T M R r e t A + T T M R r e p A
The runtime for TMR MIPS Type B Best-case error is given in Equation (A2) where T T M R t t d B is the error detection time, T T M R r e c B is the Type B recovery time, T T M R r e t B B e s t is the time to re-accomplish the instructions between the last completed save/restore point and the instruction at which the error occurred. The time to detect the error and recover from the error are determined from simulation, but the time to return from the error to the point at which the error occurred is determined by analysis.
T T M R E r r B B e s t = T T M R M I P S + T T M R t t d B + T T M R r e c B + T T M R r e t B B e s t
While there are many locations in a program where a TMR Type B-Best error may occur, the absolute Best-case error is the one that minimizes the number of instructions between the return from save/restore point creation and the store word instruction following it. In order to determine which pairing of save/restore point creation and store word instructions has the shortest distance between them, the loop count and instruction index of every save/restore point creation and store word instruction must be determined. The store word instruction indices are simply located by examining the program. Equation (A3) shows how to calculate where save/restore point creation occurs where S I T M R is a vector containing the instruction index in the TMR program where save/restore points are created, S L T M R is a vector containing the program loop count values where the save/restore points are created, and S T T M R is a vector containing the amount of time from the beginning of the program to the points at which save/restore points are created. S T T M R is not used now in calculating the TMR Type B-Best program completion time, but will be used shortly.
f o r m = 0 t o n S R P - 1 i f m = 0 S I T M R ( m + 1 ) = 1 S L T M R ( m + 1 ) = 0 S T T M R ( m + 1 ) = 0 e l s e S I T M R ( m + 1 ) = m o d ( m · n s a v e - n T M R i n i t , N T M R ) + n T M R _ i n i t + 1 S L T M R ( m + 1 ) = m · n s a v e - n T M R i n i t n T M R T a d d = n = n T M R i n i t + 1 S I T M R ( m + 1 ) - 1 t I T M R n S T T M R ( m + 1 ) = T T M R i n i t + T T M R l o o p · S L T M R ( m + 1 ) + T a d d + · · · ( m - 1 ) · T T M R S R P e n d e n d
Figure A5. TMR MIPS Type B Best- and Worst-Case error timing diagram.
Figure A5. TMR MIPS Type B Best- and Worst-Case error timing diagram.
Electronics 08 01266 g0a5
The next step is to compute all possible differences ( S D 1 ) between save/restore point indices and store word indices as shown in Equation (A4) where S I T M R T is the transpose of S I T M R . This formula states S D 1 is a matrix of row vectors such that the nth row subtracts each value of S I T M R from the nth S W T M R value. Note that S W T M R is a vector containing the indices of every store word instruction in a program.
f o r n = 1 t o l e n g t h ( S W T M R ) S D 1 ( n , : ) = S W T M R ( n ) - S I T M R T e n d
Next, because some of the values in S D 1 may be negative because a save/restore point may occur at the end of one loop and the next store word may occur at the beginning of the next loop, S D 1 is modified so that all values are positive as shown in Equation (A5). In this equation, the “<” and “>” operators are logical operators that populate a matrix with ones or zeros depending on whether the individual matrix entries are less than or greater than the argument to the right of the operator. The term S D 1 . · ( S D 1 > 0 ) creates a matrix with all the positive values of S D 1 and where all the negative values of S D 1 are set to zero (the “ . · ” operator denotes element-wise multiplication). The term S D 1 . · ( S D 1 < 0 ) creates a matrix with all the negative values of S D 1 and where all the positive values of S D 1 are set to zero. The term N T M R · ( S D 1 < 0 ) creates a matrix where all the negative values of S D 1 are replaced by N T M R and all the positive values of S D 1 are set to zero. The term S D 1 . · ( S D 1 < 0 ) + N T M R · ( S D 1 < 0 ) creates a matrix where all the negative values of S D 1 are replaced by the positive number of instructions from the save/restore point at the end of a loop to the store word instruction at the beginning of the next loop and accounts for the fact that code execution jumped from the end of the loop back to the start of the loop. Finally, S D 2 contains all the positive instruction distances between save/restore points and the store words following them.
S D 2 = S D 1 . · ( S D 1 > 0 ) + ( S D 1 . · ( S D 1 < 0 ) + N T M R · ( S D 1 < 0 ) )
The next step is to determine the minimum distance between a save/restore point creation and a store word instruction. Equation (A6) is used to calculate the minimum distance where m i n is a function that returns the minimum value of each column vector of a matrix in the row vector a 1 and returns the index of each minimum value in each column vector in the row vector b 1 . For a vector, m i n returns the minimum value in c 1 and the index of the minimum value in d 1 . The value d 1 is as an index into the columns of S D 2 and tells which column contains the minimum distance between a save/restore point and a store word. The value b 1 ( d 1 ) is an index into the rows of S D 2 and tells which row contains the minimum distance between a save/restore point and a store word. Because the columns of S D 2 correspond to the save/restore point indices S I T M R and the rows correspond to the store word indices S W T M R , S I T M R ( d 1 ) is the address of the instruction at which the save/restore point is created closest to the store word instruction at the address specified to S W T M R ( b 1 ( d 1 ) ) . In other words, this is the absolute shortest distance between the creation of a save/restore point and when an error could occur at a store word and constitutes the best-case multiple processor error for TMR MIPS.
[ a 1 , b 1 ] = m i n ( S D 2 ) [ c 1 , d 1 ] = m i n ( a 1 )
The formula for determining T T M R r e t B B e s t is now presented in Equation (A7). The reason for the if-else statement is because the program is in a loop and S W T M R ( b 1 ( d 1 ) ) could be less than S I T M R ( d 1 ) .
i f S W T M R ( b 1 ( d 1 ) ) S I T M R ( d 1 ) T T M R r e t B B e s t = n = S I T M R ( d 1 ) S W T M R ( b 1 ( d 1 ) ) t I T M R n e l s e T T M R r e t B B e s t = n = S I T M R ( d 1 ) n T M R _ i n i t + N T M R t I T M R n + n = n T M R _ i n i t + 1 S W T M R ( b 1 ( d 1 ) ) t I T M R n e n d
The definition of the m i n function in Equation (A6) presents an interesting situation when S W T M R or S I T M R is a scalar rather than a vector. In this situation, S D 2 will be a vector instead of a matrix and performing the operations in Equation (A6) will not provide usable results for proper indexing into S W and S I in Equation (A7). If S W T M R is a scalar, b 1 is used as the indexing variable into S I T M R and no index variable is used for S W T M R because it is a scalar. These adjustments are made to Equation (A7) as shown in Equation (A8). If S I T M R is a scalar, b 1 is used as the indexing variable into S W T M R and no index variable is used for S I T M R because it is a scalar. These adjustments are made to Equation (A7) as shown in Equation (A9).
i f S W T M R S I T M R ( b 1 ) T T M R r e t B B e s t 2 = n = S I T M R ( b 1 ) S W T M R t I T M R n e l s e T T M R r e t B B e s t 2 = n = S I T M R ( b 1 ) n T M R _ i n i t + N T M R t I T M R n + n = n T M R _ i n i t + 1 S W T M R t I T M R n e n d
i f S W T M R ( b 1 ) S I T M R T T M R r e t B B e s t = n = S I T M R S W T M R ( b 1 ) t I T M R n e l s e T T M R r e t B B e s t = n = S I T M R n T M R _ i n i t + N T M R t I T M R n + n = n T M R _ i n i t + 1 S W T M R ( b 1 ) t I T M R n e n d
The Type B-Worst error occurs at the end of creating a save/restore point so that the error is detected before successful save/restore point creation. The multiple bit error is injected when attempting to write the loop counter when creating the save/restore point such that a multiple processor error is detected and triggers recovery operations. In this scenario, 10,000 instructions and save/restore point creation must be repeated to return to the point in the program at which the error occurred. The Type B Worst-case error is shown in Figure A5.
The runtime for TMR MIPS Type B Worst-case error is given in Equation (A10) where T T M R S R P E r r is the time it takes TMR MIPS to encounter an error during creation of a save/restore point when the error occurs in the loop counter of multiple processors when attempting to save the loop counter to memory, T T M R r e c B is the time to recover from a multiple processor error, T T M R r e t B W o r s t is the time to return to the instruction at which the error occurred. The time T T M R S R P E r r is determined from the simulation, but T T M R r e t B W o r s t is determined by analysis.
T T M R E r r B W o r s t = T T M R M I P S + T T M R S R P E r r + · · · T T M R r e c B + T T M R r e t B W o r s t
To compute the worst-case scenario time to return to the instruction at which the error occurred, the worst-case time between save points must first be determined according to Equation (A11) where S T T M R was previously defined in Equation (A3), S L T M R ( m ) is the number of full loops completed by the time the mth save/restore point creation is reached, T a d d is the time from the start of the loop in which the save/restore point is created to the instruction in that loop at which the save/restore point is created, S T T M R ( m ) is the time from the beginning of the program to the time at which the mth save/restore point creation begins, S D T T M R is the save time difference between consecutive save points, and W S I T M R is the index of the worst-case S D T T M R . The value of S D T T M R is obtained by subtracting the 1st value of S T T M R from the second value, the second value from the third, and so on until the ( n S R P - 1 ) th value is subtracted from the n S R P th . The maximum value of S D T T M R is T T M R r e t B W o r s t .
S D T T M R = S T T M R - [ 0 , S T T M R ( 1 t o l e n g t h ( S T T M R ) - 1 ) ] [ T T M R r e t B W o r s t , W S I T M R ] = m a x ( S D T T M R )
TSR best-case and worst-case errors are similar to TMR Type B-Best and TMR Type B-Worst errors in that they occur immediately after save/restore point error and during save/restore point creation respectively. These errors are shown in Figure A6.
Figure A6. TSR MIPS Best- and Worst-Case error timing diagram.
Figure A6. TSR MIPS Best- and Worst-Case error timing diagram.
Electronics 08 01266 g0a6
The best-case scenario minimizes the number of instructions which must be executed after error recovery to return to the point in the program at which the error occurred. Therefore, the best-case error occurs immediately after creation of a save/restore point. The error is injected immediately prior to the branch comparison instruction before the first store word instruction after creating a save/restore point. The error is injected into one of the registers to be compared.
The TSR MIPS Best-case error is computed using Equation (A12) where T T S R M I P S is the time to complete a program in the absence of an error from the previous work [1], T T S R R e c is the time to perform error recovery operations and is determined from simulation results, and T T S R R e t is the time needed to return from the most recent save/restore point to the instruction at which the error occurred.
T T S R B e s t = T T S R M I P S + T T S R R e c + T T S R R e t
The time to return to the instruction at which the error occurred is determined using Equation (A13) where n T S R i n i t is the number of instructions needed to initialize a TSR program (4 instructions) and S W T S R is a vector containing the instruction indices of all store word instructions in a TSR MIPS program.
T T S R r e t = n = N T S R - 3 N T S R t I T S R n + n = n T S R i n i t + 1 S W T S R ( 1 ) t I T S R n
The TSR MIPS Worst-case scenario maximizes the number of instructions which must be executed after error recovery to return to the point in the program at which the error occurred. Therefore, the worst-case error occurs at the end of creating a save/restore point. This error would specifically target the loop counter, which is the last register to be written to the save/restore point during save/restore point creation. This error would force TSR MIPS to restore itself from the previous save/restore point and then proceed past the end of the next save/restore point creation, which means completing 250 program loops all over again. Additionally, the worst-case error will occur when creating the save/restore point in the second segment of save/restore point memory rather than the first segment because the second segment takes longer to create.
The TSR MIPS Worst-case error is computed using Equation (A14) where T T S R l o o p is the time to complete a single TSR program loop defined in previous work [1] and T T S R S R P 1 E r r is the time from the start of save/restore point creation to the time at which an error is detected in the difference between the loop counter and duplicate loop counter. The value of T T S R S R P 1 E r r is determined from the simulation. The term n = N T S R - 3 N T S R t I T S R n is the time to complete the loop after performing error recovery and the term 250 · T T S R l o o p is the time to re-complete the 250 loops between save/restore points.
T T S R W o r s t = T T S R M I P S + T T S R R e c + n = N T S R - 3 N T S R t I T S R n + · · · 250 · T T S R l o o p + T T S R S R P 1 E r r
AHR may experience a TMR or TSR error depending on whether AHR is operating in TMR or TSR mode. TMR errors are further subdivided into early and late errors depending on whether they occur near the beginning of the program or near the TMR to TSR transition point respectively. Early errors have less impact on total program runtime and energy usage than late errors because early errors do not significantly affect the location of the TMR to TSR transition point. This is because the TMR to TSR transition point depends upon completing a predetermined number of instructions without error before transitioning AHR from TMR to TSR mode. Late errors have more impact on total program runtime and energy usage because they cause the TMR to TSR transition point to move towards the end of the program so that more of the program is executed in TMR mode. The result is that program runs faster, but uses more energy than if an early error or no error occurred. Errors encountered when AHR MIPS is operating in TSR mode are virtually identical to the errors encountered by TSR MIPS, but depend upon the point at which the TMR to TSR transition occurs within a program.
When AHR MIPS encounters a TMR Type A error, it handles the error the same way that TMR MIPS would. If the error occurs early in the program, such as at the first store word instruction in the program, the TMR to TSR transition point is only moved by a few instructions as shown in Figure A7. This is referred to as a Type A Early error and it has a minimal impact on the program runtime.
Figure A7. AHR MIPS TMR Type A Early error timing diagram.
Figure A7. AHR MIPS TMR Type A Early error timing diagram.
Electronics 08 01266 g0a7
Equation (A15) shows how to calculate the AHR MIPS Type A Early error timing where P l o o p s T M R A E a r l y is the new transition point loop count determined according to Equation (A16) and n C S R P A E a r l y is the number of save/restore points to create prior to the transition determined by Equation (A17). The TMR to TSR transition point determines how many save/restore points are created in TMR and TSR mode. The TMR mode save/restore points are determined by n C S R P A E a r l y , but the number of TSR mode save/restore points depends on where the transition occurs relative to the creation point for the TSR mode save/restore points which only occur at 250, 500, and 750 loops. This is the rationale for the if-else statements in these equations. There is also a possibility that the Type A error may push the TMR to TSR transition point out past the end of the program, in which case, AHR MIPS never enters TSR mode. Note also that t C T M R A E T M R and t C T M R A E T S R are the time AHR MIPS spends in TMR and TSR mode respectively when encountering a TMR Type A Early error. The time spent in TMR and TSR are separated to make the energy calculations simpler.
t n o m A E = t T M R i n i t + P l o o p s T M R A E a r l y · T T M R l o o p + · · · T T M R S R P · n C S R P A E a r l y t e r r A E = T T M R t t d A + T T M R r e c A + T T M R r e t A + T T M R r e p A i f P l o o p s T M R A E a r l y < 250 t C T M R A E T M R = t n o m A E + t e r r A E + t T M R T S R t C T M R A E T S R = ( n l o o p s - P l o o p s T M R A E a r l y ) · t T S R l o o p + · · · T T S R S R P 0 + 2 · T T S R S R P 1 + T T S R c o n c - T T S R s k i p e l s e i f 250 P l o o p s T M R A E a r l y < 500 t C T M R A E T M R = t n o m A E + t e r r A E + t T M R T S R t C T M R A E T S R = ( n l o o p s - P l o o p s T M R A E a r l y ) · t T S R l o o p + · · · T T S R S R P 0 + T T S R S R P 1 + T T S R c o n c - T T S R s k i p e l s e i f 500 P l o o p s T M R A E a r l y < 750 t C T M R A E T M R = t n o m A E + t e r r A E + t T M R T S R t C T M R A E T S R = ( n l o o p s - P l o o p s T M R A E a r l y ) · t T S R l o o p + · · · T T S R S R P 1 + T T S R c o n c - 2 3 T T S R s k i p e l s e i f 750 P l o o p s T M R A E a r l y < n l o o p s t C T M R A E T M R = t n o m A E + t e r r A E + t T M R T S R t C T M R A E T S R = ( n l o o p s - P l o o p s T M R A E a r l y ) · t T S R l o o p + · · · T T S R c o n c e l s e i f P l o o p s T M R A E a r l y n l o o p s t C T M R A E T M R = t T M R i n i t + n l o o p s · T T M R l o o p + · · · T T M R S R P · ( n S R P - 1 ) + t e r r A E t C T M R A E T S R = 0 e n d T C T M R A E a r l y = t C T M R A E T M R + t C T M R A E T S R
P l o o p s T M R A E a r l y = S W T M R ( 1 ) + n t r a n s i t i o n - n T M R _ i n i t N T M R
n C S R P A E a r l y = P l o o p s T M R A E a r l y · N T M R + n T M R _ i n i t n s a v e
If the TMR Type A error occurs late in the program, such as at the last store word instruction before the TMR to TSR transition, the TMR to TSR transition is moved by nearly 15,000 instructions past the point at which it would have occurred if there were no error. This is shown in Figure A8. This is referred to as a Type A Late error and it causes the program to execute more instructions in TMR MIPS and fewer instructions in TSR MIPS than if no error had occurred. The expected effect is a significantly shorter runtime and increased energy usage.
Figure A8. AHR MIPS TMR Type A Late error timing diagram.
Figure A8. AHR MIPS TMR Type A Late error timing diagram.
Electronics 08 01266 g0a8
Equation (A18) shows how to calculate the AHR MIPS Type A Late error timing where P l o o p s T M R A L a t e is the new transition point loop count determined according to Equation (A19) and n C S R P A L a t e is the number of save/restore points to create prior to the transition determined by Equation (A20). Note also that t C T M R A L T M R and t C T M R A L T S R are the time AHR MIPS spends in TMR and TSR mode respectively when encountering a TMR Type A Late error.
t n o m A L = t T M R i n i t + P l o o p s T M R A L a t e · T T M R l o o p + · · · T T M R S R P · n C S R P A L a t e t e r r A L = T T M R t t d A + T T M R r e c A + T T M R r e t A + T T M R r e p A i f P l o o p s T M R A L a t e < 250 t C T M R A L T M R = t n o m A L + t e r r A L + t T M R T S R t C T M R A L T S R = ( n l o o p s - P l o o p s T M R A L a t e ) · t T S R l o o p + · · · T T S R S R P 0 + 2 · T T S R S R P 1 + T T S R c o n c - T T S R s k i p e l s e i f 250 P l o o p s T M R A L a t e < 500 t C T M R A L T M R = t n o m A L + t e r r A L + t T M R T S R t C T M R A L T S R = ( n l o o p s - P l o o p s T M R A L a t e ) · t T S R l o o p + · · · T T S R S R P 0 + T T S R S R P 1 + T T S R c o n c - T T S R s k i p e l s e i f 500 P l o o p s T M R A L a t e < 750 t C T M R A L T M R = t n o m A L + t e r r A L + t T M R T S R t C T M R A L T S R = ( n l o o p s - P l o o p s T M R A L a t e ) · t T S R l o o p + · · · T T S R S R P 1 + T T S R c o n c - 2 3 T T S R s k i p e l s e i f 750 P l o o p s T M R A L a t e < n l o o p s t C T M R A L T M R = t n o m A L + t e r r A L + t T M R T S R t C T M R A L T S R = ( n l o o p s - P l o o p s T M R A L a t e ) · t T S R l o o p + · · · T T S R c o n c e l s e i f P l o o p s T M R A L a t e n l o o p s t C T M R A L T M R = t T M R i n i t + n l o o p s · T T M R l o o p + · · · T T M R S R P · ( n S R P - 1 ) + t e r r A L t C T M R A L T S R = 0 e n d T C T M R A L a t e = t C T M R A L T M R + t C T M R A L T S R
P l o o p s T M R A L a t e = S W T M R ( l e n g t h ( S W T M R ) ) + n t r a n s i t i o n - n T M R _ i n i t N T M R
n C S R P A L a t e = P l o o p s T M R A L a t e · N T M R + n T M R _ i n i t n s a v e
AHR MIPS may also encounter TMR MIPS Type B-Best errors early or late and these are referred to as TMR Type B-Best Early and TMR Type B-Best Late errors. As with the TMR Type A Early error, the TMR Type B-Best Early error has a minimal impact on runtime. As with the TMR Type A Late error, the TMR Type B-Best Late error is expected to significantly decrease runtime and increase energy usage. The TMR Type B-Best Early error is shown in Figure A9 while the TMR Type B-Best Late error is shown in Figure A10.
Figure A9. AHR MIPS TMR Type B Best-Case Early error timing diagram.
Figure A9. AHR MIPS TMR Type B Best-Case Early error timing diagram.
Electronics 08 01266 g0a9
Figure A10. AHR MIPS TMR Type B Best-Case Late error timing diagram.
Figure A10. AHR MIPS TMR Type B Best-Case Late error timing diagram.
Electronics 08 01266 g0a10
In order to determine the AHR MIPS runtime for programs with Type B-Best Early and Late errors, some of the variables used in computing TMR MIPS runtime for programs with Type B-Best errors need to be modified. The variable S I T M R needs to be modified so it only contains instruction indices of save/restore points that occur before the original TMR to TSR transition was expected to take place. The values of S L T M R and S T T M R must also be updated. These updates are illustrated in Equation (A21) where the S L T M R ( S L T M R < P l o o p s ) returns the vector of S L T M R where the values of S L T M R are less than the TMR to TSR transition point and all other values of the original S L T M R vector are excluded.
S L C T M R = S L T M R ( S L T M R < P l o o p s ) S I C T M R = S I T M R ( 1 t o l e n g t h ( S L T M R ) ) S T C T M R = S T T M R ( 1 t o l e n g t h ( S L T M R ) )
Next, all possible differences between save/restore point indices and store word indices are calculated according to Equation (A22). Note that this is different when compared with Equation (A4) because this formula must account for the fact that an error cannot be allowed to occur after the TMR to TSR transition or it would be a TSR error rather than a TMR Type B Error.
f o r n = 1 t o l e n g t h ( S W T M R ) S D 3 ( n , : ) = S W T M R ( n ) - S I T M R T i f S L C T M R ( l e n g t h ( S L C T M R ) ) = P l o o p s i f S D 3 ( n , l e n g t h ( S L C T M R ) ) < 0 S D 3 ( n , l e n g t h ( S L C T M R ) ) = 10 6 e n d e n d e n d
Then, just as Equation (A5) made all values of S D 1 positive, Equation (A23) makes all values of S D 3 positive as well.
S D 4 = S D 3 . · ( S D 3 > 0 ) + ( S D 3 . · ( S D 3 < 0 ) + N T M R · ( S D 3 < 0 ) )
The next step is to determine which store word indices minimize the difference between each store word and save index. This is computed in Equation (A24). Note that this differs from Equation (A6) because there is no second step to determine the absolute minimum distance. This is because it is desirable to determine the early and late scenarios for a Type B-Best error. The absolute minimum of S D 4 might not minimize or maximize the number of instructions computed in TMR mode. Instead, each possible combination of minimum distance from a save index to a store word index is evaluated for total program completion time. The total program completion time for each scenario is then evaluated against the completion times to determine which is slowest (Early) and which is fastest (Late).
[ a 2 , b 2 ] = m i n ( S D 4 )
Equations (A25)–(A27) show how to compute the time to complete each program for each possible combination of minimum distance from a save index to a store word index. (Equation (A27) is a continuation of Equation (A26) because the entire equation could not fit on one page.) Equation (A26) (and Equation (A27)) also shows that the Type B-Best Early solution is the maximum of these times and the Type B-Best Late solution is the minimum of these times. The F l a g variable is used to keep track of whether a particular combination of save index and store word index is allowed. The flag is 1 if the combination is not allowed because the store word following the save index would occur after the TMR to TSR transition. The variable P l o o p s T M R B B e s t ( n ) is the new TMR to TSR transition point based on the error location for the nth save index. The variable n C S R P B B e s t ( n ) is the new number of save/restore points to create for the nth save index. The variable T a d d ( n ) is the amount of time required to return from the save index to the store word index at which the error occurred for the nth save index. The time to complete the TMR portion of the program for the nth save index is t C T M R B B T M R ( n ) . The time to complete the TSR portion of the program for the nth save index is t C T M R B B T S R ( n ) . The value N a N is assigned to t C T M R B B T M R ( n ) and t C T M R B B T S R ( n ) when F l a g = 1 because the m a x and m i n functions ignore N a N values and return only numerical values. Finally, t C T M R B B E T M R ( n ) , t C T M R B B E T S R ( n ) are the time AHR MIPS spends in TMR and TSR mode when a TMR Type B-Best Early error is encountered. Similarly, t C T M R B B L T M R ( n ) , t C T M R B B L T S R ( n ) are the time AHR MIPS spends in TMR and TSR mode when a TMR Type B-Best Late error is encountered.
f o r n = 1 t o l e n g t h ( b 2 ) F l a g = 0 i f S I C T M R = 1 P l o o p s T M R B B e s t ( n ) = P l o o p s e l s e P l o o p s T M R B B e s t ( n ) = · · · S I C T M R ( n ) + S L C T M R · N T M R + n t r a n s i t i o n - n T M R _ i n i t N T M R e n d n C S R P B B e s t ( n ) = P l o o p s T M R B B e s t ( n ) · N T M R + n T M R _ i n i t n s a v e i f S I C T M R ( n ) S W C T M R ( b 2 ( n ) ) - 1 T a d d ( n ) = m = S I C T M R ( n ) S W C T M R ( b 2 ( n ) ) - 1 t I T M R m e l s e i f S L C T M R ( n ) < P l o o p s T M R B B e s t ( n ) T a d d ( n ) = m = S I C T M R ( n ) n T M R _ i n i t + N T M R t I T M R m + n T M R _ i n i t + 1 S W C T M R ( b 2 ( n ) ) - 1 t I T M R m e l s e T a d d ( n ) = 0 F l a g = 1 e n d e n d
f o r n = 1 t o l e n g t h ( b 2 ) i f F l a g = 1 t C T M R B B T M R ( n ) = N a N t C T M R B B T S R ( n ) = N a N e l s e t n o m B B = t T M R i n i t + P l o o p s T M R B B e s t ( n ) · T T M R l o o p + · · · T T M R S R P · n C S R P B B e s t ( n ) t e r r B B = T T M R t t d B + T T M R r e c B + T a d d ( n ) i f P l o o p s T M R B B e s t ( n ) < 250 t C T M R B B T M R ( n ) = t n o m B B + t e r r B B + t T M R T S R t C T M R B B T S R ( n ) = ( n l o o p s - P l o o p s T M R B B e s t ( n ) ) · t T S R l o o p + · · · T T S R S R P 0 + 2 · T T S R S R P 1 + T T S R c o n c - T T S R s k i p e l s e i f 250 P l o o p s T M R B B e s t ( n ) < 500 t C T M R B B T M R ( n ) = t n o m B B + t e r r B B + t T M R T S R t C T M R B B T S R = ( n l o o p s - P l o o p s T M R B B e s t ( n ) ) · t T S R l o o p + · · · T T S R S R P 0 + T T S R S R P 1 + T T S R c o n c - T T S R s k i p e l s e i f 500 P l o o p s T M R B B e s t ( n ) < 750 t C T M R B B T M R ( n ) = t n o m B B + t e r r B B + t T M R T S R t C T M R B B T S R = ( n l o o p s - P l o o p s T M R B B e s t ( n ) ) · t T S R l o o p + · · · T T S R S R P 1 + T T S R c o n c - 2 3 T T S R s k i p
e l s e i f 750 P l o o p s T M R B B e s t ( n ) < n l o o p s t C T M R B B T M R ( n ) = t n o m B B + t e r r B B + t T M R T S R t C T M R B B T S R = ( n l o o p s - P l o o p s T M R B B e s t ( n ) ) · t T S R l o o p + T T S R c o n c e l s e i f P l o o p s T M R B B e s t ( n ) n l o o p s t C T M R B B T M R ( n ) = t T M R i n i t + n l o o p s · T T M R l o o p + · · · T T M R S R P · ( n S R P - 1 ) + t e r r B B t C T M R B B T S R = 0 e n d e n d e n d [ T C T M R B B e s t E a r l y , b 3 ] = m a x ( t C T M R B B T M R + t C T M R B B T S R ) t C T M R B B E T M R = t C T M R B B T M R ( b 3 ) t C T M R B B E T S R = t C T M R B B T S R ( b 3 ) [ T C T M R B B e s t L a t e , b 4 ] = m i n ( t C T M R B B T M R + t C T M R B B T S R ) t C T M R B B L T M R = t C T M R B B T M R ( b 4 ) t C T M R B B L T S R = t C T M R B B T S R ( b 4 )
Remembering that the m i n function used in Equation (A24) is defined and used in the same manner as in Equation (A6), the same problem with S D 2 possibly being a vector arises for S D 4 as well. This affects the indices used in Equations (A25) and (A26). If S W T M R is a scalar, Equation (A25) is rewritten in Equation (A28). If S I C T M R is a scalar, these equations are rewritten in Equations (A29)–(A31) where Equation (A31) is a continuation of Equation (A30).
f o r n = 1 t o l e n g t h ( b 2 ) F l a g = 0 i f S I C T M R = 1 P l o o p s T M R B B e s t ( n ) = P l o o p s e l s e P l o o p s T M R B B e s t ( n ) = · · · S I C T M R ( n ) + S L C T M R · N T M R + n t r a n s i t i o n - n T M R _ i n i t N T M R e n d n C S R P B B e s t ( n ) = P l o o p s T M R B B e s t ( n ) · N T M R + n T M R _ i n i t n s a v e i f S I C T M R ( n ) S W C T M R - 1 T a d d ( n ) = m = S I C T M R ( n ) S W C T M R - 1 t I T M R m e l s e i f S L C T M R ( n ) < P l o o p s T M R B B e s t ( n ) T a d d ( n ) = m = S I C T M R ( n ) n T M R _ i n i t + N T M R t I T M R m + n T M R _ i n i t + 1 S W C T M R - 1 t I T M R m e l s e T a d d ( n ) = 0 F l a g = 1 e n d e n d
F l a g = 0 i f S I C T M R = 1 P l o o p s T M R B B e s t = P l o o p s e l s e P l o o p s T M R B B e s t = · · · S I C T M R + S L C T M R · N T M R + n t r a n s i t i o n - n T M R _ i n i t N T M R e n d n C S R P B B e s t = P l o o p s T M R B B e s t · N T M R + n T M R _ i n i t n s a v e i f S I C T M R S W C T M R ( b 2 ) - 1 T a d d = m = S I C T M R S W C T M R ( b 2 ) - 1 t I T M R m e l s e i f S L C T M R ( n ) < P l o o p s T M R B B e s t ( n ) T a d d = m = S I C T M R n T M R _ i n i t + N T M R t I T M R m + n T M R _ i n i t + 1 S W C T M R ( b 2 ) - 1 t I T M R m e l s e T a d d = 0 F l a g = 1 e n d
i f F l a g = 1 t C T R M B B T M R = N a N t C T R M B B T S R = N a N e l s e t n o m B B = t T M R i n i t + P l o o p s T M R B B e s t ( n ) · T T M R l o o p + · · · T T M R S R P · n C S R P B B e s t ( n ) t e r r B B = T T M R t t d B + T T M R r e c B + T a d d ( n ) i f P l o o p s T M R B B e s t < 250 t C T R M B B T M R = t n o m B B + t e r r B B + t T M R T S R t C T R M B B T S R = ( n l o o p s - P l o o p s T M R B B e s t ) · t T S R l o o p + · · · T T S R S R P 0 + 2 · T T S R S R P 1 + T T S R c o n c - T T S R s k i p e l s e i f 250 P l o o p s T M R B B e s t < 500 t C T R M B B T M R = t n o m B B + t e r r B B + t T M R T S R t C T R M B B T S R = ( n l o o p s - P l o o p s T M R B B e s t ) · t T S R l o o p + · · · T T S R S R P 0 + T T S R S R P 1 + T T S R c o n c - T T S R s k i p e l s e i f 500 P l o o p s T M R B B e s t < 750 t C T R M B B T M R = t n o m B B + t e r r B B + t T M R T S R t C T R M B B T S R = ( n l o o p s - P l o o p s T M R B B e s t ) · t T S R l o o p + · · · T T S R S R P 1 + T T S R c o n c - 2 3 T T S R s k i p e l s e i f 750 P l o o p s T M R B B e s t < n l o o p s t C T R M B B T M R = t n o m B B + t e r r B B + t T M R T S R t C T R M B B T S R = ( n l o o p s - P l o o p s T M R B B e s t ) · t T S R l o o p + · · · T T S R c o n c
e l s e i f P l o o p s T M R B B e s t n l o o p s t C T R M B B T M R = t T M R i n i t + n l o o p s · T T M R l o o p + · · · T T M R S R P · ( n S R P - 1 ) + t e r r B B t C T R M B B T S R = 0 e n d e n d T C T M R B B e s t E a r l y = t C T R M B B T M R + t C T R M B B T S R t C T R M B B E T M R = t C T R M B B T M R t C T R M B B E T S R = t C T R M B B T S R T C T M R B B e s t L a t e = t C T R M B B T M R + t C T R M B B T S R t C T R M B B L T M R = t C T R M B B T M R t C T R M B B L T S R = t C T R M B B T S R
AHR MIPS may also encounter TMR MIPS Type B-Worst errors early or late and these are referred to as TMR Type B-Worst Early and TMR Type B-Worst Late errors. As with the TMR Type A Early error, the TMR Type B-Worst Early error has a minimal impact on runtime. As with the TMR Type A Late error, the TMR Type B-Worst Late error is expected to significantly decrease runtime and increase energy usage. The TMR Type B-Worst Early error is shown in Figure A11 while the TMR Type B-Worst Late error is shown in Figure A12.
Figure A11. AHR MIPS TMR Type B Worst-Case Early error timing diagram.
Figure A11. AHR MIPS TMR Type B Worst-Case Early error timing diagram.
Electronics 08 01266 g0a11
Figure A12. AHR MIPS TMR Type B Worst-Case Late error timing diagram.
Figure A12. AHR MIPS TMR Type B Worst-Case Late error timing diagram.
Electronics 08 01266 g0a12
Equation (A32) shows how to compute the time to complete a AHR MIPS program with a TMR Type B-Worst Early error where P l o o p s T M R B W o r s t E a r l y is the number of loops at which the transition point occurs when accounting for the error, n C S R P B W o r s t E a r l y is the number of save/restore points to create in TMR MIPS when accounting for the error, and T C T M R r e t B W o r s t E a r l y is the time needed to return to the point at which the error occurred after recovering from the error. Note that t C T M R B W E T M R and t C T M R B W E T S R are the time AHR MIPS spends in TMR and TSR mode respectively when encountering a TMR Type B-Worst Early error.
t n o m B W E = t T M R i n i t + P l o o p s T M R B W o r s t E a r l y · T T M R l o o p + · · · T T M R S R P · n C S R P B W o r s t E a r l y t e r r B W E = T T M R S R P E r r + T T M R r e c B + T C T M R r e t B W o r s t E a r l y i f P l o o p s T M R B W o r s t E a r l y < 250 t C T M R B W E T M R = t n o m B W E + t e r r B W E + t T M R T S R t C T M R B W E T S R = ( n l o o p s - P l o o p s T M R B W o r s t E a r l y ) · t T S R l o o p + · · · T T S R S R P 0 + 2 · T T S R S R P 1 + T T S R c o n c - T T S R s k i p e l s e i f 250 P l o o p s T M R B W o r s t E a r l y < 500 t C T M R B W E T M R = t n o m B W E + t e r r B W E + t T M R T S R t C T M R B W E T S R = ( n l o o p s - P l o o p s T M R B W o r s t E a r l y ) · t T S R l o o p + · · · T T S R S R P 0 + T T S R S R P 1 + T T S R c o n c - T T S R s k i p e l s e i f 500 P l o o p s T M R B W o r s t E a r l y < 750 t C T M R B W E T M R = t n o m B W E + t e r r B W E + t T M R T S R t C T M R B W E T S R = ( n l o o p s - P l o o p s T M R B W o r s t E a r l y ) · t T S R l o o p + · · · T T S R S R P 1 + T T S R c o n c - 2 3 T T S R s k i p e l s e i f 750 P l o o p s T M R B W o r s t E a r l y < n l o o p s t C T M R B W E T M R = t n o m B W E + t e r r B W E + t T M R T S R t C T M R B W E T S R = ( n l o o p s - P l o o p s T M R B W o r s t E a r l y ) · t T S R l o o p + T T S R c o n c e l s e i f P l o o p s T M R B W o r s t E a r l y n l o o p s t C T M R B W E T M R = t T M R i n i t + n l o o p s · T T M R l o o p + · · · T T M R S R P · ( n S R P - 1 ) + t e r r B W E t C T M R B W E T S R = 0 e n d T C T M R B W o r s t E a r l y = t C T M R B W E T M R + t C T M R B W E T S R
The time T C T M R r e t B W o r s t E a r l y is computed according to Equation (A33) where S D T C T M R is the save time difference between consecutive save points and W S I C T M R is the index of the worst-case S D T C T M R . This is nearly identical to Equation (A11).
S D T C T M R = S T C T M R - [ 0 , S T C T M R ( 1 t o l e n g t h ( S T C T M R ) - 1 ) ] T C T M R r e t B W o r s t E a r l y = S D T C T M R ( 2 )
Next, the loop count at which the TMR to TSR transition will occur after encountering an error is determined using Equation (A34).
i f S L C T M R ( 1 ) = 0 P l o o p s T M R B W o r s t E a r l y = P l o o p s e l s e P l o o p s T M R B W o r s t E a r l y = · · · S I C T M R ( 1 ) + S L C T M R ( 1 ) · N T M R + n t r a n s i t i o n - n T M R _ i n i t N T M R e n d
Finally, n C S R P B W o r s t E a r l y is determined according to Equation (A35).
n C S R P B W o r s t E a r l y = P l o o p s T M R B W o r s t E a r l y · N T M R + n T M R _ i n i t n s a v e
Equation (A36) shows how to compute the time to complete a AHR MIPS program with a TMR Type B-Worst Late error where P l o o p s T M R B W o r s t L a t e is the number of loops at which the transition point occurs when accounting for the error, n C S R P B W o r s t L a t e is the number of save/restore points to create in TMR MIPS when accounting for the error, and T C T M R r e t B W o r s t L a t e is the time needed to return to the point at which the error occurred after recovering from the error. Note that t C T M R B W L T M R and t C T M R B W L T S R are the time AHR MIPS spends in TMR and TSR mode respectively when encountering a TMR Type B-Worst Late error.
t n o m B W L = t T M R i n i t + P l o o p s T M R B W o r s t L a t e · T T M R l o o p + · · · T T M R S R P · n C S R P B W o r s t L a t e t e r r B W L = T T M R S R P E r r + T T M R r e c B + T C T M R r e t B W o r s t E a r l y i f P l o o p s T M R B W o r s t L a t e < 250 t C T M R B W L T M R = t n o m B W E + t e r r B W E + t T M R T S R t C T M R B W L T S R = ( n l o o p s - P l o o p s T M R B W o r s t L a t e ) · t T S R l o o p + · · · T T S R S R P 0 + 2 · T T S R S R P 1 + T T S R c o n c - T T S R s k i p e l s e i f 250 P l o o p s T M R B W o r s t L a t e < 500 t C T M R B W L T M R = t n o m B W E + t e r r B W E + t T M R T S R t C T M R B W L T S R = ( n l o o p s - P l o o p s T M R B W o r s t L a t e ) · t T S R l o o p + · · · T T S R S R P 0 + T T S R S R P 1 + T T S R c o n c - T T S R s k i p e l s e i f 500 P l o o p s T M R B W o r s t L a t e < 750 t C T M R B W L T M R = t n o m B W E + t e r r B W E + t T M R T S R t C T M R B W L T S R = ( n l o o p s - P l o o p s T M R B W o r s t L a t e ) · t T S R l o o p + · · · T T S R S R P 1 + T T S R c o n c - 2 3 T T S R s k i p e l s e i f 750 P l o o p s T M R B W o r s t L a t e < n l o o p s t C T M R B W L T M R = t n o m B W E + t e r r B W E + t T M R T S R t C T M R B W L T S R = ( n l o o p s - P l o o p s T M R B W o r s t L a t e ) · t T S R l o o p + T T S R c o n c e l s e i f P l o o p s T M R B W o r s t L a t e n l o o p s t C T M R B W L T M R = t T M R i n i t + n l o o p s · T T M R l o o p + · · · T T M R S R P · ( n S R P - 1 ) + t e r r B W E t C T M R B W L T S R = 0 e n d T C T M R B W o r s t L a t e = t C T M R B W L T M R + t C T M R B W L T S R
The time T C T M R r e t B W o r s t L a t e is computed according to Equation (A37). This is nearly identical to Equation (A11).
T C T M R r e t B W o r s t L a t e = S D T C T M R ( l e n g t h ( S D T C T M R ) )
Next, the loop count at which the TMR to TSR transition will occur after encountering an error is determined using Equation (A38).
i f S L C T M R ( l e n g t h ( S D T C T M R ) - 1 ) = 0 P l o o p s T M R B W o r s t L a t e = P l o o p s e l s e P l o o p s T M R B W o r s t L a t e = · · · ( S I C T M R ( l e n g t h ( S D T C T M R ) - 1 ) + · · · S L C T M R ( l e n g t h ( S D T C T M R ) - 1 ) · N T M R + · · · n t r a n s i t i o n - n T M R _ i n i t ) / N T M R e n d
Finally, n C S R P B W o r s t L a t e is determined according to Equation (A39).
n C S R P B W o r s t L a t e = P l o o p s T M R B W o r s t L a t e · N T M R + n T M R _ i n i t n s a v e
In contrast to the TMR errors which can affect the TMR to TSR transition point, TSR errors do not affect the transition point; however, TSR worst-case errors may be affected by the transition point. The best-case errors are unaffected by the transition point.
When AHR MIPS encounters a TSR Best-case error, it encounters it immediately after the creation of a save/restore point. This could be the save/restore point created by the transition from TMR to TSR, or any of the save/restore points created by TSR MIPS after AHR MIPS enters TSR mode. Regardless of where the which save/restore point the TSR Best-case error occurs after, the recovery time is always the same. This is because of the way the TSR MIPS Best-case error was defined to be injected immediately prior to the branch comparison instruction before the first store word instruction after creating a save/restore point. A few examples of AHR MIPS TSR Best-case errors are shown in Figure A13, Figure A14, Figure A15 and A16 where the transition occurs before the first, second, or third TSR save/restore creation point or after the third TSR save/restore creation point respectively.
Figure A13. AHR MIPS TSR Best-Case Early Error Timing Diagram 1.
Figure A13. AHR MIPS TSR Best-Case Early Error Timing Diagram 1.
Electronics 08 01266 g0a13
Figure A14. AHR MIPS TSR Best-Case Early Error Timing Diagram 2.
Figure A14. AHR MIPS TSR Best-Case Early Error Timing Diagram 2.
Electronics 08 01266 g0a14
Figure A15. AHR MIPS TSR Best-Case Early Error Timing Diagram 3.
Figure A15. AHR MIPS TSR Best-Case Early Error Timing Diagram 3.
Electronics 08 01266 g0a15
Figure A16. AHR MIPS TSR Best-Case Early Error Timing Diagram 4.
Figure A16. AHR MIPS TSR Best-Case Early Error Timing Diagram 4.
Electronics 08 01266 g0a16
The time needed to complete a AHR MIPS program experiencing a TSR Best-case error is given in Equation (A40) where T T S R R e c and T T S R r e t were previously defined in Equation (A12).
T C T S R B e s t = T A H R M I P S + T T S R R e c + T T S R r e t T C T S R B e s t = t A H R T M R + t A H R T S R + T T S R R e c + T T S R r e t t C T S R B T M R = t A H R T M R t C T S R B T S R = t A H R T S R + T T S R R e c + T T S R r e t T C T S R B e s t = t C T S R B T M R + t C T S R B T S R
TSR Worst-case errors in AHR MIPS require special attention. While TSR Worst-case errors in TSR MIPS take place at the end of creating a save/restore point in the second save/restore point memory segment, that may not be possible in AHR MIPS depending on when the TMR to TSR transition takes place. If that transition occurs before the first TSR MIPS save/restore point is created, then the TSR worst-case error is still encountered at the end of creating a save/restore point in the second segment; in this case this would be the save/restore point created when the loop counter is at 250. This scenario is shown in Figure A17.
Figure A17. AHR MIPS TSR Worst-Case Early Error Timing Diagram 1.
Figure A17. AHR MIPS TSR Worst-Case Early Error Timing Diagram 1.
Electronics 08 01266 g0a17
When the TMR to TSR transition occurs after what would have been the first TSR MIPS save/restore point creation and before the second TSR MIPS save/restore point creation, there are two possibilities for a worst-case error. These possibilities are shown in Figure A18. Note that the first save/restore point created after the transition is always to the second save/restore point memory segment. This means that an error at the end of this save/restore point creation may not be the worst-case error. The worst-case error may be the one that occurs at the end of the next save/restore point creation which saves to the first save/restore point memory segment. The time to recover from the error and return to the point at which the error was encountered is calculated for both of these scenarios and the one that takes longer is the worst-case error.
If the TSR Worst-case error occurs after the second TSR MIPS save/restore point creation and before the third, then it is unclear what the worst-case error might be. According to the original definition of a TSR MIPS Worst-case error, it is an error that maximizes the number of instructions that TSR MIPS must re-execute. Therefore, the error may occur at the end of creating the third TSR MIPS save/restore point or at the last branch comparison at the end of the program. The amount of time to return to the point at which the error occurred is calculated for both scenarios, and the one that takes longer is the worst-case scenario. This is illustrated graphically in Figure A19.
Figure A18. AHR MIPS TSR Worst-Case Early Error Timing Diagram 2.
Figure A18. AHR MIPS TSR Worst-Case Early Error Timing Diagram 2.
Electronics 08 01266 g0a18
Figure A19. AHR MIPS TSR Worst-Case Early Error Timing Diagram 3.
Figure A19. AHR MIPS TSR Worst-Case Early Error Timing Diagram 3.
Electronics 08 01266 g0a19
Finally, if the TSR Worst-case error occurs after the last TSR MIPS save/restore point creation, the worst-case error occurs at the last branch comparison at the end of the program as shown in Figure A20.
No errors are injected to Basic MIPS because it has no way of detecting or correcting the errors. Any errors injected into a register to be stored to memory would not impact the runtime or energy usage of Basic MIPS. The only manifestation would be that the resulting computations would be incorrect.
Equations (A41) and (A42) show how to compute the time to complete a AHR MIPS program experiencing a TSR Worst-case error where Equation (A42) is a continuation of Equation (A41). If the transition point occurs before the completion of the first 250 loops, the AHR MIPS TSR worst-case error is identical to the TSR MIPS worst-case error in that the added time to complete the program is the same as in Equation (A14).
If the transition point occurs between the completion of 250 loops and 500 loops, there are two possibilities for the worst-case error. The first is that the error occurs at the end of creating the save/restore point upon completion of 500 loops, in which case all loops after the TMR to TSR transition must be re-completed and the save/restore point must be completed without error as well ( c t s r w 1 ). The second is that the error occurs at the end of creating the save/restore point upon completion of 750 loops, in which case all loops after previous save/restore point creation must be re-completed and the save/restore point at loop number 750 must be completed without error as well ( c t s r w 2 ).
Figure A20. AHR MIPS TSR Worst-Case Early Error Timing Diagram 4.
Figure A20. AHR MIPS TSR Worst-Case Early Error Timing Diagram 4.
Electronics 08 01266 g0a20
If the transition point occurs between the completion of 500 loops and 750 loops, there are two possibilities for the worst-case error. The first is that the error occurs at the end of creating the save/restore point upon completion of 750 loops, in which case all loops after the TMR to TSR transition must be re-completed and the save/restore point must be completed without error as well ( c t s r w 3 ). The second is that the error occurs at the last store word instruction in the program and the nearly 250 complete loops since the creation of the save/restore point at loop 750 must be re-completed ( c t s r w 4 ). The only way to know which takes longer to complete is to calculate the values for both, compare the results, and select the larger of the two. If the transition point occurs after the completion of 750 loops, the worst-case error occurs at the last store word at the end of the program and all loops from the TMR to TSR transition to the end of the program must be re-completed.
Note that t C T S R W T M R and t C T S R W T S R are the time AHR MIPS spends in TMR and TSR mode respectively when encountering a TSR Worst-case error.
i f P l o o p s < 250 t C T S R W T M R = t A H R T M R t C T S R W T S R = t A H R T S R + T T S R R e c + n = N T S R - 3 N T S R t I T S R n + · · · 250 · T T S R l o o p + T T S R S R P 1 E r r e l s e i f 250 P l o o p s < 500 c t s r w 1 = T T S R R e c + n = N T S R - 3 N T S R t I T S R n + ( 500 - P l o o p s ) · T T S R l o o p + · · · T T S R S R P 1 E r r c t s r w 2 = T T S R R e c + n = N T S R - 3 N T S R t I T S R n + 250 · T T S R l o o p + T T S R S R P 0 E r r t C T S R W T M R = t A H R T M R i f c t s r w 1 > c t s r w 2 t C T S R W T S R = t A H R T S R + c t s r w 1 e l s e t C T S R W T S R = t A H R T S R + c t s r w 2 e n d
e l s e i f 500 P l o o p s < 750 c t s r w 3 = T T S R R e c + n = N T S R - 3 N T S R t I T S R n + · · · ( 750 - P l o o p s ) · T T S R l o o p + T T S R S R P 1 E r r c t s r w 4 = T T S R R e c + n = N T S R - 3 N T S R t I T S R n + 249 · T T S R l o o p + · · · n T S R i n i t + 1 S W T S R ( l e n g t h ( S W T S R ) - 1 ) t I T S R n t C T S R W T M R = t A H R T M R i f c t s r w 3 > c t s r w 4 t C T S R W T S R = t A H R T S R + c t s r w 3 e l s e t C T S R W T S R = t A H R T S R + c t s r w 4 e n d e l s e i f P l o o p s 750 t C T S R W T M R = t A H R T M R t C T S R W T S R = t A H R T S R + T T S R R e c + n = N T S R - 3 N T S R t I T S R n + · · · ( n l o o p s - P l o o p s - 1 ) · T T S R l o o p + n T S R i n i t + 1 S W T S R ( l e n g t h ( S W T S R ) - 1 ) t I T S R n e n d T C T S R W o r s t = t C T S R W T M R + t C T S R W T S R

Appendix C. Error Injection Energy Calculations

The energy computations are straightforward for TMR MIPS and TSR MIPS programs even when errors are injected. The time to complete these programs is multiplied by the dynamic power used by the appropriate architecture. The TMR Type A, Type B-Best, and Type B-Worst error energy calculations are shown in Equations (A43)–(A45) respectively. The TSR MIPS Best-case and Worst-case error energy calculations are shown in Equations (A46) and (A47) respectively.
E T M R E r r A = P T M R M I P S · T T M R E r r A
E T M R E r r B B e s t = P T M R M I P S · T T M R E r r B B e s t
E T M R E r r B W o r s t = P T M R M I P S · T T M R E r r B W o r s t
E T S R B e s t = P T S R M I P S · T T S R B e s t
E T S R W o r s t = P T S R M I P S · T T S R W o r s t
While it was trivial to calculate the energy used by TMR MIPS and TSR MIPS programs experiencing errors, it is more complicated to calculate the energy used by programs running in AHR MIPS. It is more difficult because of the time divided between TMR and TSR modes of operation. Fortunately, the times to complete the TMR and TSR portions were recorded separately to make these calculations simpler.
Equations (A48) and (A49) show how to calculate the energy used by AHR MIPS when encountering TMR Type A Early and Late errors respectively.
E C T M R A E a r l y = P C T M R M I P S · t C T M R A E T M R + P C T S R _ M I P S · t C T M R A E T S R
E C T M R A L a t e = P C T M R M I P S · t C T M R A L T M R + P C T S R _ M I P S · t C T M R A L T S R
Equations (A50) and (A51) show how to calculate the energy used by AHR MIPS when encountering a TMR Type B-Best Early and Late error respectively.
E C T M R B B e s t E a r l y = P C T M R M I P S · t C T M R B B E T M R + · · · P C T S R _ M I P S · t C T M R B B E T S R
E C T M R B B e s t L a t e = P C T M R M I P S · t C T M R B B L T M R + · · · P C T S R _ M I P S · t C T M R B B L T S R
Equations (A52) and (A53) show how to calculate the energy used by AHR MIPS when encountering a TMR Type B-Worst Early and Late error respectively.
E C T M R B W o r s t E a r l y = P C T M R M I P S · t C T M R B W E T M R + · · · P C T S R _ M I P S · t C T M R B W E T S R
E C T M R B W o r s t L a t e = P C T M R M I P S · t C T M R B W L T M R + · · · P C T S R _ M I P S · t C T M R B W L T S R
Equations (A54) and (A55) show how to calculate the energy used by AHR MIPS when encountering a TSR Type Best-Case and Worst-Case error respectively.
E C T S R B e s t = P C T M R M I P S · t C T S R B T M R + P C T S R _ M I P S · t C T S R B T S R
E C T S R W o r s t = P C T M R M I P S · t C T S R W T M R + P C T S R _ M I P S · t C T S R W T S R

Appendix D. VHDL Code to Reproduce Basic MIPS, TMR MIPS, TSR MIPS, and AHR MIPS

The VHDL code used to implement Basic MIPS, TMR MIPS, TSR MIPS, AHR MIPS and perform error injection for simulations is available on GitHub at: https://github.com/nicolas-hamilton/Adaptive-Hybrid-Redundancy-VHDL.

References

  1. Hamilton, N.S. Adaptive-Hybrid Redundancy for Radiation Hardening. Ph.D. Thesis, Air Force Institute of Technology, Wright-Patterson AFB, OH, USA, 2019. [Google Scholar]
  2. Espinosa, D.C.; Geist, A.; Petrick, D.J.; Flatley, T.P.; Hosler, J.C.; Crum, G.A.; Buenfil, M. Radiation-Hardened Processing System. Provisional Patent 2011/0107158 A1, 5 May 2011. [Google Scholar]
  3. Flatley, T.P. Radiation-Hardened Hybrid Processor. Provisional Patent 2011/0078498 A1, 31 March 2011. [Google Scholar]
  4. Geist, A.; Flatley, T.P.; Lin, M.R.; Petrick, D.J. Radiation-Hardened Hybrid Processor. Provisional Patent 2011/0099421 A1, 28 April 2011. [Google Scholar]
  5. Tamir, Y. Fault Tolerance for VLSI Multicomputers. Ph.D. Thesis, University of California, Berkeley, CA, USA, 1985. [Google Scholar]
  6. Gomaa, M.A.; Scarbrough, C.; Vijaykumar, T.N.; Pomeranz, I. Transient-Fault Recovery for Chip Multiprocessors. IEEE Micro 2003, 23, 76–83. [Google Scholar] [CrossRef]
  7. Bickel, R.E. Fault Tolerant Processing Architecture. Provisional Patent 2003/0061535 A1, 27 March 2003. [Google Scholar]
  8. Bickel, R.E. Fault Tolerant Processing Architecture. U.S. Patent 6,938,183 B2, 30 August 2005. [Google Scholar]
  9. Breuer, M.A.; Carlan, A.J. State-of-the-Art Assessment of Testing and Testability of Custom LSI-VLSI Circuits. Volume VI. Redundancy, Testing Circuits, and Codes; Technical Report; Aerospace Corporation: El Segundo, CA, USA, 1982. [Google Scholar]
  10. Grecki, M. SEUs Tolerance in FPGAs Based Digital LLRF System for XFEL. In Proceedings of the 2012 18th IEEE-NPSS Real Time Conference, Berkeley, CA, USA, 9–15 June 2012; pp. 1–3. [Google Scholar] [CrossRef]
  11. Iturbe, X.; Venu, B.; Özer, E.; Das, S. A Triple Core Lock-Step (TCLS) ARM Cortex-R5 Processor for Safety-Critical and Ultra-Reliable Applications. In Proceedings of the 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W), Toulouse, France, 28 June–1 July 2016; pp. 246–249. [Google Scholar] [CrossRef]
  12. Liddell, D.C.; Williams, E.J. Method and Apparatus for Reducing the Effects of Hardware Faults in a Computer System Employing Multiple Central Processing Modules. U.S. Patent US5627965A, 6 May 1997. [Google Scholar]
  13. Sterpone, L.; Du, B. Analysis and Mitigation of Single Event effects on Flash-Based FPGAs. In Proceedings of the 2014 19th IEEE European Test Symposium (ETS), Paderborn, Germany, 26–30 May 2014; pp. 1–6. [Google Scholar] [CrossRef]
  14. Espinosa, D.C.; Geist, A.; Petrick, D.J.; Flatley, T.P.; Hosler, J.C.; Crum, G.A.; Buenfil, M. Radiation-Hardened Processing System. U.S. Patent 8,484,509 B2, 9 July 2013. [Google Scholar]
  15. Singh, A.D.; Gray, F.G. Periodically Self Restoring Redundant Systems for VLSI Based Highly Reliable Design; Technical Report; University of Massachusettes and Virginia Tech: Boston, MA, USA, 1986. [Google Scholar]
  16. Tabero, J.; Regadío, A.; Pérez, C.; Pazos, J.; Reviriego, P.; Sánchez-Macian, A.; Maestro, J.A. Modular Fault Tolerant Processor Architecture on a SoC for Space. Microelectron. Reliab. 2018, 83, 84–90. [Google Scholar] [CrossRef]
  17. Oh, N.; Shirvani, P.P.; McCluskey, E.J. Error Detection by Duplicated Instructions in Super-Scalar Processors. IEEE Trans. Reliab. 2002, 51, 63–75. [Google Scholar] [CrossRef]
  18. Oh, N.; McCluskey, E.J. Low Energy Error Detection Technique Using Procedure Call Duplication. In Proceedings of the 2001 International Symposium on Dependable Systems and Networks, Goteborg, Sweden, 1–4 July 2001. [Google Scholar]
  19. Tokponnon, M.P.; Lobelle, M.; Ezin, E.C. Entirely Protecting Operating Systems Against Transient Errors in Space Environment. arXiv 2017, arXiv:1708.06450. [Google Scholar]
  20. Oh, N.; Shirvani, P.P.; McCluskey, E.J. Control-Flow Checking by Software Signatures. IEEE Trans. Reliab. 2002, 51, 111–122. [Google Scholar] [CrossRef]
  21. Reis, G.A.; Chang, J.; Vachharajani, N.; Rangan, R.; August, D.I. SWIFT: Software Implemented Fault Tolerance. In Proceedings of the International Symposium on Code Generation and Optimization, New York, NY, USA, 20–23 March 2005. [Google Scholar]
  22. Reis, G.A.; Chang, J.; August, D.I. Automatic Instruction-Level Software-Only Recovery. IEEE Micro 2007, 27, 36–47. [Google Scholar] [CrossRef]
  23. Frenkel, C.; Legat, J.D.; Bol, D. A Partial Reconfiguration-Based Scheme to Mitigate Multiple-Bit Upsets for FPGAs in Low-Cost Space Applications. In Proceedings of the 2015 10th International Symposium on Reconfigurable Communication-Centric Systems-on-Chip (ReCoSoC), Bremen, Germany, 29 June–1 July 2015; pp. 1–7. [Google Scholar] [CrossRef]
  24. Lima, F.; Carmichaell, C.; Fabula, J.; Padovanil, R.; Reis, R. A Fault Injection Analysis of Virtex FPGA TMR Design Methodology. In Proceedings of the 2001 6th European Conference on Radiation and Its Effects on Components and Systems, Grenoble, France, 10–14 September 2001; pp. 275–282. [Google Scholar] [CrossRef]
  25. Mahmoud, D.G.; Alkady, G.I.; Amer, H.H.; Daoud, R.M.; Adly, I.; Essam, Y.; Ismail, H.A.; Sorour, K.N. Fault Secure FPGA-Based TMR Voter. In Proceedings of the 2018 7th Mediterranean Conference on Embedded Computing, Budva, Montenegro, 10–14 June 2018; pp. 1–4. [Google Scholar] [CrossRef]
  26. Nidhin, T.S.; Battacharyya, A.; Behera, R.P.; Jayanthi, T.; Velusamy, K. Understanding Radiation Effects in SRAM-Based Field Programmable Gate Arrays for Implementing Instrumentation and Control Systems of Nuclear Power Plants. Nucl. Eng. Technol. 2017, 49, 1589–1599. [Google Scholar] [CrossRef]
  27. Ostler, P.S.; Caffrey, M.P.; Gibelyou, D.S.; Graham, P.S.; Morgan, K.S.; Pratt, B.H.; Quinn, H.M.; Wirthlin, M.J. SRAM FPGA Reliability Analysis for Harsh Radiation Environments. IEEE Trans. Nucl. Sci. 2009, 56, 3519–3526. [Google Scholar] [CrossRef]
  28. Straka, M.; Kastil, J.; Kotasek, Z. Fault Tolerant Structure for SRAM-Based FPGA via Partial Dynamic Reconfiguration. In Proceedings of the 2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools, Lille, France, 1–3 September 2010; pp. 365–372. [Google Scholar] [CrossRef]
  29. Czajkowski, D.R. SEU and SEFI Fault Tolerant Computer. U.S. Patent 7,260,742, 21 August 2007. [Google Scholar]
  30. Mariani, R.; Kuschel, T.; Shigehara, H. A Flexible Microcontroller Architecture for Fail-Safe and Fail-Operational Systems. In Proceedings of the HiPEAC Workshop on Design for Reliability (HiPEAC), Pisa, Italy, 25–27 January 2010. [Google Scholar]
  31. Kontoleon, J. Soft Error Recovery in Simplex and Triplex Memory Systems. Microelectron. Reliab. 2009, 49, 410–423. [Google Scholar] [CrossRef]
  32. Ray, J.; Hoe, J.C.; Falsafi, B. Dual Use of Superscalar Datapath for Transient-Fault Detection and Recovery. In Proceedings of the 34th Annual ACM/IEEEE International Symposium on Microarchitecture, Austin, TX, USA, 1–5 December 2001; pp. 214–224. [Google Scholar] [CrossRef]
  33. Shirvani, P.P.; Saxena, N.R.; McCluskey, E.J. Software Implemented EDAC Protection Against SEUs. IEEE Trans. Reliab. 2000, 49, 273–284. [Google Scholar] [CrossRef]
  34. LaMares, B.J.; Gauer, C. A Power-Efficient Design Approach to Radiation Hardened Digital Circuitry using Dynamically Selectable Triple Modulo Redundancy. In Proceedings of the 2008 Military & Aerospace Programmable Logic Devices (MAPLD) Conference, Annapolis, MD, USA, 15–18 September 2008. [Google Scholar]
  35. Wang, S.; Hu, J.; Ziavras, S.G. Self-Adaptive Data Caches for Soft-Error Reliability. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2008, 27, 1503–1507. [Google Scholar] [CrossRef]
  36. Hamilton, N.S. Basic MIPS Architecture Version 1.4; Technical Report; Air Force Institute of Technology: Wright-Patterson AFB, OH, USA, 2019. [Google Scholar]
  37. Hamilton, N.S. Triple Modular Redundancy MIPS Architecture Version 1.4; Technical Report; Air Force Institute of Technology: Wright-Patterson AFB, OH, USA, 2019. [Google Scholar]
  38. Hamilton, N.S. Adaptive-Hybrid Redundancy MIPS Architecture Version 2.2; Technical Report; Air Force Institute of Technology: Wright-Patterson AFB, OH, USA, 2019. [Google Scholar]
Figure 1. Triple Modular Redundancy (TMR) MIPS simplified block diagram.
Figure 1. Triple Modular Redundancy (TMR) MIPS simplified block diagram.
Electronics 08 01266 g001
Figure 2. Temporal Software Redundancy (TSR) MIPS simplified block diagram.
Figure 2. Temporal Software Redundancy (TSR) MIPS simplified block diagram.
Electronics 08 01266 g002
Figure 3. AHR MIPS simplified block diagram.
Figure 3. AHR MIPS simplified block diagram.
Electronics 08 01266 g003
Figure 4. AHR MIPS in TMR mode simplified block diagram with disabled portions in red.
Figure 4. AHR MIPS in TMR mode simplified block diagram with disabled portions in red.
Electronics 08 01266 g004
Figure 5. AHR MIPS in TSR mode simplified block diagram with disabled portions in red.
Figure 5. AHR MIPS in TSR mode simplified block diagram with disabled portions in red.
Electronics 08 01266 g005
Figure 6. Averaged results of software simulation of all errors: energy vs. time to complete.
Figure 6. Averaged results of software simulation of all errors: energy vs. time to complete.
Electronics 08 01266 g006
Figure 7. Average performance bounds for AHR MIPS with a TMR to TSR point at 15,000 instructions.
Figure 7. Average performance bounds for AHR MIPS with a TMR to TSR point at 15,000 instructions.
Electronics 08 01266 g007
Figure 8. Average performance bounds for AHR MIPS with a TMR to TSR point at 11,000 instructions.
Figure 8. Average performance bounds for AHR MIPS with a TMR to TSR point at 11,000 instructions.
Electronics 08 01266 g008
Figure 9. Average performance bounds for AHR MIPS with a TMR to TSR point at 20,000 instructions.
Figure 9. Average performance bounds for AHR MIPS with a TMR to TSR point at 20,000 instructions.
Electronics 08 01266 g009
Figure 10. Average performance bounds for AHR MIPS with a TMR to TSR point at 30,000 instructions.
Figure 10. Average performance bounds for AHR MIPS with a TMR to TSR point at 30,000 instructions.
Electronics 08 01266 g010
Figure 11. Average performance bounds for AHR MIPS with a TMR to TSR point at 40,000 instructions.
Figure 11. Average performance bounds for AHR MIPS with a TMR to TSR point at 40,000 instructions.
Electronics 08 01266 g011
Figure 12. Average performance bounds for AHR MIPS with a TMR to TSR point at 50,000 instructions.
Figure 12. Average performance bounds for AHR MIPS with a TMR to TSR point at 50,000 instructions.
Electronics 08 01266 g012
Figure 13. Average performance bounds for AHR MIPS with a TMR to TSR point at 60,000 instructions.
Figure 13. Average performance bounds for AHR MIPS with a TMR to TSR point at 60,000 instructions.
Electronics 08 01266 g013
Figure 14. Average performance bounds for AHR MIPS with a TMR to TSR point at 70,000 instructions.
Figure 14. Average performance bounds for AHR MIPS with a TMR to TSR point at 70,000 instructions.
Electronics 08 01266 g014
Figure 15. Average performance bounds for AHR MIPS with a TMR to TSR point at 80,000 instructions.
Figure 15. Average performance bounds for AHR MIPS with a TMR to TSR point at 80,000 instructions.
Electronics 08 01266 g015
Figure 16. TMR to TSR transition varying from 11,000 to 80,000 instructions: energy vs. time to complete.
Figure 16. TMR to TSR transition varying from 11,000 to 80,000 instructions: energy vs. time to complete.
Electronics 08 01266 g016
Figure 17. AHR MIPS TMR to TSR transition varying from 11,000 to 80,000 instructions: energy vs. time to complete.
Figure 17. AHR MIPS TMR to TSR transition varying from 11,000 to 80,000 instructions: energy vs. time to complete.
Electronics 08 01266 g017
Figure 18. AHR MIPS TMR to TSR transition varying from 11,000 to 80,000 instructions: energy vs. time to complete.
Figure 18. AHR MIPS TMR to TSR transition varying from 11,000 to 80,000 instructions: energy vs. time to complete.
Electronics 08 01266 g018
Figure 19. Time Difference Between Successive Steps of TMR to TSR Transition Point When Varying from 11,000 to 80,000 in Steps of 1000.
Figure 19. Time Difference Between Successive Steps of TMR to TSR Transition Point When Varying from 11,000 to 80,000 in Steps of 1000.
Electronics 08 01266 g019
Figure 20. Energy difference between successive steps of TMR to TSR transition point when varying from 11,000 to 80,000 in steps of 1000.
Figure 20. Energy difference between successive steps of TMR to TSR transition point when varying from 11,000 to 80,000 in steps of 1000.
Electronics 08 01266 g020
Table 1. Simple software redundancy example.
Table 1. Simple software redundancy example.
Instruction NumberOriginal SetRedundant Set
1LUI R1 1LUI R1 1
2LUI R2 2LUI R15 1
3ADD R3 R1 R2LUI R2 2
4SW R3 R0 OFFSETLUI R16 2
5 ADD R3 R1 R2
6 ADD R17 R15 R16
7 BNE R3 R17 ERR
8 SW R3 R0 OFFSET

Share and Cite

MDPI and ACS Style

Hamilton, N.; Graham, S.; Carbino, T.; Petrosky, J.; Betances, A. Adaptive-Hybrid Redundancy with Error Injection. Electronics 2019, 8, 1266. https://doi.org/10.3390/electronics8111266

AMA Style

Hamilton N, Graham S, Carbino T, Petrosky J, Betances A. Adaptive-Hybrid Redundancy with Error Injection. Electronics. 2019; 8(11):1266. https://doi.org/10.3390/electronics8111266

Chicago/Turabian Style

Hamilton, Nicolas, Scott Graham, Timothy Carbino, James Petrosky, and Addison Betances. 2019. "Adaptive-Hybrid Redundancy with Error Injection" Electronics 8, no. 11: 1266. https://doi.org/10.3390/electronics8111266

APA Style

Hamilton, N., Graham, S., Carbino, T., Petrosky, J., & Betances, A. (2019). Adaptive-Hybrid Redundancy with Error Injection. Electronics, 8(11), 1266. https://doi.org/10.3390/electronics8111266

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop