Appendix B. Error Injection Timing and Calculations
TMR errors are divided into single and multiple processor errors. A single processor error, called a TMR Type A error, recovers to the same instruction at which the initial error occurred as shown in
Figure A3. A multiple processor error, called a TMR Type B error, occurs when all three processors disagree and all three processors are reset and restored to a previously saved state called a save/restore point. This is shown in
Figure A4. In this figure, the acronym SRP denotes points in the TMR program execution at which a save/restore point is created. The TMR Type A error is one example of a best-case TMR error. A TMR Type B error that occurs immediately after creation of a save/restore point is another example of a best-case TMR error because it minimizes the number of instructions that must be recomputed to fully recover from the error. This is called a TMR Type B-Best error. The worst-case TMR error occurs when a Type B error occurs during save/restore point creation and maximizes the number of instructions to be recomputed to fully recover from the error. This is called a TMR Type B-Worst error. The Type B-Best and Type B-Worst errors are shown in
Figure A5.
Figure A3.
TMR MIPS Type A error timing diagram.
Figure A3.
TMR MIPS Type A error timing diagram.
Figure A4.
TMR MIPS Type B error timing diagram.
Figure A4.
TMR MIPS Type B error timing diagram.
The runtime for a TMR MIPS Type A error is given in Equation (
A1) where
is the time for TMR MIPS to complete a program in the absence of an error from the previous work [
1],
is error detection time,
is the Type A recovery time,
is the time to return to the instruction at which the error occurred, and
is the time required to repeat the instruction at which the error occurred. The last four of these values are determined from simulation.
The runtime for TMR MIPS Type B Best-case error is given in Equation (
A2) where
is the error detection time,
is the Type B recovery time,
is the time to re-accomplish the instructions between the last completed save/restore point and the instruction at which the error occurred. The time to detect the error and recover from the error are determined from simulation, but the time to return from the error to the point at which the error occurred is determined by analysis.
While there are many locations in a program where a TMR Type B-Best error may occur, the absolute Best-case error is the one that minimizes the number of instructions between the return from save/restore point creation and the store word instruction following it. In order to determine which pairing of save/restore point creation and store word instructions has the shortest distance between them, the loop count and instruction index of every save/restore point creation and store word instruction must be determined. The store word instruction indices are simply located by examining the program. Equation (
A3) shows how to calculate where save/restore point creation occurs where
is a vector containing the instruction index in the TMR program where save/restore points are created,
is a vector containing the program loop count values where the save/restore points are created, and
is a vector containing the amount of time from the beginning of the program to the points at which save/restore points are created.
is not used now in calculating the TMR Type B-Best program completion time, but will be used shortly.
Figure A5.
TMR MIPS Type B Best- and Worst-Case error timing diagram.
Figure A5.
TMR MIPS Type B Best- and Worst-Case error timing diagram.
The next step is to compute all possible differences (
) between save/restore point indices and store word indices as shown in Equation (
A4) where
is the transpose of
. This formula states
is a matrix of row vectors such that the
nth row subtracts each value of
from the
nth
value. Note that
is a vector containing the indices of every store word instruction in a program.
Next, because some of the values in
may be negative because a save/restore point may occur at the end of one loop and the next store word may occur at the beginning of the next loop,
is modified so that all values are positive as shown in Equation (
A5). In this equation, the “<” and “>” operators are logical operators that populate a matrix with ones or zeros depending on whether the individual matrix entries are less than or greater than the argument to the right of the operator. The term
creates a matrix with all the positive values of
and where all the negative values of
are set to zero (the “
” operator denotes element-wise multiplication). The term
creates a matrix with all the negative values of
and where all the positive values of
are set to zero. The term
creates a matrix where all the negative values of
are replaced by
and all the positive values of
are set to zero. The term
creates a matrix where all the negative values of
are replaced by the positive number of instructions from the save/restore point at the end of a loop to the store word instruction at the beginning of the next loop and accounts for the fact that code execution jumped from the end of the loop back to the start of the loop. Finally,
contains all the positive instruction distances between save/restore points and the store words following them.
The next step is to determine the minimum distance between a save/restore point creation and a store word instruction. Equation (
A6) is used to calculate the minimum distance where
is a function that returns the minimum value of each column vector of a matrix in the row vector
and returns the index of each minimum value in each column vector in the row vector
. For a vector,
returns the minimum value in
and the index of the minimum value in
. The value
is as an index into the columns of
and tells which column contains the minimum distance between a save/restore point and a store word. The value
is an index into the rows of
and tells which row contains the minimum distance between a save/restore point and a store word. Because the columns of
correspond to the save/restore point indices
and the rows correspond to the store word indices
,
is the address of the instruction at which the save/restore point is created closest to the store word instruction at the address specified to
. In other words, this is the absolute shortest distance between the creation of a save/restore point and when an error could occur at a store word and constitutes the best-case multiple processor error for TMR MIPS.
The formula for determining
is now presented in Equation (
A7). The reason for the if-else statement is because the program is in a loop and
could be less than
.
The definition of the
function in Equation (
A6) presents an interesting situation when
or
is a scalar rather than a vector. In this situation,
will be a vector instead of a matrix and performing the operations in Equation (
A6) will not provide usable results for proper indexing into
and
in Equation (
A7). If
is a scalar,
is used as the indexing variable into
and no index variable is used for
because it is a scalar. These adjustments are made to Equation (
A7) as shown in Equation (
A8). If
is a scalar,
is used as the indexing variable into
and no index variable is used for
because it is a scalar. These adjustments are made to Equation (
A7) as shown in Equation (
A9).
The Type B-Worst error occurs at the end of creating a save/restore point so that the error is detected before successful save/restore point creation. The multiple bit error is injected when attempting to write the loop counter when creating the save/restore point such that a multiple processor error is detected and triggers recovery operations. In this scenario, 10,000 instructions and save/restore point creation must be repeated to return to the point in the program at which the error occurred. The Type B Worst-case error is shown in
Figure A5.
The runtime for TMR MIPS Type B Worst-case error is given in Equation (
A10) where
is the time it takes TMR MIPS to encounter an error during creation of a save/restore point when the error occurs in the loop counter of multiple processors when attempting to save the loop counter to memory,
is the time to recover from a multiple processor error,
is the time to return to the instruction at which the error occurred. The time
is determined from the simulation, but
is determined by analysis.
To compute the worst-case scenario time to return to the instruction at which the error occurred, the worst-case time between save points must first be determined according to Equation (
A11) where
was previously defined in Equation (
A3),
is the number of full loops completed by the time the
mth save/restore point creation is reached,
is the time from the start of the loop in which the save/restore point is created to the instruction in that loop at which the save/restore point is created,
is the time from the beginning of the program to the time at which the
mth save/restore point creation begins,
is the save time difference between consecutive save points, and
is the index of the worst-case
. The value of
is obtained by subtracting the 1st value of
from the second value, the second value from the third, and so on until the
value is subtracted from the
. The maximum value of
is
.
TSR best-case and worst-case errors are similar to TMR Type B-Best and TMR Type B-Worst errors in that they occur immediately after save/restore point error and during save/restore point creation respectively. These errors are shown in
Figure A6.
Figure A6.
TSR MIPS Best- and Worst-Case error timing diagram.
Figure A6.
TSR MIPS Best- and Worst-Case error timing diagram.
The best-case scenario minimizes the number of instructions which must be executed after error recovery to return to the point in the program at which the error occurred. Therefore, the best-case error occurs immediately after creation of a save/restore point. The error is injected immediately prior to the branch comparison instruction before the first store word instruction after creating a save/restore point. The error is injected into one of the registers to be compared.
The TSR MIPS Best-case error is computed using Equation (
A12) where
is the time to complete a program in the absence of an error from the previous work [
1],
is the time to perform error recovery operations and is determined from simulation results, and
is the time needed to return from the most recent save/restore point to the instruction at which the error occurred.
The time to return to the instruction at which the error occurred is determined using Equation (
A13) where
is the number of instructions needed to initialize a TSR program (4 instructions) and
is a vector containing the instruction indices of all store word instructions in a TSR MIPS program.
The TSR MIPS Worst-case scenario maximizes the number of instructions which must be executed after error recovery to return to the point in the program at which the error occurred. Therefore, the worst-case error occurs at the end of creating a save/restore point. This error would specifically target the loop counter, which is the last register to be written to the save/restore point during save/restore point creation. This error would force TSR MIPS to restore itself from the previous save/restore point and then proceed past the end of the next save/restore point creation, which means completing 250 program loops all over again. Additionally, the worst-case error will occur when creating the save/restore point in the second segment of save/restore point memory rather than the first segment because the second segment takes longer to create.
The TSR MIPS Worst-case error is computed using Equation (
A14) where
is the time to complete a single TSR program loop defined in previous work [
1] and
is the time from the start of save/restore point creation to the time at which an error is detected in the difference between the loop counter and duplicate loop counter. The value of
is determined from the simulation. The term
is the time to complete the loop after performing error recovery and the term
is the time to re-complete the 250 loops between save/restore points.
AHR may experience a TMR or TSR error depending on whether AHR is operating in TMR or TSR mode. TMR errors are further subdivided into early and late errors depending on whether they occur near the beginning of the program or near the TMR to TSR transition point respectively. Early errors have less impact on total program runtime and energy usage than late errors because early errors do not significantly affect the location of the TMR to TSR transition point. This is because the TMR to TSR transition point depends upon completing a predetermined number of instructions without error before transitioning AHR from TMR to TSR mode. Late errors have more impact on total program runtime and energy usage because they cause the TMR to TSR transition point to move towards the end of the program so that more of the program is executed in TMR mode. The result is that program runs faster, but uses more energy than if an early error or no error occurred. Errors encountered when AHR MIPS is operating in TSR mode are virtually identical to the errors encountered by TSR MIPS, but depend upon the point at which the TMR to TSR transition occurs within a program.
When AHR MIPS encounters a TMR Type A error, it handles the error the same way that TMR MIPS would. If the error occurs early in the program, such as at the first store word instruction in the program, the TMR to TSR transition point is only moved by a few instructions as shown in
Figure A7. This is referred to as a Type A Early error and it has a minimal impact on the program runtime.
Figure A7.
AHR MIPS TMR Type A Early error timing diagram.
Figure A7.
AHR MIPS TMR Type A Early error timing diagram.
Equation (
A15) shows how to calculate the AHR MIPS Type A Early error timing where
is the new transition point loop count determined according to Equation (
A16) and
is the number of save/restore points to create prior to the transition determined by Equation (
A17). The TMR to TSR transition point determines how many save/restore points are created in TMR and TSR mode. The TMR mode save/restore points are determined by
, but the number of TSR mode save/restore points depends on where the transition occurs relative to the creation point for the TSR mode save/restore points which only occur at 250, 500, and 750 loops. This is the rationale for the if-else statements in these equations. There is also a possibility that the Type A error may push the TMR to TSR transition point out past the end of the program, in which case, AHR MIPS never enters TSR mode. Note also that
and
are the time AHR MIPS spends in TMR and TSR mode respectively when encountering a TMR Type A Early error. The time spent in TMR and TSR are separated to make the energy calculations simpler.
If the TMR Type A error occurs late in the program, such as at the last store word instruction before the TMR to TSR transition, the TMR to TSR transition is moved by nearly 15,000 instructions past the point at which it would have occurred if there were no error. This is shown in
Figure A8. This is referred to as a Type A Late error and it causes the program to execute more instructions in TMR MIPS and fewer instructions in TSR MIPS than if no error had occurred. The expected effect is a significantly shorter runtime and increased energy usage.
Figure A8.
AHR MIPS TMR Type A Late error timing diagram.
Figure A8.
AHR MIPS TMR Type A Late error timing diagram.
Equation (
A18) shows how to calculate the AHR MIPS Type A Late error timing where
is the new transition point loop count determined according to Equation (
A19) and
is the number of save/restore points to create prior to the transition determined by Equation (
A20). Note also that
and
are the time AHR MIPS spends in TMR and TSR mode respectively when encountering a TMR Type A Late error.
AHR MIPS may also encounter TMR MIPS Type B-Best errors early or late and these are referred to as TMR Type B-Best Early and TMR Type B-Best Late errors. As with the TMR Type A Early error, the TMR Type B-Best Early error has a minimal impact on runtime. As with the TMR Type A Late error, the TMR Type B-Best Late error is expected to significantly decrease runtime and increase energy usage. The TMR Type B-Best Early error is shown in
Figure A9 while the TMR Type B-Best Late error is shown in
Figure A10.
Figure A9.
AHR MIPS TMR Type B Best-Case Early error timing diagram.
Figure A9.
AHR MIPS TMR Type B Best-Case Early error timing diagram.
Figure A10.
AHR MIPS TMR Type B Best-Case Late error timing diagram.
Figure A10.
AHR MIPS TMR Type B Best-Case Late error timing diagram.
In order to determine the AHR MIPS runtime for programs with Type B-Best Early and Late errors, some of the variables used in computing TMR MIPS runtime for programs with Type B-Best errors need to be modified. The variable
needs to be modified so it only contains instruction indices of save/restore points that occur before the original TMR to TSR transition was expected to take place. The values of
and
must also be updated. These updates are illustrated in Equation (
A21) where the
returns the vector of
where the values of
are less than the TMR to TSR transition point and all other values of the original
vector are excluded.
Next, all possible differences between save/restore point indices and store word indices are calculated according to Equation (
A22). Note that this is different when compared with Equation (
A4) because this formula must account for the fact that an error cannot be allowed to occur after the TMR to TSR transition or it would be a TSR error rather than a TMR Type B Error.
Then, just as Equation (
A5) made all values of
positive, Equation (
A23) makes all values of
positive as well.
The next step is to determine which store word indices minimize the difference between each store word and save index. This is computed in Equation (
A24). Note that this differs from Equation (
A6) because there is no second step to determine the absolute minimum distance. This is because it is desirable to determine the early and late scenarios for a Type B-Best error. The absolute minimum of
might not minimize or maximize the number of instructions computed in TMR mode. Instead, each possible combination of minimum distance from a save index to a store word index is evaluated for total program completion time. The total program completion time for each scenario is then evaluated against the completion times to determine which is slowest (Early) and which is fastest (Late).
Equations (
A25)–(
A27) show how to compute the time to complete each program for each possible combination of minimum distance from a save index to a store word index. (Equation (
A27) is a continuation of Equation (
A26) because the entire equation could not fit on one page.) Equation (
A26) (and Equation (
A27)) also shows that the Type B-Best Early solution is the maximum of these times and the Type B-Best Late solution is the minimum of these times. The
variable is used to keep track of whether a particular combination of save index and store word index is allowed. The flag is 1 if the combination is not allowed because the store word following the save index would occur after the TMR to TSR transition. The variable
is the new TMR to TSR transition point based on the error location for the
nth save index. The variable
is the new number of save/restore points to create for the
nth save index. The variable
is the amount of time required to return from the save index to the store word index at which the error occurred for the
nth save index. The time to complete the TMR portion of the program for the
nth save index is
. The time to complete the TSR portion of the program for the
nth save index is
. The value
is assigned to
and
when
because the
and
functions ignore
values and return only numerical values. Finally,
,
are the time AHR MIPS spends in TMR and TSR mode when a TMR Type B-Best Early error is encountered. Similarly,
,
are the time AHR MIPS spends in TMR and TSR mode when a TMR Type B-Best Late error is encountered.
Remembering that the
function used in Equation (
A24) is defined and used in the same manner as in Equation (
A6), the same problem with
possibly being a vector arises for
as well. This affects the indices used in Equations (
A25) and (
A26). If
is a scalar, Equation (
A25) is rewritten in Equation (
A28). If
is a scalar, these equations are rewritten in Equations (
A29)–(
A31) where Equation (
A31) is a continuation of Equation (
A30).
AHR MIPS may also encounter TMR MIPS Type B-Worst errors early or late and these are referred to as TMR Type B-Worst Early and TMR Type B-Worst Late errors. As with the TMR Type A Early error, the TMR Type B-Worst Early error has a minimal impact on runtime. As with the TMR Type A Late error, the TMR Type B-Worst Late error is expected to significantly decrease runtime and increase energy usage. The TMR Type B-Worst Early error is shown in
Figure A11 while the TMR Type B-Worst Late error is shown in
Figure A12.
Figure A11.
AHR MIPS TMR Type B Worst-Case Early error timing diagram.
Figure A11.
AHR MIPS TMR Type B Worst-Case Early error timing diagram.
Figure A12.
AHR MIPS TMR Type B Worst-Case Late error timing diagram.
Figure A12.
AHR MIPS TMR Type B Worst-Case Late error timing diagram.
Equation (
A32) shows how to compute the time to complete a AHR MIPS program with a TMR Type B-Worst Early error where
is the number of loops at which the transition point occurs when accounting for the error,
is the number of save/restore points to create in TMR MIPS when accounting for the error, and
is the time needed to return to the point at which the error occurred after recovering from the error. Note that
and
are the time AHR MIPS spends in TMR and TSR mode respectively when encountering a TMR Type B-Worst Early error.
The time
is computed according to Equation (
A33) where
is the save time difference between consecutive save points and
is the index of the worst-case
. This is nearly identical to Equation (
A11).
Next, the loop count at which the TMR to TSR transition will occur after encountering an error is determined using Equation (
A34).
Finally,
is determined according to Equation (
A35).
Equation (
A36) shows how to compute the time to complete a AHR MIPS program with a TMR Type B-Worst Late error where
is the number of loops at which the transition point occurs when accounting for the error,
is the number of save/restore points to create in TMR MIPS when accounting for the error, and
is the time needed to return to the point at which the error occurred after recovering from the error. Note that
and
are the time AHR MIPS spends in TMR and TSR mode respectively when encountering a TMR Type B-Worst Late error.
The time
is computed according to Equation (
A37). This is nearly identical to Equation (
A11).
Next, the loop count at which the TMR to TSR transition will occur after encountering an error is determined using Equation (
A38).
Finally,
is determined according to Equation (
A39).
In contrast to the TMR errors which can affect the TMR to TSR transition point, TSR errors do not affect the transition point; however, TSR worst-case errors may be affected by the transition point. The best-case errors are unaffected by the transition point.
When AHR MIPS encounters a TSR Best-case error, it encounters it immediately after the creation of a save/restore point. This could be the save/restore point created by the transition from TMR to TSR, or any of the save/restore points created by TSR MIPS after AHR MIPS enters TSR mode. Regardless of where the which save/restore point the TSR Best-case error occurs after, the recovery time is always the same. This is because of the way the TSR MIPS Best-case error was defined to be injected immediately prior to the branch comparison instruction before the first store word instruction after creating a save/restore point. A few examples of AHR MIPS TSR Best-case errors are shown in
Figure A13,
Figure A14,
Figure A15 and
A16 where the transition occurs before the first, second, or third TSR save/restore creation point or after the third TSR save/restore creation point respectively.
Figure A13.
AHR MIPS TSR Best-Case Early Error Timing Diagram 1.
Figure A13.
AHR MIPS TSR Best-Case Early Error Timing Diagram 1.
Figure A14.
AHR MIPS TSR Best-Case Early Error Timing Diagram 2.
Figure A14.
AHR MIPS TSR Best-Case Early Error Timing Diagram 2.
Figure A15.
AHR MIPS TSR Best-Case Early Error Timing Diagram 3.
Figure A15.
AHR MIPS TSR Best-Case Early Error Timing Diagram 3.
Figure A16.
AHR MIPS TSR Best-Case Early Error Timing Diagram 4.
Figure A16.
AHR MIPS TSR Best-Case Early Error Timing Diagram 4.
The time needed to complete a AHR MIPS program experiencing a TSR Best-case error is given in Equation (
A40) where
and
were previously defined in Equation (
A12).
TSR Worst-case errors in AHR MIPS require special attention. While TSR Worst-case errors in TSR MIPS take place at the end of creating a save/restore point in the second save/restore point memory segment, that may not be possible in AHR MIPS depending on when the TMR to TSR transition takes place. If that transition occurs before the first TSR MIPS save/restore point is created, then the TSR worst-case error is still encountered at the end of creating a save/restore point in the second segment; in this case this would be the save/restore point created when the loop counter is at 250. This scenario is shown in
Figure A17.
Figure A17.
AHR MIPS TSR Worst-Case Early Error Timing Diagram 1.
Figure A17.
AHR MIPS TSR Worst-Case Early Error Timing Diagram 1.
When the TMR to TSR transition occurs after what would have been the first TSR MIPS save/restore point creation and before the second TSR MIPS save/restore point creation, there are two possibilities for a worst-case error. These possibilities are shown in
Figure A18. Note that the first save/restore point created after the transition is always to the second save/restore point memory segment. This means that an error at the end of this save/restore point creation may not be the worst-case error. The worst-case error may be the one that occurs at the end of the next save/restore point creation which saves to the first save/restore point memory segment. The time to recover from the error and return to the point at which the error was encountered is calculated for both of these scenarios and the one that takes longer is the worst-case error.
If the TSR Worst-case error occurs after the second TSR MIPS save/restore point creation and before the third, then it is unclear what the worst-case error might be. According to the original definition of a TSR MIPS Worst-case error, it is an error that maximizes the number of instructions that TSR MIPS must re-execute. Therefore, the error may occur at the end of creating the third TSR MIPS save/restore point or at the last branch comparison at the end of the program. The amount of time to return to the point at which the error occurred is calculated for both scenarios, and the one that takes longer is the worst-case scenario. This is illustrated graphically in
Figure A19.
Figure A18.
AHR MIPS TSR Worst-Case Early Error Timing Diagram 2.
Figure A18.
AHR MIPS TSR Worst-Case Early Error Timing Diagram 2.
Figure A19.
AHR MIPS TSR Worst-Case Early Error Timing Diagram 3.
Figure A19.
AHR MIPS TSR Worst-Case Early Error Timing Diagram 3.
Finally, if the TSR Worst-case error occurs after the last TSR MIPS save/restore point creation, the worst-case error occurs at the last branch comparison at the end of the program as shown in
Figure A20.
No errors are injected to Basic MIPS because it has no way of detecting or correcting the errors. Any errors injected into a register to be stored to memory would not impact the runtime or energy usage of Basic MIPS. The only manifestation would be that the resulting computations would be incorrect.
Equations (
A41) and (
A42) show how to compute the time to complete a AHR MIPS program experiencing a TSR Worst-case error where Equation (
A42) is a continuation of Equation (
A41). If the transition point occurs before the completion of the first 250 loops, the AHR MIPS TSR worst-case error is identical to the TSR MIPS worst-case error in that the added time to complete the program is the same as in Equation (
A14).
If the transition point occurs between the completion of 250 loops and 500 loops, there are two possibilities for the worst-case error. The first is that the error occurs at the end of creating the save/restore point upon completion of 500 loops, in which case all loops after the TMR to TSR transition must be re-completed and the save/restore point must be completed without error as well (). The second is that the error occurs at the end of creating the save/restore point upon completion of 750 loops, in which case all loops after previous save/restore point creation must be re-completed and the save/restore point at loop number 750 must be completed without error as well ().
Figure A20.
AHR MIPS TSR Worst-Case Early Error Timing Diagram 4.
Figure A20.
AHR MIPS TSR Worst-Case Early Error Timing Diagram 4.
If the transition point occurs between the completion of 500 loops and 750 loops, there are two possibilities for the worst-case error. The first is that the error occurs at the end of creating the save/restore point upon completion of 750 loops, in which case all loops after the TMR to TSR transition must be re-completed and the save/restore point must be completed without error as well (). The second is that the error occurs at the last store word instruction in the program and the nearly 250 complete loops since the creation of the save/restore point at loop 750 must be re-completed (). The only way to know which takes longer to complete is to calculate the values for both, compare the results, and select the larger of the two. If the transition point occurs after the completion of 750 loops, the worst-case error occurs at the last store word at the end of the program and all loops from the TMR to TSR transition to the end of the program must be re-completed.
Note that and are the time AHR MIPS spends in TMR and TSR mode respectively when encountering a TSR Worst-case error.