1. Introduction
As parallel algorithms perform many operations on a cluster of independent computing nodes, even a single node crash can cause the execution of an algorithm to halt [
1]. This undesirable property may make large-scale distributed systems more vulnerable to failures [
2]. For this reason, effective fault-tolerance techniques are essentially needed in such systems. Rollback recovery is one such technique that enables the current erroneous state of a distributed system to be restored to a previous failure-free state from the stable storage [
3,
4]. To achieve this goal, the system recovery information must be occasionally saved to the storage for normal operations [
5].
Among the techniques used for rollback recovery, checkpoint-based recovery depends solely upon local states of processes maintained in the stable storage—called checkpoints—to support fault tolerance [
5]. In the case of a failure, the system state is recovered by using the most recent consistent global checkpoint kept in the storage. According to when and how consistent sets of checkpoints are formed, checkpoint-based recovery protocols are categorized into coordinated, independent, and communication-induced checkpointing [
5]. In order to balance trade-offs between independent and coordinated checkpointing in an effective manner, communication-induced checkpointing (CIC) is used to preclude any local checkpoint that have already been taken from becoming useless by performing forced checkpointing while attempting to increase the degree of checkpointing independence as much as possible [
6,
7,
8,
9,
10,
11,
12,
13,
14,
15]. The CIC protocols include HMNR [
8], which uses this feature to enable the number of extra checkpoints to be much lower by effectively using the control information contained in each sent message. Next, an improved HMNR protocol [
12], LazyHMNR, attempts to use a lazy indexing strategy [
13] to alleviate the problem of high frequencies of forced checkpointing that may occur in some particular cases. Next, two protocols, FINE [
9] and LazyFINE [
10], were designed to try to generate fewer forced checkpoints than HMNR and LazyHMNR with the same numbers of variables that they hold. However, they cannot ensure the property of never having useless checkpoints [
7]. Then, an adaptive CIC protocol [
11] was developed in an attempt to delay taking forced checkpointing actions as long as possible by evaluating some safety predicates. One of the most recent CIC protocols [
14] utilizes the effectiveness of the one-to-many transmission of broadcasting links, which is widely used to lower the number of extra checkpoints. However, this may greatly degrade the applicability of the CIC. Lastly, several recent works [
6,
15] exploited one or more protocols mentioned above to raise the dependability of the systems to a higher level in the fields of distributed database management systems and web services.
Generally, checkpointing-based recovery, including CIC, is not subject to the piecewise deterministic (PWD) model; thus, it is less restrictive and more realistic for application in distributed systems than in message-logging-based recovery [
16,
17,
18]. However, the former may not ensure the recovery of the system to its pre-failure state, which generally makes the rollback distance of each process much longer than that in the latter [
5]. Thus, hybrid protocols that combine the two techniques were proposed to compensate for the drawbacks of the first [
5]. However, these protocols cannot be freed from the PWD assumption. For this reason, none of the existing CIC protocols—including a family of HMNR protocols [
6,
8,
9,
10,
11,
12,
13,
14,
15]—developed so far can utilize the benefits of message logging on real-world distributed systems without assuming the PWD model. Whenever each process receives a message, the protocols cause the process to perform forced checkpointing if they decide that one or more checkpoints that have already been taken may become useless. Due to this inherent shortcoming, even the HMNR protocols force each process to take extra checkpoints—more than twice the number of basic checkpoints [
7]. However, the property of never having useless checkpoints is ensured without performing the forced checkpointing action if the process can know precisely that the current state of the sender of the message can be deterministically restored by replaying logged messages from the stable storage in the case of a failure of the sender. We found that this observation may be an important way to drive a large reduction in the number of forced extra checkpoints that all of the previous CIC protocols can incur. This paper introduces a CIC protocol, S-CIC, with message logging that is not subject to the PWD model in order to address the aforementioned observation. The protocol can achieve this goal by only carrying a one-bit variable and an n-size vector into every message transmitted. In the checkpointing and communication patterns where the existing CIC protocols—including the family of HMNR protocols—force the receiver of a message to take extra checkpoints, if the information of the message sender piggybacked on it indicates that the sender’s state right before sending is recoverable, the proposed protocol allows the receiver not to perform forced checkpointing.
3. The Proposed Protocol
S-CIC was devised to maintain the following three behavioral properties.
Similarly to HMNR, each process attaches the state information related to other processes, as well as to itself, to every outgoing message so that the number of extra checkpoints decreases as much as possible.
Even if either of the two cases where a process should perform a forced checkpointing before delivery of a message in HMNR occurs, S-CIC does not have the process perform the task if it knows that the same message can be replayed in spite of any future failures.
Although pessimistic message logging is used to satisfy the second requirement, S-CIC is not subject to the PWD model.
Initially, each process deterministically performs its computation in a certain interval and, if a non-deterministic (ND) event occurs in this interval, the process begins its ND execution interval. In this research field, ND events can be classified into two types of events. The first type includes loggable ND events, of which there is sufficient support for forcing the replay at the same point in case of failure. Message receipt is one type of loggable ND event that most message-logging protocols detect and save in the stable storage for recovery. Aside from this, there are other types of loggable ND events, such as software interrupts or signals, which some other works [
2] attempted to detect in order to make it possible to replay them in case of failure. This effort may raise the rate of the deterministic (DM) execution intervals. The second type comprises unloggable ND events, for which there is no support for taking action to enable a form of repeatable execution in case of failure.
To hold all of them in S-CIC, each process p should always have the following additional state information:
: A vector that saves an element composed of two variables, and , for each process q. [q]. keeps the value of the ssn of the latest message m that q has transmitted and is known by p. [q]. is a boolean value that indicates whether at least one internal unloggable ND event q has been executed before m since q’s latest checkpoint. The two variables are initialized to (0,false). As a message m is transmitted from p, [p]. increments by one and the vector is attached to m. If p takes a local checkpoint, [p]. comes back to false. If a message m is transmitted to p, [q] is updated to m.[q] if m.[q]. > [q]..
-: A boolean variable for detecting whether it there is at least one internal unloggable ND event that any process including p has executed since its latest checkpoint. It is initialized to false and, when a message m is transmitted, it is attached to m. As p performs the receipt of a message or takes a checkpoint, - comes back to false if - = true and ∀q(1≤q≤n): [q]. = false.
: A non-negative integer variable that has the same sequence number as that of the most recent message that p has received.
First, we aimed to understand how to identify the recoverable state of each process with checkpointing and message logging without assuming the use of the PWD model. For this purpose, an execution mode detection method was introduced in order to consider three typical cases that can occur in CIC, as shown in
Figure 2. As shown in
Figure 2a, the first case is that a process performs its computation without any communication with others. In this example, a process
p first executes with its internal DM or loggable ND events in a certain interval from its checkpointed state Ck
(
[
p].
= false), which is called the DM mode (
-
= false). Then, if any first internal unloggable ND event occurs (
[
p].
= true),
p’s ND interval begins and changes its execution mode to non-deterministic (
-
= true). When taking its next checkpoint Ck
,
p’s state becomes recoverable (
[
p].
= false); thus, its execution mode returns to being deterministic (
-
= false). Then, it performs its computation in a similar way. Therefore, in this case, if
p fails at a certain execution point after Ck
, it can restart from the latest checkpoint and recover to the state right before any first unloggable ND event after Ck
without considering any dependency relation with others.
As shown in
Figure 2b, the second case is that an execution of a process
q that is affected by messages transmitted from another single process. In this case,
q first executes in its DM mode from its checkpointed state Ck
(
[
q].
= false,
-
= false). When receiving a message
from
p, whose current mode is ND (
.
[
p].
= true,
-
= true),
q’s execution mode also changes to ND (
-
= true). Then,
q can execute with its internal unloggable ND events, although this is not shown in this figure (
= true). However, even if
q takes its local checkpoint Ck
(
= false), its execution mode still remains ND because
q does not know that
p’s current state is recoverable (
[
p].
= false,
-
= false). When receiving
from
p,
q can recognize that
p’s current mode is DM due to the information piggybacked on
(
.
[
p].
= false,
.
-
= false). Then,
q’s mode becomes DM(
-
= false).
As shown in
Figure 2c, the third case is that a process
q fulfills its computation depending on messages received from more than one process,
p and
r, which execute with internal unloggable ND events. In this case,
q first executes in its DM mode from its checkpointed state Ck
, and then in its ND mode with an unloggable ND event (
[
q].
= true,
-
= true). Next, it receives two messages,
and
, from
p and
r, which are currently both in ND mode (
.
[
r].
= true,
.
-
= true). When taking its next checkpoint Ck
(
[
q].
= false), it still executes in ND mode even if the current modes of the two processes are both DM after their latest checkpoints. This unawareness can be resolved after
q has received
and
(
.
[
r].
= false,
-
= false).
Let us examine how, by using the mode detection method, S-CIC can have its number of extra checkpoints lowered in the two types of NC Z-path patterns in comparison with HMNR, as shown in
Figure 3. An example of the first pattern of NC paths is illustrated in
Figure 3a, which violates the theorem in HMNR. Let the ssns of
p,
q, and
r be
,
, and
for Ck
, Ck
, and Ck
, respectively. When
q sends
to
r, its mode is DM(
-
= false) and its ssn is (
+ 1)(
[
q] = (
+ 1,false)). The two pieces of information are piggybacked on
. On receiving
,
r is in the DM mode (
[
r] = (
,false),
-
= false). Then, it increments its rsn,
, and saves a log element whose form is
e (sender’s identifier, receiver’s identifier, ssn, rsn, data) in the stable storage—for example,
e (
q,
r, (
+ 1),
,
.
) for
. Afterwards, it updates its mode-detection-related information as follows:
-
←
.
-
∨
-
= false,
= {(0,false),(
+ 1,false), (
,false)}. After receiving
, whose ssn is (
+ 1) from
r, and logging it,
p’s mode remains DM, as
r and
p are both in DM mode (
-
←
.
-
∨
-
= false), and its ssn vector
is updated to {(
,false), (
+ 1,false),(
+ 1,false)}. When
q receives
from
p and it is logged, it is in ND mode (
[
q] = (
+ 1,true),
-
= true), and
becomes true, as
[
r],
m.
[
r], and
m.
>
are all true. However,
q does not need to take any forced checkpoints (
.
-
∧
= false) because it knows
p’s state, including the fact that sending
is recoverable (
.
-
= false); thus,
can always be regenerated even if
p crashes. When taking its next checkpoint Ck
,
q’s mode changes to DM and its vector
is updated to {(
+ 1,false),(
+ 1,false),(
+ 1,false)}. Therefore, in this example,
p’s recoverable state until sending
after Ck
, Ck
, and Ck
comprises a globally consistent state.
Figure 3b illustrates an example of the second pattern of NC paths that violate the theorem in HMNR. After
r receives
from
q, whose mode is DM (
.
-
= false,
.
= {(0,false),(
+ 1, false),(0,false)}), and logging it, its mode is ND(
-
∨
-
= true) because of its internal unloggable ND event (
] =(
,true),
-
= true). However, when taking a checkpoint Ck
(
[
r]= (
,false)),
r’s mode changes to DM; thus,
can always be replayed (
.
-
= false). On
’s receipt and logging,
p keeps its DM mode and updates its variables as follows:
-
←
.
-
∨
-
= false, and
= {(
,false),(
+ 1,false),(
+ 1, false)}. When
q receives
from
p and logs it,
becomes true, as
.
[
q]=
[
q] and
.
[
q] are all true. However,
q does not need to take any forced checkpoints (
.
-
∧
= false) because
r and
p can deterministically reproduce and send
and
in order, respectively, even in case of
p’s failure (
.
-
= false). Therefore, Ck
, which is taken after delivering
(
-
= false,
[
q] = (
+ 1,false)), is consistent with the states of the others right after sending
and
.
Figure 4 presents the concrete algorithmic description of S-CIC.
Definition 1. is an arbitrary Z-path from to .
Definition 2. -Z-() is a variable-length non-causal sub-Z-path of .
Lemma 1. If, in -Z-(), .≮, but p’s state immediately before sending m to another process q() is recoverable, no forced checkpointing is required to ensure that no useless local checkpoints exist.
Proof. The correctness of the theorem is proved through contradiction. Assume that when m arrives at q, q has to perform a forced checkpoint action to enforce the safety condition, as q recognizes .≮. Afterwards, if p crashes, it can recover to hold up to the state right after sending m. The state is always consistent with q’s state immediately after receiving m. Thus, the forced checkpoint is not needed to preclude from being useless.
Therefore, if .≮, but p’s state immediately before sending m is recoverable, the property of not having any useless checkpoints can be kept without q’s forced checkpointing before delivering m. This contradicts the hypothesis. □
Theorem 2. S-CIC ensures that no checkpoint is useless.
Proof. S-CIC includes a checkpoint-timestamping mechanism that uses Lamport’s logical clock, as in HMNR. This feature ensures that any variable-length causal sub-Z-path in includes no Z-cycles. Therefore, we only have to prove that the protocol prevents -Z-() from containing any Z-cycle formations. The proof goes on by induction on its length, denoted by (-Z-()).
[Base case] In this case, is an NC Z-path with two messages, and , where is the latest checkpoint before transmitting from p and is the most recent checkpoint after is received by r. Suppose that q takes a checkpoint , sends to r, and then receives from p before its next checkpoint. At this point, two cases must be checked.
Case 1: .- = false.
q knows p’s state, including that sending is recoverable. Based on Lemma 1, even if q takes no forced checkpoints, is always useful.
Case 2: .- = true
If ≤ , . ≤ ≤ < . Thus, includes no Z-cycles. Otherwise, two sub-cases must be checked.
Case 2.1: No causal sub-paths are generated from r to p or q between after r receives and before p sends or q receives it.
In this case, before p sends to q, [r] is true because p knows that > . When q receives , it has to perform a forced checkpointing action before delivery of because is satisfied.
Case 2.2: A causal sub-path, u, from r to p or q exists between after r received and before p sends or q receives it.
At this point, two sub-cases must be checked.
Case 2.2.1: u occurs at r before .
If ≤ u., . < . and never becomes a useless checkpoint, regardless of whether u’s destination is p or q. If p receives u, it updates [r](=[r]) as false. If u goes to q, is updated with and is greater than or equal to . In both cases, S-CIC can recognize that is not satisfied, and q does not perform a forced checkpointing action.
Otherwise, . ≮ and may become useless in both cases. If p receives u, this condition causes [r](=[r]) to remain unchanged (true). In addition, when is transmitted to q, > , as u. ≥ . If u goes to q, [r] is true and > . In both cases, S-CIC can recognize that is satisfied and causes q to perform a forced checkpointing action before delivering m.
Case 2.2.2: u occurs at r after .
On receiving , r can keep the value of the latest checkpoint index of q in [q]. As it takes , [q] is set to true. When r sends u, [q] and [q] are eventually brought to q by a directed path—either <u> or <u, >. In both cases, S-CIC can recognize is satisfied and causes q to perform a forced checkpointing action before the delivery of the message received by q, as in HMNR.
[Induction hypothesis] It is assumed that the theorem is true for if (-Z-()) = l.
[Induction step] By the induction hypothesis, every checkpoint in -Z-() incurs no Z-cycle formations. Therefore, if a new non-causal path forms with -Z-(), (which is considered as one message in the induction hypothesis), and the (l + 1)-th message together, the theorem is true for if (-Z-()) = l + 1. The following case is similar to the base case that was mentioned earlier.
Through induction, our protocol ensures that no checkpoint becomes useless. □
4. Performance Evaluation
Let us examine our extensive simulations to make a comparison of the performance of the two protocols, LazyHMNR and S-CIC, with a discrete-event simulation language, PARSEC [
23]. LazyHMNR is one of the most recently developed versions of HMNR, which is intended to decrease the high frequency of forced checkpointing [
7]. S-CIC is our improved version of LazyHMNR with the advantageous features mentioned in the previous sections.
In this comparison, we precisely examine one important performance index,
. The index indicates the total number of forced checkpoints taken in each protocol. The simulated system is a cluster of
N computers on a broadcast network. All processes running on each computer begin and finish their individual execution together. As a process transmits an application message, the message is destined to a single recipient at all times. The link capacity and propagation delay of the simulated network are 100 Mbps and 1 ms, respectively. Every process performs a basic local checkpointing task in a certain checkpoint interval according to an exponential distribution with a mean
= 5 min. In addition, among
N processes, one is selected at random and transfers a message in every timed interval according to an exponential distribution with a mean of
= 3 s. Furthermore, to measure the communication pattern sensitivity of the two protocols, more complex experiments were conducted by splitting applications into four groups: serial, circular, hierarchical, and irregular [
24].
Figure 5,
Figure 6,
Figure 7 and
Figure 8 show the
for both LazyHMNR and S-CIC with changes in the numbers of processes—denoted by
—scaling from 6 to 12 when the percentage of internal unloggable ND events in each process (
) was 20%, 40%, 60%, and 80% for the four different communication patterns, respectively. In these figures,
never changed the
of LazyHMNR because, unlike S-CIC, LazyHMNR has no method for skipping forced checkpointing actions if the state of the sender of each message right before the receipt of the message is recoverable. As
increased, the ratio of the
of LazyHMNR to that of S-CIC increased for all four patterns according to the change in
, which ranged from 1.3 to 6.5. The main reason is that there was an increased possibility of forming the two kinds of NC Z-path patterns and inducing forced checkpointing in LazyHMNR when
or
was satisfied. In addition, the occurrence of fewer unloggable ND events per process (i.e., decreasing
) led to a significant decrease in the forced checkpointing overhead of LazyHMNR due to the advantageous features of S-CIC. These results indicate that S-CIC frequently skips actions to take forced checkpoints for each process by checking the recoverability of the dependent states of other processes, unlike LazyHMNR. In addition, the features have the effect of a large reduction in the number of forced checkpoints, regardless of the communication patterns. However, the degree of their effectiveness may fluctuate in the irregular pattern because its irregularity can cause the formation of Z-cycles and can cause the first type of NC Z-path to differ in every run. From the results, we can see that, with these features, S-CIC can alleviate the shortcomings of the family of HMNR protocols, including LazyHMNR.