Next Article in Journal
Geometry of Bi-Warped Product Submanifolds of Nearly Trans-Sasakian Manifolds
Next Article in Special Issue
DICER 2.0: A New Model Checker for Data-Flow Errors of Concurrent Software Systems
Previous Article in Journal
On Approximation of the Tails of the Binomial Distribution with These of the Poisson Law
Previous Article in Special Issue
A Metamorphic Testing Approach for Assessing Question Answering Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Availability Analysis of Software Systems with Rejuvenation and Checkpointing

1
Department of Information Science and Engineering, Ritsumeikan University, 1-1-1 Nojihigashi, Kusatsu 5258577, Japan
2
Graduate School of Advanced Science and Engineering, Hiroshima University, 1-4-1 Kagamiyama, Higashihiroshima 7398527, Japan
*
Author to whom correspondence should be addressed.
Mathematics 2021, 9(8), 846; https://doi.org/10.3390/math9080846
Submission received: 15 March 2021 / Revised: 5 April 2021 / Accepted: 9 April 2021 / Published: 13 April 2021
(This article belongs to the Special Issue Mathematics in Software Reliability and Quality Assurance)

Abstract

:
In software reliability engineering, software-rejuvenation and -checkpointing techniques are widely used for enhancing system reliability and strengthening data protection. In this paper, a stochastic framework composed of a composite stochastic Petri reward net and its resulting non-Markovian availability model is presented to capture the dynamic behavior of an operational software system in which time-based software rejuvenation and checkpointing are both aperiodically conducted. In particular, apart from the software-aging problem that may cause the system to fail, human-error factors (i.e., a system operator’s misoperations) during checkpointing are also considered. To solve the stationary solution of the non-Markovian availability model, which is derived on the basis of the reachability graph of stochastic Petri reward nets and is actually not one of the trivial stochastic models such as the semi-Markov process and the Markov regenerative process, the phase-expansion approach is considered. In numerical experiments, we illustrate steady-state system availability and find optimal software-rejuvenation policies that maximize steady-state system availability. The effects of human-error factors on both steady-state system availability and the optimal software-rejuvenation trigger timing are also evaluated. Numerical results showed that human errors during checkpointing both decreased system availability and brought a significant effect on the optimal rejuvenation-trigger timing, so that it should not be overlooked during system modeling.

1. Introduction

In software reliability engineering, various software fault-tolerance techniques such as software rejuvenation and checkpointing are widely used for enhancing system reliability and strengthening data protection. Software rejuvenation is a countermeasure against software aging, which refers to the phenomenon that the performance or dependability of software systems degrades with time, caused by aging-related bugs [1,2], eventually resulting in system failures. In 1995, Huang et al. [3] first reported the aging phenomenon in real telecommunication billing applications where the application experienced a crash or a hang failure over time. The software-aging phenomenon exists in the real world and is inevitable, but can nevertheless be controlled or even reversed [1,2,4]. Software rejuvenation plays a central role in counteracting aging issues by refreshing the system’s internal states. However, as pointed out by Alonso et al. [5], the software rejuvenation can address aging issues well, but typically involves an overhead since the system becomes unavailable during rejuvenation. That is to say, it is necessary and important to determine an optimal rejuvenation schedule for achieving the best trade-off between target performance or dependability and the associated overhead. To date, there are a number of works devoted to solving such optimization problems [6,7,8,9,10]. For example, Vaidyanathan and Trivedi [6] presented a semi-Markov reward model for a UNIX operating system, and used this model to derive optimal software-rejuvenation schedules in terms of system availability or downtime cost. Dohi et al. [9] considered two basic software-rejuvenation models described by Markov regenerative processes (MRGPs), and provided transient solutions using Laplace–Stieltjes transform (LST) and their numerical inversion. In [9], an optimal software-rejuvenation policy that maximized interval system reliability was numerically determined. Wang and Liu [10] recently offered a real-time decision method for optimal software-rejuvenation timing through simulating and modeling the state-transition process of software aging and constructing the rejuvenation decision function using an analytic hierarchy process.
In the context of data protection, a typical technique is checkpointing, which is an efficient method for saving re-execution time in the presence of faults [11] through saving current data in the main memory to secondary storage. Checkpointing is easy to conduct and has been widely studied for decades [12,13,14,15,16]. For example, Fukumoto et al. [12], and Dohi et al. [13] introduced different checkpointing schemes for database systems, and Ranganathan and Upadhyaya [14] considered the temporal behavior related to database system states from a macroscopic viewpoint. Some of the literature also considered software rejuvenation and checkpointing together [17,18,19,20]. Okamura and Dohi [17] focused on two kinds of maintenance policies for a software system, and adopted a dynamic programming approach to comprehensively evaluate aperiodic checkpointing and rejuvenation schemes in the system. In [19], the authors introduced a stochastic reward Petri net (SRN) [21] to model a software system of which the state moves to the execution process immediately after a rollback recovery. In particular, according to SRN analysis, a non-Markovian state-transition diagram was derived. More recently, a similar to but somewhat different system from [19] was considered in [20], in which the system executes checkpointing immediately after a rollback recovery in order to update the starting point of the recovery operation from the past to the current time. In these previous works, the systems underwent both aperiodic checkpointing and software rejuvenation, and their transition diagrams are not one of the trivial stochastic models such as semi-Markov process (SMP) and MRGP. That means that common approaches such as the LST and embedded Markov chain techniques cannot be directly applied. To solve these complex non-Markovian transition diagrams, the phase (PH) expansion approach [22,23], which is an approximation technique by using phase-type (PH) distribution, was utilized and worked well in different contents. Moreover, in [19,20], it was assumed that system failures are caused by only aging problems, but in fact, human error is inescapable [24], and the system operator’s misoperations during checkpointing cannot be ignored [25].
In this paper, we consider the different software systems from [19,20], where both aperiodic checkpointing and software rejuvenation were executed, and system failure occurred due to both software aging and human errors in checkpointing. A stochastic framework composed of a composite SRN and its resulting non-Markovian availability model is presented to capture the dynamics of the system from a macroscopic point of view. More specifically, the non-Markovian availability model was derived from the reachability graph of the composite SRN model. On the basis of the non-Markovian availability model, which is also a nontrivial model including multiple competitive events as in [19,20], we formulated the steady-state availability of the system by means of PH expansion, and then determined the optimal software-rejuvenation schedule that maximized steady-state system availability. The effects of human-error factors on both steady-state system availability and optimal software-rejuvenation schedule are investigated. The main differences between this work and previous ones [19,20] are that we (i) consider both aging-related and human-error-related system failures, of which the latter was overlooked in previous works; and (ii) investigate the effect of human-error factors on system availability and software rejuvenation. For brevity, the main contributions of this paper are summarized as twofold:
  • stochastic modeling of software systems that undergo both software rejuvenation and checkpointing, and may fail due to both the aging problem and human errors in checkpointing;
  • investigation of the effects of human-error factors on both steady-state system availability and optimal software-rejuvenation trigger timing by the comparison of cases where human-error-related system failures are considered or not.
The remainder of this paper is organized as follows. In Section 2, a stochastic framework composed of a composite SRN and its corresponding non-Markovian state-transition diagram for an operational software system with software rejuvenation and checkpointing are introduced. In particular, a reachability graph was generated from the composite SRN, and on its basis, a non-Markovian state-transition diagram was obtained. Section 3 first defines continuous PH distribution and presents an approach to formulate the steady-state system availability of the non-Markovian model by using the underlying approximate CTMC of the non-Markovian model, which was derived by replacing all general distributions with their corresponding PH distributions. In Section 4, we describe conducted numerical experiments that evaluated system availability, determined the optimal software-rejuvenation trigger timing, and quantified the effects of human-error factors. Lastly, in Section 5, we conclude this paper with some remarks.

2. Macroscopic System Model

In this section, we first introduce the system assumptions and then present a stochastic framework consisting of a composite SRN and its resulting non-Markovian transition diagram to model operational software systems from a macroscopic point of view. More specifically, the non-Markovian transition diagram was derived on the basis of a reachability graph, which was generated from analysis of the composite SRN.

2.1. System Assumptions

Consider an operational software system that aperiodically executes checkpointing for saving current data in the main memory in secondary storage. Without loss of generality, it was assumed that the system suffers from software aging, so that it may fail due to aging-related bugs, such as a memory leak and the accumulation of round-off errors. On the other hand, system failure might also be caused by incorrect operation by the operator during the execution of checkpointing. Once system failure occurred, a series of recovery operations that include checkpointed data loading and rollback recovery were conducted to recover the system. In addition, software rejuvenation was adopted to counteract the aging problem. A few other assumptions:
  • the checkpointing operation just saves the current data and does not refresh system aging;
  • the clock of the rejuvenation trigger is not reset and continuously accumulates even when the system executes the checkpointing;
  • when a rejuvenation point is reached while the system is under checkpointing, the rejuvenation waits until the checkpointing is completed;
  • the system is regarded as good as new after either rollback recovery or rejuvenation.

2.2. Stochastic Reward Nets

On the basis of the above assumptions, the dynamics of the system are described by a composite SRN as in Figure 1 and Figure 2. Concretely, the composite SRN contains three submodels: clock model for system aging (Figure 1a), clock model for software rejuvenation (Figure 1b), and SRN model for system behavior (Figure 2). In these SRNs, transitions are divided into three types: (i) immediate (IMM) transition (represented by a thin black bar), which means the zero firing time transition; (ii) exponential (EXP) transition (represented by a white rectangle), which refers to the exponentially distributed firing time transition; and (iii) general (GEN) transition (represented by a thick black bar), which is generally distributed firing time transition. The places are defined as follows:
  • P f c l o c k : software aging accumulates as time passes.
  • P f s i g n a l : it is time for an aging-related system failure to occur.
  • P r c l o c k : time is accumulated to trigger a rejuvenation.
  • P r s i g n a l : a rejuvenation point was reached.
  • P n o r m a l : the system waits for checkpointing and rejuvenation in the normal execution process.
  • P c h e c k p o i n t i n g : the system is under checkpointing.
  • P r e j u v e n a t i o n : the system is under rejuvenation.
  • P f a i l u r e : the system fails due to either aging-related bugs or human-error factors, and checkpointed data are loaded for rollback recovery.
  • P r e c o v e r y : rollback recovery is executed to recover the failed system.
  • P c o m p l e t e d : the system becomes as good as new after the completion of either rejuvenation or rollback recovery.
On the other hand, transitions T c i n t , T t r i g g e r , and T f a i l 1 correspond to the trigger intervals of checkpointing and rejuvenation, and system lifetime, respectively. Transitions T c h e c k p o i n t i n g , T r e j u v e n a t i o n , T l o a d , and T r e c o v e r y separately represent the operations of checkpointing, rejuvenation, loading of checkpointed data, and rollback recovery. Transitions T f a i l 2 and T f a i l 3 are both EXP transitions, representing failures caused by incorrect operations by the operators. Once IMM transition t r e j fires with satisfied guard function G r e j , the system is immediately rejuvenated. If a token appears in place P f s i g n a l , either transition t f a i l 1 or transition t f a i l 2 fires due to the exhausted lifetime. Transitions t f r e s e t and t r r e s e t represent the reset of the clocks, and t n o r m a l means that the system becomes normal again at the same time as when clock reset. The details of guard functions are shown in Table 1.

2.3. Reachability Graph

A Petri net’s reachability graph is also a directed graph composed of nodes and edges, each of which representing a reachable marking and a transition between two reachable markings, respectively. According to analysis of the composite SRN described in Section 2.2, a reachability graph, starting with the initial marking { P n o r m a l : 1 , P f c l o c k : 1 , P r c l o c k : 1 } (here no token places are not shown for brevity), is generated and depicted as in Figure 3. The description of nodes in the graph are summarized in Table 2. For example, node GEN ( T c i n t → enable T f a i l 1 → enable T t r i g g e r → enable) is the initial marking and represents the normal execution state of the system in which all transitions T c i n t , T f a i l 1 , and T t r i g g e r are enable. Both nodes GEN ( T c h e c k p o i n t i n g → enable T f a i l 1 → enable T t r i g g e r → enable) and GEN ( T c h e c k p o i n t i n g → enable T f a i l 1 → enable) correspond to the checkpointing execution states, and the difference between them is whether a rejuvenation point was reached. Node GEN ( T l o a d → enable) means that the system failed, and the loading of checkpointed data is being executed. This graph shows that there exist two edges from either node GEN ( T c h e c k p o i n t i n g → enable T f a i l 1 → enable T t r i g g e r → enable) or node GEN ( T c h e c k p o i n t i n g → enable T f a i l 1 → enable) to node GEN ( T l o a d → enable). This is explained by the fact that, during checkpointing, the system may fail due to aging-rated bugs or human-error factors, that is, among two edges, one represents the GEN transition T f a i l 1 and another corresponds to the EXP transition T f a i l 3 .

2.4. Non-Markovian State-Transition Diagram

From the reachability graph in Section 2.3, a non-Markovian state-transition diagram was derived as shown in Figure 4. This model consisted of seven states: N o r m a l , C h e c k p o i n t i n g , C h e c k p o i n t i n g , R e j u v e n a t i o n , F a i l u r e 1 , R e c o v e r y , and F a i l u r e 2 . State N o r m a l is the initial state and represents that the system is in the normal execution process in the main memory and waits for the checkpointing and rejuvenation. Once a checkpoint is reached prior to the rejuvenation point, the system state becomes C h e c k p o i n t i n g , in which data on the main memory are saved in secondary storage. Since the checkpointing operation does not reset the clock of the rejuvenation trigger, a rejuvenation point may be reached during checkpointing. In such a case, the system enters state C h e c k p o i n t i n g , which represents the checkpoint execution with enabled rejuvenation. After the completion of checkpointing, the system transitions from state C h e c k p o i n t i n g to state R e j u v e n a t i o n . If a rejuvenation point is reached prior to the checkpoint, the system immediately executes rejuvenation and enters state R e j u v e n a t i o n from state N o r m a l . As mentioned in Section 2.1, system failure may occur due to aging-related bugs and human-error factors. Thus, two failure states, F a i l u r e 1 and F a i l u r e 2 , were defined to distinguish two kinds of system failures. When the system fails, a series of recovery operations, including checkpointed data loading and the rollback recovery, are conducted to recover the system from failure. Lastly, the system becomes N o r m a l again from state R e c o v e r y . Of course, the system may fail before both checkpointing and rejuvenation. The details of state notation are given in Table 3.
Table 4 summarizes the cumulative distribution functions (CDFs) of the corresponding transitions in the state-transition diagram. In this table, GEN represents general distribution, and EXP means exponential distribution. The reasons for making such assumptions of probability distributions can be found in [20]. The checkpoint interval was assumed to follow general distribution G i n t v ( t ) , and the CDF of the time needed for checkpointing is given by G c p ( t ) . The time for an aging-related failure to occur follows a general distribution G f a i l ( t ) with increasing failure rate (IFR), while the time distributions for failures occurring during both rollback recovery and checkpointing due to incorrect operations by operators are given by F f a i l 1 ( t ) and F f a i l 2 with constant failure rates (CFRs) λ f a i l 1 and λ f a i l 2 , respectively. Similarly, the rejuvenation-trigger interval distribution is described by G t r i g ( t ) , and its relevant overhead distribution is represented by G r e j ( t ) . The probability distribution of loading time of checkpointed data and the time needed for rollback recovery are given by G l o a d ( t ) and G r c ( t ) , respectively.
Figure 4 shows states N o r m a l and C h e c k p o i n t i n g , highlighted by a dashed rectangle with G f a i l ( t ) and G t r i g ( t ) , indicating that these GEN transitions regarding G f a i l ( t ) and G t r i g ( t ) are enabled and could fire under either the N o r m a l or the C h e c k p o i n t i n g state. In the same way, the dashed rectangle for C h e c k p o i n t i n g and C h e c k p o i n t i n g means the possible firings of GEN and EXP transitions regarding G f a i l ( t ) , G c p ( t ) , and F f a i l 2 ( t ) . This implies that the non-Markovian state-transition diagram under consideration is neither the SMP nor the MRGP, resulting in difficult numerical analysis. To cope with this issue, in this paper we consider the PH expansion approach [22], which proved to be efficient for solving such kind of non-Markovian state-transition models [19,20,26].

3. System Availability Analysis

This section first introduces the well-known continuous PH distribution [22] and then derives the underlying approximate CTMC for the non-Markovian state-transition diagram in Figure 4 via PH expansion approach, of which the essential idea is to replace general distribution with its corresponding PH distribution at a high accuracy level. Lastly, the stationary solution for the model in Figure 4 through CTMC analysis is presented. The measure of interest is steady-state system availability, which is defined as the probability that the system is operational in the steady state.

3.1. Continuous PH Distribution

Continuous PH distribution is defined as the probability distribution of absorbing time in a finite CTMC with absorbing states, and it is widely applied in various fields, such as reliability assessment [26], queueing systems [27], and random telegraph noise analysis [28]. Without loss of generality, we define Q as an infinitesimal generator matrix of a CTMC that has m transient states and one absorbing state, and then partition Q into four parts as below:
Q = T ξ 0 0 .
In the above, T and ξ represent transition rates among transient states and exit rates from transient states to the absorbing state, respectively. Defining α as an initial probability vector over the transient states, we have the CDF and probability density function (PDF) for the continuous PH distribution:
F P H ( t ) = 1 α exp ( T t ) 1 , f P H ( t ) = α exp T t ξ ,
where 1 is a column vector of ones. Exit vector ξ is given by ξ = T 1 . Transient states are called phases in general.
Continuous PH distribution can be categorized into several subclasses according to the structure of T [29]. When phase transition is acyclic, the corresponding PH distribution is called acyclic PH distribution (APH). The APH is the widest class among mathematically tractable PH distributions, and it can be converted into the canonical form (CF), which is the minimal representation of APH with the smallest number of free parameters [30]. The APH and its CF are important from the viewpoint of practical applications because it covers some well-known probability distributions, such as exponential distribution, Erlang distribution, and their mixtures. In particular, canonical form 1 (CF1) is usually considered and defined by
α = α 1 α 2 α m ,
T = β 1 β 1 O β 2 β 2 β m 1 β m 1 O β m ,
ξ = 0 0 0 β m ,
where α i 0 , i α i = 1 and 0 < β 1 β m for m phases.
In this paper, continuous PH distribution was applied to approximate all general distributions in the non-Markovian state-transition diagram, that is, to determine PH distribution with parameters ( α , T , ξ ) , which can fit the target distribution well by means of maximum likelihood estimation (MLE) approach [22].

3.2. PH-Expanded CTMC

According to the definition of PH distribution in Section 3.1, we define the general distributions in Table 4 by PH distributions with appropriate phases as follows:
F i n t v P H ( t ) = 1 α i n t v exp ( T i n t v t ) 1 i n t v , f i n t v P H ( t ) = α i n t v exp ( T i n t v t ) ξ i n t v ,
F f a i l P H ( t ) = 1 α f a i l exp ( T f a i l t ) 1 f a i l , f f a i l P H ( t ) = α f a i l exp ( T f a i l t ) ξ f a i l ,
F c p P H ( t ) = 1 α c p exp ( T c p t ) 1 c p , f c p P H ( t ) = α c p exp ( T c p t ) ξ c p ,
F l o a d P H ( t ) = 1 α l o a d exp ( T l o a d t ) 1 l o a d , f l o a d P H ( t ) = α l o a d exp ( T l o a d t ) ξ l o a d ,
F r c P H ( t ) = 1 α r c exp ( T r c t ) 1 r c , f r c P H ( t ) = α r c exp ( T r c t ) ξ r c ,
F t r i g P H ( t ) = 1 α t r i g exp ( T t r i g t ) 1 t r i g , f t r i g P H ( t ) = α t r i g exp ( T t r i g t ) ξ t r i g ,
F r e j P H ( t ) = 1 α r e j exp ( T r e j t ) 1 r e j , f r e j P H ( t ) = α r e j exp ( T r e j t ) ξ r e j .
Here, PH parameters ( α x , T x , ξ x ) , x { i n t v , f a i l , c p , l o a d , r c , t r i g , r e j } were estimated on the basis of MLE using an expectation–maximization (EM) algorithm [22,31]. Using the above-estimated PH distributions to replace general distributions, the non-Markovian transition diagram was expanded into an approximate CTMC, alternatively called PH-expanded CTMC, of which the infinitesimal generator matrix is given by
Q = T i n t v T f a i l T t r i g ( ξ i n t v α c p ) I I ( 1 i n t v 1 f a i l ξ t r i g ) α r e j ( 1 i n t v ξ f a i l 1 t r i g ) α l o a d ( ξ c p α i n t v ) I I T c p T f a i l T t r i g ( λ f a i l 2 ) I I ξ t r i g ( 1 c p 1 t r i g ξ f a i l ) α l o a d ( 1 c p 1 t r i g 1 f a i l λ f a i l 2 ) α l o a d T f a i l T c p ( λ f a i l 2 ) ( 1 f a i l ξ c p ) α r e j ( ξ f a i l 1 c p ) α l o a d ( 1 f a i l 1 c p λ f a i l 2 ) α l o a d ξ r e j ( α i n t v α f a i l α t r i g ) T r e j T l o a d ξ l o a d α r c ξ r c ( α i n t v α f a i l α t r i g ) ( λ f a i l 1 1 r c ) α l o a d ( λ f a i l 1 ) T r c ξ l o a d α r c T l o a d .
The infinitesimal generator matrix is derived on the basis of the Kronecker representation [23], and the order of states is {Normal, Checkpointing, Checkpointing’, Rejuvenation, Failure1, Recovery, Failure2}. In Equation (13), ⊕ and ⊗ are the Kronecker product and sum [32], I is an identity matrix, and 1 / λ f a i l 1 and 1 / λ f a i l 2 are the mean values of EXP distributions F f a i l 1 ( t ) and F f a i l 2 ( t ) , say the mean times to failure during rollback recovery and checkpointing, respectively.
Entry ( ξ i n t v α c p I I ) shows that the clock of the rejuvenation trigger is not reset and continuously accumulates, even when the system executes the checkpointing. Since the checkpointing operation just saves the current data and does not refresh system aging, entry ( ξ c p α i n t v ) I I indicates that only the clock of checkpointing trigger is reset. When a rejuvenation point is reached while the system is under checkpointing, rejuvenation waits until checkpointing is completed; in such a case, the system transits from C h e c k p o i n t i n g to C h e c k p o i n t i n g with entry I I ξ t r i g . Entries ( 1 i n t v ξ f a i l 1 t r i g ) α l o a d , ( 1 c p 1 t r i g ξ f a i l ) α l o a d , and ( ξ f a i l 1 c p ) α l o a d indicate aging-related failures in both normal and checkpointing states, while entries ( 1 c p 1 t r i g 1 f a i l λ f a i l 2 ) α l o a d and ( 1 f a i l 1 c p λ f a i l 2 ) α l o a d represent human-error-related failures during checkpointing. In addition, the system is regarded to be as good as new after either rollback recovery or rejuvenation, so the corresponding transitions are represented by entries ξ r e j ( α i n t v α f a i l α t r i g ) , and ξ r c ( α i n t v α f a i l α t r i g ) , where ( α i n t v α f a i l α t r i g ) implies that the clocks of checkpointing trigger, system aging, and rejuvenation trigger are refreshed at the same time.

3.3. Steady-State System Availability

Steady-state system availability gives the probability that the system is operational in the steady state, so that it provides a significant insight into the long-term performance of a repairable system. Let A s s define the steady-state system availability. Then, we can obtain it by
A s s = π s s r ,
where π s s is the steady-state probability vector of the PH-expanded CTMC, Q , and can be computed by solving the following linear equation [33]:
π s s Q = 1 , π s s 1 = 1 ,
and r is the reward (column) vector of the PH-expanded CTMC and given by
r = 1 1 i n t v 1 f a i l 1 t r i g 0 1 c p 1 f a i l 1 t r i g 0 1 f a i l 1 c p 0 1 r e j 0 1 l o a d 0 1 r c 0 1 l o a d .
It is clear that the system is only available in the normal execution process state. In this paper, one problem of interest is to determine optimal software-rejuvenation timing that maximizes steady-state system availability.

4. Numerical Illustration

This section is devoted to the numerical illustration of the presented model in Figure 4 by means of phase expansion. Model parameters are summarized in Table 5, where all values are given according to the related literature [13,20,34]. All general distributions were accurately approximated by PH distributions with appropriate phases, that is, 100 phases for G i n t v ( t ) , G c p ( t ) , G l o a d ( t ) , G r c ( t ) , G t r i g ( t ) , and G r e j ( t ) and 10 phases for G f a i l ( t ) (see [20] for more details); eventually, we obtained a large approximate CTMC consisting of 201,400 PH-expanded states. Similar to [20], in order to evaluate the effects of the checkpoint interval and the rejuvenation-trigger interval on system availability, the mean checkpoint interval (MCI) was varied from 1 to 10 h, and the mean rejuvenation-trigger interval (MRTI) was changed from 5 to 35 h. In addition, human-error-related system failures both were and were not considered, aiming at quantifying the effects of human-error factors on both system availability and optimal software-rejuvenation timing.

4.1. Steady-State System Availability

Here, we show the steady-state availabilities of a system that may fail due to human error in checkpointing under different cases of MRTI and MCI. The corresponding results are given in Table 6, which shows that steady-state system availability increased as the value of MCI increased under each MRTI case. This means that too-frequent checkpointing decreases system availability because the system becomes unavailable during checkpointing. The effect of MRTI on system availability is now examined. For each MCI, steady-state system availability increases at the beginning and subsequently decreases with increasing MRTI, implying that an optimal MRTI might exist for maximizing steady-state system availability.
Moreover, by comparing results in Table 6 and Table 7, the latter of which gives the steady-state system availability without considering human-error-related system failures, it is reasonable to say that human-error factors significantly decreased system availability, especially in the case where the value of MCI was small. In other words, although frequent checkpointing can save data in a timely manner, it also brings a higher risk of system failure, caused by incorrect operations. Therefore, it is crucial to determine a suitable frequency of executing checkpointing to satisfy target system availability. For example, given a target steady-state system availability of 0.9 and an MRTI of 10 h, an MCI equal to or larger than 5 h is a good choice.

4.2. Optimal Rejuvenation-Trigger Timing

This subsection discusses optimal software-rejuvenation timing maximizing steady-state system availability. Figure 5 illustrates the sensitivity of steady-state system availability with respect to the mean rejuvenation-trigger interval in the cases of MCI = 2 , 4 , 6 , 8 and 10. The figure plots unimodal curves of the steady-state system availabilities, which reveals the existence of optimal rejuvenation-trigger timing maximizing steady-state system availability in each case. Specifically, the overhead incurred by frequent rejuvenation (i.e., short MRTI) largely affects system availability. Conversely, downtime due to system failures caused by a less frequent execution of rejuvenation smoothly decreases system availability.
Optimal rejuvenation-trigger timings and their corresponding maximal steady-state system availabilities in all cases are presented in Table 8. We present all optimal rejuvenation timings for the system regardless of considering human-error-related system failures. Optimal MRTIs for all cases of MCI were very similar, which means that the optimal rejuvenation-trigger timing is not very sensitive to checkpoint interval. Optimal MRTIs in the case where human-error-related system failures were not considered were slightly smaller than those in the case with human-error-related failure when the value of MCI was small, and vice versa when the MCI had a large value, for example, MCI = 9 , 10 .

5. Conclusions

In this paper, we presented a composite stochastic Petri reward net and its resulting non-Markovian availability model for operational software systems where both checkpointing and software rejuvenation are adopted to protect data and to enhance the system availability, and the system may fail due to both the aging problem and human errors during checkpointing. More specifically, the non-Markovian availability model was derived on the basis of a reachability graph that was generated from the original SRNs. In particular, the PH expansion approach was applied to solve the stationary solution of the non-Markovian availability model since the model was not one of the trivial stochastic models such as SMP and MRGP, so that common approaches such as LST and embedded Markov chain techniques do not work. Numerical results showed that human-error factors both decreased steady-state system availability and brought a significant effect on optimal rejuvenation-trigger timing, which means that human-error factors during system modeling should not be overlooked.
The model presented in this paper was based on a macroscopic view, providing a fundamental idea of how to model such a software system that undergoes both checkpointing and software rejuvenation, and in which the system behaves with multiple competitive events. The system’s actual behavior is very complex, and more possible events need to be considered, for example, software environment upgrades and time-scope limitations of used versions of libraries. Although this improvement may vastly increase difficulty in numerical analysis, it is significant to take a microscopic look at system behavior, which will be one of our future directions. This paper only considered both aperiodic checkpointing and software rejuvenation, but to the best of our knowledge, there exist various kinds of checkpointing [35] and rejuvenation techniques [8]. In the future, we aim to extend this work to solve more complicated software systems considering different rejuvenation and checkpointing schemes.

Author Contributions

Conceptualization, J.Z., H.O. and T.D.; methodology, J.Z., H.O. and T.D.; formal analysis, J.Z.; investigation, J.Z.; writing—original draft preparation, J.Z.; writing—review and editing, H.O. and T.D.; supervision, H.O. and T.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MRGPMarkov regenerative process
LSTLaplace–Stieltjes transform
SRNStochastic (Petri) reward net
PHPhase or phase-type
CTMCContinuous-time Markov chain
IMMImmediate
EXPExponential
GENGeneral
APHAcyclic PH distribution
CFCanonical form
MLEMaximum-likelihood estimation
MCIMean checkpoint interval
MRTIMean rejuvenation-trigger interval

References

  1. Grottke, M.; Trivedi, K.S. Fighting bugs: Remove, retry, replicate, and rejuvenate. IEEE Comput. 2007, 40, 107–109. [Google Scholar] [CrossRef]
  2. Dohi, T.; Trivedi, K.S.; Avritzer, A. Handbook of Software Aging and Rejuvenation: Fundamentals, Methods, Applications, and Future Directions; World Scientific: Singapore, 2020. [Google Scholar]
  3. Huang, Y.; Kintala, C.; Kolettis, N.; Funton, N.D. Software rejuvenation: Analysis, module and applications. In Proceedings of the 25th IEEE International Symposium on Fault Tolerant Computing (FTC’95), Pasadena, CA, USA, 27–30 June 1995; pp. 381–390. [Google Scholar]
  4. Trivedi, K.S.; Vaidyanathan, K. Software aging and rejuvenation. In Wiley Encyclopedia of Computer Science and Engineering; John Wiley and Sons: Hoboken, NJ, USA, 2007; pp. 1–8. [Google Scholar]
  5. Alonso, J.; Matias, R.; Vicente, E.; Maria, A.; Trivedi, K.S. A comparative experimental study of software rejuvenation overhead. Perform. Eval. 2013, 70, 231–250. [Google Scholar] [CrossRef]
  6. Vaidyanathan, K.; Trivedi, K.S. A comprehensive model for software rejuvenation. IEEE Trans. Depend. Secur. Comput. 2005, 2, 124–137. [Google Scholar] [CrossRef]
  7. Ning, G.; Zhao, J.; Lou, Y.; Alonso, J.; Matias, R.; Trivedi, K.S.; Yin, B.B.; Cai, K.Y. Optimization of two-granularity software rejuvenation policy based on the Markov regenerative process. IEEE Trans. Reliab. 2016, 65, 1630–1646. [Google Scholar] [CrossRef] [Green Version]
  8. Zheng, J.; Okamura, H.; Li, L.; Dohi, T. A comprehensive evaluation of software rejuvenation policies for transaction systems with Markovian arrivals. IEEE Trans. Reliab. 2017, 66, 1157–1177. [Google Scholar] [CrossRef]
  9. Dohi, T.; Zheng, J.; Okamura, H.; Trivedi, K.S. Optimal periodic software rejuvenation policies based on interval reliability criteria. Reliab. Eng. Syst. Saf. 2018, 180, 463–475. [Google Scholar] [CrossRef]
  10. Wang, S.; Liu, J. HARRD: Real-time software rejuvenation decision based on hierarchical analysis under weibull distribution. In Proceedings of the 20th IEEE International Conference on Software Quality, Reliability and Security (QRS’20), Macau, China, 11–14 December 2020; pp. 83–90. [Google Scholar]
  11. Zhang, Y.; Chakrabarty, K. Fault recovery based on checkpointing for hard real-time embedded systems. In Proceedings of the 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’03), Boston, MA, USA, 5 November 2003; pp. 320–327. [Google Scholar]
  12. Fukumoto, S.; Kaio, N.; Osaki, S. Optimal checkpointing policies using the checkpointing density. J. Inf. Process. 1992, 15, 87–92. [Google Scholar]
  13. Dohi, T.; Osajima, S.; Kaio, N.; Osaki, S. On the effects of checkpoint institution methods for a macroscopic database model. Electron. Commun. Jpn. Part III Fundam. Electron. Sci. 2000, 83, 23–33. [Google Scholar] [CrossRef]
  14. Ranganathan, A.; Upadhyaya, S.J. Performance evaluation of rollback-recovery techniques in computer programs. IEEE Trans. Reliab. 1993, 42, 220–226. [Google Scholar] [CrossRef]
  15. Bajunaid, N.; Menascé, D.A. Efficient modeling and optimizing of checkpointing in concurrent component-based software systems. J. Syst. Softw. 2018, 139, 1–13. [Google Scholar] [CrossRef]
  16. Sigdel, P.; Tzeng, N.F. Coalescing and deduplicating incremental checkpoint files for restore-express multi-level checkpointing. IEEE Trans. Parallel Distrib. Syst. 2018, 29, 2713–2727. [Google Scholar] [CrossRef]
  17. Okamura, H.; Dohi, T. Comprehensive evaluation of aperiodic checkpointing and rejuvenation schemes in operational software system. J. Syst. Softw. 2010, 83, 1591–1604. [Google Scholar] [CrossRef] [Green Version]
  18. Levitin, G.; Xing, L.; Luo, L. Joint optimal checkpointing and rejuvenation policy for real-time computing tasks. Reliab. Eng. Syst. Saf. 2019, 182, 63–72. [Google Scholar] [CrossRef]
  19. Zheng, J.; Okamura, H.; Dohi, T. A phase expansion for non-Markovian availability models with time-based aperiodic rejuvenation and checkpointing. Commun. Stat-Theory Methods 2020, 49, 3712–3729. [Google Scholar] [CrossRef]
  20. Zheng, J.; Okamura, H.; Dohi, T. Optimal rejuvenation policies for non-Markovian availability models with aperiodic checkpointing. IEICE Trans. Inf. Syst. 2020, E103-D, 2133–2142. [Google Scholar] [CrossRef]
  21. Bolch, G.; Greiner, S.; De Meer, H.; Trivedi, K.S. Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications, 2nd ed.; John Wiley and Sons: New York, NY, USA, 2006. [Google Scholar]
  22. Okamura, H.; Dohi, T. Fitting phase-type distributions and Markovian arrival processes: Algorithms and tools. In Principles of Performance and Reliability Modeling and Evaluation; Lance, F., Antonio, P., Eds.; Springer: Berlin/Heidelberg, Germany, 2016; pp. 49–75. [Google Scholar]
  23. Trivedi, K.S.; Bobbio, A. Reliability and Availability Engineering: Modeling, Analysis, and Applications; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
  24. Brown, A. An Overview of Human Error. CS294-4 ROC Semin. 1990, 54. Available online: http://roc.cs.berkeley.edu/294fall01/slides/human-error.pdf (accessed on 10 December 2020).
  25. Yanagihara, M.; Odagiri, M.; Osaki, S.; Kaio, N. Optimal checkpointing procedures taking into account system failure caused by checkpointing. Electron. Commun. Jpn. Part III Fundam. Electron. Sci. 1995, 78, 69–79. [Google Scholar] [CrossRef]
  26. Zheng, J.; Okamura, H.; Dohi, T. A transient interval reliability analysis for software rejuvenation models with phase expansion. Softw. Qual. J. 2020, 28, 173–194. [Google Scholar] [CrossRef]
  27. Yang, X.; Alfa, A.S. A class of multi-server queueing system with server failures. Comput. Ind. Eng. 2009, 56, 33–43. [Google Scholar] [CrossRef]
  28. Ruiz-Castro, J.E.; Acal, C.; Aguilera, A.M.; Roldán, J.B. A complex model via phase-type distributions to study random telegraph noise in resistive memories. Mathematics 2021, 9, 390. [Google Scholar] [CrossRef]
  29. Kemper, P.; Müller, D.; Thümmler, A. Combining response surface methodology with numerical methods for optimization of Markovian models. IEEE Trans. Depend. Secur. Comput. 2006, 3, 259–269. [Google Scholar] [CrossRef]
  30. Cumani, A. On the canonical representation of homogeneous Markov processes modelling failure-time distributions. Microelectron. Reliab. 1982, 22, 583–602. [Google Scholar] [CrossRef]
  31. Okamura, H.; Dohi, T.; Trivedi, K.S. Improvement of EM algorithm for phase-type distributions with grouped and truncated data. Appl. Stoch. Model. Bus. Ind. 2013, 29, 141–156. [Google Scholar] [CrossRef]
  32. Dayar, T. Analyzing Markov Chains Using Kronecker Products: Theory and Applications; Springer Science and Business Media: New York, NY, USA, 2012. [Google Scholar]
  33. Trivedi, K.S. Probability and Statistics with Reliability, Queuing, and Computer Science Applications, 2nd ed.; John Wiley and Sons: Hoboken, NJ, USA, 2001. [Google Scholar]
  34. Leung, C.H.C.; Currie, E. The effect of failures on the performance of long-duration database transactions. Comput. J. 1995, 38, 471–478. [Google Scholar] [CrossRef] [Green Version]
  35. Tantawi, A.N.; Ruschitzka, M. Performance analysis of checkpointing strategies. ACM Trans. Comput. Syst. 1984, 2, 123–144. [Google Scholar] [CrossRef]
Figure 1. Clock models for (a) system aging and (b) software rejuvenation.
Figure 1. Clock models for (a) system aging and (b) software rejuvenation.
Mathematics 09 00846 g001
Figure 2. Stochastic (Petri) reward net (SRN) model for system behavior.
Figure 2. Stochastic (Petri) reward net (SRN) model for system behavior.
Mathematics 09 00846 g002
Figure 3. Reachability graph.
Figure 3. Reachability graph.
Mathematics 09 00846 g003
Figure 4. Non-Markovian state-transition diagram.
Figure 4. Non-Markovian state-transition diagram.
Mathematics 09 00846 g004
Figure 5. Sensitivity of steady-state system availability with respect to mean rejuvenation-trigger timing.
Figure 5. Sensitivity of steady-state system availability with respect to mean rejuvenation-trigger timing.
Mathematics 09 00846 g005
Table 1. Guard functions.
Table 1. Guard functions.
GuardGuard Function
G n o r m a l # ( P f c l o c k ) = 1 && # ( P r c l o c k ) = 1
G f a i l # ( P f s i g n a l ) = 1
G r e j # ( P r s i g n a l ) = 1 && # ( P f s i g n a l ) = 0
G t r i g g e r # ( P n o r m a l ) = 1 && # ( P c h e c k p o i n t i n g ) = 1
G r e s e t # ( P c o m p l e t e d ) = 1
Table 2. Nodes in reachability graph.
Table 2. Nodes in reachability graph.
NodeDescription
GEN ( T c i n t → enable T f a i l 1 → enable T t r i g g e r → enable)Initial marking representing the normal execution state
GEN ( T c h e c k p o i n t i n g → enable T f a i l 1 → enable T t r i g g e r → enable)Marking representing checkpointing-execution state with disabled rejuvenation
GEN ( T c h e c k p o i n t i n g → enable T f a i l 1 → enable)Marking representing checkpointing-execution state with enabled rejuvenation
GEN ( T l o a d → enable)Marking representing system-failure state
GEN ( T r e c o v e r y → enable)Marking representing rollback-recovery state
GEN ( T r e j u v e n a t i o n → enable)Marking representing rejuvenation-execution state
Table 3. State notation in non-Markovian state-transition diagram.
Table 3. State notation in non-Markovian state-transition diagram.
StateDescription
NormalNormal execution process in the main memory
CheckpointingCheckpointing execution with a disabled rejuvenation
Checkpointing’Checkpointing execution with an enabled rejuvenation
Failure1Aging-related system failure
Failure2Human-error-related system failure
RecoveryRollback recovery to recover from system failure
RejuvenationSoftware-rejuvenation execution to refresh system’s internal states
Table 4. Cumulative distribution functions (CDFs) of transitions in state-transition diagram.
Table 4. Cumulative distribution functions (CDFs) of transitions in state-transition diagram.
CDFDescriptionType
G i n t v ( t ) CDF of checkpoint interval.GEN
G f a i l ( t ) CDF of time for an aging-related failure to occur.GEN
G c p ( t ) CDF of time needed for checkpointing.GEN
G l o a d ( t ) CDF of loading time of checkpointed data.GEN
G r c ( t ) CDF of time needed for rollback recovery.GEN
G t r i g ( t ) CDF of time required to trigger a rejuvenation.GEN
G r e j ( t ) CDF of rejuvenation overhead.GEN
F f a i l 1 ( t ) CDF of time for failure to occur during rollback recovery.EXP
F f a i l 2 ( t ) CDF of time for a human-error-related failure to occur during checkpointing execution.EXP
Table 5. Model parameters.
Table 5. Model parameters.
CDFDistributionMean (h)CV
G i n t v ( t ) Lognormal1–100.2
G f a i l ( t ) Weilbull100.5
G c p ( t ) Lognormal0.050.2
G l o a d ( t ) Lognormal0.50.2
G r c ( t ) Lognormal0.50.2
G t r i g ( t ) Lognormal5–350.1
G r e j ( t ) Lognormal0.50.2
F f a i l 1 ( t ) Exponential16.671
F f a i l 2 ( t ) Exponential1.51
Table 6. Steady-state system availability (with human-error-related system failures). Note: MCI, mean checkpoint interval; MRTI, mean rejuvenation-trigger interval.
Table 6. Steady-state system availability (with human-error-related system failures). Note: MCI, mean checkpoint interval; MRTI, mean rejuvenation-trigger interval.
MCI (h)MRTI = 5 hMRTI = 7 hMRTI = 10 hMRTI = 13 hMRTI = 15 h
10.833330.846000.851680.852260.85194
20.863800.876840.882450.882590.88192
30.874940.887470.893090.893050.89227
40.878970.893350.898460.898360.89752
50.883270.895980.901820.901550.90069
60.886790.898010.904040.903690.90278
70.888490.900220.905310.905290.90430
80.889080.902040.906350.906370.90546
90.889250.903180.907400.907140.90630
100.889290.903770.908380.907790.90694
Table 7. Steady-state system availability (without human-error-related system failures).
Table 7. Steady-state system availability (without human-error-related system failures).
MCI (h)MRTI = 5 hMRTI = 7 hMRTI = 10 hMRTI = 13 hMRTI = 15 h
10.848500.862060.867960.868210.86758
20.870670.884380.890240.890250.88942
30.878760.892000.897880.897790.89692
40.881540.896260.901740.901620.90073
50.884690.898100.904150.903930.90303
60.887350.899540.905760.905480.90456
70.888670.901170.906660.906650.90567
80.889130.902540.907410.907440.90652
90.889260.903410.908180.908000.90714
100.889290.903870.908920.908490.90761
Table 8. Optimal rejuvenation-trigger timings.
Table 8. Optimal rejuvenation-trigger timings.
MCI (h)with Human-Error-Related Failureswithout Human-Error-Related Failures
MRTI (h) A ss MRTI (h) A ss
112.30.8523011.60.86841
211.50.8828311.30.89059
311.30.8933911.20.89819
411.20.8987811.20.90206
511.00.9019611.10.90435
610.90.9042811.00.90603
711.30.9057211.30.90708
811.40.9066811.40.90777
911.00.9075311.10.90838
1010.50.9084210.70.90902
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zheng, J.; Okamura, H.; Dohi, T. Availability Analysis of Software Systems with Rejuvenation and Checkpointing. Mathematics 2021, 9, 846. https://doi.org/10.3390/math9080846

AMA Style

Zheng J, Okamura H, Dohi T. Availability Analysis of Software Systems with Rejuvenation and Checkpointing. Mathematics. 2021; 9(8):846. https://doi.org/10.3390/math9080846

Chicago/Turabian Style

Zheng, Junjun, Hiroyuki Okamura, and Tadashi Dohi. 2021. "Availability Analysis of Software Systems with Rejuvenation and Checkpointing" Mathematics 9, no. 8: 846. https://doi.org/10.3390/math9080846

APA Style

Zheng, J., Okamura, H., & Dohi, T. (2021). Availability Analysis of Software Systems with Rejuvenation and Checkpointing. Mathematics, 9(8), 846. https://doi.org/10.3390/math9080846

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop