1. Introduction
Various practical systems are commonly required to perform specific tasks within a certain duration. Task success probability (TSP), the probability of completing a required task with or without deadlines, is a core metric to evaluate the performance during task execution [
1,
2,
3,
4,
5,
6,
7]. Existing models on task based systems are mainly devoted to the evaluation and maximization of TSP [
8,
9,
10]. However, there are often situations where the survival of the system from failure is more crucial than the successful completion of the required tasks, since the failure of these systems will cause huge economic losses and major environmental hazards [
11,
12,
13,
14]. For example, due to external influences (such as extreme natural conditions), when the risk of failure reaches a high level, the drones are designed to abnegate the surveillance task and immediately start rescue procedures [
15]. As system failures in critical engineering applications often result in huge damages and casualties, it is pivotal to implement a detailed task abort policy to balance the trade-off between TSP and system survival probability (SSP), thereby minimizing the expected total cost of task failures and system failures.
Time redundancy is commonly incorporated into task execution to improve the TSP of task-critical systems. Time redundancy allows systems to execute tasks multiple times during a constrained time. For instance, satellites can perform the task of transiting information between the spacecraft and the ground observation station multiple times within a given time window [
16]. There exist two types of time redundancy according to the criteria of task success: in type I time redundancy (ITR), systems should continually run for a period of time greater than a specified value; in type II time redundancy (IITR), task success requires that the cumulative operating time be larger than the required value [
8].
In addition to task termination and time redundancy, preventive maintenance is another critical factor influencing TSP and SSP. Task abort policy and time redundancy are designed during task execution while preventive maintenance is taken before or after task execution. Preventive maintenance is crucial to ensure the highly reliable performance during task execution. To improve the TSP and SSP of safety-critical systems, an effective preventive maintenance scheme considering task abort and time redundancy should be developed.
Despite the significant theoretical advancement in the optimization of abort policy, the effect of time redundancy on the TSP and SSP under different abort and maintenance policies has not been explored. In real-world applications, time redundancy can enhance the system performance significantly and consequently influence the decision-making process. Taking into account the time redundancy when the abort and maintenance decisions are made will lead to more effective and beneficial task abort policies. To further advance the state of the art of evaluating and enhancing system performance during task execution, this paper contributes by modeling dynamic condition-based task termination and maintenance strategies considering two types of time redundancy. The results indicate that by introducing time redundancy into the abort and maintenance decision making, the SSP and TSP can be significantly improved. In summary, our contributions to existing theoretical and practical research on risk control are summarized as follows:
Dynamic preventive maintenance and task abort policies are designed that vary with the number task attempts;
TSP and SSP are derived under proposed preventive maintenance and task abort policies considering ITR and IITR;
The optimal preventive maintenance and abort thresholds minimizing the expected cost of preventive maintenance, task failure and system failure are studied.
The rest of this paper is organized as follows.
Section 2 conducts a thorough literature review on task abort, time redundancy and maintenance planning.
Section 3 characterizes the monotone degradation behavior and develops maintenance and task abort policies considering ITR and IITR. The MSP and SSP are evaluated under ITR and IITR in
Section 4 and
Section 5, respectively.
Section 6 studies the optimal maintenance and abort thresholds. The obtained results is illustrated with a case study in
Section 7. We conclude the research in
Section 8 discussing the conclusion and future research directions.
2. Literature Review
Intensive efforts have been dedicated to modeling task abort policy with the aim of balancing the TSP and SSP under different types of system and mission characteristics. The optimal task abort strategy for unmanned aerial vehicles (UAVs) modeled by single-component systems under external shocks was studied in [
15,
17,
18,
19,
20], where the task abort was determined by a threshold of the number of external shocks. In [
21], the multi-state shock models were considered, and each shock may lead to the degradation of the system to a worse performance level and eventually cause system failure. The system state was used in the decision optimization problem, and the trade-off between TSP and SSP was studied. In addition to the shock-based abort decision making, other criteria can also be used to guide the termination strategy of safety-critical systems. For systems with a defective state, the duration of defective state [
22] and warning signal [
23] can be used to guide abort decision making. For systems subject to minor and catastrophic failures, the number of minimal repairs can be chosen as the decision variable of the termination strategy [
24,
25]. The abort modeling for systems with continuous degradation is drawing increasing attention thanks to the wide application of sensor technologies in safety-critical systems [
26,
27,
28,
29,
30].
As an extension of abort policies for single-component systems, the pioneering work of the optimal task termination strategy for multi-component systems was conducted by Myers [
12], which considered the hot standby system and developed strategy of task termination by taking the number of failed components as a decision variable. The rescue procedure in multi-component systems is commonly triggered by a certain number of failed components or external shocks to avoid costly consequences. Levitin et al. [
31] generalized the model in [
12] to the case of different components and proposed an adaptive termination strategy. Filene and Daly [
32] characterized the effect of task termination strategy on TSP and SSP in distributed computer systems. Peng [
33] designed the termination strategy of a multi-cooperative UAVs subject to external shocks. The optimal routing, aborting, and hitting strategies of UAVs were investigated in [
34,
35]. Levitin [
36] calculated TSP and SSP considering the fault propagation effect. Levitin [
21] considered several subtasks performed by different groups of units, and the optimal subtask assignment and optimal termination strategy between units is studied. The joint optimization of mission abort and component switching policies for multi-state warm standby systems was studied in [
19].
The above-mentioned references have considered termination policies allowing only one attempt to complete a task. However, for systems performing critical tasks, time redundancy is commonly taken into consideration to improve their TSP. Reliability modeling of time redundancy has received considerable research attention in the past several years. In [
10,
37], the TSP and optimal maintenance policy under ITR were modeled and analyzed. The abort modeling incorporating time redundancy is a rather new topic. In [
38], the optimal termination strategy was studied for single-component systems with a fixed number of attempts to complete the task. In [
21], optimal abort rules for multi-component systems with multiple allowed attempts were considered. The optimal mission abort policies under ITR and IITR were investigated in [
16]. Compared with the existing literature with constant abort thresholds, the abort threshold varies with the task attempts due to time redundancy.
In addition to task termination and time redundancy, preventive maintenance is another critical factor influencing TSP and SSP. Planing maintenance action in an optimal way can not only enhance system reliability performance but also reduce the system operation cost. According to the effect of maintenance actions, existing maintenance policies can be classified into perfect maintenance and imperfect maintenance [
39,
40,
41,
42,
43]. Perfect maintenance restores the system “as good as new”, and the maintenance effect of imperfect maintenance is worse than that of perfect maintenance. Under an age-based preventive maintenance policy, a product is maintained at a certain age or upon failure, whichever occurs first. The rapid development of sensing technology makes it possible to monitor the condition of the system in a much easier way, which facilitates modeling the system degradation paths through random processes such as the Wiener process and gamma process. For systems with measurable degradation, condition-based maintenance is more effective than age-based maintenance in reducing the risk of failure [
44,
45,
46,
47,
48,
49]. In existing models, the joint effects of time redundancy, preventive maintenance and task abort on TSP and SSP have not been considered.
3. Problem Formulation
We consider safety critical systems whose degradation is stochastically increasing. Due to the monotonicity of the degradation path, the degradation process is modeled by homogeneous Gamma processes with shape function and scale parameter . That is, possesses the following properties:
System failure emerges once the degradation level reaches the threshold D. To this end, the random system lifetime can be defined as the first hitting time of the degradation process with respect to the failure threshold D. For the considered system, SSP is measured through the probability that no catastrophic failure occurs during task execution. To enhance the SSP under continuous degradation, a task can be terminated if the degradation level exceeds a specified level and starts a rescue procedure immediately. The duration for the rescue procedure started at time t is increasing in t, which is denoted as . Let be the time after which the task success takes less time than the rescue procedure. Namely, Thus, for , the task will not be aborted.
For systems executing critical tasks, time redundancy is another commonly adopted method to improve TSP and SSP. With time redundancy, the task can be executed multiple times by the required deadline . Let K be the maximum number of attempts by time . The mission succeeds if in any attempt , the system completes the task within the time deadline . The following two common types of time redundancy are considered in the established models:
A crucial problem in designing the task termination policy is balancing TSP and SSP. Considering the time redundancy property, to achieve more accurate risk evaluation and control, a dynamic task termination policy is proposed where the control limit for task termination varies with attempts. Specifically, in the
ith attempt, the task is terminated, and a rescue procedure is started if the degradation level is larger than the threshold
. Let
be the duration from the beginning of the
ith attempt to the task termination instant if abort threshold
is adopted, which is defined as the first hitting time of
with respect to the termination threshold
. By Equation (
2), the cumulative distribution function of
,
, is given as
In addition to task abort and time redundancy, a preventive maintenance policy is incorporated to enhance the TSP and SSP. To be specific, upon the completion of a rescue procedure, imperfect maintenance is carried out whose effect is characterized by the maintenance degree . Given the degradation level at the completion of a rescue y, the degradation after preventive maintenance is reduced to with . The case of implies that no maintenance action is performed and corresponds to replacement. The preventive maintenance cost associated with maintenance degree and degradation y is denoted by , which is increasing in y and decreasing in . When , no maintenance is performed, and thus, .
Figure 1 illustrates the multiple attempts under ITR and IITR. It can be seen from
Figure 1a that under IITR, the task is terminated in attempt 1, and the degradation after rescue completion is below the failure threshold. Thus, the system survives the rescue process. After the first attempt, imperfect maintenance is carried out to reduce the degradation level. The operating time in the first attempt are accumulated, and the task succeeds in attempt 2 before the task abort time
.
Figure 1b shows a sample path under ITR where the task is aborted in attempt 1 and attempt 2 and the continuous operating time reaches
in the third attempt.
4. TSP and SSP under ITR
In this section, TSP and SSP are derived under ITR considering the proposed dynamic abort and maintenance policies. Since multiple attempts can be executed until task success, an event transition-based numerical algorithm is adopted to evaluate the TSP and SSP.
4.1. TSP Evaluation under ITR
Under ITR, the continuous operating time should exceed a threshold
. Let
and
be random variables representing the remaining time for task execution and degradation level before the
kth attempt. Let
be the joint pdf of
and
. A new system starts operation with the the remaining task execution time
at time 0. Therefore, by definition of
, the corresponding probability mass function of
and
can be given as follows
Since both the task abort and maintenance policies are dependent on the number of attempts, in the
th attempt, if abort threshold
and maintenance degree
are adopted, then the rescue initiated time and rescue completion time are
and
, respectively. Let
denote the abort and preventive maintenance policies during the
th attempt. The degradation level at the beginning of the
kth attempt can be recursively determined by
where
denotes the indicator function of event
A and
is the degradation increment during the
th attempt if that abort threshold
is taken. Given the elapsed time of the
th attempt
, the remaining time for task execution at the beginning of the
kth task is given as
Let
be the joint probability density function of the operating time and degradation increment of the
th attempt. By Equations (
5) and (
6), given the degradation increment
and operating time
of the
th attempt, then the degradation and remaining task execution time before the
th attempt are
and
, respectively. Thus, one can obtain the overall unconditional probability density function
recursively as
In what follows, we focus on deriving the expression for
. Note that the Gamma process
is a jump process and has an infinite number of jumps in finite intervals, and the degradation level at time
is not exactly
but attains it with a non-degenerative random overshoot. By Bertoin [
50], the joint probability density function of
and
can be given as
where
is the Levy measure of Gamma process with parameters
and
given by
Since the task is aborted at the
th attempt, if the elapsed time of the
th attempt is
, then
. By Equation (
7), the expression for
can be obtained using the property of independent increment of Gamma process
Under ITR, if the task succeeds at the
kth attempt by time
, then the remaining time for task execution before the
kth attempt should be larger than the task duration
, and no system failure occurs at the
kth attempt (the rescue initiated time in the
kth attempt,
, is larger than
, and the system lifetime
is greater than the task duration
, i.e.,
and
). Then, the probability that the task is completed at the
kth attempt under ITR is given as
Due to the property of stationary increments of Gamma process, the degradation increment in time interval
,
follows Gamma distribution with a shape parameter
and scale parameter
. Based on the distribution function of the degradation increment
in Equation (2) and the joint probability density function of
and
in Equation (7), we have
Based on Equation (
10), the probability that the task is completed at the
kth attempt under ITR is given as
Since the number of attempts until task success is mutually exclusive, by the law of total probability, the TSP under ITR can be obtained as
4.2. SSP Evaluation under ITR
The system survives the task under the condition that it completes either the task or the rescue process. Consequently, SSP is the sum of TSP and rescue success probability. If the system survives after making
k attempts before time
, then we have that the task is aborted at the
kth attempt and the rescue procedure succeeds, i.e.,
and
, and the remaining time after the
kth rescue should be smaller than
such that no further attempt is made, i.e.,
. Thus, the probability that the system survives after
k attempts is given by
In accordance with the property of independent and stationary increments of Gamma process, given the degradation and remaining task execution time at the beginning of the
kth attempt, the probability that the system survives after the
kth attempt is given by
where
satisfies
and
According to Equations (
11) and (
12), the probability that the system survives after
k attempts under ITR in Equation (
11) is given by
Since the number of attempts until rescue success is mutually exclusive, the SSP under ITR can be obtained by the law of total probability as
5. TSP and SSP under IITR
In this section, TSP and SSP are derived under IITR considering the dynamic abort and maintenance policies. Similar to the derivation of TSP and SSP under ITR, the recursive method is adopted to evaluate the TSP and SSP.
5.1. TSP Evaluation under IITR
Under IITR, the cumulative operating time should exceed a threshold
. Let
,
, and
be random variables representing the remaining time for task execution, degradation level, and cumulative operating time before the
kth attempt. Let
be the joint pdf of
,
, and
. A new system starts operation with the the remaining task execution time
and cumulative operating 0 at the beginning of the first attempt. Therefore, by definition of
, the corresponding probability mass function of
,
and
can be given as
Let
be the joint probability density function of the operating time, degradation increment and time in task of the
th attempt. Given the degradation increment
y, operating time
s and time in task
m of the
th attempt, then the degradation, remaining task execution time and cumulation time in task before the
th attempt are
,
and
, respectively. Thus, one can obtain the overall unconditional probability density function
recursively as
Under IITR, if
k attempts are made until task success before time
, then the remaining time for task execution before the
kth attempt should be greater than the time required to finish the remaining task, and the cumulative operating time is larger than
after the
kth attempt. There exist two possible scenarios for task success. In scenario 1, the cumulative operating time exceeds
before the abort time
. In scenario 2, the cumulative operating time is smaller than
before the abort time
but is larger than
before failure occurrence. Thus, the probability that the task is completed at the
kth attempt under IITR is given as
According to the distribution function of the degradation increment in Equation (2) and the joint probability density function of
and
in Equation (
7), the probability that the mission is completed before reaching the abort threshold
is
and the probability that the mission is completed after reaching the abort threshold
is
Based on the expression in Equations (
14) and (
15), the probability that the task is completed at the
kth attempt under IITR can be derived as
Since the number of attempts until task success is mutually exclusive, by the law of total probability, the TSP under IITR can be obtained as
5.2. SSP Evaluation under IITR
According to the time redundancy property, if the system survives after making
k attempts before time
, then the task is aborted at the
kth attempt, i.e.,
and
, and the remaining time for task execution after the completion of the
kth rescue procedure should be smaller than the remaining task time, i.e.,
. Based on the probability density function of
, the probability that the system survives after
k attempts under IITR is given by
By the distribution of the degradation increment in Equation (2) and the joint probability density function of
and
in Equation (
7), it follows that
where
satisfies
. Using Equation (
17), the probability that the system survives after
k attempts under IITR can be given as
In a similar manner, the SSP under ITR can be obtained by the law of total probability as
6. Optimal Abort and Maintenance Policies
The TSP increases as the abort limits increase while the SSP is decreasing in the abort thresholds due to increased failure risk caused by larger task execution duration. Thus, it is of practical value to find the optimal task abort thresholds to balance the trade-off between TSP and SSP. A commonly used criterion characterizing such optimization problem is economic loss. The cost during the task execution includes the maintenance cost before each attempt, task failure cost and system failure cost. Denote the random maintenance cost over task execution by
W. Let
and
be the task failure cost and system failure cost, respectively. Based on the expressions for TSP and SSP, the expected total cost under ITR during task execution can be given as
and the expected total cost under IITR during task execution can be given as
In what follows, we focus on deriving the expected maintenance cost during task execution under two types of time redundancy. Let
and
be the maintenance cost at the
jth attempt under ITR and IITR, respectively. By the law of conditional expectation, the expected maintenance cost under ITR is
In a similar manner, the expected maintenance cost under IITR is
Note that the maintenance cost is related to both the degradation before maintenance and the maintenance degree. Given the degradation level
y at the beginning of the
jth attempt, then the degradation after the rescue of the
th attempt is
. The expected maintenance cost at the
jth attempt under ITR is
By (
20) and (
21), the expected maintenance cost under ITR is
The expected maintenance cost under IITR is given in a similar manner as
Since the calculation of TSP and SSP involves reserve function, a numerical method is designed to obtain the value of TSP and SSP, and then, the optimal abort thresholds and maintenance degrees can be solved by efficient heuristic algorithms. The following section evaluates the performance of the proposed abort and maintenance policies numerically.
7. Case Study
7.1. Background
This section applies the developed abort and maintenance strategies to the cooling system in chemical reactors studied by Cha et al. [
24] to illustrate the proposed risk assessment method and the performance of the optimal policies. Cooling systems are widely used in chemical reactors, which are required to keep temperatures at required levels, and their failures will lead to the dramatic fluctuation in temperature and ultimate reactor damage, resulting in huge economic loss and serious environmental and personal harm. Thus, the performed cooling task can be terminated to avoid the serious failure consequences. Additionally, routine preventive maintenance is critical to improve the TSP and SSP of the cooling system. The development of optimal maintenance and task termination strategies for cooling systems is of crucial importance in engineering practice. Crack degradation caused by corrosion is the most common internal failure mode of a typical cooling system. To characterize the monotone degradation behavior, the degradation process is modeled by a homogeneous Gamma process with
D = 20 mm,
.
Assume that the allowable time for performing the cooling task is 30 h. The time for a single cooling task is 15 h. When the system degradation at each attempt exceeds a threshold, the cooling task is suspended, and rescue will be carried out whose duration at time is t is . Then, we can calculate the maximum abort time in each attempt as 10 h. When the time in task is greater than 10 h, the task takes less time to complete the rescue; that is, if the rescue starts after 10 h, the task will not be suspended. After each rescue, the imperfect maintenance is carried out with cost function . In this section, the TSP and SS of the cooling system are numerically tested by a numerical integration method. Then, the optimal maintenance and abort thresholds under dynamic policy are studied.
7.2. Evaluation of TSP and SSP
This section uses a forward numerical algorithm to evaluate TSP and SSP based on the backward equations. First, we define the discretized time interval and degradation level. The running time of Matlab on a Pentium 3.2GHz PC is approximately 1500 s under the current parameter setting.
Figure 2 shows the TSP and SSP under a single task attempt. The solid line represents the SSP, and the dotted line denotes the TSP. It can be observed that the TSP goes up as the degradation-based abort threshold increases. Such variation is due to the fact that under a larger value of task abort threshold, the task can continue for a relatively longer duration, and the corresponding TSP is larger, but the probability of occurrence of system failure will increase for a longer task duration. Hence, SSP decreases with task abort threshold.
Figure 2 shows that TSP remains at a low level when the abort threshold is less than 8 and then increases significantly since the probability of completing the mission is very small when the abort threshold takes a low value. When the task abort threshold is 0, a task is aborted immediately before mission execution, and the corresponding SSP and TSP are 1 and 0, respectively. When the task abort threshold is 20, the TSP and SSP are the same value, since the task is never aborted. In this case, SSP equals the TSP.
7.3. Optimal Task Abort and Maintenance Policies
We further consider optimal task abort and maintenance policies under different types of time redundancy. This section investigates the variation of the optimal solution with respect to allowable time and task duration. The cost of a task failure and system failure are assumed to be 500 and 1700, respectively.
Table 1 shows how the optimal abort and maintenance decisions under ITR vary with the change of different time deadlines and task duration. It shows that given a fixed task duration and number of attempts, the abort threshold is nondecreasing when the time deadline increases. One possible source for such variation is that when the time deadline is small, the abort should be conducted earlier during the first several attempts to save time for the rescue procedure and subsequent attempts. With the increase of time deadline, it is optimal to delay mission abort due to the need for more time for the rescue procedure and following task execution. Similarly, we can observe from
Table 1 that given a fixed number of attempts, the abort threshold decreases when the task duration increases. Because when the mission duration is small, the abort should be conducted at a later stage to improve TSP. With the increase of task duration, it is optimal to advance task abort to improve SSP. For a fixed time deadline and task duration, the abort threshold decreases with the increase of task attempts. One explanation is that the task should be terminated earlier to save time for rescue and improve SSP in the last few attempts.
We can observe that given fixed amounts of allowable time, task duration and allowed attempts, the optimal decreases with the increase of task attempts to reduce the total cost. To be specific, in the first few task attempts, it is optimal to perform imperfect maintenance to reduce maintenance cost, while with the increase of task attempts, it is optimal to perform perfect maintenance to improve TSP and SSP. Given a fixed task duration and number of attempts, the optimal increases when the allowable time increases due to the increased TSP.
Table 2 shows how the optimal abort and maintenance decisions under IITR vary with the change of different time deadlines and task duration. Comparing
Table 1 and
Table 2, it can be found that under IITR, the optimal
increases under IITR, since the completed work can be accumulated under IITR. Thus, it is optimal to conduct imperfect maintenance to save the maintenance cost. The abort threshold under ITR is lager than that under IITR. Under IITR, the completed task in different attempts can be accumulated under IITR, resulting in higher TSP. Consequently, the task under IITR can be aborted earlier due to higher TSP.
8. Conclusions, Limitations, and Future Research
This paper advances the state of the art of task termination by studying the joint optimal task abort and maintenance policies for task-based systems with a stochastic degradation process. The tasks can be executed multiple attempts by a deadline. TSP and SSP are evaluated considering ITR and IITR with different task success criteria. Based on the system degradation process and time redundancy properties, dynamic task abort and preventive maintenance policies are designed via considering the degradation level, remaining amount of time for task execution and cumulative operating time. TSP and SSP are evaluated by an event transition-based numerical algorithm. Based on the proposed framework, cost models are constructed to characterize total cost due to maintenance, task failure and system malfunction. The optimal thresholds and performance of the policies under ITR and IITR are investigated. The results indicate that both TSP and SSP under IITR are better than that under ITR.
The limitations of the current study and a number of corresponding future research directions are summarized as follows. Firstly, we assume that both the maintenance and abort thresholds are related to the number of task attempts. Future study can be devoted to considering the maintenance and abort thresholds related to the remaining task to be completed. Secondly, maintenance time is assumed to negligible in this study. The case of time-consuming maintenance activities is worth investigating, which is more practical in engineering practice. Last but not least, the current research can be extended to the case where a certain amount of work is required to complete the task.