1. Introduction
In recent years, there has been more attention to a class of optimal control problems where the dynamic systems are governed means switching diffusions in which the switching is modeled by a continuous-time Markov chain () with unobservable hidden states (also known as partially observed optimal control problems). In these problems, an observable process y whose outcomes are “influenced” by the outcomes of in a known way is assumed. Since cannot be observed directly, the goal is to learn about by observing y. Following the last mentioned, this article concerns with an optimal control problem with discounted and ergodic payoff in which the dynamic system evolves according to a Markovian regime-switching diffusion for given continuous functions f and . The reward rate is allowed to be unbounded from above and from below. In this paper, the Wonham filter to estimate the states of the Markov chain from the observable evolution of a given process (y) is used. As a result, the original system is converted to a completely observable one .
Our main results extend the dynamic programming technique to this family of stochastic optimal control problems with reward (or cost) rate per unit of time unbounded and Markovian regime-switching diffusions. The regime switching is modeled by a continuous-time Markov chain (
) with unobservable states. Early works include research on an optimal control problem with an ergodic payoff, considering that the dynamic system evolves according to Markovian switching diffusions. However, this diffusion does not depend on a hidden Markov chain [
1]. Research on deriving the dynamic programming principle for a partially observed optimal control problem in which the dynamic system is governed by a discrete-time Markov control process taking values in a finite-dimensional space has also been proposed [
2]. Finally, one paper studied the optimal control with Markovian switching that is completely observable and rewards rate unbounded [
3]. As an application of our results, we study the discounted linear quadratic regulator (LQR) problem, the ergodic LQR problem for the modeled quarter-car suspension, the average (ergodic) LQR problem for the modeled quarter-car suspension with damp, and an explicit application for an optimal pollution control. Other applications with bounded payoff different from those studied in this work are found in [
4,
5,
6].
The objective of the theory of controlled regime-switching diffusions is to model controlled diffusion systems whose dynamics are affected by discrete phenomena. In these systems, the discrete phenomena are modeled by a Markov chain in continuous time, whose states represent the discrete phenomenon involved. There is an extensive list of references dealing with the case of completely observable stochastic optimal control in which a switching diffusion governs the stochastic systems. A literature review includes the textbooks [
7,
8] and the papers [
9,
10,
11,
12,
13,
14], with several applications, including optimization portfolios, wireless communication systems, and wind turbines, among others.
Generally, to solve unobserved optimal control problems, where the dynamic systems are governed by a hidden Markovian switching diffusion, it is necessary to transform them into completely observed ones, which in our case is done using a Wonham filter.
This Wonham filter estimates the hidden state of the Markov chain from the observable evolution of the process
y. When these estimates are replaced in the original system, this becomes a completely observable system [
15,
16] and ([
17], Section 22.3). The numerical results for Wonham’s filter are given in [
18].
The paper is organized as follows: in
Section 1, an introduction is given. In
Section 2, the main assumptions are given. In this section, the partially observable system is converted into an observable system. The conditions to ensure the existence of optimal solutions for the optimal control problem with discounted payoff are given in
Section 3. In
Section 4, the conditions to ensure the existence of optimal solutions for the optimal control problem with average payoff are deduced. To illustrate our results, four applications are developed: an application on a linear quadratic regulator (LQR) with discounted payoff (
Section 5); the development of a model of a quarter-car suspension LQR with an average payoff (
Section 6); the study of an optimal control of a vehicle active suspension system with damp (
Section 7); and an explicit application for an optimal pollution control (
Section 8).
2. Formulation of the Problem
This work focuses on controlled hybrid stochastic differential Equations (HSDE) under partial observation. To explain this, first, we consider the stochastic differential equations of the form:
where
and
in (
1) depend on a finite state and time-continuous irreducible and aperiodic Markov chain
taking values in
. For all
the transition probabilities are given by:
where the constants
are the transition rates from
i to
j and satisfy that
, the transition matrix is denoted by
. The control component is
with
a compact set of
, and
W is a
d-dimensional standard Brownian motion independent of
. Throughout the work, it is considered that both the Markov chain
and the Brownian motion
W are defined on a complete filtered probability space
that satisfies the usual conditions.
Until now, the switching diffusion (
1) seems to be formulated as a classical switching diffusion, as in [
11,
12,
13,
14,
19], among others. However, we propose that the process
is a hidden Markov chain, i.e., at any given instant of time, the exact state of the Markov chain
cannot be observed directly. Instead, we can only observe the process
y given by:
whose dynamics depends on the value of
. In Equation (
2),
is a bounded function, whereas
B is a one-dimensional Brownian motion independent of
W and
, and
is a positive constant.
Under partial observation, the best way to work is through nonlinear filtering. This technique studies the conditional distribution of
given the observed data accumulated up to time
t, namely:
where
is the
-algebra generated by the process
and
Taking into account the following notation:
and using the Wonham filtering techniques, we know that the process
in (
3) satisfies the following Equation (see for instance [
15] or ([
17], Section 22.3)):
where
is the
identity matrix. If we introduce the process:
then Equation (
4) can be rewritten as:
Remark 1. Note that the unique solution of (5) exists up to an explosion time τ (see, for instance [20]). However, a.s. since for all and . At this point, we have defined the controlled HSDE with partial observation. To fulfill the objective of this work, that is, to solve an optimal control problem with the discounted and average payoff with partial observation, we will transform this problem into one with complete observation (see for instance [
5,
6,
16]). First, we will establish the following notational convention.
For the coefficients
and
we have their filtered estimates:
and with equalities (
6)–(
7), we establish the new coefficients:
With the use of above functions and Equation (
1), we introduce the components of a new diffusion process as:
and therefore, we obtain from (
5) and (
8) the following controlled system with complete observation:
where
with:
Throughout this work, we will use the following Assumption 1.
Assumption 1. - (a)
The control setis compact.
- (b)
is a continuous function that satisfies the Lipschitz continuous property on x uniformly in, that is, there exists a constantsuch that: - (c)
There exists constantssuch that,satisfies: for alland for all.
- (d)
There existswith: forand.
Under Assumption 1 and taking into account Remark 1, we know that the system (
9) has a unique solution.
For
, we denote by
and
the gradient and the Hessian matrix of
x, respectively, and
the scalar product. For a sufficiently smooth real-valued function
. Let:
with
the operator associated with Equation (
9). In order to carry out the aim of this work, we define the control policies.
Definition 1. A function of the formfor some measurable function, is called a Markov policy, whereasfor some measurable functionis said to be a stationary Markov policy. The stationary Markov policies set is denote by.
The following assumption represents a Lyapunov-like condition.
Assumption 2. There exists a function, and constants, such that:
- (i)
and
- (ii)
for eachand.
It is important to point out that since the
is irreducible and aperiodic, we can ensure the existence of a unique invariant measure for the Markov–Feller process
(see [
21,
22]). Moreover, the Assumption 2 allows us to conclude that the Markov process
, where
is positive recurrent and there exists a unique invariant probability measure
for which is satisfied:
Note that for every , the measure belongs to the space defined as follows.
Definition 2. The w-norm is defined as:where ν is the real-valued measurable function onand w is the Lyapunov function given in Assumption 2. The normed linear space of real-valued measurable functions ν with finite w-norm is denoted by. Moreover, the normed linear space of finite signed measures μ onsuch that:whereis the total variation of μ is denoted by. Remark 2. For eachand, we get:that is, the integralis finite. The next result will be useful later.
Lemma 1. The conditionin Assumption 2 implies that:
- (a)
;
- (b)
for all,, and;
- (c)
for all.
Proof. (a) After applying Dynkin’s formula to the function
, we use case
of Assumption 2 to get:
Finally, if we multiply the inequality (12) by , we obtain the result. To prove , it is enough take the limit from the inequality (12). Integrating both sides of (12) with respect to the invariant probability , we obtain , i.e., ; thus, the result follows. □
In this work, the reward rate is a measurable function that satisfies the following conditions:
Assumption 3. - (a)
The functionis continuous on; moreover, for each, there exists a constantsuch that: i.e., r is locally Lipschitz in x uniformly with respect toand.
- (b)
is in the normed linear space of real-valued functionsuniformly in u; that is, there existssuch that for all:
Notation. The rate reward
is vector form is given by:
and its estimation is:
Henceforth, for each stationary Markov policy
, we write:
4. Average Optimality Criteria
As in (
10), let
for every
.
Assumption 4. Letbe the solution of the hidden Markovian-switching diffusion (1)–(4). Then, we suppose that there exist positive constants C and δ such that:for all,, and. That is, we assume that the processis uniformly w-exponentially ergodic. Next, we define the long-run average optimality criterion.
Definition 5. For each,
, and, let: The long-run expected average reward given the initial stateis: The function:is referred to as the optimal gain or the optimal average reward. If there is a policyfor whichfor all, thenis called average optimal. Remark 4. In some optimal control problems, the limit of as
might not exist. To avoid this difficulty, in optimal control problems, it defines the average payoff as a liminf as in (21), which be interpreted as the worst average payoff that is to be maximized. For each
, let:
with
as in (
10). Now, observe that
defined in (
20) can be expressed as:
therefore, multiplying (
23) by
and letting
we obtain, by (
19):
Moreover, by the definition (
22) of
, the Assumption 3b, and (
10):
Therefore, by Lemma 1c:
thus, the reward
is uniformly bounded on
. From (
24) and (
25) we obtain that the following:
has a finite value.
Thus, under the Assumptions 1, 2, and 4, it follows from (
19) (
w-exponential ergodicity) and (
22) that the long-run expected average reward (
21) coincides with the constant
for every
. Indeed, note that
defined in (
20) can be expressed as:
Definition 6. (a) A pairconsisting of a constantand a functionis said to be a solution of the average reward HJB-equation if: (b) If a stationary policyattains the maximum in (27), that is: then f is called a canonical policy.
The following theorem shows that if a policy satisfies the average reward HJB-equation, then it is an optimal average policy.
Theorem 1. If Assumptions 1, 2, and 3 hold, then:
- (i)
The average reward HJB Equation (27) admits a unique solution, withsatisfyingfor somefixed. - (ii)
There exists a canonical policy.
- (iii)
The constant J in (27) equalsin (26). - (iv)
There exists a stationary average optimal policy.
Proof. The steps for the proof of this incise are essentially the same given in proof of Theorem 6.4 in [
24]; thus, we omit the proof.
Since
and
are continuous functions on the compact set
, we obtain that
is a continuous function on
; thus, the existence of a canonical policy
follows from standard measurable selection theorems; see [
25] (Theorem 12.2).
Observe that, by (
27):
Therefore, for any
, using Dynkin’s formula and (
29) we obtain:
Thus, multiplying by
in (
30) we have:
Consequently, letting
in (
31), and using Lemma 1b and (
24), we obtain:
To obtain the reverse inequality, similar arguments show that if:
then
for all
. This last inequality together with (
29) yields that if
is a canonical policy, which satisfies (
28), then we obtain that
, and by (
26):
Similar arguments to those given in lead us to that if is a canonical policy, then it is an average optimal. □
Theorem 1 indicates that if a policy satisfies the HJB Equation (
27), then this policy is an optimal policy for the optimal control problem associated with the HJB equation. The difficulty with this approach is how to get a solution
of the HJB equation. The most common form of the solve the HJB equation is based on variants on the
vanishing discount approach (see [
11,
24,
26] for details).
Remark 5 ([
1])
. In the optimality criteria known as bias optimality, overtaking optimality, sensitive discount optimality, and Blackwell optimality, the early returns and the asymptotic returns are both relevant; thus, to study them, we need first to analyze the discounted and average optimality criteria. These optimality criteria will be studied in future work. Remark 6. On Assumption 1, ([7], Theorems 3.17 and 3.18). The uniform Lipschitz and linear growth conditions of b and σ ensure the existence and uniqueness of the global solution of the SDE with Markovian switching (1). The uniform Lipschitz condition,) imply that the change rates of the functionsandare minor or equal to the change rate of a linear function of x. This gives, in particular, the continuity of b and σ in x for all. Thus, the uniform Lipschitz condition excludes the functions b and σ that are discontinuous concerning x. It is important to note that although a function let continuous, it does not guarantee that it satisfies the uniform Lipschitz condition; for example, the continuous functiondoes not satisfy this condition. Uniform Lipschitz condition can be replaced by the local Lipschitz condition. In fact, the local Lipschitz condition allows us to include a great variety of functions, such as functions. However, the linear growth condition (Assumption 1 (d)) also excludes some important functions, such as. Assumption 1 (d) is quite standard but may be restrictive for some applications. As far as the results of this paper are concerned, the uniform Lipschitz condition may be replaced by the weaker condition:where K is a positive constant. This last condition allows us to include many functions as the coefficients b and. For example:withsuch thatand for some continuous functiongiven. It is possible to check that a diffusion process with the parameters given above satisfies the local Lipschitz condition but the linear growth condition is not satisfied. On the other hand, note that:withand a compact control set U. That is, the condition (33) is fulfilled. Thus, ([7], Theorem 3.18) guarantees that the SDE with Markovian switching with these coefficients has a unique global solution on. On Assumption 2, ([7], Theorem 5.2). This assumption guarantees the positive recurrence and the existence of an invariant measurefor the Markov–Feller process. Moreover, if this assumption holds together with the inequalityfor positive numbers, then, the diffusion process (1) satisfies:that is,is asymptotically bounded inth moment. Some Lyapunov functions are, for example:considering that the coefficients b andin (1) satisfy the Lipschitz condition and:withandbe constants. In fact, using the inequalityand (35), we get:where. Now, taking the Lyapunov function (34) we define: Considering that,,anda similar procedure to that given in (37) allows us to obtain that W is also a Lyapunov function. That is: On Assumption 3.This assumption allows us that the reward rate (or cost rate) can be unbounded from above and below. For the Lyapunov function, a reward rate of the form:for some continuous functionsatisfies the Assumption 3. In fact:withand U a compact set. On Assumption 4.This assumption indicates asymptotic behavior ofwhen t goes to infinite. Sufficient conditions for the w-exponentially ergodicity of the processcan be seen in ([1], Theorem 2.8). In fact, in the proof of this theorem, Assumptions 1 and 2 are required. Note that, for the optimal control problem with discounted optimality criterion, the w-exponentially ergodicity of the processis not required. This assumption is only necessary to study the average reward optimality criterion.
Remark 7. In the following sections, our theoretical results are implemented in three applications. The dynamic system in the three applications evolves according to linear stochastic differential equations, namely, Assumption 1. The state numbers of the Markov chain is 2, that is,. The payoff rate is of the formwithand,. Takingwe get:with; thus, Assumption 3 also holds. A few calculations allow us to obtain the Assumption 2 with. In fact: Let, and rewriteas:wherewhere the last inequality is obtained from fact that the functionis continuous on the compact set U for alland that the termis negative. Thus,and Assumption 2b follows. 5. Application 1: Discounted Linear Quadratic Regulator (LQR)
In this subsection, we consider the
-discounted linear quadratic regulator. To this end, we suppose that the dynamic system evolves according to the linear stochastic differential equations:
with
,
,
,
is a
m-dimensional Brownian motion, and
is a positive constant. The expected cost is:
where
,
,
and
. The optimality equation or HJB-equation for the
-discounted partially observed LQR-optimal control problem is:
where the infinitesimal generator for the process
applied to
is:
where
Note that, by minimizing (
40) with respect to
u, we find that the optimal control is the form:
By Proposition 2, if there exist a function
and a policy
such that (
14) and (
15) hold, then
v coincides with the value function
and
is the
-discount optimal policy. Thus, we propose that the function
that solves the HJB-Equation (
40) has the form:
where
is a twice differentiable continuous function,
c is a constant, and
K is a positive definite matrix. Inserting the derivative of
in (
43) we get the optimal control:
where the equality (
40) holds if the matrix
K satisfies the algebraic Riccati equation:
and
satisfies the partial differential equation:
where
is as in (
42),
is the identity matrix of
, and
and
are the gradient and the Hessian of the
n, respectively.
Simulation results. In the following figures, we assume that the Markov chain
has two states, namely,
and the dynamic system
. We have computed the Wonham filter, the states of the dynamic system (
39)
with initial condition
, the value function (
44), and the optimal control (
45) for the following data:
,
,
,
,
,
,
,
:
and the transition matrix:
To solve the Wonhan filter, we use the numerical method given in ([
18], Section 8.4), considering that the Markov chain can only be observed through
.
Figure 1 shows the solution of the filter Wonham equation and the states of the hidden Markov chain
. As can be noted, in
s
, implying that the Markov chain with a higher probability to
is in state 2 in
(
). The evolution of the dynamic system (
39) is given in
Figure 2 (top); in this figure, we can note that the optimal control (
45) moves the initial point
to the point
in
s, indicating the good performance of the optimal control (
45). The asymptotic behavior of the optimal control (
45) is given in
Figure 2 (bottom); this control stabilizes at zero around
s, since
also stabilizes at zero around
s.
6. Application 2: Average LQR: Modeling of a Quarter-Car Suspension
In this section, the basic quarter-car suspension model analyzed in [
27] is considered, see
Figure 3. The parameters are: the sprung mass (
), the unsprung mass (
), the suspension spring constant (
), and the tire spring constant (
k). Let
, and
be the vertical displacements of the sprung mass, the unsprung mass, and the road profile, respectively. The equations of motion for this model are given by:
Now, defining
,
,
, and
, the equations of motion (
46) and (
47) can be expressed in matrix form as:
where
, and in the time domain, the road profile,
, can be represented as the output of a linear first-order filter to white noise as follows:
where
V is the vehicle speed (assumed constant),
is a positive constant, and
a is the road roughness coefficient depending on the type of road. Here, we assume that
a depends on a hidden Markov chain, that is,
with
In our case, we consider that the dynamic system (
48) evolves with additional white noise, that is:
The experts introduced the following performance index in order to trade off between the ride comfort and the handling while maintaining the constraint on suspension deflection:
Defining
and
, we can rewrite (
50) as:
Now, from the equations of motion in (
46) and (
47), note that
with
and
Thus, replacing this matrix form of
y in (
51) we can rewrite (
50) again as:
where
,
,
.
The optimal control problem (OCP). The OCP in this application consists of finding
such that it minimizes the performance index (
52) considering that the dynamic system evolves according to the stochastic differential Equation (
49).
In the dynamic programming technique, we need the infinitesimal generator
of the process
applied to
; in this case, this generator is:
where
, whereas the Hamilton–Jacobi–Bellman Equation (or dynamic programming equation) associated with this problem is:
see [
28] for more details.
Proposition 3. Assume thatevolves according to (49). Then, the control that minimizes the long-run cost (52) is:whereas the corresponding function v that solves the HJB Equation (54) is given by:where K is a positive semi-definite matrix that satisfies the Ricatti differential equationandsatisfies the differential equation:andsatisfies the partial differential equation:whereis as in (41) and and
denote the gradient and the Hessian of the n, respectively. The optimal cost is given by: Proof. The HJB-equation for the partially observed LQR optimal control problem with
evolves according to (
49) and finite cost (
52) is (
54), where
is the infinitesimal generator given in (
53). We are looking for a candidate solution
to (
54) in the form:
for some continuous functions
,
and
K a positive semi-definite matrix. We assume that
for all
and
is positive definite, so that the function
is convex.
Now, the function
is strictly convex on the compact set
U, and thus, attains its minimum at:
Inserting
and the partial derivatives of
v with respect to
x,
, and
in the HJB-Equation (
54), we obtain:
For equality (
61) to hold, it is necessary that the functions
g and
h satisfy (
57) and (
58), respectively, and the matrix
K satisfies the Ricatti differential Equation (
56), whereas the constant
. Finally, from the Theorem 1, it follows that
is an optimal Markovian control and the value function
is equal to (
59). That is:
□
Simulation results. To solve the Wonhan filter, we use the numerical method given in ([
18], Section 8.4), considering that the Markov chain
has two states that can only be observed through
. The following data were used:
,
,
,
,
,
,
,
,
×
,
,
,
kg,
kg,
N/m,
N/m,
m/s,
,
,
and:
The solution of the Wonham filter equation and the states of the hidden Markov chain
are shown in
Figure 4. As can be noted, in
s,
, implying that the Markov Chain with a probability greater than
is in state 1 at
.
The asymptotic behavior of the optimal control (
55) is given in
Figure 5 (bottom). It is interesting to note that this control minimizes the magnitude of the sprung mass velocity,
and unsprung mass velocity,
after
s, see
Figure 5 (top). This behavior implies that the magnitude of the sprung mass acceleration,
and unsprung mass acceleration
are also minimized, considering that the stochastic differential equation that models the road profile depends on a hidden Markov chain. These results agree with the obtained by authors in [
27]. These authors mentioned that two important objectives of a suspension system are ride comfort and handling performance. The ride comfort requires that the car body be isolated from road disturbances as much as possible to provide a good feeling for passengers. In practice, we are looking to minimize the acceleration of the sprung mass.
7. Application 3: Optimal Control of a Vehicle Active Suspension System with Damp
The model analyzed in this subsection is given in [
29]. In this application, a damp
is added to the quarter-car suspension given in
Section 6, see
Figure 6. The parameters in
Figure 6 are: the sprung mass (
), the unsprung mass (
), the suspension spring constant (
), and the tire spring constant (
k). Let
, and
r be the vertical displacements of the sprung mass, the unsprung mass, and the road disturbance, respectively. The equations of motion are given by:
Now, defining
,
,
, and
, the equations of motion in (
62) and (
63) can be expressed in matrix form as:
where
, and we assume that the road profile
is represented by a function with hidden Markovian switchings:
where
(road bump height is 10 cm),
(road bump height is 16 cm), and
,
are the random jump times of
. In our case, we consider that the dynamic system (
64) evolves with additional white noise, that is:
and we wish to minimize the discounted expected cost:
subject to (
66) and (
65). Considering the infinitesimal generator given in (
53) with
and the Hamilton–Jacobi–Bellman equation associated as the following problem:
similar arguments to these given in
Section 5 and
Section 6 allow us to find the optimal control
and the value function
for this setting. In fact:
where
is a twice differentiable continuous function,
c is a constant,
is a twice differentiable continuous function, and
K is a positive definite matrix. Inserting the derivative of
in (
43), we get the optimal control:
where the matrix
K satisfies the algebraic Riccati equation:
the function
satisfies the differential equation:
and
satisfies the partial differential equation:
where
is as in (
42),
is the identity matrix of
, and
and
are the gradient and the Hessian of the
n, respectively.
Simulation results. To solve the Wonhan filter, we use the numerical method given in ([
18], Section 8.4) considering that the Markov chain
has two states and that can be only observed through
. The following data were used:
,
,
,
,
,
,
,
,
,
kg,
kg,
N/m,
N/m,
N/m, and:
Figure 7 shows the solution of the Wonham filter equation and the states of the hidden Markov chain
. As can be seen, in the time interval
,
, implying that the Markov chain with a probability greater than
is in state 1.
The asymptotic behavior of the optimal control (
67) is given in
Figure 8 (bottom). It is interesting to note that this control minimizes the magnitude of the sprung mass,
, and unsprung mass,
, al well as their velocities,
and
, after
s, see
Figure 8 (top).
8. Application 4: Optimal Pollution Control with Average Payoff
The application studies the pollution accumulation incurred by the consumption of a certain product, such as gas or petroleum, see [
30]. The stock of pollution
is governed by the controlled diffusion process:
where
represents the pollution flow generated by an entity due to the consumption of the product,
represents the decay rate of pollution, chosen at each time by nature, and
k is a positive constant. We shall assume that
is bounded and the parameter
represents the consumption/production restriction. Let
be a Markov chain with two states
and a generator
Q given by:
The reward rate
in this example represents the social welfare and is defined as:
where
and
is the social utility of the consumption
u and the social disutility of the pollution
, respectively. We assume that the function
F in (
69) satisfies:
Clearly, (
68) is a liner stochastic differential equation, and satisfies Assumption 1.
Now, we define the Banach space
and use
,
. Hence,
and Assumption 2
i holds. On the other hand, since the utility function
is continuous on the compact interval
, then:
where
; thus, Assumption 3 holds. Note that:
Thus, taking
and
we obtain:
Therefore, Assumption 2(ii) holds. It can be proven that the process (
68) satisfies Assumption 2.6 in [
1]; thus, by ([
1], Theorem 2.8),
is exponentially ergodic (Assumption 4). In this application, we seek a policy
u that maximizes the long-run average welfare
:
We propose
, where
and
as a solution that verify the HJB Equation (
27) associated with this pollution control problem. Simple calculations allow us to conclude that the policy on consumption/pollution takes the form:
where
is the inverse function of derivative
,
.