1. Introduction
Control of unknown dynamic systems with uncertainties is a challenge because exact mathematical models are often required. Since many processes are complicated, nonlinear, and varying with time, a control algorithm that does not depend on a mathematical model and can adapt to time-varying conditions is required. A popular approach is to develop a universal approximator for predicting the output of unknown systems [
1]. Control algorithms can then be designed based on the parameters of the approximator. Based on this approach, many control techniques have been proposed using machine learning models such as neural networks and fuzzy logic. For example, Goyal et al. [
2] proposed a robust sliding mode controller which can be designed from Chebyshev neural networks. Chadli and Guerra [
3] introduced a robust static output feedback controller for Takagi Sugeno fuzzy models. Ngo and Shin [
4] proposed a method to model unstructured uncertainties and a new Takagi Sugeno fuzzy controller using type-2 fuzzy neural networks.
However, obtaining a good approximator requires a significant amount of training data, especially for a complicated model with high-dimensional state spaces or with many inputs and outputs. The data-driven model must also be updated frequently for time-varying systems. In addition, many control design techniques assume uncertainties as functions of system parameters. However, in many cases, the causes of uncertainties are unknown and unstructured. With the development of data science and machine learning, model-free approaches such as reinforcement learning (RL) have emerged as an effective method to control unknown nonlinear systems [
5,
6,
7,
8]. The principle of RL is based on the interaction between a decision-making agent and its environment [
9], and the actor–critic method is often used as the RL framework for many control algorithms. In the actor–critic framework, the critic agent uses current state information of the environment in order to update the value or action value function. The actor agent then uses the value or action value function to calculate the optimal action.
It can be seen that many data-driven algorithms lack stability analysis of the closed-loop systems. Among recent techniques focusing on the robustness of control algorithms, Yang et al. [
10] presented an off-policy reinforcement learning (RL) solution to solve robust control problems for a certain class of unknown systems with structured uncertainties. In [
11], a robust data-driven controller was proposed based on the frequency response of multivariable systems and convex optimization. Based on data-driven tuning, Takabe et al. [
12] introduced a detection algorithm suitable for massive overloaded multiple-input multiple-output systems. In more recent works, Na et al. [
13] proposed an approach to address the output-feedback robust control for continuous-time uncertain systems using online data-driven learning, while Makarem et al. [
14] used data-driven techniques for iterative feedback tuning of a proportional-integral-derivative controller’s parameters. However, in many cases, stability can only be ensured for specific systems where uncertainties are structured. In addition, the value function must be estimated accurately, which is difficult to achieve, especially at the beginning of the control process when the agent has just started interacting with the environment. Additionally, in many applications, the state space is either continuous or high-dimensional. In these cases, the value function approximation is often inaccurate, potentially leading to instability. Therefore, new RL approaches for which stability can be guaranteed under uncertain conditions are essential if algorithms are to be used in critical and safety-demanding systems.
Type 1 diabetes is a disease caused by the lack of insulin secretion. The condition results in uncontrolled increase of blood glucose level if the patients are not provided with insulin doses. High blood glucose level can lead to both acute and chronic complications, and eventually result in failure of various organs. One of the major challenges in controlling the blood glucose is that the biochemical and physiologic kinetics of insulin and glucose is complicated, nonlinear, and only approximately known [
15]. Additionally, the stability of the control system is essential in this case since unstable control effort will lead to life-threatening condition for the patients.
This paper proposes a novel method to capture uncertainty in estimating the value function in reinforcement learning based on observation data. Using the uncertainty information, the paper also presents a new technique to improve the policy while guaranteeing the stability of the closed-loop system under uncertainty conditions for partially-unknown dynamical systems. The proposed methodology is applied to a blood glucose model for testing its effectiveness in controlling the blood glucose level in patients with Type 1 diabetes.
Structure of Paper
The content of the paper is organized as follows.
Section 2 describes the proposed robust RL algorithm.
Section 3 shows the simulation results of the methodology. The conclusions are given in
Section 4.
2. Materials and Methods
In this section we present the robust RL method and the simulation setup used for evaluation of the algorithm.
2.1. Robust Control Using Reinforcement Learning
In this paper, a class of dynamical systems is considered, which can be described by the following linear state-space equation:
where
is the vector of
n state variables,
is the vector of
m control inputs,
is the state matrix, and
is the input matrix. It is assumed that matrix
A is a squared
unknown matrix and the system (
A and
B) is stabilized. Our target is to derive a control algorithm
that can regulate the state variables contained in
based on input and output data without knowing matrix
A.
As an RL framework, the proposed robust control algorithm consists of an agent that takes actions and learns the consequences of its actions in an unknown environment. The environment is defined by a state vector that describes its states at time t. The action at time t is represented by . As a consequence of the action, a cost is incurred and accumulated. The cost function is assumed to be known and predefined as a function of the current state and action. The objective of the learning process is to minimize the total cost accumulation in the future.
At each decision time point, the agent receives information about the state of the environment and chooses an action. The environment reacts to this action and transitions to a new state, which determines whether the agent receives a positive or negative reinforcement. Current RL techniques propose optimal actions by minimizing the predicted cost accumulation. However, uncertainties due to noises in the data or inaccurate estimation of the cost accumulation can lead to suboptimal actions and even unstable responses. Our target is to provide the agent with a robust and safe action that can guarantee the reduction of the future cost accumulation in the presence of uncertainties. The action calculated by the proposed algorithm may not be the optimal action that reduces the cost in the fastest way, but it can always guarantee the stability of the system, which is imperative in many critical applications.
2.1.1. Estimation of the Value Function by the Critics
In the RL context, the accumulation of cost over time, when starting in the state
and following policy
, is defined as the value function of policy
, i.e.,
where
is the discount factor. The cost
is assumed to be a quadratic function of the states:
where the positive definite matrix
is symmetric, positive semidefinite (since the cost is assumed to be non-negative), and contains the weighting factors of the variables that are minimized.
In order to facilitate the formulation of the stability condition in the form of linear matrix inequalities (LMI), the value function
is approximated by a quadratic function of the states:
where the kernel matrix
is symmetric and positive semidefinite (since matrix
Q in the cost function is symmetric and positive semidefinite).
By using the Kronecker operation, the approximated value function can be expressed as a linear combination of the basis function
:
where
w is the parameter vector,
is the vector of basis functions, and ⊗ is the Kronecker product. The transformation between
w and
P can be performed as follows:
where
is the element of matrix
P in the
row and
column. With
T as the interval time for data sampling, the integral RL Bellman equation can be used to update the value function [
8]:
By using the quadratic cost function (Equation (
3)) and the approximated value function (Equation (
5)), the integral RL Bellman equation can be written as follows:
or
At each iteration,
n samples along the state trajectory are collected (
). The mean value of
w can be obtained by using least-square technique:
where
and
with
.
The confidence interval for the coefficient
is given by
where
is the confidence level,
is the quantile function of standard normal distribution,
is the jth element on the diagonal of
, and
, with
. From that, the uncertainty
is defined as the deviation interval around the nominal value:
Matrices and can be obtained by placing elements of and into columns.
2.1.2. Policy Improvement by the Actor
Linear feedback controllers have been widely used as a stabilization tool for nonlinear systems where dynamic behavior is considered approximately linear around the operating condition [
16,
17,
18]. Hence, in this paper, we use linear functions of the states with gain
as the control policy at iteration
i:
and the level of uncertainty is constant during the controlling process. The task of the actor is to robustly improve the current policy such that the value function is guaranteed to be reduced during the next policy implementation. If the following differential inequality is satisfied:
with some positive constant
, then by using the comparison lemma (Lemma 3.4 in [
19]), the derivative of function
can be bounded by
Therefore, maximizing the rate will ensure a maximum exponential decrease in the value of .
The following part shows the main results of the paper, which describe how the policy gain can be improved during the learning process. Derivations of the results are provided in the stability analysis (
Section 2.1.3).
Definition 1. Assume A is a square matrix with dimension and x is a vector with dimension . The maximize
operation on matrix A and vector x is defined as follows:where Assuming that the sign of all state variables cannot be changed between each policy update interval, the improved policy
can be obtained by minimizing
subject to
and
where:
and
with
and
, utilizing the maximize operation defined in Definition 1. Inequality (
22) provides the stable condition and its derivation is provided in
Section 2.1.3. Inequality (
23) provides the upper bound for the updated gain
through the user-defined parameter
. The value of
limits the maximum
gain of
since inequality (
23) is equivalent to
.
2.1.3. Stability Analysis
With the control policy as described in Equation (
17), the equation for the closed-loop system can be derived as follows:
Lemma 1. Assuming that the closed-loop system described by Equation (26) is stable, solving for P in Equation (8) is equivalent to finding the solution of the underlying Lyapunov equation [8]: Proof of Lemma 1. We start with Equation (
27) and try to prove that matrix
P is also the solution of Equation (
8). Consider
, where
P is the solution of Equation (
27):
Since the closed-loop system is stable, the Lyapunov Equation (
27) has a unique solution,
. From (
28), this solution will satisfy
which is equivalent to
Therefore,
P is also the solution of Equation (
8). □
Lemma 2. Given matrices E and F with appropriate dimensions, the following LMI can be obtained: Proof of Lemma 2. From the properties of matrix norm, we have
which is equivalent to
or
□
Lemma 3. Given A as a square matrix with dimension and x as a vector with dimension , the following LMI can be obtained:where as in Definition 1. Proof of Lemma 3. We have
where
with
. □
Theorem 1. Consider a dynamic system that can be represented by Equation (1) with the state matrix A unknown. Assume that the sign of all state variables cannot be changed between each policy update interval and the estimated value function at iteration i is with . If then the closed-loop system with the control policy is quadratic stable with convergence rate α.
Proof of Theorem 1. Since the current control policy is stable, the estimated parameter matrix
is positive definite. Hence,
. Here,
is used as the Lyapunov function for the updated control policy
. For notation convenience, the state vector
and input vector
are denoted as
and
, respectively. By using Equation (
27) in Lemma 1 and the representation
, we can calculate the left side of Equation (
18) as follows:
By using Lemma 3, we have the following inequality:
and the following inequality can be obtained by Lemma 2:
Additionally,
where
, and
, utilizing the maximize operator defined in Definition 1.
Hence,
can be bounded by
Using the Lyapunov theory, the system will be quadratic stable with the convergent rate
if
. This condition is satisfied if
The above condition can be written in the matrix form, as shown in Theorem 1. □
By using Theorem 1, it can be seen that with the proposed improved policy, the closed-loop system will be asymptotically stable. It is also noted that Theorem 1 is also applicable for unknown nonlinear systems if they can be approximated by a linear state-space equation (Equation (
1)) and if their nonlinearity is within the uncertainty bound
calculated from
in Equation (
16).
2.1.4. Robust Reinforcement Learning Algorithm
The robust RL algorithm for controlling partially unknown dynamically systems includes the following steps:
Initialization
(Step )
Estimation of the Value Function
(Step )
Control Policy Update
Figure 1 shows the simplified diagram of the above algorithm. It is noted that the estimation of the value function is an on-policy learning since it updates
using the V-value of the next state and the current policy’s action.
2.1.5. Simulation Setup
A simulation study of the proposed robust RL controller was conducted on a glucose kinetics model, which can be described by [
20,
21,
22,
23]:
and
In this model, parameter and variable descriptions can be found in
Table 1 and
Table 2, respectively. The values of the parameters are selected based on [
20,
21]. Variable
in Equation (
43) is the process noise. The measured blood glucose value is affected by a random noise
:
The inputs of the model are the amount of carbohydrate intake
D and the insulin concentration
i. The value of
must be non-negative: