Online Inverse Optimal Control for Time-Varying Cost Weights

Cao, Sheng; Luo, Zhiwei; Quan, Changqin

doi:10.3390/biomimetics9020084

Open AccessArticle

Online Inverse Optimal Control for Time-Varying Cost Weights

by

Sheng Cao

^*

,

Zhiwei Luo

and

Changqin Quan

Graduate School of System Informatics, Kobe University, 1-1 Rokkodai-cho, Nada-ku, Kobe 657-8501, Japan

^*

Author to whom correspondence should be addressed.

Biomimetics 2024, 9(2), 84; https://doi.org/10.3390/biomimetics9020084

Submission received: 16 December 2023 / Revised: 25 January 2024 / Accepted: 29 January 2024 / Published: 31 January 2024

(This article belongs to the Special Issue Biology for Robotics and Robotics for Biology)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Inverse optimal control is a method for recovering the cost function used in an optimal control problem in expert demonstrations. Most studies on inverse optimal control have focused on building the unknown cost function through the linear combination of given features with unknown cost weights, which are generally considered to be constant. However, in many real-world applications, the cost weights may vary over time. In this study, we propose an adaptive online inverse optimal control approach based on a neural-network approximation to address the challenge of recovering time-varying cost weights. We conduct a well-posedness analysis of the problem and suggest a condition for the adaptive goal, under which the weights of the neural network generated to achieve this adaptive goal are unique to the corresponding inverse optimal control problem. Furthermore, we propose an updating law for the weights of the neural network to ensure the stability of the convergence of the solutions. Finally, simulation results for an example linear system are presented to demonstrate the effectiveness of the proposed strategy. The proposed method is applicable to a wide range of problems requiring real-time inverse optimal control calculations.

Keywords:

inverse optimal control; online calculation; time-varying cost weights; robust to noises

1. Introduction

The integration of biological principles with robotic technology heralds a new era of innovation, with a significant focus on applying optimal control and optimization methods to analyze animal motion. This approach guides robotic movement development evident in [1], which explores the intricate control systems in mammalian locomotion. Such research underpins the development of robots that emulate the efficiency and adaptability found in nature.

These advancements in understanding animal locomotion through optimal control methods set the stage for the relevance of inverse optimal control (IOC). IOC offers a retrospective analysis of expert movements—human or animal—to infer underlying cost functions optimized in these motions. This methodology is crucial when direct modeling of optimal strategies is complex or unknown.

The use of inverse optimal control (IOC) to identify suitable cost functions from the observable control input and state trajectories of experts is becoming increasingly important. Several successful applications of IOC in estimating the cost weights of multi-features have been reported. For example, the knowledge and expertise of specialists can be categorized and exploited in several fields, including robot control and autonomous driving. The authors of [2], who employed game theory in tailoring robot–human interactions, proposed a method for estimating the human cost function and selecting the robot’s cost function based on the results, leading to the Nash equilibrium in human–robot interactions. The authors of [3] applied IOC to analyze taxi drivers’ route choices. To investigate the cost combination of human motion, the authors of [4] conducted an experiment using IOC techniques to study human motion during the performance of a goal-achieving task using one arm. Additionally, the authors of [5] represented the learning of biological behavior as an inverse linear quadratic regulator (LQR) problem and proposed adaptive methods for modeling and analyzing human reach-to-grasp behavior. Furthermore, the authors of [6] employed an IOC method to segment human movement.

Linear quadratic regulation is a common optimal control method for linear systems. In the 1960s and 1970s, numerous researchers offered solutions to the inverse LQR problem [7,8,9]. Recently, the theory of linear matrix inequality was employed to solve the inverse LQR problem [5,10,11]. Regarding the application of the IOC method for nonlinear systems, several approaches involving methods such as passivity-based condition monitoring [12] or robust design [13] have been reported.

Recent studies in the field of IOC have demonstrated significant advancements. The authors of [14] provided a comprehensive review of the methodologies and applications in inverse optimization, highlighting its growing importance across various domains. The authors of [15] introduced a novel method for sequential calculation in discrete-time systems, enhancing the IOC model’s efficacy under noisy data conditions. The authors of [16] employed a multi-objective IOC approach to explore motor control objectives in human locomotion, which has implications for predictive simulations in rehabilitation technology. Furthermore, the authors of [17] delve into cost uniqueness in quadratic costs and control-affine systems, shedding light on the non-uniqueness cases in IOC. Moreover, a recent thesis [18] introduces a Collage-Based Approach for solving unique inverse optimal control problems, leveraging the Collage method for ODE inverse problems in conjunction with Pontryagin’s Maximum Principle.

Feature-based IOC methods, which involve modeling the cost function as a linear combination of various feature functions with unknown weights, have gained acclaim in recent years [19,20,21,22]. However, it may be difficult to apply these methods to the analysis of complex, long-term behaviors using simple feature functions, e.g., analyzing human jumping [23]. To address this challenge, the authors of [24] proposed a technique for recovering phase-dependent weights that switch at unknown phase-transition points. This method employs a moving window along the observed trajectory to identify the phase-transition points, with the window length determined by a recovery matrix aimed at minimizing the number of observations required for successful cost-weight recovery. Although this method is effective in estimating phase-dependent cost weights, the complex computational requirements limit its use in real-time applications, such as human–robot collaboration tasks. Additionally, in this method, the cost weights in each phase are assumed to be fixed, which may not be generalizable. For example, the human jump motion in [23] was analyzed using time-varying, continuous cost weights.

Overall, the IOC still has several shortcomings that need to be addressed, particularly when applied in approximating complex, multi-phase, continuous cost functions in real time. In this paper, we propose a method for recovering the time-varying cost weights in the IOC problem for linear continuous systems using neural networks. Our approach involves constructing an auxiliary estimation system that closely approximates the behavior of the original system, followed by determining the necessary conditions for tuning the weights of the neurons in the neural network to obtain a unique solution for the IOC problem. We demonstrate that the unique solution corresponds to achieving a zero error between the original system state and the auxiliary estimated system state, as well as zero error between the original costate and the integral of the estimated costate. Based on this analysis, we develop two neural-network frameworks: one for approximating the cost-weight function and the other for addressing the error introduced by the auxiliary estimation system. Additionally, we discuss the necessary requirements for the feature functions to ensure the well-posedness of our online IOC method. Finally, we validate the effectiveness of our method through simulations.

This work makes several significant contributions:

We provide a solution for the recovery of time-varying cost weights, essential for analyzing real-world animal or human motion.
Our method operates online, suitable for a broad spectrum of real-time calculation problems. This contrasts with previous online IOC methods that mainly focused on constant cost weights for discrete system control.
We introduce a neural network and state observer-based framework for online verification and refinement of estimated cost weights. This innovation addresses the critical need for solution uniqueness and robustness against data noise in IOC applications.

2. Problem Formulation

2.1. System Description and Problem Statement

Consider an object’s system dynamics formulated as

\dot{x} = A x + B u

(1)

where

A \in R^{n \times n}

and

B \in R^{n \times m}

are two time-invariant matrixes,

x \in R^{n}

represents the system states, and

u = {[u_{1}, \dots, u_{m}]}^{T} \in R^{m}

denotes the control input of the system [25].

To minimize the following cost function while accounting for dynamics (1), the classic optimal control problem is required to design the optimal control input

u^{*} (t)

, and generate a sequence of optimal states

x^{*} (t)

. (Superscript ∗ stands for the optimal condition.)

V (x, t) = \int_{t}^{t_{f}} L_{0} (x, u, τ) d τ

(2)

Here,

L_{0}

has the following form:

L_{0} = q^{T} F (x) + r^{T} G (u)

(3)

where

q = {[q_{1}, q_{2}, \dots, q_{n_{f}}]}^{T} \in R^{n_{f}}

and

r = {[r_{1}, r_{2}, \dots, r_{m}]}^{T} \in R^{m} \forall r_{i} > 0

represent the cost weight vectors,

F (x)

is referred to as the general union feature vector with respect to x, and

G (u)

indicates the feature vector that is only relevant to the control input u [26].

n_{f}

represents the feature’s number, which is different from the dimension of system states. For simplicity, we assume that

r^{T} G (u) = u^{T} R u

where R is an unknown matrix with

R = [\begin{matrix} r_{1} & 0 & \dots \\ ⋮ & ⋱ & ⋮ \\ \dots & 0 & r_{m} \end{matrix}]

. Additionally, it is assumed that (

A, B

) is controllable, B is a full column rank matrix, and A and B are bounded such that

| | A | | \leq δ_{A} | | B | | \leq δ_{B}

.

2.2. Maximum Principle in Forward Optimal Control

To minimize the cost function as is the case in (2) with

L_{0}

defined in (3), there exists a costate variable vector

λ

that satisfies Pontryagin’s maximum principle as follows:

\begin{matrix} \dot{λ} = - {\bar{F}}_{x}^{T} q - A^{T} λ \end{matrix}

(4)

\begin{matrix} R u + B^{T} λ = 0 \end{matrix}

(5)

where

{\bar{F}}_{x} = \frac{\partial F (x)}{\partial x}

and

λ \in R^{n}

denote the costate variables. These two equations are derived from Pontryagin’s Maximum Principle by taking the partial derivatives of the Hamiltonian function defined by

H (x, u, λ) = L_{0} + λ^{T} (A x + B u)

, specifically

\dot{λ} = - \frac{\partial H}{\partial x}

and

\frac{\partial H}{\partial u} = 0

. The initial value of

λ

can be represented as

λ_{0}

.

The optimal control input

u^{*}

of the system expressed by (1) is given as

u^{*} = - R^{- 1} B^{T} λ

(6)

where

λ

is unknown. Thus, using this optimal control input, we have

\dot{x} = A x - H λ

(7)

where H denotes the matrix

H = B R^{- 1} B^{T}

. Notably, given that B is a full column rank matrix, it is clear that H is invertible. In addition, since B is a bounded constant matrix, there exists a positive scalar

δ_{H}

such that H satisfies

| | H | | \leq δ_{H}

.

Additionally, the time derivatives of the system dynamics can be formulated as follows:

\ddot{x} = A \dot{x} - H \dot{λ}

(8)

2.3. Analysis of the IOC Problem

We assume that the system states

x [t, t_{f}]

and the control input

u [t, t_{f}]

, which represent the time series of the system states and control inputs from time point t to

t_{f}

, provide the solution to the optimal minimization of the cost function (2). In addition, we assume that the optimal system states and control input satisfy the boundary conditions

| | x | | \leq δ_{x}

| | u | | \leq δ_{u}

| | \dot{u} | | \leq δ_{\dot{u}}

.

The objective of the IOC problem is to recover the unknown cost weight’s vector

q (t)

. Furthermore, IOC, for example, may be employed to analyze different behaviors such as the effect of different occasions on the relative importance of certain human motion feature functions. A rigorous analysis of the derived cost weights that can recreate the original data

x [t, t_{f}], u [t, t_{f}]

is required for the aforementioned applications. To begin, we consider two problems:

What happens when a different feature function is selected?

In previous studies, it was assumed that the cost weight vector q is either a constant value [19] or a step function with multiple phases [24]. These assumptions have been effective in recovering the cost weights used in the analysis of optimal control methods for a robot’s motion control, such as analyzing the motion of a robot controlled by a LQR approach. However, occasionally, it may be inappropriate to assume that the cost weights are constants or step functions when analyzing the complex behaviors of natural objects, such as human motion. In particular, deciding which feature function to adopt when evaluating the motion of natural objects could pose a challenge.

Proposition 1.

Depending on the different selections of feature functions

F (x)

for the IOC, the original constant cost weight q may become a time-varying continuous function.

Proof.

From (8), for the objects’ original feature function, we have

\begin{matrix} H^{- 1} (- \ddot{x} + A \dot{x} + H A^{T} H^{- 1} B u) = {\bar{F}}_{o x}^{T} q_{o} \end{matrix}

(9)

where

q_{o}

denotes the original time-invariant cost weight vector, and

{\bar{F}}_{o} (x)

denotes the partial derivative with respect to x of the original feature function. When we choose a different feature function

F_{n} (x)

, the above equation becomes

\begin{matrix} H^{- 1} (- \ddot{x} + A \dot{x} + H A^{T} H^{- 1} B u) = {\bar{F}}_{n x}^{T} q_{n} \end{matrix}

(10)

where

{\bar{F}}_{n x}

denotes the partial derivative with respect to x of the new selected feature function and

q_{n}

is the corresponding cost weights on

{\bar{F}}_{n x}

. Thus, we have

{\bar{F}}_{o x}^{T} q_{o} = {\bar{F}}_{n x}^{T} q_{n} \forall t_{0} \leq t \leq t_{f}

From this equation, it follows that

q_{n}

may be a time-varying function when

{\bar{F}}_{o x}

and

{\bar{F}}_{n x}

are not equivalent, and as

{\bar{F}}_{o x}

and

{\bar{F}}_{n x}

are continuous functions, we can reasonably conclude that

q_{n}

is also a continuous function. □

Based on this proposition, it is crucial to expand the definition of cost weights to include time-varying values, as this will facilitate a more accurate analysis of the motion of increasingly complex natural objects. Despite the need for time-varying cost weight recovery in many applications, it has received minimal research attention thus far.

Whether or not the given set $x [t, t_{f}], u [t, t_{f}]$ in the IOC problem has a unique solution ${q (t), r}$ .

The uniqueness of the solution to the IOC problem when cost weights are constant has been discussed in many studies [15,17,18,22]. In this work, we determine if there is still a unique solution to the IOC problem when q is a time-varying function.

From (10), we can find different continuous functions

q (t)

such that the equation is satisfied for different values of R (different values of H). This implies that if q is considered as a time-varying function, the set

{q (t), r}

will not have a unique solution.

Therefore, when we consider the unique solution of the IOC problem with the time-varying function

q (t)

, it is necessary to introduce additional conditions to ensure that the IOC problem has a unique solution and that the resulting unique solution is meaningful.

In this study, for simplicity, we assume that

R = I

[27,28], where I is the identity matrix. In actual optimal control cost functions, when we focus on reducing one of the control inputs

u_{i}

, the convergence of the i-th system state

x_{i}

related to

u_{i}

will also be affected. Consequently, the final control result shows that the change in each state of the system is not solely influenced by the chosen cost weights

q (t)

, but also by

R (t)

. In the IOC problem, setting

R (t) = I

allows the effect of different weights on different control inputs in the original system to be reflected in the current estimate of

q (t)

. This enables us to view the estimated weights on the system states as representing the relative importance of each state in the system’s dynamic evolution, without considering the impact of the control input on these weights.

Based on our conclusion that q may be time-varying when different feature functions are chosen and on the corresponding conditions under which a unique solution exists, we can define the IOC problem to be solved in this study as follows:

Problem 1.

Online Estimation of Time-Varying Cost weights

q (t)

Given: (1) Measured system state x as well as control input u (2)

R = I

Goal: Online estimate of the time-varying

q (t)

utilizing the given x and u.

3. Adaptive Observer-Based Neural Network Approximation of Time-Varying Cost Weights

In this study, we estimate time-varying cost weight functions online using an observer-based adaptive neural network estimation approach, as opposed to earlier studies that required a large number of time series of x and u to recover fixed cost weights offline.

Construction of the Observer

Following the introduction of

\hat{q} (t) \in R^{n}

denoting the estimation of

q (t)

, we define the estimation of the associated costate variable

\hat{λ}

as follows:

\dot{\hat{λ}} = - {\bar{F}}_{\hat{x}}^{T} \hat{q} (t) - A^{T} \hat{λ}

(11)

where

{\bar{F}}_{\hat{x}} = \frac{\partial F (\hat{x})}{\partial \hat{x}}

denotes the partial derivatives of the feature functions that are only relevant to the estimated system states

\hat{x}

obtained by inserting

\hat{λ}

into (7):

\dot{\hat{x}} = A \hat{x} - H \hat{λ}

(12)

where the initial state

{\hat{x}}_{0}

of this system is selected to be

{\hat{x}}_{0} = x_{0}

.

Thus, compared with that of the original system, the error generated by the new estimation system can be expressed as

\dot{\tilde{x}} = A \tilde{x} - H \tilde{λ}

(13)

\dot{\tilde{λ}} = - {\bar{F}}_{x}^{T} q (t) + {\bar{F}}_{\hat{x}}^{T} \hat{q} (t) - A^{T} \tilde{λ}

(14)

where

\tilde{λ} = λ - \hat{λ}

and

\tilde{x} = x - \hat{x}

. Here, the feature function is selected such that its partial derivative with respect to x is bounded and it is assumed that

| | {\bar{F}}_{x}^{x} | | \leq δ_{n x}

,

| | {\bar{F}}_{\hat{x}} | | \leq δ_{n \hat{x}}

and

| | {\bar{F}}_{x}^{x} (x) - {\bar{F}}_{\hat{x}} (\hat{x}) | | \leq ζ | | \tilde{x} | |

where

δ_{n x}

,

δ_{n \hat{x}}

and

ζ

denote a positive scalar.

Additionally, the time derivatives of (14) can be expressed as

\begin{matrix} \ddot{\tilde{x}} = A \dot{\tilde{x}} - H \dot{\tilde{λ}} \end{matrix}

(15)

Thus, the following equation can be satisfied:

\begin{matrix} \dot{s} = A_{r} s + T_{x} \tilde{q} + (T_{x} - T_{\hat{x}}) \hat{q} \end{matrix}

(16)

where

s = [\begin{matrix} \dot{\tilde{x}} \\ \tilde{λ} \end{matrix}]

,

A_{r} = [\begin{matrix} A & H A^{T} \\ 0 & - A^{T} \end{matrix}]

,

T_{x} = [\begin{matrix} H {\bar{F}}_{x} \\ - {\bar{F}}_{x} \end{matrix}]

,

T_{\hat{x}} = [\begin{matrix} H {\bar{F}}_{\hat{x}} \\ - {\bar{F}}_{\hat{x}} \end{matrix}]

.

\tilde{q}

denotes the error of estimating q. Here,

| | {\bar{F}}_{x}^{x} (x) - {\bar{F}}_{\hat{x}} (\hat{x}) | | \leq ζ | | \tilde{x} | |

implies that there exists a positive scalar

ζ^{'}

such that

| | T_{x} - T_{\hat{x}} | | \leq ζ^{'} | | \tilde{x} | |

holds. Based on the bound of

{\bar{F}}_{x}^{x} (x), {\bar{F}}_{\hat{x}} (\hat{x}), H

, it follows that there are two positive scalars

δ_{t_{x}}

and

δ_{t_{\hat{x}}}

such that the following inequalities hold:

| | T_{x} | | \leq δ_{t_{x}}

and

| | T_{\hat{x}} | | \leq δ_{t_{\hat{x}}}

.

Moreover, from (6) and (7),

λ

can be calculated as follows:

λ = - H^{- 1} B u

(17)

4. Neural Network-Based Approximation of Time Varying Cost Weights

In this section, a neural network-based cost weight approximation algorithm is proposed. To calculate an approximation of the time-varying vector q, we adopt a neural network in which the chosen inputs are

u_{I} = [\begin{matrix} x_{0} \\ u \end{matrix}]

, where

x_{0}

denotes the initial state of the system (1). Based on this, we assume that time-invariant weight matrixes

W \in R^{n_{f} \times l}

exist that satisfy the following expression:

\begin{matrix} q = W^{T} ϕ (u_{I}) + ϵ_{1} (u_{I}) \end{matrix}

(18)

where

ϕ (u_{I})

denotes the activation function and

ϵ_{1} (u_{I})

denotes the structure approximation error of the neural networks. In addition, the activation function selected enables the activation function as well as its partial derivative to satisfy the following boundary condition:

| | ϕ (u_{I}) | | \leq δ_{p}

and

| | \frac{\partial ϕ (u_{I})}{\partial u_{I}} | | \leq δ_{p u}

where

δ_{p}

and

δ_{p u}

represent two positive scalars. Additionally,

| | ϵ_{1} (u_{I}) | | \leq ϵ_{n}

where

ϵ_{n}

is a positive scalar.

The estimate of vector q is constructed as follows:

\begin{matrix} \hat{q} = {\hat{W}}^{T} ϕ (u_{I}) \end{matrix}

(19)

where

\hat{W}

denotes the estimation of W. In this paper, we will combine two estimators

{\hat{W}}_{1}

and

{\hat{W}}_{2}

to estimate W, as shown in Section 4.1. Before presenting the details of the estimators, we first discuss the necessary conditions for the estimation.

Based on the setting of estimator

\hat{W}

, the error of estimating q can be expressed as

\begin{matrix} \tilde{q} = q - \hat{q} = {\tilde{W}}^{T} ϕ (u_{I}) + ϵ_{1} (u_{I}) \end{matrix}

(20)

where

\tilde{W} = W - \hat{W}

denotes the error of estimating W. Substituting

\tilde{q}

into (16) yields

\begin{matrix} \dot{s} = A_{r} s + T_{x} {\tilde{W}}^{T} ϕ (u_{I}) + (T_{x} - T_{\hat{x}}) \hat{q} + T_{x} ϵ_{1} (u_{I}) \end{matrix}

(21)

To profoundly comprehend the necessary condition for the convergence of the estimation error

\tilde{W}

, we define uniformly ultimately bounded (UUB) below.

Definition 1.

A time-varying signal

σ (t)

can be said as UUB if there exists a compact set

S \subset R^{n}

so that for all

σ \in S

, there exists a bound

μ \geq 0

and a time T such that

| | σ | | \leq μ

for all

t \geq t_{0} + T

.

Lemma 1.

If the following conditions are satisfied,

\tilde{W}

becomes UUB.

$\int_{t_{0}}^{t_{i}} s d t, s$ become UUB after a time point $t_{1}$ ( $| | \int_{t_{0}}^{t_{i}} s d t | | \leq δ_{1}$ , and $| | s | | \leq δ_{2}$ )
The change in $\hat{W}$ approaches zero
Matrix C defined below will become a full row rank matrix.

\begin{matrix} C = [\begin{matrix} \int_{t_{1} + 1}^{t_{1} + 2} T_{x} {(I \otimes ϕ (u_{I}))}^{T} d t \\ ⋮ \\ \int_{t_{i} - 1}^{t_{i}} T_{x} {(I \otimes ϕ (u_{I}))}^{T} d t \end{matrix}] \end{matrix}

(22)

where

t_{1} \leq t_{i} \leq t_{f}

and any term in C satisfies the persistent excitation (PE) condition defined below.

\begin{matrix} | | \int_{t_{j}}^{t_{j} + 1} T_{x} (I \otimes ϕ (u_{I})) {d t)}^{T} | | \geq β_{j} \forall t_{1} \leq t_{j} \leq t_{i} \end{matrix}

(23)

Here,

β_{j}

is a positive value.

Proof.

From (21)

\begin{matrix} s & = A_{r} \int_{t_{0}}^{t_{i}} s d t + \int_{t_{0}}^{t_{i}} T_{x} {\tilde{W}}^{T} ϕ (u_{I}) d t \\ + \int_{t_{0}}^{t_{i}} (T_{x} - T_{\hat{x}}) \hat{q} d t + \int_{t_{0}}^{t_{i}} T_{x} ϵ_{1} (u_{I}) d t \end{matrix}

(24)

Since

\int_{t_{0}}^{t_{i}} s d t \to 0, s \to 0

reaches a steady state and

A_{r}

is a constant, we can obtain the following:

\begin{matrix} | | s - A_{r} \int_{t_{0}}^{t_{i}} s d t | | \leq δ_{s i} \end{matrix}

(25)

where

δ_{s i}

denotes a small positive scalar. Additionally, with both

ϵ_{1} (u_{I})

and

T_{x}

being bounded, this leads to

\begin{matrix} | | \int_{t_{0}}^{t_{i}} T_{x} ϵ_{1} (u_{I}) d t | | \leq δ_{T ϵ} \end{matrix}

(26)

where

δ_{T ϵ}

denotes a small positive scalar. The term

\int_{t_{0}}^{t_{i}} T_{x} ϵ_{1} (u_{I}) d t

captures the effect of the structural error of the neural network on state s. Since

T_{x}

is bounded, when the neural network approximates the cost weight function adequately, the value of

ϵ_{1} (u_{I})

decreases, which in turn minimizes the overall integral value. In other words, a well-selected neural network structure with a good approximation of the cost weight function will produce a small structure error and, therefore, a small overall integral value

\int_{t_{0}}^{t_{i}} T_{x} ϵ_{1} (u_{I}) d t

.

(24)–(26) leads to

\begin{matrix} | | \int_{t_{0}}^{t_{i}} T_{x} {\tilde{W}}^{T} ϕ (u_{I}) d t + \int_{t_{0}}^{t_{i}} (T_{x} - T_{\hat{x}}) \hat{q} d t | | \leq δ_{s i} + δ_{T ϵ} \end{matrix}

(27)

Similarly, we can obtain a similar relation for the duration

[t_{0}, t_{1}]

\begin{matrix} | | \int_{t_{0}}^{t_{1}} T_{x} {\tilde{W}}^{T} ϕ (u_{I}) d t + \int_{t_{0}}^{t_{1}} (T_{x} - T_{\hat{x}}) \hat{q} d t | | \leq δ_{s i} + δ_{T ϵ} \end{matrix}

(28)

From (27) and (28), it follows that

\begin{matrix} | | \int_{t_{1} + 1}^{t_{i}} T_{x} {\tilde{W}}^{T} ϕ (u_{I}) d t + \int_{t_{1} + 1}^{t_{i}} (T_{x} - T_{\hat{x}}) \hat{q} d t | | \leq 2 (δ_{s i} + δ_{T ϵ}) \end{matrix}

(29)

Furthermore, considering

\int_{t_{0}}^{t_{i}} s d t \to 0

after

t_{1}

, the definition of s and

| | T_{x} - T_{\hat{x}} | | \leq ζ^{'} | | \tilde{x} | |

, this implies that

\begin{matrix} | | \int_{t_{1} + 1}^{t_{i}} (T_{x} - T_{\hat{x}}) \hat{q} d t | | & \leq \int_{t_{1} + 1}^{t_{i}} | | (T_{x} - T_{\hat{x}}) | | | | \hat{q} | | d t \\ \leq \int_{t_{1} + 1}^{t_{i}} ζ^{'} δ_{\tilde{x}} δ_{\hat{q}} d t \equiv δ_{ζ (t_{i} - t_{1} - 1)} \end{matrix}

(30)

where

δ_{\tilde{x}}

and

δ_{\hat{q}}

represent the bounds of

\tilde{x}

and

\hat{q}

respectively. Thus, this leads to the inequality

\begin{matrix} | | \int_{t_{1} + 1}^{t_{i}} T_{x} {\tilde{W}}^{T} ϕ (u_{I}) d t | | \leq 2 (δ_{s i} + δ_{T ϵ}) + δ_{ζ (t_{i} - t_{1} - 1)} \end{matrix}

(31)

In this case, when

\dot{\hat{W}}

approaches zero, the following relation emerges:

\begin{matrix} | | \int_{t_{1} + 1}^{t_{i}} T_{x} {(I \otimes ϕ (u_{I}))}^{T} v e c (\tilde{W}) d t | | \\ = | | \int_{t_{1} + 1}^{t_{i}} T_{x} {(I \otimes ϕ (u_{I}))}^{T} d t v e c (\tilde{W}) | | \\ \leq 2 (δ_{s i} + δ_{T ϵ}) + δ_{ζ (t_{i} - t_{1} - 1)} \end{matrix}

(32)

Based on this relation, it follows that

\begin{matrix} | | \int_{t_{1} + 1}^{t_{1} + 2} T_{x} {(I \otimes ϕ (u_{I}))}^{T} d t v e c (\tilde{W}) | | \\ \leq 2 (δ_{s i} + δ_{T ϵ}) + δ_{ζ (1)} \end{matrix}

(33)

where

δ_{ζ (1)} = \int_{t_{1} + 1}^{t_{1} + 2} ζ^{'} δ_{\tilde{x}} δ_{\hat{q}} d t = \dots = \int_{t_{i} - 1}^{t_{i}} ζ^{'} δ_{\tilde{x}} δ_{\hat{q}} d t

.

Thus, it implies that

\begin{matrix} | | C v e c (\tilde{W}) | | \leq (t_{i} - t_{1} - 1) (2 (δ_{s i} + δ_{T ϵ}) + δ_{ζ (1)}) \end{matrix}

(34)

where C is defined in (22). Due to C being full row rank, this leads to

\begin{matrix} | | v e c (\tilde{W}) | | & \leq | | C^{+} | | | | C v e c (\tilde{W}) | | \\ \leq | | C^{+} | | (t_{i} - t_{1} - 1) (2 (δ_{s i} + δ_{T ϵ}) + δ_{ζ (1)}) \end{matrix}

(35)

From (23), we have

| | C^{+} | | \leq \frac{1}{\sqrt{(t_{i} - t_{1} - 1) β_{j}^{2}}}

\begin{matrix} | | v e c (\tilde{W}) | | \leq \sqrt{\frac{t_{i} - t_{1} - 1}{β_{j}^{2}}} (2 (δ_{s i} + δ_{T ϵ}) + δ_{ζ (1)}) \end{matrix}

(36)

Thus,

\tilde{W}

is UUB.

Notably,

β_{j}

evaluates the lower bound of the norm of

\int_{t_{j}}^{t_{j} + 1} T_{x} (I \otimes ϕ (u_{I})) {d t)}^{T}

, it can increase when the data x cause the norm of the integral to deviate significantly from zero. The size of

δ_{ζ (1)}, δ_{s i}

is related to the minimization of s and

\int_{t_{0}}^{t_{i}} s d t

, and the size of

δ_{T ϵ}

is related to the approximation ability of the chosen neural network. The bound of

\tilde{W}

after

t_{1}

can be minimized by the excited x, successfully minimizing s and

\int_{t_{0}}^{t_{i}} s d t

while appropriately designing the structure of the neural network. □

4.1. Construction of the Neural Network

As shown in Lemma 1, the convergence of

\int_{t_{0}}^{t} s d τ

is essential in the convergence of

\tilde{W}

to 0. Therefore, it is necessary to incorporate this consideration in the approximation design.

First, we divide the estimation of the weights of the neural network into two parts:

\hat{W} = {\hat{W}}_{1} + {\hat{W}}_{2}

(37)

and

\begin{matrix} \hat{q} = {\hat{q}}_{1} + {\hat{q}}_{2} = {({\hat{W}}_{1} + {\hat{W}}_{2})}^{T} ϕ (u_{I}) \end{matrix}

(38)

where

{\hat{q}}_{1} = {\hat{W}}_{1}^{T} ϕ (u_{I})

and

{\hat{q}}_{2} = {\hat{W}}_{2}^{T} ϕ (u_{I})

.

The necessity for employing two distinct estimators,

{\hat{W}}_{1}

and

{\hat{W}}_{2}

, is rooted in their specialized roles in minimizing the tracking error s. This dual-estimator approach ensures that

\hat{q} (t)

closely aligns with the desired trajectory

q (t)

. While

{\hat{W}}_{1}

’s adaptive tuning is primarily aimed at steering s towards zero, its inherent residual errors in its adaptive process necessitate the deployment of

{\hat{W}}_{2}

for error compensation and enhanced accuracy in tracking the ideal cost weight

q (t)

. To gain a deeper understanding of this system, we will begin by examining the error dynamics, which forms a fundamental basis for the subsequent detailed exploration of the tuning laws for each estimator.

The state equation describing the error dynamics can be obtained as follows:

\dot{s} = A_{r} s + T_{x} {\tilde{q}}_{1} + (T_{x} - T_{\hat{x}}) {\hat{q}}_{1} - T_{\hat{x}} {\hat{q}}_{2}

(39)

where

s = [\begin{matrix} \dot{\tilde{x}} \\ \tilde{λ} \end{matrix}]

,

A_{r} = [\begin{matrix} A & H A^{T} \\ 0 & - A^{T} \end{matrix}]

,

T_{x} = [\begin{matrix} H {\bar{F}}_{x} \\ - {\bar{F}}_{x} \end{matrix}]

,

T_{\hat{x}} = [\begin{matrix} H {\bar{F}}_{\hat{x}} \\ - {\bar{F}}_{\hat{x}} \end{matrix}]

.

Further, to effectively minimize

\int_{t_{0}}^{t} s d τ

, we define vector e as follows:

e = (T_{x} - T_{\hat{x}}) {\hat{q}}_{1} + K s + K_{p} \int_{t_{0}}^{t} s d τ - T_{\hat{x}} {\hat{q}}_{2} + A_{r} s

(40)

where

K = d i a g ([k, \dots, k]) \in R^{2 n \times 2 n}

and

K_{p} = d i a g ([k_{p}, \dots, k_{p}]) \in R^{2 n \times 2 n}

. Parameters k and

k_{p}

are two positive scalars, thus, (39) can be written as:

\dot{s} = - K s - K_{p} \int_{t_{0}}^{t} s d τ + T_{x} {\tilde{q}}_{1} + e

(41)

We suppose that an ideal time-invariant weight matrix

W_{2} \in R^{n_{f} \times l}

exists, which guarantees that

\begin{matrix} (T_{x} - T_{\hat{x}}) {\hat{q}}_{1} + K s + A_{r} s + K_{p} \int_{t_{0}}^{t} s d τ \\ = T_{\hat{x}} q^{'} = T_{\hat{x}} (W_{2}^{T} ϕ (u_{I}) + ϵ_{2} (u_{I})) \end{matrix}

(42)

where

u_{I} = [\begin{matrix} x_{0} \\ u \end{matrix}]

.

The estimation error of the neural network can be represented as

\begin{matrix} {\tilde{q}}_{1} & \equiv q - {\hat{q}}_{1} = {\tilde{W}}_{1}^{T} ϕ (u_{I}) + ϵ_{1} (u_{I}) \\ {\tilde{q}}_{2} & \equiv q^{'} - {\hat{q}}_{2} = {\tilde{W}}_{2}^{T} ϕ (u_{I}) + ϵ_{2} (u_{I}) \end{matrix}

(43)

and e can be represented as

e = T_{\hat{x}} ({\tilde{W}}_{2}^{T} ϕ (u_{I}) + ϵ_{2} (u_{I}))

(44)

Therefore, (41) becomes

\begin{matrix} \dot{s} & = - K s - K_{p} \int_{t_{0}}^{t} s d τ \\ + T_{x} ({\tilde{W}}_{1}^{T} ϕ (u_{I}) + ϵ_{1} (u_{I})) + T_{\hat{x}} ({\tilde{W}}_{2}^{T} ϕ (u_{I}) + ϵ_{2} (u_{I})) \end{matrix}

(45)

4.2. Tuning Law of the Neural Network for the Estimation of $q (t)$

An updating law for a neural network that estimates

q (t)

can be represented in Theorem 1, based on the error system’s dynamics that were derived in (45).

Theorem 1.

If we choose the updating laws for the neural network weights

{\hat{W}}_{1}

and

{\hat{W}}_{2}

as shown in (46), respectively, where

Γ_{1}

,

Γ_{2}

, and

k_{e}

are positive scalar constants, then state s,

\int_{t_{0}}^{t} s d τ

and error e will be UUB.

\begin{matrix} {\dot{\hat{W}}}_{1} & = Γ_{1} ϕ (u_{I}) s^{T} T_{x} \\ {\dot{\hat{W}}}_{2} & = Γ_{2} ϕ (u_{I}) {(s + k_{e} e)}^{T} T_{\hat{x}} \end{matrix}

(46)

In addition, if there exist positive constants

t_{δ}

,

β_{1}

,

β_{2}

,

β_{3}

, and

β_{4}

such that the inequalities in (47) are satisfied for all initial times

t_{0}

, then the signals

{\tilde{W}}_{1}

and

{\tilde{W}}_{2}

will also be UUB.

\begin{matrix} β_{2} I & \geq \int_{t_{0}}^{t_{0} + t_{δ}} C_{p 1} {(t)}^{T} C_{p 1} (t) d t \geq β_{1} I \\ β_{4} I & \geq \int_{t_{0}}^{t_{0} + t_{δ}} C_{p 2} {(t)}^{T} C_{p 2} (t) d t \geq β_{3} I \end{matrix}

(47)

Here,

C_{p 1} (t) = T_{x} (I \otimes ϕ {(u_{I})}^{T})

,

C_{p 2} (t) = T_{\hat{x}} (I \otimes ϕ {(u_{I})}^{T})

Proof.

A proof of this theorem can be found in Appendix A. □

Applying (46) results in s,

\int_{t_{0}}^{t} s d τ

, and e being UUB, as shown in Theorem 1. Additionally, (46) shows that when s and e decreases,

{\dot{\hat{W}}}_{1}

and

{\dot{\hat{W}}}_{2}

decrease as well, resulting in a decrease in

\dot{\hat{W}} = {\dot{\hat{W}}}_{1} + {\dot{\hat{W}}}_{2}

. At this point, as stated in Lemma 1, if the condition of matrix C (defined in Lemma 1), being a full row rank matrix, is satisfied, then

\tilde{W} = {\tilde{W}}_{1} + {\tilde{W}}_{2}

will also be UUB. Thus, the solution to the IOC problem can be derived by applying (38).

5. Simulations

5.1. Basic Simulation Conditions

To verify the effectiveness of our method, we performed the simulations using a sample linear system controlled by the optimal control method with the original cost weights R selected in two cases.

The sample linear system dynamics can be formulated as follows:

\dot{θ} = A θ + B τ

(48)

where

θ = {[θ_{1}, θ_{2}]}^{T} \in R^{2}

represents the system states. We select

A = [\begin{matrix} 30 & 80 \\ 60 & 0 \end{matrix}]

,

B = [\begin{matrix} 2 & 0 \\ 0 & 4 \end{matrix}]

and

τ \in R^{2}

denoting the control input.

The cost function selected in these simulations is formulated as

\begin{matrix} V_{r} = \frac{1}{2} \int_{0}^{t_{f}} (θ^{T} Q (t) θ + τ^{T} R τ) d t \end{matrix}

(49)

when all the elements of

θ

satisfying

| θ_{i} | \leq θ_{r_{l}}

and

Q (t) = [\begin{matrix} q_{1} & 0 \\ 0 & q_{2} \end{matrix}]

is the continuous time-varying cost weights on system states

θ

.

R = [\begin{matrix} r_{1} & 0 \\ 0 & r_{2} \end{matrix}]

represents the cost weights on the control inputs.

Moreover, in our simulations, we select 0 as the initial value of all the elements of both

{\hat{W}}_{1}

and

{\hat{W}}_{2}

. Actuation function

ϕ (u_{I})

was selected as

ϕ (u_{I}) = {[ϕ_{1} (u_{I}), \dots, ϕ_{i} (u_{I}), \dots, ϕ_{l} (u_{I})]}^{T}

with

ϕ_{i} (u_{I})

designed as

ϕ_{i} (u_{I}) = e x p (\frac{- {(u_{I} - ψ_{i})}^{T} (u_{I} - ψ_{i})}{ν})

(50)

where

ν

denotes a positive scalar and

ψ_{i}

denotes the center of the respective activation function. We initialized the activation function centers on a four-dimensional grid to match the dimension of

u_{i}

, ensuring a uniform distribution across the input space and enhancing network adaptability.

The overall implementation for recovering the time-varying cost weights is shown in Algorithm 1.

Algorithm 1 Online implementation

Input:: ${x_{i}, u_{i}}$
Output:: $\hat{q} (t)$

Initialization:

1:: Initialize $\hat{λ}$ , $\hat{x}$ , ${\hat{W}}_{1}$ , ${\hat{W}}_{1}$ , ${\hat{W}}_{2}$ , $Γ_{1}$ , $Γ_{2}$ and $R = I$ .

LOOP Process

2:: for $i = 0$ to K do
3:: Calculate $λ$ using $λ = - H^{- 1} B u$ .
4:: Calculate $\dot{\hat{x}}$ and $\dot{\hat{λ}}$ using (11) and (12).
5:: Calculate $\dot{\tilde{x}}$ and $\dot{\tilde{λ}}$ using (14) and (13).
6:: Calculate $s = [\begin{matrix} \dot{\tilde{x}} \tilde{λ} \end{matrix}]$ .
7:: Calculate e following (40).
8:: Calculate $ϕ (u_{I})$ and update ${\hat{W}}_{1}$ , ${\hat{W}}_{2}$ using (46).
9:: Calculate $\hat{q} (t)$ using (38).
10:: end for
11:: return $\hat{q} (t)$

Two cases are considered in the simulation:

In the first case, we apply the optimal control of the sample system with cost weights $θ$ as the signal ( $q_{1} (t) = 1 + c o s (t)$ and $q_{2} (t) = 2 + s i n (t)$ ). The proposed IOC method is employed online to estimate the cost weights, with the simultaneous online recovery of the original system trajectory. Parameters $Γ_{1}$ and $Γ_{2}$ in the updating law are set to $Γ_{1} = 1$ and $Γ_{2} = 1$ , respectively. Parameters k and $k_{p}$ are set to $k = 50$ and $k_{p} = 625$ , respectively. The initial values of ${\hat{W}}_{1}$ and ${\hat{W}}_{2}$ are set to matrixes with all elements equal to zero. The original $r_{1}$ and $r_{2}$ are set to $r_{1} = 1$ and $r_{2} = 1$ , respectively. The simulation also uses 49 nodes in the neural network.
In the second case, we perform the simulation of our IOC method, but with the original $r_{1}$ and $r_{2}$ set to $r_{1} = 3$ and $r_{2} = 4$ , respectively. All other simulation settings are the same as in the first case.

Similar to the simulation sections in previous works ([6,24]), we use the control input from the simulation, which ignores the measurement issues with the control input and measurement errors that may occur in real-world applications. This allows us to purely evaluate the performance of our method in solving the IOC problem. In actual applications, the control input can be calculated by substituting the measured

\dot{θ}

into (48), as described in [24].

5.2. Results

The simulation results are shown in the figures below.

In Figure 1, the blue solid line represents the original variation in the cost weights whereas the gray solid line represents the estimated cost weights. After a brief period of oscillation at the initial time, our method accurately recovers the original cost weights when

R = I

. Notably, similar to the case in other adaptive control methods and adaptive neural network based control methods, the initial oscillation is a result of the adaptive initialization of the weights in (46) due to the large initial errors in

{\tilde{W}}_{1}

and

{\tilde{W}}_{2}

.

Figure 2 demonstrates the impact of selecting

R = I

on the estimation results when the original R value is arbitrary. The solid blue line represents the original time-varying cost weights, whereas the dotted gray line represents the final estimated values. Although the estimated values differ from the original values, the general trend of the changes is preserved. In addition, the gray line represents the mutual weights in the dynamics of the system state, whereas the original weights among the control inputs are reflected in the current estimate of

q (t)

. From the figure, we can observe that the bottom lines in blue and gray colors represent the value of the original and estimated

q_{2}

. Evidently, the blue line for

q_{2}

is larger than that for

q_{1}

from 4.8 s to 5 s. Additionally, in the original settings,

r_{2}

is 4, which confers greater importance to the decrease in

u_{2}

compared with the case when

r_{1} = 3

, leading to the weakening of the convergence of the

θ_{2}

term associated with

u_{2}

. In our estimates, the value of the dashed line for the estimated

q_{2}

, which also considers the impact from original setting of R is not greater than the value of estimated

q_{1}

between 4.8 s and 5 s. This indicates that the convergence of

θ_{2}

is weakened by considering the impact from the cost weights on control input. Our dashed line more accurately reflects the actual situation compared to the blue line.

In Figure 3, Figure 4 and Figure 5, we show the results of error e, states s and

\int_{t_{0}}^{t} s d τ

in two cases. The blue lines show the results of the first case, whereas the gray dotted lines show the results of the second case. From the figures, we can observe that all the values effectively decrease to a low range during the simulation, and most importantly, in the second case, the different selections of R do not affect the convergence of these values. This demonstrates the effectiveness of our method and highlights that even with different values of R, the recovered cost weights are still feasible solutions to the IOC problem, as they can be utilized to regenerate a similar system trajectory and control inputs (

\int_{t_{0}}^{t} s d τ = [\begin{matrix} \tilde{x} \\ \int_{t_{0}}^{t} \tilde{λ} d τ \end{matrix}] \to 0

).

6. Discussion

6.1. Robustness of the Proposed Method to Noisy Data

In (46),

Γ_{1}

and

Γ_{2}

decrease the error by regulating the updating speed of the estimated values. Adjusting these two terms may successfully reduce the impact of data noise to a certain degree. Their roles are similar to that of a low-pass filter’s time constant. For example, in the setting of the first case, when noise exists,

x \sim N (0, 10^{- 1})

and

u \sim N (0, 10^{- 4})

, the simulation results show that different sets of

Γ_{1}

and

Γ_{2}

(e.g.,

Γ_{1} = 10

,

Γ_{2} = 10

;

Γ_{1} = 1

,

Γ_{2} = 1

) can significantly influence the noise reduction performance.

As shown in Figure 6, while relatively small values of

Γ_{1}

and

Γ_{2}

may result in a low convergence rate, they effectively reduce the impact of data noise. Our method demonstrates robustness against noise by allowing for the adjustment of parameters

Γ_{1}

and

Γ_{2}

.

6.2. Calculation Complexity and Real-Time Calculation

The proposed algorithm has a low computational complexity, as it only involves the calculation of dot products between matrixes and vectors as well as the summation of vectors. Additionally, it does not require any iterative or optimization calculations. This makes it an efficient solution for real-time calculations. In fact, our simulation shows that a single iteration of the algorithm using case 1 settings takes only approximately 0.23 ms in Matlab 2016b to complete the SIOC’s calculation, which is fast enough to meet real-time calculation requirements.

6.3. Advantages of Using $R = I$

The simulation results suggest that one of the key advantages of setting R as a constant I is that it effectively consolidates the impact of cost weights on state convergence, which would have been influenced by different settings of R, into the estimated value of

q (t)

. This allows for a comprehensive evaluation of the system state convergence, as it only depends on

q (t)

, without needing to account for additional considerations. Furthermore, by maintaining a consistent value of

R = I

, it is possible to standardize the analysis of the same motion across multiple agents, which is crucial for various applications.

7. Conclusions

In this paper, we proposed a neural network based method for recovering the time-varying cost weights in the IOC problem for linear continuous systems. Our approach involved constructing an auxiliary estimation system that closely approximates the behavior of the original system, followed by determining the necessary conditions for tuning the weights of the neurons in the neural network to obtain a unique solution for the IOC problem. We discussed the necessary requirements for the previous settings to ensure the well-posedness of our online IOC method. We showed that the unique solution corresponds to achieving a nearly zero error between the original system state and the auxiliary estimated system state, as well as nearly zero error between the original costate and the integral of the estimated costate. Based on this analysis, we developed two neural network frameworks: one for approximating the cost weight function and the other for addressing the error introduced by the auxiliary estimation system and terms. Finally, we validated the effectiveness of our method through simulations, highlighting its ability to recover time-varying cost weights and its robustness against different original choices of R. Overall, our method represents a significant advancement in the field of online IOC, and it is applicable to a wide range of problems requiring real-time IOC calculations.

Author Contributions

Conceptualization, S.C.; methodology, S.C.; software, C.Q.; writing—original draft, S.C.; writing—review and editing, S.C.; project administration, Z.L. and C.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proof of Theorem 1

Proof.

Considering the Lyapunov candidate selected as follows

\begin{matrix} V & = \frac{1}{2} s^{T} s + \frac{1}{2} {(\int_{t_{0}}^{t} s d τ)}^{T} K_{p} \int_{t_{0}}^{t} s d τ \\ + \frac{1}{2} t r [{\tilde{W}}_{1}^{T} Γ_{1}^{- 1} {\tilde{W}}_{1} + {\tilde{W}}_{2}^{T} Γ_{2}^{- 1} {\tilde{W}}_{2}] \end{matrix}

(A1)

The derivative of V can be expressed as

\dot{V} = s^{T} \dot{s} + s^{T} K_{p} \int_{t_{0}}^{t} s d τ - t r [{\tilde{W}}_{1}^{T} Γ_{1}^{- 1} {\dot{\hat{W}}}_{1} + {\tilde{W}}_{2}^{T} Γ_{2}^{- 1} {\dot{\hat{W}}}_{2}]

(A2)

By introducing (45) and utilizing the proposed updating law of

{\hat{W}}_{1}

and

{\hat{W}}_{2}

in (46),

\dot{V}

becomes

\begin{matrix} \dot{V} = & - s^{T} k s + s^{T} T_{x} {\tilde{q}}_{1} + s^{T} e \\ - t r [{\tilde{W}}_{1}^{T} Γ_{1}^{- 1} {\dot{\hat{W}}}_{1} + {\tilde{W}}_{2}^{T} Γ_{2}^{- 1} {\dot{\hat{W}}}_{2}] \\ = & - s^{T} k s + s^{T} T_{x} ({\tilde{W}}_{1} ϕ (u_{I}) + ϵ_{1} (u_{I})) \\ + s^{T} T_{\hat{x}} ({\tilde{W}}_{2} ϕ (u_{I}) + ϵ_{2} (u_{I})) \\ - t r [{\tilde{W}}_{1}^{T} Γ_{1}^{- 1} {\dot{\hat{W}}}_{1} + {\tilde{W}}_{2}^{T} Γ_{2}^{- 1} {\dot{\hat{W}}}_{2}] \\ = & - s^{T} k s + s^{T} T_{x} ϵ_{1} (u_{I}) + s^{T} T_{\hat{x}} ϵ_{2} (u_{I}) \\ + k_{e} e^{T} T_{\hat{x}} ϵ_{2} (u_{I}) - e^{T} k_{e} e \end{matrix}

(A3)

Here, with introducing a new vector p defined as

p = [\begin{matrix} s \\ \frac{k_{e}}{\sqrt{k}} e \end{matrix}]

and considering (44), (A3) can be rewritten as

\dot{V} = - p^{T} K p + p^{T} p_{ϵ}

(A4)

where

p_{ϵ} = [\begin{matrix} T_{x} ϵ_{1} (u_{I}) + T_{\hat{x}} ϵ_{2} (u_{I}) \\ \sqrt{k} T_{\hat{x}} ϵ_{2} (u_{I}) \end{matrix}]

.

By considering the boundedness condition of

T_{x}

,

T_{\hat{x}}

,

ϵ_{1} (u_{I})

and

ϵ_{2} (u_{I})

, we have

| | p_{ϵ} | | \leq \sqrt{{(δ_{t_{x}} ϵ_{n 1} + δ_{t_{\hat{x}}} ϵ_{n 2})}^{2} + k δ_{t_{\hat{x}}} ϵ_{n 2}} \equiv δ_{t_{p ϵ}}

(A5)

From this boundedness condition, (A3) becomes

\begin{matrix} \dot{V} & \leq - {k | | p | |}^{2} + | | p | | | | p_{ϵ} | | \\ \leq - {k | | p | |}^{2} + | | p | | δ_{t_{p ϵ}} \\ = - | | p | | (k | | p | | - δ_{t_{p ϵ}}) \end{matrix}

(A6)

From (A6), the left hand side of (A6) would be negative when

| | p | | \geq \frac{δ_{t_{p ϵ}}}{k}

, implying that

\dot{V} \leq 0

and p would maintain convergence when

| | p | | \geq \frac{δ_{t_{p ϵ}}}{k}

. Moreover, due to the vector

p = [\begin{matrix} s \\ \frac{k_{e}}{\sqrt{k}} e \end{matrix}]

, s as well as e would all be bounded satisfying

\begin{matrix} | | s | | & \leq δ_{s} \end{matrix}

(A7)

\begin{matrix} | | e | | & \leq δ_{e} \end{matrix}

(A8)

That is, s, e would all be UUB. Moreover, due to the continuity,

\dot{s}

would also be UUB satisfying the following condition as

\begin{matrix} | | s | | \leq δ_{\dot{s}} \end{matrix}

(A9)

Notably, with increasing k, the bound

\frac{δ_{t_{p ϵ}}}{k}

of p decreases. Furthermore, since V decreases continuously while

| | p | | \geq \frac{δ_{t_{p ϵ}}}{k}

,

\int_{t_{0}}^{t} s d τ

would also be UUB.

Conversely, from (41), we have

\begin{matrix} | | T_{x} {\tilde{q}}_{1} | | & = | | \dot{s} + K s + K_{p} \int_{t_{0}}^{t} s d τ - e | | \\ \leq B_{f h} \end{matrix}

(A10)

where

B_{f h}

denotes a positive scalar. Furthermore, by considering (43), we have

\begin{matrix} | | T_{x} {\tilde{W}}_{1}^{T} ϕ (u_{I}) | | & = | | T_{x} {\tilde{q}}_{1} - T_{x} ϵ_{1} (u_{I}) | | \\ \leq | | T_{x} {\tilde{q}}_{1} | | + | | T_{x} | | | | ϵ_{1} (u_{I}) | | \\ = B_{f h} + δ_{t_{x}} ϵ_{n 1} \end{matrix}

(A11)

Similarly, from the boundedness of e,

ϵ_{2} (u_{I})

and (44), we have

\begin{matrix} | | T_{\hat{x}} {\tilde{W}}_{2}^{T} ϕ (u_{I}) | | & \leq δ_{e} + δ_{t_{\hat{x}}} ϵ_{n 2} \end{matrix}

(A12)

From (46), the dynamics related to

{\tilde{W}}_{1}

and

{\tilde{W}}_{2}

can be respectively given by

\begin{matrix} \{\begin{matrix} {\dot{\tilde{W}}}_{1} = - Γ_{1} ϕ (u_{I}) s^{T} T_{x} \\ y_{1} = T_{x} {\tilde{W}}_{1}^{T} ϕ (u_{I}) \end{matrix} \end{matrix}

(A13)

\begin{matrix} \{\begin{matrix} {\dot{\tilde{W}}}_{2} = - Γ_{2} ϕ (u_{I}) {(s + e)}^{T} T_{\hat{x}} \\ y_{2} = T_{\hat{x}} {\tilde{W}}_{2}^{T} ϕ (u_{I}) \end{matrix} \end{matrix}

(A14)

where

y_{1}

and

y_{2}

denote the outputs of two systems and are both bounded following (A11) and (A12).

Thus, the vector dynamics of the two systems can be given as

\begin{matrix} \{\begin{matrix} \frac{d}{d t} v e c ({\tilde{W}}_{1}) = - (I \otimes Γ_{1} ϕ (u_{I})) T_{x}^{T} s = B_{p 1} (t) s \\ y_{1} = T_{x} (I \otimes ϕ {(u_{I})}^{T}) v e c ({\tilde{W}}_{1}) = C_{p 1} (t) v e c ({\tilde{W}}_{1}) \end{matrix} \end{matrix}

(A15)

\begin{matrix} \{\begin{matrix} \frac{d}{d t} v e c ({\tilde{W}}_{2}) = - (I \otimes Γ_{2} ϕ (u_{I})) T_{\hat{x}}^{T} (s + k_{e} e) \\ = B_{p 2} (t) (s + k_{e} e)) \\ y_{1} = T_{\hat{x}} (I \otimes ϕ {(u_{I})}^{T}) v e c ({\tilde{W}}_{2}) = C_{p 2} (t) v e c ({\tilde{W}}_{2}) \end{matrix} \end{matrix}

(A16)

where

B_{p 1} (t) = - (I \otimes Γ_{1} ϕ (u_{I})) T_{x}^{T}

and

B_{p 2} = - (I \otimes Γ_{2} ϕ (u_{I})) T_{\hat{x}}^{T}

would be bounded with ensuring the boundedness of

ϕ (u_{I})

and

T_{x}, T_{\hat{x}}

. Thus, from Lemma 4.2.1 in [29], if (47) is satisfied, the boundedness of

y_{1}

y_{2}

as well as those of s and

s + k_{e} e

assures the boundedness of

{\tilde{W}}_{1}

,

{\tilde{W}}_{2}

, that is, there exist two positive scalars

δ_{{\tilde{W}}_{1}}

,

δ_{{\tilde{W}}_{2}}

such that

| | {\tilde{W}}_{1} | | \leq δ_{{\tilde{W}}_{1}}

,

| | {\tilde{W}}_{2} | | \leq δ_{{\tilde{W}}_{2}}

. Thus,

{\tilde{W}}_{1}

and

{\tilde{W}}_{2}

would be UUB. □

References

Frigon, A.; Akay, T.; Prilutsky, B.I. Control of Mammalian Locomotion by Somatosensory Feedback. Compr. Physiol. 2021, 12, 2877–2947. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Tee, K.P.; Yan, R.; Chan, W.L.; Wu, Y. A framework of human–robot coordination based on game theory and policy iteration. IEEE Trans. Robot. 2016, 32, 1408–1418. [Google Scholar] [CrossRef]
Ziebart, B.D.; Maas, A.L.; Bagnell, J.A.; Dey, A.K. Human Behavior Modeling with Maximum Entropy Inverse Optimal Control. In Proceedings of the AAAI Spring Symposium: Human Behavior Modeling, Stanford, CA, USA, 23–25 March 2009; Volume 92. [Google Scholar]
Berret, B.; Chiovetto, E.; Nori, F.; Pozzo, T. Evidence for composite cost functions in arm movement planning: An inverse optimal control approach. PLoS Comput. Biol. 2011, 7, e1002183. [Google Scholar] [CrossRef]
El-Hussieny, H.; Abouelsoud, A.; Assal, S.F.; Megahed, S.M. Adaptive learning of human motor behaviors: An evolving inverse optimal control approach. Eng. Appl. Artif. Intell. 2016, 50, 115–124. [Google Scholar] [CrossRef]
Jin, W.; Kulić, D.; Mou, S.; Hirche, S. Inverse optimal control from incomplete trajectory observations. Int. J. Robot. Res. 2021, 40, 848–865. [Google Scholar] [CrossRef]
Kalman, R.E. When is a linear control system optimal? J. Fluids Eng. 1964, 86, 51–60. [Google Scholar] [CrossRef]
Molinari, B. The stable regulator problem and its inverse. IEEE Trans. Autom. Control 1973, 18, 454–459. [Google Scholar] [CrossRef]
Obermayer, R.; Muckler, F.A. On the Inverse Optimal Control Problem in Manual Control Systems; NASA: Washington, DC, USA, 1965; Volume 208. [Google Scholar]
Boyd, S.; El Ghaoui, L.; Feron, E.; Balakrishnan, V. Linear Matrix Inequalities in System and Control Theory; SIAM: Philadelphia, PA, USA, 1994. [Google Scholar]
Priess, M.C.; Conway, R.; Choi, J.; Popovich, J.M.; Radcliffe, C. Solutions to the inverse LQR problem with application to biological systems analysis. IEEE Trans. Control Syst. Technol. 2014, 23, 770–777. [Google Scholar] [CrossRef] [PubMed]
Rodriguez, A.; Ortega, R. Adaptive stabilization of nonlinear systems: The non-feedback linearizable case. IFAC Proc. Vol. 1990, 23, 303–306. [Google Scholar] [CrossRef]
Freeman, R.A.; Kokotovic, P.V. Inverse optimality in robust stabilization. SIAM J. Control Optim. 1996, 34, 1365–1391. [Google Scholar] [CrossRef]
Chan, T.C.; Mahmood, R.; Zhu, I.Y. Inverse optimization: Theory and applications. Oper. Res. 2023. [Google Scholar] [CrossRef]
Cao, S.; Luo, Z.; Quan, C. Sequential Inverse Optimal Control of Discrete-Time Systems. IEEE/CAA J. Autom. Sin. 2024, 11, 1–14. [Google Scholar]
Tomasi, M.; Artoni, A. Identification of motor control objectives in human locomotion via multi-objective inverse optimal control. J. Comput. Nonlinear Dyn. 2023, 18, 051004. [Google Scholar] [CrossRef]
Jean, F.; Maslovskaya, S. Injectivity of the inverse optimal control problem for control-affine systems. In Proceedings of the 2019 IEEE 58th Conference on Decision and Control (CDC), Nice, France, 11–13 December 2019; pp. 511–516. [Google Scholar]
Dewhurst, J. A Collage-Based Approach to Inverse Optimal Control Problems with Unique Solutions. Ph.D. Thesis, University of Guelph, Guelph, ON, Canada, 2021. [Google Scholar]
Johnson, M.; Aghasadeghi, N.; Bretl, T. Inverse optimal control for deterministic continuous-time nonlinear systems. In Proceedings of the 52nd IEEE Conference on Decision and Control, Firenze, Italy, 10–13 December 2013; pp. 2906–2913. [Google Scholar]
Abbeel, P.; Ng, A.Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004; p. 1. [Google Scholar]
Ziebart, B.D.; Maas, A.L.; Bagnell, J.A.; Dey, A.K. Maximum entropy inverse reinforcement learning. In Proceedings of the Aaai, Chicago, IL, USA, 13–17 July 2008; Volume 8, pp. 1433–1438. [Google Scholar]
Molloy, T.L.; Ford, J.J.; Perez, T. Online inverse optimal control for control-constrained discrete-time systems on finite and infinite horizons. Automatica 2020, 120, 109109. [Google Scholar] [CrossRef]
Gupta, R.; Zhang, Q. Decomposition and Adaptive Sampling for Data-Driven Inverse Linear Optimization. INFORMS J. Comput. 2022, 34, 2720–2735. [Google Scholar] [CrossRef]
Jin, W.; Kulić, D.; Lin, J.F.S.; Mou, S.; Hirche, S. Inverse optimal control for multiphase cost functions. IEEE Trans. Robot. 2019, 35, 1387–1398. [Google Scholar] [CrossRef]
Athans, M.; Falb, P.L. Optimal Control: An Introduction to the Theory and Its Applications; Courier Corporation: Chelmsford, MA, USA, 2007. [Google Scholar]
Ab Azar, N.; Shahmansoorian, A.; Davoudi, M. From inverse optimal control to inverse reinforcement learning: A historical review. Annu. Rev. Control 2020, 50, 119–138. [Google Scholar] [CrossRef]
Li, Y.; Yao, Y.; Hu, X. Continuous-time inverse quadratic optimal control problem. Automatica 2020, 117, 108977. [Google Scholar] [CrossRef]
Zhang, H.; Ringh, A. Inverse linear-quadratic discrete-time finite-horizon optimal control for indistinguishable homogeneous agents: A convex optimization approach. Automatica 2023, 148, 110758. [Google Scholar] [CrossRef]
Lewis, F.; Jagannathan, S.; Yesildirak, A. Neural Network Control of Robot Manipulators and Non-Linear Systems; CRC Press: Boca Raton, FL, USA, 2020. [Google Scholar]

Figure 1. Estimated cost weights (

r_{1} = 1, r_{2} = 1

).

Figure 1. Estimated cost weights (

r_{1} = 1, r_{2} = 1

).

Figure 2. Estimated cost weights (

r_{1} = 3, r_{2} = 4

).

Figure 2. Estimated cost weights (

r_{1} = 3, r_{2} = 4

).

Figure 3. Variation of error e (

r_{1} = 1, r_{2} = 1

and

r_{1} = 3, r_{2} = 4

).

Figure 3. Variation of error e (

r_{1} = 1, r_{2} = 1

and

r_{1} = 3, r_{2} = 4

).

Figure 4. Variation of error s (

r_{1} = 1, r_{2} = 1

and

r_{1} = 3, r_{2} = 4

).

Figure 4. Variation of error s (

r_{1} = 1, r_{2} = 1

and

r_{1} = 3, r_{2} = 4

).

Figure 5. Variation of

\int_{t_{0}}^{t} s d τ

(

r_{1} = 1, r_{2} = 1

and

r_{1} = 3, r_{2} = 4

).

Figure 5. Variation of

\int_{t_{0}}^{t} s d τ

(

r_{1} = 1, r_{2} = 1

and

r_{1} = 3, r_{2} = 4

).

Figure 6. Estimated cost weights (Noisy Case): (1)

Γ_{1} = 10

,

Γ_{2} = 10

(2)

Γ_{1} = 1

,

Γ_{2} = 1

.

Figure 6. Estimated cost weights (Noisy Case): (1)

Γ_{1} = 10

,

Γ_{2} = 10

(2)

Γ_{1} = 1

,

Γ_{2} = 1

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, S.; Luo, Z.; Quan, C. Online Inverse Optimal Control for Time-Varying Cost Weights. Biomimetics 2024, 9, 84. https://doi.org/10.3390/biomimetics9020084

AMA Style

Cao S, Luo Z, Quan C. Online Inverse Optimal Control for Time-Varying Cost Weights. Biomimetics. 2024; 9(2):84. https://doi.org/10.3390/biomimetics9020084

Chicago/Turabian Style

Cao, Sheng, Zhiwei Luo, and Changqin Quan. 2024. "Online Inverse Optimal Control for Time-Varying Cost Weights" Biomimetics 9, no. 2: 84. https://doi.org/10.3390/biomimetics9020084

APA Style

Cao, S., Luo, Z., & Quan, C. (2024). Online Inverse Optimal Control for Time-Varying Cost Weights. Biomimetics, 9(2), 84. https://doi.org/10.3390/biomimetics9020084

Article Menu

Online Inverse Optimal Control for Time-Varying Cost Weights

Abstract

1. Introduction

2. Problem Formulation

2.1. System Description and Problem Statement

2.2. Maximum Principle in Forward Optimal Control

2.3. Analysis of the IOC Problem

3. Adaptive Observer-Based Neural Network Approximation of Time-Varying Cost Weights

Construction of the Observer

4. Neural Network-Based Approximation of Time Varying Cost Weights

4.1. Construction of the Neural Network

4.2. Tuning Law of the Neural Network for the Estimation of $q (t)$

5. Simulations

5.1. Basic Simulation Conditions

5.2. Results

6. Discussion

6.1. Robustness of the Proposed Method to Noisy Data

6.2. Calculation Complexity and Real-Time Calculation

6.3. Advantages of Using $R = I$

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Theorem 1

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Online Inverse Optimal Control for Time-Varying Cost Weights

Abstract

1. Introduction

2. Problem Formulation

2.1. System Description and Problem Statement

2.2. Maximum Principle in Forward Optimal Control

2.3. Analysis of the IOC Problem

3. Adaptive Observer-Based Neural Network Approximation of Time-Varying Cost Weights

Construction of the Observer

4. Neural Network-Based Approximation of Time Varying Cost Weights

4.1. Construction of the Neural Network

4.2. Tuning Law of the Neural Network for the Estimation of q ( t )

5. Simulations

5.1. Basic Simulation Conditions

5.2. Results

6. Discussion

6.1. Robustness of the Proposed Method to Noisy Data

6.2. Calculation Complexity and Real-Time Calculation

6.3. Advantages of Using R = I

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Theorem 1

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2. Tuning Law of the Neural Network for the Estimation of $q (t)$

6.3. Advantages of Using $R = I$