Secure State Estimation of Cyber-Physical System under Cyber Attacks: Q-Learning vs. SARSA

Jin, Zengwang; Ma, Menglu; Zhang, Shuting; Hu, Yanyan; Zhang, Yanning; Sun, Changyin

doi:10.3390/electronics11193161

Open AccessArticle

Secure State Estimation of Cyber-Physical System under Cyber Attacks: Q-Learning vs. SARSA

by

Zengwang Jin

^1,2,*

,

Menglu Ma

¹,

Shuting Zhang

¹,

Yanyan Hu

³,

Yanning Zhang

² and

Changyin Sun

⁴

¹

School of Cybersecurity, Northwestern Polytechnical University, Xi’an 710072, China

²

National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Northwestern Polytechnical University, Xi’an 710072, China

³

School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China

⁴

School of Automation, Southeast University, Nanjing 210096, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(19), 3161; https://doi.org/10.3390/electronics11193161

Submission received: 3 September 2022 / Revised: 20 September 2022 / Accepted: 27 September 2022 / Published: 1 October 2022

(This article belongs to the Special Issue Cybersecurity and Privacy Issues in Cyber-Physical Systems and Industrial Control Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This paper proposes a reinforcement learning (RL) algorithm for the security problem of state estimation of cyber-physical system (CPS) under denial-of-service (DoS) attacks. The security of CPS will inevitably decline when faced with malicious cyber attacks. In order to analyze the impact of cyber attacks on CPS performance, a Kalman filter, as an adaptive state estimation technology, is combined with an RL method to evaluate the issue of system security, where estimation performance is adopted as an evaluation criterion. Then, the transition of estimation error covariance under a DoS attack is described as a Markov decision process, and the RL algorithm could be applied to resolve the optimal countermeasures. Meanwhile, the interactive combat between defender and attacker could be regarded as a two-player zero-sum game, where the Nash equilibrium policy exists but needs to be solved. Considering the energy constraints, the action selection of both sides will be restricted by setting certain cost functions. The proposed RL approach is designed from three different perspectives, including the defender, the attacker and the interactive game of two opposite sides. In addition, the framework of Q-learning and state–action–reward–state–action (SARSA) methods are investigated separately in this paper to analyze the influence of different RL algorithms. The results show that both algorithms obtain the corresponding optimal policy and the Nash equilibrium policy of the zero-sum interactive game. Through comparative analysis of two algorithms, it is verified that the differences between Q-Learning and SARSA could be applied effectively into the secure state estimation in CPS.

Keywords:

state estimation; security issue; cyber attacks; cyber-physical systems; reinforcement learning

1. Introduction

Integrated development in information technologies have lead to the emergence of cyber-physical systems (CPS). CPS is a modern control system that deeply integrates computation, communication, control technologies (3C) with physical systems. It monitors and controls the physical processes through computing system in real time. As one of the core technologies in industry 4.0 at the current stage, CPS now has a wide application prospect in many fields, including aerospace, civil infrastructure energy, intelligent manufacturing, intelligent transportation, new materials, etc. [1]. Mass-deployed wireless network in CPS has facilitated production mode and lifestyle, but it has meanwhile brought security risks [2,3,4]. Cyber attackers are able to destroy the physical control process of the system through various attack methods. Frequent CPS safety accidents in recent years have indicated that it is of great significance to explore the security issue of CPS under cyber attacks.

Cyber attacks have become one of the main threats to CPS security. Investigating their characteristics is helpful for better protection of CPS. Typical attacks include denial-of-service (DoS) attack [5,6], deception attack [7,8] and replay attack [9,10]. Among them, the DoS attack is the easiest to implement, but its destructive power is considerable. DoS attacks have the ability to block the data communication between system components by physical layer congestion or transmission layer flooding without any information disclosure and channel monitoring. To measure the damage on CPS performance caused by a DoS attack, the state estimation technique is a common method which reflects the internal control information of system [11,12]. Thus, the research task of this paper mainly focuses on the issue of secure state estimation in CPS under DoS attack.

Current studies mainly consider the secure state estimation of CPS under DoS attack from three perspectives: sensor, attacker and the game between two sides. From the perspective of sensors, reasonable and effective defending strategies should be studied to minimize the impact of state estimation caused by DoS attacks. To solve the CPS security problem under continuous random DoS attacks, Ref. [13] designed a dynamic output feedback controller to provide control inputs data. To achieve stable control for the CPS under DoS jamming attack, controllers with state feedback and output feedback were designed in [14] based on a passive attack-tolerant mechanism. Directed against DoS attacks with different energy levels, a control method combined active and passive attack tolerance strategies was proposed in [15], which could resist the impact of DoS attacks on the CPS effectively. From the perspective of attacker, some researchers considered how to design attack strategies to deteriorate the performance of the system as much as possible. An optimal attack strategy was proposed based on local Kalman filtering in [16,17] when DoS attack energy was limited. The secure state estimation problem of CPS with two subsystems was researched in [18], where the optimal attack scheduling was proposed. The works in [19,20] considered the optimal power allocation of DoS attacker, and the multi-agent issue was involved. Furthermore, the goals of both sides are opposite under cyber attacks. The combat between system defender and attacker could be regarded as a game process. The study in [21] considered the remote state estimation problem of CPS under DoS attack based on signal-to-interference-plus-noise ratio using the two-player game method. In [22], a static cooperative game was formulated for the collaboration among multiple DoS attackers, while a non-cooperative game described the competition among the multiple controllers. On the basis of above research, this paper addresses the issue within the unified framework from three different perspectives to develop a universal solution.

The above literature mostly adopts the methods of cybernetics or game theory. CPSs in reality are mostly complex nonlinear systems, which are complicated in terms of constructing the model and designing the control law. Reinforcement learning (RL) provides an innovative approach to addressing this problem. With the rapid development of artificial intelligence in recent years, RL algorithms have been widely used in many fields due to their model-free learning mechanism [23,24,25,26]. The extension of the RL method on secure state estimation problem in CPS is reasonable as the attack and defense issue of CPS under cyber attacks could be described by a Markov decision process (MDP). Relevant studies have proven the practicability [27,28,29,30]. In [27], the action-state function in RL framework was trained to find the optimal energy management strategy of a hybrid electric system. Q-learning, as a typical off-policy RL algorithm, was modified in [29] to find the Nash equilibrium policies between sensors and attackers. The problem of CPS against DoS attack with reliable channel and unreliable channel was addressed in [30], where the RL method was proved to find the optimal policy selection problem of attacker and sensor. In the framework of RL, the action selection strategy is an important component.

ϵ

-greedy [31] and Ant-TD [32] combine both probabilistic and greedy rules to balance exploration and exploitation. EFR-ESO [33] could solve the high-dimensional selection problem when dealing with complex systems. In this paper,

ϵ

-greedy is chosen as the main action selection strategy due to its simplicity and versatility.

Furthermore, RL algorithms are divided into on-policy and off-policy due to different iteration patterns. Normally, on-policy algorithms focus on stability, while off-policy algorithms emphasize efficiency [34]. To investigate the effect of different policy iteration patterns on algorithm performance, this paper applies the framework of two classical RL algorithms: Q-learning (off-policy) and SARSA (on-policy), to the security issue of CPS against DoS attack. The confrontation game process between the attacker and the defender under cyber attack are modeled as an MDP, and then the corresponding RL method is proposed in this paper.

The research task of this paper is to estimate the secure state of CPS based on RL under a DoS attack. Firstly, the interactive combat process of CPS defender and DoS attacker are established comprehensively, where the energy constraints of sensor and attacker are taken into account. Moreover, the Kalman filter is involved as the evaluation standard of CPS security issues. Then, the transition of state estimation error covariance against the DoS attack is characterized as an MDP to lay a foundation for RL algorithm design. The framework of Q-learning and SARSA, which are the representatives of off-policy and on-policy algorithms, respectively, is introduced into secure state estimation. The contributions of this paper are the following:

(i) Kalman filter is combined with reinforcement learning to estimate the secure state of CPS under cyber attacks. The state estimation error covariance is selected as the state in reinforcement learning algorithm, and its transition is described as a Markov decision process. The output of this artificial intelligence method could guide the sensor or attacker to find optimal countermeasure;

(ii) Based on the framework of Q-learning and SARSA, the secure state estimation algorithm is designed with reinforcement learning from three different perspectives: defender, attacker and the game of both opposite sides;

(iii) Through comparative analysis of the two algorithms, the differences between Q-learning and SARSA are verified. It is demonstrated that the SARSA method is a little more conservative than the Q-learning method due to the lower value in the Q-table.

The rest of this paper is organized as follows. Section 2 introduces the relevant theoretical background, including two types of RL algorithms and the concept of two-player zero-sum game. In Section 3, the framework of CPS under DoS attack is formed, where the CPS structure and DoS attack model are established. In Section 4, an MDP is set up to describes the RL problem under the interactive game process between sensor and attacker. Based on the framework of Q-learning and SARSA, the RL algorithms are designed to find the optimal policy and the Nash equilibrium policy. Section 5 presents the simulation results and proves the effectiveness of the algorithms. Different algorithms are compared and analyzed according to numerical values of Q-table. Section 6 presents conclusions on the research task and achievements of the whole paper and identifies the future research direction.

2. Preliminaries

The basic theoretical background is addressed in this section, including two classical RL algorithms and the concept of a two-player zero-sum game in game theory.

2.1. Q-Learning vs. SARSA

RL studies the sequential decision process of interaction between agent and environment [31], which could be mathematically described by an MDP. MDP is the foundation of an RL framework, which is usually represented by a tuple

〈S, A, P, R, ρ〉

. S is the state space and A is the action space, where

s \in S

and

a \in A

. P is the state transition probability function, R is the reward function, and

ρ

is the discount factor. To solve an MDP, the optimal strategy that maximizes the reward needs to be found. The solution process normally includes two parts: (i) Given a policy

π

, evaluate the corresponding state value function

V_{π} (s)

and state-action value function

q_{π} (s)

; (ii) Obtain the optimal action corresponding to the current state according to the value function.

According to their different learning patterns, RL algorithms are divided into two types: off-policy and on-policy. Q-learning and SARSA are the representative algorithms of these types, respectively, and they all use a Q-table to store value functions.

The iterative form of Q-learning is described below:

Q (s, a) \leftarrow Q (s, a) + α [r + ρ \underset{a}{m a x} Q (s^{'}, a) - Q (s, a)]

s \leftarrow s^{'};

The iterative form of SARSA is described as follows:

Q (s, a) \leftarrow Q (s, a) + α [r + ρ Q (s^{'}, a^{'}) - Q (s, a)]

s \leftarrow s^{'}; a \leftarrow a^{'};

α

is the learning rate, which defines the proportion of the new Q-value that an old Q-value will learn from. Q-function is the state action value function. The value of the Q-function is stored in an

i \times j

table, i.e., Q-table. i represents the number of states and j represents the number of actions.

According to the iterative forms above, it could be concluded that the essential difference between SARSA and Q-learning comes from the different update methods of Q value. SARSA uses the actual action of the next step as the update, while Q-learning selects the action with the largest Q value of the next step as the update. Both algorithms are able to converge to the optimal condition when all state-action pairs

(s, a)

are continuously accessible.

2.2. Two-Player Zero-Sum Game

The two-player zero-sum game is a basic concept in game theory. A two-player zero-sum game involves two players with opposed interests who share the same reward function [35], which means this type of game is non-cooperative. When the RL method is involved, the two-player zero-sum game could be described by a six-tuple MDP

〈S, A^{1}, A^{2}, P, R, ρ〉

. Specifically, S is the state set,

A^{1}

and

A^{2}

are the action set of player 1 and player 2, respectively, P is the state transition function, R is the reward function and

ρ

is the discount factor. In the game, player 1 aims to maximize the sum of rewards, while player 2 tries to minimize it. The combat between CPS defender and attacker can be described as a two-player zero-sum game where sensor and attacker hold opposite goals: one aims to reduce the state estimation error to the greatest extent and the other one aims to magnify it. In zero-sum game, no participant’s strategy is always optimal because players share a reward function [36]. Nash equilibrium policy is the concept that needs to address which is a crucial concept to measure the performance of a playtheirhis or her own strategy unilaterally in this strategy combination (other players’ strategies remain unchanged) will not improve his or her own profits. Early in 1950, John Nash proved that for a non-cooperative game in which n players participate, if the strategies of each player are limited, then there must be at least one Nash equilibrium solution set [37].

Theorem 1.

For a two-player zero-sum game, player 1 selects the strategy:

π_{1}^{*} \in arg \underset{a_{1} \in A_{1}}{m a x} \underset{a_{2} \in A_{2}}{m i n} R (a_{1}, a_{2})

Player 2 selects the strategy:

π_{2}^{*} \in arg \underset{a_{2} \in A_{2}}{m i n} \underset{a_{1} \in A_{1}}{m a x} R (a_{1}, a_{2})

(π_{1}^{*}, π_{2}^{*})

constitutes a Nash equilibrium only if

\underset{a_{1} \in A_{1}}{m a x} \underset{a_{2} \in A_{2}}{m i n} R (a_{1}, a_{2}) = \underset{a_{2} \in A_{2}}{m i n} \underset{a_{1} \in A_{1}}{m a x} R (a_{1}, a_{2})

(1)

Nash equilibrium point is usually obtained by the min-max principle, which could be represented by the following Bellman equation:

\begin{matrix} Q (s, a^{1}, a^{2}, π_{1}, π_{2}) & = R (s, a^{1}, a^{2}) \\ + ρ \underset{a^{1} \in A^{1}}{m a x} \underset{a^{2} \in A^{2}}{m i n} Q (s^{'}, a^{1}, a^{2}, π_{1}, π_{2}) \end{matrix}

(2)

where

s^{'} = P (s, a^{1}, a^{2})

,

s \in S

,

a^{1} \in A^{1}

,

a^{2} \in A^{2}

.

As addressed before, the RL algorithm will obtain a convergent Q-table after enough iteration steps. For the zero-sum game between sensor and attacker, the Nash equilibrium policy could be solved according to the Q-table.

3. Problem Setup

In this section, the attacked CPS structure and DoS attack model are established according to their characteristics. The estimation process of remote estimator under DoS attack is also described.

3.1. System Model

The CPS structure considers a discrete linear time-invariant dynamic system with a single sensor and a remote estimator, as shown in Figure 1. A DoS attack may occur during the data transmission process. The dynamic evolution equation of the system state is described as follows:

\{\begin{matrix} x (k + 1) = A x (k) + w (k), \\ y (k) = C x (k) + v (k), \end{matrix}

(3)

where k is the discrete time step,

x (k) \in R^{n}

is the state vector of the system,

y (k) \in R^{m}

is the measured value of the sensor.

w (k) \in R^{n}

is process noise and

v (k) \in R^{m}

is measurement noise, which are white Gaussian random noises with covariance of

Q_{w}

and

R_{v}

, respectively,

(p (w) \sim N (0, Q_{w}), p (v) \sim N (0, R_{v}))

. At each time step k, the minimum mean-squared error (MMSE) estimate

\hat{x} (k)

of the state vector

x (k)

is obtained by running the Kalman filter.

\hat{x} (k)

is then sent to the remote estimator through wireless channel.

\hat{x} (k)

and its estimation error covariance

P (k)

are defined as follows:

\hat{x} (k) = E [x (k) | y (1), . . ., y (k)]

(4)

P (k) = E [(x (k) - \hat{x} (k)) {(x (k) - \hat{x} (k))}^{T} | y (1), . . ., y (k)]

(5)

The recursive updating Kalman filter equations consist of two parts:

Time update:

$\hat{x} (k | k - 1) = A \hat{x} (k - 1)$

(6)

$P (k | k - 1) = h (P (k - 1))$

(7)

For simplicity, the Lyapunov operator $h (\cdot)$ is defined below:

$h (X) ≜ A X A^{T} + Q,$

(8)
Measurement update:

$K (k) = P (k | k - 1) C^{T} {[C P (k | k - 1) C^{T} + R]}^{- 1}$

(9)

$\hat{x} (k) = \hat{x} (k | k - 1) + K (k) (y (k) - C \hat{x} (k | k - 1))$

(10)

$P (k) = P (k | k - 1) - K (k) C P (k | k - 1)$

(11)

P (k)

could rapidly converge to a stable value at an exponential speed. At the stage of running the RL algorithm, the Kalman filter is assumed to reach a steady state, i.e.,

P (k) = \bar{P}

.

3.2. DoS Attack Model

Consider a DoS attacker who wants to degrade the performance of remote state estimation by interfering with wireless channels to prevent sensors from transmitting information to remote estimators. Assuming that the attacker has limited energy and only blocks one channel at each time step, it needs to decide whether to attack the channel or not. As the environment gives certain feedback to agents, the estimation error covariance of system state is selected as one of the main factors to influence the action of the sensor and attacker in this paper. The DoS attack model is described in Figure 2. The opposite goals of sensor and attacker makes the interactive process a two-player zero-sum game, where the Nash equilibrium policy needs to be found.

Sensor and attacker all have two choices at each time step k.

a_{k} \in \{0, 1\}

denotes the action selected by the sensor at time step k. There are two channels for the sensor to transmit data package.

a_{k} = 0

indicates that the sensor chooses the channel of “state 0”, which is free of charge but with a high probability of being attacked.

a_{k} = 1

indicates that the sensor choose the channel of “state 1”, which is protected from DoS attacks, but requires an extra cost c.

For the attacker, DoS attacker needs to choose to attack or not at each time step k.

d_{k} \in \{0, 1\}

represents the attacker’s action variable at time k. The DoS attacker attacks the channel when

d_{k} = 1

and does not attack the channel when

d_{k} = 0

. Similarly, the choice of attack requires an extra cost denoted by

c_{d}

.

It is assumed that the estimated packet of

\hat{x} (k)

will be discarded only if the sensor chooses the channel in “state 0” and the attacker chooses to attack, i.e.,

a_{k} = 0

and

d_{k} = 1

. When packet loss happens, the remote estimator takes the value at the previous time as the estimated value according to the Kalman filter equation as the estimation. In other cases, the estimated packet of

\hat{x} (k)

and its steady-state error covariance

\bar{P}

will always successfully reach the remote estimator.

\tilde{x} (k) = \{\begin{matrix} A \hat{x} (k - 1), & (a_{k}, d_{k}) = (0, 1) \\ \hat{x} (k), & o t h e r w i s e . \end{matrix}

(12)

P (k) = \{\begin{matrix} h (P (k - 1)), & (a_{k}, d_{k}) = (0, 1) \\ \bar{P}, & o t h e r w i s e . \end{matrix}

(13)

Remark 1.

As the sensor is able to know the parameters of the system, it is accessible to the information of

\bar{P}

. An action selection principle is set for the sensor. At time k, the sensor obtains

P (k)

and selects the action based on the following principle: when the cost of packet loss is higher than the cost of choosing the channel in “state 1”, i.e.,

T r (P (k)) - T r (\bar{P}) > c

, the action of sensor is restricted in the channel of “state 1” to ensure the security of data transmission. There is no action selection rule for the attacker because it is inaccessible to the concrete information of the system.

Remark 2.

τ (k)

is defined as the number of consecutive packet loss up to the current time,

P (k) = h^{τ (k)} (\bar{P})

. Assuming that the data packet is successfully received by the remote estimator at time k and the attacker succeeds

τ (k)

times after time k, the state sequence of the estimation error covariance

P (k)

from time k to

k + τ (k)

could be expressed as

{\bar{P}, h^{1} (\bar{P}), . . ., h^{τ (k)} (\bar{P})}

.

4. Reinforcement Learning Methods Finding Optimal Policies

An MDP is established to describe the RL problem of CPS against DoS attack. The elements of the MDP are as follows:

State: Denote the estimation error covariance of remote estimators $P (k)$ as the state variable. $P (k + 1)$ at time $k + 1$ is decided by $P (k)$ , $a_{k}$ and $d_{k}$ . S, as the set of state in RL problem, is denoted by $S = {s_{k}} = {P (k), k = 1, 2, \dots}$ .
Action: The sensor needs to select the channel in “state 0” or “state 1” to send the packet of the local estimation data to the remote estimator, $a_{k} \in {0, 1}$ . The DoS attacker needs to choose whether to attack the channel or not, $d_{k} \in {0, 1}$ . Denote A as the action sets in RL problems, $A = {(a_{k}, d_{k}), k = 1, 2, \dots}$ .
State transition rule: The estimation of the remote estimator will depend on its current state and the actions of sensors and attackers. The next state of the estimator $s_{k + 1}$ is obtained by the following formula:

$s_{k + 1} = P (k + 1) = \{\begin{matrix} h (s_{k}), & (a_{k}, d_{k}) = (0, 1) \\ \bar{P}, & o t h e r w i s e \end{matrix}$

(14)
Reward function: The reward function, as an assessment of the current state and action behaviors, is defined as follows:

$r_{k} = r (s_{k}, a_{k}, d_{k}) = T r (P (k)) + c a_{k} - c_{d} d_{k}$

(15)

where $T r (P (k))$ is the trace of the matrix $P (k)$ . The reward function is designed based on the principle that the sensor’s goal to minimize the estimated error covariance while the attacker’s goal is to maximize it. The cost constraints of the sensor and the attacker are also considered.
Discount factor: $ρ \in (0, 1)$ characterizes the feature that sensor and attacker‘s focus more on current rewards than future ones. Meanwhile, it ensures the convergence of the Q-function.

Consider the problems from three perspectives: sensor, attacker and the game between sensor and attacker. The algorithm is designed based on the framework of Q-learning and SARSA, respectively. The overall framework of the algorithm is universal, which is described as follows:

(i) Given the system parameters, the estimation error covariance

\bar{P}

of the steady state is obtained by running the Kalman filter.

\bar{P}

will be regarded as the initial state s of RL algorithm;

(ii) Create an

i \times j

Q table and initialize it, where i is the number of states and j is the number of actions;

(iii) Select actions according to the current state and the

ϵ

-greedy strategy;

(iv) Obtain the instant reward

r_{k}

and the next state

s^{'}

according to the reward function and state transition rule of MDP;

(v) Update Q-table and the state.

After a sufficient number of time steps, the Q-table eventually converges. According to the convergent Q-table, the optimal strategy of sensor or attacker and Nash equilibrium strategy could be obtained.

Remark 3.

The ϵ-greedy strategy is described as follows: when an agent makes a decision, it selects an unknown action randomly with a probability of ϵ, and the leaving 1−ϵ probability follows a greedy strategy (like choosing the action with the highest value in the Q-table). The ϵ-greedy strategy achieves a good balance between “exploration” and “exploitation”. “exploration” refers to trying an action that has not yet been attempted and “exploitation” refers to choosing the next action from a known action and which may maximize the total payoff in the long run. The balance between them is the key to determining whether the RL system is able to obtain the optimal solution. The general ϵ-greedy strategy converges slowly and is not suitable for non-stationary distribution. To improve it, this paper sets ϵ to a large value at the beginning of training and lets it decay gradually over time.

The action selection strategy and Q-table update method differ from different perspectives, which will be addressed in detail in the following sections.

4.1. Optimal Policy for the Sensor

From the perspective of sensors, the sensor selects actions according to the

ϵ

-greedy strategy, while the attacker selects actions with a uniform random strategy. As the purpose of the sensor is to reduce the estimated error covariance mostly to maintain system stability, it selects the action with the lowest Q-value in the 1-

ϵ

probability of

ϵ

-greedy strategy. As the design of the reward function includes the influence of the attack behavior on the system estimation error, the attacker’s action selection strategy also needs to be determined: the attacker selects the action with a uniform random strategy.

Based on Q-learning, the update rule of Q-table is as follows:

Q_{k + 1} (s, a) = Q_{k} (s, a) + α [r_{k} + ρ min_{a^{'}} Q_{k} (s^{'}, a^{'}) - Q_{k} (s, a)],

(16)

where the action used to update the Q-table is selected based on the greedy strategy, that is, the action with the lowest Q-value under the state s’. This action is only used to update the Q-table and will not be actually executed.

Based on SARSA, the update rule of Q table is as follows:

Q_{k + 1} (s, a) = Q_{k} (s, a) + α [r_{k} + ρ Q_{k} (s^{'}, a^{'}) - Q_{k} (s, a)],

(17)

which indicates that the agent’s strategy for selecting actions is the same as that for updating the Q-table.

The algorithm flow is conducted according to Figure 3 when the action strategy and the update rule of Q-table are determined. The Q-table converges through sufficient time steps. According to the convergent Q-table, the sensor’s optimal policy is to choose the sensor action with the largest value in each state.

4.2. Optimal Policy for the Attacker

In contrast to the sensor, the attacker’s goal is to maximize the estimated error covariance to destroy the system performance. Based on the

ϵ

-greedy strategy, attacker select an action randomly with a probability of

ϵ

, and selects the action of the highest Q-value with a probability of 1-

ϵ

. Meanwhile, the sensor selects actions according to the uniform random strategy.

Based on Q-learning, the update rule of Q-table is as follows:

Q_{k + 1} (s, d) = Q_{k} (s, d) + α [r_{k} + ρ max_{d^{'}} Q_{k} (s^{'}, d^{'}) - Q_{k} (s, d)],

(18)

where the action used to update the Q-table is the action with the highest Q-value under the state s’.

Based on the SARSA, the update rule of Q-table is as follows:

Q_{k + 1} (s, d) = Q_{k} (s, d) + α [r_{k} + ρ Q_{k} (s^{'}, d^{'}) - Q_{k} (s, d)],

(19)

where the action

d^{'}

corresponding to the next state

s^{'}

used to update Q-table is the action selected by the

ϵ

-greedy strategy, and will be executed at the next time step, i.e.,

d = d^{'}

.

For the attacker, the RL algorithm is run in Figure 3 when the sensor strategy is set as random. After a certain time of iterative learning, the Q-table could converge to be relatively stable. The optimal policy for the attacker is to choose the action with lowest value in each states.

4.3. Nash Equilibrium Policy of the Sensor and Attacker

Under the framework of two-player zero-sum game, the sensor and attacker select actions according to a modified

ϵ

-greedy strategy combining uniform random and max-min and min-max criterion together, which will be addressed in detail as follows.

For the sensor, it selects unknown actions randomly with the probability of

ϵ

. In the remaining probability 1-

ϵ

, firstly, it needs to consider the constraints in Remark 1: when

T r (P (k)) - T r (\bar{P}) > c

, choosing the channel in “state 1” is the only choice for sensor, i.e.,

a_{k} = 1

. In other cases, the sensor selects actions based on the max-min criterion: the attacker chooses the action with the highest value in each sensor’s action selection case, which means the worst situation to sensor. Among the chosen actions, the sensor continues to select the actions with the relatively lowest values, thus achieving the best results from the worst case for the sensor.

From the perspective of the attacker, there are no constraints for the attacker. The attacker selects an unknown action with the probability of

ϵ

and selects the action according to min-max criterion with the probability of 1 1-

ϵ

: the sensor selects the actions of lowest values in each attacker’s action selection case, which means the worst situation to attacker. Among the chosen actions, the attacker continues to select the actions with the relatively highest values, thus achieving the best results from the worst case for the attacker.

Based on Q-learning, the update rule of Q-table is as follows:

\begin{matrix} Q_{k + 1} (s, a, d) = Q_{k} (s, a, d) + α [r_{k} + ρ max_{d^{'}} min_{a^{'}} Q_{k} (s^{'}, a^{'}, d^{'}) \\ - Q_{k} (s, a, d)], \end{matrix}

(20)

where the action pair used to update the Q-table is selected according to the max-min criterion under state

s^{'}

.

Based on SARSA, the update rule of Q-table is as follows:

Q_{k + 1} (s, a, d) = Q_{k} (s, a, d) + α [r_{k} + ρ Q_{k} (s^{'}, a^{'}, d^{'}) - Q_{k} (s, a, d)],

(21)

where the action pair used to update the Q-table is from the action selection process, and will be executed in the next time step, i.e.,

(a, d) = (a^{'}, d^{'})

.

The each line of the Q-table could be converted to a strategy matrix where the action sets of sensor and attacker constitute the rows and columns of the matrix. When the Q-table converges, the Nash equilibrium is solved if the sensor’s action obtained by max-min criterion and the attacker’s action obtained by min-max criterion are matched. Algorithm 1 describes the flow to find Nash equilibrium policy, which is displayed as follows.

Algorithm 1 RL method to find Nash equilibrium policy

Input: CPS parameters A, C; cost c and c_d; learning rate α, discount factor ρ and exploration rate ε
Output: Converged Q-table, Nash equilibrium policy.
Initialize: Initial state

s_{0} = \bar{P}

, initialize

i \times j

Q-table, initialize time step

k = 1

.

1:: while $∥Q_{k + 1} (s_{,} a, d) - Q_{k} (s_{,} a, d)∥ < η$ do
2:: if $r a n d < ε$ then
3:: Choose actions randomly;
4:: else
5:: Find the optimal actions obtained according to max-min and min-max criterion.
6:: end if
7:: Observe the reward $r_{k}$ by (15).
8:: Observe the next state $s_{k + 1}$ according to (14).
9:: Update the Q-table according to (20 or 21).
10:: $s_{k} ⟵ s_{k + 1}$
11:: $k ⟵ k + 1$
12:: end while
13:: Return the final Q-table.
14:: Observe the Nash equilibrium policy.

5. Simulation Results

Consider the CPS with a single sensor and a remote state estimator. The system parameters are given as follows:

A = [\begin{matrix} 1 & 0.5 \\ 0 & 1 \end{matrix}], C = [\begin{matrix} 1 & 0 \end{matrix}], Q_{w} = [\begin{matrix} 0.8 & 0 \\ 0 & 0.8 \end{matrix}], R_{v} = [\begin{matrix} 0.8 \end{matrix}]

Running the Kalman filter,

P (k)

converges to a stable value when

k = 8

and

\bar{P}

=

[\begin{matrix} 0.6 & 0.4 \\ 0.4 & 2.4 \end{matrix}]

.

The convergence process is shown in Figure 4, where the red and blue lines represent the convergence process of diagonal elements of

P (k)

.

Set the cost of the sensor choosing “state 1” channel

c = 5.8

, the cost of the attacker choosing to attack

c_{d} = 8

. According to Remark 1, when the state estimation error covariance is

h^{2} (\bar{P})

, the sensor could only select the “state 1” channel at the next time step, because

c = 5.8 < T r (h^{2} (\bar{P})) - T r (\bar{P})) = 9.6 - 3 = 6.6

. Thus, the finite set of states for the state transition process is determined as

S = {\bar{P}, h^{1} (\bar{P}), h^{2} (\bar{P})}

, which is described in Figure 5.

5.1. Scenario 1: From the Perspective of the Sensor

In this scenario, the sensor action strategy is trained by the different RL algorithms (Q-learning or SARSA method), while the attacker action selection is set randomly. Based on the framework of Q-learning and SARSA, the convergence processes of Q-function in each state, i.e.,

s \in S = {\bar{P}, h^{1} (\bar{P}), h^{2} (\bar{P})}

, are shown in Figure 6, where the different behaviors of the sensors are depicted by different colors. The fluctuating range of Q-value curves is higher in the first 5000 time steps, which is in the exploration and training stage with the probability

ϵ

gradually increasing in the first 5000 steps and remaining fixed after that time. From the sensor’s perspective, the sensor’s action selection follows a certain policy, but the attacker’s action selection is random. Thus, a slight fluctuation still exists in the process of convergence. As the fluctuation range is stable finally, the Q-function is considered as convergence and the optimal sensor policy could be obtained. The time consuming of Q-learning is 2.101 s and 2.304 s for SARSA.

The Q-tables that converge under states

{\bar{P}, h^{1} (\bar{P}), h^{2} (\bar{P})}

and actions “a={0,1}" are shown in Table 1 and Table 2. According to the Figure 6 and Table 1 and Table 2, the optimal strategies obtained by two algorithms are consistent. The sensor will choose the channel of “state 0” in state

\bar{P}

and state

h^{1} (\bar{P})

as the value of action “a=0” in these two states is lower: −3.0899 < 1.4825 and −0.4846 < 1.3627, −3.8655 < 0.5472 and −0.9111 < 0.5245. The estimation error in these two states is relatively little, thus choosing the channel of “state 0” has higher overall income to make a balance between the selection cost and estimation performance. In the state

h^{2} (\bar{P})

, the sensor will select the channel of “State 1” as the value action of “a = 1” is lower in this state: 1.0685 < 4.7987, −1.6413 < 3.8966. In this state, the estimation error of system state is too large, thus securing the requirements of the sensor is a key factor to be considered by choosing the channel of “State 1”.

5.2. Scenario 2: From the Perspective of the Attacker

In this scenario, the attack action strategy is trained by the different RL algorithms (Q-learning or SARSA methods) while the sensor countermeasure is set randomly. The convergence processes of Q-functions in each state, i.e.,

s \in S = {\bar{P}, h^{1} (\bar{P}), h^{2} (\bar{P})}

, are shown in Figure 7, based on the framework of Q-learning and SARSA, respectively. From the perspective of the attacker, since the action selection of the sensor is uniformly random, the state-action value curves in the Figure 7 have certain fluctuations. The time consumption of Q-learning is 3.557 s and 3.907 s for SARSA.

The Q-tables that converge under states

{\bar{P}, h^{1} (\bar{P}), h^{2} (\bar{P})}

and actions

d = {0, 1}

are shown in Table 3 and Table 4. According to the Figure 7 and Table 3 and Table 4, the optimal strategies of attacker obtained by two algorithms are consistent. In each state, the attacker will not choose to attack the channel as the value of action “d = 0” is higher in all states. This is because the attack cost set in this paper is relatively high. In all states, the choice of “not attack” comes with higher overall income.

5.3. Scenario 3: Zero-Sum Game of Sensor and Attacker

Different from the above two scenarios, the sensor and attacker action strategy is trained together by the the different RL algorithms (Q-learning or SARSA method) in this section, and the Nash equilibrium policy is obtained by transferring the sensor–attacker interactive decision process into a zero-sum two-player matrix game. In the zero-sum game of sensor and attacker, the convergence processes of Q-functions in each state, i.e.,

s \in S = {\bar{P}, h^{1} (\bar{P}), h^{2} (\bar{P})}

are shown in Figure 8. Different behaviors of sensor and attacker are described in different colors. The Q-tables that converge under all state–action pairs are shown in Table 5 and Table 6, which reflects the Q-values under states

{\bar{P}, h^{1} (\bar{P}), h^{2} (\bar{P})}

and action pairs

(a, d) = {(0, 0), (0, 1), (1, 0), (1, 1,)}

. The time consuming of Q-learning is 2.312 s and 2.119 s for SARSA.

According to Table 5 and Table 6, both the Q-learning algorithm and the SARSA algorithm could obtain a same Nash equilibrium policy. For example, in state

\bar{P}

, the sensor chooses the channel of “state 0” and the attacker does not attack as the value of action pair “(a,d) = (0,0)” satisfies the max-min and min-max criterion. It is reasonable for the sensor, as the system return of choosing a safer channel in state

\bar{P}

is not enough to compensate the additional cost. Similarly, the attacker chooses not to attack to save cost as the choice of attack may not cause sufficient system performance loss. In state

h^{1} (\bar{P})

, the sensor chooses the channel of “state 0” and the attacker chooses to attack. The choice of attack could cause more system performance decline in this state. In state

h^{2} (\bar{P})

, the sensor chooses the channel of “state 1” and the attacker dose not attack. Since the system error in the state

h^{2} (\bar{P})

is too large, the sensor chooses a safer channel to ensure the transmission security. For the attacker, choosing to attack may not be necessary as the system error is large enough and the attacking cost is considerable.

5.4. Comparison

5.4.1. Comparison of Different Algorithms

As the problem addressed in this paper is a simple CPS with a single sensor and attacker, the policies obtained by two different algorithm are consistent, while certain differences could be reflected through the numerical values of Q-tables. As shown in Figure 9, the Q-values of the optimal policies obtained by Q-learning are generally higher than those obtained by SARSA. Moreover, although all sensors choose the safer “state 1” channel in state

h^{2} (\bar{P})

in Scenario 1, the Q-value obtained by Q-learning is positive, i.e., 1.0685, while that obtained by SARSA is negative, i.e., −0.6413, which is shown in line 3 of Table 7. The above result indicates that SARSA has taken certain potential risks into consideration. The difference in update methods results in the consequence that Q-learning is a greedy and brave algorithm which cares little about dangers, while SARSA is a relatively conservative algorithm sensitive to errors.

5.4.2. Comparison of Different Scenarios

This paper considers the issue of CPS security under a DoS attack under different scenarios. The simulation results indicate that the optimal policies from the unilateral perspective and those from the perspective of zero-sum game vary. The optimal policies of the sensor are shown in Table 8, and they are consistent in two different scenarios. The optimal polices of the attacker from two different perspectives differ, which could be seen in line 2 of Table 9. Starting from the perspective attacker alone, the attacker chooses not to attack in all states. The goal of the algorithm in this scenario is to maximize the accumulated income, which could lead to an extreme final policy as the choice of attack needs additional cost. Differently, the attacker chooses to attack in state

h^{1} (\bar{P})

when considering the issue from perspective of zero-sum game. In the process of interactive combat between sensor and attacker, the income of two sides is interrelated and coupled, which results in a balanced final policy. Two options of the attacker’s action occur in three states instead of following a fixed strategy.

Remark 4.

The values of Q-table represent the expectation of the total reward after the player selects actions until the final stage of the algorithm. The reward function in this paper combines system error and the cost of action. Q-learning shows its braveness with higher Q-table values. Conversely, a Q-table with lower values means that SARSA considers the problem in a conservative way.

6. Conclusions

This paper investigates the issue of secure state estimation of CPS under DoS attack, where RL algorithms are applied to find their best strategies for the defender and the attaker. Security assessment framework for CPS is established in a unified framework with a Kalman filter and the DoS attack model, where the estimation error covariance is adopted as an evaluation criterion to measure the security issue under DoS attacks. Then, the transition of estimation performance under system protection and attack strategies is characterized as a Markov decision process, and the RL algorithm could be applied to deal with the optimal countermeasures. To further explore the interactive decision process between system defender and malicious attacker, the mutual confrontation between both sides is described by a two-player zero-sum game, where the Nash equilibrium policy exists and is solved by the linear programming method. Based on the framework of two classical algorithms, Q-learning (off policy) and SARSA (on policy), this paper proposed adaptive RL algorithms to investigate the secure state estimation problem from three different perspectives. The algorithms are able to find the corresponding optimal policy from the perspective of sensor or attacker, and the Nash equilibrium policy from the perspective of the zero-sum game. The simulation results indicate that two types of algorithms are able to obtain the ideal polices. The effectiveness of the algorithm is prove. It is practical to extend the algorithm to other cases which could be described as an MDP. Meanwhile, certain differences between two algorithms are reflected through numerical value: the Q-leaning-based algorithm is more greedy, while the SARSA-based algorithm is relatively conservative. It has to be noted that limitations exist. As Q-learning and SARSA use table to store value function, they will not be adaptable in cases with high-dimensional state space. The future studies will extend to more complex issues, including the CPS of multiple sensors and deep reinforcement learning methods.

Author Contributions

Writing—review and editing, Z.J.; Writing— original draft, M.M.; Methodology, S.Z.; Project administration, Y.H.; Supervision, Y.Z. and C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 62003275, Fundamental Research Funds for the Central Universities of China with Grant 31020190QD039, Ningbo Natural Science Foundation with Grant 2021J046.

Data Availability Statement

The data used to support the findings of this study are included within the article.

Acknowledgments

The authors also express great gratitude to the research team and the editors for their help.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dafflon, B.; Moalla, N.; Ouzrout, Y. The challenges, approaches, and used techniques of CPS for manufacturing in Industry 4.0: A literature review. Int. J. Adv. Manuf. Technol. 2021, 113, 2395–2412. [Google Scholar] [CrossRef]
Keerthi, C.K.; Jabbar, M.; Seetharamulu, B. Cyber physical systems(CPS): Security issues, challenges and solutions. In Proceedings of the 2017 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), Coimbatore, India, 14–16 December 2017; pp. 1–4. [Google Scholar]
Zahid, M.; Inayat, I.; Daneva, M.; Mehmood, Z. Security risks in Cyber-Physical systems—A systematic mapping study. J. Softw. Evol. Process 2021, 33, 2346. [Google Scholar] [CrossRef]
Zhang, D.; Feng, G.; Shi, Y.; Srinivasan, D. Physical safety and cyber security analysis of multi-agent systems: A survey of recent advances. IEEE/CAA J. Autom. Sin. 2021, 8, 319–333. [Google Scholar] [CrossRef]
Nazih, W.; Elkilani, W.S.; Dhahri, H.; Abdelkader, T. Survey of countering DoS/DDoS attacks on SIP based VoIP networks. Electronics 2020, 9, 1827. [Google Scholar] [CrossRef]
Lu, A.Y.; Yang, G.H. Stability analysis for Cyber-Physical systems under Denial-of-Service attacks. IEEE Trans. Cybern. 2020, 51, 5304–5313. [Google Scholar] [CrossRef] [PubMed]
Mahmoud, M.S.; Hamdan, M.M.; Baroudi, U.A. Secure control of Cyber-Physical systems subject to stochastic distributed DoS and deception attacks. Int. J. Syst. Sci. 2020, 51, 1653–1668. [Google Scholar] [CrossRef]
Zhao, L.; Yang, G.H. Cooperative adaptive fault-tolerant control for multi-agent systems with deception attacks. J. Frankl. Inst. 2020, 357, 3419–3433. [Google Scholar] [CrossRef]
Zhou, M.; Zhang, Z.; Xie, L. Permutation entropy based detection scheme of replay attacks in industrial Cyber-Physical systems. J. Frankl. Inst. 2021, 358, 4058–4076. [Google Scholar] [CrossRef]
Zhai, L.; Vamvoudakis, K.G. A data-based private learning framework for enhanced security against replay attacks in Cyber-Physical systems. Int. J. Robust Nonlinear Control 2021, 31, 1817–1833. [Google Scholar] [CrossRef]
Ao, W.; Song, Y.; Wen, C. Distributed secure state estimation and control for CPSs under sensor attacks. IEEE Trans. Cybern. 2018, 50, 259–269. [Google Scholar] [CrossRef]
Kazemi, Z.; Safavi, A.A.; Arefi, M.M.; Naseri, F. Finite-time secure dynamic state estimation for cyber-physical systems under unknown inputs and sensor attacks. IEEE Trans. Syst. Man Cybern. Syst. 2021, 52, 4950–4959. [Google Scholar] [CrossRef]
Wang, Y.; Hu, B.; Pan, X.; Xu, T.; Sun, Q. Security control of Denial-of-Service attacks in Cyber-Physical Systems based on dynamic feedback. Comput. Intell. Neurosci. 2022, 2022, 5472137. [Google Scholar] [CrossRef]
Wang, M.; Xu, B. Guaranteed cost control of Cyber-Physical systems with packet dropouts under DoS jamming attacks. Asian J. Control 2020, 22, 1659–1669. [Google Scholar] [CrossRef]
Zhao, L.; Li, W.; Li, Y. Research on dual security control for a Non-Linear CPS with multi-objective constraints under DoS attack and actuator fault: An active-passive attack-tolerant approach. J. Control Sci. Eng. 2022, 2022, 1734593. [Google Scholar] [CrossRef]
Zhang, H.; Cheng, P.; Shi, L.; Chen, J. Optimal Denial-of-Service attack scheduling against linear quadratic Gaussian control. In Proceedings of the 2014 American Control Conference, Portland, OR, USA, 4–6 June 2014; pp. 3996–4001. [Google Scholar]
Zhang, H.; Cheng, P.; Shi, L.; Chen, J. Optimal DoS attack scheduling in wireless networked control system. IEEE Trans. Control Syst. Technol. 2015, 24, 843–852. [Google Scholar] [CrossRef]
Peng, L.; Cao, X.; Sun, C.; Cheng, Y.; Jin, S. Energy efficient jamming attack schedule against remote state estimation in wireless Cyber-Physical systems. Neurocomputing 2018, 272, 571–583. [Google Scholar] [CrossRef]
Sun, L.; Zhang, Y.; Sun, C. Stochastic Denial-of-Service attack allocation in leader-following multiagent systems. IEEE Trans. Syst. Man Cybern. Syst. 2021, 52, 2848–2857. [Google Scholar] [CrossRef]
Zhao, L.; Li, Y.; Yuan, Y.; Yuan, H. Optimal power allocation for multiple DoS attackers in wireless networked control systems. ISA Trans. 2020, 104, 204–211. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Quevedo, D.E.; Dey, S.; Shi, L. SINR-based DoS attack on remote state estimation: A game-theoretic approach. IEEE Trans. Control Netw. Syst. 2016, 4, 632–642. [Google Scholar] [CrossRef] [Green Version]
Huang, Y.; Zhao, J. Cyber-Physical systems with multiple Denial-of-Service attackers: A game-theoretic framework. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 4349–4359. [Google Scholar] [CrossRef]
Uprety, A.; Rawat, D.B. Reinforcement learning for iot security: A comprehensive survey. IEEE Internet Things J. 2020, 8, 8693–8706. [Google Scholar] [CrossRef]
He, Y.; Liang, C.; Yu, F.R.; Han, Z. Trust-based social networks with computing, caching and communications: A deep reinforcement learning approach. IEEE Trans. Netw. Sci. Eng. 2018, 7, 66–79. [Google Scholar] [CrossRef]
Li, F.; Qin, J.; Zheng, W.X. Distributed Q-Learning-Based Online Optimization Algorithm for Unit Commitment and Dispatch in Smart Grid. IEEE Trans. Cybern. 2019, 50, 4146–4156. [Google Scholar] [CrossRef] [PubMed]
Zhou, T.; Tang, D.; Zhu, H.; Zhang, Z. Multi-agent reinforcement learning for online scheduling in smart factories. Robot. Comput.-Integr. Manuf. 2021, 72, 102202. [Google Scholar] [CrossRef]
Liu, T.; Tian, B.; Ai, Y.; Wang, F.Y. Parallel reinforcement learning-based energy efficiency improvement for a cyber-physical system. IEEE/CAA J. Autom. Sin. 2020, 7, 617–626. [Google Scholar] [CrossRef]
Tran, H.D.; Cai, F.; Diego, M.L.; Musau, P.; Johnson, T.T.; Koutsoukos, X. Safety verification of cyber-physical systems with reinforcement learning control. ACM Trans. Embed. Comput. Syst. (TECS) 2019, 18, 1–22. [Google Scholar] [CrossRef]
Dai, P.; Yu, W.; Wang, H.; Wen, G.; Lv, Y. Distributed reinforcement learning for Cyber-Physical system with multiple remote state estimation under DoS attacker. IEEE Trans. Netw. Sci. Eng. 2020, 7, 3212–3222. [Google Scholar] [CrossRef]
Jin, Z.; Zhang, S.; Hu, Y.; Zhang, Y.; Sun, C. Security State Estimation for Cyber-Physical Systems against DoS Attacks via Reinforcement Learning and Game Theory. Actuators 2022, 11, 192. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: London, UK, 2018. [Google Scholar]
Paniri, M.; Dowlatshahi, M.B.; Nezamabadi-pour, H. Ant-TD: Ant colony optimization plus temporal difference reinforcement learning for multi-label feature selection. Swarm Evol. Comput. 2021, 64, 100892. [Google Scholar] [CrossRef]
Dowlatshahi, M.B.; Derhami, V.; Nezamabadi-pour, H. Ensemble of filter-based rankers to guide an epsilon-greedy swarm optimizer for high-dimensional feature subset selection. Information 2017, 8, 152. [Google Scholar] [CrossRef] [Green Version]
Lyu, D.; Liu, B.; Geist, M.; Dong, W.; Biaz, S.; Wang, Q. Stable and efficient policy evaluation. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 1831–1840. [Google Scholar] [CrossRef]
Zhao, Y.; Tian, Y.; Lee, J.; Du, S. Provably Efficient Policy Optimization for Two-Player Zero-Sum Markov Games. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 28–30 March 2022; Volume 151, pp. 2736–2761. [Google Scholar]
Zhu, Y.; Zhao, D. Online minimax Q network learning for two-player zero-sum Markov games. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 1228–1241. [Google Scholar] [CrossRef]
Nash, J.F., Jr. Equilibrium points in n-person games. Proc. Natl. Acad. Sci. USA 1950, 36, 48–49. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Secure state estimation process of CPS against DoS attack.

Figure 2. The zero-sum game between sensor and DoS attacker.

Figure 3. The overall flow of the proposed RL algorithm.

Figure 4. Convergence process of estimation error covariance

P (k)

.

Figure 4. Convergence process of estimation error covariance

P (k)

.

Figure 5. State transition process of estimation error covariance

P (k)

against DoS attack.

Figure 5. State transition process of estimation error covariance

P (k)

against DoS attack.

Figure 6. The convergence process of the state-action value Q-function in each state from the perspective of sensor. (a) The framework of Q-learning; (b) The framework of SARSA.

Figure 7. The convergence process of the state-action value Q-function in each state from the perspective of the attacker. (a) Based on the framework of Q-learning; (b) Based on the framework of SARSA.

Figure 8. The convergence process of the state-action value Q-function in each state considering the interactive decision game of sensor and attacker. (a) The framework of Q-learning; (b) The framework of SARSA.

Figure 9. Comparisons of Q-values between two algorithms. (a) Optimal policy for the sensor; (b) Optimal policy for the attacker; (c) Nash equilibrium policy for sensor and attacker.

Table 1. Convergent Q-table based on Q-learning from the perspective of the sensor.

State-Action	a = 0	a = 1
$\bar{P}$	−3.0899	1.4825
$h^{1} (\bar{P})$	−0.4846	1.3627
$h^{2} (\bar{P})$	4.7987	1.0685

Table 2. Convergent Q-table based on SARSA from the perspective of the sensor.

State-Action	a = 0	a = 1
$\bar{P}$	−3.8655	0.5472
$h^{1} (\bar{P})$	−0.9111	0.5245
$h^{2} (\bar{P})$	3.8966	−0.6413

Table 3. Convergent Q-table based on Q-learning from the perspective of attacker.

State-Action	d = 0	d = 1
$\bar{P}$	14.9112	8.0291
$h^{1} (\bar{P})$	14.0651	9.9603
$h^{2} (\bar{P})$	14.5627	12.6941

Table 4. Convergent Q-table based on SARSA from the perspective of attacker.

State-Action	d = 0	d = 1
$\bar{P}$	13.6415	6.9901
$h^{1} (\bar{P})$	13.0922	8.6582
$h^{2} (\bar{P})$	13.3815	11.2312

Table 5. Convergent Q-table based on Q-leaning, considering the game of sensor and attacker.

State-Action	(a,d) = (0,0)	(a,d) = (0,1)	(a,d) = (1,0)	(a,d) = (1,1)
$\bar{P}$	7.5000	2.1412	13.3001	5.3001
$h^{1} (\bar{P})$	7.5113	11.8843	27.4170	19.5472
$h^{2} (\bar{P})$	56.0980	57.4043	35.8205	8.7735

Table 6. Convergent Q-table based on SARSA, considering the game of sensor and attacker.

State-Action	(a,d) = (0,0)	(a,d) = (0,1)	(a,d) = (1,0)	(a,d) = (1,1)
$\bar{P}$	7.5000	2.1028	13.3000	5.3000
$h^{1} (\bar{P})$	7.5000	13.1819	17.4590	13.3126
$h^{2} (\bar{P})$	54.3677	58.2853	17.2286	12.8882

Table 7. Comparison of optimal strategy Q-values from the perspective of the sensor.

State-Action	Q-Learning	SARSA
$\bar{P}$	−3.0899	−3.8655
$h^{1} (\bar{P})$	−0.4846	−0.9111
$h^{2} (\bar{P})$	1.0685	−0.6413

Table 8. The optimal polices of sensor under different scenarios.

	Unilateral Perspective	Zero-Sum Game
$\bar{P}$	“state 0” channel	“state 0” channel
$h^{1} (\bar{P})$	“state 0” channel	“state 0” channel
$h^{2} (\bar{P})$	“state 1” channel	“state 1” channel

Table 9. The optimal polices of attacker under different scenarios.

	Unilateral Perspective	Zero-Sum Game
$\bar{P}$	not to attack	not to attack
$h^{1} (\bar{P})$	not to attack	attack
$h^{2} (\bar{P})$	not to attack	not to attack

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, Z.; Ma, M.; Zhang, S.; Hu, Y.; Zhang, Y.; Sun, C. Secure State Estimation of Cyber-Physical System under Cyber Attacks: Q-Learning vs. SARSA. Electronics 2022, 11, 3161. https://doi.org/10.3390/electronics11193161

AMA Style

Jin Z, Ma M, Zhang S, Hu Y, Zhang Y, Sun C. Secure State Estimation of Cyber-Physical System under Cyber Attacks: Q-Learning vs. SARSA. Electronics. 2022; 11(19):3161. https://doi.org/10.3390/electronics11193161

Chicago/Turabian Style

Jin, Zengwang, Menglu Ma, Shuting Zhang, Yanyan Hu, Yanning Zhang, and Changyin Sun. 2022. "Secure State Estimation of Cyber-Physical System under Cyber Attacks: Q-Learning vs. SARSA" Electronics 11, no. 19: 3161. https://doi.org/10.3390/electronics11193161

APA Style

Jin, Z., Ma, M., Zhang, S., Hu, Y., Zhang, Y., & Sun, C. (2022). Secure State Estimation of Cyber-Physical System under Cyber Attacks: Q-Learning vs. SARSA. Electronics, 11(19), 3161. https://doi.org/10.3390/electronics11193161

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Secure State Estimation of Cyber-Physical System under Cyber Attacks: Q-Learning vs. SARSA

Abstract

1. Introduction

2. Preliminaries

2.1. Q-Learning vs. SARSA

2.2. Two-Player Zero-Sum Game

3. Problem Setup

3.1. System Model

3.2. DoS Attack Model

4. Reinforcement Learning Methods Finding Optimal Policies

4.1. Optimal Policy for the Sensor

4.2. Optimal Policy for the Attacker

4.3. Nash Equilibrium Policy of the Sensor and Attacker

5. Simulation Results

5.1. Scenario 1: From the Perspective of the Sensor

5.2. Scenario 2: From the Perspective of the Attacker

5.3. Scenario 3: Zero-Sum Game of Sensor and Attacker

5.4. Comparison

5.4.1. Comparison of Different Algorithms

5.4.2. Comparison of Different Scenarios

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI