1. Introduction
In recent years, many high-level AI, such as AlphaGo, AlphaStar, Pluribus, and Suphx have defeated human players in human–machine confrontation scenarios [
1], and significant breakthroughs have been made in the development of intelligent games. In terms of cyber attacks, automated and intelligent attack tools keep emerging in recent years, such as CyberBattleSim [
2], NetworkAttackSimulator [
3], CyberOrg [
4], CyGIL [
5], CALDERA [
6], etc., aiming at using agents to automatically dispatch attack payloads and plan attack paths and improving the intelligence level of network attack capability. The automatic network attack platforms have greatly decreased the technical threshold of a network attack, the network attack means based on artificial intelligence technology are more concealed, and the game in cyberspace has developed to the stage of intelligent confrontation. Attackers can carry out high-intensity, high-concealment, and highly destructive attacks at a lower cost by adopting automated attack platforms and intelligent means. The situation of cybersecurity is becoming increasingly severe, the traditional defense system can no longer meet the current security requirements. Furthermore, there are inevitable defects when humans confront automatic and intelligent machines. This is because the human has physical and psychological limits, and human is more susceptible to external factors.
The similarity, deterministic, and static are the security defects of network systems, which cannot stand long-term observation, analysis, and repeated attacks. To reverse the passive situation of the network defense, the moving target defense (MTD) [
7] emerges at the time required. MTD incorporates a “noise” into system and the entropy of the system configuration increases. Thus, MTD can resist the same type of attack, and greatly decreases the success rate of attacks [
8]. How to efficiently use MTD technology and enhance the defense efficiency of target systems has become an important research hotspot. However, there are also vulnerabilities in MTD technology. Winterrose et al. [
9] proposed adaptive attacker strategies against temporal platform migration defense. Therefore, it is urgent to study an adversarial decision-making strategy in an intelligent game scenario, to provide technical support for reversing the cyberspace security situation. Fast, automatic, and intelligent defense will be needed in the network defense field in the future. Intelligent game technology can extend the human brain and counter intelligent attacks, to improve the defense decision-making ability and adapt to the high-speed, complex, and changeable network attack and defense environment. Virtual defense agents can be used to assist defenders to make decisions.
The antagonism of goals, non-cooperation of relations, and dependence on strategies in the process of network attack and defense conform to the basic characteristics of game theory [
10,
11]. Confronted with intelligent attacks, network attack and defense can be modeled as a multi-agent game model, thus providing auxiliary decision support for defenders. The state transition of network attack and defense can be characterized by Markov Decision Process. Markov game regards the network attack and defense as a dynamic multi-stage process with continuous, random, and dynamic interaction.
Multi-agent reinforcement learning (MARL) [
12] is an extension of reinforcement learning (RL) in a multi-agent domain. MARL is widely used in multi-agent collaboration, such as MADDPG [
13], MAPPO [
14], etc. However, according to the confrontation between network attack and defense, we incorporate MARL into multi-agent games. At present, some algorithms are used in multi-agent games, such as Minimax-Q [
15], Nash-Q [
16], Bayesian Strong Stackelberg (BSS) [
17], etc. These algorithms can find the equilibrium solution of the game. However, a shared limitation of these algorithms is that they cannot dynamically adjust their strategies according to the game process. WoLF-PHC [
18] can dynamically adjust the strategy; however, the performance of WoLF-PHC is not high enough to find the optimal solution of defense agent.
In conclusion, the main challenges of adversarial decision-making strategy for cyber defense are as follows:
Challenge 1: traditional defense methods are not enough to resist automatic and intelligent attacks, so it is necessary to construct a simulation environment of multi-agent games and provide assistant decision support for defenders.
Challenge 2: the defense strategy obtained by mainstream algorithms is certain, and it cannot adjust the strategy adaptively and continuously in the dynamic game process.
Challenge 3: reinforcement learning algorithms used in the game should not only have both convergence and rationality, but also obtain the maximum benefit. However, it is difficult to achieve the convergence condition of the Nash-Q algorithm, and the URS-Q algorithm does not have good rationality.
In the paper, a multi-stage Markov game based on MARL is constructed, and then an optimal decision-making approach to moving target defense called WoLF-BSS-Q is proposed. The main idea of the WoLF-BSS-Q algorithm is the defense agents select the optimal defense strategy in multi-agent game and continuously and dynamically update the defense strategy by observing the environment. A series of experiments were performed to validate the efficiency and effectiveness of our proposed method. The main contributions are as follows:
According to the characteristics of continuity, randomness, and dynamics of cyber attack and defense, the multi-agent Markov games (MAMGs) model was constructed and the equilibrium points using reinforcement learning were solved.
The proposed WoLF-BSS-Q algorithm selects the adversarial strategy from two dimensions: micro-selection level and macro-adjustment level. Firstly, the BSS algorithm is used to select the strategy for MTD, and then the WoLF algorithm is used to dynamically adjust the strategy by observing the game process. Finally, UCB algorithm is used to choose the final strategy with the largest Q value.
The adversarial strategy of defense agent obtained by the WoLF-BSS-Q algorithm is better than the Nash-Q and URS-Q. Furthermore, the convergence and rationality of the WoLF-BSS-Q algorithm are both excellent. From the perspective of decision entropy, the selection probability of actions can be gradually optimized using WoLF-BSS-Q.
The proposed WoLF-BSS-Q approach combines BSS, WoLF, UCB, and Constrained Q-learning algorithms. It makes use of the information intentionally or unintentionally exposed by the attacker and is expected to resist automatic and intelligent attacks based on reinforcement learning in cyberspace, which can provide assistant decision support for the defender. Meanwhile, the optimal defense strategy generated by the defense agent is allocated to each dynamic defense node, restraining and resisting the attack agent to continue to carry out high-intensity, high-concealment, and highly destruction attacks.
The rest of the paper is organized as follows. In
Section 2, we review the related work of Markov games and MARL algorithms. In
Section 3, we present the model of multi-agent Markov games and the main idea of our proposed WoLF-BSS-Q algorithm. In
Section 4, to examine the effectiveness of WoLF-BSS-Q, a series of experiments were carried out and the experimental results were analyzed. In
Section 5, we summarize the paper. All the frequent abbreviations used in our work is shown in
Table A1.
2. Related Work
Cyberspace game has developed to the stage of intelligent confrontation. In recent years, cyber attacks have become more and more automatic and intelligent. Attackers can make full use of artificial intelligence technologies such as reinforcement learning and imitation learning to make attack decisions and deployment, and then launch high-intensity, high-concealment, and high-destructive cyber attacks with automated and intelligent attack tools. Chen et al. [
19] introduce expert knowledge to guide the attack agent to make better decisions in RL-based automated attacks. Li et al. [
20] construct an improved network graph model and incorporates social engineering into automated penetration testing (PT) platforms. The intelligent attack simulators [
2,
3,
4,
5,
6] can construct the cyber attack model, utilize attack loads, and plan attack paths automatically, aiming at improving the intelligent level of cyber attack capability.
With the pace and complexity of cyber attacks and defense increasing, there are inevitable defects when humans confront automated and intelligent machines, mainly because of their physical and psychological limitations. Furthermore, humans are more susceptible to external factors. However, it takes a lot of time and costs to train an expert, and the successful experience of the expert is not easy to be copied widely. Therefore, it is urgent to study the adversarial decision-making approach of MTD to realize intelligent cyber defense. However, the research on proactive defense against intelligent attacks is still in its infancy. Water et al. [
21] incorporate deception into CyberBattleSim for autonomous defense and deploys honeypots to confuse the attack agent. Fabio et al. [
22] model PT with RL using capture-the-flag challenges and introduce port hopping into a single host. RL-based simulation environment CybORG [
4] proposed the modeling methods of attackers, defenders, and arbitrators. CybORG trains in a virtual network environment and tries to run in a real network simulation environment with the same configuration through an API interface. Although the modeling of CybORG is complete, there is no focus on intelligent defense. In a word, the defense strategies against intelligent attacks remain at a static defense level, and the design of the defense algorithms is relatively simple.
Adversarial intelligent game technology is the key to realizing intelligent defense. It mainly depends on the combination of game theory and deep reinforcement learning [
23]. Game theory provides an equilibrium solution to describe the learning results of multi-agent systems; however, it is mainly developed in theory and rarely applied to practical problems. Reinforcement learning provides convergent learning algorithms for the training of agents, which can achieve a rational and convergent result in the process of sequential decision-making. Defense agent uses adversarial intelligent game technology to plan the task and make real-time decisions. Markov game can be used to model an adversarial intelligent game. Furthermore, Markov game is combined with game theory and the Markov Decision Process (MDP) [
15], which is suitable for describing the state transition process of a system with multi-state and stochastic characteristics. Liu et al. [
24] adopts a recurrent neural network to solve the game equilibrium with partial rationality and incomplete information. Lei et al. [
11] proposed an optimal strategy of MTD for the Markov game to trade off the defensive profit and network service quality.
The application of MARL into game scenarios is a new research hotspot. The agent aims to learn the reward function of each state through interaction with the environment and then learn the optimal strategy in a MARL game process. In general, it is necessary to approximately near the action-state-valued function by using the method of Q-learning. Marc et al. [
25] proposed a unified game-theoretic method for MARL, they found that policies learned using independent reinforcement learning would overfit the other agents’ policies in the training process, and could not fully generalize in the implementation process. In a MARL game scenario, the agent cannot change the observation results of the opponent agent. Gleave et al. [
26] believe that the observation of the opponent agent can be influenced by choosing confrontation strategies in multi-agent environments. The experiments show that the actions can naturally change the observation of opponents by training attack strategies, thus realizing the effect of black-box attacks. Aravind et al. [
27] develop a game framework of model-based RL. Furthermore, they found that a near-optimal policy for the environment can be obtained by finding an approximate equilibrium for the game. Zhang et al. [
28] aim to analyze the sample complexity of model-based MARL in Markov games. Xuan et al. [
29] focus on the UAV swarm attack-defense confrontation based on the MADDPG algorithm.
In competitive tasks, the main purpose of MARL is how to make agents gain as many benefits as possible in competition with rivals. Littman [
15] proposed the reinforcement learning algorithm Minimax-Q in a two-players zero-sum Markov game. Minimax-Q solves the minimum–maximum strategy of the game in each state for the agent to guide its action selection. Zhu et al. [
30] proposed minimax Q network learning algorithm to train the network with observations and combined game theory, dynamic programming, and deep reinforcement learning to solve the Nash equilibrium for Markov games. Nash Q-learning (Nash-Q) [
16] can be used in multi-player general-sum games. Nash-Q is to solve the Nash equilibrium point by quadratic programming. The convergence condition of Nash-Q is that a global optimal point or saddle point can be found in each stage game. Sailik et al. [
17] proposed the Bayesian Stackelberg Markov game model and the Bayesian Strong Stackelberg Q-learning (BSS-Q) algorithm to better model the continuity and incomplete information of cyber attack and defense. The experimental results show that the BSS-Q algorithm applies to multi-agent games and is superior to Nash-Q. A shared problem of Minimax-Q, Nash-Q, and BSS-Q is that the defense agent cannot dynamically update the defense strategy by observing the environment. Policy Hill-Climbing (PHC) algorithm is a gradient descent algorithm for mixed strategies. However, the PHC strategy does not always converge. WoLF-PHC algorithm [
18] is an improvement of the PHC algorithm. By incorporating the WoLF mechanism, two learning rate parameters are needed. The learning rate used for updating the strategy depends on whether the agent wins (
) or loses (
) currently. Yang et al. [
31] introduce the WoLF-PHC algorithm in online RL to analyze bounded rational Markov game. Although the WoLF-PHC algorithm can update the strategy dynamically, the adjustment range is small and the performance of WoLF-PHC is not satisfactory.
In conclusion, the research on intelligent games based on MARL is booming. However, it is mainly used in real-time strategic games, UAV group operations [
29], and other fields. The research on its application in the field of cyber attack and defense is still in its infancy. At present, cyber defense is still at the level of proactive defense and adaptive defense. The current cyber defense level is fragile. Therefore, it is necessary to construct a multi-agent game model of cyber attack and defense. Note that most of the existing intelligent attack platforms are based on RL algorithms. RL is easily influenced by antagonistic disturbance, which is caused by the instability of RL training. At present, the algorithms used in MARL games either cannot dynamically adjust the strategy according to the game process, or the adjustment range is too small to obtain the optimal strategy. In addition, the algorithm used in the game needs to have both convergence and rationality. Therefore, to adapt to the highly dynamic and complex cyber attack and defense environment, it can be considered to design an adversarial decision-making approach of the defense agent for MARL games.
3. Methodology
In this paper, a multi-agent Markov game in cyberspace according to the Cyber-Kill-Chain model is constructed. Both the attacker and the defender are agents using reinforcement learning algorithms, and the both sides can make decisions through observing the environment. Multi-agent games rely on reinforcement learning algorithms to solve game equilibrium points.
Section 3.1 explains the assumptions in the method.
Section 3.2 models the multi-agent Markov games.
Section 3.3 introduces the proposed adversarial decision-making strategy in multi-agent Markov games, called the WoLF-BSS-Q algorithm.
3.1. Assumptions of Multi-Agent Markov Games
A Markov game between attack agents and defense agents in cyberspace is constructed. Furthermore, a relatively rational attacker is considered and the following assumptions are made.
Assuming that the attack agent is rational and does not cheat or feint.
Assuming that no third-party factors interfere with the environment.
Assuming that the attack agent does not know that the defense agent adopts proactive defense technology and will not elaborate on the corresponding countermeasures.
3.2. RL Formalization for Markov Game Theory
The goal of an agent
i is to find an optimal strategy to maximize its long-term reward in any state
s and any time step
t in a Markov Game.
Where refers to the state-valued function of the agent i. E represents the expected value, k represents the future time step, is the discount factor, and is the reward of agent i at the time step . However, the state and reward function of the multi-agent environment is affected by the joint actions of all agents. Agent i cannot unilaterally maximize their long-term reward sum and . Therefore, the strategies of other agents must also be taken into account. For an agent i, the strategy of the agent is generally represented as . Let the joint strategy of all agents be = (, ), the state-value function refers to the sum of long-term reward of agents in any state S under the joint strategy.
The action-value function
is the long-term reward that agent
i can obtain when all agents in any state
S execute joint action
and execute joint strategy
in the later process.
To select the optimal strategy in Markov Games, the concept of equilibrium solutions in game theory has been successfully introduced into MARL, forming a lot of equilibrium-based MARL algorithms. The learning process [
12] can be summarized as follows: (1) To establish a general game on the current state
S, and taking the action value function
of each joint action
in this state as the reward value of agent
i in the game of action
. (2) To solve the equilibrium of the general game, and to select the action that each agent needs to perform in state
S according to the equilibrium solution. (3) To execute the selected action and obtain the reward from the environment, and update the value function
of each agent
i according to the reward and the equilibrium solution on the state
S. All the joint actions of the agent
i in state
S are represented as
. Furthermore, tuple
is the general game in state
S, where
corresponds to the reward function
of agent
i. Therefore, RL can be organically combined with the game theory paradigm. The mapping relationship between game theory and MARL is shown in
Table 1.
Multi-stage Markov game is a widely used game model. It can describe the randomness, dynamics, and continuity of network attack and defense processes. In addition, Markov games can model multi-agent interactions in the sequential decision-making problem. In the game process, an agent can reason the policy from cooperative agents or opponent agents. Furthermore, the agent adopts policies based on equilibrium, where no agents can benefit by deviating from the strategies. When all participants in the game are agents, it can be called multi-agent Markov Games (MAMGs). MDP can be used to represent multi-agent Markov Games, as shown in
Figure 1.
Multi-agent Markov Games can be represented by the tuple .
where D is the defense agent and A is the attacker agent.
are states of the game.
where and represent the action sets available to the defense agent and attacker agent in state S, respectively.
denotes the probability of transferring to a state from the state when D chooses and A chooses .
where and represents the reward of defense agent and attacker agent, respectively.
is the discount factor, .
The feature of MAMGs is that the attack agent and defense agent are in the same environment, but they have their own observation space and action space. The reward value of both sides cannot be directly preset, and are influenced by the opponent’s strategy, and they obtain the reward value in the game equilibrium.
3.3. WoLF-BSS-Q Algorithm in Multi-Agent Markov Games
3.3.1. The Basic Idea of WoLF-BSS-Q
Equilibrium-based multi-agent reinforcement learning (MARL) takes the Markov game as the learning model and the concept of equilibrium solution is introduced from game theory to define the optimal strategy. The defense agent aims to obtain an optimal strategy by interacting with the environment under the condition of incomplete information against the opponents. During the defense process, the agent interacts with the environment and it can update its incomplete information about the opponents, resulting in a Bayesian-style update after each interaction [
17]. However, RL algorithms usually require a lot of interaction with real-world opponents. Therefore, we simulate a multi-agent game environment of moving target defense, where an attacker agent and a defense agent are both simulated. The defense agent selects a best strategy based on its learned knowledge, and then adjusts its strategy by observing the game process, then it can obtain a reward value when the game equilibrium is reached.
The RL algorithms for MAMGs are considered to learn an adversarial decision-making strategy for the defense agents. Our proposed WoLF-BSS-Q algorithm selects the optimal strategy from two dimensions: micro-selection level and macro-dynamic adjustment level. On the one hand, given the leader-follower paradigm, the BSS algorithm can obtain the best strategy from the micro-level, on the other hand, the WoLF mechanism can dynamically adjust the parameters of the game process by observing the game process from the macro-level. By introducing the WoLF mechanism, the algorithm requires two learning rates,
and
(
>
).The choice of learning rate for updating the strategy depends on whether the agent wins (
) or loses (
) currently, which is decided by comparing the expected value with the current Q-value estimates. We consider introducing the WoLF mechanism to improve the performance of the BSS algorithm. Thus, the WoLF-BSS-Q algorithm is proposed for MAMGs. The main idea of the WoLF-BSS-Q algorithm is shown in
Figure 2. To prevent the problem of overestimate of Q-learning, conservative Q-learning [
32] is proposed. Based on the idea of this method, we proposed constrained Q-learning.
Definition 1. Constrained Q-learning is taking the average Q value as the lower limit of Q-learning to prevent the problem of overestimate of Q-learning.
Firstly, the BSS algorithm is used to select the best strategy, and then the WoLF algorithm is used to dynamically adjust the
value by observing the reward value of the defense agent. Because the selection strategy of
-greedy is too random, leading to that some optimal actions will be assigned a very low selection probability due to initialization. To solve the problem, Upper Confidence Bound (UCB) algorithm is used to achieve the balance between the exploration and exploitation by incorporating confidence level, as shown in Equation (
3). Thus, UCB algorithm is used to choose the final-optimal strategy with the largest Q value.
where
is the control coefficient of exploration intensity and determines the confidence index;
c is a measure of uncertainty about the action; and
is the number of times the action is selected at time
t.
The WoLF-BSS-Q algorithm is shown as Algorithm 1. In line 4, the unite probability of selecting an action in a certain state S is determined. In line 6, the scenario is modeled as a mixed-integer linear programming (MILP). Through interaction with the environment, in line 7 to line 10, the Q-values for the defense agent and the attacker agent in the state S are updated. Line 7 to line 9 is the constrained Q-learning algorithm. Line 11 to line 22 is the optimization process of the WoLF algorithm, is the optimization probability of the WoLF algorithm. In line 23, a Markov games solver is used to calculate the bi-matrix game in the state S of the WoLF-BSS-Q. The UCB algorithm is adopted to find the final adversarial decision-making strategy. Then, the strategy of the defense agent is optimized and updated when a new episode is started in line 4.
3.3.2. The Convergence and Rationality of WoLF-BSS-Q
There are two expected properties [
18] for a learning agent when there are other learning agents, namely convergence and rationality.
Convergence: Convergence means that the current agent can learn and converge to a stable strategy when other agents also use learning algorithms. The BSS-Q algorithm can converge to a strong Stackelberg equilibrium (SSE) [
17] of a MAMGs. By introducing the WoLF mechanism, the WoLF-BSS-Q algorithm is an improvement of the BSS-Q algorithm and has good convergence.
Rationality: When the opponent uses a constant strategy, the current agent can learn and converge to an optimal strategy relative to the opponent’s strategy. The WoLF-BSS-Q algorithm has good rationality. First, the BSS-Q algorithm is used to select the best strategy, and then the WoLF algorithm is used to optimize the strategy. Through a two-step scheme, the final-optimal strategy with convergence can be chosen.
Algorithm 1: WoLF-BSS-Q in MAMGs |
|
5. Conclusions and Future Work
An adversarial decision-making approach to moving target defense called WoLF-BSS-Q was proposed in this paper for multi-agent Markov games. The main idea of this method is: firstly, the BSS algorithm is used to select the defense strategy, and then the WoLF algorithm is used to dynamically adjust the learning rate by observing the rewards of the defense agent. Finally, UCB algorithm is used to choose the final adversarial decision-making strategy with the largest Q value. To prevent overestimate of Q-learning, we adopt constrained Q-learning to calculate the Q value. With this method, the defense agent can obtain the optimal strategy as well as continuously adjust the strategy. The experiments showed that the defense agent should attach importance to short-term rewards in the process of a real-time game between the attack agent and the defense agent. The WoLF-BSS-Q can obtain the highest rewards for defense agent compared with the Nash-Q and URS-Q algorithms. From the perspective of decision entropy, when the selection probability of each action is more equal, the greater the decision entropy value is. Furthermore, it has less effect on the defense agent’s decision-making. Furthermore, the proposed WoLF-BSS-Q approach adjusts the action selection probability by observing the game process, the optimal action can be selected with higher probability, and thus the decision entropy gradually decreases.
In this paper, the attack agent will not cheat to consume defense resources, nor can it identify the dynamic defense strategy of the defense agent; that is, the “intelligence level” is not high enough. In the future, we will study an irrational opponent, who can make misleading deceptions or feint. It is also worth studying a multi-agent Markov game with incomplete information, which will cause deviation in the observation space of the defense agents.