Next Article in Journal
Knapsack Balancing via Multiobjectivization
Previous Article in Journal
Design and Test of Seedling-Picking Mechanism of Fully Automatic Transplanting Machine
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Catch the Cyber Thief: A Multi-Dimensional Asymmetric Network Attack–Defense Game

by
Wenhao Wang
1,2,†,
Xingguo Chen
3,†,
Yuwei Li
2 and
Cheng Zhu
2,*
1
College of Electronic Engineering, National University of Defense Technology, Hefei 230031, China
2
Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha 410073, China
3
Jiangsu Key Laboratory of Big Data Security & Intelligent Processing, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2024, 14(20), 9234; https://doi.org/10.3390/app14209234
Submission received: 13 August 2024 / Revised: 23 September 2024 / Accepted: 8 October 2024 / Published: 11 October 2024

Abstract

:
This paper presents a novel multi-dimensional asymmetric game model for network attack–defense decision-making, called “Catch the Cyber Thief”. The model is built upon the concept of partially observable stochastic games (POSG) and is designed to systematically incorporate multi-dimensional asymmetry into network attack–defense problems. The attack agent is called a “thief” who wants to control a key host by exploring the unfamiliar network environment, and the defense agent is called a “police” who needs to catch the opponent before its goal is accomplished. The results indicate that the asymmetry of network attack and defense is not only influenced by attack and defense strategies but also by spatio-temporal factors such as the attacker’s initial position, network topology, and defense lag time. In addition, we have found that there may exist the “attack rhythm,” which makes “how to maintain a good attack rhythm” and “how to generate a robust defense strategy against different attackers” worth exploring. Compared with existing attack–defense game models, our game model can better generate a direct mapping relationship with real elements, enabling us to understand network attack and defense interactions better, recognize security risks, and design defense strategies that can directly serve real-world decision-making.

1. Introduction

Modeling a network attack–defense game that is close to the real world, especially for advanced persistent threats (APTs) and defending [1,2,3], has always been an open and challenging issue. A good game model can map well to real elements in network attack–defense scenarios, which helps researchers gain a better understanding of complex interactions in cyberspace and enables them to make more timely and informed decisions. Furthermore, it can even facilitate the training of automated intelligent agents to assist security experts in their work.
However, this places very high requirements for designing such a game model. Different from real-world games, such as chess, cards, soccer, and real strategy games, the network attack–defense game is a typical asymmetric game. The tasks, actions, observations, and even applicable strategies of the attacker and defender are different. Figure 1 shows how the attacker and defender interact in a layered internal network environment.
In Figure 1, the attacker established an initial foothold on host 2 in the Demilitarized Zone (DMZ) service area and found that he could attack host 7 in the intranet workspace area from host 2. After completing the attack on host 7, the attacker scanned host 7 for connectable hosts and found that there were exploitable vulnerabilities on host 4 and host 10, and launched attacks on these hosts in sequence. During the process of the attack, some actions taken by the attacker trigger alerts from security devices. However, due to the fact that security devices are not completely reliable, security experts still need to manually investigate potentially compromised hosts, searching for logs and other information to confirm the presence of the attack.
Based on the above figure, we can summarize the asymmetry between the attacker and defender into the following factors:
Task objectives. In terms of task objectives, network attack–defense tasks have certain similarities with the hide-and-seek game in the physical world, where the attacker needs to gain unauthorized access to a target host before being detected by the defender. However, in the physical world, the environment has the same impact on both parties, but in cyberspace, this has fundamentally changed. The cyberspace environment is an unknown, non-cooperative network for the attacker. Therefore, it requires continuous exploration to complete the attack and the high technical capabilities of attackers. On the other hand, the environment is a semi-cooperative network for the defender. Although the defender understands the network topology and configuration and has installed security devices to capture suspicious traffic, it does not know which hosts have been compromised by the attacker. Therefore, the defender also needs to further inspect more hosts to confirm attacks.
Movement. The movement refers to how to move (take actions) by using known information (observations). Unlike point-to-point movement in the physical world, the attacker’s “movement” is to select one of the controlled hosts as the initiating host of an attack action and then select one of the connectable hosts as the target host to attack. Therefore, the attacker’s movement is a “diffusion” movement. Since the internal network is semi-cooperative for the defender, it can often directly check the logs of hosts in the protected network through installed security devices, so it is a “direct” movement. In addition, the attacker’s observations come from feedback from the environment, while the defender’s observations come from alarms provided by security devices. If we assume that the detection by the defender does not change the environment and the attacker loses after the defense confirms the attack. So the attacker’s observation will not contain any information about the defender, which will cause the attack to solely rely on its own experience to execute tasks.
Other spatio-temporal factors. Obviously, network attack–defense games do not limit the scale of the network and the time for both parties to participate in the game. Since the attacker and defender cannot see the opponent’s status and actions, there is no guarantee that they will act in a certain order. Even if the defender keeps on line for 24 h, its decision-making time for executing its strategy will lag behind the attacker due to the advanced nature of the attack.
The aforementioned factors jointly affect the modeling of a network attack–defense game. Game theory, as a mathematical framework for describing the cooperative and competitive relationship between two or more players, has the potential to describe the complex network attack and defense decision-making processes. We hope to leverage game theory to systematically model how the attacker and defender interact in cyberspace where their opponents are invisible, which has the following challenges.
The first challenge is how to design the core game mechanism. We can assume the attacker’s objective is to get the privilege of a key host. Once the attacker controls the host that stores the confidential information, it is difficult to prevent the attack even if the defender can obtain the evidence and trace the source afterward. Therefore, the key to the attack–defense game is how the defender discovers the existence of the attack before the attacker completes his task. How to abstract this indirect confrontation is the key issue that the model needs to solve.
The second challenge is how to design the modeling elements. Typical game elements include tasks, observations, actions, rewards, and strategies that the agent can use. In the scenario of this paper, due to the different control capabilities of the network environment, the biggest difference is the movement mode of the attacker and defender. How to take this difference into consideration in the modeling of the above game elements is the basis.
The third challenge is how to integrate real spatio-temporal elements. For the network attack–defense game in the real world, the attacker and defender are not at the same starting point. From the spatial dimension, the scale of network topology of both players may affect the game results, and the different movement methods may also affect the effectiveness of strategies. From the time dimension, the defender has a natural lag in the initial state, and the attacker can adjust its decision-making process by intermittently cleaning up traces. The integration of these spatio-temporal elements will make the game model closer to reality.
To overcome the above challenges, we propose a novel two-player zero-sum partially observable stochastic game called “Catch the Cyber Thief”. Specifically, the model has four typical characteristics: asymmetry in observation ability, asymmetry in time dimension, asymmetry in spatial dimension, and decision-making under uncertainty. The core of this game model is to highlight how the attacker and defender fight against the invisible opponent in cyberspace and connect the attack and defense agents with the traces left by the attack agent and the distorted alarms that the defense agent may receive. In the proposed game, the attack agent is a “thief” who wants to get the administrator privilege of a certain key host in an unfamiliar network environment. The defender is the “police” who wants to catch the attacker before completing its task by searching the environment. In the experiments, we evaluate the performance of the attack and defense strategies with various conditions. Furthermore, the experimental results shed light on a previously mentioned but uncovered “phenomenon,” which had not received in-depth discussion in prior research. The main contributions are as follows:
(1) The “Catch the Cyber Thief” game is a multi-dimensional asymmetric game model. Its asymmetry is not only affected by the agent’s strategy but also by many spatiotemporal factors such as the initial position of the attack agent, the scale of network topology, the number that the attack agent cleans its traces, and the lag time of the defense agent. Experimental results indicate that a good defense strategy is not meant to be effective in all scenarios, even having completely different performances against different opponents in the same scenario or the same opponent in different scenarios.
(2) The “Catch the Cyber Thief” game’s experimental results show the model’s credibility. It can be seen from the experimental statistical results that the winning rate of partial defense strategies maintained between 35% and 53%, which is basically consistent with that the 62% (equals to 38% defense winning rate) of intrusion attacks were discovered only after completion. In addition, we have found that the attacking agent may have an “attack rhythm,” which is very similar to the “low-and-slow” attack mode [1] in reality. The discovery of “attack rhythm” also explains why the attack agent needs to maintain stealthy during the task execution.
(3) The “Catch the Cyber Thief” game reveals new issues to network attack and defense decision-making that are worthy of continued exploration. Experiments reveal that when the attack agent performs a different number to clean its traces, its winning rate changes in a zigzag manner. Therefore, the first question is how to maintain a good attack rhythm. Considering that defense strategies perform differently in different experimental scenarios, the second question is how to generate a robust defense strategy against different attackers. These two new problems are both derived from the game model and have certain practical significance, which is worth continuous analysis in subsequent research.
(4) The “Catch the Cyber Thief” game generates a direct mapping relationship with real elements. Another aspect of the credibility of this model is its ability to map to real-world elements, such as network topology, routing rules, vulnerability credentials, and alarms. Thus, it can help security defenders better understand potential attack decision-making processes from observed alarm events, assess security risks, and design defense policies. This game model can also be used to conduct low-cost, large-scale experiments to train intelligent agents and feedback this decision-making knowledge to practical applications.
The rest of the paper is organized as follows: Section 2 analyzes the background and motivation in this paper. Section 3 presents the design of the proposed model, and Section 4 analyzes the strategies of the agents. Experimental evaluations are presented in Section 5. Section 6 introduces the related work. Section 7 concludes this work with some future work.

2. Background and Motivation

  In the previous research [4,5,6,7,8] on studying network attack–defense problems, researchers often focus on studying stealthy or partial observability, and provide us with valuable references for our work. However, in most studies, researchers assume that some elements of network attack–defense games are consistent, which cannot comprehensively capture the multi-dimensional asymmetry of network attack and defense games. Section 6 provides a more detailed introduction to the related work [9].
To more fully reflect the asymmetry in network attack–defense problems, we select the partially observable stochastic game (POSG) approach to describe it. The reason is that the POSG does not limit the order of execution by both the attacker and defender and assumes that both sides cannot observe the true state of the system. In the end, the network attack–defense problem considered in this paper can be characterized by the following features:
The asymmetry in observation ability. The biggest feature of the network intrusion defense problem is that both agents fight with an invisible opponent. The termination condition of the game is that the defense agent discovers the attack before the attack agent completes its task, and the attack agent needs to acquire the privilege of the key host before being discovered. Therefore, the attack agent only gets observations from the environment, while the defense agent receives observations triggered by the attacking actions, although these observations do not fully reflect the real actions of the attack agent.
The asymmetry in the time dimension. Time is another important factor affecting the game. Although the attack agent cannot see the opponent’s status, actions, and other information, it can reduce the risk of being detected by clearing attack traces rather than simply “moving” to the target host. Another important time factor is the defense lag time. Since the premise of defense detection is that the attack has already occurred, the defense agent is inherently lagging. The defense lag time directly affects the quality of the defense strategy.
The asymmetry in the spatial dimension. The biggest difference between the attacker and the defender comes from the difference in “movement mode,” which is due to the different control capabilities of the network environment. Attackers can obtain initial strongholds in the target network utilizing phishing emails, exploiting website vulnerabilities, etc., and may “jump" from a non-fixed initial position to perform tasks, while the defender can directly detect hosts in the target network, which is fundamentally different from point-to-point movement in physical space. In addition, the scale of network topology can also affect the results of network attack–defense games.
Decision-making under uncertainty. In the context of network intrusion defense, meeting the pre-conditions of an attack does not ensure successful execution due to the instability of attack tools and actions. For defenders, factors such as the false alarm rate (e.g., network security devices analyzing traffic at specific sampling rates, absence of a comprehensive fingerprint library, etc.) and false alarm rate (e.g., security detection equipment misidentifying normal behaviors as attack patterns, etc.) make it challenging to definitively ascertain the presence of an attack. Consequently, both the attacker’s and defender’s uncertainties must be considered in the defense strategy.
Based on the above analysis, the attack and defense agents are not in the same space–time environment, and both agents do not have a predetermined order and can make decisions at the same time or one after the other. In addition, both agents need to make decisions based on observations and rewards, but neither can obtain the complete real state of the environment, so a partially observable stochastic game method can be used to model this problem.

3. Game Model

This work proposes a novel two-player network attack–defense zero-sum game model based on the partially observable stochastic game (POSG), called “Catch the Cyber Thief”. In the proposed game, both attacker and defender interact indirectly in an internet network. The attacker is the “thief” who hopes to gain a certain host’s administrator privilege, while the defender is the “police” who hopes to catch the attacker by searching the environment. Due to the lack of knowledge of the thief’s location, police in the real world need to track the footprints of the thief to design a route. On the other hand, the defender in cyberspace can move directly to any desired host based on the alarm information through remote desktop and check if the node is under attack until the task is accomplished or the game ends.
The game is defined by a tuple N , S , T , Z , A T = { A 1 , O 1 , R 1 , F } , D T = { A 2 , O 2 , R 2 } , γ , where the following is true:
  • N: The term N is the number of agents. In the “Catch the Cyber Thief” game, the attack agent is represented by a superscript 1, and the defense agent is represented by a superscript 2.
  • S : The term S is the state space of the game, which includes states such as available vulnerabilities and credentials on each host, the attacker’s privilege on each host, the number of executions of attack actions on each host, the number of attacks that the defense agent can observe, and so on.
  • A T = { A 1 , O 1 , R 1 , F } : The term A T is the attack agent’s task, which is defined as the Cyber Navigation Task (CNT). The A 1 , O 1 , and R 1 represent actions, observations, and rewards of the attack agent, which will be described in detail in future sections. The F is the number of footholds (controlled hosts) during executing the CNT task; once a host i is controlled, it is added to the foothold F. The attack agent will update the next available hosts and select one of them as the target host for the attack. When the size of the host i equals the maximum value, the attack agent will add a new controlled host to the foothold to replace the earliest controlled host.
  • D T = { A 2 , O 2 , R 2 } : The term D T is the defense agent’s task, which is defined as the Cyber Search Task (CST). The A 2 , O 2 , and R 2 represent actions, observations, and rewards of the defense agent, which will be described in detail in future sections.
  • T : S × A Δ ( S ) : The term T is the state transition function. For each time step t, T ( s , a , s ) represents the probability transition of the environment from the state s to the successor state s through the joint action a .
  • Z : S × A × O [ 0 , 1 ] : The term Z is the observation function. For each time step t, the observation function Z ( s , a , o ) = p ( o | s , a ) represents the probability of observing o O after executing action a A and the new state s S from the environment transition.
  • γ [ 0 , 1 ] : The term γ is the discount factor.
An important assumption of the game is that both the attack and defense agents move differently due to their different abilities to control the network environment, which directly affects the definition of their tasks and the applicability of policies.
In the defined CST task, the attack agent selects a controlled host as the initiating host for the attack action, then selects a reachable host as the target host for the attack action to move. Unlike the point-to-point movements in the physical world, the essence of the “jumping” movement is that the attack agent needs to remotely send attack commands to complete the scanning and vulnerability exploitation rather than moving towards its own position.
In the defined CNT task, the biggest difference is that the defense agent can manage hosts by remote desktop login or security software over the internal network environment, so the defense agent can directly “move” to check whether an attack has occurred. This unlimited action execution method is very similar to the classic Multi-Armed Bandit (MAB) [10] problem. Although the CNT task has the same action execution pattern as the MAB problem, the defense decision-making process is different. Due to the possibility that the attack agent may choose different attack paths to complete its task, the defense agent finds attacks no longer follow a probability distribution, so rewards are determined based on the opponent’s behavior. Therefore, the defense agent’s objective is to optimize its strategy by using observation information o t and reward r t and to detect the attack as quickly as possible.

3.1. State Transition Function

The state transition is determined by the current state s t , the joint action a , and the state transition function T ( s , a , s ) . At the initial time, the entire system is in the state S 1 , the attack agent and the defense agent respectively receive their own observation o 1 i and reward r 1 i , i = 1 , 2 . After both agents update their own strategies π , they choose actions a 1 1 A 1 and a 1 2 A 2 to continue interacting with the system. The system will update its state to S 2 based on the joint action a 1 and the state transition function T ( s , a , s ) .
As the “Catch the Cyber Thief” game is asymmetric in the time dimension, the initial moment is usually taken by the attack agent. Several time steps later, the defense agent’s action is executed. The state transition process is shown in Figure 2.
When an agent achieves its goal or reaches the game’s end conditions, the system state evolution stops, and agents can optimize their strategies for the next episode. Furthermore, since different actions have different durations, if an agent’s action is in the process of being executed, the state will be determined by the remaining agent. If both agent actions are in progress, then the system state remains unchanged.

3.2. The Observation Function

The observation function is the core of the “Catch the Cyber Thief” game. The most important thing in designing an observation function is to reflect on how to fight with the invisible opponent in cyberspace. Therefore, it is necessary to incorporate the following basic design principles into the modeling process.
(1)
When the attack agent launches an attack action from one host to another, the defense agent may receive the alarm information with a certain probability. However, when an attack agent executes local operations, no alarm message is generated.
(2)
The more times an attack agent attacks a host, the higher the probability that the defense agent will find the attack. Since the defense agent is uncertain whether the host has been attacked before, it may need to check multiple times to confirm.
(3)
The attack agent can reduce the possibility of being detected by executing actions to clean up traces. The longer the time (or the more times executing actions to clean up traces), the less likely it will be detected;
(4)
If a host is attacked, even if the attacker executes multiple cleaning actions, there is a small probability that the defense agent can find the attack.
Based on the above principles, we modeled the observation function as the detection success rate (detectRate) by referring to the softmax function. The detectRate is positively correlated with the number of attacks on each host and is defined as
detectRate = 1 1 + exp count i + 2 , IF pri t ( i ) 1 ; 0 , IF pri t ( i ) = 0 .
where pri t ( i ) represents the privilege of the attack agent having the authority of the host i at time t. When pri t ( i ) = 0 means the attack agent has no privilege of host i, pri t ( i ) = 1 means the attack agent has normal user privilege of host i, and pri t ( i ) = 2 means the attack agent has administrator privilege of host i.
The detectRate is calculated based on the number of attacks on each host. The count i represents the number of attack actions executed on host i. When count i is equal to 2, the detection rate is 0.5. After the number of attacks exceeds 2, the probability of being detected increases rapidly. Figure 3 further describes the interactions between both agents in the “Catch the Cyber Thief” game.
It can be seen that when the attack agent executes an attack action, the network environment will leave attack “footprints”. When the attack agent decides to clean up those “footprints,” it can reduce the probability of being detected. These attack actions will trigger network defense alarms; however, due to the existence of a false alarm rate and the possible actions executed by the attack agent locally, not all attack actions can be observed by the defense agent.
The “Catch the Cyber Thief” game regards the attack and defense interactions as an asymmetric decision-making game model. The defense agent receives cumulative alarms and can establish a mapping relationship with real traffic. It is more in line with the characteristics of the defense using distorted observations to fight with the invisible opponent. For the attack agent, it cannot execute too many attack actions, which may lead to faster failure of its task.

3.3. Actions

Actions are represented by a six-tuple a c t i o n D e s c , p r e , p o s t , d u r a t i o n , s u c c e s s R a t e , r e l R a t e . The first parameter actionDesc describes how agents use actions. In order to follow the most simplified principle of modeling, we design four types of attack actions and one type of defense actions, which mainly include the following aspects:
  • lateral(X, Y, A): The attack agent laterally moves to host Y from host X using account A. If the preconditions are met, then the calculation result with the s u c c e s s R a t e returns true, and the attack agent will have account A’s privilege and add connectable hosts of host Y to already detected connected hosts. The system will also increase the number of attacks on host X and host Y by one each. Whether or not the action is executed successfully, the defending agent has a probability r e l R a t e to observe the attack agent executing the action.
  • exploit(X, Y, E): The attack agent remotely exploits host Y from host X using vulnerability E. If the preconditions are met, then the calculation result with the attack success rate s u c c e s s R a t e returns true, the attack agent will have vulnerability E’s privilege, and add connectable hosts of host Y to already detected connected hosts. The system will also increase the number of attacks on host X and host Y by one each. Whether or not the action is executed successfully, the defending agent has a probability r e l R a t e to observe the attack agent executing the action.
  • escalate(X, E): The attack agent executes privilege escalation locally using vulnerability E. If the preconditions are met, then the calculation result with the attack success rate s u c c e s s R a t e returns true, the attack agent will have vulnerability E’s privilege, and add connectable hosts of host Y to already detected connected hosts. The system will also increase the number of attacks on host X by 1. Since privilege escalation is executed locally, the action will not be discovered by the defense agent.
  • cleanup(X): The attack agent cleans up “attack footprints” on host X. Whenever other attack actions are executed successfully, the attack agent will randomly execute the c l e a n u p actions on controlled hosts according to the set value. If the preconditions are met, the number of attacks on host X will be reduced by multiple c l e a n u p actions.
  • search(X): The defense agent searches host X to find out if an attack has occurred. The success rate of finding an attack is related to the actual number of attacks on host X, which is calculated by Equation (1).
The second parameter p r e is the precondition of actions, which are usually composed of multiple predicates using AND and OR conjunctions. When the conjunction between two predicates is “AND,” it means that both predicates need to be satisfied at the same time. If the conjunction is “OR,” it means that only one of the predicates needs to be satisfied. The predicates are shown in Table 1.
The third parameter p o s t is the result when the action is successfully executed. Agents and the environment will update the state and observation according to the action. The fourth parameter d u r a t i o n represents how many time steps the agent needs to execute. The fifth parameter s u c c e s s R a t e and the sixth parameter r e l R a t e are respectively the success rate of executing the attack action and the possibility of being observed by the defense agent. The description of actions is detailed in Table 2.
The duration is set based on the attack intensity and possible attack commands sent. Since durations are different, the numbers of attacks recorded in the system are also different. For example, if l a t e r a l A n d S c a n ( X , Y , A ) action requires two time steps, then the system will increase the number of attacks on host A and host B by two times each.

3.4. Observations

In the “Catch the Cyber Thief” game, there are three observation elements that the attack agent can receive, namely, the privilege on compromised hosts, available accounts, and available vulnerabilities on the specified host. For the defense agent, it can observe the number of attacks for each host. It is worth noting that the number of observed attacks does not equal the actual number of attacks because the defense agent cannot observe locally executed actions e s c a l a t e ( X , E ) and c l e a n u p ( X ) , and remote actions l a t e r a l ( X , Y , A ) and e x p l o i t ( X , Y , E ) only have a certain probability of being observed. Observations are shown in Table 3.
According to the above modeling, it is obvious that the observations are asymmetric. The observation space of the attack agent is 3 n 2 i = 1 n v i + c , where n is the number of hosts, v i is the number of available vulnerabilities on host i, c is the number of available credentials. However, the size of the observation space of the defense agent equals the number of hosts. Only experienced attack agents can fight against the defense agent during the task execution; otherwise, too many useless actions may cause the attack agent to be discovered quickly.

3.5. Rewards

Depending on the design of tasks in the game, rewards mainly include episode-based rewards and state-based rewards.
Episode-based rewards are defined according to whether the agent’s task is successfully executed, which are often used to evaluate the performance of a strategy. If the attack agent gets the administrator privilege of the target host without being detected by the defense agent, the attack agent wins ( + 1 ), and the defense agent loses ( 1 ). If the defense agent detects an attack before the attack agent completes its goal, the defense agent wins ( + 1 ), and the attack agent loses ( 1 ). If neither the attack agent nor the defense agent can win within the specified number of episodes, it is a draw.
State-based rewards are usually used to speed up the training process of agents. If the agent is trained with sparse rewards (rewards are only received when the task is terminated), then state-based rewards are equal to episode-based rewards. Besides, the reward can also be returned after each action is executed. Taking the attack agent as an example, where the state-based reward r 1 = R i ( s , a , s ) represents when the system state is s, both agents take a joint action a , and the successor state is s . The environment feeds back the reward r 1 to the attack agent.

4. Strategies

As a learning paradigm that optimizes policies from experience, reinforcement learning has been widely used in multi-agent scenarios [11,12]. The state-value function V π ( s ) and state-action value function Q π ( s , a ) (also called Q function) are used to optimize the agent strategy. The Q function specifies how good it is for an agent to perform action a in state s with policy π , and formally it is defined as
Q π i , π i ( s , a ) = E t 0 γ t R t i | a t i π i ( · | o t ) , s 0
The state-value function and state-action value function can be converted to each other through the Bellman equation [13]. The objective of an agent is to optimize the policy π i in order to maximize the long-term cumulative rewards in Formula 3.
arg max π E [ t 0 γ t R t i ( s t , a t , s t + 1 ) | a t i π i ( · | o t ) , s 0 ]

4.1. Strategies of the Attack Agent

In the “Catch the Cyber Thief” game, the attack agent cannot observe the state and action of the defense agent. Therefore, classic multi-agent reinforcement learning algorithms are no longer applicable. For example, the Minimax-Q algorithm is a multi-agent reinforcement learning algorithm used to solve two-player zero-sum games [14]. The algorithm considers the opponent’s goal to minimize its own return and calculates its Q function based on the mini-max principle. The update formula of Q function is Q i Q i + α · ( R i + γ V ( s t + 1 ) Q i ) , where V ( s t + 1 ) = m a x π   m i n a i a i A i Q i π i ( s , a i ) , i = 1 , 2 represents the player’s index, and Q i is a simplified form of the Q i ( s t , a t i , a t i ) .
The biggest limitation of the Minimax-Q algorithm is that it needs to know the action space of both agents. Similar to it is the Nash-Q algorithm [15]. The Nash-Q algorithm extends a two-person zero-sum game to a multi-person general-sum game and updates its Q function to Q i + α · ( R i + γ Nash Q i ( s ) Q i ) .
In the Nash-Q algorithm, each state s is regarded as a stage game. The agent needs to observe the actions a i and reward r i of all other agents to find a global optimal point or saddle point in each state s. However, due to the fact that the attack agent can only receive feedback from the environment, it cannot see any information about the opponent, and multi-agent algorithms such as MiniMax-Q and Nash-Q cannot optimize their own strategies.
Therefore, an alternative approach is to use single-agent reinforcement learning algorithms to train the attack agent, such as Q-learning [16], SARSA( λ ) [17], Double Q-learning [18], and DQN [19]. Then the attack agent can adjust some parameters to better respond to the opponent. In this case, the attack agent can be seen as an experienced “thief”. Although it cannot see any information about the defense agent, it is able to evade detection by increasing the cleaning trace action, adjusting the probability of action selection, and other methods.
Assume that the attack agent uses the Q-learning algorithm as its attack strategy. The characteristic of the Q-learning algorithm is that its behavior strategy is different from the learning strategy. Its behavior strategy can be selected through the ε -greedy method. The ε -greedy method selects actions according to the Q table; its purpose is to maintain a balance between exploration and exploitation. Its calculation formula is
π ( a | s ) = 1 ε + ε | A ( F ) | , IF a = arg max a Q π ( F | a ) ; ε | A ( F ) | , Otherwise
The learning strategy is through sampling ( s , a , R , s ) to approximate the Bellman optimal operator T , namely m a x a A   E [ R ( s , a , s ) + γ V ( s ) ] m a x a A [ R ( s , a , s ) + γ V ( s ) ] to approach the optimal value function for learning control. However, under the definition of the CNT, the attack agent needs to select an action from the footholds F, so the Q function that needs to be updated here is Q ( F t , a t ) . The Q-learning algorithm process of the attack agent under the CNT is shown in Algorithm 1.
The SARSA( λ ) and DQN can also be used in Algorithm 1 by replacing the Q value update strategy function.

4.2. Strategies of the Defense Agent

For the defense agent, the attack information is implicit in the received observations, so its strategies can be divided into four categories.
(1) The first category includes heuristic strategies that do not use observations and rewards. Since the defender can directly access any host in the network, strategies can be designed according to its detection preference. We can divide the most commonly used three-layer intranet into the DMZ layer (outermost layer), the internal office layer (second layer), and the core layer (innermost layer). Three types of strategies can be designed: Priority searching hosts in the Demilitarized Zone (PDM), Priority searching hosts in the Core Layer (PDC), and Random Detection (RD). The probability that these strategies detect each host is given by the formula
PDM ( i ) = 1 n , PDC ( i ) = i i = 1 n i , RD ( i ) = n i i = 1 n i ,
where i = { 1 , 2 , , n } is the index of hosts. Assigning values according to the way that one host is in the DMZ layer and n host is in the core layer.
Algorithm 1 The Q-learning algorithm under the CNT
Require: 
(1) Set γ , α and ϵ to specified values; (2) Set the footholds F, rewards R and the termination condition of the game.
Ensure: 
π ( a | s ) of the attack agent
   1:
Initialize the Q table
   2:
for each episode do
   3:
   Set initial state s
   4:
   for each step in a episode do
   5:
     Select an action a t from the Q table according to ϵ -greedy strategy
   6:
     The attack agent receives the reward r t and observes o t + 1 , where o t contains the privilege of host i
   7:
     if Host i is not in current footholds F t  then
   8:
         F t = F t . append ( node ) , 0 len ( F t ) < L ; F t [ 0 ] = n o d e , len ( F t ) = L .
   9:
     else
 10:
        Keep F t unchanged
 11:
     end if
 12:
      Q Q + α · ( r t + γ · m a x a A   Q Q ) , Where Q is a simplified form of Q ( F t , a t ) .
 13:
     if  s t + 1 = d o n e  then
 14:
        The e p i s o d e ends
 15:
     else
 16:
         s t s t + 1
 17:
     end if
 18:
   end for
 19:
end for
(2) The second category includes learning strategies that only use rewards. The defense agent optimizes its strategy depending on whether its task is successful or not. A typical algorithm is the Upper Confidence Bounds (UCB1) algorithm [20]. UCB1 is a classic algorithm for the MAB problem that uses the optimism principle to select actions under uncertainty. The agent initially pulls each arm once, then selects the arm i ( t ) according to the following calculation formula.
i ( t ) = arg max a Q t ( j ) + c ln t N t ( j ) ,
where t is the number of episodes experienced so far, Q t ( j ) is the average reward obtained from host j, N t ( j ) means that host j was selected before the number of times t, and c is a constant used to control the degree of exploration.
(3) The third category includes heuristic strategies that only use observations. Two strategies are designed here: the Maximum Observation detection strategy ( MaxObs ( O t ) ) and the Random Maximum Observation detection strategy ( RMaxObs ( O t ) ). MaxObs ( O t ) means that the defense agent chooses the host with the maximum number of alarms. The calculation formula is
MaxObs ( O t ) = O t ( i ) , i f O t ( i ) O t ( j ) ,
where i , j { 1 , 2 , , n } is the index of a host, and O t ( i ) is the number of alarms of the host i at time t.
The RMaxObs ( O t ) adds randomness to the MaxObs ( O t ) ). Its calculation formula is
RMaxObs ( O t ) = randomChoice ( Top K ( O i ( t ) ) ) ,
where Top K ( O i ( t ) ) indicates the K hosts with the largest number of alarms, randomChoice ( X ) indicates random selection among X, and K is set to three in this paper, which means selecting the top three hosts with the largest number of alarms. Human experts will also directly go to a certain host or several hosts for detection based on the number of alarms collected by the security software.
(4) The fourth category includes learning strategies that use observation and rewards. The defense agent optimizes its strategy based on observations and rewards per episode since the defense agent can only use distorted observations to fight with the invisible opponent, which is a special case of partially observable. Therefore, algorithms should work well under partially observable conditions. For example, Deep Recurrent Q-Network (DRQN) [21] with long short-term memory (LSTM) network [22] and Transformer [23] can be used to encode states. The method of opponent modeling [24,25] can also be tried to solve this kind of problem.

5. Experimental Analysis

In order to quantitatively analyze the interactions between the attack and defense agent, this paper sets up three network topologies with 20, 30, and 40 nodes as experimental environments and three experiments to explore the nature of the model within a limited range. Experiment 1 aims to evaluate interaction results between different attack and defense strategies; Experiment 2 aims to evaluate the asymmetry in the spatial dimension; and Experiment 3 aims to evaluate the asymmetry in the time dimension. In the experiment, each strategy selected by the attack and defense agent played 500 episodes to reduce the uncertainty in the decision-making process; the maximum number of steps in each episode is 1000.
The attack agent uses a Q-learning algorithm with 2, 3, and 4 footholds as the attack strategy set. The defense agent uses six strategies: PDM, PDC, RD, UCB1, MaxObs, and RMaxObs as the defense strategy set. When the attacker obtains the administrator privilege of the target host, the attack agent wins; when the defense agent finds the attack before it completes its task, the defense agent wins; when the maximum number of steps in each episode is reached, it is a draw.

5.1. Experiment 1: Evaluating between Different Attack and Defense Strategies

In Experiment 1, it assumes that both agents performed actions simultaneously without lag, and the initial controlled host is generated from hosts 1 to 6. Experiment 1 sets up three modes. (1) Fixed strategy mode: the attack agent first trains its strategy in the environment with the defense agent, then uses the converged policy model to confront the opponent, and the attack strategy is not updated at this time; (2) continuous learning mode: on the basis of mode 1, the attack agent uses the trained model to confront the defense agent and continuously updates its strategy; and (3) learning together mode: the attack agent uses a fully initialized strategy model to fight with the defense agent. Table 4 shows the winning rates for the attack agent and draw. Where the first percentage in each cell is the winning rate of the attacking agent, and the second is the percentage of the draw.
It can be seen that there is no complete imbalance between the attack and defense agent in the fixed strategy mode, and neither player has an absolute advantage. This is because the attack agent is trained to form a strategy model in a single agent environment in advance, so it can be considered that the attack agent has knowledge of the network environment but does not understand the defense agent. This is similar to a “thief” who, based on their own experience, can directly ignore the presence of the defense agent to execute its task.
However, in the continuous learning mode and the learning together mode, the defense agent has an absolute advantage. In the continuous learning phase, the attack agent hopes to update the trained policy model based on negative rewards, but the updated Q function value may not be the most important factor causing its failure. For example, the attack agent repeatedly executes n attack actions under the observation o 1 , which greatly increases the probability of being detected. But the defense agent detects and finds the attack after several time steps; the observation of the attack agent has changed to o 1 , then it gets a negative reward r 1 . Since the attack agent cannot see the defense agent, r 1 will update the Q function Q ( o 2 , a ) instead of Q ( o 1 , a ) , which caused it to fail. Compared to the real world, on the basis of previous experience, the thief still tries to understand the police’s movements by observing the changes in the environment, so too much adjustment may cause it to be found faster.
In the learning together mode, since the attack agent does not understand the network environment at the initial time, it is difficult to keep exploring the environment and avoid the detection of the defender at the same time. This is similar to that in the physical world: a fledgling “thief” who is completely unfamiliar with the terrain and needs to avoid the police. It is almost impossible for him to complete the task.
Therefore, only experienced attackers can find breakthroughs in the intranet to confront defenders. Figure 4 compares different defense strategies’ winning rates in the fixed strategy mode.
The results show that the RMaxObs and MaxObs strategies have the highest winning rate, because when the attack agent performs its task, if the defense agent can immediately respond according to current observations, then the adversary can be found quickly, and the reaction time of the defense agent will be an important factor to decide the game.
The winning rate of the PDC strategy is nearly 20% lower than all other strategies. This is because the attacking agent executes the task from the outermost layer to the innermost layer step by step. The P D C strategy first detects the core layer, which makes it difficult to discover the attack at one time, so that the attack agent can complete its task with a greater probability. When the defense agent adopts the PDM and UCB1 strategies, its winning rate is basically maintained between 35% and 53%, which is also the same as the literature [26] mentioned that 62% of the attacks were discovered after completion (equals to 38% defense win rate) is basically consistent, which also proves the credibility of the model.
Except for the PDC strategy, there is no strict dominant relationship between different defense strategies. It means the “Catch the Cyber Thief” game is essentially an asymmetric game model affected by attack and defense strategies.

5.2. Experiment 2: Analyzing the Asymmetry in the Spatial Dimension

In this experiment, we analyze the asymmetry from the spatial dimension, mainly including two aspects: (1) the initial position of the attack agent; and (2) the scale of the network topology. The defense agent still performs actions simultaneously with the attack agent.

5.2.1. Case 1: Different Initial Positions

Different from the case in Experiment 1 where the attack position is assumed to be a random position, a situation is added in Experiment 2 where the initial position of the attack agent is fixed as host 1. The defense agent mainly uses three strategies: P D M , U C B 1 , and R M a x O b s . Figure 5 shows winning rates for the defense agent and probabilities that the UCB1 algorithm selects each host to detect.
Generally speaking, when the attack agent is in a fixed initial position, the defense winning rate is higher than that in a random initial position. This is because, under the random initial position, the attack path of the attacking agent may be shorter, so the defense agent will “capture” the attack agent faster when detecting.
Figure 6 shows probabilities that the UCB1 algorithm selects each host to detect. It is worth noting that the performance of the UCB1 algorithm in the fixed initial position is much better than that in the random position, and the winning rate is close to 80% to 90%. This is because when the initial position of the attack agent is fixed, the UCB1 algorithm will preferentially detect host i ( t ) = arg max a Q t ( j ) + c ln t N t ( j ) in the tth episode. If the defense agent has received higher rewards from a certain host before, it will still be biased to detect this host first. When the attack agent is in a random initial position, the defense agent can no longer find the law through the UCB1 algorithm, so it is necessary to make better use of observations for further algorithm optimization.

5.2.2. Case 2: Network Topologies of Different Scales

The second group of experiments analyzed the influence of topology scales on the game and added 30 hosts and 40 hosts to the simulation experiment environment. The detailed environment setting is in Appendix A. The footholds of the attack agent are set to 3, and the defense strategies include PDM, UCB1, RMaxObs, and MaxObs. Figure 7 shows the winnings rates of the defense agent with different network topologies.
It can be seen that the RMaxObs and MaxObs strategies are still absolutely dominant compared to other defense strategies. This is because the defense agent in cyberspace can “directly move” to hosts that want to be detected, which is also the advantage of the defense agent in the spatial dimension.
However, the PDM strategy that depends on the detection preference is affected by the network scale. However, due to the comprehensive effect of factors such as host and connectivity settings in the network environment, attack paths under different network scales are also different, so the scale of the network, as one of the asymmetries of a spatial dimension, together with other spatiotemporal elements affect the results.

5.3. Experiment 3: Analyzing the Asymmetry in the Time Dimension

Experiment 3 analyzes the asymmetry from the time dimension, mainly including two aspects: (1) the number of times the attack agent executes c l e a n U p actions; (2) the lag time of the defense agent.

5.3.1. Case 1: The Number That the Attack Agent Executes c l e a n U p Actions

The footholds are set to 3, and defense strategies include PDM, UCB1, RMaxObs, and MaxObs. The attack agent executes 0 to 10 “cleanup” actions. Figure 8 shows winnings rates of the defense agent with different “cleanup” actions, where the horizontal axis is the number of c l e a n u p actions. The vertical axis is the winning rates of defense strategies.
It can be seen that winning rates change in a zigzag pattern. Taking the MaxObs strategy adopted by the defense agent in 20 host environments as an example, the winning rate of the defense agent is 73.6% when the attacking agent performs six cleanup actions, and the winning rate of the defense agent who performs five cleanup actions and seven cleanup actions is 80.4% and 77.0%. This is because in the game, if the number of c l e a n u p actions is too small, it may increase the risk of being discovered. And if the attack agent executes too many c l e a n u p actions, the defense agent may have more time to detect the host.
From the perspective of the performance of different defense strategies, when the network scale is 30 hosts and the attack agent executes 4 c l e a n u p actions, the defense winning rate will increase significantly, even better than the two strategies of MaxObs and RMaxObs. When the network size is 40 hosts, the performance of PDM, UCB1, and RMaxObs varies with the number of c l e a n u p actions, and there is no situation where a certain defense strategy is completely dominant.
This phenomenon indicates that there may exist an “attack rhythm” when the attack agent executes its task, which is very similar to the “princess and monster” game. In the “princess and monster” game, although the princess is in a dark room where the monster cannot be observed, she can still avoid the “hunt” of the monster by waiting for a random time before moving. It is similar to the “police and thief” game. Since the thief cannot observe the location of the police, he can only avoid their pursuit by adjusting his own marching rhythm. In our game, the winning rates of the defense agent do not increase monotonically with the number of times it executes the c l e a n u p ( X ) action, and the changing trend in different environments is also different.
The discovery of the attack rhythm phenomenon is very close to the “low and slow” [1] decision-making mode in the real intrusion process, and also shows that the “Catch the Cyber Thief” game is also asymmetric in the time dimension. How to maintain a good attack rhythm will be an issue worthy of in-depth discussion.

5.3.2. Case 2: The Lag Time of the Defense Agent

The second set of experiments analyzes the effect of defense lag time on the game. The footholds of the attack agent are set to 3, the number that the attack agent executes c l e a n U p actions is set to 5, and the defense strategies include PDM, UCB1, RMaxObs, and MaxObs. The lag time is set from 0 to 10. Figure 9 shows the winnings rates of the defense agent with different lag times. The horizontal axis is the lag time of the defense agent. The lag time is set from 0 to 10. The vertical axis is the winning rate of defense strategies.
The results show that the longer the defensive agent lags, the lower the winning rate of RMaxObs and MaxObs is. Although these two strategies can quickly detect the target host based on observations, the attack agent has already completed actions in advance. And it is possible to perform multiple cleanup traces so that the observations seen by the defense agent are not the real state in the system. However, the defense reaction time does not have an additional impact on the PDM and UCB1 strategies, because these two algorithms do not depend on observation, and the UCB1 algorithm only judges whether the attack exists based on the reward. Therefore, the defense lag time has little effect on it.
In the experiment, the lag time of the defense agent is 0, which is an ideal situation. Considering that attack actions need two or three time steps, even if the defense agent responds in time or adopts polling mode to detect hosts in the environment, it cannot guarantee that the attack and defense agents are in the same initial state. Therefore, in follow-up studies, it is suggested to set the initial lag time in a reasonable range.

6. Related Work

We divide related work on the network attack–defense game modeling into three aspects, using traditional game theoretic approaches, using stochastic games and partially observable stochastic games, and other similar game models.
Traditional game theoretic approaches. Due to the importance of network intrusion defense problems, researchers have proposed different network attack–defense game models based on various game theoretic approaches, such as differential games, dynamic games, and other game theoretic approaches. Van et al. [4] proposed the “FlipIt” game based on the differential game (SG), which modeled the network attack–defense game as a process of “stealthy takeover” for the network asset. Due to the simple design of attack and defense actions (each player has only one action to control resources through “flipping” at a certain time.) in this model and the ability to depict the characteristics of both sides being invisible to each other, this model has been investigated and adapted extensively [27,28,29]. Chapman [30] designed a “Cyber Hide-and-Seek Game” as a two-sided search problem. In the proposed game, one player hides k hosts with a certain strategy in the n-host network topology. Zhu et al. [31] modeled the network attack–defense game as the Colonel Blotto game [32], where multiple attackers and defenders compete for multiple distributed resources. Similar problem descriptions have also appeared in the literature [29,33,34,35]. Since the traditional game-theoretic models do not set up elements such as the state and observation, it often assumes that the tasks and actions of both players are consistent, which makes it difficult to analyze in depth the complex interaction process between the attacker and defender.
Stochastic games and partially observable stochastic games. Stochastic games (SGs), also known as Markov games, describe a game in which players compete or cooperate with each other in a sequential decision-making manner. Wang et al. [6] used Stochastic Petri Nets (SPN) to build a network attack–defense simulation environment based on the stochastic game. Chung et al. [7] proposed a simplified network attack–defense game based on the stochastic game. Elderman et al. [8] also modeled the adversarial attacker–defense game. The attacker aims to reach the target host by taking actions of different values, and the defender needs to prevent the attacker by allocating different defense values. Both players cannot see the actions and utilities of the opponent. Results show that when two agents try to adapt to each other, no one can ensure a long-term victory.
Although the stochastic game approach can reflect complex interactions between the attack and defense agents, its assumption that the agent receives the real state of the system is clearly inconsistent with reality. A partially observable stochastic game models a problem in that agents have no access to the exact environmental state but only an observation of the true state through an observation function [12]. Wu et al. [36] proposed an incomplete information stochastic game to model a network attack–defense game. In the proposed game, the attacker and defender choose an action from a given behavior, and get rewards. Schwartz et al. [5] used the partially observable stochastic game (POSG) to model the network attack–defense game. The author uses a Bernoulli process to model the defender’s impact on the attacker’s task, and distinguishes the defender’s behavior into detection and response behavior, where the detection behavior can change the network environment and the response behavior does not change the network environment. Horák et al. [37] used a one-sided POSG to address the issue of honeypot reallocation against the attacker’s lateral movement. In a graph network, the attacker aims to reach a specific target node by traversing infected nodes, while the defender aims to detect the attacker and increase the cost of reaching the target by deploying honeypots.
Neither stochastic games nor partially observable stochastic games (POSGs) restrict the order in which the agent’s actions are executed. In order to characterize the limited observation ability of both attack and defense sides, researchers have started using partially observable stochastic games to describe network attack–defense game models.
Other similar game models. Game models in other fields can also be used to compare. The pursuit-evasion games (PEGs) in the physical world can also help us understand and build ADGs. Classic models in PEGs include hide-and-seek [38], Princess and Monster [39], and Green Security Games [40]. In these models, the pursuer’s purpose is to catch the evader as fast as possible, and the evader’s purpose is to avoid the pursuit or to complete his own goal before being caught. The setting of tasks and observations is the biggest difference between these PEG models. A hide-and-seek game assumes the evader needs to reach a goal when running in a maze. The Princess and Monster game distinguishes two players’ abilities to observe and move. The princess has a larger observation range but moves slower, while the monster has a smaller observation range but moves faster. In Green Security Games, the pursuer needs to track the evader by analyzing the orientation of the footprints. This is similar to the intrusion defense problem, where a defender needs to look at log information that may record the attacker’s actions.
Chess and board games (CBGs) include two classic extensive-form games, i.e., Go and Texas Hold’em poker. In these two games, both players take turns to make decisions. The game can be modeled as a game tree according to action sequences. Players need to choose the best actions according to their information sets. Among them, Go is a two-person complete information game (complete information game means that all past and present game information can be known at any stage of the game), and both players can see each other’s actions and states. Texas Hold’em Poker is a multiplayer, incomplete information game. Players cannot see other players’ hidden cards within five cards. They can only infer from the other opponent’s visible cards, betting situation, and the public board cards.

7. Conclusions and Future Work

This work presents a novel attack–defense game model called “Catch the Cyber Thief”. The innovation of this model is that it integrates the four characteristics in real attack–defense decision-making based on the POSG. The proposed game systematically defines the actions, observations, rewards, and strategies of the attack and defense agents and analyzes the nature of the game model through experiments. The experimental results show that the performance of the attack and defense strategies changes with the opponent’s strategy and experimental conditions. By adjusting the experimental parameters, we found that there may be an “attack rhythm” in the attack decision-making. The game model provides a basic mathematical tool for studying network attack–defense interactions. Researchers can conduct more research on this model, such as analyzing potential risks, designing robust defense strategies, and analyzing patterns of attack decisions.
In the future, we plan to conduct a comparative analysis between our proposed Catch the Cyber Thief model and other existing models such as the Stackelberg hypergame, Partially Observable Markov Decision Processes (POMDPs), and evolutionary games. This analysis will focus on analyzing the advantages and disadvantages of modeling network attack and defense problems with different models.
Furthermore, we aim to explore the implications of solving the game equilibrium of attack–defense strategies in real-world network security scenarios. Understanding the strategic outcomes and implications of achieving equilibrium in such games can provide valuable insights into developing robust defense mechanisms against cyber threats.
Additionally, we are interested in investigating how our proposed model could be implemented in network emulators or simulators such as NS-3. By integrating our model into these simulation environments, we can evaluate its performance, scalability, and practicality in a controlled network setting. This implementation would allow us to assess the real-world applicability and effectiveness of our proposed approach in enhancing network security measures.

Author Contributions

W.W.: Conceptualization, writing—review and editing and supervision. X.C.: conceptualization, methodology, coding, validation, investigation, data curation, writing—original draft and visualization. Y.L.: methodology, investigation, writing—original draft, visualization and data curation. C.Z.: writing—review and editing, resources and supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by the National Natural Science Foundation of China (No. 62276142, 62206133, 62202240).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Environment Setting

Appendix A.1. The Environment of 20 Hosts

The network topology is shown in Figure A1.
Figure A1. The network topology of 20 hosts.
Figure A1. The network topology of 20 hosts.
Applsci 14 09234 g0a1
Each host in the environment is connected according to connections and access rules between layers. These layers are isolated by firewalls, and some connections are opened to facilitate the exchange of internal and external network information. The firewall is configured with two types of rules: deny and permit. Deny[1, 2](Subnet) means hosts in subnet 1 cannot connect to hosts in subnet 2. The permit rule’s priority is higher than denial. Permit[2, 5](Host) indicates that on the basis of the above, hosts 2 and 5 can be connected. The access rule settings in the environment of 20 hosts are shown in Table A1.
Table A1. Access rule setting in the environment of 20 hosts.
Table A1. Access rule setting in the environment of 20 hosts.
TypeRuleDescription
PermitPermit[3,8](Host)Host 3 and 8 can access each other.
Permit[4,9](Host)Host 4 and 9 can access each other.
Permit[6,18](Host)Host 6 and 18 can access each other.
DenyDeny[2,4](Subnet)Subnet 2 and 4 can not be connected.
There are mainly two types of data on each host: vulnerability and credential. The array in column Vulns represents the set of vulnerabilities that existed on this host. [ X , Y ] denotes this host has two vulnerabilities, X and Y. Among them, vulnerabilities 1, 4, 5, and 7 can obtain normal user privileges, while other vulnerabilities can obtain administrator privileges. The duration of vulnerabilities 1, 2, and 7 is three time steps, and other vulnerabilities are two time steps. The probability of successfully exploiting each vulnerability is randomly chosen between 0.85, 0.9, and 0.95. The array in column Creds represents the set of credentials stored on this host. Host settings in the environment of 20 hosts are shown in Table A2.
Table A2. Host settings in the environment of 20 hosts.
Table A2. Host settings in the environment of 20 hosts.
IndexVulnsCredsIndexVulnsCreds
1[4, 8][]11[6, 8][]
2[9]cred112[0, 7]cred2
3[3, 9][]13[2, 7][]
4[4, 10][]14[8][]
5[3]cred215[6][]
6[3, 4][]16[0,2][]
7[6]cred117[1][]
8[2, 8][]18[2,6][]
9[10][]19[0,9]cred2
10[3, 4][]20[5][]

Appendix A.2. The Environment of 30 Hosts

The network topology is shown in Figure A2. The access rule settings in the environment of 30 hosts are shown in Table A3.
Table A3. Access rule setting in the environment of 30 hosts.
Table A3. Access rule setting in the environment of 30 hosts.
TypeRuleDescription
PermitPermit[4,9](Host)Host 4 and 8 can access each other.
Permit[4,13](Host)Host 4 and 13 can access each other.
Permit[16,24](Host)Host 16 and 24 can access each other.
DenyDeny[2,4](Subnet)Subnet 2 and 4 can not be connected.
Figure A2. The network topology of 30 hosts.
Figure A2. The network topology of 30 hosts.
Applsci 14 09234 g0a2
Host settings in the environment of 30 hosts are shown in Table A4.
Table A4. Host settings in the environment of 30 hosts.
Table A4. Host settings in the environment of 30 hosts.
IndexVulnsCredsIndexVulnsCreds
1[3, 7][]16[6, 8][]
2[8]cred117[0, 7][]
3[2, 8][]18[2, 7][]
4[3, 9][]19[8]cred2
5[2]cred220[6][]
6[2, 3][]21[0,2][]
7[5]cred122[1][]
8[1, 7][]23[2,6][]
9[9][]24[0,9][]
10[2, 3][]25[5][]
11[8][]26[6,9]cred2
12[3, 9]cred227[7][]
13[8][]28[8][]
14[0, 1][]29[9][]
15[2][]30[4,5][]

Appendix A.3. The Environment of 40 Hosts

The network topology is shown in Figure A3.
Figure A3. The network topology of 40 hosts.
Figure A3. The network topology of 40 hosts.
Applsci 14 09234 g0a3
The access rule settings in the environment of 40 hosts are shown in Table A5.
Table A5. Access rule setting in the environment of 40 hosts.
Table A5. Access rule setting in the environment of 40 hosts.
TypeRuleDescription
PermitPermit[5,14](Host)Host 5 and 14 can access each other.
Permit[7,18](Host)Host 7 and 18 can access each other.
Permit[18,34](Host)Host 18 and 34 can access each other.
Permit[18,36](Host)Host 18 and 36 can access each other.
DenyDeny[3,6](Subnet)Subnet 3 and 6 can not be connected.
Host settings in the environment of 40 hosts are shown in Table A6.
Table A6. Host settings in the environment of 40 hosts.
Table A6. Host settings in the environment of 40 hosts.
IndexVulnsCredsIndexVulnsCreds
1[0,1][]21[4,5][]
2[9]cred122[1][]
3[5][]23[4,9][]
4[3,7][]24[2][]
5[6,9][1]25[7][]
6[2][]26[3,5][]
7[7]cred127[9][]
8[0,9][]28[8]cred1
9[8][]29[4,2][]
10[1][]30[6,8][]
11[4,7][]31[7][]
12[2]cred232[5]cred1
13[5][]33[0,5][]
14[3,1][]34[9][]
15[6,5][]35[2]cred2
16[1][]36[6,1][]
17[8][]37[9]cred1
18[3,7][]38[7][]
19[0,8]cred239[8][]
20[7][]40[3,5][]

References

  1. Alshamrani, A.; Myneni, S.; Chowdhary, A.; Huang, D. A survey on advanced persistent threats: Techniques, solutions, challenges, and research opportunities. IEEE Commun. Surv. Tutorials 2019, 21, 1851–1877. [Google Scholar] [CrossRef]
  2. Daly, M.K. Advanced persistent threat. Usenix 2009, 4, 2013–2016. [Google Scholar]
  3. Tankard, C. Advanced persistent threats and how to monitor and deter them. Netw. Secur. 2011, 2011, 16–19. [Google Scholar] [CrossRef]
  4. Van Dijk, M.; Juels, A.; Oprea, A.; Rivest, R.L. FlipIt: The game of “stealthy takeover”. J. Cryptol. 2013, 26, 655–713. [Google Scholar] [CrossRef]
  5. Schwartz, J.; Kurniawati, H.; El-Mahassni, E. POMDP+ Information-Decay: Incorporating Defender’s Behaviour in Autonomous Penetration Testing. In Proceedings of the International Conference on Automated Planning and Scheduling, Nancy, France, 20–26 October 2020; Volume 30, pp. 235–243. [Google Scholar]
  6. Wang, Y.Z.; Lin, C.; Cheng, X.Q.; Fang, B.X. Analysis for network attack-defense based on stochastic game model. Chin. J. Comput. 2010, 33, 1748–1762. [Google Scholar] [CrossRef]
  7. Chung, K.; Kamhoua, C.A.; Kwiat, K.A.; Kalbarczyk, Z.T.; Iyer, R.K. Game theory with learning for cyber security monitoring. In Proceedings of the 2016 IEEE 17th International Symposium on High Assurance Systems Engineering (HASE), Orlando, FL, USA, 7–9 January 2016; pp. 1–8. [Google Scholar]
  8. Elderman, R.; Pater, L.J.; Thie, A.S.; Drugan, M.M.; Wiering, M.A. Adversarial Reinforcement Learning in a Cyber Security Simulation. In Proceedings of the 9th International Conference on Agents and Artificial Intelligence (ICAART 2017), Porto, Portugal, 24–26 February 2017; pp. 559–566. [Google Scholar]
  9. Wang, W.; Sun, D.; Jiang, F.; Chen, X.; Zhu, C. Research and challenges of reinforcement learning in cyber defense decision-making for intranet security. Algorithms 2022, 15, 134. [Google Scholar] [CrossRef]
  10. Robbins, H. Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 1952, 58, 527–535. [Google Scholar] [CrossRef]
  11. Tuyls, K.; Weiss, G. Multiagent learning: Basics, challenges, and prospects. AI Mag. 2012, 33, 41–52. [Google Scholar] [CrossRef]
  12. Yang, Y.; Wang, J. An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv 2020, arXiv:2011.00583. [Google Scholar]
  13. Bellman, R. Dynamic programming. Science 1966, 153, 34–37. [Google Scholar] [CrossRef]
  14. Littman, M.L. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994; Elsevier: Amsterdam, The Netherlands, 1994; pp. 157–163. [Google Scholar]
  15. Hu, J.; Wellman, M.P. Nash Q-learning for general-sum stochastic games. J. Mach. Learn. Res. 2003, 4, 1039–1069. [Google Scholar]
  16. Crites, R.H.; Barto, A.G. Elevator group control using multiple reinforcement learning agents. Mach. Learn. 1998, 33, 235–262. [Google Scholar] [CrossRef]
  17. Yang, L.; Zheng, G.; Zhang, Y.; Zheng, Q.; Li, P.; Pan, G. On convergence of gradient expected sarsa (λ). In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 10621–10629. [Google Scholar]
  18. Hasselt, H. Double Q-learning. Adv. Neural Inf. Process. Syst. 2010, 23, 2613–2621. [Google Scholar]
  19. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
  20. Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 2002, 47, 235–256. [Google Scholar] [CrossRef]
  21. Hausknecht, M.; Stone, P. Deep recurrent q-learning for partially observable mdps. arXiv 2015, arXiv:1507.06527. [Google Scholar]
  22. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  23. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  24. He, H.; Boyd-Graber, J.; Kwok, K.; Daumé, H., III. Opponent modeling in deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1804–1813. [Google Scholar]
  25. Hong, Z.W.; Su, S.Y.; Shann, T.Y.; Chang, Y.H.; Lee, C.Y. A Deep Policy Inference Q-Network for Multi-Agent Systems. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS ’18), Stockholm, Sweden, 10–15 July 2018; pp. 1388–1396. [Google Scholar]
  26. Sharma, A.; Kalbarczyk, Z.; Barlow, J.; Iyer, R. Analysis of security data from a large computing organization. In Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN), Hong Kong, China, 27–30 June 2011; pp. 506–517. [Google Scholar]
  27. Grossklags, J.; Reitter, D. How task familiarity and cognitive predispositions impact behavior in a security game of timing. In Proceedings of the 2014 IEEE 27th Computer Security Foundations Symposium, Vienna, Austria, 19–22 July 2014; pp. 111–122. [Google Scholar]
  28. Feng, X.; Zheng, Z.; Cansever, D.; Swami, A.; Mohapatra, P. Stealthy attacks with insider information: A game theoretic model with asymmetric feedback. In Proceedings of the MILCOM 2016—2016 IEEE Military Communications Conference, Baltimore, MD, USA, 1–3 November 2016; pp. 277–282. [Google Scholar]
  29. Xiao, L.; Xu, D.; Mandayam, N.B.; Poor, H.V. Attacker-centric view of a detection game against advanced persistent threats. IEEE Trans. Mob. Comput. 2018, 17, 2512–2523. [Google Scholar] [CrossRef]
  30. Chapman, M.D. Cyber Hide-and-Seek. Ph.D. Thesis, King’s College London, London, UK, 2016. [Google Scholar]
  31. Zhu, T.; Ye, D.; Cheng, Z.; Zhou, W.; Philip, S.Y. Learning games for defending advanced persistent threats in cyber systems. IEEE Trans. Syst. Man, Cybern. Syst. 2022, 53, 2410–2422. [Google Scholar] [CrossRef]
  32. Roberson, B. The colonel blotto game. Econ. Theory 2006, 29, 1–24. [Google Scholar] [CrossRef]
  33. Zhang, L.; Zhu, T.; Hussain, F.K.; Ye, D.; Zhou, W. A Game-Theoretic Method for Defending Against Advanced Persistent Threats in Cyber Systems. IEEE Trans. Inf. Forensics Secur. 2022, 18, 1349–1364. [Google Scholar] [CrossRef]
  34. Mo, H.; Sansavini, G. Dynamic defense resource allocation for minimizing unsupplied demand in cyber-physical systems against uncertain attacks. IEEE Trans. Reliab. 2017, 66, 1253–1265. [Google Scholar] [CrossRef]
  35. Yang, L.X.; Li, P.; Zhang, Y.; Yang, X.; Xiang, Y.; Zhou, W. Effective repair strategy against advanced persistent threat: A differential game approach. IEEE Trans. Inf. Forensics Secur. 2018, 14, 1713–1728. [Google Scholar] [CrossRef]
  36. Wu, Z.; Tian, L.; Wang, Y.; Xie, J.; Du, Y.; Zhang, Y. Network Security Defense Decision-Making Method Based on Stochastic Game and Deep Reinforcement Learning. Secur. Commun. Netw. 2021, 2021, 2283786. [Google Scholar] [CrossRef]
  37. Horák, K.; Bošanskỳ, B.; Tomášek, P.; Kiekintveld, C.; Kamhoua, C. Optimizing honeypot strategies against dynamic lateral movement using partially observable stochastic games. Comput. Secur. 2019, 87, 101579. [Google Scholar] [CrossRef]
  38. Goldhoorn, A.; Sanfeliu, A.; Alquézar, R. Comparison of momdp and heuristic methods to play hide-and-seek. In Artificial Intelligence Research and Development; IOS Press: Amsterdam, The Netherlands, 2013; pp. 31–40. [Google Scholar]
  39. Alpern, S.; Fokkink, R.; Lindelauf, R.; Olsder, G.J. The “princess and monste’r’ game on an interval. Siam J. Control. Optim. 2008, 47, 1178–1190. [Google Scholar] [CrossRef]
  40. Wang, Y.; Shi, Z.R.; Yu, L.; Wu, Y.; Singh, R.; Joppa, L.; Fang, F. green security games with real-time information. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 1401–1408. [Google Scholar]
Figure 1. How the attacker and defender interact in a layered internal network environment.
Figure 1. How the attacker and defender interact in a layered internal network environment.
Applsci 14 09234 g001
Figure 2. The state transition process in the “Catch the Cyber Thief” game.
Figure 2. The state transition process in the “Catch the Cyber Thief” game.
Applsci 14 09234 g002
Figure 3. The interactions between attack and defense agents.
Figure 3. The interactions between attack and defense agents.
Applsci 14 09234 g003
Figure 4. The comparison of the attack agent’s winning rates under different defense strategies.
Figure 4. The comparison of the attack agent’s winning rates under different defense strategies.
Applsci 14 09234 g004
Figure 5. Winning rates of the defense agent with different initial positions.
Figure 5. Winning rates of the defense agent with different initial positions.
Applsci 14 09234 g005
Figure 6. The probability of detecting a host by UCB1 algorithm.
Figure 6. The probability of detecting a host by UCB1 algorithm.
Applsci 14 09234 g006
Figure 7. Winnings rates of the defense agent in different network topologies.
Figure 7. Winnings rates of the defense agent in different network topologies.
Applsci 14 09234 g007
Figure 8. Winnings rates of the defense agent with different “cleanup” actions.
Figure 8. Winnings rates of the defense agent with different “cleanup” actions.
Applsci 14 09234 g008
Figure 9. Winnings rates of the defense agent with different defense lag time.
Figure 9. Winnings rates of the defense agent with different defense lag time.
Applsci 14 09234 g009aApplsci 14 09234 g009b
Table 1. The predicates of actions.
Table 1. The predicates of actions.
PredicatesReturnDescription
connected ( X , Y ) True/FalseWhether the host X and host Y are connected.
isOpen(X)True/FalseWhether the host X is open.
allow ( F , X , Y ) True/FalseWhether firewall F allows host X to communicate with host Y.
hasFoothold ( X , P ) True/FalseWhether the attack agent have the privilege P on host X. P can take the value of [ 0 , 1 , 2 ] , respectively representing no privilege, normal user privilege and administrator privilege.
hasAccount ( X , A ) True/FalseWhether the host X has account A.
hasVul ( X , E ) True/FalseWhether the host X has vulnerability A.
recordAttack ( X , T ) IntegerChange the number of attacks on host X according to the value of T. T usually takes values according to the duration of attack actions.
Table 2. The actions of both agents, where A.P and E.P, respectively, represent the privileges that can be obtained by successfully controlling account A and exploiting vulnerability E.
Table 2. The actions of both agents, where A.P and E.P, respectively, represent the privileges that can be obtained by successfully controlling account A and exploiting vulnerability E.
ActionPreconditionResult
lateral ( X , Y , A ) connected(X, Y) AND isOpen(Y) AND allow(F, X, Y) AND hasFoothold(X, 2) AND hasAccount(Y, A)hasFoothold(Y, A.P) AND recordAttack(X, Y, D)
exploit ( X , Y , E ) connected(X, Y) AND isOpen(Y) AND allow(F,X,Y) AND hasFoothold(X, 2) AND hasAccount(Y, E)hasFoothold(Y, E.P) AND recordAttack(X, Y, D)
escalate ( X , E ) hasFoothold(X, 1) AND hasVul(X, E) AND P1 < P2hasFoothold(Y, E.P ) AND recordAttack(X, Y, D)
cleanup(X)hasFoothold(X, 2)recordAttack(X, −1)
search(X)-checkSuccess(X)
Table 3. The observations of agents.
Table 3. The observations of agents.
AgentObservationsValue
AttackerThe privilege on host i0/1/2
Whether account A on host i is available0/1
Whether vulnerability E on host i is available0/1
DefenderThe number of alarms on host i≥0
Table 4. Winning rates for the attack agent and draw.
Table 4. Winning rates for the attack agent and draw.
ModesFootholdsWinning and Draw Rates of the Attack Agent
RDPDCPDMUCB1RMaxObsMaxObs
Fixed strategy mode277.2%|0%93.0%|0%58.4%|0%65.4%|0%32.6%|0%40.4%|0%
364.8%|0%89.8%|0%52.0%|0%46.6%|0%28.2%|0%19.2%|0%
464.8%|0.2%86.8%|0.2%53.6%|0.2%58.4%|0%27.6%|0%35.8%|0.2%
Continuous learning mode22.0%|2.8%3.6%|11.4%1.8%|2.6%1.6%|2.6%1.8%|3.0%2.0%|5.0%
32.0%|3.4%2.8%|3.6%2.0%|1.8%1.8%|0.6%1.8%|2.4%2.0%|4.2%
42.4%|6%2.4%|41.6%2.4%|10.6%2.4%|3.6%2.0%|10.8%2.2%|34.2%
Learning together mode20.2%|0%0.2%|0%0.4%|0%0.4%|0%0%|0%0.2%|0%
30.4%|0%0.2%|0%0.2%|0%0.2%|0%0.2%|0%0.2%|0%
40.2%|0%0.2%|0%0.2%|0%0.2%|0%0.2%|0%0%|0.4%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, W.; Chen, X.; Li, Y.; Zhu, C. Catch the Cyber Thief: A Multi-Dimensional Asymmetric Network Attack–Defense Game. Appl. Sci. 2024, 14, 9234. https://doi.org/10.3390/app14209234

AMA Style

Wang W, Chen X, Li Y, Zhu C. Catch the Cyber Thief: A Multi-Dimensional Asymmetric Network Attack–Defense Game. Applied Sciences. 2024; 14(20):9234. https://doi.org/10.3390/app14209234

Chicago/Turabian Style

Wang, Wenhao, Xingguo Chen, Yuwei Li, and Cheng Zhu. 2024. "Catch the Cyber Thief: A Multi-Dimensional Asymmetric Network Attack–Defense Game" Applied Sciences 14, no. 20: 9234. https://doi.org/10.3390/app14209234

APA Style

Wang, W., Chen, X., Li, Y., & Zhu, C. (2024). Catch the Cyber Thief: A Multi-Dimensional Asymmetric Network Attack–Defense Game. Applied Sciences, 14(20), 9234. https://doi.org/10.3390/app14209234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop