1. Introduction
Modeling a network attack–defense game that is close to the real world, especially for advanced persistent threats (APTs) and defending [
1,
2,
3], has always been an open and challenging issue. A good game model can map well to real elements in network attack–defense scenarios, which helps researchers gain a better understanding of complex interactions in cyberspace and enables them to make more timely and informed decisions. Furthermore, it can even facilitate the training of automated intelligent agents to assist security experts in their work.
However, this places very high requirements for designing such a game model. Different from real-world games, such as chess, cards, soccer, and real strategy games, the network attack–defense game is a typical asymmetric game. The tasks, actions, observations, and even applicable strategies of the attacker and defender are different.
Figure 1 shows how the attacker and defender interact in a layered internal network environment.
In
Figure 1, the attacker established an initial foothold on host 2 in the Demilitarized Zone (DMZ) service area and found that he could attack host 7 in the intranet workspace area from host 2. After completing the attack on host 7, the attacker scanned host 7 for connectable hosts and found that there were exploitable vulnerabilities on host 4 and host 10, and launched attacks on these hosts in sequence. During the process of the attack, some actions taken by the attacker trigger alerts from security devices. However, due to the fact that security devices are not completely reliable, security experts still need to manually investigate potentially compromised hosts, searching for logs and other information to confirm the presence of the attack.
Based on the above figure, we can summarize the asymmetry between the attacker and defender into the following factors:
Task objectives. In terms of task objectives, network attack–defense tasks have certain similarities with the hide-and-seek game in the physical world, where the attacker needs to gain unauthorized access to a target host before being detected by the defender. However, in the physical world, the environment has the same impact on both parties, but in cyberspace, this has fundamentally changed. The cyberspace environment is an unknown, non-cooperative network for the attacker. Therefore, it requires continuous exploration to complete the attack and the high technical capabilities of attackers. On the other hand, the environment is a semi-cooperative network for the defender. Although the defender understands the network topology and configuration and has installed security devices to capture suspicious traffic, it does not know which hosts have been compromised by the attacker. Therefore, the defender also needs to further inspect more hosts to confirm attacks.
Movement. The movement refers to how to move (take actions) by using known information (observations). Unlike point-to-point movement in the physical world, the attacker’s “movement” is to select one of the controlled hosts as the initiating host of an attack action and then select one of the connectable hosts as the target host to attack. Therefore, the attacker’s movement is a “diffusion” movement. Since the internal network is semi-cooperative for the defender, it can often directly check the logs of hosts in the protected network through installed security devices, so it is a “direct” movement. In addition, the attacker’s observations come from feedback from the environment, while the defender’s observations come from alarms provided by security devices. If we assume that the detection by the defender does not change the environment and the attacker loses after the defense confirms the attack. So the attacker’s observation will not contain any information about the defender, which will cause the attack to solely rely on its own experience to execute tasks.
Other spatio-temporal factors. Obviously, network attack–defense games do not limit the scale of the network and the time for both parties to participate in the game. Since the attacker and defender cannot see the opponent’s status and actions, there is no guarantee that they will act in a certain order. Even if the defender keeps on line for 24 h, its decision-making time for executing its strategy will lag behind the attacker due to the advanced nature of the attack.
The aforementioned factors jointly affect the modeling of a network attack–defense game. Game theory, as a mathematical framework for describing the cooperative and competitive relationship between two or more players, has the potential to describe the complex network attack and defense decision-making processes. We hope to leverage game theory to systematically model how the attacker and defender interact in cyberspace where their opponents are invisible, which has the following challenges.
The first challenge is how to design the core game mechanism. We can assume the attacker’s objective is to get the privilege of a key host. Once the attacker controls the host that stores the confidential information, it is difficult to prevent the attack even if the defender can obtain the evidence and trace the source afterward. Therefore, the key to the attack–defense game is how the defender discovers the existence of the attack before the attacker completes his task. How to abstract this indirect confrontation is the key issue that the model needs to solve.
The second challenge is how to design the modeling elements. Typical game elements include tasks, observations, actions, rewards, and strategies that the agent can use. In the scenario of this paper, due to the different control capabilities of the network environment, the biggest difference is the movement mode of the attacker and defender. How to take this difference into consideration in the modeling of the above game elements is the basis.
The third challenge is how to integrate real spatio-temporal elements. For the network attack–defense game in the real world, the attacker and defender are not at the same starting point. From the spatial dimension, the scale of network topology of both players may affect the game results, and the different movement methods may also affect the effectiveness of strategies. From the time dimension, the defender has a natural lag in the initial state, and the attacker can adjust its decision-making process by intermittently cleaning up traces. The integration of these spatio-temporal elements will make the game model closer to reality.
To overcome the above challenges, we propose a novel two-player zero-sum partially observable stochastic game called “Catch the Cyber Thief”. Specifically, the model has four typical characteristics: asymmetry in observation ability, asymmetry in time dimension, asymmetry in spatial dimension, and decision-making under uncertainty. The core of this game model is to highlight how the attacker and defender fight against the invisible opponent in cyberspace and connect the attack and defense agents with the traces left by the attack agent and the distorted alarms that the defense agent may receive. In the proposed game, the attack agent is a “thief” who wants to get the administrator privilege of a certain key host in an unfamiliar network environment. The defender is the “police” who wants to catch the attacker before completing its task by searching the environment. In the experiments, we evaluate the performance of the attack and defense strategies with various conditions. Furthermore, the experimental results shed light on a previously mentioned but uncovered “phenomenon,” which had not received in-depth discussion in prior research. The main contributions are as follows:
(1) The “Catch the Cyber Thief” game is a multi-dimensional asymmetric game model. Its asymmetry is not only affected by the agent’s strategy but also by many spatiotemporal factors such as the initial position of the attack agent, the scale of network topology, the number that the attack agent cleans its traces, and the lag time of the defense agent. Experimental results indicate that a good defense strategy is not meant to be effective in all scenarios, even having completely different performances against different opponents in the same scenario or the same opponent in different scenarios.
(2) The “Catch the Cyber Thief” game’s experimental results show the model’s credibility. It can be seen from the experimental statistical results that the winning rate of partial defense strategies maintained between 35% and 53%, which is basically consistent with that the 62% (equals to 38% defense winning rate) of intrusion attacks were discovered only after completion. In addition, we have found that the attacking agent may have an “attack rhythm,” which is very similar to the “low-and-slow” attack mode [
1] in reality. The discovery of “attack rhythm” also explains why the attack agent needs to maintain stealthy during the task execution.
(3) The “Catch the Cyber Thief” game reveals new issues to network attack and defense decision-making that are worthy of continued exploration. Experiments reveal that when the attack agent performs a different number to clean its traces, its winning rate changes in a zigzag manner. Therefore, the first question is how to maintain a good attack rhythm. Considering that defense strategies perform differently in different experimental scenarios, the second question is how to generate a robust defense strategy against different attackers. These two new problems are both derived from the game model and have certain practical significance, which is worth continuous analysis in subsequent research.
(4) The “Catch the Cyber Thief” game generates a direct mapping relationship with real elements. Another aspect of the credibility of this model is its ability to map to real-world elements, such as network topology, routing rules, vulnerability credentials, and alarms. Thus, it can help security defenders better understand potential attack decision-making processes from observed alarm events, assess security risks, and design defense policies. This game model can also be used to conduct low-cost, large-scale experiments to train intelligent agents and feedback this decision-making knowledge to practical applications.
The rest of the paper is organized as follows:
Section 2 analyzes the background and motivation in this paper.
Section 3 presents the design of the proposed model, and
Section 4 analyzes the strategies of the agents. Experimental evaluations are presented in
Section 5.
Section 6 introduces the related work.
Section 7 concludes this work with some future work.
2. Background and Motivation
In the previous research [
4,
5,
6,
7,
8] on studying network attack–defense problems, researchers often focus on studying stealthy or partial observability, and provide us with valuable references for our work. However, in most studies, researchers assume that some elements of network attack–defense games are consistent, which cannot comprehensively capture the multi-dimensional asymmetry of network attack and defense games.
Section 6 provides a more detailed introduction to the related work [
9].
To more fully reflect the asymmetry in network attack–defense problems, we select the partially observable stochastic game (POSG) approach to describe it. The reason is that the POSG does not limit the order of execution by both the attacker and defender and assumes that both sides cannot observe the true state of the system. In the end, the network attack–defense problem considered in this paper can be characterized by the following features:
The asymmetry in observation ability. The biggest feature of the network intrusion defense problem is that both agents fight with an invisible opponent. The termination condition of the game is that the defense agent discovers the attack before the attack agent completes its task, and the attack agent needs to acquire the privilege of the key host before being discovered. Therefore, the attack agent only gets observations from the environment, while the defense agent receives observations triggered by the attacking actions, although these observations do not fully reflect the real actions of the attack agent.
The asymmetry in the time dimension. Time is another important factor affecting the game. Although the attack agent cannot see the opponent’s status, actions, and other information, it can reduce the risk of being detected by clearing attack traces rather than simply “moving” to the target host. Another important time factor is the defense lag time. Since the premise of defense detection is that the attack has already occurred, the defense agent is inherently lagging. The defense lag time directly affects the quality of the defense strategy.
The asymmetry in the spatial dimension. The biggest difference between the attacker and the defender comes from the difference in “movement mode,” which is due to the different control capabilities of the network environment. Attackers can obtain initial strongholds in the target network utilizing phishing emails, exploiting website vulnerabilities, etc., and may “jump" from a non-fixed initial position to perform tasks, while the defender can directly detect hosts in the target network, which is fundamentally different from point-to-point movement in physical space. In addition, the scale of network topology can also affect the results of network attack–defense games.
Decision-making under uncertainty. In the context of network intrusion defense, meeting the pre-conditions of an attack does not ensure successful execution due to the instability of attack tools and actions. For defenders, factors such as the false alarm rate (e.g., network security devices analyzing traffic at specific sampling rates, absence of a comprehensive fingerprint library, etc.) and false alarm rate (e.g., security detection equipment misidentifying normal behaviors as attack patterns, etc.) make it challenging to definitively ascertain the presence of an attack. Consequently, both the attacker’s and defender’s uncertainties must be considered in the defense strategy.
Based on the above analysis, the attack and defense agents are not in the same space–time environment, and both agents do not have a predetermined order and can make decisions at the same time or one after the other. In addition, both agents need to make decisions based on observations and rewards, but neither can obtain the complete real state of the environment, so a partially observable stochastic game method can be used to model this problem.
3. Game Model
This work proposes a novel two-player network attack–defense zero-sum game model based on the partially observable stochastic game (POSG), called “Catch the Cyber Thief”. In the proposed game, both attacker and defender interact indirectly in an internet network. The attacker is the “thief” who hopes to gain a certain host’s administrator privilege, while the defender is the “police” who hopes to catch the attacker by searching the environment. Due to the lack of knowledge of the thief’s location, police in the real world need to track the footprints of the thief to design a route. On the other hand, the defender in cyberspace can move directly to any desired host based on the alarm information through remote desktop and check if the node is under attack until the task is accomplished or the game ends.
The game is defined by a tuple , where the following is true:
N: The term N is the number of agents. In the “Catch the Cyber Thief” game, the attack agent is represented by a superscript 1, and the defense agent is represented by a superscript 2.
: The term is the state space of the game, which includes states such as available vulnerabilities and credentials on each host, the attacker’s privilege on each host, the number of executions of attack actions on each host, the number of attacks that the defense agent can observe, and so on.
: The term is the attack agent’s task, which is defined as the Cyber Navigation Task (CNT). The , , and represent actions, observations, and rewards of the attack agent, which will be described in detail in future sections. The F is the number of footholds (controlled hosts) during executing the CNT task; once a host i is controlled, it is added to the foothold F. The attack agent will update the next available hosts and select one of them as the target host for the attack. When the size of the host i equals the maximum value, the attack agent will add a new controlled host to the foothold to replace the earliest controlled host.
: The term is the defense agent’s task, which is defined as the Cyber Search Task (CST). The , , and represent actions, observations, and rewards of the defense agent, which will be described in detail in future sections.
: The term T is the state transition function. For each time step t, represents the probability transition of the environment from the state s to the successor state through the joint action .
: The term Z is the observation function. For each time step t, the observation function represents the probability of observing after executing action and the new state from the environment transition.
: The term is the discount factor.
An important assumption of the game is that both the attack and defense agents move differently due to their different abilities to control the network environment, which directly affects the definition of their tasks and the applicability of policies.
In the defined CST task, the attack agent selects a controlled host as the initiating host for the attack action, then selects a reachable host as the target host for the attack action to move. Unlike the point-to-point movements in the physical world, the essence of the “jumping” movement is that the attack agent needs to remotely send attack commands to complete the scanning and vulnerability exploitation rather than moving towards its own position.
In the defined CNT task, the biggest difference is that the defense agent can manage hosts by remote desktop login or security software over the internal network environment, so the defense agent can directly “move” to check whether an attack has occurred. This unlimited action execution method is very similar to the classic Multi-Armed Bandit (MAB) [
10] problem. Although the CNT task has the same action execution pattern as the MAB problem, the defense decision-making process is different. Due to the possibility that the attack agent may choose different attack paths to complete its task, the defense agent finds attacks no longer follow a probability distribution, so rewards are determined based on the opponent’s behavior. Therefore, the defense agent’s objective is to optimize its strategy by using observation information
and reward
and to detect the attack as quickly as possible.
3.1. State Transition Function
The state transition is determined by the current state , the joint action , and the state transition function . At the initial time, the entire system is in the state , the attack agent and the defense agent respectively receive their own observation and reward , . After both agents update their own strategies , they choose actions and to continue interacting with the system. The system will update its state to based on the joint action and the state transition function .
As the “Catch the Cyber Thief” game is asymmetric in the time dimension, the initial moment is usually taken by the attack agent. Several time steps later, the defense agent’s action is executed. The state transition process is shown in
Figure 2.
When an agent achieves its goal or reaches the game’s end conditions, the system state evolution stops, and agents can optimize their strategies for the next episode. Furthermore, since different actions have different durations, if an agent’s action is in the process of being executed, the state will be determined by the remaining agent. If both agent actions are in progress, then the system state remains unchanged.
3.2. The Observation Function
The observation function is the core of the “Catch the Cyber Thief” game. The most important thing in designing an observation function is to reflect on how to fight with the invisible opponent in cyberspace. Therefore, it is necessary to incorporate the following basic design principles into the modeling process.
- (1)
When the attack agent launches an attack action from one host to another, the defense agent may receive the alarm information with a certain probability. However, when an attack agent executes local operations, no alarm message is generated.
- (2)
The more times an attack agent attacks a host, the higher the probability that the defense agent will find the attack. Since the defense agent is uncertain whether the host has been attacked before, it may need to check multiple times to confirm.
- (3)
The attack agent can reduce the possibility of being detected by executing actions to clean up traces. The longer the time (or the more times executing actions to clean up traces), the less likely it will be detected;
- (4)
If a host is attacked, even if the attacker executes multiple cleaning actions, there is a small probability that the defense agent can find the attack.
Based on the above principles, we modeled the observation function as the
detection success rate (detectRate) by referring to the softmax function. The detectRate is positively correlated with the number of attacks on each host and is defined as
where
represents the privilege of the attack agent having the authority of the host
i at time
t. When
means the attack agent has no privilege of host
i,
means the attack agent has normal user privilege of host
i, and
means the attack agent has administrator privilege of host
i.
The detectRate is calculated based on the number of attacks on each host. The
represents the number of attack actions executed on host
i. When
is equal to 2, the detection rate is 0.5. After the number of attacks exceeds 2, the probability of being detected increases rapidly.
Figure 3 further describes the interactions between both agents in the “Catch the Cyber Thief” game.
It can be seen that when the attack agent executes an attack action, the network environment will leave attack “footprints”. When the attack agent decides to clean up those “footprints,” it can reduce the probability of being detected. These attack actions will trigger network defense alarms; however, due to the existence of a false alarm rate and the possible actions executed by the attack agent locally, not all attack actions can be observed by the defense agent.
The “Catch the Cyber Thief” game regards the attack and defense interactions as an asymmetric decision-making game model. The defense agent receives cumulative alarms and can establish a mapping relationship with real traffic. It is more in line with the characteristics of the defense using distorted observations to fight with the invisible opponent. For the attack agent, it cannot execute too many attack actions, which may lead to faster failure of its task.
3.3. Actions
Actions are represented by a six-tuple . The first parameter actionDesc describes how agents use actions. In order to follow the most simplified principle of modeling, we design four types of attack actions and one type of defense actions, which mainly include the following aspects:
lateral(X, Y, A): The attack agent laterally moves to host Y from host X using account A. If the preconditions are met, then the calculation result with the returns true, and the attack agent will have account A’s privilege and add connectable hosts of host Y to already detected connected hosts. The system will also increase the number of attacks on host X and host Y by one each. Whether or not the action is executed successfully, the defending agent has a probability to observe the attack agent executing the action.
exploit(X, Y, E): The attack agent remotely exploits host Y from host X using vulnerability E. If the preconditions are met, then the calculation result with the attack success rate returns true, the attack agent will have vulnerability E’s privilege, and add connectable hosts of host Y to already detected connected hosts. The system will also increase the number of attacks on host X and host Y by one each. Whether or not the action is executed successfully, the defending agent has a probability to observe the attack agent executing the action.
escalate(X, E): The attack agent executes privilege escalation locally using vulnerability E. If the preconditions are met, then the calculation result with the attack success rate returns true, the attack agent will have vulnerability E’s privilege, and add connectable hosts of host Y to already detected connected hosts. The system will also increase the number of attacks on host X by 1. Since privilege escalation is executed locally, the action will not be discovered by the defense agent.
cleanup(X): The attack agent cleans up “attack footprints” on host X. Whenever other attack actions are executed successfully, the attack agent will randomly execute the actions on controlled hosts according to the set value. If the preconditions are met, the number of attacks on host X will be reduced by multiple actions.
search(X): The defense agent searches host
X to find out if an attack has occurred. The success rate of finding an attack is related to the actual number of attacks on host
X, which is calculated by Equation (
1).
The second parameter
is the precondition of actions, which are usually composed of multiple predicates using AND and OR conjunctions. When the conjunction between two predicates is “AND,” it means that both predicates need to be satisfied at the same time. If the conjunction is “OR,” it means that only one of the predicates needs to be satisfied. The predicates are shown in
Table 1.
The third parameter
is the result when the action is successfully executed. Agents and the environment will update the state and observation according to the action. The fourth parameter
represents how many time steps the agent needs to execute. The fifth parameter
and the sixth parameter
are respectively the success rate of executing the attack action and the possibility of being observed by the defense agent. The description of actions is detailed in
Table 2.
The duration is set based on the attack intensity and possible attack commands sent. Since durations are different, the numbers of attacks recorded in the system are also different. For example, if action requires two time steps, then the system will increase the number of attacks on host A and host B by two times each.
3.4. Observations
In the “Catch the Cyber Thief” game, there are three observation elements that the attack agent can receive, namely, the privilege on compromised hosts, available accounts, and available vulnerabilities on the specified host. For the defense agent, it can observe the number of attacks for each host. It is worth noting that the number of observed attacks does not equal the actual number of attacks because the defense agent cannot observe locally executed actions
and
, and remote actions
and
only have a certain probability of being observed. Observations are shown in
Table 3.
According to the above modeling, it is obvious that the observations are asymmetric. The observation space of the attack agent is , where n is the number of hosts, is the number of available vulnerabilities on host i, c is the number of available credentials. However, the size of the observation space of the defense agent equals the number of hosts. Only experienced attack agents can fight against the defense agent during the task execution; otherwise, too many useless actions may cause the attack agent to be discovered quickly.
3.5. Rewards
Depending on the design of tasks in the game, rewards mainly include episode-based rewards and state-based rewards.
Episode-based rewards are defined according to whether the agent’s task is successfully executed, which are often used to evaluate the performance of a strategy. If the attack agent gets the administrator privilege of the target host without being detected by the defense agent, the attack agent wins (), and the defense agent loses (). If the defense agent detects an attack before the attack agent completes its goal, the defense agent wins (), and the attack agent loses (). If neither the attack agent nor the defense agent can win within the specified number of episodes, it is a draw.
State-based rewards are usually used to speed up the training process of agents. If the agent is trained with sparse rewards (rewards are only received when the task is terminated), then state-based rewards are equal to episode-based rewards. Besides, the reward can also be returned after each action is executed. Taking the attack agent as an example, where the state-based reward represents when the system state is s, both agents take a joint action , and the successor state is . The environment feeds back the reward to the attack agent.
4. Strategies
As a learning paradigm that optimizes policies from experience, reinforcement learning has been widely used in multi-agent scenarios [
11,
12]. The state-value function
and state-action value function
(also called Q function) are used to optimize the agent strategy. The Q function specifies how good it is for an agent to perform action
a in state
s with policy
, and formally it is defined as
The state-value function and state-action value function can be converted to each other through the Bellman equation [
13]. The objective of an agent is to optimize the policy
in order to maximize the long-term cumulative rewards in Formula
3.
4.1. Strategies of the Attack Agent
In the “Catch the Cyber Thief” game, the attack agent cannot observe the state and action of the defense agent. Therefore, classic multi-agent reinforcement learning algorithms are no longer applicable. For example, the Minimax-Q algorithm is a multi-agent reinforcement learning algorithm used to solve two-player zero-sum games [
14]. The algorithm considers the opponent’s goal to minimize its own return and calculates its Q function based on the mini-max principle. The update formula of Q function is
, where
,
represents the player’s index, and
is a simplified form of the
.
The biggest limitation of the Minimax-Q algorithm is that it needs to know the action space of both agents. Similar to it is the Nash-Q algorithm [
15]. The Nash-Q algorithm extends a two-person zero-sum game to a multi-person general-sum game and updates its Q function to
.
In the Nash-Q algorithm, each state s is regarded as a stage game. The agent needs to observe the actions and reward of all other agents to find a global optimal point or saddle point in each state s. However, due to the fact that the attack agent can only receive feedback from the environment, it cannot see any information about the opponent, and multi-agent algorithms such as MiniMax-Q and Nash-Q cannot optimize their own strategies.
Therefore, an alternative approach is to use single-agent reinforcement learning algorithms to train the attack agent, such as Q-learning [
16], SARSA(
) [
17], Double Q-learning [
18], and DQN [
19]. Then the attack agent can adjust some parameters to better respond to the opponent. In this case, the attack agent can be seen as an experienced “thief”. Although it cannot see any information about the defense agent, it is able to evade detection by increasing the cleaning trace action, adjusting the probability of action selection, and other methods.
Assume that the attack agent uses the Q-learning algorithm as its attack strategy. The characteristic of the Q-learning algorithm is that its behavior strategy is different from the learning strategy. Its behavior strategy can be selected through the
-greedy method. The
-greedy method selects actions according to the Q table; its purpose is to maintain a balance between exploration and exploitation. Its calculation formula is
The learning strategy is through sampling to approximate the Bellman optimal operator , namely to approach the optimal value function for learning control. However, under the definition of the CNT, the attack agent needs to select an action from the footholds F, so the Q function that needs to be updated here is . The Q-learning algorithm process of the attack agent under the CNT is shown in Algorithm 1.
The SARSA() and DQN can also be used in Algorithm 1 by replacing the Q value update strategy function.
4.2. Strategies of the Defense Agent
For the defense agent, the attack information is implicit in the received observations, so its strategies can be divided into four categories.
(1) The first category includes heuristic strategies that do not use observations and rewards. Since the defender can directly access any host in the network, strategies can be designed according to its detection preference. We can divide the most commonly used three-layer intranet into the DMZ layer (outermost layer), the internal office layer (second layer), and the core layer (innermost layer). Three types of strategies can be designed: Priority searching hosts in the Demilitarized Zone (PDM), Priority searching hosts in the Core Layer (PDC), and Random Detection (RD). The probability that these strategies detect each host is given by the formula
where
is the index of hosts. Assigning values according to the way that one host is in the DMZ layer and
n host is in the core layer.
Algorithm 1 The Q-learning algorithm under the CNT |
- Require:
(1) Set , and to specified values; (2) Set the footholds F, rewards R and the termination condition of the game. - Ensure:
of the attack agent - 1:
Initialize the Q table - 2:
for each episode do - 3:
Set initial state s - 4:
for each step in a episode do - 5:
Select an action from the Q table according to -greedy strategy - 6:
The attack agent receives the reward and observes , where contains the privilege of host i - 7:
if Host i is not in current footholds then - 8:
- 9:
else - 10:
Keep unchanged - 11:
end if - 12:
, Where Q is a simplified form of . - 13:
if then - 14:
The ends - 15:
else - 16:
- 17:
end if - 18:
end for - 19:
end for
|
(2) The second category includes learning strategies that only use rewards. The defense agent optimizes its strategy depending on whether its task is successful or not. A typical algorithm is the Upper Confidence Bounds (UCB1) algorithm [
20]. UCB1 is a classic algorithm for the MAB problem that uses the optimism principle to select actions under uncertainty. The agent initially pulls each arm once, then selects the arm
according to the following calculation formula.
where
t is the number of episodes experienced so far,
is the average reward obtained from host
j,
means that host
j was selected before the number of times
t, and
c is a constant used to control the degree of exploration.
(3) The third category includes heuristic strategies that only use observations. Two strategies are designed here: the Maximum Observation detection strategy (
) and the Random Maximum Observation detection strategy (
).
means that the defense agent chooses the host with the maximum number of alarms. The calculation formula is
where
is the index of a host, and
is the number of alarms of the host
i at time
t.
The
adds randomness to the
). Its calculation formula is
where
indicates the
K hosts with the largest number of alarms,
indicates random selection among
X, and
K is set to three in this paper, which means selecting the top three hosts with the largest number of alarms. Human experts will also directly go to a certain host or several hosts for detection based on the number of alarms collected by the security software.
(4) The fourth category includes learning strategies that use observation and rewards. The defense agent optimizes its strategy based on observations and rewards per episode since the defense agent can only use distorted observations to fight with the invisible opponent, which is a special case of partially observable. Therefore, algorithms should work well under partially observable conditions. For example, Deep Recurrent Q-Network (DRQN) [
21] with long short-term memory (LSTM) network [
22] and Transformer [
23] can be used to encode states. The method of opponent modeling [
24,
25] can also be tried to solve this kind of problem.
5. Experimental Analysis
In order to quantitatively analyze the interactions between the attack and defense agent, this paper sets up three network topologies with 20, 30, and 40 nodes as experimental environments and three experiments to explore the nature of the model within a limited range. Experiment 1 aims to evaluate interaction results between different attack and defense strategies; Experiment 2 aims to evaluate the asymmetry in the spatial dimension; and Experiment 3 aims to evaluate the asymmetry in the time dimension. In the experiment, each strategy selected by the attack and defense agent played 500 episodes to reduce the uncertainty in the decision-making process; the maximum number of steps in each episode is 1000.
The attack agent uses a Q-learning algorithm with 2, 3, and 4 footholds as the attack strategy set. The defense agent uses six strategies: PDM, PDC, RD, UCB1, MaxObs, and RMaxObs as the defense strategy set. When the attacker obtains the administrator privilege of the target host, the attack agent wins; when the defense agent finds the attack before it completes its task, the defense agent wins; when the maximum number of steps in each episode is reached, it is a draw.
5.1. Experiment 1: Evaluating between Different Attack and Defense Strategies
In Experiment 1, it assumes that both agents performed actions simultaneously without lag, and the initial controlled host is generated from hosts 1 to 6. Experiment 1 sets up three modes. (1)
Fixed strategy mode: the attack agent first trains its strategy in the environment with the defense agent, then uses the converged policy model to confront the opponent, and the attack strategy is not updated at this time; (2)
continuous learning mode: on the basis of mode 1, the attack agent uses the trained model to confront the defense agent and continuously updates its strategy; and (3)
learning together mode: the attack agent uses a fully initialized strategy model to fight with the defense agent.
Table 4 shows the winning rates for the attack agent and draw. Where the first percentage in each cell is the winning rate of the attacking agent, and the second is the percentage of the draw.
It can be seen that there is no complete imbalance between the attack and defense agent in the fixed strategy mode, and neither player has an absolute advantage. This is because the attack agent is trained to form a strategy model in a single agent environment in advance, so it can be considered that the attack agent has knowledge of the network environment but does not understand the defense agent. This is similar to a “thief” who, based on their own experience, can directly ignore the presence of the defense agent to execute its task.
However, in the continuous learning mode and the learning together mode, the defense agent has an absolute advantage. In the continuous learning phase, the attack agent hopes to update the trained policy model based on negative rewards, but the updated Q function value may not be the most important factor causing its failure. For example, the attack agent repeatedly executes n attack actions under the observation , which greatly increases the probability of being detected. But the defense agent detects and finds the attack after several time steps; the observation of the attack agent has changed to , then it gets a negative reward . Since the attack agent cannot see the defense agent, will update the Q function instead of , which caused it to fail. Compared to the real world, on the basis of previous experience, the thief still tries to understand the police’s movements by observing the changes in the environment, so too much adjustment may cause it to be found faster.
In the learning together mode, since the attack agent does not understand the network environment at the initial time, it is difficult to keep exploring the environment and avoid the detection of the defender at the same time. This is similar to that in the physical world: a fledgling “thief” who is completely unfamiliar with the terrain and needs to avoid the police. It is almost impossible for him to complete the task.
Therefore, only experienced attackers can find breakthroughs in the intranet to confront defenders.
Figure 4 compares different defense strategies’ winning rates in the fixed strategy mode.
The results show that the RMaxObs and MaxObs strategies have the highest winning rate, because when the attack agent performs its task, if the defense agent can immediately respond according to current observations, then the adversary can be found quickly, and the reaction time of the defense agent will be an important factor to decide the game.
The winning rate of the PDC strategy is nearly 20% lower than all other strategies. This is because the attacking agent executes the task from the outermost layer to the innermost layer step by step. The
strategy first detects the core layer, which makes it difficult to discover the attack at one time, so that the attack agent can complete its task with a greater probability. When the defense agent adopts the PDM and UCB1 strategies, its winning rate is basically maintained between 35% and 53%, which is also the same as the literature [
26] mentioned that 62% of the attacks were discovered after completion (equals to 38% defense win rate) is basically consistent, which also proves the credibility of the model.
Except for the PDC strategy, there is no strict dominant relationship between different defense strategies. It means the “Catch the Cyber Thief” game is essentially an asymmetric game model affected by attack and defense strategies.
5.2. Experiment 2: Analyzing the Asymmetry in the Spatial Dimension
In this experiment, we analyze the asymmetry from the spatial dimension, mainly including two aspects: (1) the initial position of the attack agent; and (2) the scale of the network topology. The defense agent still performs actions simultaneously with the attack agent.
5.2.1. Case 1: Different Initial Positions
Different from the case in Experiment 1 where the attack position is assumed to be a random position, a situation is added in Experiment 2 where the initial position of the attack agent is fixed as host 1. The defense agent mainly uses three strategies:
,
, and
.
Figure 5 shows winning rates for the defense agent and probabilities that the UCB1 algorithm selects each host to detect.
Generally speaking, when the attack agent is in a fixed initial position, the defense winning rate is higher than that in a random initial position. This is because, under the random initial position, the attack path of the attacking agent may be shorter, so the defense agent will “capture” the attack agent faster when detecting.
Figure 6 shows probabilities that the UCB1 algorithm selects each host to detect. It is worth noting that the performance of the UCB1 algorithm in the fixed initial position is much better than that in the random position, and the winning rate is close to 80% to 90%. This is because when the initial position of the attack agent is fixed, the UCB1 algorithm will preferentially detect host
in the
tth episode. If the defense agent has received higher rewards from a certain host before, it will still be biased to detect this host first. When the attack agent is in a random initial position, the defense agent can no longer find the law through the UCB1 algorithm, so it is necessary to make better use of observations for further algorithm optimization.
5.2.2. Case 2: Network Topologies of Different Scales
The second group of experiments analyzed the influence of topology scales on the game and added 30 hosts and 40 hosts to the simulation experiment environment. The detailed environment setting is in
Appendix A. The footholds of the attack agent are set to 3, and the defense strategies include PDM, UCB1, RMaxObs, and MaxObs.
Figure 7 shows the winnings rates of the defense agent with different network topologies.
It can be seen that the RMaxObs and MaxObs strategies are still absolutely dominant compared to other defense strategies. This is because the defense agent in cyberspace can “directly move” to hosts that want to be detected, which is also the advantage of the defense agent in the spatial dimension.
However, the PDM strategy that depends on the detection preference is affected by the network scale. However, due to the comprehensive effect of factors such as host and connectivity settings in the network environment, attack paths under different network scales are also different, so the scale of the network, as one of the asymmetries of a spatial dimension, together with other spatiotemporal elements affect the results.
5.3. Experiment 3: Analyzing the Asymmetry in the Time Dimension
Experiment 3 analyzes the asymmetry from the time dimension, mainly including two aspects: (1) the number of times the attack agent executes actions; (2) the lag time of the defense agent.
5.3.1. Case 1: The Number That the Attack Agent Executes Actions
The footholds are set to 3, and defense strategies include PDM, UCB1, RMaxObs, and MaxObs. The attack agent executes 0 to 10 “cleanup” actions.
Figure 8 shows winnings rates of the defense agent with different “cleanup” actions, where the horizontal axis is the number of
actions. The vertical axis is the winning rates of defense strategies.
It can be seen that winning rates change in a zigzag pattern. Taking the MaxObs strategy adopted by the defense agent in 20 host environments as an example, the winning rate of the defense agent is 73.6% when the attacking agent performs six cleanup actions, and the winning rate of the defense agent who performs five cleanup actions and seven cleanup actions is 80.4% and 77.0%. This is because in the game, if the number of actions is too small, it may increase the risk of being discovered. And if the attack agent executes too many actions, the defense agent may have more time to detect the host.
From the perspective of the performance of different defense strategies, when the network scale is 30 hosts and the attack agent executes 4 actions, the defense winning rate will increase significantly, even better than the two strategies of MaxObs and RMaxObs. When the network size is 40 hosts, the performance of PDM, UCB1, and RMaxObs varies with the number of actions, and there is no situation where a certain defense strategy is completely dominant.
This phenomenon indicates that there may exist an “attack rhythm” when the attack agent executes its task, which is very similar to the “princess and monster” game. In the “princess and monster” game, although the princess is in a dark room where the monster cannot be observed, she can still avoid the “hunt” of the monster by waiting for a random time before moving. It is similar to the “police and thief” game. Since the thief cannot observe the location of the police, he can only avoid their pursuit by adjusting his own marching rhythm. In our game, the winning rates of the defense agent do not increase monotonically with the number of times it executes the action, and the changing trend in different environments is also different.
The discovery of the attack rhythm phenomenon is very close to the “low and slow” [
1] decision-making mode in the real intrusion process, and also shows that the “Catch the Cyber Thief” game is also asymmetric in the time dimension. How to maintain a good attack rhythm will be an issue worthy of in-depth discussion.
5.3.2. Case 2: The Lag Time of the Defense Agent
The second set of experiments analyzes the effect of defense lag time on the game. The footholds of the attack agent are set to 3, the number that the attack agent executes
actions is set to 5, and the defense strategies include PDM, UCB1, RMaxObs, and MaxObs. The lag time is set from 0 to 10.
Figure 9 shows the winnings rates of the defense agent with different lag times. The horizontal axis is the lag time of the defense agent. The lag time is set from 0 to 10. The vertical axis is the winning rate of defense strategies.
The results show that the longer the defensive agent lags, the lower the winning rate of RMaxObs and MaxObs is. Although these two strategies can quickly detect the target host based on observations, the attack agent has already completed actions in advance. And it is possible to perform multiple cleanup traces so that the observations seen by the defense agent are not the real state in the system. However, the defense reaction time does not have an additional impact on the PDM and UCB1 strategies, because these two algorithms do not depend on observation, and the UCB1 algorithm only judges whether the attack exists based on the reward. Therefore, the defense lag time has little effect on it.
In the experiment, the lag time of the defense agent is 0, which is an ideal situation. Considering that attack actions need two or three time steps, even if the defense agent responds in time or adopts polling mode to detect hosts in the environment, it cannot guarantee that the attack and defense agents are in the same initial state. Therefore, in follow-up studies, it is suggested to set the initial lag time in a reasonable range.
6. Related Work
We divide related work on the network attack–defense game modeling into three aspects, using traditional game theoretic approaches, using stochastic games and partially observable stochastic games, and other similar game models.
Traditional game theoretic approaches. Due to the importance of network intrusion defense problems, researchers have proposed different network attack–defense game models based on various game theoretic approaches, such as differential games, dynamic games, and other game theoretic approaches. Van et al. [
4] proposed the “FlipIt” game based on the differential game (SG), which modeled the network attack–defense game as a process of “stealthy takeover” for the network asset. Due to the simple design of attack and defense actions (each player has only one action to control resources through “flipping” at a certain time.) in this model and the ability to depict the characteristics of both sides being invisible to each other, this model has been investigated and adapted extensively [
27,
28,
29]. Chapman [
30] designed a “Cyber Hide-and-Seek Game” as a two-sided search problem. In the proposed game, one player hides
k hosts with a certain strategy in the
n-host network topology. Zhu et al. [
31] modeled the network attack–defense game as the Colonel Blotto game [
32], where multiple attackers and defenders compete for multiple distributed resources. Similar problem descriptions have also appeared in the literature [
29,
33,
34,
35]. Since the traditional game-theoretic models do not set up elements such as the state and observation, it often assumes that the tasks and actions of both players are consistent, which makes it difficult to analyze in depth the complex interaction process between the attacker and defender.
Stochastic games and partially observable stochastic games. Stochastic games (SGs), also known as Markov games, describe a game in which players compete or cooperate with each other in a sequential decision-making manner. Wang et al. [
6] used Stochastic Petri Nets (SPN) to build a network attack–defense simulation environment based on the stochastic game. Chung et al. [
7] proposed a simplified network attack–defense game based on the stochastic game. Elderman et al. [
8] also modeled the adversarial attacker–defense game. The attacker aims to reach the target host by taking actions of different values, and the defender needs to prevent the attacker by allocating different defense values. Both players cannot see the actions and utilities of the opponent. Results show that when two agents try to adapt to each other, no one can ensure a long-term victory.
Although the stochastic game approach can reflect complex interactions between the attack and defense agents, its assumption that the agent receives the real state of the system is clearly inconsistent with reality. A partially observable stochastic game models a problem in that agents have no access to the exact environmental state but only an observation of the true state through an observation function [
12]. Wu et al. [
36] proposed an incomplete information stochastic game to model a network attack–defense game. In the proposed game, the attacker and defender choose an action from a given behavior, and get rewards. Schwartz et al. [
5] used the partially observable stochastic game (POSG) to model the network attack–defense game. The author uses a Bernoulli process to model the defender’s impact on the attacker’s task, and distinguishes the defender’s behavior into detection and response behavior, where the detection behavior can change the network environment and the response behavior does not change the network environment. Horák et al. [
37] used a one-sided POSG to address the issue of honeypot reallocation against the attacker’s lateral movement. In a graph network, the attacker aims to reach a specific target node by traversing infected nodes, while the defender aims to detect the attacker and increase the cost of reaching the target by deploying honeypots.
Neither stochastic games nor partially observable stochastic games (POSGs) restrict the order in which the agent’s actions are executed. In order to characterize the limited observation ability of both attack and defense sides, researchers have started using partially observable stochastic games to describe network attack–defense game models.
Other similar game models. Game models in other fields can also be used to compare. The pursuit-evasion games (PEGs) in the physical world can also help us understand and build ADGs. Classic models in PEGs include hide-and-seek [
38], Princess and Monster [
39], and Green Security Games [
40]. In these models, the pursuer’s purpose is to catch the evader as fast as possible, and the evader’s purpose is to avoid the pursuit or to complete his own goal before being caught. The setting of tasks and observations is the biggest difference between these PEG models. A hide-and-seek game assumes the evader needs to reach a goal when running in a maze. The Princess and Monster game distinguishes two players’ abilities to observe and move. The princess has a larger observation range but moves slower, while the monster has a smaller observation range but moves faster. In Green Security Games, the pursuer needs to track the evader by analyzing the orientation of the footprints. This is similar to the intrusion defense problem, where a defender needs to look at log information that may record the attacker’s actions.
Chess and board games (CBGs) include two classic extensive-form games, i.e., Go and Texas Hold’em poker. In these two games, both players take turns to make decisions. The game can be modeled as a game tree according to action sequences. Players need to choose the best actions according to their information sets. Among them, Go is a two-person complete information game (complete information game means that all past and present game information can be known at any stage of the game), and both players can see each other’s actions and states. Texas Hold’em Poker is a multiplayer, incomplete information game. Players cannot see other players’ hidden cards within five cards. They can only infer from the other opponent’s visible cards, betting situation, and the public board cards.
7. Conclusions and Future Work
This work presents a novel attack–defense game model called “Catch the Cyber Thief”. The innovation of this model is that it integrates the four characteristics in real attack–defense decision-making based on the POSG. The proposed game systematically defines the actions, observations, rewards, and strategies of the attack and defense agents and analyzes the nature of the game model through experiments. The experimental results show that the performance of the attack and defense strategies changes with the opponent’s strategy and experimental conditions. By adjusting the experimental parameters, we found that there may be an “attack rhythm” in the attack decision-making. The game model provides a basic mathematical tool for studying network attack–defense interactions. Researchers can conduct more research on this model, such as analyzing potential risks, designing robust defense strategies, and analyzing patterns of attack decisions.
In the future, we plan to conduct a comparative analysis between our proposed Catch the Cyber Thief model and other existing models such as the Stackelberg hypergame, Partially Observable Markov Decision Processes (POMDPs), and evolutionary games. This analysis will focus on analyzing the advantages and disadvantages of modeling network attack and defense problems with different models.
Furthermore, we aim to explore the implications of solving the game equilibrium of attack–defense strategies in real-world network security scenarios. Understanding the strategic outcomes and implications of achieving equilibrium in such games can provide valuable insights into developing robust defense mechanisms against cyber threats.
Additionally, we are interested in investigating how our proposed model could be implemented in network emulators or simulators such as NS-3. By integrating our model into these simulation environments, we can evaluate its performance, scalability, and practicality in a controlled network setting. This implementation would allow us to assess the real-world applicability and effectiveness of our proposed approach in enhancing network security measures.