In this paper, we propose an intrusion detection method based on adaptive sample distribution dual-experience replay reinforcement learning. The method takes network traffic data and relevant network activity features as inputs, which are preprocessed and extracted for the reinforcement learning model. During the training phase, the model interacts with the reinforcement learning environment we developed for intrusion detection. The objective is to maximize cumulative rewards, enabling it to learn attack patterns in network traffic characteristics and identify novel intrusions. Utilizing the attack patterns learned by the model, it classifies the input network traffic information, determining whether it corresponds to normal network behavior or potential intrusion activity.
3.2. Reinforcement Learning Environment for Intrusion Detection
To integrate reinforcement learning with intrusion detection, we developed a custom intrusion detection environment suitable for reinforcement learning based on OpenAI Gym [
42]. Gym is an open-source toolkit widely used in the field of reinforcement learning. It provides a unified interface and standardized environment definition, allowing researchers to experiment and compare their proposed reinforcement learning algorithms on a standardized and unified platform. Specifically, the components included in our developed intrusion detection reinforcement learning environment are as follows:
State space. In intrusion detection tasks, we consider network traffic features such as packet size, network protocols, and transmission information as the state, and all states form the state space. Due to the existence of discrete features such as network protocol types and port numbers, as well as continuous features such as packet size and transmission rate, the overall state space will be a hybrid state space. Therefore, specific implementation requires the preprocessing of these features. The state space
S is represented as follows:
Among these, n represents the quantity of network traffic features, with each representing a distinct network traffic feature.
Action space. The action space is responsible for labeling the incoming network traffic data for intrusion, i.e., categorizing them as either normal or attack. Typically, the output of a reinforcement learning algorithm is a probability distribution over actions. Therefore, in the context of intrusion detection, the model generates a vector of probabilities, with its shape matching the number of intrusion classes, corresponding to each class. The agent selects an action based on the generated classification probability vector. The action space
A is represented as follows:
where
m represents the number of intrusion categories (including the normal type), and each
represents an intrusion detection label.
Reward function. The reward function represents the feedback on the intrusion detection results, which considers whether the attack was successfully detected or when there were false alarms. Therefore, if the agent’s classification action is consistent with the actual results, a positive reward will be given; however, if the agent’s action is inconsistent with the actual results, such as false positives or false negatives, a negative reward will be given. Thus, the intrusion detection agent will optimize the model according to the reward feedback, striving to obtain as much positive reward as possible to maximize cumulative rewards. In reinforcement learning, a cumulative reward is one of the metrics used to measure the effectiveness of learning. The reward function
R is represented as follows:
where
is the classification action chosen by the agent at time step
t, and
is the true intrusion type of the detection sample input at time step
t.
State transition. The concept of state transition elucidates the pattern of changes in the environment following the actions taken by the agent. In the context of intrusion detection tasks, state transition is represented as the result of the current time step after classifying a certain traffic sample and then transitioning to the traffic sample of the next time step. The state transition model is represented as follows:
represents the state at the current time step, while represents the new state transitioned to after executing the state transition to the next time step. The transition of states consists of two specific scenarios: one is the probability of transitioning to the next sample at the current time with p, and the other is the probability of randomly selecting a sample from the state set S with .
Termination condition. The termination condition defines the circumstances under which the task should conclude. In our setting, the fixed time steps serve as the criterion for task completion, wherein the agent ceases to interact with the environment after handling a predetermined amount of traffic samples.
In summary, the environment we developed presents a versatile interface for training and evaluating the performance of agents. The environment also supports customized reinforcement learning algorithms, customized reward function design, and configurable parameter options to facilitate model training and optimization.
The interaction process between the agent and the environment is depicted in
Figure 2. During both the training and testing phases, the agent, acting as a reinforcement learning model for intrusion detection, interacts with the environment. During each time step, the agent receives the current state provided by the environment, which includes information such as current network traffic features. Based on the learned policy, the agent selects an action for classifying the current network traffic and applies it to the environment. The environment provides the agent with immediate rewards based on the action taken by the agent, while also returning updated information on network traffic status. The agent updates its knowledge and strategies based on environmental feedback rewards to better adapt to the environment. The process will continue over a duration of time, as the agent engages in continuous interactions with the environment, ultimately refining its strategy to attain more cumulative rewards. More cumulative rewards signify that the agent has acquired valuable knowledge from the characteristics of network traffic and is capable of accurately classifying network traffic information.
3.3. Detection Model Based on Adaptive Sample Distribution Dual-Experience Replay RL
The adaptive sample distribution dual-experience replay reinforcement learning algorithm that we proposed is an improvement based on the Deep Q-Network (DQN) [
43]. DQN is a reinforcement learning algorithm that combines the ideas of deep neural networks and Q-learning. It utilizes neural networks to approximate the Q-value function, enabling learning and decision making in complex environments, and the Q-value function is utilized to evaluate the value of taking a specific action in a given state. The realization of DQN involves two crucial techniques, namely experience replay and target networks. The DQN stores the agent’s past experiences, which are the quadruples of state, action, reward, and next state obtained by the agent during each interaction with the environment, in an experience replay buffer. It then randomly samples from these experiences during the training process for learning. The purpose of doing so is to disrupt the temporal correlation between samples and enhance the efficiency of sample utilization. The target network is utilized to compute the target Q-value, contributing to the stability of the training process. This is a network with a structure that is the same as the primary network but with independent parameters, which are regularly updated based on the parameters in the primary network.
However, in intrusion detection scenarios, the highly imbalanced nature of network traffic data often results in models being biased towards the majority class samples. While using resampling techniques may alleviate the problem of imbalanced sample distribution, it could potentially result in the loss of crucial information. Therefore, we require an approach that can address imbalanced sample distribution and accurately reflect the true distribution of data samples.
We propose a reinforcement learning algorithm based on adaptive sample distribution dual-experience replay, wherein a second experience replay buffer is added to DQN. The experiences in the second buffer are distributed based on the weights of samples from each class, determining their proportion in the experience buffer. As depicted in Algorithm 1, in addition to initializing the primary experience replay buffer, we also initialized a secondary experience buffer and allocated initial weights for each class, while setting up a switch window for experience replay buffers. During the experience storage phase, we store the experience quadruples in both experience buffers simultaneously. However, unlike the primary buffer which collects all experiences, the secondary buffer decides whether to add the current experience to the buffer based on the weight of each class. During the experience replay phase, the algorithm will determine which experience buffer to use based on the experience replay buffer switch window. After completing a batch of neural network updates, the algorithm will examine the current batch of updates in comparison to the previous one to determine whether the loss has increased or decreased. We refer to this as the loss trend. We hypothesize that if the loss trend has increased, it indicates the presence of potentially difficult-to-classify minority class samples. As a result, we proceed with weight updates to adjust the distribution of samples in the secondary experience buffer.
Algorithm 1: Dual-Experience Replay DQN |
|
The weights of each class in the secondary experience buffer are subject to an adaptive and dynamically updating process, which we refer to as the adaptive sample distribution. Its update is illustrated in Algorithm 2. We initiated by computing a classify metric that combines the F1-Score of each class with the loss trend, aiming to assess the relative significance of samples in each class and thereby capture the impact level of each class’s samples on the overall classification performance. The calculation of the classify metrics is shown as (5), where
represents the loss from the current training batch,
represents the loss from the previous training batch, and the difference between the two indicates the loss trend. And
represents the F1-Score of class
i;
is a factor ensuring that the denominator is not zero, with a value of
.
Algorithm 2: Adaptive Sample Distribution |
|
In order to comprehensively consider the classification performance and the distribution of samples in different classes, thus finely adjusting the weights of each class, we propose an analysis metric. The formula for calculating the analysis metrics is shown as (6), where
represents the number of samples belonging to the class
i,
S is the total number of samples in each batch, and
N represents the number of classes. This metric integrates the classify metric from the previous step and the proportions of samples in each class among all samples.
For each class, we computed its classify and analysis metrics, incorporated them as part of the weight’s calculation, and introduced historical factors that adjust the impact of the current analysis metric and the previous weights on the new weight to balance the weight between current metrics and historical weights. Finally, we normalized the weights for all classes. The updated formula for the weight
in class
i is shown below:
where
denotes the historical factor,
k denotes the number of weight iterations, and
represents the historical weights sequence of the class
i. When considering historical weights sequences, we weight the previous
historical weights by gradually decaying them according to a weight factor and then compute the weights’ sum. This process preserves the influence of historical weights on the new weight, but as the historical factor decays, the impact of earlier historical weights on the new weight gradually diminishes. This balancing mechanism allows the weight to retain the influence of historical information while also focusing more on the most recent situation.
By incorporating adaptive sample distribution dual-experience replay, the overall architecture of our algorithm is depicted in
Figure 3. We input the network traffic features as the state into the model and approximate the state-action value function (i.e., Q-function) through a deep neural network. During the training process, the agent engages in interactions with the environment, classifying traffic samples and storing the experience in two separate experience replay buffers. Subsequently, the agent selects a buffer based on the switch window within the experience buffers and performs experience sampling. The target Q-value is computed using a target network, which represents the expected long-term cumulative reward for predicting a certain classification label given the network traffic feature information. Then, the loss function is optimized to update and upgrade the network parameters, continuously learning and updating the Q-value to approximate the optimal classification action strategy. In addition, predictive results and the training status of the neural network are used to generate corresponding classification and analysis metrics. Combined with the historical weights sequence, the experience distribution in the experience buffer is adaptively adjusted to achieve better intrusion detection classification effects.
We assume there are
n traffic data samples and
m discrete actions (representing different classification labels);
s represents the feature information of the current network traffic,
a represents the classification label predicted by the agent,
r is the feedback on the correctness of the agent’s prediction,
is the discount factor, and
represents the feature information of the next network traffic after the agent takes action
a. The Q-value update formula can be summarized as the following equation:
Among which s, , a, . Additionally, we also employ the -greedy strategy to strike a balance between exploration and exploitation. When making decisions, the agent will choose a random action with a probability of , while selecting the currently known optimal action with a probability of . This approach serves to balance the likelihood of exploring unknown actions and exploiting the benefits of known optimal actions, aiding the agent in optimizing its strategy.
The proposed algorithm of adaptive sample distribution dual-experience replay reinforcement learning aims to address the issue of sample imbalance in intrusion detection. By introducing a second experience buffer, the algorithm focuses on the fewer and more challenging minority class samples, effectively preventing the minority class samples from being overshadowed by the majority class samples. We also implemented an adaptive dynamic updating mechanism, enabling the flexible adjustment of the distribution of samples in the experience buffer for each class. This allows the model to focus more on samples that are deemed challenging in the current stage, thereby enhancing the learning effectiveness for minority categories. Furthermore, the inclusion of a window for switching between experience buffers allows the model to seamlessly transition between the two, preventing an excessive focus on one experience buffer and the subsequent neglect of others. This ensures a comprehensive learning approach across all classes. This innovative design enhances the model’s versatility and robustness, enabling it to adapt to evolving intrusion detection scenarios and more accurately reflect the true distribution of a sample.