Next Article in Journal
Navigating the Disinformation Maze: A Bibliometric Analysis of Scholarly Efforts
Previous Article in Journal
Detecting Adversarial Attacks in IoT-Enabled Predictive Maintenance with Time-Series Data Augmentation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Machine Learning-Based Methodologies for Cyber-Attacks and Network Traffic Monitoring: A Review and Insights

1
Invest & Engineering S.r.l., Viale Paolo Borsellino e Giovanni Falcone, 17, 70125 Bari, BA, Italy
2
Digital Innovation S.r.l., Via Edoardo Orabona, 4, 70125 Bari, BA, Italy
3
Kad3 S.r.l., Via Baione, snc, 70043 Monopoli, BA, Italy
4
Department of Computer Science, University of Bari “Aldo Moro”, Piazza Umberto I, 1, 70121 Bari, BA, Italy
*
Author to whom correspondence should be addressed.
Information 2024, 15(11), 741; https://doi.org/10.3390/info15110741
Submission received: 29 October 2024 / Accepted: 31 October 2024 / Published: 20 November 2024

Abstract

:
The number of connected IoT devices is increasing significantly due to their many benefits, including automation, improved efficiency and quality of life, and reducing waste. However, these devices have several vulnerabilities that have led to the rapid growth in the number of attacks. Therefore, several machine learning-based intrusion detection system (IDS) tools have been developed to detect intrusions and suspicious activity to and from a host (HIDS—Host IDS) or, in general, within the traffic of a network (NIDS—Network IDS). The proposed work performs a comparative analysis and an ablative study among recent machine learning-based NIDSs to develop a benchmark of the different proposed strategies. The proposed work compares both shallow learning algorithms, such as decision trees, random forests, Naïve Bayes, logistic regression, XGBoost, and support vector machines, and deep learning algorithms, such as DNNs, CNNs, and LSTM, whose approach is relatively new in the literature. Also, the ensembles are tested. The algorithms are evaluated on the KDD-99, NSL-KDD, UNSW-NB15, IoT-23, and UNB-CIC IoT 2023 datasets. The results show that the NIDS tools based on deep learning approaches achieve better performance in detecting network anomalies than shallow learning approaches, and ensembles outperform all the other models.

Graphical Abstract

1. Introduction

The rapid increase of IoT devices has led to a rapid increase in traffic generated in IoT networks. As IoT devices still have vulnerabilities, such as inadequate protection for sensitive data, unsecured network services, and a lack of access control [1], the security of these devices must be of the upmost importance; therefore, effective countermeasures and detection tools are required in order to protect the devices and their data from cyber-attacks. In this context, intrusion detection systems (IDSs) play a key role. To effectively simulate network traffic within a given time window, commercial intrusion detection systems (IDSs) primarily use statistical measurements or thresholds derived on feature sets, such as packet length, inter-arrival time, flow size, and other parameters. However, a significant percentage of false-positive and false-negative alarms affect them. High false-positive warning rates indicate that NIDSs may trigger unnecessary alarms when no attack is genuinely underway, while high false-negative alert rates indicate that NIDSs may frequently fail to detect attacks. This implies that some commercial solutions might not be up to the task, and it might be improved by the integration of machine learning techniques, with their supervised, semi-supervised, and unsupervised mechanisms, in order to learn the patterns of various normal and malicious activities in large corpuses of normal and attack events at the network and host level.
In the scientific and technical literature, machine learning (ML) and deep learning (DL) methods are widely used, as they are more effective and efficient. Examples of ML techniques are support vector machines (SVMs), decision trees (DTs) and random forests (RFs), as well as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory (LSTM). It goes without saying that the effectiveness of machine learning-based IDSs mainly depends on the effectiveness of the ML algorithm itself, along with the quality of the data retrieved from network traffic used to train the models. Intrusion detection in IoT networks can be characterized as a binary classification problem, in which a trained classifier aims to classify network traffic into normal or attack class with the highest accuracy.
The objective of this paper is to present a comparative analysis and an ablative study among recent machine learning-based NIDSs to develop a benchmark of the different proposed strategies by testing both shallow and deep learning models, as well as ensemble models. Starting with the definition and types of IDSs, the most popular supervised learning algorithms and datasets in this domain are reviewed. The background section illustrates the concepts of IDSs, as well as cyber-attacks and the most popular software. Next, the methods and the datasets are illustrated. In the related work section, some widely used techniques are reviewed, along with their main advantages and disadvantages for IDSs. Then, the results are illustrated, along with a discussion.

2. Background

Computers can be vulnerable to external threats and, as computers and similar devices are constantly connected to a network and are widely used by a large number of people to communicate with each other and share information, constant monitoring and detection are necessary, especially because not all people have a digital education. Cybersecurity has three pillars, known as CIA, confidentiality, integrity and availability, and any action compromising any of these pillars is considered to be a threat or danger. However, information security also heavily relies on accountability and authenticity. Attacks against confidentiality, for example, deal with passive attacks like eavesdropping; attacks against integrity, on the other hand, deal with active attacks like system scanning attacks; and attacks against availability, on the other hand, deal with attacks that cause disruptions to network resources, making them unavailable to regular users, like distributed denial of service (DDoS) and denial of service (DoS). An attack is now described as a sequence of events that may jeopardize a resource’s availability, confidentiality, data integrity, or security policy of any kind. Overall, there are a variety of cyber-attacks that violate CIA, but, more specifically, the process of tracking and evaluating network or computer traffic for indications of infiltration is known as intrusion detection. Businesses need to safeguard their networks using a range of technologies and detection techniques to counteract the many assault, intrusion, and compromise tactics that cybercriminals employ nowadays.
Cyber-threats typically vary in complexity, with different attacks having different targets, effects, and scope. The literature proposes different sequences of attack stages, but one way to conceive of an attack is the sequence of the following five stages: reconnaissance, exploitation, reinforcement, consolidation, and pillage. The system should notice an attack within the first three phases, but, once it reaches the fourth or fifth step, it will be fully compromised. In the reconnaissance stage, the malevolent user attempts to learn as much as they can about the target system, such as its operating systems and applications, user accounts, network architecture, and other pertinent details. In order to develop an efficient assault strategy, the objective is to collect as much data as possible. During the exploitation phase, a malicious user uses a specific service in an attempt to access the target machine. A service may be classified as abusive, subversive, or hacked. Passwords that have been stolen are considered abusive, and SQL injection is considered subversion. Following an unauthorized forced entry into a system, a malicious user installs more tools and services to exploit the newly obtained privileges during the reinforcement phase. A malicious user tries to obtain total access to the system using the misused user account. Ultimately, a malicious user makes advantage of the programs that the accessible user account may access. During the consolidation stage, a malicious user obtains total control over the system and the degree of privilege required to accomplish their objectives. The last phase is called looting, and it involves data theft, crucial system corruption, and interruption of corporate operations as potential malevolent user actions. There is a chance for errors in both hardware and software since computers and networks are built and developed by people. Vulnerabilities may result from these defects and human mistakes.
These considerations highlight the urgency to improve current security systems, mostly represented by antivirus software, firewalls, and intrusion detection systems (IDSs). Firewalls monitor incoming and outgoing traffic based on rules and policies, acting as a barrier between secure and untrusted networks. Within the protected network, an IDS detects suspicious activity to and from hosts and within the traffic itself, taking proactive measures to log and block attacks. There are two categories of IDSs, network-based intrusion detection systems (NIDSs) and host-based intrusion detection systems (HIDSs). An HIDS monitors and analyzes activities on the system (host) where it is installed, monitoring parts of the dynamic behavior and state of the system, including incoming and outcoming traffic of the device, as well as changes to local files. An NIDS monitors network traffic by offering sophisticated real-time intrusion detection capabilities and is strategically placed at various points on the network to monitor incoming and outgoing traffic to and from devices on the network. Based on the mode of detection, current NIDSs can be classified as either misuse-based (also known as signature detection) or anomaly-based. One crucial area of study and advancement in the field of intrusion detection is anomaly-based network intrusion detection. Techniques for detecting anomalies simulate system characteristics in order to produce normal, or standard, behavior. When something happens that is not typical, the intrusion detection system (IDS) recognizes it as suspicious activity and sends out an alert that can be used to identify zero-day assaults. This model has two key drawbacks, (1) a high rate of false positives and (2) how to develop a model for normal behavior and how to handle the evolution of normal user behavior. However, with signature-based detection, potential intrusions are found by comparing recorded events to recognized threats or attacks. Signatures are a common term used to describe known attacks. Ports, source addresses, and destination addresses are all included in signatures. These kinds of detection systems typically have excellent accuracy against known threats because the signatures are known. There is a significant disadvantage to this kind of approach, though. The system will not be able to identify an attack in the case of a zero-day assault or a modification of a known attack. A signature-based IDS (SIDS), for example, which relies on predefined signatures or patterns of known attacks, is unable to do this because it only uses regular expressions and matching strings. As such, it is more accurate and produces fewer false alarms than the anomaly detection approach, but it is only useful for detecting known assaults; it is useless for detecting unknown attacks. For instance, it could be challenging for signature-based NIDSs to identify a newly identified intrusion type or vulnerability if it is not yet included in CVE. A shift in the standard data model could prompt an anomaly-based NIDS to take quick action.

3. Methods

The primary technology of NIDSs is still traditional machine learning techniques, which are more interpretable, easier to use, and unrestricted by processing capacity than deep learning techniques. However, a growing number of intrusion detection techniques are based on deep learning, which is developing quickly.
This section is organized as follows: first, machine learning models are illustrated, distinguished by their nature, i.e., shallow or deep learning; next, metrics are described; finally, the deployment challenges are discussed.
Table 1 summarizes the models and their nature.

3.1. Shallow Learning Models

3.1.1. Decision Tree (DT)

A decision tree is a machine learning approach that uses a top-down tree form to illustrate every conceivable decision outcome. Three components are used to build the tree, decision nodes, branches, and leaf nodes. The if–then condition for categorizing an item is represented by the decision node; a branch denotes a potential attribute value, and the leaf node represents the class or label. DT is advantageous in both classification and regression settings, more specifically, in scenarios where instances are attribute–value pairs, with every possible value for each attribute disjoint; if target features are discrete attributes; if there is an error in the training data; or if the training data contain missing attribute values. Because decision tree just employs entropy and information gain for feature selection, it outperforms all other stand-alone classifiers. Finding the feature that most effectively separates the instances into the appropriate classes is the key challenge. Furthermore, while the model has an advantage over outliers and missing values, it has a propensity to overfit the data. Any change in the dataset can also result in significant changes in the model. However, by restricting the tree’s depth, this overfitting can be avoided or reduced.

3.1.2. Naïve Bayes (NB)

Based on the Bayes’ theorem, Naïve Bayes is a probabilistic classifier. With categorical values and small datasets, this probabilistic machine learning approach performs well in comparison to other classifiers. Gaussian Naïve Bayes, Bernoulli Naïve Bayes, and multinomial Naïve Bayes are the three Naïve Bayes algorithms. For continuous data values, Gaussian Naïve Bayes is employed; for binary values, Bernoulli Naïve Bayes; for discrete values, multinominal Naïve Bayes. The foundation of all Bayesian naïve classifiers is the idea that each feature’s value is independent from all other features’ values, as shown in (1), where y ^ is the conditional probability that the data belong to each class, k is the number of classes, C k is the k-th class, n is the number of features, p C k is the a priori probability of C k , and p x i | C k is the conditional probability of the feature x i given the class C k .
y ^ = argmax k 1 ,   , K p C k i = 1 n p x i | C k
To compute a class a priori, one must assume a feature distribution (i.e., a model of events) constructed from the training set. It is possible that distinct attack types cannot be detected using the presumption that characteristics are independent of one another.

3.1.3. Logistic Regression (LR)

When the value of target variable is categorical in nature, the classification procedure known as logistic regression is employed. When the data in question have a binary output, meaning they belong to one class or another or are either 0 or 1, logistic regression is most frequently utilized. In logistic regression, an “S”-shaped logistic function with two maximum values (0 or 1) fits in place of a regression line; the sigmoid function is the most popular function, which helps determine a label’s likelihood. It represents a mathematical tool for mapping any predicted probability value to a different value between 0 and 1. Any real number between 0 and 1 can be mapped to another value with this function. The sigmoid function’s equation is displayed in (2), where z is typically a linear combination of the input features x and their corresponding weights w (i.e., z = w T x + b ). If z is a large positive number, this yields a value that is near to 1, and if z is a huge negative value, it yields a value that is extremely close to 0. σ z represents the final prediction.
σ z = 1 1 + e z
Therefore, σ z represents a value between 0 and 1 and may be easily classified by applying a threshold, e.g., a threshold of 0.5, and obtain a final binary prediction y . Therefore, it is presumed that y = 1 if the prediction is bigger than 0.5. If not, we will presume that y = 0 . While binary classification cases are the ideal fit for logistic regression, they may also be used for multiclass classification tasks, or classification tasks that involve three or more classes, if a “one versus all” technique is used, which treats the classes as separate binary classification issues. Therefore, logistic regression primarily refers to binary logistic regression with binary target variables, while it may also be able to predict other target variable types. Logistic regression may be categorized into the following types based on the number of categories: binomial or binary and multinomial. A dependent variable in binary classification will only have two potential types, 1 and 0. These variables might stand for things like win or lose, yes or no, success or failure, etc. In the multinomial scenario, the dependent variable may contain three or more potential unordered categories, or the types may not have any quantitative significance.

3.1.4. XGBoost (XGB)

Regularized gradient boosting, or XGBoost [2], is a refined form of gradient-boosted machine (GBM). A GBM belongs to the group of algorithms that work together to enhance decision tree performance. It combines weak classifiers, such decision trees, in a sequential manner, similar to other boosting techniques, enabling them to maximize any differential loss function and create a powerful prediction model. The prediction errors of each current learner (tree) are improved by utilizing the predictions of prior learners.
By adding all the scores of leaves that are detected evaluating a certain test sample, the ensemble tree generates its final prediction. The total of the predictions produced by the trees that minimize the prediction error is the final prediction, or h(x), for a given sample S. The hyper-tuned parameters for GBM are 500 estimators, maximum tree construction depth is 3, minimum samples required for splitting 100, and learning rate 0.1. XGB uses the same gradient-boosting theory as GBM. The modeling details are the only significant distinction between them. While GBM solely considers variance, XGB utilizes a more regularized model formalization to reduce overfitting and boost generalization ability. The regularization parameter ( ζ ) is expressed mathematically in (3), where T l is the number of leaves in the tree, w j 2 is the score on the j-th leaf, and λ represents the regularization term that controls the complexity of the model.
ζ = γ T l + 1 2 λ j = 1 T w j 2
Gradient boosting is a technique used by XGB to improve the loss function during model training. Generally, the LogLoss (L) function is represented as in (4), where N is the total number of observations, p i is the predicted probability that an observation o is in class c , and y i is the binary indicator of whether the predicted class c is the correct classification for a given observation o . Most importantly, ζ regulates the model’s simplicity, while L affects its predictive power. The usage of sparse matrices (DMatrix) with sparsity-aware algorithms, enhanced data structures, and support for parallelization are the primary implementation improvements of XGB. Hence, XGB makes use of hardware to process data quickly while using less memory (primary memory and cache). The optimal parameter values obtained for XGB are 100 estimators, maximum tree depth is 8, minimum child weight value is 1, and minimum loss reduction and sub-sample ratio are 2 and 0.6, respectively.
L = 1 N 1 N y i · log p i + 1 y i · log 1 p i

3.1.5. Support Vector Machine (SVM)

Support vector machine (SVM) [3] is a classification technique that employs just geometric principles rather than statistical approaches to describe the likelihood of class membership. The so-called support vectors are a small portion of the training dataset, and, using the input samples, SVM creates lines and hyperplanes to divide the data into classes for prediction. The values in one class that are closest to the other class and the separating line are the support vectors. The objective of SVMs is to maximize the distance between classes, that is to say, the distance between the support vectors. These are, in essence, the values that are hardest to categorize. The hyperplane mostly relies on a little number of observations. SVMs use the kernel approach to identify a plane that partitions the data linearly in situations when it is not linearly separable. The kernel method’s concept is to project the original features onto a space that is dimensionally bigger than the beginning space by generating nonlinear combinations of those features. It is, therefore, envisaged that at this stage the dataset becomes separable in the dimensionally bigger space. The Gaussian kernel, often known as the radial basis function (RBF), is one of the most widely used kernels.
SVM is one of the easiest approaches to handle problem statements involving anomaly detection, and because one-class SVM can identify uncommon occurrences, it may also be utilized to identify new attacks in IDSs. While SVM is often resistant to noise, overfitting and a lengthy training period are potential drawbacks, along with a large number of parameters. Nonetheless, the one-class SVM classifier’s strong outlier identification capacity makes it suitable for IDSs. A fast and effective IDS is desired, while lengthy training times are not. To address these shortcomings, SVMs may be adjusted in a number of ways, including dividing the data into smaller training sets and increasing the value of the radial kernel’s parameter.

3.2. Deep Learning Models

3.2.1. Multilayer Perceptron (MLP)

Artificial neural networks are capable of performing a wide range of tasks, including data categorization and classification through training to approximate any function. A directed graph is used by feed-forward neural networks (FFNs), a type of artificial neural network (ANN), to transfer different system information from one node to another without creating a loop. An example of an FFN with three or more layers is the multilayer perceptron (MLP) model, which consists of an input layer, one or more hidden layers, and an output layer. Numerous neurons or units make up each layer. With all of the neurons in each layer completely linked, information is transferred forward from one layer to the next. Usually, weighted connections connect every neuron in the following layer to every other neuron. Each neuron sums the weighted values of all the neurons connected to it and adds a bias value. This result is then subjected to an activation function, which merely performs a mathematical transformation on the value before forwarding it to the subsequent layer. The node transmits the value on to the next layer if the operation’s value is greater than the anticipated threshold; if not, zero is sent on. Neural networks are designed to be universal function approximators; hence, a nonlinearity component, or activation function, must be introduced. The input values are sent to the output neurons via the network in this manner. The many connections among neurons inside a neural network (NN), therefore, take the shape of a weighted oriented graph, where the edges represent the weighted connections among neurons, the nodes represent individual neurons, and the direction of the edges represents the direction of signal propagation. Processing an input data means sending a stream of information across that graph, which is periodically changed by the weights and activation functions of the neurons. MLP is mathematically defined as O :   m × n , where m is the size of the input vector x = x 1 , x 2 , , x m 1 , x m , and n is the size of the output vector O x , respectively. The calculation of each hidden layer h i is defined mathematically as h i x = f w i T x + b i , where h i :   d i 1 d i , f : , w i d × d i 1 , b d i , d i denotes the size of the input, and f is the nonlinear activation function, which can be either the sigmoid function (values in the interval [0, 1]) or a tangent function (values in the interval [1, −1]). For the multiclass classification problem, the MLP model can use the softmax function as the nonlinear activation function. The softmax function outputs the probabilities of each class. There are many possible activation functions; the typical ones are presented in (5)–(8).
sigmoid   x = 1 1 + e x
tan   x = e 2 x 1 e 2 x + 1
softmax   x i = e x i j = 1 n e x j
relu   x = max 0 , x
The rectifier linear unit, or ReLU, function is one of the most often utilized activation functions among them, particularly in intermediate layers. Its effectiveness and propensity to speed up training can be attributed to its straightforward computation, which flattens the negative values to zero and leaves everything unchanged for values greater or equal to zero. It is considered to be the fastest and least expensive way to train large amounts of data when compared to traditional nonlinear sigmoidal and tangent activation functions.
Since the purpose of the network is to minimize the prediction error of the output from the expected value of the data, it is necessary to calculate this error and proceed backward along the oriented graph to calibrate the weights according to how much they affected the erroneous output. The principle behind error minimization in ANNs with supervised learning is gradient descent. Gradient descent is a technique that aims to minimize the loss function as much as possible. The backpropagation algorithm calculates the gradient of the loss function with respect to the network weights for a single input–output example and does so by calculating the gradient one layer at a time, backward from the outputs to the inputs to derive by how much the weights need to be changed one by one. In general terms, for many hidden layers, the MLP is formulated as follows: H x = H l H l 1 H l 2 H 1 x . This way of stacking hidden layers is generally called deep neural networks (DNNs).

3.2.2. Convolutional Neural Network (CNN)

An advanced form of artificial neural network (ANN) called a convolutional neural network (CNN) [4] is used to handle grid topology input, such as sequences or pictures. CNNs are capable of processing one-, two- and three-dimensional data. CNN architecture aims to resemble the multiple-layered neurons seen in the human visual system, each of which is in charge of identifying a distinct feature in the data. Through the application of pertinent filters, a CNN may effectively capture dependencies (spatial and/or temporal) in the data, resulting in a good internal representation of the world. A CNN typically consists of pooling and convolutional layers that alternate. The name “convolutional neural network” indicates that the network uses a mathematical operation called convolution, a specialized type of linear operation that consists of the application of a sliding window function (also known as kernel or filter) to a matrix of pixels representing an image. An application of multiple convolutional layers allows passing from low-level features to high-level features. Following every convolution process, a ReLU activation function is performed. By teaching the network nonlinear relationships between components in the picture, this function strengthens the network’s ability to recognize various patterns. Usually applied after a convolution layer, which extracts the most important features from the convolved matrix, the pooling layer is a downsampling technique. This is accomplished by using an aggregation procedure to shrink the size of the convolutional matrix, or feature map, which lowers the amount of memory required for network training. Pooling is important to reduce overfitting as well. Max pooling, average pooling, and sum pooling are the three most often used aggregating functions. When the pooling function is used, the size of the feature map decreases. The feature map is flattened by the last pooling layer so that the fully linked layer can analyze it. The convolutional neural network’s last layer has fully connected layers, whose inputs match the one-dimensional matrix that has been flattened by the final pooling layer. The final predicted label is the one with the greatest probability score. ReLU activation functions are employed here, and a softmax prediction layer is used to calculate probability values for each of the potential output labels. Filter weights, which are defined at each convolutional level, are established during the training stage by an iterative update procedure. That is, they are first initialized and then adjusted by backpropagation to minimize a cost function.

3.2.3. Long Short-Term Memory (LSTM) and Gate Recurrent Unit (GRU)

There are two distinct varieties of recurrent neural networks (RNNs), LSTM [5] and GRU [6]. Recurrent neural networks (RNNs) have more processing options than simple neural networks like multilayer perceptrons (MLPs). RNNs have the ability to go through several levels and to store data momentarily for use at a later time.
A method called backpropagation is used to train the RNN model, which is mathematically constructed as illustrated in (9).
a t = b + W h t 1 + U x t                 h t = tanh a t                                                         o t = c + V h t                                                                         y ¯ = s o f t m a x o t                                                          
Let x t be the RNN model’s input at time step t . a t is the intermediate activation vector at the time step t , in which b is the bias vector that is added to the activation, W is the weight matrix associated with the previous hidden state h t 1 , and U is the weight matrix associated with the current input x t . Next, h t is the hidden state at time step t , where the hyperbolic tangent is applied to the activation a t to introduce nonlinearity. o t is the output before the activation function; c is the bias vector for the output, and V is the weight matrix that connects the hidden state h t to the output. Finally, y ¯ is the final output of the network at time step t , where the softmax function is applied to the output o t to convert it into probabilities, typically for classification tasks.
Although RNNs are effective in performing various prediction tasks, they nevertheless have a problem of exploding and vanishing gradients. To solve this problem, other types of RNNs such as long-short term memory (LSTM) and gated recurrent unit (GRU) have been designed. More specifically, LSTM (illustrated in Figure 1a) operates as defined in (10); GRU (illustrated in Figure 1b) operates as defined in (11).
L S T M g p = σ W g · v p 1 , x p + b g         k p = σ W k · v p 1 , x p + b p         S p ' = tanh W s · v p 1 , x p + b s r p = σ W r · v p 1 , x p + b r             v p = r p · tanh S p                                              
G R U k p = σ W k · S p 1 , x p + x p     z p = σ W z · S p 1 , x p                           S p ' = tanh W · z p · S p 1 , x p S p = 1 k p · S p 1 + k p · S p '
x p is the input at time step p , S p 1 is the previous cell state, v p 1 is the previous output (hidden state), S p is the updated cell state, and e v p is the updated output (hidden state). LSTM has multiple gates, the forget gate ( g p ), input gate ( k p ), and output gate ( r p ). The forget gate decides what information from the previous cell state S p 1 to keep or discard. The forget gate’s output is obtained by passing a linear combination of x p and v p through a sigmoid function, outputting values between 0 and 1. A value closer to 1 means keeping more of the past information. The input gate determines what new information from x p and x p 1 should be added to the cell state; it combines x p and v p 1 through a sigmoid function to produce k p , and it is modulated by a tanh function to ensure that the values added are in the range of −1 to 1. The output gate controls the flow of information from the cell state to the output v p . After combining x p and v p 1 and applying a sigmoid function, the result is further modulated by tanh to produce the output v p . The new cell state S p is a combination of the previous cell state S p 1 , modulated by g p , and the candidate cell state S p ' , scaled by k p . This update allows the LSTM to retain long-term information (from S p 1 ) while adding new information.
GRUs have two primary gates, the update gate ( z p ) and reset gate ( k p ). The update gate ( z p ) controls the degree to which the previous hidden state S p 1 should be carried forward to the next state. The update gate output is obtained by applying a sigmoid function to a linear combination of x p and S p 1 . This output z p determines how much of the previous hidden state is retained in the new state. The reset gate determines how much of the previous hidden state S p 1 should be forgotten when computing the candidate hidden state. Similar to the update gate, it combines x p and S p 1 with a sigmoid function, resulting in a value between 0 and 1, where 1 means retaining more of S p 1 . The candidate hidden state ( S p ) is generated by applying a tanh function to a linear combination of x p and the reset-modulated previous hidden state k p · S p 1 . This intermediate state represents potential updates to the hidden state, influenced by the reset gate.
Hyperbolic and sigmoid tangent formulas are illustrated in (12) and (13).
tan h   a = 1 e 2 a 1 + e 2 a
σ   a = a 1 + e a

3.2.4. Convolutional Neural Network–Long Short-Term Memory (CNN)

A CNN–LSTM (convolutional neural network–long short-term memory) is a hybrid deep learning model that combines the strengths of both CNNs and LSTMs. Usually, the convolutional neural network is utilized for feature extraction, and the LSTM is used to model the temporal dependencies in the data.
A typical CNN–LSTM architecture involves a CNN architecture comprising one or more LSTM layers, usually before the fully connected layer, which is responsible for the final prediction.

3.3. Metrics

The main metric that is employed to evaluate performance is accuracy.

Accuracy

Accuracy is calculated as the ratio of the number of correct predictions to the total number of samples. This provides an overview of the effectiveness of a model in correctly classifying both legitimate traffic instances and intrusions.
In the case of ensemble models, accuracy can improve through model aggregation, where each classifier contributes a partial prediction. This approach makes it possible to mitigate individual errors and increase the probability of correct classifications. In deep learning, accuracy is optimized through the use of optimization algorithms, such as the gradient descent algorithm, which reduce the loss function to improve the model’s ability to make correct predictions.

3.4. Deployment Challenges

There are several practical difficulties in putting a network intrusion detection system (NIDS) into place, particularly when combining ensemble models with machine learning approaches. These systems are made to keep an eye on network activity and spot unusual activity that might point to attempted intrusions. Real-time performance and processing overhead are two major obstacles to incorporating sophisticated data analysis techniques.
The computational overhead is a crucial component of NIDS implementation. Machine learning models demand a significant amount of processing power, especially those that rely on ensemble methods like random forest or gradient boosting. Several methods are combined in these models to increase accuracy and lower the possibility of overfitting. However, this combination of models might have a high processing cost, particularly when analyzing big amounts of data. Processing and analyzing packets in real time demands a large amount of hardware. As a result, controlling this overhead is essential to guaranteeing that the system runs efficiently without sacrificing network resources.
Apart from overhead, another significant obstacle is real-time performance. To detect and address such threats before they have a chance to do damage, an NIDS needs to be able to analyze data in real time. Nevertheless, the detection procedure may get slower if machine learning models are used. Complex models can hinder the system’s capacity to react swiftly to threats since they take time to train and infer. This is especially troublesome in situations where choices have to be made in a split second. Therefore, the difficulty is in striking a balance between the model’s accuracy and the requirement for quick reaction.
Using simplification and model optimisation approaches is one way to lessen these issues. For instance, models can be made lighter and faster while retaining a respectable degree of accuracy by using pruning or size reduction strategies. Furthermore, distributing the workload among several nodes through the integration of distributed processing technologies can enhance system performance.
Lastly, the ongoing evolution of cyber threats must be taken into account. Despite their effectiveness, machine learning models need to be updated frequently to handle new kinds of threats. This can further raise the computational cost since it necessitates a large investment of resources for ongoing model training and validation.

4. Related Work

Recently, extensive research has been conducted on the application of supervised machine learning techniques to automate the process of intrusion detection in network connections. Experiments conducted in [7] compared different machine learning algorithms for intrusion detection systems (IDSs) using the KDD-99 Cup dataset. The supervised learning methods used for the detection task are logistic regression, decision tree, k-nearest neighbor (KNN), support vector machine (SVM), random forest, the AdaBoost algorithm, multilayer perceptron, and Naïve Bayes. The results show that the worst performance is provided by logistic regression, with 79.7% accuracy; the best performance is provided by random forest, with 99% accuracy, followed by KNN, with 94.17% accuracy, and Naïve Bayes, with 92.4% accuracy.
The authors of [8] aimed at the problem of network intrusion classification using the NSL-KDD dataset. In the study, a model is proposed that applies the RF algorithm for intrusion detection, classifying the following various types of attacks: DoS, Probe, R2L, and U2R. After the phase of data preprocessing and feature selection, the proposed approach was evaluated in terms of accuracy and FPR (false positive rate) and compared with the J48 tree algorithm, reporting in the case of DOS attack an accuracy of 99.67%, which is 7% more than the J48 algorithm.
Using UNSW-NB15 datasets, the authors of [9] examined the effectiveness of top supervised machine learning methods. The J48 decision tree, Naïve Bayes, logistic regression, and SVM models with three distinct kernels (SVM-RBF, SVM-Polynomial, and SVM-Linear) were all compared. With an accuracy of 89.26%, logistic regression had the highest performance, followed by the J48 algorithm’s 88.67% accuracy.
In order to categorize each event as normal or attack, the research of [10] aimed to use Naïve Bayes supervised classification for the NSL-KDD dataset using principal component analysis (PCA). In essence, the classification uses features to determine category labels; however, because the NSL-KDD dataset has a large number of attributes, PCA is utilized as a feature reduction approach. The highest accuracy recorded in the findings was 86.5%.
In order to stop intrusions as soon as possible, the perfect IDS should be able to swiftly identify zero-day attacks with high accuracy and few false positives. The aim of [11] was to develop an intrusion detection model that combines machine learning algorithms with reduction techniques. The goal was to identify intrusions quickly and accurately while also minimizing false positives. The research tested ten widely used machine learning algorithms, BayesNet, Naïve Bayes, logistic, random tree, random forest, bagging, J48 DT, PART, OneR, and ZeroR, using the NSL-KDD dataset. The best performance among these was reported by random forest.
In [12], four different algorithms are used on the UNSW-NB15 dataset for the classification of cyber-attacks. These supervised techniques include J48, ZeroR, random forest, and Naïve Bays. Correlation-based feature selection (CFS) was used to create an ideal subset of features. In this investigation as well, the random forest method yields the highest accuracy (97.6%), recall, precision, and F-measure (0.976). With a recall, precision, and F-measure of 0.681, along with an accuracy of 68.06%, ZeroR seems to be the worst model.
In [13], different machine learning techniques were studied using the NSL-KDD dataset with the different model building steps. Supervised machine learning algorithms include k-nearest-neighbors (KNN), decision tree, random forest, Naïve Bayes, neural network, and AdaBoost. These algorithms were compared at each stage of the data preprocessing steps to analyze which combination of algorithms is the best to be suitable for an intrusion detection system. For decision tree, the CART algorithm with random state set to 0 is used. For KNN, 10 nearest-neighbors are used as a parameter. For random forest, 100 random states are used as a parameter. For Naïve Bayes, its default parameters are used. For the neural network, multilayer perceptron is used with a backpropagation algorithm to train the model. For multilayer perceptron, 10 is used as the number of random states. Number of estimators set to 100 and 0 number of random states are used for AdaBoost. For data preprocessing, categorical feature data are mapped to binary data by applying one-hot coding. The authors applied some feature scaling and feature reduction techniques, as well as standardization and normalization, individually to analyze the result of these techniques on the dataset. Then, each selected ML algorithm is applied to test the result against an unscaled model. Several feature reduction techniques were compared for this analysis, including low variance filter, high correlation filter, random forest, incremental principal component analysis (incremental PCA). For the data set “NSL-KDD”, there is an imbalance problem; the authors applied the following over-sampling techniques: SMOTE, Borderline-SMOTE, and ADASYN. Among these selected models, “KNN+Normalization+correlation filter” with Borderline SMOTE had the best performance in terms of accuracy (85.3%) but had a longer prediction time (62.54 s). On the other hand, the best model in terms of prediction time was “Decision Tree+standardization+correlation filter”, with a prediction time of 0.01 s, the shortest time observed, but it achieved a lower accuracy. Therefore, considering accuracy for the best overall performance, KNN is the best choice for an IDS.
The following three traditional tree-based machine learning algorithms are trained and tested in [14] using the NSL-KDD benchmarking dataset: random forest, decision tree, and XGBoost. In order to improve and optimize the algorithm’s performance for accurate prediction, as well as to enable a smooth training process with the least amount of time and resources, normalization and feature selection approaches are employed in conjunction with data preprocessing. XGB reported an accuracy of 95.5% in the detection rate.
In [15], XGBoost is used with two datasets designed to evaluate NIDS machine learning algorithms, NSL-KDD and UNSW-NB15. The results showed that it is possible to achieve good performance (accuracy of 0.8864 and 0.9334, respectively) using a limited fraction of the complete parameter space. The most relevant model parameters for achieving such performance are ntrees, max_depth, eta, sample_rate, gamma, reg_lambda, and reg_alpha.
In [16], the authors compared various machine learning techniques for classifying network data into threats/non-threats using the UNSW-NB15 dataset. The models tested were ANN, support vector machine (SVM), and the proposed AdaBoost technique based on a decision tree classifier. The ANN achieved an accuracy of 89.54%; for the SVM, with the RBF kernel, an accuracy of 94.7% was obtained. The authors proposed a decision tree-based classification approach using AdaBoost. The parameters used in the AdaBoost model are maximum depth = 2 and algorithm = ‘SAMME.R’, achieving an accuracy of 99.3%.
In the context of the IoT-23 dataset, [17] implemented random forest, Naïve Bayes, support vector machine (SVM), and decision tree algorithms to detect anomalies in network data. The random forest algorithm achieved the best results, with an accuracy of 99.5%, while the worst results were achieved by Naïve Bayes, with 78.84% accuracy.
A machine learning-based intrusion detection system (IDS) with two hidden layers (a first layer with 32 neurons and a second layer with 16 neurons) was published by the authors of [18]. The number of features determines the input layer. For every scenario in the IoT-23 dataset, the output layer is associated to the labeled classes. The feature selection method used by ANN is called sequential forward feature selection (SFS).
In [19], three classifiers were used to classify network traffic data, deep feed-forward neural network, random forest, and gradient boosting tree. Two publicly available datasets, UNSW-NB15 and CICIDS2017, were used to evaluate the proposed method. The results show high accuracy with deep feed-forward neural network for both binary and multiclass classification on the UNSW-NB15 dataset, achieving 99.16% accuracy for binary classification and 97.01% for multiclass classification. On the other hand, gradient boosting tree achieved the highest accuracy for binary classification with the CICIDS2017 dataset at 99.99%, while for multiclass classification, deep feed-forward neural network had the highest accuracy at 99.56%.
In [20], a deep neural network composed of three parts was proposed, input layers, hidden layers, and output layers. The structure includes an input layer with 41 neurons, four hidden layers each with 100 neurons, one fully connected (FC) layer with five neurons, a softmax layer, and an output layer with five neurons. Experiments on the KDD99 dataset demonstrated a maximum accuracy of 99.9%.
Using publicly available network-based intrusion datasets like KDD-99, NSL-KDD, UNSW-NB15, and CICIDS 2017, the authors of [21] focused their research on assessing the effectiveness of several classical machine learning classifiers (logistic regression, Naïve Bayes, k-nearest neighbor, decision tree, random forest, and SVM) applied to NIDSs. They also proposed a deep neural network architecture with an input layer, five hidden layers, and an output layer. To assess how well alternative models and the suggested model performed on various NIDS datasets, two distinct test scenarios were taken into consideration: (1) The network connection record is classified as either benign or attack; (2) the attack is classified into its respective categories, and the network connection record is classified as either benign or attack. When it comes to accuracy, the suggested model performs better than traditional machine learning algorithm, often much better in multiclass classification on several datasets, with competitive outcomes even in the binary case.
The authors in [22] presented their approach to IDS classification using the RNN model through six activation functions (SoftPlus, ReLU, Tanh, Sigmoid, LeakyReLU, ELU (Exponential Linear Unit)). They calculated accuracy, recall, and precision for the KDD-99 dataset, and results showed that the LeakyReLU function provides the best performance, achieving 97.77%, 87.85%, and 99.38% in accuracy, precision, and recall.
The study presented in [23] presents an intrusion detection model with a CNN-based classifier trained on the KDD-99 dataset. The proposed approach revised the LaNet-5 model [4], incorporating the gradient descent optimization algorithm (i.e., the adaptive delta algorithm) to fine-tune the model parameters. The prediction accuracy of threat detection is 99.65%, higher than the existing LeNet-5 classifier (about 95%).
In recent years, many studies have used CNNs or RNNs to perform intrusion detection tasks based on spatial and temporal features. The authors of [24] evaluated the performance of two deep learning RNN models (LSTM and GRU) with two (LSTM2 and GRU2), three (LSTM3 and GRU3) and four (LSTM4 and GRU4) hidden layers. The best performance on the NSL-KDD dataset was obtained using RNN LSTM4 and GRU3, with a maximum accuracy of 82.78% and 82.87%, respectively.
The authors of [25] proposed two deep learning models trained on the NSL-KDD dataset, which are LSTM and the combination of convolutional neural network and LSTM (CNN–LSTM) for an intrusion detection system. Normalization, scaling, and conversion to numeric form were performed on the data, as LSTM only accepts numeric inputs. In the experimental phase, LSTM-only and CNN–LSTM achieved approximately 88% and 92% in terms of accuracy.
MINDFUL, a network intrusion detection approach, was introduced by [26]. Using a convolution neural network that has been trained on a multichannel representation of network flows, it acquires an intrusion detection model. Using feature vectors created with these autoencoders, two autoencoders—one from the normal flow and the other from the attack flow—provide the feature vector representation of the network flows. The work’s goal is to add class-specific information to the original flows’ representation to enhance it. The depiction that is produced offers a fresh interpretation of the original flow. Each sample is thus represented as an extended multichannel sample in three dimensions, including the two vectors recovered by the autoencoders in addition to the original raw vector. Thus, MINDFUL combines an unsupervised approach for multichannel feature construction based on two NN autoencoders, with a supervised approach that exploits cross-channel feature correlations. The authors assess the efficacy of the intrusion detection algorithm employed in MINDFUL by examining three benchmark datasets, KDD-99, UNSW-NB15, and CIC-IDS2017. The architecture takes X training samples as input. The two autoencoders are used for each training sample x in order to return the reconstructed features x ^ n and x ^ a . These new features are used to create a new augmented dataset that is used as input for a 1-CNN neural network. Experimental results reported a binary accuracy of 92.49%, 93.40%, and 97.90% on KDDCUP99, UNSW-NB15, and CIC-IDS2017, respectively.
Using recurrent patterns based on deep learning, [27] presented an end-to-end model for the detection and classification of network attacks. Three recurrent pattern types (RNN, LSTM, and GRU) were taken as exploited since the network traffic flow had temporal and sequence properties. The SDN-IoT, KDD-99, UNSW-NB15, WSN-DS, and CICIDS-2017 datasets were used for the tests. In addition to performing dimensionality reduction, the authors performed dimensionality reduction and also employed a simple feature fusion approach that combined features from the RNN, LSTM, and GRU hidden layers. After that, an ensemble meta-classifier, also known as a stacking classifier, which combines several classification models, receives the fused features of the recurrent hidden layers. The meta-classifier is a two-step process that employs RF and SVM for prediction in the first stage, stacks the predictions in the second stage, then uses logistic regression in the third stage to identify and classify network threats. The findings demonstrate that the suggested approach swings between 89% and 99% in the attack categorization test and achieves 98–99% accuracy in the attack detection task across all datasets.
The authors of [28] presented an attack detection framework using a deep learning model modeled on the IoT-23 dataset. The proposed mechanism uses CNN–LSTM. When it comes to the experiments, the authors first tested the CNN alone, which does not perform well in classifying attacks such as CC, File Download, HeartBeat, and PartofHorizontalPortScan. However, it performs substantially well in identifying benign features, with 94% accuracy. For the hybrid model, the problem was converted to a binary classification problem by aggregating all the classes identifying the several attacks; this model, on the other hand, performs quite well in identifying malicious devices, with 96% accuracy.
A novel method for network intrusion detection utilizing multistage deep learning image recognition is presented in [29]. Four-channel (red, green, blue, and alpha) pictures are created using network features. The ResNet-50 deep learning model is then trained and tested using classification on the photos. UNSW-NB15 and BOUNDdos, two publicly accessible reference datasets, are used to assess the suggested methodology. The proposed technique in the binary classification phase achieves a 93.4% accuracy rate in identifying regular traffic in the UNSW-NB15 network intrusion dataset. It achieves 99.8% accuracy in identifying generic attacks, 86% accuracy in identifying reconnaissance attacks, and 67.9% accuracy in identifying exploit attacks during the attack-type identification phase.
CNN was used by the authors of [30]. It was used to design and implement an anomaly detection model for Internet of Things networks that can identify and categorize binary and multiclass abnormalities. A multiclass classification model is constructed using the CNN 1D, 2D, and 3D models. Transfer learning methodology was used to carry out the binary classification process. The CNN1D, CNN2D, and CNN3D multiclass classification models pretrained on the IoT-DS-2 dataset were initially utilized by the authors to apply the idea of transfer learning to the binary classification of the IoT-DS-2 dataset. Four existing datasets (BoTIoT, IoT Network Intrusion, MQTT-IoT-IDS2020, and IoT-23) combine to form IoT-DS-2. In the subsequent phase, they classified datasets from BoT-IoT, IoT Network Intrusion, MQTT-IoT-IDS2020, IoT-23, and IoTDS-1 into many classes using the same pretrained learning model.
The multiclass and binary CNN models are validated using accuracy, precision, recall, and F1-score. The CNN structure is good at extracting spatial aspects of data flow, but it performs a terrible job at extracting long-distance dependent information. In comparison, even though the GRU structure has a large number of parameters and requires a lengthy training period, it is more successful at extracting information that is reliant on distance and can prevent forgetting throughout the learning process. CNN–GRU is proposed by [31], which combines GRU with a convolutional neural network. Using a CNN, spatial features are collected and then combined using average-pooling and max-pooling, with the attention mechanism (CBAM) used to give each feature a distinct weight. At the same time, to achieve comprehensive and effective feature learning, the features of long-distance-dependent information are simultaneously extracted using a gated recurrent unit (GRU). Based on the UNSW_NB15, NSL-KDD, and CIC-IDS2017 datasets, the suggested intrusion detection model is assessed. The experimental findings indicate that the classification accuracy achieves 86.25%, 99.69%, and 99.65%, respectively.
In [32], five deep learning algorithms were suggested to be applied in the study in order to distinguish between malware and benign traffic in network traffic. Among the classifiers utilized, there are LSTM, random forest, Catboost, XGBoost, and convolutional neural network models. The study’s primary goal is to use the IoT-23 dataset to investigate the network behavior and traffic traces of Internet of Things devices. After preprocessing the dataset to eliminate redundant or missing data, a feature engineering approach was used to extract the most important features. With a rating of 89%, RF has the best detection accuracy to distinguish between malware and safe purchases. The accuracy of the XGBoost and CatBoost classifiers was 89%, which is the same. With 10 layers, the suggested CNN model attained an accuracy of 84%. The worst accuracy was achieved by LSTM, with a value of 78%.
The authors of [33] designed and developed LSTM, BiLSTM, and GRU models for anomaly detection in IoT networks. Seven datasets were used to conduct multiclass and binary classification experiments using the proposed anomaly detection models, including NSL-KDD and IoT-23. Table 2 shows a summary of the works discussed in this section.

5. Datasets

A classifier is trained using an ML algorithm on a dataset of typical and anomalous traffic patterns in a machine learning technique. Subsequently, a trained model may be used to instantly identify suspect traffic. While there are certain drawbacks to such systems, their primary benefits lie in their capacity to learn without explicit programming and their flexibility in responding to changing traffic patterns. The security of a contemporary A-NIDS employing ML methods depends on the choice of a training dataset.
In this section, the selected datasets are illustrated and described. The datasets are the KDD CUP 99 Dataset (KDD-99), NSL-KDD Dataset, USW-NB15 Dataset, and IoT-23 Dataset. The choice fell on these datasets because of the following from the state-of-the-art analysis:
  • On the NSL-KDD dataset, whenever feature selection is used, classification performance can be improved. In addition, the RF algorithm is very effective on this dataset and reports high performance results. DT also seems to perform well on this dataset.
  • On the KDD-99 dataset, whenever feature selection is used, the improvement in classification performance was not always discounted. However, RF works very well on this dataset.
  • On the UNSW-NB15 dataset, feature selection is effective in improving classification performance. CNNs and recurrent patterns are a successful approach to this dataset; in fact, being a large dataset, deep learning seems to be the solution.

5.1. KDD CUP 99 (KDD-99) Dataset

KDD-99 has been the most used dataset for testing anomaly detection techniques since 1999. It was prepared by [34] and is a variation on the dataset that was initially developed as part of an IDS initiative at MIT’s Lincoln Laboratory. The program was tested in 1998 and again in 1999. The DARPA-funded program yielded what is often known as the DARPA98 dataset. The dataset that is known as the KDD CUP 99 dataset was subsequently refined for the International Knowledge Discovery and Data Mining Tools Competition. Five million connection records over the course of seven weeks of traffic were utilized to create a training set. Another two weeks of network traffic generated a test set with two million examples. KDD-99 is a filtered version of this data.
Table 3 shows the class distribution, resulting from a preprocessing phase, where redundant data points were discarded (78% of the training set and 89.5% of the test set). The training set originally comprised 4,898,431 data points, but it was reduced to 1,074,992 unique data points. Similarly, the test set has been reduced from 2,984,154 to 311,029.
The classes are as follows:
  • Normal.
  • Denial of Service Attack (DoS): DoS attacks occur when an attacker prevents authorized users from accessing a system or overloads certain computer or memory resources, making them unable to process valid requests.
  • Probing Attack: These attacks involve scanning the network to identify valid IPs, and information is collected on those IPs. Often this information provides attackers with a list of vulnerabilities that can later be useful in launching attacks on systems and services.
  • Remote to Local Attack (R2L): The term refers to the process by which a malicious user who is able to transmit packets to a computer on a network but does not have an account on that machine uses a vulnerability to obtain local access as that machine’s user.
  • User to Root Attacks (U2R): These are a type of exploit where the attacker gains root access to the system by first getting access to a regular user account (perhaps by password sniffing). From there, they can take advantage of a vulnerability.
Each pattern has 41 features, assigned to one of three categories, basic, traffic, and content:
  • Basic features: This category encompasses all attributes that can be extracted from a TCP/IP connection.
  • Traffic features: This category includes features computed with respect to a window interval and is divided into two groups: (a) “same host” feature: in order to compute statistics pertaining to protocol behavior, service, etc., it focuses only on connections made within the last two seconds that share the same destination host as the active connection; (b) “same service” feature: it only focuses on connections that have had the same service as the current connection for the previous two seconds. The above two types of “traffic” features are called time-based. However, there are several slow probe attacks that scan hosts (or ports) using a much wider time window than 2 s, such as one every minute. As a result, these attacks do not produce intrusion patterns with a time window of 2 s. To solve this problem, the “same host” and “same service” features are recalculated based on a connection window of 100 instead of a 2 s time window. These features are called connection-based traffic features.
  • Content features: R2L and U2R attacks lack the regular sequential patterns of incursion that characterize typical DoS and probing attacks. This is due to the fact that R2L and U2R attacks are encoded in the data portions of packets and often only require a single connection, whereas DoS and probing attacks entail several connections to certain hosts in a very short amount of time. Certain aspects are necessary to be able to search for unusual activity in the data section, such as the number of unsuccessful login attempts, in order to detect these kinds of attacks. These features are called content features.
The unbalanced nature of the dataset is evident from Table 3; 98.61% of the data belongs to the normal or DoS categories. Moreover, the nonstationary nature of the KDD-99 dataset is visible from the distributions of training and test data in Table 3; 23% of the training set consists of DoS examples, versus 73.9% in the test set; normal is 75.61% in the training set but only 19.48% in the test set.

5.2. NSL-KDD Dataset

Unfortunately, KDD-99 has a number of drawbacks that may deter its application in the current setting, such as its advanced age, nonstationarity in training and test datasets, redundant patterns, and extraneous features; therefore, NSL-KDD has been introduced [35]. The goal of NSL-KDD is to improve KDD-99. The dataset is nevertheless vulnerable to several issues, as the authors point out, such as its inability to adequately depict low-impact assaults, but, compared to KDD-99, the number of unique data points in NSL-KDD is less than that of KDD-99. As such, the training phase requires less computing power. NSL-KDD includes sub-sampling of the normal, DoS and probe classes. This mitigates some of the problems associated with the asymmetry of KDD-99. Therefore, NSL-KDD is a stationary sampling of KDD CUP 99.

5.3. UNSW-NB15 Dataset

A more recent substitute for KDD-99 is the UNSW-NB15 dataset [36]. Its features and generating process are similar enough to KDD-99 and can be considered a competitive replacement, and it compensates a few of the drawbacks that make KDD-99 difficult to employ in modern NIDSs. This dataset was simulated by monitoring the traffic of two days in 16 h and 15 h sessions using the IXIA PerfectStorm program at the Australian Center of Cyber Security (ACCS). Compared to 11 IP addresses on two networks for KDD-99, 45 distinct IP addresses over three networks were used for UNSW-NB15. The attacks were selected from an up-to-date CVE site. TCPdump collected communications at the packet level. In total, 2,540,044 recordings were produced. To split the UNSW-NB15 into training and testing data, a significantly smaller division was chosen. Ten target classes (whose distribution is displayed in Table 4) are included in NB15.
  • Normal;
  • Fuzzer: trying to stop a program or network by feeding it data that are produced at random;
  • Analysis: this includes several attacks to port scan, spam attacks, and HTML file penetration kinds of attack;
  • Backdoor: a method for secretly bypassing a system security measure to access a computer or its contents;
  • DoS: an intentional attempt to prevent people from accessing a server or network resource, often by momentarily stopping or disrupting the operations of a host that is connected to the internet;
  • Exploit: the attacker takes advantage of a known vulnerability;
  • Generic: a method that functions against all block ciphers (with a specific block and key size) without taking the block cipher’s structure into account;
  • Reconnaissance: this includes every strike that can imitate an information-gathering attack;
  • Shell code: a small piece of code used as the payload in the exploitation of software vulnerability;
  • Worm: malware that replicates itself in order to infect more systems. It frequently spreads over a computer network, taking advantage of security flaws in the target machine’s security to gain access.
A total of 49 features were extracted, using tools such as Bro-IDS and Argus. In addition, these features were classified into five categories, basic, flow, time, additionally generated, and content. We highlight the uniformity of UNSW-NB15 compared to traditional datasets by observing the target-largest-to-smallest ratio in the next figure. KDD-99 is the most unbalanced. NSL-KDD attempts to alleviate this problem. The asymmetry of UNSW-NB15 is significantly lower. Data stationarity is maintained between the training and test sets in NB15, in which both have similar distributions.

5.4. IoT-23 Dataset

The IoT-23 dataset [37] was collected at the Stratosphere Laboratory at Czech Technical University between 2018 and 2019 and aims to facilitate researchers in their attempts at developing machine learning models.
IoT-23 contains a total of 23 traffic traces, or PCAP files, known as scenarios, that were collected in an IoT network environment that was under control and had an unrestricted network connection; three of the traces correspond to benign traffic and twenty to malicious activity. Each scenario is more specifically associated with a particular malware sample or benign traffic. The benign scenarios were collected by recording the network traffic of three distinct real IoT devices, a Somfy smart lock, an Amazon Echo smart home personal assistant, and a Philips HUE smart LED light; this allowed the tracking of actual typical network data instead of simulated traffic data. The malicious traffic was produced by an infected Raspberry Pi. Details, such as length, number of packets, number of Zeek streams, pcap file, and device name, are provided for each of these devices.
The labels used for detecting malicious network flows are as follows:
  • Attack: This label denotes the existence of an attack originating from the compromised device and directed against a different host. For instance, a command injection into the header of a GET request, a brute force attempt at a telnet login, etc.;
  • Benign: this label indicates that no suspicious or malicious activity was detected in the connections;
  • C-and-C: This label denotes that a Command-and-Control server was linked to the compromised device. Because of the irregular connections to the suspicious server or the sporadic arrival and departure of certain IRC commands, this behavior was discovered during the network malware capture analysis;
  • DDoS: The volume of traffic flowing to the same IP address indicates that these flows are part of a DDoS assault;
  • FileDownload: This label denotes the process of downloading a file to the compromised device. This is identified by screening connections whose response bytes exceed 3 KB or 5 KB; often, this is done in conjunction with a destination IP or port that is known to be a C-and-C server and to be suspicious;
  • HeartBeat: This label denotes that the C-and-C server tracks the infected host using packets transmitted over this connection;
  • Mirai: This label reports that the connections resemble a Mirai botnet, created by exploiting IoT device vulnerabilities. When flows exhibit patterns like the most prevalent known Mirai assaults, this label is appended;
  • Okiru: Connections with this designation exhibit traits of an Okiru bot-net, which is a Mirai botnet targeting IoT devices that utilize ARC (Argonaut RISC Core), the only distinction being that this bot-net family is less widespread than Mirai when it comes to the labeling choice;
  • PartOfAHorizontalPortScan: This label denotes the use of connections for a horizontal port scan in order to obtain data for further attacks. These labels are used for patterns in which connections have numerous distinct destination IP addresses, the same port, and a comparable amount of transferred bytes;
  • Torii: This descriptor denotes that the connections exhibit traits of a bot-net associated with Torii. The criteria used for this categorization determination were the same as those used for Mirai, with the exception that this bot-net family is less widespread.
The three most frequent malicious (nonbenign) labels among the 20 malware captures are DDoS (19,538,713 flows), Okiru (47,381,241 flows), and PartOfAHorizontalPortScan (213,852,924 flows). However, C-and-C-Mirai (two streams), PartOfAHorizontalPortScan-Attack (five streams), and C-and-C-HeartBeat-FileDownload (11 streams) are the three least frequent malevolent (nonbenign) labels.

5.5. UNB-CIC IoT 2023 Dataset

UNB-CIC IoT 2023 [38] was created to facilitate cyber-security research, with an emphasis on IoT (Internet of Things) device security. The dataset’s primary objective is to help improve threat and intrusion detection algorithms by offering a real-world, lab-generated dataset for examining and analyzing suspicious or unusual behavior in IoT devices.
The network traffic produced by different IoT devices—such as sensors, security cameras, smart bulbs, and more—makes up the dataset. Both hostile and benign scenarios have been acted out by simulating how each device interacts with a network and other devices. This makes it possible to record a broad variety of behaviors, both typical and abnormal. Examples of tracked activities include
  • Normal communications between IoT devices;
  • Denial-of-service (DoS) or distributed denial-of-service (DDoS) attacks, where attackers try to overload devices or networks;
  • Malware injection into IoT devices, to compromise their operation or steal information;
  • Man-in-the-middle attacks, in which an attacker places himself in between two devices in order to intercept or alter communication;
Features in the dataset include various fields useful for network traffic analysis.
  • Timestamp: the exact time each packet was captured;
  • IP Address: the source and destination address of each packet;
  • Port Number: the port number associated with the connection;
  • Protocol: the network protocol used (e.g., TCP, UDP, etc.);
  • Packet size: information about the volume of data transmitted;
  • TCP flags: indicators specific to TCP connections, such as SYN, ACK, FIN, that help identify the status of connections.

6. Experiments

6.1. Experimental Setup

An Nvidia RTX 2080Ti GPU and Ubuntu Linux 20.04 were used for the testing. Jupyter notebooks running Python 3.10 (with the Anaconda distribution) serve as the test environment. An interactive tool called a Jupyter notebook is mostly used for documentation and code development. It is a platform for creating a single document that contains graphics, structured text (including Markdown), executable code, and other output formats. This makes it especially helpful for scientific research, data science, data analysis, and machine learning, where it is necessary to clearly and interactively document and communicate findings and analysis.
In each experiment, inference was performed both on the CPU alone and (when possible) with the GPU alone. In the present case, SkLearn (used for random forest, gradient boosting and decision tables) was replaced with cuML for random forest classifier implementations, while Nvidia’s RAPIDS XGBoost was used for gradient boosting. As for the neural networks used, these were run on both CPU and GPU, using the appropriate extensions of Tensorflow and Keras.

6.2. Experimental Cases

Experiments were carried out on separate models as well as ensembles. Specifically, five experiments were performed according to and are organized as follows:
  • Experiment 1: separate models
  • Experiment 2: ensemble 1—Random forest (RF)/XGBoost (XGB)/decision tree (DT)
  • Experiment 3: ensemble 2—Deep neural network (DNN)/CNN–LSTM
  • Experiment 4: ensemble 3—LSTM/CNN–LSTM/GRU
  • Experiment 5: ensemble 4—Random forest (RF)/deep neural network (DNN)
The specific ensembles (Experiments 2–5) are justified by the results observed in Experiment 1; more specifically, the models with highest performance were taken into consideration when it came to the definition of the various combinations.
In all experiments, the metric used for effectiveness is accuracy.

6.2.1. Experiment 1: Separate Models

In this experiment, the models were tested separately.
Several preprocessing steps to conduct the experiments were used. Given the vast number of features in the dataset, a dimension reduction was carried out in the random forest context. In the context of neural networks, such as SVM or KNN, scaling operations were also required. The scikit-learn “SelectFromModel” class was used to reduce the dimensionality. Learn is a tool that automatically chooses features (characteristics) from a dataset in order to reduce the complexity of the data and enhance model performance. This method allows for the removal of irrelevant or less significant elements by choosing the most pertinent ones based on the model’s own assessment of their significance. In machine learning scenarios, where a high number of variables might increase noise, overload the model, and reduce prediction accuracy, this procedure is especially helpful. A supervised learning model that gives each feature in the dataset a weight or measure of relevance is the foundation for SelectFromModel’s function. A linear regression, random forest, or any other model with an attribute that expresses the significance of features—like coefficients or relevance scores—can be an example of such a model.
SelectFromModel can automatically eliminate features that are less important than a predetermined threshold after it has been trained, increasing model efficiency and avoiding overfitting. SelectFromModel’s primary benefit is its capacity to automate the size reduction procedure, increasing model efficiency without necessitating manual variable selection. This is especially important when working with high-dimensional datasets, as it may not be feasible or effective to manually analyze each feature. Furthermore, cutting down on the amount of input variables will shorten training times and enhance the model’s capacity to generalize to fresh data, which will increase prediction robustness. The scikit-learn “StandardScaler” class was used to carry out the scaling. By removing the mean and scaling to unit variance, standardization entails bringing all features to a common scale. This helps avoid certain features from dominating others due to their differing scale, which is helpful when employing machine learning methods like neural networks, SVM, or KNN that rely on distances between points in space.
The EarlyStopping feature of Keras, which stops training a neural network model before it completes the maximum number of epochs, was employed to prevent overfitting of deep learning models. EarlyStopping keeps an eye on a validation set’s statistic (such accuracy or loss) and stops training if no improvement is seen after a predetermined number of consecutive epochs (the patience parameter). The validation accuracy was used as the metric, the patience parameter was set to four epochs, and the minimal delta, or the smallest change in the metric that is regarded as improvement, was set to 0.001.

6.2.2. Experiment 2: Ensemble 1—RF/XGB/DT

An ensemble of random forest, XGBoost, and decision tree models was used in this experiment, with a soft voting strategy. Soft voting is an ensemble technique used mostly for classification tasks in machine learning. Soft voting combines predictions from multiple models by averaging (traditional or weighted average) the probabilities and selecting the class with the highest average probability, as opposed to selecting the final prediction by merely selecting the most popular prediction among the models (i.e., the class with highest probability).

6.2.3. Experiment 3: Ensemble 2—DNN/CNN–LSTM

In this experiment, a set of DNN/CNN–LSTM models using a bagging strategy has been used. Therefore, the two neural network models that performed better in the previous experiment (Experiment 1) have been taken into consideration. By applying several iterations of the same learning algorithm to various dataset samples, the ensemble technique known as bagging seeks to lower the variance of a predictive model. In order to provide a more robust and accurate final prediction, bagging actually divides the given data into subsamples, trains a model on each of them, and then aggregates the results.
The fundamental concept is to take use of the bootstrap process, a statistical method that repeatedly creates samples from the original data. A model is constructed for every sample, and the final prediction is determined by a majority vote (for classification models) or an average (for regression models). The bagging process can be divided into the following three main steps:
  • Subsample generation (Bootstrap): The bootstrap approach is used to choose multiple random samples from the original training dataset; as a result, some data may be excluded and some may be repeated in each sample.
  • Model training: An independent model, such as a decision tree, is trained for every sample. Different models that capture distinct characteristics of the original dataset are produced because each model is trained on slightly different data.
  • Aggregation of predictions: Following training, the models’ output is utilized to generate predictions. A majority vote is used in bagging for classification models, where each model casts a vote for a class, and the class with the most votes becomes the final prediction. The average of the predictions made by individual models is employed in regression models.

6.2.4. Experiment 4: Ensemble 3—LSTM/CNN–LSTM/GRU

In this experiment, an ensemble of LSTM/CNN–LSTM/GRU models was used with a bagging strategy. Thus, the three neural network models that were found to be fastest in previous experiments were considered.

6.2.5. Experiment 5: Ensemble 4—RF/DNN

An ensemble of random forest/DNN models with a bagging strategy was used in this experiment. Thus, the neural network model and the best performing nonneural machine learning model were considered.

7. Results

The results illustrated in this section are divided into two major categories, separate models (Experiment 1) and ensemble models (Experiments 2–5), which are followed by an overall discussion and comparison of separate and ensemble models.

7.1. Experiment 1: Separate Models

In reference to Experiment 1, the results obtained are summarized in Table 5, and the best result for each dataset is in bold. As seen from the table, the overall best performance belongs to the deep learning models, as they achieved the highest accuracy on three datasets out of five, that is to say, KDD-99 (DNN with 99.20% accuracy), NLS-KDD (RNN with 91.80% accuracy), and IoT-23 (DNN, CNN, LSTM, CNN–LSTM, GRU, and RNN with 93.50% accuracy). Regarding the UNSW-NB15 dataset, the highest accuracy was achieved by random forest and XGBoost (99.80%), but, even in this case, the deep learning models had an exceptional performance, with 98–99% accuracy.
The worst performance belongs to SVM. It seems to be the most susceptible algorithm to the dataset, as its accuracy swings from ~50% to ~89%, and the mean accuracy amounts to 80.52%, the lowest one.
Focusing on the mean accuracy, the deep learning techniques achieved the best results, once again; more specifically, the highest mean accuracy belongs to DNN (95.86%), followed by CNN–LSTM (95.07%). This shows the potential effectiveness of deep learning models in the implementation of NIDSs, since the tested models reached high accuracies and did not show significant changes of performance on different datasets.
In terms of hardware performance, the results can be seen in Table 6, showing the rig speed for both CPU and GPU (when possible). The GPU setup offers faster execution speeds for the majority of models, frequently by several orders of magnitude. For instance, with a focus of shallow learning models, random forest (RF) displays a significant performance boost with GPU acceleration (from 0.003 ms to 0.0000015 ms), perhaps as a result of the algorithm’s parallelizability. Similar to random forest, Naïve Bayes (NB) and logistic regression (LR) also experience enhanced GPU performance (from 0.0006 ms to 0.0000001 ms for NB and from 0.00035 ms to 0.0000001 ms for LR). XGBoost experiences a slight enhanced GPU performance, and, given that XGBoost may be optimized for GPU utilization, this is to be expected. On the other hand, in reference to deep learning, GPU acceleration can really boost deep learning models; on the GPU, models like CNN, LSTM, GRU, and combinations (like CNN–LSTM) get notable performance gains; as a matter of fact, the best improvement can be seen in LSTM, for which the performance passes from 0.0000006 ms to 0.0000001 ms. It is followed by CNN–LSTM (from 0.0000005 ms to 0.0000001 ms).
Overall, for complex models like CNN, LSTM, and GRU, GPU acceleration provides substantial performance improvements, especially when working with large datasets or deep learning tasks. Simpler models, such as NB, LR, and DT, show minimal benefit from GPU usage, which suggests that CPU-based implementations might suffice in these cases.

7.2. Experiments 2–5: Ensemble Models

In reference to Experiments 2–5, the results obtained are summarized in Table 7, and the best result for each dataset is in bold. As seen from the table, the overall best performance belongs to RF/DNN (Experiment 5), as it allows one to achieve the highest accuracy on four datasets out of five, that is to say, UNSW-NB15 (99.79%, which is also the highest value observed), NSL-KDD (87.40%), IoT-23 (93.53%), and UNB-CIC IoT 2023 (99.73%). The highest performance on the KDD-99 dataset is presented by RF/XGB/DT (Experiment 2), with 99.30%. Moreover, this ensemble also ties with RF/DNN when it comes to IoT-23 and UNB-CIC IoT 2023 datasets; these results make RF/XGB/DT the second-best ensemble. The remaining experiments, DNN/CNN–LSTM (Experiment 3) and LSTM/CNN–LSTM/GRU (Experiment 4), achieved slightly lower accuracies.
In reference to the mean accuracy, the best mean accuracy belongs to RF/DNN (95.91%), which is understandable, judging from the singular accuracies that were previously discussed. Even in this case, RF/DNN is followed by RF/XGB/DT, which achieved an accuracy of 95.58%; the remaining ensembles achieved an accuracy of 95.44% and 95.46%, which means that there is a thin difference of 0.02%.
These results show the effectiveness of ensemble models in the context of NIDSs, as all the tested models achieved high accuracies, and their performance was pretty constant across the various datasets.
In terms of hardware performance, the results can be seen in Table 8. The RF/XGB/DT ensemble seems to be the one that benefited the most from GPU acceleration, as the execution time dropped from 0.005 ms to 0.001 ms. Also, LSTM/CNN–LSTM/GRU and RF/DNN showed a reduction in execution time, which passed from 0.0000005 ms to 0.0000002 ms. On the contrary, the combination DNN/CNN–LSTM showed a slight increase in execution time on GPU (from 0.0000006 ms to 0.0000008 ms), suggesting that while these models benefit from GPU, the ensemble architecture may introduce complexity that mitigates performance gains.

7.3. Comparison of Separate and Ensemble Models

Ensembles slightly increase accuracy for very little increase in computational cost. Indeed, all the ensembles achieved a (mean) accuracy between 95% and 96%, while the accuracies achieved by separate models ranged broadly, swinging from 80.52% (SVM) to 95.86% (DNN). Random forest (RF) achieved 94.58%, and decision tree (DT) showed 93.96%. The deep learning models like CNN (94.85%) and LSTM (94.58%) performed competitively, suggesting that these models are well-tuned for the task at hand. When it comes to ensemble models, an accuracy of 95.58% was shown by the RF/XGB/DT ensemble, suggesting better performance than individual models. At the same time, the LSTM/CNN–LSTM/GRU and DNN/CNN–LSTM ensembles both showed good performance, with respective accuracies of 95.46% and 95.44%. Moreover, with the highest accuracy of 95.91%, the RF/DNN ensemble demonstrated that combining deep learning and tree-based models can produce better results than either one alone.
Focusing on the hardware performance, the best ensemble accuracy-wise (RF/DNN ensemble) also happens to have a better execution time than the best separate model (DNN), on both CPU (0.0000006 ms vs. 0.0000005 ms) and GPU (0.0000004 ms vs. 0.0000002 ms); this implies that the ensemble returns better results while also showing good performance. Although the best performance belongs to two separate models, i.e., CNN–LSTM and GRU, with 0.0000005 ms (CPU) and 0.0000001 ms (GPU), they show lower accuracies (95.07% and 94.47%, respectively). Considering that their execution time on CPU is the same as RF/DNN, and their execution time of GPU is 0.0000001 ms lower, it is worth saying that, overall, RF/DNN represents an optimal trade-off between accuracy and execution time.

8. Conclusions

This paper proposes a benchmark and an ablative study among recent machine learning-based NIDSs, both shallow learning and deep learning; more particularly, the proposed work compares the performance of classification models commonly found in the literature—decision tree, random forest, Naïve Bayes, logistic regression, XGBoost, support vector machine, and neural network, along with deep learning techniques such as DNN, CNN, and LSTM—on the most popular dataset for intrusion detection, KDD-99, as well as the alternatives to KDD-99, i.e., NSL-KDD, UNSW-NB15, and the modern IoT-23 and UNB-CIC IoT 2023. In addition to these models, ensemble models were also tested. This study was also encouraged by the novelty that can be found in the application of deep neural networks for solving information security problems; therefore, the paper also aims to show the potentiality of such models in the context of intrusion detection and NIDSs. As a matter of fact, the comparison illustrated in the previous section shows how the deep learning techniques generally outperform the shallow learning ones, in terms of performance on single datasets as well as mean accuracy. Moreover, ensembles also outperform both shallow learning and deep learning separate models. The best performing separate model appears to be the DNN, which achieved a mean accuracy of 95.86%; also, the second and the third best models are deep learning models, that is to say CNN–LSTM and RNN, which achieved a mean accuracy of 95.07% and 95.06%, respectively. The best ensemble model is RF/DNN, which achieved an accuracy of 95.91% and represents the best model overall. Its efficiency can be also seen from in the execution time, as it happens to have good performance on both CPU and GPU. This implies that a tree-based ensemble is a good solution. SVM had the worst performance, as it achieved the lowest accuracy (80.52%) and seems to be dataset-dependent, according to the significant changes in performance.

Author Contributions

Conceptualization, E.Z., S.B., F.G., G.S. and M.G.; methodology, E.Z. and S.B.; software, F.G. and G.S.; validation, E.Z., F.G. and G.S.; formal analysis, E.Z. and S.B.; investigation, E.Z. and S.B.; resources, S.B. and M.G.; data curation, E.Z. and M.G.; writing—original draft preparation, E.Z. and S.B.; writing—review and editing, E.Z. and S.B.; visualization, E.Z. and S.B.; supervision, D.I.; project administration, F.G., G.S. and M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Regione Puglia—Progetto ACROSS—AcCounting and payROll Software cloud converSion—POR PUGLIA FESR 2014–2020—Grant number: THA48Y5.

Data Availability Statement

The KDD CUP 99 dataset is openly available from the UCI KDD archive at https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (accessed on 10 November 2022). The NSL-KDD dataset is available at https://ieee-dataport.org/documents/nsl-kdd-0 (accessed on 9 November 2022). The UNSW-NB15 dataset is publicly available at https://research.unsw.edu.au/projects/unsw-nb15-dataset (accessed on 3 November 2022). The IoT-23 dataset is publicly available at https://www.stratosphereips.org/datasets-iot23 (accessed on 9 November 2022). The UNB-CIC IoT 2023 dataset is available at https://www.unb.ca/cic/datasets/iotdataset-2023.html (accessed on 8 May 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Nascita, A.; Cerasuolo, F.; Di Monda, D.; Garcia, J.T.A.; Montieri, A.; Pescape, A. Machine and Deep Learning Approaches for IoT Attack Classification. In Proceedings of the INFOCOM WKSHPS 2022—IEEE Conference on Computer Communications Workshops, New York, NY, USA, 2–5 May 2022. [Google Scholar] [CrossRef]
  2. Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
  3. Cortes, C.; Vapnik, V.; Saitta, L. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  4. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2323. [Google Scholar] [CrossRef]
  5. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  6. Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the EMNLP 2014—2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar] [CrossRef]
  7. Ravipati, R.D.; Abualkibash, M. Intrusion Detection System Classification Using Different Machine Learning Algorithms on KDD-99 and NSL-KDD Datasets—A Review Paper. SSRN Electron. J. 2019, 11. [Google Scholar] [CrossRef]
  8. Farnaaz, N.; Jabbar, M.A. Random Forest Modeling for Network Intrusion Detection System. Procedia Comput. Sci. 2016, 89, 213–217. [Google Scholar] [CrossRef]
  9. Bhamare, D.; Salman, T.; Samaka, M.; Erbad, A.; Jain, R. Feasibility of Supervised Machine Learning for Cloud Security. In Proceedings of the ICISS 2016—2016 International Conference on Information Science and Security, Pattaya, Thailand, 19–22 December 2017. [Google Scholar] [CrossRef]
  10. Sharmila, B.S.; Nagapadma, R. Intrusion detection system using naive bayes algorithm. In Proceedings of the 2019 5th IEEE International WIE Conference on Electrical and Computer Engineering, WIECON-ECE 2019—Proceedings, Bengaluru, India, 15–16 November 2019. [Google Scholar] [CrossRef]
  11. Prachi, H.M.; Sharma, P. Intrusion detection using machine learning and feature selection. Int. J. Comput. Netw. Inf. Secur. 2019, 11, 43–52. [Google Scholar]
  12. Hammad, M.; El-Medany, W.; Ismail, Y. Intrusion Detection System using Feature Selection with Clustering and Classification Machine Learning Algorithms on the UNSW-NB15 dataset. In Proceedings of the 2020 International Conference on Innovation and Intelligence for Informatics, Computing and Technologies, 3ICT 2020, Sakheer, Bahrain, 20–21 December 2020. [Google Scholar] [CrossRef]
  13. Latif, S.; Dola, F.F.; Afsar, M.; Esha, I.J.; Nandi, D. Investigation of Machine Learning Algorithms for Network Intrusion Detection. Int. J. Inf. Eng. Electron. Bus. 2022, 14, 1–22. [Google Scholar] [CrossRef]
  14. Alzahrani, A.O.; Alenazi, M.J.F. Designing a Network Intrusion Detection System Based on Machine Learning for Software Defined Networks. Future Internet 2021, 13, 111. [Google Scholar] [CrossRef]
  15. Gouveia, A.; Correia, M. Network intrusion detection with XGBoost. In Recent Advances in Security, Privacy, and Trust for Internet of Things (IoT) and Cyber-Physical Systems (CPS); Chapman and Hall/CRC: Boca Raton, FL, USA, 2020; pp. 137–166. [Google Scholar]
  16. Ahmad, I.; Haq, Q.E.U.; Imran, M.; Alassafi, M.O.; AlGhamdi, R.A. An Efficient Network Intrusion Detection and Classification System. Mathematics 2022, 10, 530. [Google Scholar] [CrossRef]
  17. Thamaraiselvi, R.; Mary, S.A.S. Attack and anomaly detection in iot networks using machine learning. Int. J. Comput. Sci. Mob. Comput. 2020, 9, 95–103. [Google Scholar] [CrossRef]
  18. Kim, Y.G.; Ahmed, K.J.; Lee, M.J.; Tsukamoto, K. A Comprehensive Analysis of Machine Learning-Based Intrusion Detection System for IoT-23 Dataset. In Advances in Intelligent Networking and Collaborative Systems; Lecture Notes in Networks and Systems, LNNS; Springer: Cham, Switzerland, 2022; Volume 527, pp. 475–486. [Google Scholar] [CrossRef]
  19. Faker, O.; Dogdu, E. Intrusion detection using big data and deep learning techniques. In Proceedings of the ACMSE 2019—Proceedings of the 2019 ACM Southeast Conference, Kennesaw, GA, USA, 18–20 April 2019; pp. 86–93. [Google Scholar] [CrossRef]
  20. Jia, Y.; Wang, M.; Wang, Y. Network intrusion detection algorithm based on deep neural network. IET Inf. Secur. 2019, 13, 48–53. [Google Scholar] [CrossRef]
  21. Vinayakumar, R.; Alazab, M.; Soman, K.P.; Poornachandran, P.; Al-Nemrat, A.; Venkatraman, S. Deep Learning Approach for Intelligent Intrusion Detection System. IEEE Access 2019, 7, 41525–41550. [Google Scholar] [CrossRef]
  22. Le, T.-T.-H.; Kim, J.; Kim, H. Analyzing Effective of Activation Functions on Recurrent Network for Intrusion Detection. J. Multimed. Inf. Syst. 2016, 3, 91–96. [Google Scholar] [CrossRef]
  23. Lin, W.-H.; Lin, H.-C.; Wang, P.; Wu, B.-H.; Tsai, J.-Y. Using convolutional neural networks to network intrusion detection for cyber threats. In Proceedings of the 4th IEEE International Conference on Applied System Innovation 2018, ICASI 2018, Chiba, Japan, 13–17 April 2018; pp. 1107–1110. [Google Scholar] [CrossRef]
  24. Li, Z.; Rios, A.L.G.; Xu, G.; Trajkovic, L. Machine learning techniques for classifying network anomalies and intrusions. In Proceedings of the IEEE International Symposium on Circuits and Systems, Monterey, CA, USA, 21–22 May 2019; Volume 2019. [Google Scholar] [CrossRef]
  25. Hsu, C.-M.; Hsieh, Y.; Prakosa, S.; Azhari, M.; Leu, J.-S. Using Long-Short-Term Memory Based Convolutional Neural Networks for Network Intrusion Detection. In Proceedings of the 11th EAI International Conference, WiCON 2018, Taipei, Taiwan, 15–16 October 2018; Proceedings. Springer: Cham, Switzerland, 2019; pp. 86–94. [Google Scholar] [CrossRef]
  26. Andresini, G.; Appice, A.; Di Mauro, N.; Loglisci, C.; Malerba, D. Multi-Channel Deep Feature Learning for Intrusion Detection. IEEE Access 2020, 8, 53346–53359. [Google Scholar] [CrossRef]
  27. Ravi, V.; Chaganti, R.; Alazab, M. Recurrent deep learning-based feature fusion ensemble meta-classifier approach for intelligent network intrusion detection system. Comput. Electr. Eng. 2022, 102, 108156. [Google Scholar] [CrossRef]
  28. Sahu, A.K.; Sharma, S.; Tanveer, M.; Raja, R. Internet of Things attack detection using hybrid Deep Learning Model. Comput. Commun. 2021, 176, 146–154. [Google Scholar] [CrossRef]
  29. Toldinas, J.; Venčkauskas, A.; Damaševičius, R.; Grigaliūnas, Š.; Morkevičius, N.; Baranauskas, E. A Novel Approach for Network Intrusion Detection Using Multistage Deep Learning Image Recognition. Electronics 2021, 10, 1854. [Google Scholar] [CrossRef]
  30. Ullah, I.; Mahmoud, Q.H. Design and Development of a Deep Learning-Based Model for Anomaly Detection in IoT Networks. IEEE Access 2021, 9, 103906–103926. [Google Scholar] [CrossRef]
  31. Cao, B.; Li, C.; Song, Y.; Qin, Y.; Chen, C. Network Intrusion Detection Model Based on CNN and GRU. Appl. Sci. 2022, 12, 4184. [Google Scholar] [CrossRef]
  32. Alhamad, R.N.; Alserhani, F.M. Prediction Models to Effectively Detect Malware Patterns in the IoT Systems. Int. J. Adv. Comput. Sci. Appl. 2022, 13. [Google Scholar] [CrossRef]
  33. Ullah, I.; Mahmoud, Q.H. Design and Development of RNN Anomaly Detection Model for IoT Networks. IEEE Access 2022, 10, 62722–62750. [Google Scholar] [CrossRef]
  34. Stolfo, S.J.; Fan, W.; Lee, W.; Prodromidis, A.; Chan, P.K. Cost-based modeling for fraud and intrusion detection: Results from the JAM project. In Proceedings of the DARPA Information Survivability Conference and Exposition, DISCEX 2000, Hilton Head, SC, USA, 25–27 January 2000; Volume 2, pp. 130–144. [Google Scholar] [CrossRef]
  35. Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A.A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the IEEE Symposium on Computational Intelligence for Security and Defense Applications, CISDA 2009, Ottawa, ON, Canada, 8–10 July 2009. [Google Scholar] [CrossRef]
  36. Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference, MilCIS 2015—Proceedings, Canberra, Australia, 10–12 November 2015. [Google Scholar] [CrossRef]
  37. Garcia, S.; Parmisano, A.; Erquiaga, M.J. IoT-23: A labeled dataset with malicious and benign IoT network traffic. Zenodo 2021. [Google Scholar] [CrossRef]
  38. Neto, C.P.; Dadkhah, S.; Ferreira, R.; Zohourian, A.; Lu, R.; CICIoT, A.A.G. 2023: A Real-Time Dataset and Benchmark for Large-Scale Attacks in IoT Environment. Sensors 2023, 23, 5941. [Google Scholar] [CrossRef] [PubMed]
Figure 1. (a) LSTM and (b) GRU units.
Figure 1. (a) LSTM and (b) GRU units.
Information 15 00741 g001
Table 1. Machine learning models and their nature.
Table 1. Machine learning models and their nature.
ModelNature (Shallow/Deep)
Decision Tree (DT)Shallow learning
Naïve Bayes (NB)
Logistic Regression (LR)
XGBoost (v.1.7.2)
Support Vector Machine (SVM)
Multilayer Perceptron (MLP)Deep learning
Convolutional Neural Network (CNN)
Long Short-Term Memory (LSTM)
Table 2. Related work.
Table 2. Related work.
PaperDatasetBest ModelBest Result
(Detection/Classification)
Ravipati et al. [7]KDD-99Random Forest99% (detection)
Farnaaz [8]NSL-KDDRandom Forest99.67% (classification)
Bhamare et al. [9]UNSW-NB15Logistic Regression89.26% (detection)
Sharmila et al. [10]NSL-KDDNaïve Bayes86.5% (classification)
Prachi et al. [11]NSL-KDDRandom Forest99.91% (classification)
Hammad et al. [12]UNSW-NB15Random Forest97.60% (detection)
Latif et al. [13]NSL-KDDKNN85.3% (classification)
Alzahrani et al. [14]NSL-KDDXGBoost95.5% (detection)
Gouveia et al. [15]NSL-KDDXGBoost88.64% (detection)
UNSW-NB1593.34% (detection)
Ahmad et al. [16]UNSW-NB15Adaboost-based DT99.3% (detection)
Thamaraiselvi et al. [17]IoT-23RF99.5% (detection)
Kim et al. [18]IoT-23ANN99.99% (detection)
Faker et al. [19]UNSW-NB15DNN99.16% (detection)
97.01% (classification)
CIC-IDS2017GBT
DNN
99.99% (detection)
99.56% (classification)
Jia et al. [20]NSL-KDDDNN99.9% (classification)
Vinayakumar et al. [21]KDD-99DT
DNN
92.9% (detection)
92.5% (classification)
NSL-KDDDT
DNN
93% (detection)
78.5% (classification)
UNSW NB-15RF90.3% (detection)
75.5% (classification)
CIC-IDS 2017RF
DNN
94% (detection)
95.6% (classification)
Le et al. [22]KDD-99RNN97.77% (classification)
Lin et al. [23]KDD-99CNN (con adaptive delta algorithm)99.65% (detection)
Li et al. [24]NSL-KDDGRU82.87% (classification)
Hsu et al. [25]NSL-KDDCNN+LSTM94.12% (detection)
88.95% (classification)
Andresini et al. [26]KDD-99AE+CNN92.49% (detection)
UNSW-NB1593.40% (detection)
CIC-IDS201797.90% (detection)
Ravi et al. [27]KDD-99(RNN and LSTM and GRU) + (RF and SVM) + LR99% (detection)
89% (classification)
UNSW-NB1599% (detection)
99% (classification)
CICIDS-201799% (detection)
98% (classification)
Sahu et al. [28]IoT-23CNN+LSTM96% (detection)
Toldinas et al. [29]UNSW-NB15CNN99.8% (classification)
Ullah et al. [30]IoT-23CNN 1D99.96% (classification)
MQTT-IoT-IDS2020CNN 1D + transfer learning99.98% (detection)
Cao et al. [31]UNSW-NB15CNN + GRU99% (detection)
89% (classification)
NSL-KDD99% (detection)
99% (classification)
CIC-IDS201799% (detection)
98% (classification)
Alhamad et al. [32]IoT-23RF
Catboost
XGBoost
89% (detection)
Ullah et al. [33]NSL-KDDCNNBiLST
BiLSTM
99.88% (classification)
99.92% (detection)
IoT-23CNNBiLSTM
LSTM
99.87% (classification)
99.80% (detection)
Table 3. Distribution of classes in KDD CUP 99.
Table 3. Distribution of classes in KDD CUP 99.
ClassTraining SetPercentageTest SetPercentage
Normal812,81475.611%60,59319.481%
DoS247,26723.002%229,85373.901%
Probing Attack13,8601.289%41661.339%
R2L9990.093%16,1895.205%
U2R520.005%2280.073%
Total1,074,992100%311,029100%
Table 4. Distribution of classes in UNSW-NB15 dataset.
Table 4. Distribution of classes in UNSW-NB15 dataset.
ClassNo. of SamplesPercentage
Normal2,218,76187.35%
Fuzzer24,2460.95%
Analysis26770.11%
Backdoor23290.09%
DoS16,3530.64%
Exploit44,5251.75%
Generic215,4818.48%
Reconnaissance13,9870.55%
Shellcode15110.06%
Worm1740.01%
Total2,540,044100%
Table 5. Intrusion detection test results in terms of accuracy for Experiment 1.
Table 5. Intrusion detection test results in terms of accuracy for Experiment 1.
DatasetAccuracy (%)
DTRFNBLRXGBSVMMLPDNNCNNLSTMCNN–LSTMGRURNN
KDD-9993.0093.0088.0092.0093.0078.0093.0099.2093.1093.0092.9093.0092.60
NSL-KDD84.0087.0085.3085.8087.2050.7088.7089.1089.5088.8091.0088.3091.80
UNSW-NB1599.7099.8098.0098.3099.8089.0098.0098.4099.0098.6098.9098.5098.40
IoT-2393.5093.4086.1090.092.6086.3086.7093.5093.5093.5093.5093.5093.50
UNB-CIC
IoT 2023
99.6099.7061.0098.9099.6098.6099.2099.1199.1699.0099.0399.0599.00
Mean acc.93.9694.5883.6893.0094.4480.5293.1295.8694.8594.5895.0794.4795.06
Table 6. Execution time for Experiment 1.
Table 6. Execution time for Experiment 1.
Execution Time (ms)
DTRFNBLRXGBSVMMLPDNNCNNLSTMCNN–LSTMGRURNN
CPU4 × 10−43 × 10−36 × 10−43.5 × 10−41.5 × 10−31 × 10−33 × 10−36 × 10−78 × 10−76 × 10−75 × 10−75 × 10−78 × 10−7
GPU-1.5 × 10−61 × 10−71 × 10−71 × 10−3--4 × 10−74 × 10−71 × 10−71 × 10−7 1 × 10−7 4 × 10−7
Table 7. Intrusion detection test results in terms of accuracy for Experiments 2–5.
Table 7. Intrusion detection test results in terms of accuracy for Experiments 2–5.
DatasetAccuracy (%)
Experiment 2:
RF/XGB/DT
Experiment 3:
DNN/CNN–LSTM
Experiment 4:
LSTM/CNN–LSTM/GRU
Experiment 5:
RF/DNN
KDD-9999.3098.7098.7199.09
NSL-KDD86.0086.2886.2487.40
UNSW-NB1599.4099.6399.6599.79
IoT-2393.5093.3393.3793.53
UNB-CIC IoT 202399.7099.2799.3399.73
Mean acc.95.5895.4495.4695.91
Table 8. Execution time for Experiments 2–5.
Table 8. Execution time for Experiments 2–5.
Execution Time (ms)
Experiment 2:
RF/XGB/DT
Experiment 3:
DNN/CNN–LSTM
Experiment 4:
LSTM/CNN–LSTM/GRU
Experiment 5:
RF/DNN
CPU5 × 10−36 × 10−75 × 10−75 × 10−7
GPU1 × 10−38 × 10−72 × 10−72 × 10−7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Genuario, F.; Santoro, G.; Giliberti, M.; Bello, S.; Zazzera, E.; Impedovo, D. Machine Learning-Based Methodologies for Cyber-Attacks and Network Traffic Monitoring: A Review and Insights. Information 2024, 15, 741. https://doi.org/10.3390/info15110741

AMA Style

Genuario F, Santoro G, Giliberti M, Bello S, Zazzera E, Impedovo D. Machine Learning-Based Methodologies for Cyber-Attacks and Network Traffic Monitoring: A Review and Insights. Information. 2024; 15(11):741. https://doi.org/10.3390/info15110741

Chicago/Turabian Style

Genuario, Filippo, Giuseppe Santoro, Michele Giliberti, Stefania Bello, Elvira Zazzera, and Donato Impedovo. 2024. "Machine Learning-Based Methodologies for Cyber-Attacks and Network Traffic Monitoring: A Review and Insights" Information 15, no. 11: 741. https://doi.org/10.3390/info15110741

APA Style

Genuario, F., Santoro, G., Giliberti, M., Bello, S., Zazzera, E., & Impedovo, D. (2024). Machine Learning-Based Methodologies for Cyber-Attacks and Network Traffic Monitoring: A Review and Insights. Information, 15(11), 741. https://doi.org/10.3390/info15110741

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop