Whale Optimization Algorithm-Enhanced Long Short-Term Memory Classifier with Novel Wrapped Feature Selection for Intrusion Detection

AL-Husseini, Haider; Hosseini, Mohammad Mehdi; Yousofi, Ahmad; Alazzawi, Murtadha A.

doi:10.3390/jsan13060073

Open AccessArticle

Whale Optimization Algorithm-Enhanced Long Short-Term Memory Classifier with Novel Wrapped Feature Selection for Intrusion Detection

¹

Department of Computer Engineering, Isfahan (Khorasgan) Branch, Islamic Azad University, Isfahan P.O. Box 81595-158, Iran

²

Department of Computer Engineering, Shahrood Branch, Islamic Azad University, Shahrood P.O. Box 36155-163, Iran

³

Department of Computer Techniques Engineering, Imam Alkadhum University College, Baghdad 10011, Iraq

^*

Author to whom correspondence should be addressed.

J. Sens. Actuator Netw. 2024, 13(6), 73; https://doi.org/10.3390/jsan13060073

Submission received: 7 September 2024 / Revised: 19 October 2024 / Accepted: 24 October 2024 / Published: 2 November 2024

(This article belongs to the Section Big Data, Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Intrusion detection in network systems is a critical challenge due to the ever-increasing volume and complexity of cyber-attacks. Traditional methods often struggle with high-dimensional data and the need for real-time detection. This paper proposes a comprehensive intrusion detection method utilizing a novel wrapped feature selection approach combined with a long short-term memory classifier optimized with the whale optimization algorithm to address these challenges effectively. The proposed method introduces a novel feature selection technique using a multi-layer perceptron and a hybrid genetic algorithm-particle swarm optimization algorithm to select salient features from the input dataset, significantly reducing dimensionality while retaining critical information. The selected features are then used to train a long short-term memory network, optimized by the whale optimization algorithm to enhance its classification performance. The effectiveness of the proposed method is demonstrated through extensive simulations of intrusion detection tasks. The feature selection approach effectively reduced the feature set from 78 to 68 features, maintaining diversity and relevance. The proposed method achieved a remarkable accuracy of 99.62% in DDoS attack detection and 99.40% in FTP-Patator/SSH-Patator attack detection using the CICIDS-2017 dataset and an anomaly attack detection accuracy of 99.6% using the NSL-KDD dataset. These results highlight the potential of the proposed method in achieving high detection accuracy with reduced computational complexity, making it a viable solution for real-time intrusion detection.

Keywords:

feature selection; genetic algorithm (GA); intrusion detection; LSTM; MLP; particle swarm optimization (PSO); whale optimization algorithm (WOA)

1. Introduction

Intrusion detection in network systems has become an increasingly critical challenge in the face of growing cyber threats. Modern network environments generate vast amounts of data, characterized by high dimensionality and complexity. This data includes various features such as IP addresses, port numbers, protocol types, and packet sizes, which can contain both relevant and redundant information. The high volume and complexity of this data make it difficult for traditional intrusion detection systems (IDS) to effectively identify malicious activities, leading to issues such as reduced accuracy, increased computational overhead, and higher false positive rates. Consequently, there is a pressing need for advanced methods that can efficiently and accurately detect intrusions while managing the high dimensionality of network traffic data. The primary challenge in developing an effective IDS lies in the ability to distinguish between normal and malicious traffic in real-time. Traditional approaches, such as signature-based and anomaly-based detection, often struggle to cope with the dynamic and evolving nature of cyber-attacks. Signature-based methods rely on predefined patterns and are ineffective against new or unknown threats, while anomaly-based methods can generate a high number of false positives due to their reliance on statistical deviations from normal behavior. Furthermore, the high-dimensional nature of network traffic data exacerbates these challenges, making it imperative to develop sophisticated feature selection and classification techniques that can enhance detection accuracy and efficiency. Addressing these issues requires a comprehensive approach that integrates advanced machine learning algorithms with robust optimization techniques to create a more reliable and scalable intrusion detection system. Machine learning algorithms showed reliable results in different fields, such as straw semantic segmentation [1], flood prediction [2], freight truck traffic flow prediction [3], soil stress prediction [4], etc., and can be adopted for efficient intrusion detection.

Despite significant advancements in machine learning and optimization algorithms, current research in intrusion detection often falls short in two critical areas: effective feature selection from high-dimensional data and optimal tuning of classifier hyperparameters for real-time detection. Many existing approaches either rely on simplistic feature selection methods that do not adequately capture the most relevant features or use classifiers that are not fine-tuned to their optimal performance, resulting in subpar detection rates and increased false positives. This paper aims to bridge these gaps by introducing a novel wrapped feature selection method combined with a long short-term memory—Whale Optimization Algorithm (LSTM-WOA) classifier. The proposed method leverages the strengths of multi-layer perceptron (MLP) and Genetic Algorithm-Particle Swarm Optimization (GA-PSO) for feature selection and employs the WOA to fine-tune the hyperparameters of an LSTM network, thereby enhancing both detection accuracy and computational efficiency.

Our methodology is structured in two main phases. The first phase involves a wrapped feature selection approach where different subsets of features are evaluated using an MLP classifier, guided by a hybrid GA-PSO algorithm. This approach ensures that the most informative features are selected, reducing dimensionality and computational load while maintaining high classification performance. In the second phase, we utilize the selected features to train an LSTM network, optimized by the WOA to fine-tune critical hyperparameters such as the number of hidden units, learning rate, learning rate drop factor, and batch size. WOA was selected for its simplicity, lower computational cost, and effectiveness in continuous optimization, making it more efficient for hyperparameter tuning compared to traditional algorithms like GA, PSO, or ant colony optimization algorithm(ACO). This two-phase approach aims to provide a robust and scalable solution for real-time intrusion detection.

The remainder of this paper is organized as follows: A brief review of other intrusion detection approaches is provided in Section 2. Section 3 elaborates on the proposed method, detailing the wrapped feature selection process and the optimization of the LSTM network. Section 4 describes the dataset used for evaluation and outlines the necessary preprocessing steps. Section 5 introduces the evaluation metrics employed to assess the proposed method. Section 6 discusses the time and space complexity of the proposed method. Section 7 presents the simulation results, including feature selection outcomes, LSTM hyperparameter tuning results, and classification performance. Section 8 offers a comparative study with related work in the field of intrusion detection. Finally, Section 9 concludes the paper, summarizing our findings, discussing the implications of our research, and suggesting potential directions for future work.

2. Literature Review

In this section, a brief review of other intrusion detection methods in the literature is presented. Maseer et al. [5] benchmarked 10 machine learning algorithms—seven supervised (k-NN, SVM, DT (both types C4.5 and ID3), RF, ANN, NB, and CNN) and three unsupervised (K-means clustering, EM clustering, and SOM)—for anomaly-based intrusion detection systems (AIDS) using the multi-class CICIDS2017 dataset. The k-NN-AIDS, DT-AIDS, and NB-AIDS models achieved the best results. Rosay et al. [6] proposed the use of an MLP neural network for intrusion detection in embedded devices connected to the internet. Catillo et al. [7] evaluated two machine learning techniques, deep autoencoders and decision trees, against adversarial examples generated by the virtual adversarial method (VAM). The study found that autoencoders were more robust to evasion attacks, while decision trees were vulnerable to evasion and their robustness was highly affected by learning parameter adjustments. Chindove and Brown [8] included dataset balancing and sampling, feature engineering, and systematic model tuning for adaptive intrusion detection improvement. In this study, the performance of recurrent neural networks (RNN) and random forests (RF) was evaluated on the CICIDS 2017 and CICIDS 2018 datasets. Aldarwbi et al. [9] proposed a novel Network Intrusion Detection System (NIDS) called “the sound of intrusion”, which transformed network traffic flow features into waves and applied audio/speech recognition deep-learning techniques to detect intrusions. The system utilized LSTM, deep belief networks (DBN), and convolutional neural networks (CNN). Panwar et al. [10] evaluated eight supervised classification techniques (GaussianNB, BernoulliNB, Decision Tree, KNN, Logistic Regression, SVM, Random Forest, SGD) for network intrusion detection using the CICIDS-2017 dataset. They applied a three-stage methodology: data preprocessing (including feature extraction, dataset splitting, handling missing values, scaling, and encoding), feature selection using Recursive Feature Elimination (RFE), and performance testing through cross-validation. Ho et al. [11] introduced a CNN-based IDS aimed at enhancing internet security by effectively identifying network intrusions from packet traffic. Using the CICIDS2017 dataset, the proposed IDS model was trained and validated to achieve high accuracy while evaluating metrics such as attack detection rate, false alarm rate, and training overhead. Kshirsagar and Kumar [12] introduced an Ensemble of Filter Feature Selection Techniques (EFFST) aimed at improving the efficiency of IDSs, specifically for web attack detection. EFFST selected a subset of features by leveraging filter feature selection methods, focusing on enhancing detection rates while reducing computational overhead. Pelletier and Abualkibash [13] utilized the R language to preprocess, analyze, and build predictive models using the CICIDS-2017 dataset, applying Artificial Neural Networks (ANN) and a Machine Learning algorithm to classify labeled network data for network intrusion detection. Priyanka and Gireesh Kumar [14] evaluated the performance of the CICIDS-2017 dataset for IDS using several machine learning algorithms, including CNN, Naive Bayes (NB), RF, RF with highly ranked features, and RF with feature reduction techniques (PCA and SVD). The study concluded that Random Forest achieved superior results on the CICIDS-2017 dataset. Krsteski et al. [15] aimed to develop an effective IDS using machine learning techniques on the CICIDS 2017 dataset. They focused on classification using Random Forest, Decision Tree, SVM, k-NN, and Naïve Bayes, with Random Forest emerging as the top performer in accuracy.

Alabsi et al. [16] proposed an IDS designed specifically for Distributed Denial of Service (DDoS) and Denial of Service (DoS) attacks. They employed a Conditional Tabular Generative Adversarial Network (CTGAN) to generate synthetic traffic resembling legitimate patterns, coupled with a discriminator network to distinguish between legitimate and malicious traffic. The synthetic data generated by CTGAN was utilized to train multiple classifiers, including shallow machine learning and deep learning models, thereby enhancing the IDS’s detection capabilities. Zavrak and Iskefiyeli [17] introduced an efficient approach for detecting DDoS attacks by proposing a feature subset selection method based on Random Harmony Search (RHS) optimization. The selected features were used to enhance the performance of a deep learning-based classifier model, specifically using Restricted Boltzmann Machines (RBM). The RBM model was augmented with additional layers between visible and hidden layers, and hyperparameters were optimized to improve detection rates. Kumar et al. [18] proposed a novel distributed Intrusion Detection System for detecting DDoS attacks in blockchain-enabled IoT networks. The system leveraged machine learning models, specifically Random Forest (RF) and an optimized gradient tree boosting system (XGBoost). The results indicated that XGBoost outperformed in binary attack detection, while RF was more effective in multi-attack detection. Additionally, RF required less time for training and testing compared to XGBoost. Zeeshan et al. [19] introduced a Protocol Based Deep Intrusion Detection (PB-DID) architecture to tackle DoS and Distributed DoS (DDoS) attacks in IoT networks. The authors created a dataset of packets from IoT traffic by comparing features from the UNSW-NB15 and Bot-IoT datasets, focusing on flow and Transmission Control Protocol (TCP). Roopak et al. [20] proposed an IDS for detecting DDoS attacks in Internet of Things (IoT) networks. The system combined the Jumping Gene adapted NSGA-II multi-objective optimization method for data dimension reduction with a deep learning technique that integrated CNN and LSTM networks for attack classification. Akgun et al. [21] introduced an IDS for detecting DDoS attacks, combining preprocessing techniques with deep learning models. Various models, including Deep Neural Networks (DNN), CNN, and LSTM, were evaluated using the CIC-DDoS2019 dataset. Preprocessing steps such as feature elimination, random subset selection, feature selection, duplication removal, and normalization were applied to enhance performance. Khanday et al. [22] focused on developing a lightweight IDS tailored for protecting IoT networks, particularly against DDoS attacks. They introduced novel data preprocessing techniques and utilized both machine learning and deep learning classifiers to enhance detection accuracy. The study employed datasets like BOT-IoT and TON-IoT from UNSW Sydney. Class imbalance issues in the datasets were addressed using Synthetic Minority Oversampling Technique (SMOTE) variants to improve model performance. Issa and Albayrak [23] focused on addressing the significant threat posed by Distributed Denial-of-Service (DDoS) attacks, which are pervasive in network security. They proposed a novel deep learning approach by combining CNN and LSTM networks. This architecture was applied to the NSL-KDD dataset, a common benchmark for intrusion detection systems. Baldini and Amerini [24] introduced an innovative approach to IDS specifically targeting DDoS attacks using an online algorithm based on a sliding window technique. The novelty lay in the application of Morphological Fractal Dimension (MFD), a measure adapted from fractal geometry, to enhance detection capabilities compared to traditional entropy-based methods. The study utilized the CICIDS2017 dataset. Hussain [25] employed supervised Machine Learning (ML) techniques using the CIC-DDoS2019 dataset, which included both benign and DDoS traffic. Six datasets, each optimized with 24 key features using sampling techniques, were evaluated with ML algorithms like Bayesian Network, Bagging, k-Nearest Neighbors, Sequential Minimal Optimization, and Simple Logistic. Evaluation metrics revealed Bagging as the top performer, demonstrating scalability with dataset size, thus enhancing DDoS detection capabilities in real-world scenarios. Ferrag et al. [26] proposed a deep learning-based IDS tailored for detecting DDoS attacks in the context of agriculture. They explored three deep learning models: CNN, DNN, and RNN. The study evaluated the performance of these models across two classification types (binary and multiclass) using two new real traffic datasets: CIC-DDoS2019 and TON_IoT. Huang et al. [27] introduced two methods, Genetic Attack and Probability Weighted Packet Saliency Attack (PWPSA), aimed at generating adversarial samples to bypass LSTM-based DDoS detection systems. GA employed genetic algorithms to evolve modified samples that evaded detection, while PWPSA iteratively modified samples based on position saliency and packet scores to achieve evasion. Mendonça et al. [28] introduced a novel IDS based on the Tree-CNN hierarchical algorithm with the Soft-Root-Sign (SRS) activation function, aimed at fast and effective identification of security attacks like DDoS, Infiltration, Brute Force, and Web attacks. The model focused on reducing training time and enhancing detection accuracy. Adefemi Alimi et al. [29] introduced a refined LSTM deep learning approach for IDS aimed at detecting DoS attacks in IoT networks. Tested on CICIDS-2017 and NSL-KDD datasets, the proposed IDS employed preprocessing techniques like encoding, dimensionality reduction, and normalization.

3. Methodology

This section provides a thorough explanation of the fundamental principles and key concepts essential for understanding the proposed method, along with a detailed overview of its stages.

3.1. Multi-Layer Perceptron

MLP is a type of feedforward artificial neural network comprising at least three layers: an input layer, one or more hidden layers, and an output layer. Except for the input nodes, each node is a neuron utilizing a nonlinear activation function. MLPs are trained using a supervised learning technique known as backpropagation. Unlike a simple perceptron, MLPs can handle data that is not linearly separable due to their multiple layers and nonlinear activations. If all neurons in an MLP use a linear activation function, the entire network can be reduced to a simple two-layer input-output model due to linear algebra properties. However, MLPs typically use nonlinear activation functions, originally designed to mimic the action potentials, or firing, of biological neurons. Two common nonlinear activation functions are: Hyperbolic Tangent (tanh) Ranges from −1 to 1 and Logistic Function (sigmoid) Ranges from 0 to 1. In these functions, yi is the output of the i-th node, and vi is the weighted sum of the inputs. Other activation functions include the rectifier and softplus functions, with more specialized ones like radial basis functions used in other neural network models.

Layers

An MLP consists of at least three layers (input, hidden, and output) of nodes with nonlinear activations. Being fully connected, each node in one layer connects to every node in the next layer with a specific weight. Learning in an MLP involves adjusting connection weights based on the error between the predicted output and the actual target. This supervised learning process is conducted through backpropagation, an extension of the least mean squares algorithm used in linear perceptrons. The error at output node

j

for the n-th data point is represented as:

E = \frac{1}{2} \sum_{j} {(d_{j} - y_{j})}^{2}

(1)

where

d

is the target value and

y

is the perceptron output. Weight adjustments are made to minimize the overall error using the formula:

{∆ ω}_{i j} = - η \frac{\partial E}{\partial ω_{i j}}

(2)

where

y_{j}

is the output of the previous neuron, and

η

is the learning rate, chosen to ensure rapid convergence without causing oscillations.

The derivative needed depends on the local induced field

v_{j}

, and for an output node, this derivative simplifies to:

\frac{\partial E}{\partial v_{j}} = - (d_{j} - y_{j}) \overset{´}{φ} (v_{j})

(3)

where

\overset{´}{φ}

is the derivative of the activation function. For hidden nodes, the derivative is more complex and depends on the changes in weights of the output nodes. Thus, weight adjustments in hidden layers rely on the changes in the output layer weights, effectively representing the backpropagation of the activation function [30]. The flowchart of the MLP method is shown in Figure 1.

3.2. Genetic Algorithm

GA are computational algorithms inspired by the natural process of evolution, designed to find optimal solutions in complex search spaces with high accuracy. They are particularly suitable for applications like prediction and forecasting. Figure 2 illustrates the stages of a genetic algorithm.

The initial stage in GA involves creating individuals with random arrays of genes (chromosomes), each representing a potential solution. The next stages are reproduction, involving crossover and mutation processes, which generate new individuals within the population. Each chromosome’s fitness is evaluated, with higher fitness values indicating better solutions. Selection then occurs, where the best individuals from the population and offspring are chosen to survive into the next generation.

3.2.1. Population Initialization

In predictive cases using GA, the main process involves identifying optimal historical data patterns through regression methods. This process aims to closely match historical data.

3.2.2. Chromosome Representation

Chromosomes are represented by real numbers between 0 and 1, suitable for predictive functions.

3.2.3. Fitness Value Calculation

The fitness value

f

is determined using the Mean Square Error (MSE) between predicted and actual values. The goal is to minimize the MSE, thus maximizing the fitness value:

f = \frac{1}{M S E + ϵ}

(4)

where

ϵ

is a small number to prevent division by zero. The MSE is calculated as:

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - Y_{i}^{’})}^{2}

(5)

where

n

is the number of data points,

Y

is the predicted value, and

Y^{’}

is the actual value.

3.2.4. Parent Selection

Parent selection uses the Roulette Wheel Selection method, where individuals are mapped to a line segment based on their fitness values. Random numbers are generated to select parents for crossover, ensuring no redundancy.

3.2.5. Crossover

New individuals are formed through the crossover method, combining parts of parent genomes. Using whole arithmetic crossover and a random alpha value, genes are combined to produce offspring.

3.2.6. Mutation

Mutation occurs after crossover, changing the values of selected genes to prevent premature convergence. The uniform mutation method replaces selected gene values with random numbers within a defined range.

3.2.7. Elitism

Elitism ensures the best individuals from the current generation survive into the next. The selection process involves sorting individuals by fitness and carrying forward only the top performers, maintaining a constant population size [33].

3.3. The Particle Swarm Optimization (PSO)

PSO algorithm, a meta-heuristic, was initially developed by American social psychologist Kennedy. It is widely employed for solving non-linear problems, inspired by the natural behavior of birds and fish in finding optimal food routes. In PSO, particles represent potential solutions and navigate through an n-dimensional space, where each dimension corresponds to a parameter of the problem. The fundamental principle of PSO is the optimization of each particle’s position and velocity. Consider:

x_{i}^{t} = (x_{i 1}^{t}, x_{i 2}^{t}, \dots, x_{i n}^{t}) a n d v_{i}^{t} = (v_{i 1}^{t}, v_{i 2}^{t}, \dots, v_{i n}^{t})

as the position and velocity of the

i

th particle at iteration t. The position and velocity of the

i

th particle in the (

t + 1

)th iteration are updated using the following equations:

v_{i}^{t + 1} = ω \cdot v_{i}^{t} + c_{1} \cdot r_{1} \cdot (p_{i}^{t} - x_{i}^{t}) + c_{2} r_{2} \cdot (g_{i}^{t} - x_{i}^{t})

(6)

With

- v m a x \leq v_{i}^{t + 1} \leq v_{m a x}

x_{i}^{t + 1} = x_{i}^{t} + v_{i}^{t + 1}

(7)

Here,

x_{i}^{t}

is the current position of the

i

th particle,

p_{i}^{t}

is the best position found by the

i

th particle,

g_{i}^{t}

is the global best position found by the swarm,

r_{1}

and

r_{2}

are random numbers between 0 and 1,

ω

is the inertia weight,

c_{1}

is the cognitive coefficient, and

c_{2}

is the social coefficient. The standard 2011 PSO method is commonly used for determining the inertia weight (

ω

), calculated as follows:

ω = \frac{1}{2 \ln (2)}

(8)

c_{1} = c_{2} = 0.5 + \ln (2)

(9)

The problem is considered solved when the particles converge to a single point in the search space. PSO is highly effective for parallel swarming and optimization, utilizing a multi-objective fitness function to evaluate the quality of various features in a dataset [34]. The flowchart for PSO is illustrated in Figure 3.

3.4. Long Short-Term Memory

Recurrent Neural Networks (RNNs) often struggle to maintain information across long time periods. LSTM models address this issue by incorporating memory cells equipped with gated mechanisms. These gates enable the model to determine which information to retain or discard [36].

3.4.1. Forget Gate

This gate controls the removal of information from the LSTM’s memory. By employing a sigmoid function, it calculates a value, denoted as

f_{t}

, that lies between 0 and 1. This value dictates the degree to which the previously learned information

h_{t - 1}

and the current input

x t

should be preserved or discarded. This process is mathematically expressed in the following equation:

f_{t} = σ (W_{f h} [h_{t - 1}] + W_{f x} {[x}_{t}], b_{f}

(10)

3.4.2. Input Gate

The input gate determines if new information should be incorporated into the LSTM’s memory. It consists of two layers: a sigmoid layer and a hyperbolic tangent (tanh) layer. The sigmoid layer produces an update signal

i_{t}

, specifying which portions of the memory need updating. The tanh layer creates a vector of candidate values,

{\tilde{c}}_{t}

, that are considered for addition to the memory. Together, these layers decide the memory update, which is calculated as follows:

i_{t} = σ (W_{i h} [h_{t - 1}] + W_{i x} {[x}_{t}], b_{i}

(11)

{\tilde{c}}_{t} = \tanh (W_{c h} [h_{t - 1}] + W_{c x} {[x}_{t}], b_{c})

(12)

The updated memory

c_{t}

(Equation (13)) results from merging the process of forgetting the old value

c_{t - 1}

with the addition of the new candidate value

i_{t} {\tilde{c}}_{t}

:

c_{t} = f_{t} c_{t - 1} + i_{t} {\tilde{c}}_{t}

(13)

3.4.3. Output Gate

The output gate regulates which portion of the LSTM memory influences the output. It begins with a sigmoid layer that calculates the output gate signal

o_{t}

, indicating the importance of the memory. This is followed by a tanh function that maps values between −1 and 1. The result is then multiplied by the output of another sigmoid layer to generate the final output

h_{t}

:

o_{t} = σ (W_{o h} [h_{t - 1}] + W_{o x} {[x}_{t}], b_{o})

(14)

h_{t} = o_{t} \tanh (c_{t})

(15)

The flowchart of the LSTM method is shown in Figure 4.

3.5. Whale Optimization Algorithm

This section discusses the fundamentals of the WOA, covering aspects such as encircling prey, the bubble-net feeding method, and the search for prey. It also introduces the binary version of the WOA and the penalty method for constraint handling.

3.5.1. Encircling Prey

Humpback whales have the ability to identify and completely encircle their prey, such as krill. The WOA algorithm assumes that the best search agent at any given time is the target prey. As the iterations progress, whales update their positions towards this best search agent. This behavior is modeled mathematically as follows:

\vec{D} = \vec{C} \cdot |{\vec{X}}^{*} (t) - \vec{X} (t)|

(16)

\vec{X} (t + 1) = {\vec{X}}^{*} (t) - \vec{A} \cdot \vec{D}

(17)

Here,

\vec{A}

and

\vec{C}

are coefficient vectors, t is the current iteration,

{\vec{X}}^{*} (t)

represents the position of the best search agent, and

|\cdot|

denotes the absolute value. These coefficient vectors are computed as:

\vec{A} = 2 \vec{a} \cdot \vec{r} - \vec{a}

(18)

\vec{X} (t + 1) = {\vec{X}}^{*} (t) - \vec{A} \cdot \vec{D}

(19)

where

\vec{a}

decreases linearly from 2 to 0 over the iterations, balancing exploration and exploitation.

\vec{r}

is a random vector within [0, 1]. The parameter

\vec{a}

is updated as

\vec{a} = 2 (1 - \frac{t}{I_{m a x}})

where

t

is the iteration index and

I_{m a x}

is the maximum number of iterations. Exploration occurs when

|\vec{A}| \geq 1,

and exploitation happens when

|\vec{A}| < 1

. To avoid getting trapped in local optima during exploitation,

\vec{C}

can be a random number in [0, 2], enhancing the algorithm’s exploration capabilities throughout optimization.

3.5.2. Bubble-Net Attacking Method

The bubble-net feeding method of humpback whales is modeled using shrinking encircling and spiral updating mechanisms simultaneously. Shrinking encircling is achieved by setting

\vec{A}

within [−1, 1] and linearly reducing

\vec{a}

. The new position is calculated between the agent’s current position and the best search agent’s position. The helix-shaped movement is modeled by:

\overset{´}{\vec{D}} = |{\vec{X}}^{*} (t) - \vec{X} (t)|

(20)

\vec{X} (t + 1) = \overset{´}{\vec{D}} \cdot e^{b l} \cdot \cos (2 π l) + {\vec{X}}^{*} (t)

(21)

where

b

is a constant defining the logarithmic spiral shape, and l is a random number in [−1, 1]. Since whales move around their prey within a shrinking circle and along a spiral path simultaneously, both methods are employed with a 50% probability:

\vec{X} (t + 1) = \{\begin{matrix} {\vec{X}}^{*} (t) - \vec{A} \cdot \vec{D} i f p < 0.5 \\ \overset{´}{\vec{D}} \cdot e^{b l} \cdot \cos (2 π l) + {\vec{X}}^{*} (t), i f p \geq 0.5 \end{matrix}

(22)

where

p

is a random number in [0, 1].

3.5.3. Search for Prey

The same mechanism used for shrinking encircling is applied to the prey search, but with ∣

\vec{A}

∣

>

1. The best search agent’s position

{\vec{X}}^{*} (t)

is replaced by a randomly selected whale’s position

\vec{X_{r a n d}}

. This forces whales to move away from a random whale, extending the search space and enabling a global search. The prey search is modeled as:

\vec{D} = |\vec{C} \cdot \vec{X_{r a n d}} - \vec{X} (t)|

(23)

\vec{X} (t + 1) = \vec{X_{r a n d}} - \vec{A} \cdot \vec{D}

(24)

The bubble-net attacking method focuses on a local search by exploiting the best current solution, while the prey search increases solution diversity for global optimization. As iterations progress, exploitation becomes more prominent, while exploration is preferred initially. Recent efforts have aimed to improve WOA by enhancing its exploitation and exploration balance. For instance, the arcsine function has been used to control this trade-off, and the Levy flight trajectory has been employed to boost exploration capabilities. In summary, the WOA algorithm is a robust global optimizer due to its balanced exploitation and exploration. The flowchart of the WOA algorithm is shown in Figure 5 [38].

3.6. Stages of the Proposed Method

In intrusion detection, feature selection is essential due to the high dimensionality and redundancy of network traffic data. This data typically includes packet headers, payloads, and flow records, encompassing features such as IP addresses, port numbers, protocol types, and packet sizes, which often contain repetitive patterns. By identifying the most informative features, we can concentrate on critical aspects of the data, enhancing detection accuracy and reducing computational burden, thus making real-time detection more feasible. Consequently, this paper introduces a novel wrapper feature selection method that combines a metaheuristic optimization algorithm with a machine learning-based classifier.

The classifier used for this purpose is the multi-layer perceptron (MLP). MLPs are particularly well-suited for wrapped feature selection methods due to several key advantages. Firstly, MLPs can model complex, non-linear relationships between input features and output labels, which is crucial for accurately assessing the predictive power of various feature subsets. This capability ensures that even intricate patterns in the data are effectively captured. Secondly, MLPs are relatively fast to train compared to more complex models such as deep neural networks or ensemble methods, making them ideal for a wrapped feature selection context where the classifier needs to be repeatedly trained on different subsets of features. This computational efficiency is a significant consideration. Additionally, MLPs strike a good balance between performance and complexity, often achieving high accuracy with fewer parameters and less computational overhead. This balance makes them practical for feature selection, where the primary goal is to compare different subsets rather than to achieve the best possible classification accuracy. Moreover, the flexibility of MLPs in handling various data types and their robustness to overfitting through techniques such as regularization further enhance their suitability for this task. Collectively, these attributes make MLPs an effective classifier for wrapped feature selection methods, enabling the identification of optimal feature subsets while maintaining computational efficiency and accuracy.

To facilitate the selection of different feature subsets and evaluate them using the MLP, a mechanism is necessary. While random selection and choosing the best result is one approach, a more efficient method is preferable. In this paper, we propose using the GA-PSO metaheuristic optimization algorithm to evaluate different feature subsets.

In the previous section, we discussed both GA and PSO individually. The GA-PSO hybrid algorithm combines the strengths of both GA and PSO to more effectively address complex optimization problems. GA operates through mechanisms such as selection, crossover, and mutation, simulating the process of natural selection to evolve a population of potential solutions. In the GA-PSO hybrid, a population of particles is initially randomized. GA operations, including selection, crossover, and mutation, are then applied to introduce diversity and new genetic material into the population. These operations thoroughly explore the global search space, ensuring a wide range of potential solutions.

PSO, inspired by the social behavior of birds flocking or fish schooling, enhances the performance of GA by incorporating a swarm-based approach. This involves updating the velocity and position of each particle based on both their individual experiences and the experiences of neighboring particles. By merging PSO with GA, the algorithm gains improved local search capabilities. The particles adjust their positions toward the best-known solutions, effectively exploiting the search space. This iterative process continues with particles moving towards optimal solutions based on both personal and global best positions, ensuring a balance between exploration and exploitation until a stopping criterion, such as reaching a maximum number of iterations or converging to a satisfactory solution, is met.

The main advantages of the GA-PSO algorithm include enhanced exploration and exploitation capabilities, making it more robust against local optima and premature convergence. The GA operations ensure adequate exploration of the search space, while PSO provides efficient local search, leading to faster convergence to optimal solutions. Additionally, GA-PSO is highly adaptable to various optimization problems and scalable for large-scale applications, benefiting from the parallel nature of both algorithms. This hybrid approach leverages the strengths of GA and PSO, resulting in a powerful and flexible optimization tool.

In this paper, the input to the GA-PSO algorithm is a binary vector whose length corresponds to the number of features (feature vector). In this vector, a value of one indicates that the feature at that index is included in the selected subset, while a value of zero indicates that the feature is excluded. The selected subset is then evaluated by a cost function that involves training an MLP and assessing its error on a validation set of the input data. This process is repeated, with the GA-PSO equations updating the selected features until the stopping criterion is met. Algorithm 1 presents the pseudo-code of GA-PSO for feature selection.

Algorithm 1. Pseudo-code of GA-PSO for feature selection.

Initialize a population of particles with random feature vectors and velocities
Evaluate fitness of each particle using MLP
Initialize pBest for each particle
Initialize gBest based on the best fitness in the population
Repeat until stopping criterion is met:
//Genetic Algorithm Operations
Select particles for mating pool based on fitness
Perform crossover on selected particles to create offspring
Apply mutation to offspring
Evaluate fitness of offspring using MLP
//Particle Swarm Optimization Operations
For each particle:
Update velocity based on current velocity, pBest, and gBest
Update feature vector based on new velocity
Evaluate fitness of particle using MLP
If fitness of particle is better than its pBest:
Update pBest to current position

The second phase of the proposed method involves classifying intrusions using the selected features. For this purpose, we employ an LSTM network as the classifier. LSTM networks are typically used for sequential data due to their ability to retain information over long periods. However, they are also effective for feature-based classifications, as they can capture and leverage correlations among features even when they are not sequentially dependent. LSTMs, with their memory cells, can model complex, non-linear relationships among features more effectively than simpler models such as logistic regression or decision trees. Furthermore, LSTMs demonstrate strong generalization capabilities, potentially leading to better performance on unseen data compared to other models.

To design the LSTM network for our purposes, we use one LSTM layer followed by a fully connected layer. The fully connected layer is responsible for the classification task, with the number of neurons equal to the number of classes. Finally, a SoftMax layer completes the classification process.

However, it is important to note that LSTMs are complex networks whose performance is highly dependent on hyperparameter tuning. Therefore, a mechanism for adjusting LSTM hyperparameters is necessary. In this paper, a metaheuristic optimization algorithm is proposed to address this issue. While GA-PSO, introduced in the previous section for feature selection, is robust and offers appropriate exploration and exploitation, the hyperparameter tuning of LSTM is computationally intensive and requires a simpler optimization algorithm. Here, the WOA is proposed for tuning LSTM’s hyperparameters. WOA is a powerful metaheuristic algorithm particularly effective for continuous optimization problems. It is simpler to implement, with fewer parameters to tune, making it easier to apply to various problems without extensive customization. Moreover, WOA typically has a lower computational cost as it focuses on a single optimization strategy, making it more efficient in terms of computational resources.

In this study, four key hyperparameters of the LSTM are considered for tuning using WOA: learning rate, learning rate drop factor, batch size, and the number of hidden units. The learning rate in deep learning controls how much the model’s weights are adjusted in response to the estimated error with each update. A higher learning rate can speed up training but risks overshooting optimal values, whereas a lower learning rate ensures more precise convergence but can slow down the training process. Consequently, a piecewise learning rate drop is adopted to improve performance, gradually reducing the learning rate to allow precise convergence. The learning rate drop factor is a multiplicative factor by which the learning rate is reduced at specified intervals during training, aiding in fine-tuning the model by allowing it to converge more precisely to the optimal solution.

The next hyperparameter to be adjusted is batch size. The batch size in deep learning training refers to the number of training samples processed before the model’s internal parameters are updated. A larger batch size leads to more stable gradient estimates but requires more memory, while a smaller batch size allows for faster updates and requires less memory, but can result in noisier gradient estimates. The final hyperparameter is the number of hidden units in the LSTM layer, which determines the dimensionality of the hidden state and cell state in the LSTM network. More hidden units enable the model to capture more complex patterns but also increase computational requirements. By optimizing these hyperparameters, better performance can be achieved.

After determining the inputs, the cost function should be designed. In our case, the cost function involves training the designed LSTM with the specified hyperparameters using WOA, with the output being the error value obtained after training. Given that training LSTMs is computationally intensive and time-consuming, the maximum number of epochs is initially set to a small number of five to compare the results of different hyperparameter adjustments. Once the optimal hyperparameters have been determined through optimization, the designed LSTM network should then be trained with a larger number of epochs to fully realize its potential. Algorithm 2 presents the pseudo-code of WOA for optimizing LSTM hyperparameters.

Algorithm 2. Pseudo-code of WOA-based hyperparameter optimization of LSTM network

Initialize:
Define the LSTM model structure
Define the hyperparameters to optimize: Learning Rate (lr), Learning Rate Drop
Factor (lr_drop), Batch Size (batch_size), Number of Hidden Units (hidden_units)
Set WOA parameters: Population size (N), Maximum number of iterations (T),
Boundary values for each hyperparameter, Define the cost function (classification
error)
Initialize the positions of whales (population) with random values for the
hyperparameters
Evaluate the fitness (classification error) of each whale using 5 epochs of LSTM
training
Identify the best whale (solution) with the lowest error
While (t < T): //Iterate through WOA optimization loop
For each whale i in the population:
Update the coefficient vectors A and C
Generate a random number p in [0,1]
If (p < 0.5):
If (|A| < 1):
Update the position of whale i towards the best whale
(exploitation—encircling the prey)
Else:
Update the position of whale i randomly far from the best
whale (exploration)
Else:
Move whale i towards a random whale in the population
(exploration)
Ensure the updated positions of whale i stay within the predefined bounds
for hyperparameters
For each whale i:
Update the LSTM hyperparameters using the whale’s position (current
hyperparameter set)
Train the LSTM for five epochs and compute the fitness (classification error)
Update the best whale if a better hyperparameter set is found
Increment iteration counter t
Train the final LSTM model with the best hyperparameter set and a larger number of epochs
Output the best hyperparameter set (lr, lr_drop, batch_size, hidden_units) and the final LSTM model

4. Dataset

In this paper, two datasets of CICIDS 2017 and NSL-KDD are used for evaluating the proposed method. These datasets along with the preprocessing needed for them are explained in this section.

4.1. The CICIDS 2017

The CICIDS 2017 dataset, developed by the Canadian Institute for Cybersecurity (CIC), is a comprehensive benchmark dataset for evaluating IDS. This dataset includes benign traffic and a wide range of contemporary attacks, capturing real-world network traffic over five days. The attacks encompass various types, such as Distributed Denial of Service (DDoS), FTP-Patator, and SSH-Patator. The CICIDS 2017 dataset is distinctive due to its inclusion of both network flow data and packet capture (PCAP) files, along with extensive labeling and detailed feature extraction using the CICFlowMeter. This extensive and diverse dataset supports the development and assessment of machine learning models for cybersecurity research, facilitating advancements in detecting and mitigating cyber threats.

Moreover, the dataset provides 80 network traffic features, such as flow duration, total packets, and source/destination IPs, making it an invaluable resource for feature engineering and selection in IDS. Researchers benefit from the realistic network topology, including varied background traffic, which mirrors genuine network environments. Additionally, the dataset has been widely adopted in academic and industry research for benchmarking IDS algorithms, contributing to a standardized evaluation framework. The CICIDS 2017 dataset’s comprehensive nature and detailed documentation make it a crucial asset for advancing cybersecurity defenses and promoting innovative IDS solutions.

4.2. NSL-KDD

The NSL-KDD dataset is a widely used benchmark for evaluating the performance of intrusion detection systems. It is an improved version of the original KDD Cup 1999 dataset, designed to overcome some of the inherent issues of the original dataset, such as redundant records and an unbalanced class distribution. These improvements ensure that models trained and tested on the NSL-KDD dataset generalize better and avoid biased learning patterns.

NSL-KDD contains a variety of network traffic data, representing both normal behavior and different types of malicious attacks. The dataset is structured into two main sets: training data (KDDTrain+) and testing data (KDDTest+). Each record in the dataset is represented by 41 features, including both continuous and categorical attributes, that capture different aspects of network traffic, such as protocol type, service, duration, and flag. These features are used to identify different types of attacks grouped into four major categories: DoS (Denial of Service), R2L (Remote to Local), U2R (User to Root), and Probe.

One of the key advantages of the NSL-KDD dataset is its balanced nature, where the number of records is proportionally distributed across different classes, thus enabling a fair evaluation of classification models. The dataset is also less complex compared to its predecessor, as it removes duplicate entries and irrelevant records, making it more suitable for modern machine learning algorithms. Given these characteristics, NSL-KDD serves as an appropriate dataset for evaluating the robustness and performance of intrusion detection methods, ensuring consistent and reliable results across various scenarios.

4.3. Preprocessing Steps

Min-max normalization: normalization is a crucial step in data preprocessing that scales numerical features to a specified range, typically [0, 1], using the min-max normalization method. This technique transforms the data based on the minimum and maximum values of each feature, ensuring that the features are on a comparable scale without distorting differences in the ranges of values. Specifically, each feature value x is scaled using Equation (1):

$\overset{´}{x} = \frac{x - x_{m i n}}{x_{m a x} - x_{m i n}}$

(25)

where $x_{m i n}$ and $x_{m a x}$ are the minimum and maximum values of the feature, respectively. This normalization process helps improve the performance and training stability of machine learning models by reducing bias due to different feature scales.
Data cleaning using K-nearest neighbor: Data cleaning is an essential preprocessing step aimed at handling missing or inconsistent data entries to ensure the quality and accuracy of the dataset. The k-nearest neighbor (KNN) method is employed to impute missing values based on the values of the k-nearest observations. By selecting an appropriate value of k, the algorithm identifies the k closest data points to the instance with missing values and uses their average (or majority class for categorical data) to fill in the gaps. This approach leverages the assumption that similar instances exhibit similar behaviors, thus providing a robust and reliable means of data imputation that preserves the inherent structure and relationships within the dataset.
Data partitioning withhold-out method: data partitioning is a fundamental step in preparing a dataset for training and evaluating machine learning models. The hold-out method is utilized to split the dataset into two distinct subsets: 70% of the data is allocated for training the model, while the remaining 30% is reserved for testing its performance. This partitioning strategy ensures that the model’s ability to generalize to new, unseen data can be effectively assessed. By evaluating the model on the test set, which has not been used during the training process, it is possible to estimate its predictive accuracy and identify any potential overfitting or underfitting issues, thereby facilitating the development of a robust and reliable machine learning model.

5. Evaluation Metrics

Evaluation metrics, such as accuracy, precision, recall, and F1 score, are essential for assessing the performance of machine learning models, especially in classification tasks. Each metric provides unique insights into different aspects of the model’s performance.

5.1. Accuracy

It is the ratio of correctly predicted instances to the total instances in the dataset. It provides an overall measure of the model’s performance but can be misleading in imbalanced datasets where certain classes are underrepresented. The formula for accuracy is:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(26)

where

T P

is true positives,

T N

is true negatives,

F P

is false positives, and

F N

is false negatives.

5.2. Precision

It measures the proportion of true positive predictions to the total predicted positives, indicating the model’s accuracy in identifying positive instances. It is particularly important in scenarios where the cost of false positives is high. The formula for precision is:

P r e c i s i o n = \frac{T P}{T P + F P}

(27)

5.3. Recall

Also known as sensitivity or true positive rate, is the ratio of true positive predictions to the actual positives in the dataset. It reflects the model’s ability to capture all relevant instances and is crucial in applications where missing a positive instance is costly. The formula for the recall is:

R e c a l l = \frac{T P}{T P + F N}

(28)

5.4. F1 Score

It is the harmonic mean of precision and recall, providing a balanced measure of the model’s performance. It is particularly useful when dealing with imbalanced datasets, as it considers both false positives and false negatives. The F1 Score is calculated as:

F 1 S c o r e = \frac{2 \times (P r e c i s i o n \times R e c a l l)}{P r e c i s i o n + R e c a l l}

(29)

These evaluation metrics collectively provide a robust framework for assessing and comparing the performance of different machine learning models, ensuring a comprehensive understanding of their strengths and weaknesses.

6. Complexity Analysis

The proposed method can be divided into two stages of training and predicting to evaluate its time and space complexity. The training stage is the main time and space-consuming stage that is investigated completely in this part. However, this stage should be done once and after completing the training stage, the optimized LSTM will detect intrusions using the selected subset of features. Hence the predicting complexity is equal to the time and space complexity of the LSTM network. Moreover, since the feature dimension is reduced, it is more likely that the proposed method has less complexity for both computation and storing than other deep learning-based approaches. In the following the training stage complexity of the proposed method is discussed.

6.1. Time Complexity Analysis

To evaluate the time complexity of the proposed method during the training phase, the feature selection approach is first evaluated. The proposed feature selection combines elements of both GA and PSO, alongside evaluating fitness using an MLP. To understand its time complexity, we will analyze the individual steps involved, taking into account the operations of both GA and PSO, as well as the overhead introduced by MLP evaluations. The time complexity of this algorithm can be broken down into three main sections: initialization, GA operations, and PSO operations. We will calculate the cost of each operation and sum them for the overall complexity.

Suppose there are N particles in the population and the feature vector has D dimensions. Initializing the positions and velocities of the particles takes O(N×D) since each particle requires initialization of both its position and velocity.

Moreover, evaluating the fitness of each particle using MLP has a time complexity of O(MLP(D)), where MLP(D) is the time complexity of evaluating an MLP for a feature vector of size D. Since all N particles need to be evaluated, the total cost for initialization is O(N×MLP(D)).

The next section to evaluate is GA operations, the first step of which is particle selection. In this paper, selecting particles for the mating pool based on fitness is done using roulette wheel selection. roulette wheel selection has a time complexity of O(N), as each particle is considered once. Next, the crossover should be assessed. Since each crossover operation involves two parents, creating new offspring takes O(N×D), as each feature vector needs to be processed. Then, applying mutation to each offspring modifies part of the feature vector, resulting in a cost of O(N×D) for mutation over all particles. Finally, each offspring is evaluated using the MLP, which results in a complexity of O(N×MLP(D)). Therefore, the total time complexity for the GA part per iteration is:

O (N + N \times D + N \times D + N \times M L P (D)) = O (N \times (D + M L P (D)))

(30)

The final section evaluates PSO operation. For each particle in PSO, updating the velocity involves calculating a weighted sum of three vectors (current velocity, distance to pBest, and distance to gBest). This operation has a time complexity of (O(D)) per particle. For N particles, it takes O(N×D). After updating velocity, the feature vector is updated accordingly, which takes O(D) per particle, or O(N×D) for all particles. Each particle’s updated feature vector is evaluated using the MLP, resulting in a complexity of O(N×MLP(D)). Therefore, the total time complexity for the PSO part per iteration is:

O (N \times D + N \times D + N \times M L P (D)) = O (N \times (D + M L P (D)))

(31)

Since both the GA and PSO operations are applied in each iteration, the overall time complexity per iteration is:

O (N \times (D + M L P (D))) + O (N \times (D + M L P (D))) = O (N \times (D + M L P (D)))

(32)

Assuming the algorithm runs for T iterations before meeting the stopping criterion, the total time complexity of the algorithm is:

O (T \times N \times (D + M L P (D)))

(33)

Furthermore, the time complexity of the WOA for optimizing LSTM hyperparameters depends primarily on the number of whales (agents), the dimensionality of the problem, and the number of iterations. Let N represent the population size, D the dimensionality of the search space, and T the total number of iterations here too. In each iteration, the algorithm performs three main operations: position updates, fitness evaluation, and selection of the best solution. Updating the position of each whale requires O(D) operations, as it involves computations based on the best whale position, random exploration/exploitation strategies, and the current whale’s position. This process is repeated for all N whales, resulting in a time complexity of O(N×D) per iteration. Evaluating the fitness of each whale depends on the problem’s objective function which is training and evaluating the LSTM network and is denoted as LSMT(D). This is performed after the position updates and has a complexity of O(N×LSTM(D)). The overall time complexity of the algorithm across all iterations is O(T×N×(D + LSTM(D))).

6.2. Space Complexity Analysis

To calculate the space complexity, the proposed feature selection is evaluated first again and then the WOA-based hyperparameters optimization is assessed. The space complexity of the proposed feature selection approach is determined by the storage required for particle positions, velocities, and fitness values, as well as any additional data structures used during the GA and PSO operations. Each particle requires storage for its position vector (size D) and velocity vector (size D), leading to a total storage requirement of O(N×D) for the population. Additionally, each particle has a personal best position vector pBest, which also requires O(N×D) space. The global best position vector gBest also requires O(D) space. Meanwhile, The fitness of each particle needs to be stored, which requires O(N) space. During the GA operations, new offspring are generated, which also require storage for their feature vectors. This adds an additional O(N×D) space requirement. Finally, the weights of the MLP must also be stored, adding a fixed space complexity depending on the architecture of the MLP. We will denote this as MLP_weights. Summing all these contributions, the overall space complexity is:

O (N \times D) + O (N \times D) + O (D) + O (N) + O (N \times D) + O ({M L P}_{w e i g h t s}) = O (N \times D) + O ({M L P}_{w e i g h t s})

(34)

Moreover, the space complexity of the WOA for optimizing LSTM hyperparameters is determined by the storage required for each whale’s position, as well as the tracking of the best solution found so far. Each whale stores its position in a D-dimensional vector, requiring O(N×D) space to store the population’s positions. Additionally, the algorithm needs to store the fitness values of each whale, which requires O(N) space, as well as the best fitness value and its associated position, which takes O(D) space. Therefore, the total space complexity of WOA is O(N×D), with the dominant term being the storage of whale positions across the population. This space complexity will be summed with those which are required for LSTM weights (the cost function), denoted as O(LSTM_weights). The final value is obtained as O(N×D) + O(LSTM_weights).

7. Simulation Results

This section presents the detailed results obtained from evaluating the proposed approach. The simulations are conducted using MATLAB 2024a on a system equipped with an Intel Core i7 13650HX CPU operating at 2.6 GHz with a 24 M cache, 16 GB of RAM, and an NVIDIA RTX 4060 GPU for graphical processing.

7.1. Feature Selection Results

Given the high dimensionality of the input dataset, which contains a large number of features, implementing an effective feature selection strategy is crucial for enhancing classification performance, reducing computational complexity, and mitigating overfitting. In this study, we introduce a novel wrapped feature selection approach that combines the MLP classifier with the GA-PSO optimization algorithm.

The proposed feature selection process begins by utilizing the GA-PSO algorithm to select various subsets of features. Decision variables for feature selection are represented as binary flags, where each flag indicates whether a corresponding feature should be retained (value of 1) or discarded (value of 0). The optimization process iteratively adjusts these decision variables to identify a subset of features that maximizes classification performance. The parameters of the GA-PSO algorithm, including the maximum number of iterations and population size, are tailored to balance computational efficiency with the quality of solutions. In this paper, the maximum number of iterations and population size are set to 20 and 5, respectively.

Additionally, to configure the MLP parameters for evaluating different subsets of features, a network with one hidden layer comprising 10 neurons is designed, with the number of training epochs set to 40. This choice is based on a trade-off analysis between evaluation accuracy and execution time. By solving the optimization process, 68 out of 78 features are selected using the proposed feature selection approach. The optimization process converges efficiently, as evidenced by the convergence curve shown in Figure 6. The resulting subset of selected features of the CICIDS-2017 dataset is visualized through a correlation heatmap, depicted in Figure 7. As observed, most of the selected features exhibit nearly zero correlation with each other, indicating a diverse set of information within the selected subset.

7.2. Hyperparameter Optimization Results

Following the selection of the optimal subset of features, we proceed to optimize the hyperparameters of the LSTM network, specifically the number of hidden units, learning rate, learning rate drop factor, and batch size. This optimization is carried out using the WOA. Proper tuning of these parameters is crucial as they significantly impact the performance of the network.

The first step involves configuring the settings for the WOA algorithm, including the upper and lower bounds for the decision variables, the number of population members, and the maximum number of iterations. To ensure a comprehensive search space, specific ranges for each decision variable (LSTM hyperparameters) are defined: the number of hidden units is set between 2 and 150, the learning rate between 0.0001 and 0.01, the learning rate drop factor between 0.1 and 0.5, and the batch size between 2 and 120.

Adjustable parameters such as the number of population members and the maximum number of iterations play a pivotal role in the optimization process. Larger values allow for a more thorough exploration of the search space, while smaller values expedite the process. Considering that each iteration involves training the LSTM network and evaluating its error, a balance is needed to prevent excessively long optimization times. Given the relatively small search space defined by our parameter ranges, we set the maximum number of iterations to 20 and the number of population members to 10.

To optimize the speed of this process without sacrificing accuracy, the number of training epochs for the LSTM network is limited to five during the hyperparameter optimization phase. While fewer epochs accelerate the process, this may risk the network not converging to the global optimum, potentially impacting accuracy. It is important to note that this phase focuses solely on comparing different hyperparameter settings, and final training is conducted with the optimal hyperparameters identified.

An additional parameter to facilitate fast network convergence is the choice of the optimization algorithm. For this study, we utilize the well-regarded ADAM algorithm, known for its rapid convergence. Figure 8 illustrates the convergence curve of the LSTM hyperparameter optimization process using the WOA algorithm, demonstrating relatively fast convergence. Table 1 presents the optimal values obtained through this process.

Finally, the network is trained using the identified optimal parameters over 500 epochs. The subsequent section evaluates the performance of the trained network using the proposed approach.

7.3. Intrusion Detection Results

In this section, we present the performance evaluation of the proposed intrusion detection method using several metrics in two datasets of CICIDS-2017 and NSL-KDD. The results are depicted through confusion matrices, ROC curves, and bar and box charts of the evaluation metrics presented in the previous section. Each type of plot is discussed in detail, followed by specific analyses for both types of intrusion detection tasks.

The confusion matrix is a tool used to visualize the performance of a classification algorithm. It shows the number of correct and incorrect predictions made by the model compared to the actual outcomes in the dataset. Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class. This enables the calculation of key metrics such as accuracy, precision, recall, and F1 score, which provide insights into the model’s performance in terms of both correct classifications and errors.

Figure 9 displays the confusion matrix for the DDoS attack detection, a binary classification problem from the CICIDS dataset. The matrix shows that the model correctly identified 1929 instances of DDoS attacks and 1444 instances of normal traffic. However, one normal traffic instance was incorrectly classified as a DDoS attack, and 12 DDoS attacks were missed. These results indicate a high accuracy in detecting DDoS attacks, with a relatively low number of false positives and false negatives.

Figure 10 presents the confusion matrix for the detection of FTP-Patator and SSH-Patator attacks, which is a multi-class classification problem from the CICIDS dataset. The matrix reveals that the model accurately classified 2372 instances of FTP-Patator attacks, 1759 instances of SSH-Patator attacks, and 3511 instances of normal traffic. Misclassifications included eight normal traffic instances misclassified as FTP-Patator, nine normal traffic instances misclassified as SSH-Patator, and 10 and 19 instances missed for each attack type. These results demonstrate the model’s effectiveness in distinguishing between multiple types of intrusions.

Figure 11 presents the confusion matrix for anomaly detection using the second (NSL-KDD) dataset. The matrix reveals that the model successfully identified 2122 true anomalies and 2394 instances of normal traffic. However, there were 10 instances of normal traffic misclassified as anomalies, and eight anomalies were incorrectly labeled as normal traffic. These findings demonstrate a strong accuracy in anomaly detection, with relatively few false positives and false negatives.

The Receiver Operating Characteristic (ROC) curve is a graphical representation of a classifier’s performance across different threshold settings. The ROC curve plots the true positive rate (recall) against the false positive rate, providing a visual insight into the trade-offs between sensitivity and specificity. The area under the ROC curve (AUC) is a single scalar value that summarizes the overall performance of the classifier; a higher AUC indicates better performance.

Figure 12 shows the ROC curve for DDoS attack detection for the CICIDS 2017 dataset. The curve illustrates a high true positive rate with a correspondingly low false positive rate across various thresholds, resulting in an AUC of 0.9999. This high AUC value reflects the model’s strong capability to distinguish between DDoS attacks and normal traffic, indicating robust performance in this binary classification task.

Figure 13 displays the ROC curves for the detection of FTP-Patator and SSH-Patator attacks for the CICIDS 2017 dataset. Each curve represents the classifier’s performance for one of the three classes (FTP-Patator, SSH-Patator, and normal traffic). The AUC values for these curves are 0.9993, 0.9992, and 0.9993, respectively. These curves demonstrate the model’s ability to effectively differentiate between the different types of traffic, although the multi-class nature introduces some complexity in maintaining high performance across all classes.

Figure 14 illustrates the ROC curve for anomaly detection using the NSL-KDD dataset. The curve demonstrates a high true positive rate coupled with a low false positive rate across different thresholds, resulting in an AUC of 0.9997 for both normal and anomaly classifications. This elevated AUC value indicates the model’s strong proficiency in differentiating between normal traffic and anomalies, showcasing its robust performance in this binary classification task.

Finally, to provide a comprehensive assessment of the model’s performance, we evaluate it using several metrics: accuracy, precision, recall, and F1 score. These metrics are derived from the confusion matrix and offer a detailed view of the model’s strengths and weaknesses in classification.

Figure 13 presents the bar chart showing the evaluation metrics for DDoS attack detection for the CICIDS 2017 dataset. The model achieves an accuracy of 99.616%, precision of 99.656%, recall of 99.562%, and F1 score of 99.609%. These metrics indicate a high level of precision and recall, suggesting that the model is effective at identifying DDoS attacks while maintaining a low rate of false alarms.

Figure 14 illustrates the bar chart for the evaluation metrics of FTP-Patator/SSH-Patator detection for the CICIDS 2017 dataset. The accuracy, precision, recall, and F1 score for all classes are 99.402%, 99.342%, 99.431%, and 99.387%, respectively. These results highlight the model’s balanced performance across the different classes, with high precision and recall values indicating effective detection and minimal misclassification.

Figure 15 presents a box chart depicting the performance metrics of our proposed method evaluated on the NSL-KDD dataset, including accuracy, precision, recall, and F1 score. To ensure a robust assessment, we replicated the evaluation across ten iterations.

Figure 16 presents a box chart evaluating the proposed method over 10 replications using the evaluation metrics in the NSL-KDD dataset. The results reveal a median accuracy of 99.6%, a median F1 score of 99.58%, a median precision of 99.55%, and a median recall of 99.62%. These high median values indicate that the proposed method consistently performs well across various evaluation metrics, affirming its effectiveness in detecting anomalies.

8. Comparison

This section provides a thorough comparison between the proposed method and other approaches found in the literature. Initially, each method will be explained in detail, followed by a brief summary of each method presented in a table format to simplify comparison.

Reference [40] analyzes the performance of deep learning algorithms for intrusion detection in smart devices, comparing DNN, CNN, and LSTM networks. The study uses the CIC-IDS 2017 dataset to evaluate and compare the accuracy of these algorithms.

Reference [41] presents an IDS for Software “MATLAB R2023b”Defined Networks (SDNs), which operates as an application module in the controller. The system comprises three phases: pre-training using sparse stacked auto-encoders for feature learning, training with a SoftMax classifier, and parameter optimization. The method is implemented using Mininet and Keras, and its performance is evaluated on the NSL-KDD and CICIDS2017 datasets, achieving an average accuracy of 98.5%.

Reference [42] proposes a hierarchical IDS combining three classifiers: REP Tree, JRip algorithm, and Forest PA. This model, named RDTIDS, features two classifiers operating in parallel, feeding into a third classifier. Evaluation on CICIDS2017 and BoT-IoT datasets demonstrates that the hierarchical model outperforms other recent machine learning models, achieving the highest detection rates (DR), true negative rates (TNR), and accuracy, with the lowest false alarm rates (FAR).

Reference [43] explores various CNN-GRU sequence combinations to optimize network parameters for effective feature learning in IDS applications. Using the CICIDS-2017 benchmark dataset, the proposed technique achieves high accuracy (98.73%) and a low False Positive Rate (FPR) of 0.075. Evaluation metrics include precision, recall, True Positive Rate (TPR), and FPR, demonstrating significant improvements over existing methods.

Reference [44] introduces a one-dimensional convolutional neural network (1D CNN)-based architecture for detecting network intrusions, focusing on four types: DoS Hulk, DDoS, DoS Goldeneye (active attacks), and PortScan (passive attack). The study utilizes the CICIDS2017 dataset to conduct experiments, achieving an impressive accuracy of 98.96% in detecting these intrusions. Deep learning techniques, particularly 1D CNNs, are highlighted for their ability to effectively process network data with minimal input and explore comprehensive feature sets critical for intrusion detection.

All of the mentioned methods achieve satisfactory performance in intrusion detection. However, the proposed method surpasses them all due to its computational efficiency and exceptional ability to avoid overfitting. Table 2 provides a summary of the comparative study.

9. Conclusions

In this paper, we presented a comprehensive intrusion detection method designed to address the challenges posed by high-dimensional data and the need for real-time detection capabilities in network security. Traditional methods often fall short in handling the complexity and volume of modern network traffic, leading to issues such as reduced accuracy and increased computational demands. To overcome these limitations, we proposed an innovative approach that integrates a novel wrapped feature selection technique with an optimized LSTM network using the WOA.

The first phase of our method involves a sophisticated feature selection process utilizing an MLP and a hybrid GA-PSO algorithm. This combination allows for the effective reduction of the feature set from 78 to 68 features, ensuring that the most relevant and informative features are retained. The selected features exhibit minimal correlation, indicating a diverse and representative subset that captures the critical aspects of the data. This reduction in dimensionality not only enhances the classification performance but also significantly lowers computational complexity, making the system more feasible for real-time applications.

The second phase focuses on the classification of intrusions using an LSTM network, which is well-suited for handling sequential and feature-based data. The LSTM network’s hyperparameters, including the number of hidden units, learning rate, learning rate drop factor, and batch size, are optimized using the WOA. This optimization ensures that the LSTM network is fine-tuned to deliver optimal performance, balancing between computational efficiency and detection accuracy.

Extensive simulations were conducted to validate the effectiveness of the proposed method. The results demonstrate that the LSTM-WOA classifier achieves a remarkable accuracy of 99.62% in DDoS attack detection, 99.40% in FTP-Patator/SSH-Patator attack detection using the CICIDS-2017 dataset, and a median accuracy of 99.6% for anomaly detection using the NSL-KDD dataset. These high accuracy rates, coupled with strong performance metrics such as precision, recall, and F1 score, underscore the method’s capability in handling both binary and multi-class classification tasks. The confusion matrices and ROC curves further illustrate the robustness of the model, highlighting its ability to correctly classify both types of intrusions with minimal false positives and false negatives.

The findings of this study suggest that the integration of advanced feature selection and hyperparameter optimization techniques can significantly enhance the performance of intrusion detection systems. By effectively reducing the dimensionality of the input data and optimizing the classifier’s parameters, the proposed method not only improves detection accuracy but also ensures computational efficiency, making it suitable for deployment in real-time network security environments.

Future work can build upon this foundation by exploring the application of the proposed method to other types of cyber-attacks and further refining the optimization algorithms to handle even larger and more complex datasets. Additionally, integrating this approach with other machine learning techniques, such as deep learning models or ensemble methods, and testing it in real-world deployment scenarios could provide valuable insights into its practical utility and scalability. Investigating the method’s adaptability to evolving threat landscapes and its resilience to adversarial attacks would also be valuable directions for future research.

Author Contributions

Conceptualization, H.A.-H. and M.M.H.; methodology, M.M.H.; software, H.A.-H.; validation, M.A.A., M.M.H. and A.Y.; formal analysis, H.A.-H.; investigation, M.M.H.; resources, M.A.A.; data curation, H.A.-H.; writing—original draft preparation, M.M.H.; writing—review and editing, H.A.-H.; visualization, A.Y.; supervision, M.M.H.; project administration, A.Y.; funding acquisition, H.A.-H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Datasets generated and analyzed in this study are comprised in this submitted manuscript. The other datasets are available on reasonable request from the corresponding author with the attached information.

Conflicts of Interest

The authors have no conflict of relevant interest to this article.

References

Li, M.; Qi, J.; Tian, X.; Guo, H.; Liu, L.; Fathollahi-Fard, A.M.; Tian, G. Smartphone-based straw incorporation: An improved convolutional neural network. Comput. Electron. Agric. 2024, 221, 109010. [Google Scholar] [CrossRef]
Chafjiri, A.S.; Gheibi, M.; Chahkandi, B.; Eghbalian, H.; Waclawek, S.; Fathollahi-Fard, A.M.; Behzadian, K. Enhancing flood risk mitigation by advanced data-driven approach. Heliyon 2024, 10, e37758. [Google Scholar] [CrossRef] [PubMed]
Ghazikhani, A.; Davoodipoor, S.; Fathollahi-Fard, A.M.; Gheibi, M.; Moezzi, R. Robust Truck Transit Time Prediction through GPS Data and Regression Algorithms in Mixed Traffic Scenarios. Mathematics 2024, 12, 2004. [Google Scholar] [CrossRef]
Khansar, H.H.; Chafjiri, A.S.; Fathollahi-Fard, A.M.; Gheibi, M.; Moezzi, R.; Parsa, J.; Annuk, A. Meta-Heuristic-Based Machine Learning Techniques for Soil Stress Prediction in Embankment Dams During Construction. Indian Geotech. J. 2024, 1–23. [Google Scholar] [CrossRef]
Maseer, Z.K.; Yusof, R.; Bahaman, N.; Mostafa, S.A.; Foozy, C.F.M. Benchmarking of machine learning for anomaly based intrusion detection systems in the CICIDS2017 dataset. IEEE Access 2021, 9, 22351–22370. [Google Scholar] [CrossRef]
Rosay, A.; Carlier, F.; Leroux, P. MLP4NIDS: An Efficient MLP-Based Network Intrusion Detection for CICIDS2017 dataset. In Proceedings of the Machine Learning for Networking: Second IFIP TC 6 International Conference, MLN 2019, Paris, France, 3–5 December 2019. [Google Scholar]
Catillo, M.; Del Vecchio, A.; Pecchia, A.; Villano, U. A Case Study with CICIDS2017 on the Robustness of Machine Learning Against Adversarial Attacks in Intrusion Detection. In Proceedings of the 18th International Conference on Availability, Reliability and Security, Benevento, Italy, 29 August–1 September 2023. [Google Scholar]
Chindove, H.; Brown, D. Adaptive Machine Learning Based Network Intrusion Detection. In Proceedings of the International Conference on Artificial Intelligence and its Applications, Bagatelle, Mauritius, 9–10 December 2021. [Google Scholar]
Aldarwbi, M.Y.; Lashkari, A.H.; Ghorbani, A.A. The sound of intrusion: A novel network intrusion detection system. Comput. Electr. Eng. 2022, 104, 108455. [Google Scholar] [CrossRef]
Panwar, S.S.; Raiwani, Y.; Panwar, L.S. An Intrusion Detection Model for CICIDS-2017 Dataset Using Machine Learning Algorithms. In Proceedings of the 2022 International Conference on Advances in Computing, Communication and Materials (ICACCM), Dehradun, India, 10–11 November 2022. [Google Scholar]
Ho, S.; Al Jufout, S.; Dajani, K.; Mozumdar, M. A novel intrusion detection model for detecting known and innovative cyberattacks using convolutional neural network. IEEE Open J. Comput. Soc. 2021, 2, 14–25. [Google Scholar] [CrossRef]
Kshirsagar, D.; Kumar, S. Towards an intrusion detection system for detecting web attacks based on an ensemble of filter feature selection techniques. Cyber-Phys. Syst. 2023, 9, 244–259. [Google Scholar] [CrossRef]
Pelletier, Z.; Abualkibash, M. Evaluating the CIC IDS-2017 dataset using machine learning methods and creating multiple predictive models in the statistical computing language R. Int. Res. J. Adv. Eng. Sci. 2020, 5, 187–191. [Google Scholar]
Priyanka, V.; Gireesh Kumar, T. Performance Assessment of IDS Based on CICIDS-2017 Dataset. In Information and Communication Technology for Competitive Strategies (ICTCS 2020) ICT: Applications and Social Interfaces; Springer: Singapore, 2022. [Google Scholar]
Krsteski, S.; Tashkovska, M.; Sazdov, B.; Radojichikj, L.; Cholakoska, A.; Efnusheva, D. Intrusion Detection with Supervised and Unsupervised Learning Using Pycaret Over CICIDS 2017 Dataset. In Proceedings of the Artificial Intelligence Application in Networks and Systems, Online, 9 July 2023. [Google Scholar]
Alabsi, B.A.; Anbar, M.; Rihan, S.D.A. Conditional tabular generative adversarial based intrusion detection system for detecting DDOS and DOS attacks on the internet of things networks. Sensors 2023, 23, 5644. [Google Scholar] [CrossRef]
Zavrak, S.; Iskefiyeli, M. Anomaly-Based Intrusion Detection from Network Flow Features Using Variational Autoencoder. IEEE Access 2020, 8, 108346–108358. [Google Scholar] [CrossRef]
Kumar, R.; Kumar, P.; Tripathi, R.; Gupta, G.P.; Garg, S.; Hassan, M.M. A distributed intrusion detection system to detect DDoS attacks in blockchain-enabled IoT network. J. Parallel Distrib. Comput. 2022, 164, 55–68. [Google Scholar] [CrossRef]
Zeeshan, M.; Riaz, Q.; Bilal, M.A.; Shahzad, M.K.; Jabeen, H.; Haider, S.A.; Rahim, A. Protocol-based deep intrusion detection for dos and DDoS attacks using unsw-nb15 and Bot-IoT data-sets. IEEE Access 2021, 10, 2269–2283. [Google Scholar] [CrossRef]
Roopak, M.; Tian, G.Y.; Chambers, J. An Intrusion Detection System Against DDoS Attacks in IoT Networks. In Proceedings of the 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), Vegas, NV, USA, 6–8 January 2020. [Google Scholar]
Akgun, D.; Hizal, S.; Cavusoglu, U. A new DDoS attacks intrusion detection model based on deep learning for cybersecurity. Comput. Secur. 2022, 118, 102748. [Google Scholar] [CrossRef]
Khanday, S.A.; Fatima, H.; Rakesh, N. Implementation of intrusion detection model for DDoS attacks in Lightweight IoT Networks. Expert Syst. Appl. 2023, 215, 119330. [Google Scholar] [CrossRef]
Issa, A.S.A.; Albayrak, Z. DDoS attack intrusion detection system based on hybridization of CNN and LSTM. Acta Polytech. Hung. 2023, 20, 105–123. [Google Scholar] [CrossRef]
Baldini, G.; Amerini, I. Online distributed denial of service (DDoS) intrusion detection based on adaptive sliding window and morphological fractal dimension. Comput. Netw. 2022, 210, 108923. [Google Scholar] [CrossRef]
Hussain, Y.S. Network Intrusion Detection for Distributed Denial-of-Service (DDoS) Attacks using Machine Learning Classification Techniques. Bachelor’s Thesis, University of Victoria, Victoria, BC, Canada, 2020. [Google Scholar]
Ferrag, M.A.; Shu, L.; Djallel, H.; Choo, K.-K.R. Deep learning-based intrusion detection for distributed denial of service attack in agriculture 4.0. Electronics 2021, 10, 1257. [Google Scholar] [CrossRef]
Huang, W.; Peng, X.; Shi, Z.; Ma, Y. Adversarial Attack Against LSTM-Based DDoS Intrusion Detection System. In Proceedings of the 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), Baltimore, MD, USA, 9–11 November 2020. [Google Scholar]
Mendonça, R.V.; Teodoro, A.A.; Rosa, R.L.; Saadi, M.; Melgarejo, D.C.; Nardelli, P.H.; Rodríguez, D.Z. Intrusion detection system based on fast hierarchical deep convolutional neural network. IEEE Access 2021, 9, 61024–61034. [Google Scholar] [CrossRef]
Adefemi Alimi, K.O.; Ouahada, K.; Abu-Mahfouz, A.M.; Rimer, S.; Alimi, O.A. Refined LSTM based intrusion detection for denial-of-service attack in Internet of Things. J. Sens. Actuator Netw. 2022, 11, 32. [Google Scholar] [CrossRef]
Amin, M.Z.; Ali, A. Application of Multilayer Perceptron (MLP) for Data Mining in Healthcare Operations. In Proceedings of the 2017 3rd International Conference on Biotechnology, Lahore, Pakistan, 8–9 February 2017. [Google Scholar]
Manalo, K.D.; Linsangan, N.B.; Torres, J.L. Classification of myoelectric signals using multilayer perceptron neural network with back propagation algorithm in a wireless surface myoelectric prosthesis. Int. J. Inf. Educ. Technol. 2016, 6, 686–690. [Google Scholar] [CrossRef]
Albadr, M.A.; Tiun, S.; Ayob, M.; Al-Dhief, F. Genetic algorithm based on natural selection theory for optimization problems. Symmetry 2020, 12, 1758. [Google Scholar] [CrossRef]
Dharma, F.; Shabrina, S.; Noviana, A.; Tahir, M.; Hendrastuty, N.; Wahyono, W. Prediction of Indonesian inflation rate using regression model based on genetic algorithms. J. Online Inf. 2020, 5, 45–52. [Google Scholar] [CrossRef]
Band, S.S.; Janizadeh, S.; Chandra Pal, S.; Saha, A.; Chakrabortty, R.; Shokri, M.; Mosavi, A. Novel ensemble approach of deep learning neural network (DLNN) model and particle swarm optimization (PSO) algorithm for prediction of gully erosion susceptibility. Sensors 2020, 20, 5609. [Google Scholar] [CrossRef] [PubMed]
Ulker, E.D.; Ulker, S. Application of particle swarm optimization to microwave tapered microstrip lines. Comput. Sci. Eng. 2014, 4, 59–64. [Google Scholar] [CrossRef]
Siami-Namini, S.; Tavakoli, N.; Namin, A.S. A comparative analysis of forecasting financial time series using ARIMA, LSTM, and BiLSTM. arXiv 2019, arXiv:1911.09512. [Google Scholar]
Hernández, J.; Lopez, D.; Vera, N. Primary user characterization for cognitive radio wireless networks using long short-term memory. Int. J. Distrib. Sens. Netw. 2018, 14, 1550147718811828. [Google Scholar] [CrossRef]
Pham, Q.-V.; Mirjalili, S.; Kumar, N.; Alazab, M.; Hwang, W.-J. Whale optimization algorithm with applications to resource allocation in wireless networks. IEEE Trans. Veh. Technol. 2020, 69, 4285–4297. [Google Scholar] [CrossRef]
Rana, N.; Latiff, M.S.A.; Abdulhamid, S.I.M.; Chiroma, H. Whale optimization algorithm: A systematic review of contemporary applications, modifications and developments. Neural Comput. Appl. 2020, 32, 16245–16277. [Google Scholar] [CrossRef]
Jose, J.; Jose, D.V. Deep learning algorithms for intrusion detection systems in internet of things using CIC-IDS 2017 dataset. Int. J. Electr. Comput. Eng. 2023, 13, 1134–1141. [Google Scholar] [CrossRef]
Choobdar, P.; Naderan, M.; Naderan, M. Detection and multi-class classification of intrusion in software defined networks using stacked auto-encoders and CICIDS2017 dataset. Wirel. Pers. Commun. 2022, 123, 437–471. [Google Scholar] [CrossRef]
Ferrag, M.A.; Maglaras, L.; Ahmim, A.; Derdour, M.; Janicke, H. Rdtids: Rules and decision tree-based intrusion detection system for internet-of-things networks. Future Internet 2020, 12, 44. [Google Scholar] [CrossRef]
Henry, A.; Gautam, S.; Khanna, S.; Rabie, K.; Shongwe, T.; Bhattacharya, P.; Sharma, B.; Chowdhury, S. Composition of hybrid deep learning model and feature optimization for intrusion detection system. Sensors 2023, 23, 890. [Google Scholar] [CrossRef] [PubMed]
Qazi, E.U.H.; Almorjan, A.; Zia, T. A one-dimensional convolutional neural network (1D-CNN) based deep learning system for network intrusion detection. Appl. Sci. 2022, 12, 7986. [Google Scholar] [CrossRef]

Figure 1. The MLP flowchart [31].

Figure 2. The genetic algorithm’s flowchart [32].

Figure 3. The PSO flowchart [35].

Figure 4. The LSTM flowchart [37].

Figure 5. The flowchart of the WOA algorithm [39].

Figure 6. Convergence curve of GA-PSO for feature selection.

Figure 7. Mutual correlation between all pairs of selected features for the CICIDS-2017 dataset.

Figure 8. The convergence curve of the WOA algorithm for optimizing LSTM’s hyperparameters.

Figure 9. Evaluating the proposed method using the confusion matrix for the DDoS attack detection in the CICIDS-2017 dataset.

Figure 10. Evaluating the proposed method using the confusion matrix for the FTP-Patator/SSH-Patator detection in the CICIDS-2017 dataset.

Figure 11. Evaluating the proposed method using the confusion matrix for anomaly detection in the NSL-KDD dataset.

Figure 12. Evaluating the proposed method using the ROC curve for the DDoS attack detection in the CICIDS-2017 dataset.

Figure 13. Evaluating the proposed method using the ROC curve for the FTP-Patator/SSH-Patator detection in the CICIDS-2017 dataset.

Figure 14. Evaluating the proposed method using the ROC curve for anomaly detection in the NSL-KDD dataset.

Figure 15. Evaluating the proposed method using the evaluation metrics for the FTP-Patator/SSH-Patator detection.

Figure 16. Box chart of evaluating the proposed method over 10 replications using the evaluation metrics in the NSL-KDD dataset.

Table 1. The obtained hyperparameters of LSTM using WOA.

Parameter	Value
The number of hidden units	33
Learning rate	0.0041
Learning rate drop factor	0.42
Batch size	137

Table 2. The comparison of the proposed method with other intrusion detection methods.

Reference	Method	Dataset	Accuracy
[40]	DNN	CIC-IDS 2017	94.61%
	LSTM	CIC-IDS 2017	97.67%
	CNN	CIC-IDS 2017	98.61%
[41]	Sparse Stacked Auto-Encoders + SoftMax	NSL-KDD	98.5%
[41]	Sparse Stacked Auto-Encoders + SoftMax	CICIDS2017	98.5%
[42]	RDTIDS (REP Tree + JRip + Forest PA)	BoT-IoT	96.995%
[42]	RDTIDS (REP Tree + JRip + Forest PA)	CICIDS2017	96.665%
[43]	CNN-GRU	CICIDS-2017	98.73%
[44]	1D CNN	CICIDS2017	98.96%
The proposed method	GA-PSO + MLP/LSTM + WOA	CICIDS2017	99.62%
		CICIDS2017	99.40%
		NSL-KDD	99.6%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

AL-Husseini, H.; Hosseini, M.M.; Yousofi, A.; Alazzawi, M.A. Whale Optimization Algorithm-Enhanced Long Short-Term Memory Classifier with Novel Wrapped Feature Selection for Intrusion Detection. J. Sens. Actuator Netw. 2024, 13, 73. https://doi.org/10.3390/jsan13060073

AMA Style

AL-Husseini H, Hosseini MM, Yousofi A, Alazzawi MA. Whale Optimization Algorithm-Enhanced Long Short-Term Memory Classifier with Novel Wrapped Feature Selection for Intrusion Detection. Journal of Sensor and Actuator Networks. 2024; 13(6):73. https://doi.org/10.3390/jsan13060073

Chicago/Turabian Style

AL-Husseini, Haider, Mohammad Mehdi Hosseini, Ahmad Yousofi, and Murtadha A. Alazzawi. 2024. "Whale Optimization Algorithm-Enhanced Long Short-Term Memory Classifier with Novel Wrapped Feature Selection for Intrusion Detection" Journal of Sensor and Actuator Networks 13, no. 6: 73. https://doi.org/10.3390/jsan13060073

APA Style

AL-Husseini, H., Hosseini, M. M., Yousofi, A., & Alazzawi, M. A. (2024). Whale Optimization Algorithm-Enhanced Long Short-Term Memory Classifier with Novel Wrapped Feature Selection for Intrusion Detection. Journal of Sensor and Actuator Networks, 13(6), 73. https://doi.org/10.3390/jsan13060073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Whale Optimization Algorithm-Enhanced Long Short-Term Memory Classifier with Novel Wrapped Feature Selection for Intrusion Detection

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Multi-Layer Perceptron

Layers

3.2. Genetic Algorithm

3.2.1. Population Initialization

3.2.2. Chromosome Representation

3.2.3. Fitness Value Calculation

3.2.4. Parent Selection

3.2.5. Crossover

3.2.6. Mutation

3.2.7. Elitism

3.3. The Particle Swarm Optimization (PSO)

3.4. Long Short-Term Memory

3.4.1. Forget Gate

3.4.2. Input Gate

3.4.3. Output Gate

3.5. Whale Optimization Algorithm

3.5.1. Encircling Prey

3.5.2. Bubble-Net Attacking Method

3.5.3. Search for Prey

3.6. Stages of the Proposed Method

4. Dataset

4.1. The CICIDS 2017

4.2. NSL-KDD

4.3. Preprocessing Steps

5. Evaluation Metrics

5.1. Accuracy

5.2. Precision

5.3. Recall

5.4. F1 Score

6. Complexity Analysis

6.1. Time Complexity Analysis

6.2. Space Complexity Analysis

7. Simulation Results

7.1. Feature Selection Results

7.2. Hyperparameter Optimization Results

7.3. Intrusion Detection Results

8. Comparison

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI