This study proposes an IDS model developed using ML-based approaches. Though existing studies have attempted this, the systems developed have been deficient with regard to accuracy. To resolve the prevailing issues, the current research considers three datasets (UNSW-Bot, NSL-KDD and CICIDS-2107) and performs a series of steps, as shown in
Figure 1. Initially, pre-processing is performed, whereby inconsistent and missing data are eliminated so as to avoid error. This process also helps enhance the dataset’s quality and accuracy by making the data more consistent and reliable. After pre-processing, features are selected using OWSA. Feature selection is performed by considering the diversity and number of features of user behavior and network traffic, and the selection of a subset of features is performed to improve the accuracy of model classification. An approach to selecting server traffic feature subsets based on OWSA has been used to determine the features that are the most important and related to the class label. This process contributes to minimizing computational costs and enhancing the performance rate. After feature selection, the chosen data are used in the training and testing phases, with a ratio of 80:20. The use of 80% of the data for training involves presenting a labeled dataset to the model, and allowing it to learn patterns and relationships between input features and corresponding labels. The proposed model adjusts its parameters through optimization techniques, aiming to make accurate predictions. Finally, regarding the 20% of data used as a testing set, we evaluate the model trained on a separate dataset in order to calculate the performance of the model. In the final phase, classification is accomplished with AWRF. The model is evaluated with regard to performance so as to confirm its efficacy.
3.1. OWSA (Optimal Whale Sine Algorithm) for Feature Selection
Initially, the WOA (Whale Optimization Algorithm) was inspired by the bubble-net hunting mannerism of humpback whales. The population-based WOA possesses the capacity for avoiding local optima, thereby attaining optimal solutions. Such merits could identify the WOA as a suitable approach to solving various unconstrained or constrained optimization issues related to practical employment without reformatting the structure of the algorithm. Humpback whales produce huge spirals of bubbles as they approach closer to their prey top contain them. Then, the prey is hunted. In the hunting phase, the humpback whales pursue two predation methodologies, so as to minimize the ensuing steps. Subsequently, the whales will enact the spiral upraise-position technique. Throughout the hunting phases, both of these methodologies are utilized concurrently. Accordingly, two bubble-related techniques are employed—upward spirals and double loops. In the first stage, the humpback whales dive twelve meters down; following this, bubbles are produced that surround the prey and float upwards in a spiral form. The subsequent step includes three phases—lob-tail, coral loop and capture loop.
Phase 1: Encircling the prey—The humpback whales lactate the prey and then surround them. With the ideal design position within the search area already known, the WOA selects a candidate outcome that targets prey, or else it moves close to the optimal. Then, better search agents are described. These search agents attempt to direct the search towards the ideal agent.
Phase 2: Exploitation—Bubble-net attack. This process encompasses two main methodologies, namely, shrinking encircling and spiral updating;
Phase 3: Exploration—Aside from bubble net processing, humpback whales search for their prey in a random manner.
For pursuing the two approaches, the WOA uses random choice possibility (
). Moreover, when (
x < 0.5), the humpback whale makes use of the spiral upraising position methodology. On the contrary, when (
x > 0.5), the humpback whale uses the methodology of shrinking encircling. During shrinking encircling, whales search for their prey with consideration of each other’s locations. To reflect the uncertainty in this algorithm,
is presented as the co-efficient vector. The search agent is set as
to update the area of the search agents. As this process involves several search strategies, its effect is limited in identifying the global ideal outcome with the highest merit in comparison to conventional optimization. A mathematical model of the search distances of position vectors is given in Equation (1),
In Equation (1),
denotes the existing iteration count, (
) and
represent the position vectors of the ideal outcome,
represents the search distance and
denotes the coefficient vector. Equation (2) updates the whale’s movement (location) around the victim, which can be described in a mathematical form as follows:
The random search agent is given by Equation (2),
In Equation (3),
represents the random position vector of the current population, while
indicates the coefficient vector. Humpback whales also use the spiral position updating technique to hunt. The spiral positioning of the whale is given in Equation (3),
In Equation (4),
’ reveals the distance of the
ith whale from the prey (satisfactory outcome gained here), and
represents the constant defining the logarithmic spiral format.
is the coefficient vector.
′
denotes the coefficient scalar.
is a random digit (in (−1, 1)). The stepwise procedure for WOA is given in Pseudocode 1.
Pseudocode 1: WOA (Whale Optimization Algorithm) |
|
|
|
|
|
//a represents linearly reduced within the range of 2 to 0 over the
|
course of iteration, p and q represents coefficient vectors, l reresents absolute value
|
and x represents the constant for explaining the shape of the logarithmic spiral.
|
|
|
Equation (2)
|
|
Equation (3)
|
|
|
|
Update the position of the current solution by Equation (4)
|
|
|
|
|
|
|
|
|
Based on the pseudocode, the whale population is altered and the fitness values are assessed. Thus, the search agent is modified and the ideal search agent is finally attained. On the contrary, the SCA (Sine Cosine Algorithm) is considered; the trigonometric operations of SC are the basis for this algorithm. Typically, SCA shows a better acceleration and convergence rate. It also shows a reliable implementation time. Several initial random solutions are generated by the SCA. It also helps in transferring the ideal solution that employs a mathematical framework in accordance with the functionalities of SC. Varied random and adaptive variables are integrated within this algorithm to maintain the search spaces in the optimization processes. The population search approach and local search approach are the major techniques in SCA. This algorithm shows certain innate advantages, such as simple execution and flexibility, due to which it could be utilized for solving several optimization issues. These features have enabled SCA to resolve various optimization issues. Taking into account the n-dimension optimization issue,
In Equation (5), indicates the ith-decision variable, denotes the lower bound, and represents the upper bound. Moreover, indicates the problem dimension. In Equation (5), SCA utilizes the oscillatory functions of both functions of SC, which alters the capacities of individuals to observe the global ideal solution. The precise process is given below.
An assumption is made such that, in the SCA, the population size is termed
, and the
ith individual’s position in the
tth generation is represented as
, wherein
. In addition, the fitness value is computed for individuals, and the position of the ideal individual is recorded as
. The
jth dimension for the
ith individual in the populace is updated in Equation (7),
In Equations (6) and (7),
and
indicate the random count of the uniform distribution, wherein
indicates the control parameter. The computation process is given in Equation (8),
In Equation (8),
is the constant,
represents current iterations and
is the maximum iteration count. Further, let
; this equation employs acquisitive searching for attaining
, which is represented via Equation (8),
Furthermore, let
, and the process (in Equations (6)–(8)) repeats until the termination condition is attained. The overall procedure of SCA is given in Pseudocode 2.
Pseudocode 2: SCA (Sine Cosine Algorithm) |
|
The initial phase involves the initialization of search agents. With the use of the objective function, search agent evaluation is undertaken. With the use of Equation (8), the ideal solution is updated and the control parameter is updated. With the use of Equation (6), the positions of search agents are updated. This process continues until the maximum iterations are reached and the ideal solution is attained. Periodically, SCA faces issues such as getting stuck in a local area in the search space. This impacts the computational exertion that is required in searching for the ideal solution within the search space. The above issues could be resolved via enhancements to standard SCA that in turn enhance the SCA performance.
On the contrary, WOA also possess disadvantages such as easy localization and low convergence. Hence, the present study endeavors to resolve these issues via several enhancements so as to procure better accuracy. With the exclusive optimization standard of SCA, this study intends to combine WOA and SCA so as to limit the demerits of both via hybridization, thereby attaining OWSA. As hybridization involves an enhancement of optimization methodologies, the present research proposes a OWSA wherein operators from certain methods are combined with other operators from supplementary methods so as to generate efficient and reliable results. In the current study, the OWSA is proposed for the optimal selection of features. It possesses the greatest ability to enhance the exploration phase. The overall process of optimal feature selection using OWSA is shown in
Figure 2.
As depicted in
Figure 2, all the whales are randomly set. Following this, the search agents are evaluated with the use of the objective function. Subsequently, the destination location is updated. Then, the fitness computation is performed. A position update is performed with SCA. Finally, the global optimal solution is attained. The overall sequence is presented in Pseudocode 3.
Pseudocode 3: OWSA (Optimal Whale Sine Algorithm) |
|
|
|
|
|
|
|
|
|
Equation (2)
|
|
Equation (3)
|
|
|
Update the position of the current solution using Equation (4)
|
|
|
|
|
|
|
|
|
3.2. AWRF (ANN Weighted Random Forest) for Classification
Generally, ANN refers to the biologically inspired sub-field of AI modeled after the brain. Typically, ANN is a computational network relying on biological NNs (Neural Networks), which mirror human brain’s structure. Like the human brain, this NN possess neurons interconnected with one another, and ANNs possess neurons that are associated with one another in several network layers. Further, RF (Random Forest) is a well-renowned ML method that relies on a supervised learning method. The RF’s function relies on the idea of EL (Ensemble Learning), whereby multiple classifiers are integrated to solve complex issues, thereby enhancing the model’s performance. The main intention of the current study is to propose a weight updating process that is applicable to an individual tree in the RF model, and the comprehensive evaluation of ideal parameter tuning. The proposed approach is recommended by its stability, flexibility and avoidance of over-fitting. The overall process is shown in
Figure 3.
As shown in
Figure 3, the selected features are fed into RF. Based on weight updating, the RF generates optimal outcomes. RF encompasses numerous DTs for several subsets of the dataset, and by averaging, it enhances the prediction rate of a specific dataset. The RF model determines the predictions of all the DTs (Decision Trees), and the overall outcomes are predicted in accordance with the prediction that receives the most votes. Besides this, weight in the ANN indicates a parameter that possesses converted input data in the network’s hidden layers. The resulting layer repeatedly tunes the input within the hidden layers to produce a desirable count in the specific range, as the less weighted value does not alter the input. However, on the contrary, high-weighted values are applied, causing significant alterations in the results. Thus, the chosen features are fed into RF. Concurrently, the ANN weight updates are fed into RF. Several input data subsets are employed for training the ML models. DT acts as the core element of the RF model. From the actual data, a set of DTs related to the bootstrap samples is generated. Usually, the bootstrapping method assists the RF with the collection of sufficient DT counts, which enhance the classification rate via the overlap-thinking concept. With the voting approach, optimal trees are chosen by bagging. The chosen features are subjected to cross-validation. These features are also fed into RF. Lastly, the ANN weight updating process assist the RF model in performing effective classification. The overall process is presented in Pseudocode 4.
Pseudocode 4: AWRF (ANN Weighted Random Forest) |
|
To receive effective outcomes with the contribution of the considered classifier, all the input is weighted. The solution relies on majority voting. Each of the quadrant solutions is compared with the overall solution. When there is a match, the fixed value () is added to the weight. Contrarily, when it does not match, is taken from the classifier’s weight, which minimizes the negative impact on further processes. The algorithm is thus updated with persistent weights, and hence the system can be subjected to internal and external alterations. This ensures the classifier’s reliability. It is significant for pre-defining the initial weight value , wherein , and denotes the classifier count. Values have a huge influence on the evolution of the system, as the system’s reliability is in accordance with the weights of ANN.