3. Related Work
The rapid growth of online social networks and electronic applications and the expansion of electronic information resources have generated huge and accumulative volumes of data, forming large datasets with many domains. Big data research, management, supervision, data storage/access, information processing, and analysis are all challenging because of highly unstructured and heterogeneous data and sources. The huge dimension of social networks, spatiotemporal influence, and user interactions are among the several challenges in uncovering behavioral mechanisms. Other challenges of big data include the extraction of essential information, privacy and security protection, and the prediction of web content to understand the interests, sites, and search history of users and accurately predict users’ behaviors. Big data and ML use IDS to address these issues and difficulties [
5].
Moreover, new technologies, including 4G/5G networks, have significant prospective application in unmanned aerial vehicles (UAVs) equipped with sensors, cameras, and GPS receivers in delivering internet of things (IoT) service, generating useful heterogeneous big data. However, there are many challenges to be resolved before UAVs and their accumulative heterogeneous big data can be effectively and safely used. An advanced framework supporting multi-level and multi-domain defense mechanisms in protecting UAVs from spoofing, false information, signal jamming, physical attacks, hijacking, abusing, and firmware hacks/sabotage is required using ML algorithm solutions that are built in IDSs [
6].
IDS is one of the key assets to protect IT infrastructure against any potential threats and to enhance network security in most organizations. As a result, many researchers have worked intensively to develop an intelligent IDS and further improve network security. Jia et al. [
7] developed a new deep neural network (NDNN) model with four hidden layers to capture and classify the intrusion features of the NSL-KDD and KDD99 training datasets. They claimed that the NDNN-based model improved the performance of the IDS and its accuracy rate to as high as 99.9%. Li et al. [
8] presented a new intelligent IDS by applying ensemble and unsupervised ML techniques to address the security challenges in software-defined 5G networks. Dey et al. [
9] suggested a multi-layered IDS for mobile clouds involving heterogeneous customer networks. They proposed a system with two steps, namely multi-layer traffic screening and decision-based virtual machine (VM) selection, and concluded that using the system is a highly effective method to detect intrusions. Leite and Girardi [
10] proposed an IDS that can adapt itself to its environment and recognize new intrusions not previously specified in a system design. They integrated case-based reasoning, reactive behavior, and learning to acquire information from past solutions and support the evolution of case-based reasoning and reactive behavior in improving IDS performance. Hajisalem and Babaie [
11] developed an intelligent classification and regression tree (CART) classifier for IDS by optimizing speed and accuracy. They combined the artificial bee colony and artificial fish swarm algorithms to select effective if-then rules for the CART classifier, ran this classifier on the NSL-KDD and UNSW-NB datasets, and concluded that it can achieve a 99% detection rate and 0.01% false positive rate FPR. Li et al. [
12] proposed a framework based on regularized semi-nonnegative matrix tri-factorization, which mapped a signed network from a high-dimensional space to a low-dimensional one. They also presented graph regularization to distribute pairs of nodes, which were connected with negative links, into different communities for improving detection accuracy. The authors claimed that the application of their framework in both synthetic and real-world datasets confirmed the effectiveness of the suggested method. Moreover, Li et al. [
13] studied the bit error rate performance of bi-quadrature physical-layer network coding (BQ-PNC) in asymmetric two-way relay channels and found that BQ-PNC can significantly improve bit error rate (BER) performance either in symmetric or asymmetric circumstances.
Most real-world network traffic data are not available because of privacy and security issues, but several public datasets are available for IDS performance assessment. Some of these datasets suffer from a lack of adequate traffic types and up-to-date low-footprint attack styles, so researchers use an older benchmark dataset, NSL-KDD, to compare the performance of ML classifiers in a fair and reasonable manner. For example, Revathi and Malathi [
14] used the NSL-KDD dataset to evaluate the performance and accuracy of five ML classifiers with or without feature reduction using the Weka tool. They reported that the NSL-KDD dataset is ideal for comparing different intrusion detection models and that the random forest method has the highest accuracy among J48, support vector machine (SVM), CART, and Naive Bayes. Using the Weka tool also, Dhanabal and Shantharajah [
15] used 20% of the NSL-KDD dataset to measure the effectiveness of four ML classifiers (CFS, J48, SVM, and naïve bayes (NB)) in detecting anomalies in a network traffic pattern. They concluded that the detection accuracy of J48 outperformed that of SVM and NB. Chand et al. [
16] evaluated the SVM classifier’s performance when integrated with other classifiers, such as BayesNet, AdaBoost, logistic, IBK, J48, random forest, JRip, OneR, and simple CART, on the NSL-KDD dataset using the Weka tool. They concluded that a multi-classifier algorithm had better performance than a single-classifier algorithm. They showed that the stacking of SVM and random forest gave the best performance, with a detection accuracy rate of 97.50%. However, they did not report the time taken to build the model. Ikram and Cherukuri [
17] proposed an ID model using chi-square feature selection and multi-class SVM. The main idea behind this model was to form a multi-class SVM to reduce the training and testing time and increase the classification accuracy of network attacks. They tested this model on the NSL-KDD and gurekddcup datasets and showed that the proposed method performed better; it had higher accuracy, faster convergence speed, and better generalization.
Choudhury and Bhowal [
18] studied and compared the performance of nine classifiers (BayesNet, logistic, IBK, J48, PART, JRip, random tree, random forest, and REPTree) over the training and testing datasets of NSL-KDD using 10-fold cross-validation in the Weka tool. They concluded that random forest and BayesNet were suitable for the proper detection of network intrusion. Belavagi and Muniyal [
3] analyzed four supervised ML classifiers, namely SVM, random forest (RF), LR, and NB, for intrusion detection over the NSL-KDD dataset. They used accuracy, true positive rate (TPR), and FPR as the parameters to measure the performance of the classifiers, as well as an Intel Core (TM) i5-3230M CPU @ 2.60 GHz, 4 GB RAM to execute the experimental procedure. They found that the RF classifier outperformed the other tested classifiers in identifying whether data traffic was normal or was an attack. Next, Biswas [
19] compared the performance of five classifiers (k-NN, DT, NN, SVM, and NB) using 5-fold cross-validation on the NSL-KDD dataset and concluded that the k-NN classifier showed better performance than the other classifiers. Wang et al. [
20] suggested an effective IDS based on SVM with augmented features. They applied logarithm marginal density ratio transformation to form new and better transformed features that can improve the detection performance of the SVM-based detection model. They used the NSL-KDD dataset to evaluate the suggested method and concluded that it achieves better performance than other existing methods in terms of accuracy, detection rate, false alarm rate, and training speed for the SVM model. Yin et al. [
21] explored how to model an IDS based on a deep learning approach using recurrent RNN-IDS on the NSL-KDD dataset. They compared this approach with J48, artificial neural network, random forest, SVM, and other ML classifiers proposed by previous researchers on the benchmark dataset. They reported that RNN-IDS had superior accuracy compared with traditional ML classifiers.
Malhotra and Sharma [
4] evaluated 10 ML algorithms on the NSL-KDD dataset with and without the feature selection/reduction technique using the Weka tool. They reported that random forest, bagging, PART, and J48 were the best-ranked classifiers without using the feature selection/reduction technique, but they consumed much time in building the model. The authors subjected these four classifiers to further evaluation using feature selection/reduction methods to reduce the model building time while achieving high intrusion detection accuracy. They concluded that the feature selection/reduction methods significantly reduced the model building time without compromising the detection accuracy of the tested classifiers. They also believed that feature selection/reduction helps a number of ML classifiers perform well. Thus, researchers can assume that the results and feature selection/reduction methods of this study are promising for the investigation of more ML classifiers to determine the best classifier with the highest detection accuracy and lowest model building time, which can satisfy the real-life need to build an efficient network IDS. Abdullah et al. [
22] presented an IDS framework with feature selection within the NSL-KDD dataset; the framework was based on dividing the input dataset into various subsets and combining them using the information gain filter in the Weka tool. The authors showed that the feature selection methods used improved detection accuracy and decreased complexity. They also demonstrated that the highest intrusion detection accuracy was obtained when using the random forest and PART classifiers under combination methods of the product probability rule. Setiawan et al. [
23] proposed a combination of the feature selection method, normalization, and SVM using Weka’s modified rank-based information gain filter to select 17/41 NSL-KDD dataset features. They achieved an overall detection accuracy rate of 99.8%. Zhou et al. [
24] proposed an IDS framework based on the feature selection and ensemble learning techniques. They used a hybrid approach by combining CFS with the BA algorithm to select the optimal subset based on the correlation between features. Then, they formed an ensemble that combines the C4.5, RF, and ForestPA classifiers and applied these on the NSL-KDD, AWID, and CIC-IDS2017 datasets. They demonstrated that the proposed CFS-BA-ensemble approach showed better performance than the other related approaches. Notably, these previous studies emphasized detection accuracy but not the time taken to build the model.
Furthermore, Mahfouz, Venugopal, and Shiva [
25] evaluated and compared the performance of six supervised ML classifiers, namely NB, logistic, multilayer perceptron (MLP), Sequential Minimal Optimization (SMO), IBK, and J48, with the full NSL-KDD using Weka software. They investigated this in the context of intrusion detection along various dimensions, mainly feature selection, sensitivity to hyper-parameter tuning, and class imbalance problems. They used accuracy, TPR, FPR, precision, recall, F-measure, and receiver operating characteristic (ROC) area as the parameters to evaluate the performance of the tested classifiers. They also used a PC installed with Intel(R) CORE(TM) i5-6600K CPU @ 3.50 GHz, 3.50 GHz, 8 GB RAM running a 64-bit Windows 10 OS, x64-based processor to carry out the experiment protocol. They concluded that J48 and IBK are the two best classifiers in terms of accuracy detection, but IBK was much better when applying feature selection techniques.
Previous studies, except that by Malhotra and Sharma [
4], clearly focused on achieving the high detection accuracy of classifiers without considering the model building time, which is a crucial aspect for predicting intrusion in real-life situations. Therefore, evaluating the performance of more ML classifiers is needed to determine the most appropriate classifiers with both high detection accuracy and the lowest possible model building time using the NSL-KDD dataset with feature selection/reduction methods in the Weka tool. In particular, researchers have already documented that feature selection/reduction methods significantly reduce the model building time without compromising the detection accuracy of some classifiers and that they could also help a number of ML classifiers perform well. Presenting more options for the best classifiers with both high detection accuracy and the lowest possible model building time can satisfy real-life needs and help build an efficient network IDS. These classifiers can enhance and accelerate IDS function to accurately deal with the huge dimension of social networks and large data flow.
ML classifiers with both the highest detection accuracy and lowest model building time in IDSs can be used to make broadband wireless heterogeneous networks more efficient and secure. Broadband wireless heterogeneous networks, including communication systems and multi-media services and applications, need a fast real-life abnormal tracing system to deliver high service quality. This can be achieved by ML classifiers with the highest detection accuracy and lowest model building time [
26].
5. Research Method
In this section, the authors describe the NSL-KDD dataset, Weka tools, ML classifiers, feature selection/reduction approaches, and the performance measures they used in this study.
NSL-KDD Dataset: This is a refined version of the well-known KDDcup99 dataset [
27] that is widely used for building an IDS. This dataset contains 125,973 instances with 41 features and an assigned label classifying each record as either normal or an attack [
Table 1]. Many researchers have used this dataset to conduct different types of analyses and to apply various methods and tools for developing effective IDSs [
28]. The NSL-KDD dataset has a subset named KDDTrain+-20Percent dataset, which has 25,192 instances and represents 20% of the entire train dataset. Researchers can group it into four main classes, namely denial of service (DoS), Probe, U2R, and R2L.
Weka Package: This is an open-source package that contains a number of ML classifiers used to perform different types of data mining tasks. Written in JAVA [
29], the Weka package consists of tools for data pre-processing, regression, classification, clustering, association rules, and visualization. Researchers can evaluate the performance of available classifiers in Weka directly on the NSL-KDD dataset. To easily compare the classifiers in this study, the authors used some default features provided by Weka for ML classifiers, feature selection/reduction techniques, and the discretize filter. Weka has four applications: Explorer, Experimenter, Knowledge Flow, and Simple Command Line Interface.
In this study, data were discretized using Weka, and the CfsSubsetEval wrapper method was used with a BestFirst search and 10-fold cross-validation. The InfoGainAttributeEval filter method was also used with Ranker search and 10-fold cross-validation.
Machine Learning Algorithms: Researchers have proposed many ML classifiers to monitor and analyze network traffic for various anomalies. Among the several classifiers available in Weka, six well-known supervised classifiers that were compatible with the NSL-KDD training dataset were evaluated by the authors. They ran each classifier only once on the NSL-KDD training dataset, and the outcome performance was analyzed. To minimize any possible overfitting problem, a cross-validation mode and global minima adjustment were used.
The supervised classifiers used were as follows:
Reduced Error Pruning Tree (REPTree): This is a rapid decision tree learner based on C4.5. It is an algorithm used as a representative method in an attempt to clarify the problems of decision tree learning. REPTree builds a decision/regression tree based on information gain or by reducing variance [
30]. It sorts values for numeric attributes once, deals with missing values by splitting them into fractional instances using C4.5, and creates multiple trees in various iterations and selects the best from all trees created. REPTree has been presented to decrease tree structure complexity without reducing classification accuracy. The basic of the pruning of this classifier is used REP with back overfitting. It forms a decision/regression tree by splitting and pruning the regression tree based on the highest information gain ratio value.
Sequential Minimal Optimization (SMO): This is widely used for training SVMs and was formulated by J. Platt [
31]. SMO is one way to solve a quadratic programming (QP) issue that arises during SVM training. SMO divides the large QP problem into a series of very tiny sub-problems. These small sub-problems are solved analytically, preventing the use of time-consuming numerical QP optimization as an inner loop. It is the fastest for linear SVMs and sparse datasets and can be more than 1000 times faster than the chunking algorithm. The amount of memory needed for SMO is linear in the training dataset size, allowing SMO to handle very large training sets. It scales somewhere between linear and quadratic in the training set size for several test problems.
LogitBoost: This is a popular boosting variant formulated by Friedman, Hastie, and Tibshirani [
32]. Researchers can apply it to either binary or multi-class classification, and they considered it an additive tree regression by minimizing logistic loss. LogitBoost has two essential setting factors, invariant property and the density of Hessian matrices, and it is seen as a convex optimization with the following formula:
. Compared with other AdaBoost classifiers, it is appropriate for handling noisy and outlier data. It uses a binomial that changes the loss function linearly. It is one of the popular boosting variants and has three main components: the loss, the function model, and the optimization algorithm.
BayesNet: This is a broadly used method that works on the basic Bayes theorem; it constructs a Bayesian network [
33] by calculating the conditional probability on every node. It is related to the family of the probabilistic graphical model. BayesNet learns in two stages: learning of the network structure followed by the probability table. It is a powerful instrument for data representation and inference under uncertainty conditions. It is a probabilistic graphical algorithm that is used to represent a set of random variables and their conditional dependencies with the assistance of a directed acyclic graph. This graph algorithm is applied to represent knowledge about an uncertain domain.
Radial Basis Function (RBF): This is an artificial neural network formulated by Broomhead and Lowe [
34]. RBF uses radial basis functions for activation to change along the distance from a location. For functional approximation, it uses time-series prediction, classification, and system control. A multi-layer feedforward neural network, RBF is used to classify data in a non-linear mode and compare input data with training data. The production of the RBF neural network is weighted linear superposition of all basis functions. The frequently used basis function in the RBF model is the Gaussian basis function.
NBTree: This is a hybrid of naive Bayes and decision trees; it uses both classifiers and has the advantage of univariant splits at each node in the decision tree and the leaves in the naive Bayes algorithm. NBTree is useful when there are many attributes in the classification that are likely to be relevant [
35]. It is a highly scalable method for big data and is described as a decision tree with both nodes and branches. NBTree categorizes an example to a leaf and allocates a class label by applying a naïve Bayes on that leaf. It represents the learned information in the form of a tree built recursively. NBTree is a popular baseline method for text categorization and with appropriate pre-processing. It significantly improves upon the performance of its constituents by inducing highly perfect classifiers.
Feature Selection/Reduction Approaches: Professionals understand that feature selection/reduction is essential to use data mining tools effectively. Feature selection/reduction has been an active area of research and development for many years in the field of ML and data mining [
4]. This method selects a subset of original features according to certain criteria by removing irrelevant or redundant features from the dataset to improve mining performance, such as prediction accuracy and the reduction of model building time [
4,
36]. There are mainly two methods for feature selection/reduction, namely the wrapper method and the filter method. Researchers can use these methods with or without the discretize filter.
The wrapper model needs one predetermined mining algorithm and uses its performance as the assessment criterion. It searches for features better suited to the mining algorithm in order to improve mining performance, but it is most likely to be more computationally expensive than the filter model. In this study, the authors used the procedure presented in the work of Malhotra and Sharma [
4] to identify a subset of features that has the best performance with the classification algorithm using an attribute evaluator and a search method (CfsSubsetEval + BestFirst).
Next, the filter method depends on the general structures of data to evaluate and select feature subsets without involving any mining algorithm. In this study, the authors assigned ranks to all attributes in the dataset by using an attribute evaluator and a ranker method (InfoGainAttributeEval + Ranker), as described in the work of Malhotra and Sharma [
4]. The attribute ranked first had the highest priority. The authors omitted features with a lower rank at a time to assess the accuracy of the classifier at that point in time, as well as omitted features one after the other until the global minimum was reached. After the global minima, the dataset started overfitting and generating additional instances that were incorrectly classified.
Malhotra and Sharma [
4] described discretization as a process of altering numeric attributes into nominal attributes by dividing each attribute’s numeric values into a number of intervals. Consequently, discretization assists in improving accuracy and reducing the learning complexity of the classifier with which it is used. To attain this benefit, the authors discretized the dataset by applying an unsupervised discretize filter to the attributes.
Performance Measures: The authors used performance measurements, particularly accuracy, time taken to build the model, and correctly and incorrectly classified instances, to evaluate and compare the performance of the tested classifiers. The descriptions of these parameters are as follows:
The confusion matrix is a visualization tool that works as a basis for calculating all other parameters. It includes four values: true positive (TP), false negative (FN), false positive (FP), and true negative (TN). They are described as follows:
- True Positive (TP):
Normal instances that are identified correctly
- False Negative (FN):
Abnormal instances incorrectly identified as normal
- False Positive (FP):
Anomalous instances incorrectly recognized as normal
- True Negative (TN):
Anomalous instances identified correctly as an attack
Accuracy is one of the essential measures for describing the performance of any algorithm. It is the degree to which an algorithm can properly predict positive and negative instances, and it can be determined by the following formula: Accuracy = TP + TN/TP + FN + FP + TN.
- Correctly Classified Instances:
Number of instances that were correctly identified
- Incorrectly Classified Instances:
Number of instances that were incorrectly identified
Time Taken to Build the Model: Time taken by a classifier to generate a model. Researchers usually measure it in seconds (s). The lesser time taken to build the model, the better the classifier.
Sensitivity is another basic measurement to assess the performance of any algorithm. It is known as a true positive rate that is determined correctly and can be calculated using the following formula: Sensitivity = TP/TP + FN.
Specificity is a measure of the proportion of an actual negative rate recognized correctly by a learning algorithm and can be measured using the following formula: Specificity = TN/FP + TN.
Precision is one of the primary performance indicators. It indicates the fact of being accurate and correct. It gives the impression of correctly predicted instances. It is calculated as TP/TP + FP.
The F-measure is the weighted average of precision and sensitivity and can be calculated by 2TP/2TP + FP + FN. It indicates the accuracy of a test by numerating the balance between precision and recall. It is the harmonic mean of sensitivity and precision.
The Receiver Operating Characteristics (ROC) is a graphical means of evaluating ML classifiers and visualizing the relationship between the TP and FP rates of IDSs. Essentially, it describes the performance of ML classifiers. It is used to effectively compare ML classifiers in terms of accuracy. ML classifiers having more area under the curve have high performance.
The TP rate indicates the possibility of an ML classifier predicting positive instances as correct and normal. Also called sensitivity, the TP rate is the probability that an actual positive will test positive. A high TP is preferable. The TP rate can be calculated using the following formula: TP rate = TP/TP + FN.
The FP rate, also called the false alarm rate, implies the possibility of an ML algorithm forecasting a normal instance as an attack. A consistent increase in the FP rate might mislead the network manager to deliberately ignore alerts from the network system. A low FP is therefore desirable. The FP rate can be measured using the following formula: FP rate = FP/FP + TN.