1. Introduction
Today, an ever-increasing number of businesses are using data centers or large server clusters to run a variety of applications. Most of these applications use TCP(Transmission Control Protocol) or UDP(User Datagram Protocol) protocols, such as web services that make up most of the network’s traffic. However, applications based on TCP or UDP have suffered a variety of malicious attacks, especially distributed denial of service (DDoS) attacks. Denial of service (DoS) or DDoS attacks pose a devastating threat to network services [
1]. In the field of network security, detecting, identifying, and mitigating denial of service and distributed denial of service attacks is a challenging task [
2]. For the current, widespread Mixed DDoS attacks, detection is mainly used to determine whether the current traffic has DDoS attack behavior, and identification is used to provide decision information regarding various specific attack types. At present, researchers have proposed many abnormal traffic detection models, including feature matching, statistical rules, and machine learning. These models [
3] are widely used for abnormal flow monitoring. In recent years, using machine learning to detect abnormal traffic has become a hot spot for DDoS traffic detection. Marwane Zekri et al. [
4] proposed a DDoS detection system based on the C4.5 algorithm to mitigate the DDoS attack threat. Wathiq Laftah AY et al. [
5] proposed a multilevel hybrid intrusion detection model that uses support vector machines and extreme learning machines to improve the efficiency of detecting known and unknown attacks. Kuang et al. [
6] proposed an intrusion detection system based on an SVM(Support Vector Machines) model combined with kernel principal component analysis (KPCA) and a genetic algorithm (GA). However, the current research on DDoS attack detection has the following problems: (1) more consideration is given to the detection of DDoS attack behavior, but less recognition exists for composite type attacks; (2) the type of attack identified by the classifier using limited statistical features is a limitation; (3) the connection between the attack feature and the attack type cannot be given; and (4) the DDoS attack detection accuracy is not high, and the detection response time is still slow.
In recent years, many machine learning methods have been developed to detect abnormal traffic, but most researchers use the KDD-Cup 99 [
7] dataset and the traffic characteristics provided by KDD-Cup 99 to experiment, but KDD-Cup 99 provides a limited number of DDoS attack types. Alternatively, these researchers only apply a small amount of traffic characteristics. Practical experience has shown that the data used and their characteristics determine the effectiveness of machine learning, the algorithm and algorithm optimization simply approximate the obtained result of machine learning. Based on TCP traffic characteristics in 102 and 49 UDP traffic characteristics, this paper constructs a feature subset that accurately represents different types of attacks through a feature selection algorithm and then optimizes the parameters of the GBDT algorithm to achieve an accurate and fast identification of the purpose of malicious attack traffic in TCP and UDP flows. To effectively shorten the training and detection time of the algorithm and improve the ubiquitous ability of the algorithm, we use the combination of random forest and Pearson correlation coefficient as the search strategy, and use the GBDT algorithm as the evaluation standard to implement the feature selection algorithm. Based on this approach, the GBDT algorithm is tuned to identify DDoS composite attack types. The experimental results show that the GBDT algorithm can quickly and accurately identify DDoS attack traffic after feature selection and tuning.
This article contributes the following:
- (1)
From the thousands of features used in the DDoS traffic detection literature in recent years, 102 features are extracted and implemented as the original feature set of DDoS attack recognition. These features can effectively represent known and unknown DDoS attacks. The experimental results show that using these features on the 2017 WIDE dataset can accurately detect DDoS attack traffic;
- (2)
A feature selection method based on the random forest feature score and Pearson correlation coefficient is proposed. The method is compared with the traditional dimensionality reduction algorithm and feature selection method. The experimental results show that the method utilizes a smaller feature subset to maintain or improve the original attack detection accuracy;
- (3)
To identify the various DDoS attacks in the composite DDoS attack traffic completely, accurately and quickly, an attack identification algorithm based on GBDT is proposed, and the GBDT parameter optimization method is given. In addition, the nearest neighbor, Bayesian, and support vector machine are compared with machine learning algorithms such as a deep network, and the experimental results show that the GBDT algorithm is superior to other algorithms in attack type recognition accuracy and running time.
- (4)
The method of constructing and optimizing the attack feature tree is proposed and continuously optimizes the attack feature to characterize the specific type of attack by analyzing the relationship between the attack type and the attack feature.
The organizational structure of this paper is as follows. The second part discusses the related work and summarizes researchers’ results in existing work. The third part introduces the DDoS attack feature selection framework and mode. The fourth part introduces the DDOS attack detection algorithm based on GBDT and its classification and the performance comparison experiment between other machines and other machine learning algorithms. The fifth part introduces a feature selection algorithm based on random forest and Pearson correlation coefficient analysis and its optimization for the DDoS classifier. The sixth part introduces the parameter tuning of the GBDT algorithm, the feature selection algorithm and its application in the performance improvement of DDoS attack identification. The seventh part summarizes the entire paper.
2. Background and Related Work
Feature extraction is a very important step in the machine learning process. The quality of the features determines the pros and cons of machine learning. The third International Knowledge Discovery and Data Mining Competition (KDDCUP’99) provides 41 traffic characteristics for abnormality detection, such as duration, protocol_type, and service in the intrusion detection datasets [
7]. Traffic statistics are very important for DDoS detection. Jiahui Jiao [
8] et al. defined two attack modes, fixed source IP attack and random source IP attack, and proposed a DDoS real-time detection method based on TCP protocol. This method extracts 15 basic statistical features and 16 ratio characteristics of TCP traffic. Malicious traffic is distinguished from normal traffic by two decision tree classifiers. Qin.X [
9] proposed an entropy-based DDoS attack detection method. By constructing the entropy vector of different traffic characteristics, the clustering analysis algorithm is used to model the normal mode. By defining different packet size levels, different levels of packets are set to different characteristics. It chooses the network connection quintuple (source IP, destination IP, source port, destination port, protocol number), and the packet size (divided into five levels, such as 0–128, 128–256, 256–512, 512–1024, 1024–1500) and
for a total of 11 features that are used for attack detection. Andrew W. Moore et al. [
10] proposed 249 features in traffic classification, including link layer features, IP layer features, and TCP layer features. This method defines the maximum, minimum, 1/4, 1/2, and average values of the packet size (or arrival time) from the client to the server and from the server to the client. The features are closely related to traffic classification, and thus indicate that these features can be applied to TCP, UDP, and ICMP flow anomaly detection. Yaokai Feng et al. [
11] detected DDoS attacks using 55 features, such as minimum packet size, average size, and size variance of packets in the session, and selected 10 features out of 55 features to characterize ChallengeCoHapsar (CC) attacks. The CC attack is a type of DDoS attack that uses a proxy server to send a large number of seemingly legitimate requests to the victim server. CC is named according to its tools, and the attacker uses a proxy mechanism to launch DDoS attacks using a number of widely available free proxy servers. Many free proxy servers support anonymous mode, which makes tracking very difficult.
Feature selection is an important means to improve the accuracy of machine learning detection and shorten the detection time. Excessive features will result in redundancy, and feature redundancy will increase the time required for machine learning training models; consequently, the detection accuracy will decrease, and the detection time will increase. Therefore, many researchers are committed to the study of feature selection. Yaokai Feng et al. [
11] considered that important features are essential for early detection of DDoS attacks, and use support vector machines (SVM) and principal component analysis (PCA) as feature selection algorithms. Their experimental results show that there are 10 features that can characterize CC attacks. Liu et al. [
12] proposed the use of mutual information methods to eliminate redundant features.
Table 1 summarizes the feature selection methods that are currently used by researchers in DDoS testing.
Machine learning algorithms are the core of machine learning. Different algorithms can produce different detection accuracies and detection times when they act on the same datasets. Detection accuracy and detection time are two major indicators that directly affect the DDoS detection effect. In recent years, many researchers have often used machine learning algorithms to detect abnormal traffic, and the detection results they obtained are also inconsistent. Common machine learning algorithms include principal component analysis, K-nearest neighbor (KNN), naive Bayes classifier (NB), decision tree, support vector machine, K-means (K-means), and back propagation neural network (back propagation, BP).
The PCA-based DDoS attack detection was first proposed by A. Lakhina et al. [
13]. The method separates the high dimensional space occupied by a set of network traffic measurements into disjoint subspaces corresponding to normal and abnormal network conditions. Experiments have shown that this separation can be effectively achieved by principal component analysis. There are many differences between this approach and the PCA-based multivariate statistical process control (MSPC) method in the industrial processing and chemometric literature. José Camacho et al. [
14] recommended the use of multivariate statistical network monitoring (MSNM) to effectively avoid the shortcomings mentioned in the literature, and the limitations of using PCA in the network are reported.
Based on KNN-based DDoS attack detection, Ming-YangSu [
25] proposed a method for detecting large-scale DDoS attacks in real time by a weighted KNN classifier. In addition, a combination of a KNN and a genetic algorithm for feature selection and weighting is proposed. The experiment shows that the overall accuracy is as high as 97.42% for known attacks. For unknown attacks, a 78% accuracy rate was obtained. Known attacks and unknown attacks are relative to the training set of machine learning. If the training set contains a certain DDoS attack type, the DDoS attack type belongs to a known attack. If the DDoS attack type is not included in the training set, the DDoS attack type belongs to an unknown attack. If a model can detect not only known attacks but also unknown attacks, the model performs very well. When multiple DDoS attacks occur simultaneously, the ability of the model to detect unknown attacks is very important. The authors in [
26] proposed a hybrid approach to intrusion detection systems that uses a boundary-cutting algorithm of the Manhattan and Jaccard coefficients with similar distances that was combined with the KNN algorithm to implement intrusion detection.
Based on NB-based DDoS attack detection, Yunpeng Wang et al. [
27] applied the NB and ReliefF algorithms to propose a naive Bayesian classification method. It uses the ReliefF algorithm to assign a weight to each attribute in the KDD-Cup 99 datasets, which reflects the relationship between the attribute and the final class for better classification results. Thaseen, S, I et al. [
28] proposed an intrusion detection model using linear discriminant analysis (LDA), chi-square feature selection, and improved naive Bayesian classification. They use Bayesian classifiers to identify normal and abnormal traffic in the NSL-KDD datasets. The experimental results show that the Bayesian classifier combined with other feature selection methods yields higher accuracy and a lower false positive rate.
Regarding decision tree-based DDoS attack detection, Lakshmi et al. [
29] use decision trees to protect wireless nodes and target nodes within the network from DDoS attacks. Malik, A, J et al. [
30] used standard particle swarm optimization algorithms to prune the decision tree with single and multiple target perspectives. The pruned decision tree classifier is then used to detect anomalous network connections. Experiments on the KDD-Cup 99 datasets showed an average detection accuracy of 93.26%.
SVM-based DDoS attack detection has been implemented. Adel. A. et al. [
31] proposed an SVM-based framework for detecting denial of service attacks in virtualized clouds in a changing infrastructure that collects some systems. The metrics are used to train SVM classifiers to distinguish between normal and malicious activities of virtual machines (VMs), associating VM application metrics with actual resource loads that enables hypervisors to distinguish between a benign high load and DDoS attacks. Wang et al. [
32] proposed an effective intrusion detection framework based on enhanced feature support vector machine (SVM). The logarithmic edge density ratio is transformed to form the original features, such that new and better quality transform features are obtained and the SVM-based detection model can achieve greatly improved detection ability. Feng, W, Y et al. [
33] combined the SVM method with the self-organizing ant colony network cluster (CSOACN) to take full advantage of both approaches, and achieved an accuracy of 95.3% on the KDD-Cup 99 datasets.
Approaches based on K-means DDoS detection have been employed. Gina C [
34] fully integrated a SOM neural network and the K-means algorithm, and developed a two-stage classification system to correlate related alarms and further classify alarms. This method effectively reduces all redundant and noisy alarms. The author used the DARPA dataset to conduct experiments. The results show that the method can detect 96% of the false positives, and the false positive rate is much lower. Ujwala Ravale et al. [
35] proposed a hybrid technique that combines the K-Means clustering algorithm and the RBF kernel function of the support vector machine as a classification module. The main goal of the proposed technique is to reduce the number of attributes associated with each data point. The accuracy of the proposed technique in the KDDCUP’99 datasets reached 96.3%.
BP-based DDoS attack detection was reported in the following work. Zouhair Chiba et al. [
36] proposed an optimal method for constructing anomalous NIDS(Network Intrusion Detection System) based on back propagation neural network using a back propagation learning algorithm. Yunhe Cui et al. [
37] used BP to detect traffic in real time on a software defined network (SDN). They first obtained all the flow entries for each switch. After receiving the flow statistics of the flow entry sent by the switch, the controller parses the flow statistics and processes the flow entry in the message one by one. The eigen values of the flow entry include the number of packets matched by each flow entry, the number of bytes matched by each flow entry, the lifetime of each flow entry, the packet rate of each flow entry, and the byte rate of each flow table item. The extracted feature values are then passed to the BP to determine if the TCP or UDP flow is benign or malicious. This method of detection takes a long time.
At present, composite type attacks in DDoS detection are little recognized, and the limited statistical features implemented in classifiers result in limited types of attacks. Moreover, the detection accuracy of DDoS attacks is not high, and the detection response time is still slow. Therefore, this paper also provides the connection between attack characteristics and attack types, helping decision makers quickly lock the scope of attack types.
3. System Model and Problem Statement
The overall framework of this paper is shown in
Figure 1. The first step is to extract the data from the pcap file, that is, parse the data packets of each pcap file. Then, the data packet is synthesized into a TCP flow or a UDP flow from which features are extracted. Finally, each datum is tagged, and one datum is formed by a TCP flow or a UDP flow. The specific generation steps are discussed in detail at the end of this section. In the second step, before the feature selection, we first determine the GBDT algorithm as the classifier for DDoS detection and recognition, which is discussed in detail in the fourth part. In the third step, the feature selection method of the random forest feature score and Pearson correlation coefficient is used to select each attack type. In the fourth step, each of the attack vectors are characterized by a feature subset, and an attack vector feature tree is constructed.
DDoS attacks are constantly changing, and each time a DDoS attack occurs, it is often accompanied by multiple attack methods. At present, there are 45 kinds of DDoS attacks based on TCP and UDP protocols [
38,
39,
40]. Many researchers have proposed DDoS attack classification methods [
41,
42,
43,
44]. In view of the current types of DDoS attacks, this paper extracts 102 features by summarizing the research results of these researchers [
8,
9,
10,
11]. As shown in
Appendix A Table A1, the features corresponding to sequence numbers 1–102 are based on 102 features of the TCP flow, and the features corresponding to sequence numbers 54–102 are based on 49 features of the UDP flow. These features serve as a collection of original features for detecting DDoS attacks.
Extracting the important features of the DDoS attack type is very important for early detection of DDoS attacks [
11]. A type of DDoS attack tends to highlight several important features. In other words, several features can characterize a DDoS attack. If we can acquire some features, we can lock the DDoS attack type range, which will greatly help the later DDoS mitigation. According to the method proposed by researchers in [
41,
42,
43,
44], each attack vector can be characterized by feature subsets to construct the attack vector feature tree. When a DDoS attack occurs, the attack feature tree can be used to quickly locate the DDoS attack type.
The goal of constructing a feature tree is to quickly locate which types of attacks appear when a hybrid attack occurs. To clearly describe the construction process of the feature tree, the details are shown in
Figure 1 (1) First, the feature of the mixed attack is extracted, and the obtained feature subset is used as the root of the feature tree. (2) Then, the mixed traffic corresponding to the root node is separated into various attack types, and the traffic of each attack type is mixed with the normal traffic. Then, feature extraction is performed, and the obtained feature subset is used as a leaf node (3). Subsequently, iterating through all the leaf nodes is performed and steps (1) and (2) are repeated. Finally, steps (1), (2), and (3) are performed to obtain a feature tree.
According to the attack vector feature tree, as shown in
Figure 1, when the feature subset 1 changes, the attack type’s range can be quickly locked, that is, the node where the feature subset 1 is located and the DDoS attack type that corresponds to the child node is determined. The attack vector feature tree has the characteristics of a fast perceptual attack and is beneficial to DDoS mitigation. Additionally, it saves time regarding feature selection because machine learning directly uses the changed feature set’s learning and prediction.
Definition 1. Feature Set: A collection of features consisting of several features in Appendix A Table A1. A feature set A is represented by [a1, a2, a3, ..., an], n ∈ (1, 2, 3, ..., 102), where n represents the sequence number in Appendix A Table A1, and an represents the feature corresponding to the sequence number n in Appendix A Table A1. For example, a feature set is [1, 6, 7, 9], indicating that the feature set has four features, which are the corresponding features of the sequence numbers in Appendix A Table A1, namely, [syn_in_pps, Push_out_pps, Fin_in_pps, and Rst_in_pps]. Definition 2. One-way flow: refers to a list of data packets in a TCP or UDP flow that are flowed by the client to the server and arranged in chronological order, or a list of data packets that are sent by the server to the client and arranged in chronological order.
To quickly extract DDoS features, this paper builds a big data processing framework. The processing flow is shown in
Figure 2. Kafka [
45] is a message queuing system that can be published and subscribed. Mainly considering the achievability of engineering practice, we, therefore, joined kafka in the experiment. Kafka has a buffering effect, and with kafka the big traffic DDoS attack will not rush our services. Kafka often used to collect real-time data from applications, the data format that Kafka sends to consumers is <key, value, timestamp>. Spark Streaming is a consumer of data. It is an extension of the core Spark API that enables scalable, high-throughput, and fault-tolerant stream processing of real-time data streams. Hbase is a distributed storage database that can store data with different key values at different times. First, the framework parses the fields of each packet of the pcap file into a text file, that is, a packet is parsed into a row of data. Kafka sends the parsed data to SparkStreaming. The data format = <K, V>, where K = sip#dip#sport#dport#protocol is a typical quintuple for reconstructing TCP/UDP flows, V = data packet arrival time, packet size, IPgram, the text version number, the IP header length and other IP packet fields, the TCP packet fields, and the UDP packet fields. SparkStreaming then compares K with K_i in the list to determine whether the stream is normal traffic or attack traffic. List is the label we manually identify (normal, SYN flood, or other), and a quintuple is used to represent a TCP flow or a UDP flow, such as List = [<K1, type1>, <K2, type2> ... <Ki, typei>], where Ki represents the quintuple of a TCP flow or UDP flow i, and i indicates the attack type of a TCP flow or UDP flow i. After comparison, we obtain <K, V, type>, which determines the label of each packet. Finally, it is stored in Hbase in the form of < type, E >, where type refers to the attack type and E refers to the statistical feature set. To calculate <K, V, type> and convert to <type, E>, the specific algorithm is as follows:
- (1)
Enter <K, V>.
- (2)
Use the map operator in Spark to determine the flow direction of the packet corresponding to <K, V>. It returns <K_in, V> if the packet is flowing from the client to the server; otherwise, it returns <K_out, V>.
- (3)
Then, call the reduce operator, and the result returns a one-way flow. For example, return <K_in, (V1, V2, V3, …, Vn)>.
- (4)
The reduce function in (3) calculates the characteristics of
Appendix A Table A1. For example, calculate syn_in_pps = countSYN (V1, V2, V3, …, Vn)/Vn.time-V1.time. countSYN (V1, V2, V3, …, Vn) is the number of SYN flags in the calculation packet V1, V2, V3, and Vn, which can be calculated according to the flags field of the TCP packet. Vn.time-V1.time indicates that the timestamp of V1 is subtracted from the timestamp of the packet Vn. After all the features are calculated, <K, E> is returned, and E = “e1, e2, e3, …, en” en represents a certain feature in
Appendix A Table A1.
- (5)
Then, continue to call the reduce function, merge the features of the two unidirectional flows of the same flow in (4), and return <K, E>.
- (6)
Call the JOIN function, List.join(E). E is the result of (5) return. The purpose of calling the JOIN function is to label each E. Call the JOIN function to obtain <K, (type, E)>, return <type, E>.
- (7)
Store the results returned by (6) in the Hbase database.
In this paper, the machine learning algorithm can be used to retrieve the normal tag data and some attack type data from the HBase database. This makes it easy to extract the feature set corresponding to a certain attack type.
4. DDoS Classifier
Decision trees are a basic classification and regression method. The decision tree model has a fast classification, but it is also prone to overfitting. When boosting is performed in the classification problem, it learns multiple classifiers by changing the weight of training samples, and linearly combines these classifiers to improve classifier performance. The boosting math is expressed as:
where
w is the weight,
φ is the set of weak classifiers,
M represents the classifiers, and
represents the weight of the
mth classifier. It can be seen that the final classifier is a linear combination of basis functions.
The GBDT algorithm is a boosting algorithm proposed by Friedman [
46] in 2001. It is an iterative decision tree algorithm consisting of multiple decision trees, and the conclusions of all trees are added as the final answer. The specific idea is that each time the model is built, the gradient of the model loss function is established in the previous direction, while the traditional boosting idea is to weight the correct and wrong samples. The GBDT algorithm plays two roles in this paper. On the one hand, it implements DDoS classification based on the GBDT algorithm; on the other hand, it uses the GBDT algorithm to evaluate and verify the effect of feature selection.
KNN, SVM, NB, and MLP(Multi-Layer Perceptron) algorithms are algorithms that are often used in intrusion detection algorithms. Here, the paper compares the GBDT algorithm with KNN, SVM, NB, and MLP in the three aspects of accuracy, running time, and the ROC(Receiver Operating Characteristic) curve. The GridSearchCV adjustment parameters are optimized for KNN, SVM, and NB algorithms, but it is difficult for MLP to determine the optimal parameters. It can only adjust a good effect according to personal experience. The results are shown in
Figure 3,
Figure 4 and
Figure 5, respectively. In the TCP flow, the attack traffic types are salphfl, malphfl, alphfl, mptmp, mptp, ptmp, ntscSYN, sntscSYN, ptmpHTTP, and mptpHTTP. In UDP flows, the attack traffic type refers to ptpposcaUDP. These types of attacks come from the division of the WIDE dataset [
47], such as the attack type ntscSYN, which refers to network_scan_SYN. Random extraction of 1 k, 10 k, and 100 k TCP or UDP datasets from the Hbase database is performed. This paper uses GBDT, KNN, SVM, NB, and MLP as the five machine learning algorithms to classify 1 k TCP and 1 k UDP datasets. The results show that when classifying TCP traffic, the training accuracy of the GBDT algorithm reaches 0.99, and the test accuracy rate reaches 0.98. The training accuracy of the KNN algorithm reaches 0.89, and the test accuracy rate is 0.9, while the training accuracy and test accuracy of SVM algorithm are 0.89 and 0.95, respectively. The training accuracy and test accuracy of the MLP algorithm are 0.96 and 0.93, respectively. The lowest is the NB algorithm, and the training scores and test accuracy rates are 0.56 and 0.6, respectively. It can be seen that the accuracy of the GBDT algorithm is significantly better than that of the other algorithms when the dataset’s size is 1 k. As the amount of data increases, the accuracy of the GBDT, KNN, SVM, and MLP algorithms decreases, but the accuracy of the NB algorithm barely fluctuates, and the accuracy is poor. The NB algorithm is suitable for datasets with few features, and eigenvalues that are discrete values and few in number. The features extracted in this paper are continuous values and have 102 features, which is obviously not suitable for NB algorithm. When the dataset’s size is 10 k, the training accuracy and test accuracy of the GBDT algorithm are reduced, which are 0.959 and 0.962, respectively. The accuracy of the test is higher than that of the KNN, SVM, and MLP algorithms. Similarly, when the amount of data is increased to 100 k, the test accuracy of the GBDT algorithm is higher than that of the KNN, SVM, and MLP algorithms. For UDP flows, when the dataset’s size is 1 k, the training accuracy and test accuracy of the GBDT algorithm are 0.994 and 0.95, respectively. Regardless of the training accuracy or the test accuracy, the GBDT algorithm achieves better results than other algorithms. As the size of the datasets increases, the accuracy of all algorithms decreases. However, the performance of the GBDT algorithm is superior to other algorithms. The reason for the decrease in accuracy may be due to the inaccurate direction of the flow during the feature extraction phase. In general, the training accuracy and test accuracy of TCP are higher than the training accuracy and test accuracy of UDP in the corresponding algorithm because TCP traffic has a richer DDoS attack signature.
In terms of training time, as the amount of data increases, the training time of the GBDT algorithm also increases, and the training time of the GDBT is not optimal in the five algorithms. However, when the data volume is 100 k, the GBDT test time on the TCP datasets is only 0.031 s, and the test time on the UDP datasets is only 0.023 s, which is smaller than that of the KNN, SVM, and MLP algorithms.
As shown in
Figure 4, the training time and test time of the SVM algorithm are very long. When the dataset’s size is 100 k TCP datasets, the time used is 451 and 25.22 s, respectively. When the dataset’s size is 100 k UDP datasets, the time used is 202 and 9.973 s, respectively. The MLP algorithm also takes a long time. The training time and test time are 28.456 and 1.02 s, respectively, compared to the 100 k TCP datasets. Although the training time of the KNN algorithm is relatively short, the test time is relatively long. The test times for the 100 k TCP datasets and the 100 k UDP datasets are 47.77 and 6.925 s, respectively. The test time is very important for DDoS detection. The shorter the test time, the earlier it is possible to detect whether a DDoS attack has occurred, thus preventing the DDoS attack from posing a greater threat to the server. Obviously, the GBDT algorithm is superior to the other algorithms in terms of accuracy and test time performance. The comparison of the ROC curves of the GBDT algorithm with other algorithms is shown in
Figure 5. The ROC curve of the GBDT algorithm encloses the ROC curve of the other algorithms. In summary, the time to detect DDoS attacks with the GBDT algorithm is shorter than that of the other algorithms, and the accuracy of detecting DDoS attacks is higher than that of the other algorithms. Therefore, the GBDT algorithm is selected as the DDoS attack detection classifier.
For the TCP flow, this paper simply collects 102 features using the GBDT algorithm without optimization to achieve a test accuracy of approximately 95%; for UDP flows with a total of 49 features, the GBDT algorithm achieves a test accuracy of 91%. The selected features of
Appendix A Table A1 and the GBDT algorithm are valid for DDoS attack traffic detection. However, due to the large number of features, feature selection is required to improve the generalization ability of the algorithm and shorten the detection time.
5. DDoS Classifier Optimization
The previous part has shown that the GBDT algorithm as a DDoS attack classifier is better than other algorithms. This part further considers the GBDT classifier optimization from two aspects. The first is feature selection, and the second is the specific classifier tuning.
5.1. Feature Selection
Feature selection has two purposes. First, for each DDoS attack, several features that best characterize the attack behavior are selected. Second, feature selection can help improve detection accuracy, reduce the false positive rate, and thus accurately use the classifier. Choosing a good attack feature can not only effectively identify DDoS attack traffic but also improve the response speed of the algorithm if the recognition rate allows.
Feature selection methods are divided into three types: filtering methods, packaging methods, and embedding methods. The main idea of the filtering method is to score the importance of each feature and then select the features according to the ranking of the scores. The main methods are chi-squared test, information gain, and correlation coefficient. The main idea of the packaging method is to regard the selection of the subset as a search optimization problem, generate different combinations, evaluate the combination, and compare it with other combinations. Therefore, this approach regards the selection of subsets as an optimization problem. The main method is the recursive feature elimination algorithm. The embedding method identifies the attributes that are important to the model training during the process of determining the model. Feature selection can also be combined with the artificial ant colony algorithm, artificial bee colony algorithm, genetic algorithm, annealing algorithm, and other algorithms to obtain the best features. This paper proposes a new integrated method of feature selection based on the random forest and Pearson correlation coefficients.
Random forests are a common feature selection method. In terms of DDoS attack feature selection, Robin. G et al. [
48] proposed a feature importance index based on random forests, using random forest importance scores to gradually increase features. Zahangir Alam et al. [
49] used a random forest importance score to rank and extract top-ranking features. Random forests can effectively extract the importance scores of features but cannot distinguish the correlation between features.
The method in this paper aims to calculate and rank all feature importance scores using random forests. The feature whose feature importance score is less than Ω is removed, and Ω is a threshold for calculating the feature importance score using the random forest. The rationale of removing the importance scores that are less than Ω is mainly to remove those features that are not related to the classification category or those that are extremely related to the classification category. This paper introduces the Pearson correlation coefficient, which examines the degree of correlation between two things. If there are two characteristics,
X and
Y, their correlation is calculated as follows:
where
N is the size of the record. The meaning of the finally calculated correlation coefficient can be understood as follows: (1) When the correlation coefficient is 0, the
X and
Y characteristics have no relationship. (2) When the value of
X increases (decreases) and the value of
Y increases (decreases), the two features are positively correlated, and the correlation coefficient is between 0.00 and 1.00. (3) When the value of
X increases (decreases) and the value of
Y decreases (increases), the two features are negatively correlated, and the correlation coefficient is between −1.00 and 0.00. The larger the absolute value of the correlation coefficient is, the stronger the correlation, and the closer the correlation coefficient is to 1 or −1, the stronger the correlation, while the closer the correlation coefficient is to 0, the weaker the correlation.
This paper proposes a feature selection method based on a random forest and Pearson correlation coefficients. The random forest is used to calculate the importance score for each feature. The features with a feature importance score that is higher than Ω are then combined into a feature subset. The pairwise features of the resulting feature subset are calculated for their Pearson correlation coefficients. The feature for which the Pearson correlation coefficient > ρ0 (ρ0, the Pearson correlation coefficient threshold) has the smallest importance score is removed. The remaining feature subsets are sequentially input into the GBDT algorithm. If the feature and the previous feature together improve the test accuracy of the GBDT algorithm, the feature is selected as the result feature; otherwise, the feature is not added as the result feature. The specific algorithm is shown in Algorithm 1. The Pearson(i, j) function of the fifth line of the algorithm refers to the Pearson correlation coefficient for calculating the feature i and the feature j. Line 7 describes that if the Pearson correlation coefficient of feature i and feature j is greater than ρ0, then features i and j are added to the list. In the algorithm, list refers to a list of feature sets, for example list = [[1, 2], [4, 5], [5, 13]], where the Pearson correlation coefficient of features 41 and 42 is greater than ρ0, so feature 41 and 42 join the list, resulting in a new list given by list = [[1, 2], [4, 5], [5, 13], [41, 42]]. Line 8 refers to the combination of two related features, such as combining the two related features in the list to obtain list = [[1, 2], [4, 5, 13], [41, 42]]. The function of lowerscore(h) on line 10 indicates that features with lower importance scores are obtained. For example, when h = [4, 5, 13], feature 4 scores the highest, so the function lowerscore(h) returns the set [5, 13]. The 14th line shows the function CroossGBDT(A2, k), which refers to a new feature subset consisting of a feature subset and a feature k, and the new feature subset is input into the GBDT algorithm to return the test set. The feature selection algorithm in this paper is called RFPW (random forest and Pearson wrap).
Algorithm 1: Feature selection algorithm |
Input: datasets D;Feature set A;Learning algorithm GBDT; Threshold value of feature importance score Ω; Pearson correlation coefficient threshold Output: feature subset A2
- 1:
Calculate the importance score of feature set A using a random forest. - 2:
Sort the scores and select the feature subset A1 with a score greater than Ω - 3:
for i in |A1|: - 4:
for j in |A1|: - 5:
= Pearson(i,j) - 6:
if > then: - 7:
list.add([i,j]) - 8:
Combine related features in the list to obtain a new list 1 - 9:
for h in list1: - 10:
A1 = A1-lowerscore(h) - 11:
Test = 0, Set A2 = {} - 12:
A1 sorts in descending order of score - 13:
for k in A1: - 14:
testscore = CroossGBDT(A2, k) - 15:
if testscore > test - 16:
A2.add(k), test = testscore - 17:
Output feature subset A2
|
5.2. GBDT Algorithm Parameters
After the feature selection, in order to obtain higher accuracy, it is necessary to adjust the parameters of the GBDT algorithm. The optimization of the GBDT algorithm’s parameters ensures that the GBDT algorithm achieves better detection accuracy and better generalization ability. There are four main parameters of the GBDT algorithm, namely, the number of the largest weak learner (CART tree), the subsampling, the condition of the subtree’s continual division, and the maximum depth of the subtree. The number of the largest weak learners is represented by n_estimators, such that when n_estimators are too small, it is easy to underfit; in contrast, when n_estimators are too large, it is easy to overfit. The sampling uses the subsample to indicate that the GBDT algorithm does not put back the sample. If the value is less than 1, only a part of the sample will be used for the decision tree fitting of GBDT. The condition in which the tree continues to be divided is represented by min_samples_split, which limits the conditions under which the subtree continues to be partitioned. The maximum depth of the tree is represented by max_depth, which can easily cause overfitting. These four parameters are important for creating a robust GBDT model.
This article starts with the number of weak learners (n_estimators). The result is shown in
Figure 6. When n_estimators = 400, the training accuracy is optimal. The accuracy of the number of iterations drops rapidly after 400.
Figure 7 shows that the maximum depth max_depth is not as large as possible, and the minimum division set min_samples_split is not as small as possible. When the maximum depth max_depth is small, the effect decreases as the minimum division set increases. When max_depth = 16, min_samples_split = 150, and the precision is 0.9997.
Figure 8 shows that the subsample value 0.25 is optimal. This paper uses the parameters n_estimators = 400, max_depth = 16, min_samples_split = 150, subsample = 0.25 as the optimized model.