1. Introduction
The decision tree is a powerful white-box classifier encompassing branches and nodes to deliver an interpretable model in the form of “if–else” rules. However, as most of the information is visible in the tree-like structure, the model is highly susceptible to privacy attacks. For example, the protocol type or port number attribute that branches out to or from the IP address node in the tree model may reveal the possibility of certain services being available on the particular machine [
1].
Although Li and Sarkar [
2] had proposed a privacy pruning model based on risk assessment, the approach is entirely different from the solution we proposed in this paper. Specifically, we utilise a case study in the network intrusion detection system (NIDS) domain to illustrate how we tackle the privacy issue as mentioned earlier. Furthermore, to the best of our knowledge, none of the previous works designed a privacy pruning solution from the perspective of IP truncation.
The IP address is viewed as personal data when its usage is combined with other information and can be used for profiling an individual [
3,
4]. For example, an attacker can deduce sensitive information such as individual online behaviours and personal communication from the IP address, and other information from the network traces [
4]. IP truncation can be embedded into the C4.5 tree to anonymise any sensitive information [
5,
6]. However, this indicates that some information will be obscured, and careless handling of it will affect the performance of the C4.5 tree. In this work, we will examine the impacts of the IP truncation and C4.5 tree in the domain of NIDS.
2. Literature Review
A decision tree model often suffers from overfitting its training data due to its greedy approach in building the classifier. Pruning is often used to simplify an unpruned decision tree by removing the insignificance nodes, branches, or leaves [
7,
8,
9]. Various pruning techniques have been proposed in the literature to yield a decision tree model with higher performance (i.e., classification accuracy) and better generalisation capability. Since the focus of this work is on tree pruning, some existing pruning strategies are discussed. The most common usage of pruning is to improve the decision tree model’s classification accuracy (i.e., reduced classification error). In general, nodes or leaves will be pruned if the classification accuracy for the pruned version of the tree is better than the unpruned version. Reduced Error Pruning [
10], Pessimistic Error Pruning [
10], Error-based Pruning [
11], Critical Value Pruning [
12], Minimum Error Pruning [
12], and Improved Expected Error Pruning [
13] are some of the notable pruning algorithms that endeavour to minimise the error rate of the decision tree model. In the default settings, the C4.5 decision tree [
11] uses Error-based Pruning (an improved version of Pessimistic Error Pruning).
A decision tree model can be viewed as a sequence of “if–else” rules flowing from the root node to the leaf node to provide a classification decision or regression result. Though maximising the classification accuracy is important, the structural complexity (i.e., number of nodes, branches, and leaves) of a tree should not be overlooked. This is because a tree with over-complicated tree structures would hinder the interpretation of the decision rules. Therefore, some researchers opt to build a simpler decision tree through the pruning mechanism to unleash the interpretability of a decision tree. On top of this, a simpler tree structure is more computational friendly. A few of the well-known pruning schemes that seek to reduce the complexity of the tree structure include Cost-Complexity Pruning [
14], improved Cost-Complexity Pruning [
15], and the Optimal Pruned Tree [
16]. Among them, the Cost-Complexity Pruning was employed in the renowned Classification and Regression Trees (CART) developed by Breiman et al. [
14].
Apart from pruning with a single objective such as maximising the classification accuracy or reducing the number of nodes, Osei-Bryson [
17] pointed out another evaluation criteria—the stability and interpretability of a tree to assess the quality of the model. Fournier and Crémilleux [
18] designed a Depth-Impurity Pruning based on the Impurity Quality Node index that considers the average distance of the data belonging to a leaf node to its centre of mass [
19] and also the depth of a node. In contrast to Cost-Complexity Pruning [
14], which only considers the number of nodes, Wei et al. [
20] include the depth (i.e., level) of the tree and classification accuracy as the pruning evaluation criteria by utilising rough set theory [
21].
In the purview of the unbalanced classes in most real-world classification problems, researchers had proposed a Cost-Sensitive Pruning technique to tackle this issue. It is very common to treat all classes as equally significant during the pruning process. However, this may not work perfectly for certain applications. In the medical field, a misclassification (i.e., false negative) in certain scenarios might be more expensive than a false positive. For instance, the consequence of identifying a patient with cancer as a healthy patient is more expensive than identifying a healthy patient as a patient diagnosed with cancer [
22]. Thus, Knoll, Nakhaeizadeh, and Tausend [
22] proposed a Cost-Sensitive Pruning that incorporates weights for each class in the dataset. In their work, they supplied the cost matrix directly to the algorithm. To evaluate the optimum ratio of misclassification costs (i.e., cost matrix), Bradley and Lovell [
23] exploited the receiver operating characteristics curve, and Ting [
24] incorporated instance weighting with a C4.5 decision tree to compute the cost matrix.
Decision tree models exploit the concepts of information gain to split the training dataset based on the features selected into a smaller group. In most cases, the leave nodes will only contain a very small subset of the training data. This scenario will lead to re-identification attacks through information linking [
25], as described by Li and Sarkar [
2]. Thus, Li and Sarkar [
2] proposed a new pruning metric that evaluates each node’s disclosure risks. They measured the risks by counting the number of data available in each node. A larger number of data would imply a lower risk, while a smaller number would suggest a higher risk of information disclosure. To prevent re-identification attacks, the authors introduced a data swapping procedure for the data found in each node. In this procedure, the majority class instances will be swapped with the minority class instances. To retain the classification performance, the superior probability after the data swapping process is retained by the majority class of each node.
From the literature review, we realise that only a minor group is working on privacy-oriented pruning, and there is room for improvement. Therefore, instead of weighting each node, we propose a new approach to prune the tree by truncating the value of the IP address in the application domain of NIDS. The methodology of this proposed work is presented in the next section.
3. Methodology
In this work, truncation [
26] is performed on IP addresses in three ways: 8-bit, 16-bit, and 24-bit truncations. Given the original IP address as “192.168.123.121”, 8-bit truncation will zeroize the last byte (8-bit) of the IP address, thus representing it as “192.168.123.0”; 16-bit truncation will zeroize the last two bytes (16-bit) of the original IP address, representing it as “192.168.0.0”; whereas 24-bit truncation will zeroize the last three bytes (24-bit) of the original IP address, representing it as “192.0.0.0”. The goal of this proposed method is twofold: (1) to anonymise the real IP address and (2) to prune the original C4.5 decision tree. In this work, we have adopted the C4.5 decision tree from the Weka J48 decision tree package to run the experiment. It should be noted that the default J48 decision tree without any fine-tuning of parameters is utilised on all the experiments deliberated on in this paper.
To illustrate the impacts of decision tree pruning with IP truncation, part of the ASCII J48 (Weka implementation of C4.5) tree model built against the original 6-percent-GureKDDCup’99 dataset is shown in
Figure 1. Similarly, the 24-bit truncation version of the same dataset trained with the identical dataset and model is presented in
Figure 2. We can observe that all the IP addresses from the tree model in
Figure 2 have been truncated by three bytes (24-bit), spawning a new set of IP addresses such as ”192.0.0.0”, “135.0.0.0”, “196.0.0.0”, etc. In both figures, the values in the bracket should be read as (total number of instances from the training distribution reaching the leaf, number of training instances that is incorrectly classified) [
25]. For instance, “resp_port <= 20: normal (6465.0/6.0)” from
Figure 2 denotes that 6465 number of training instances reached the leaf node, with 6 of them classified incorrectly [
27]. To have a better insight into the proposed approach, we give the overall methodological diagram and pseudocode for the proposed IP truncation pruning algorithm in
Figure 3 and
Figure 4. Note that 10-fold cross-validation is utilised in the absence of a testing dataset.
4. Experimental Empirical Studies
The examination is tested on a 6-percent-GureKDDCup’99 [
28,
29] NIDS dataset. This dataset is suitable because it contains IP addresses, which the IP truncation method will be applied to. To avoid the biasness or cherry-picking of the features, this work intentionally opted out of feature extraction, and no feature enhancement was performed on the dataset.
Hence, all the 47 attributes including connection_number, start_time, orig_port, resp_port, orig_ip, resp_ip, duration, protocol_type, service, flag, src_bytes, dst_bytes, land, wrong_fragment, urgent, hot, num_failed_logins, logged_in, num_compromised, root_shell, su_attempted, num_root, num_file_creations, num_shells, num_access_files, num_outbound_cmds, is_hot_login, is_guest_login, count, srv_count, serror_rate, srv_serror_rate, rerror_rate, srv_rerror_rate, same_srv_rate, diff_srv_rate, srv_diff_host_rate, dst_host_count, dst_host_srv_count, dst_host_same_srv_rate, dst_host_diff_srv_rate, dst_host_same_src_port_rate, dst_host_srv_diff_host_rate, dst_host_serror_rate, dst_host_srv_serror_rate, dst_host_rerror_rate, dst_host_srv_error_rate are utilised in the experimental procedure. Details of the attributes can be referred from the GureKDDCup’99 documentation [
28,
29].
As the dataset has not been segregated into a training and testing distribution by the authors, the classification performance of the proposed method will be evaluated based on 10-fold cross-validation to provide a fair comparison between the models.
4.1. Experimental Results
To gauge the usability of a decision tree model, the decision tree structure (i.e., number of nodes) and the classification accuracy attained before and after the application of IP truncations are compared. The number of nodes is selected as one of the evaluation criteria since the interpretability of a decision tree greatly depends on the tree structure. For instance, a very complex decision tree structure having a vast number of nodes is harder to be understood when compared to a simpler tree with a smaller number of nodes.
We present the empirical results for each model (i.e., original, 8-bit truncation, 16-bit truncation, and 24-bit truncation) in
Table 1 with crucial evaluation metrics such as classification accuracy, true positive rate, false positive rate, true negative rate, false negative rate, the number of nodes found in the tree models, and the number of leaves present in the tree models.
4.2. Discussions and Findings
Affected by the IP truncation concealing procedure, it is speculated that this process might be reducing the accuracy of the original decision tree due to information loss. However, the experimental results prove otherwise since the difference in accuracy between the original and truncated versions are remarkably insignificant (on average ~0.01% of degradation), as tabulated in
Table 1. Similarly, although the original model performance in terms of true positive rate, false positive rate, true negative rate, and false negative rate is superior when compared to the other models, the difference is miniature.
Interestingly, the total number of nodes in the decision tree structure is observed to be considerably decreased after applying IP truncation (i.e., 23,096 or 98.3520% of nodes are pruned from the original tree when 24-bit truncation is adopted) based on the results shown in
Table 1. This scenario signifies that the truncation scheme can simplify the tree structure and provide additional privacy protection without deteriorating the classification performance of the decision tree at the same time.
Akin to the observation in the number of nodes, we also noticed a significant reduction in the number of leaves. In particular, the number of leaves is reduced by 21,354 (91.0696%) when 16-bit truncation is adopted and 23,129 (98.6395%) when 24-bit truncation is applied.
To better understand the differences between the truncation methods, we have also presented the total number of distinct source IP addresses and destination IP addresses in
Table 1. The highest number of nodes and leaves are pruned when 24-bit truncation is applied. This can be substantiated by the total reduction in IP address from 5900 to 99. A total of 5801 or 98.3220% of IP addresses have been reduced as compared to the original model when 24-bit truncation is employed.
Based on the empirical results presented in this section, it is safe to assume that the truncation methods are suitable to be adopted as a pruning strategy to reduce the complexity of the tree structure while maintaining an adequate classification performance and preserving privacy simultaneously.
7. Conclusions and Research Limitation
The proposed pruning method, which is based on IP truncation, spells two main advantages: (1) the IP address can be anonymised to prevent any potential user profiling, and (2) the number of nodes in a C4.5 tree is tremendously reduced to make the rule interpretation possible while maintaining the classification accuracy.
A slight accuracy degradation is observed from the results presented after we applied the truncation in most scenarios. However, it is safe to assume that the truncation can prune the decision tree effectively without much affecting the performance of the tree classifier, and merely a very small performance degradation is observed (refer to
Table 1 and
Table 5) after the truncation. Based on the empirical results presented in
Table 5, it is also interesting to observe that the classification accuracy of the truncated model outperforms the original model in some situations. It can be postulated that the IP truncation can group similar features for building a more generalised tree model.
However, masking the IP address for the sake of privacy will also hinder its interpretability when we want to investigate an incident for an exact node (i.e., IP address or specific device). This can be viewed as a trade-off of privacy. Based on the empirical results attained, we believe that the proposed IP truncation-based pruning can preserve the utility and privacy simultaneously.