Extended Isolation Forest for Intrusion Detection in Zeek Data

Moomtaheen, Fariha; Bagui, Sikha S.; Bagui, Subhash C.; Mink, Dustin

doi:10.3390/info15070404

Open AccessArticle

Extended Isolation Forest for Intrusion Detection in Zeek Data

by

Fariha Moomtaheen

¹

,

Sikha S. Bagui

^2,*

,

Subhash C. Bagui

¹

and

Dustin Mink

²

¹

Department of Mathematics and Statistics, University of West Florida, Pensacola, FL 32514, USA

²

Department of Computer Science, University of West Florida, Pensacola, FL 32514, USA

^*

Author to whom correspondence should be addressed.

Information 2024, 15(7), 404; https://doi.org/10.3390/info15070404

Submission received: 18 June 2024 / Revised: 7 July 2024 / Accepted: 10 July 2024 / Published: 12 July 2024

(This article belongs to the Special Issue Intrusion Detection Systems in IoT Networks)

Download

Browse Figures

Versions Notes

Abstract

:

The novelty of this paper is in determining and using hyperparameters to improve the Extended Isolation Forest (EIF) algorithm, a relatively new algorithm, to detect malicious activities in network traffic. The EIF algorithm is a variation of the Isolation Forest algorithm, known for its efficacy in detecting anomalies in high-dimensional data. Our research assesses the performance of the EIF model on a newly created dataset composed of Zeek Connection Logs, UWF-ZeekDataFall22. To handle the enormous volume of data involved in this research, the Hadoop Distributed File System (HDFS) is employed for efficient and fault-tolerant storage, and the Apache Spark framework, a powerful open-source Big Data analytics platform, is utilized for machine learning (ML) tasks. The best results for the EIF algorithm came from the 0-extension level. We received an accuracy of 82.3% for the Resource Development tactic, 82.21% for the Reconnaissance tactic, and 78.3% for the Discovery tactic.

Keywords:

extended isolation forest; Zeek data; intrusion detection; big data; Apache Spark; machine learning

1. Introduction

Over the past decade, the rapid growth of Internet of Things (IoT) devices has led to an exponential increase in network traffic. As the number of connected devices continues to rise across diverse sectors such as healthcare, agriculture, logistics, and more, the volume of data being transferred across networks is expected to surge exponentially. To address the mounting challenges posed by the escalating scale of IoT data and the prevalence of cyber threats, effective monitoring and detection of malicious activities have become critical.

This research leverages Zeek, an open-source network-monitoring tool renowned for its ability to provide comprehensive raw network data, to collect a modern unique dataset, UWF-ZeekDataFall22 [1]. This dataset has been meticulously labeled using the MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) framework [2], a globally accessible knowledge base that characterizes adversary tactics and techniques used to achieve specific objectives.

Among the various adversary tactics, this work specifically focuses on detecting three critical tactics: Reconnaissance (TA0043) [3], Discovery (TA0007) [4], and Resource Development (T1589) [5]. The Reconnaissance tactic gathers information on vulnerabilities that can be exploited in future attacks. The Discovery tactic aims to gain deeper insights into the internal network structure. And the Resource Development tactic focuses on various attacking tools and methods of identification.

Zeek’s Connection (Conn) Log files are instrumental in tracking and recording vital information about network connections, including IP addresses, durations, transferred bytes, states, packets, and tunnel information. By analyzing these Conn log files, we aim to identify connections that exhibit patterns associated with Resource Development, Reconnaissance, and Discovery tactics, which are indicative of potential cyber threats.

To handle the enormous volume of data involved in this research, the Hadoop Distributed File System (HDFS) [6] is employed for efficient and fault-tolerant storage. The Apache Spark framework, a powerful open-source Big Data analytics platform [7], is utilized for machine learning (ML) tasks. The novelty of this paper is in determining and using hyperparameters to improve the Extended Isolation Forest (EIF) algorithm, a relatively new algorithm, to detect malicious activities in network traffic. The EIF algorithm is a variation of the Isolation Forest algorithm, known for its efficacy in detecting anomalies in high-dimensional data. Our research assesses the performance of the EIF model on the UWF-ZeekDataFall22 dataset [1].

The rest of the paper is organized as follows. Section 2 presents the related works; Section 3 presents the background information, that is, a description of the anomaly as well as anomaly score, and the Isolation Forest and Extended Isolation Forest algorithms; Section 4 describes the dataset, a relatively new dataset; Section 5 presents the methodology; Section 6 presents the results; Section 7 presents the conclusion; Section 8 presents the future works.

2. Related Works

Research on the Extended Isolation Forest (EIF) algorithm is relatively new but has shown promise in various domains. The Isolation Forest algorithm has demonstrated effectiveness in detecting anomalies in high-dimensional datasets. Liu et al. (2008) [8] proposed the Isolation Forest approach for intrusion detection, showing its advantages in handling large-scale data with high-dimensional features. Chen et al. (2019) [9] extended the Isolation Forest to address the challenge of detecting time series anomalies, achieving remarkable results in identifying abnormal patterns in temporal data.

Sharma et al. (2022) [10] proposed an extension to the Isolation Forest algorithm in the form of the Extended Isolation Forest (EIF) for detecting advanced persistent threats (APTs) in enterprise networks. By leveraging the power of EIF, the authors achieved remarkable accuracy in identifying complex attack patterns in large-scale network data.

Li et al. (2017) [11] presented the Extended Isolation Forest (EIF) method as an enhancement to the traditional Isolation Forest algorithm. The EIF algorithm demonstrated robustness and scalability in detecting intrusions in computer networks, making it a promising candidate for real-time network security applications.

Fan et al. (2021) [12] proposed an improved version of the Isolation Forest algorithm coupled with the Self-Organizing Map (SOM) clustering technique for anomaly detection in network security. The study showcased the effectiveness of the enhanced Isolation Forest in accurately identifying network intrusions.

Zhou et al. (2019) [13] applied the Extended Isolation Forest algorithm to detect intrusions in industrial control systems. The study demonstrated the efficacy of EIF in detecting anomalous behavior and potential cyber threats in critical industrial networks.

Thangaraj et al. (2020) [14] proposed an enhanced version of the Extended Isolation Forest tailored for intrusion detection in software-defined networks (SDNs). This study highlighted the effectiveness of the enhanced EIF model in accurately detecting intrusions in SDN environments.

Huang et al. (2019) [15] developed an efficient Random Forest Extended Isolation (RF-EIF) algorithm for anomaly detection. By combining the power of the Random Forest ensemble with the EIF method, the authors achieved improved accuracy in detecting anomalies in diverse network datasets.

Recent advancements have continued to push the boundaries of anomaly detection using variations of the Isolation Forest. Liu et al. (2024) [16] introduced the Layered Isolation Forest, a multi-level subspace algorithm designed to improve the original Isolation Forest’s ability to handle local outliers and enhance anomaly detection performance. This method maintains the efficiency of the original algorithm while achieving superior performance metrics on both synthetic and real-world datasets.

Similarly, Wu et al. (2024) [17] demonstrated the application of the Extended Isolation Forest in the fault diagnosis of avionics equipment. Their study utilized a combination of feature selection and EIF to detect and categorize faults in electronic modules, highlighting the practical engineering value and effectiveness of EIF in real-world applications.

Gaps in the Literature

Despite these advancements, specific gaps remain in the current literature:

Optimization of Isolation Level: While many studies have demonstrated the efficacy of EIF in various contexts, there is a lack of comprehensive research focusing on the optimization of isolation levels within the algorithm. Isolation levels are crucial as they directly influence the algorithm’s ability to accurately detect anomalies. Our research addresses this gap by systematically exploring and determining the optimal isolation levels for the EIF algorithm, thus improving its performance in detecting malicious activities in network traffic.
Real-Time Application in High-Dimensional, Large-Scale Datasets: Although some studies have applied EIF to high-dimensional datasets, the focus on real-time application in large-scale network traffic data remains limited. Our research leverages the Apache Spark framework and Hadoop Distributed File System (HDFS) to handle and analyze extensive network data in real time, demonstrating the practicality and scalability of EIF in such contexts.
Specific Focus on Network Traffic Anomalies: Previous works have often focused on broader applications of EIF without a concentrated effort on network traffic anomalies, particularly those associated with IoT devices. Our study uniquely targets this area by using the UWF-ZeekDataFall22 dataset and focusing on critical adversary tactics, such as Reconnaissance, Discovery, and Resource Development, as defined by the MITRE ATT&CK framework.

By addressing these gaps, our research contributes to the enhancement of the EIF algorithm, making it a more robust tool for intrusion detection and anomaly detection in network traffic.

3. Background

3.1. What Is an Anomaly

An anomaly or outlier refers to any data point or observation that notably differs from the rest of the data. Anomaly detection plays a crucial role and finds practical applications across different fields, such as identifying fraudulent bank transactions, detecting network intrusions, spotting sudden fluctuations in sales, and detecting changes in customer behavior, among others [18,19]. Numerous methods have been devised to identify anomalies in data. We focus on the implementation of Isolation Forests, which is a supervised anomaly detection technique [8].

3.2. Anomaly Score

The output of Isolation Forest is anomaly scores. When a point travels through a tree, the length of its path can be an indication of its uniqueness. If it goes deeper, the point is not that unique; if the path is shorter, that data point may be an anomaly [8]. In Figure 1, the red path may be anomalous, while the blue path would be a normal path.

When the point is run through multiple trees in the forest, the combined length of its path can give an anomaly score. If the score is closer to 1, the point is an anomaly, but if it is less than 0.5, it is a normal point [20]. On the other hand, if all points cluster around 0.5, that dataset may not have distinct anomalous points [20].

3.3. Isolation Forest

Isolation Forests (IFs) are supervised models used to detect anomalies in data. They are similar to Random Forests and use Decision Trees to identify unusual data points. One advantage of IF is that it does not rely on building a profile for the data, making it computationally efficient [8].

However, IF has a bias in how the trees are branched, which can lead to uneven anomaly scores. This inconsistency can cause false positive results and suggest patterns that do not actually exist in the data [20].

In Figure 2a, we have a normally distributed 2D dataset. A data point close to (0, 0) should be nominal, and the anomaly score should increase radially away from this point. From the score map in Figure 2b, we see that there are rectangular regions along the x and y axes where anomaly scores are lower, and the score is not equally increasing in a circular way as we expect [20].

Figure 3 has a dataset with two clusters, and the score map creates ghost clusters alongside the real ones. Similarly, in a dataset with a sinusoidal structure (Figure 4), the score map completely fails to capture the hills and valleys in the data distribution.

3.3.1. Algorithm

The hyperparameters of the IF model are as follows:

t = number of trees

ψ = subsampling size

The algorithm is split into two stages. First is the training stage where the forest is created. The second stage is the evaluation stage, which puts a given point into each tree and provides an average path length of the point, as shown in Figure 5 [20].

The complexity of the IF algorithm is the same for both stages: O(t ψlog ψ) [20].

Training Stage

As shown in Algorithm 1, the training stage performs sub-sampling and builds an ensemble of isolation trees. Each tree’s height is limited by its ceiling, which is approximately the average height of a binary search tree (BST) for the size of the given data. The algorithm for training is separated into two functions. Recursion is used in Algorithm 2 for building the isolation trees. The output of the training stage is an Isolation Forest prepared for the scoring of each given point [20].

Algorithm 1. iForest(X, t, ψ)

.Require: X—input data, t—number of trees, ψ—sub-sampling size

Ensure: a set of t iTrees

1: Initialize Forests

2: set height limit l = ceiling(log₂ ψ)

3: for i = 1 to t do

4: X′ ← sample(X, ψ)

5: Forest ← Forest ∪ iTree(X′, 0, 1)

6: end for

7: return Forest

Algorithm 2. iTree(X, e, l)

Require: X—input data, e—current tree height, l—height limit

Ensure: an iTree

1: if e ≥ l or |X| ≤ 1 then

2: return exNode{Size ← |X|}

3: else

4: let Q be a list of attributes in X

5: randomly select an attribute q ∊ Q

6: randomly select a split point p from max and min values of attribute q in X

7: X_l ← filter(X, q ≤ p)

8: X_l ← filter(X, q > p)

9: return inNode{Left ← iTree(X_l, e + 1, l),
Right ← iTree(X_r, e + 1, l),
SplitAtt ← q,
SplitValue ← p}

10: end if

Evaluation Stage

The output algorithm of the evaluation stage is the path length of a given point. The average path length in the Isolation Forest is computed and handed over to the anomaly score formula. Algorithm 3 is used to estimate the path where IF is not able to isolate the points [20].

Algorithm 3. PathLength(

\vec{x}

, T, e)

Require:

\vec{x}

—an instance, T—an iTree, e—current path length; to be initialized to zero when first called

Ensure: path length of

\vec{x}

1: if T is an external node then

2: return e + c(T.size)

3: end if

4: a ← T.splitAtt

5: if x_a ≤ T.splitValue then

6: return PathLength(

\vec{x}

, T.left, e + 1)

7: else x_a > T.splitValue

8: return PathLength(

\vec{x}

, T.right, e + 1)

9: end if

3.4. Extended Isolation Forest

To overcome the biases and limitations of the traditional Isolation Forest, the Extended Isolation Forest (EIF) algorithm has been developed. The EIF algorithm introduces modifications that allow for random slopes in the branch cuts, making the scoring more reliable and reducing the impact of artifacts [21].

The EIF algorithm is particularly significant in the context of network traffic analysis for several reasons:

Improved Detection of Complex Anomalies: EIF can better isolate outliers in high-dimensional data, which is common in network traffic. This is because the random slopes in the cuts allow the algorithm to adapt to complex structures in the data, making it more effective at identifying subtle anomalies that might be missed by traditional methods [21].
Reduction of False Positives: By mitigating the bias inherent in the branching process of traditional Isolation Forests, EIF reduces the occurrence of false positives. This is crucial in network traffic analysis where high false positive rates can lead to unnecessary alerts and increased workload for security analysts [21].
Scalability and Efficiency: Like the traditional Isolation Forest, EIF is computationally efficient and scalable. This makes it suitable for real-time intrusion detection systems that need to process large volumes of network traffic quickly [21].

By addressing these challenges, EIF enhances the reliability and accuracy of anomaly detection in network traffic, providing a more robust tool for cybersecurity applications [21].

3.4.1. Branching in Extended Isolation Forest

Keeping the branch cuts parallel to the axes has no fundamental reasoning. So, instead of picking a feature and value at every branching point, the Extended Isolation Forest picks a random slope and intercept for the branch cut [21].

Suppose we have an N-dimensional dataset. For the random slope requirement, we can choose random numbers for each coordinate of a normal vector over the N-sphere. For the random intercept, we can pick a random number from a uniform distribution over the range of values present. So, the algorithm transforms into the following two tasks shown in Figure 6 [21].

Picking

\vec{v}

: Draw a random number for each coordinate of

\vec{n}

from a normal distribution Ɲ(0,1).

Picking

\vec{p}

: Draw from a uniform distribution over the range of values present at each branching point.

Once these two pieces of information are determined, the branching criteria for the data splitting for a given point

\vec{x}

are as follows:

\vec{x} : (\vec{x} - \vec{p}) \cdot \vec{n} \leq 0

(1)

If the condition is satisfied, the data point

\vec{x}

is passed to the left branch; otherwise, it moves down to the right branch [21].

3.4.2. Extension Levels

The algorithm easily adapts to higher dimensions. In this scenario, the branch cuts are no longer straight lines; instead, they become N − 1-dimensional hyperplanes [21].

For an N-dimensional dataset, we can consider N levels of extension. As we increase the extension levels, the algorithm’s bias in producing a non-uniform score map is reduced. The lowest level of extension in the Extended Isolation Forest coincides with the standard Isolation Forest [21].

Having multiple extension levels can be beneficial when the dynamic range of the data in different dimensions varies significantly. Reducing the extension level helps in selecting more appropriate split hyperplanes and reduces the computational overhead. For example, if we have three-dimensional data with a much smaller range in two dimensions compared to the third (essentially distributed along a line), using the standard Isolation Forest might yield the most optimal result [21].

3.5. Our Extended Isolation Forest Algorithm

The adjustments made to our Extended Isolation Forest algorithm are explained below.

3.5.1. Training Stage

The forest is created from trees, as shown in Algorithm 1. In Algorithm 2, the two lines that pick a random feature and a random value for that feature are updated with lines 4 and 5. In addition, the test condition to reflect inequality is also changed. Line 6 is a new addition that allows the extension level to change. With these changes, the algorithm can be used as either the standard Isolation Forest or as the Extended Isolation Forest with any desired extension level [21].

3.5.2. Evaluation Stage

In Algorithm 3, the changes are made accordingly. The normal and intercept points from each tree are used with the appropriate test condition to set off the recursion for figuring out the path length [21].

By using the Extended Isolation Forest, we can achieve more accurate anomaly detection and better interpret the results for complex data distributions. This makes it a valuable tool for various applications including fraud detection, network intrusion detection, and more [21].

4. The Data: UWF-ZeekDataFall22

The UWF-ZeekDataFall22 dataset contains three crucial tactics/attacks, labeled as per the MITRE ATT&CK framework: Reconnaissance (Tactic: T1590) [3], Resource Development (Tactic: T1589) [4], and Discovery (Tactic: T1087) [5].

Reconnaissance (T1590): This initial phase involves gathering information about potential targets through activities like OSINT, vulnerability scanning, and probing for weaknesses. Analyzing reconnaissance data provides early warning signs of cyber threats and helps fortify defenses.
Resource Development (T1589): In this stage, attackers acquire tools, techniques, and infrastructure required for the attack, such as custom malware and command-and-control (C2) infrastructure. Analyzing resource development data reveals the types of tools and methods used by attackers, aiding in identifying potential threat vectors.
Discovery (T1087): After the initial compromise, attackers explore the target environment to understand its layout and locate sensitive information. Activities in this stage include system enumeration, scanning for network shares, and probing for vulnerable services. Analyzing discovery data detects unauthorized access attempts and lateral movement within the network.

4.1. Data Description

Table 1 shows us the number of instances of each tactic available in the UWF-ZeekDataFall22 dataset.

This work only uses the data from Resource Development, Reconnaissance, and Discovery tactics since there are not enough data for the rest of the tactics. The “none” tactic indicates benign data. Since binary classification is being performed, datasets were created for each of the three tactics, and 70% benign data was combined with 30% tactic data. The following is the distribution of the datasets used for this experiment:

Resource Development Total Records: 874,971

Benign data: 612,585

Attack data: 262,386

Reconnaissance Total Records: 13,243

Benign data: 9238

Attack data: 4005

Discovery Total Records: 3305

Benign data: 2305

Attack data: 1000

4.2. Preprocessing

To effectively implement the Extended Isolation Forest, the dataset needed to be preprocessed. Since the dataset contains columns with different types of values, such as continuous, nominal, IP address, port number, etc., the first preprocessing step that was taken was binning, in line with [22]. After the attributes were binned, an information gain [23] technique was applied to rank the features according to their importance. Table 2 lists the features according to their importance score, received using the information gain calculations.

5. Methodology

Figure 7 presents the flowchart of the methodology used for this work.

5.1. Libraries

Python’s sklearn library was used for the standard Isolation Forest implementation. The gridsearch library from sklearn was used for the parameter tuning.

There was no library for the extended IF implementation, as it is a relatively new technique and still in its experimental stage. We used a github repository provided by the original authors [8] for obtaining the anomaly scores. However, this implementation lacked any further hyperparameter tuning.

We also experimented with another library from h2o for the EIF implementation. This version gives more control for controlling some of the parameters.

5.2. Hyperparameter Tuning

A grid search was performed to find the best number of trees and subsample size for the standard IF implementation. The best values obtained were as follows:

n_estimators: 100

max_samples: 256

For the Extended IF implementation, the number of trees and sample sizes were varied since there were no libraries and the original implementation lacked this functionality. The values tested were n_trees [120, 200, 500, 700, 1000] and sample_size [64, 128, 256, 512] for extension level 0. The best results came from the following values:

n_trees: 1000

sample_size: 256

5.3. Varying Extension Level

Extended Isolation Forest allows for the changing of the levels of extension. Since there are 18 attributes in the dataset, the levels can be incremented to 17. First, the algorithm was run setting the extension level to 0, which is the same as the standard Isolation Forest. Then, the algorithm was run by incrementing the level by 1 each time and the confusion matrices were generated each time. So, for each tactic, the Isolation Forest algorithm was run once, and the Extended Isolation Forest was run 18 (ext 0–17) times. After the successful completion of each level, we found that in most cases, the results remained the same for the different extension levels.

6. Results

This section presents the performance metrics used to assess the results as well as the results.

6.1. Performance Metrics

The following performance metrics were used to assess the results of the Isolation Forest as well as the Extended Isolation Forest: accuracy, precision, recall, F1 score, and specificity.

6.1.1. Accuracy

Accuracy is a commonly used metric to assess the overall performance of a predictive model. It measures the proportion of correct predictions made by the model among all predictions. Accuracy takes into account both positive and negative classes and provides a comprehensive view of the model’s correctness.

A c c u r a c y = \frac{(T r u e P o s i t i v e s + T r u e N e g a t i v e s)}{(T r u e P o s i t i v e s + T r u e N e g a t i v e s + F a l s e P o s i t i v e s + F a l s e N e g a t i v e s)}

(2)

However, it is important to note that accuracy might not be the best choice when dealing with imbalanced datasets where one class is significantly more prevalent than the other. In such cases, a high accuracy score can be misleading, as the model might be performing well on the majority class while performing poorly on the minority class.

6.1.2. Precision (Positive Predictive Value)

Precision is a metric that focuses on the accuracy of positive predictions made by the model. It quantifies the proportion of correctly predicted positive instances out of all instances that the model predicts as positive. The formula for precision involves dividing the number of true positives by the sum of true positives and false positives.

P r e c i s i o n = \frac{T r u e P o s i t i v e s}{(T r u e P o s i t i v e s + F a l s e P o s i t i v e s)}

(3)

Precision is particularly useful when the cost of false positives is high, meaning that false alarms are costly or undesirable. In medical diagnosis, for example, precision is crucial to minimize misdiagnoses.

6.1.3. Recall (Sensitivity, True Positive Rate (TPR))

Recall, also known as sensitivity or the true positive rate, emphasizes the model’s ability to correctly identify positive instances. It calculates the proportion of true positives out of all actual positive instances. The formula for recall involves dividing the number of true positives by the sum of true positives and false negatives.

R e c a l l = \frac{T r u e P o s i t i v e s}{(T r u e P o s i t i v e s + F a l s e N e g a t i v e s)}

(4)

Recall is vital in situations where the cost of false negatives is high, such as detecting fraudulent activities. Missing positive instances in such cases can have significant consequences.

6.1.4. F1 Score

The F1 score is a combined metric that balances precision and recall. It is the harmonic mean of precision and recall, providing a single value that considers both false positives and false negatives. The F1 score is useful when there is a trade-off between precision and recall, and you want a balanced assessment of the model’s performance.

F 1 = \frac{2 \times (P r e c i s i o n \times R e c a l l)}{(P r e c i s i o n + R e c a l l)}

(5)

In scenarios where class distribution is imbalanced, and it is important to ensure that the model performs well on both positive and negative classes, the F1 score is a valuable metric.

6.1.5. Specificity (True Negative Rate (TNR))

Specificity, also known as the true negative rate, measures the model’s ability to correctly identify negative instances. It calculates the proportion of true negatives out of all actual negative instances. The formula for specificity involves dividing the number of true negatives by the sum of true negatives and false positives.

S p e c i f i c i t y = \frac{T r u e N e g a t i v e s}{(T r u e N e g a t i v e s + F a l s e P o s i t i v e s)}

(6)

Specificity is particularly relevant when the cost of false positives is high, and it is important to prioritize correctly identifying true negative cases.

There is no one-size-fits-all answer to which metric is the best to measure performance. The selection of the most appropriate metric depends on the specific goals of the task, the nature of the data, and the relative importance of different types of errors. It is essential to choose the metric that aligns with the objectives and priorities of your machine learning project.

6.2. Results

This section presents the results, that is, the accuracy, precision, recall, F1 score, and specificity, of the Isolation Forest and Extended Isolation Forest implementation on Resource Development, Reconnaissance, and Discovery. The best results are highlighted in green.

6.2.1. Resource Development

As can be seen from Table 3, the best scores for Resource Development were attained by EIF-0, highlighted in green. EIF-2–17 is an average of EIF 2 to 17.

6.2.2. Reconnaissance

Table 4 shows that the best scores for Reconnaissance were also attained by EIF-0, though EIF-1 and -2 results were close to EIF-0. The best results are highlighted in green.

6.2.3. Discovery

Table 5 shows that the best scores for Discovery were also attained by EIF-0,1, highlighted in green.

7. Conclusions

Experimentation was carried out with all possible extension levels, that is, up to 17. The best results across all metrics come from extension level 0. This is in fact the standard Isolation Forest, with no slopes in the branch cuts for any dimensions. However, the standard Isolation Forest does not perform well compared to this implementation. Even though the algorithm is the same, the implementation of Extended Isolation Forest performs better than the standard case.

The reason behind the lower scores for higher extension levels can be the varied ranges of dimensions in the dataset. The features we have for the data are spread over various ranges of values with no correlation to each other. Multiple levels of extension can be useful where the dynamic range of the data in various dimensions is very different. In such cases, reducing the extension level can help in the more appropriate selection of split hyperplanes and in reducing the computational overhead. As an extreme case, if we had three-dimensional data, but the range in two of the dimensions was much smaller compared to the third (essentially data distributed along a line), the standard Isolation Forest method would probably yield the most optimal result.

8. Future Work

This work is on an Extended Isolation Forest implementation for network data to detect anomalies. This algorithm is implemented on top of the original algorithm, but the results do not reflect what was expected from the higher extension levels. The next step will be to build an implementation of the EIF algorithm using other techniques described in the related works section and compare the results.

Author Contributions

Conceptualization, S.C.B.; methodology S.C.B., S.S.B. and F.M.; software, F.M.; validation, S.C.B., S.S.B. and F.M.; formal analysis, F.M.; investigation, F.M.; resources, D.M.; data curation, D.M. and F.M.; writing—original draft preparation, F.M.; writing—review and editing, S.C.B., S.S.B. and D.M.; visualization, F.M.; supervision, S.C.B. and S.S.B.; project administration, S.C.B. and S.S.B.; funding acquisition, S.C.B., S.S.B. and D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Centers of Academic Excellence in Cybersecurity, NCAE-C-002: Cyber Research Innovation Grant Program, Grant Number: H98230-21-1-0170.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets are available at https://datasets.uwf.edu (accessed on 20 August 2023).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

University of West Florida UWF-ZeekData22. Available online: https://datasets.uwf.edu/ (accessed on 2 August 2023).
Trellix What Is the MITRE ATT&CK Framework? Available online: https://www.trellix.com/en-us/security-awareness/cybersecurity/what-is-mitre-attack-framework.html (accessed on 1 October 2023).
Reconnaissance, Tactic TA0043—Enterprise|MITRE ATT&CK®. Available online: https://attack.mitre.org/tactics/TA0043/ (accessed on 8 October 2023).
Discovery, Tactic TA0007—Enterprise|MITRE ATT&CK®. Available online: https://attack.mitre.org/tactics/TA0007/ (accessed on 8 October 2023).
Resource Development, Tactic TA0042—Enterprise|MITRE ATT&CK®. Available online: https://attack.mitre.org/tactics/TA0042/ (accessed on 8 October 2023).
Guller, M. Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large Scale Data Analysis; Apress: New York, NY, USA, 2015. [Google Scholar]
Configuration—SPArK 3.3.0 Documentation. Available online: https://spark.apache.org/docs/3.3.0/configuration.html (accessed on 20 September 2023).
Liu, F.T.; Ting, K.M.; Zhou, Z.-H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008. [Google Scholar] [CrossRef]
Chen, T.; Ren, K.; Wu, S. Time Series Anomaly Detection with Isolation Forest. In Proceedings of the IEEE International Conference on Big Data, Los Angeles, CA, USA, 9–12 December 2019. [Google Scholar]
Sharma, A.; Madhav, N.; Sharma, S.K.; Chen, Y. Extended Isolation Forest for Advanced Persistent Threat Detection. In Proceedings of the IEEE Symposium on Security and Privacy Workshops (SPW), San Francisco, CA, USA, 22–26 May 2022. [Google Scholar]
Li, S.; Zhang, J.; Liu, Y.; Jiang, X. Extended Isolation Forest. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Halifax, NS, Canada, 13–17 August 2017. [Google Scholar]
Fan, S.; Qin, Y.; Yao, J. An Anomaly Detection Method for Network Security Based on Improved Isolation Forest and SOM Clustering Algorithm. IEEE Access 2021, 9, 13944–13954. [Google Scholar]
Zhou, L.; Ding, B.; Xiong, W.; Zhu, Y. An Intrusion Detection Method Based on Extended Isolation Forest Algorithm in Industrial Control Systems. In Proceedings of the 14th IEEE Conference on Industrial Electronics and Applications (ICIEA), Xi’an, China, 19–21 June 2019. [Google Scholar]
Thangaraj, P.; Gopalakrishnan, S.; Letchumanan, K. Enhanced Extended Isolation Forest for Intrusion Detection in Software Defined Networks. Procedia Comput. Sci. 2020, 173, 1750–1756. [Google Scholar]
Huang, M.; Wu, S.; Chen, T. An Efficient Random Forest Extended Isolation Algorithm for Anomaly Detection. IEEE Access 2019, 7, 1127244–1127256. [Google Scholar]
Liu, T.; Zhou, Z.; Yang, L. Layered isolation forest: A multi-level subspace algorithm for improving isolation forest. J. Big Data 2024, 11, 34. [Google Scholar] [CrossRef]
Wu, Z.; Niu, W.; Zhao, Y.; Fan, H. Application of extended isolation forest in avionics equipment fault diagnosis. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 78–90. [Google Scholar]
Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. CSUR 2009, 41, 1–58. [Google Scholar] [CrossRef]
Hodge, V.J.; Austin, J. A survey of outlier detection methodologies. Artif. Intell. Rev. 2004, 22, 85–126. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data TKDD 2012, 6, 1–39. [Google Scholar] [CrossRef]
Hariri, S.; Kind, M.; Brunner, R.J. Extended Isolation Forest. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; pp. 398–406. [Google Scholar]
Bagui, S.; Mink, D.; Bagui, S.; Ghosh, T.; McElroy, T.; Paredes, E.; Khasnavis, N.; Plenkers, R. Detecting Reconnaissance and Discovery Tactics from the MITRE ATT&CK Framework in Zeek Conn Logs Using Spark’s Machine Learning in the Big Data Framework. Sensors 2022, 22, 7999. [Google Scholar] [CrossRef] [PubMed]
Han, J.; Pei, J.; Tong, H. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann: Burlington, MA, USA, 2022. [Google Scholar]

Figure 1. Anomalous vs. normal point.

Figure 2. Normally distributed cluster 1 dataset: (a) Normally distributed data; (b) anomaly score map.

Figure 3. Normally distributed cluster 2 dataset: (a) Two normally distributed clusters; (b) anomaly score map.

Figure 4. Dataset with sinusoidal structure: (a) Sinusoidal data points with Gaussian noise; (b) anomaly score map.

Figure 5. Stages of Isolation Forest Algorithm.

Figure 6. Branching implementation in Extended Isolation Forest.

Figure 7. Flowchart of methodology.

Table 1. Tactics in UWF-ZeekDataFall22.

Tactics	Count
None (benign data)	3,509,406
Resource Development	262,386
Reconnaissance	4005
Discovery	1000
Execution	5
Command and Control	2
Lateral Movement	2
Defense Evasion	2
Persistence, Priv…	2
Initial Access	1

Table 2. Information gain results on UWF-ZeekData22.

Attribute No.	Attribute	Info Gain
1	History	0.827
2	Protocol	0.77
3	Service	0.726
4	Orig_bytes	0.724
5	Dest_ip	0.674
6	Orig_pkts	0.655
7	Orig ip bytes	0.572
8	Local_resp	0.524
9	Dest_port	0.486
10	Duration	0.386
11	Conn_state	0.166
12	Resp_pkts	0.085
13	Resp_ip_bytes	0.065
14	Src_port	0.008
15	Resp_bytes	0.008
16	Src_ip	0.007
17	Local_orig	0.002
18	missed bytes	0

Table 3. Results matrix for Resource Development.

Scores	IF	EIF-0	EIF-1	EIF-2–17
Accuracy	0.592	0.823	0.702	0.5599
Precision	0.298	0.704	0.503	0.2663
Recall (TPR)	0.266	0.705	0.503	0.2664
F1 Score	0.281	0.705	0.503	0.2663
Specificity (TNR)	0.732	0.873	0.787	0.6856

Table 4. Results matrix for Reconnaissance.

Scores	IF	EIF-0	EIF-1,2	EIF-3,4	EIF-5–17
Accuracy	0.5302	0.8221	0.8158	0.6526	0.5258
Precision	0.2169	0.7075	0.697	0.4251	0.2137
Recall (TPR)	0.212	0.7019	0.6914	0.4217	0.212
F1 Score	0.2144	0.7047	0.6942	0.4234	0.2128
Specificity (TNR)	0.2169	0.7075	0.697	0.697	0.4251

Table 5. Results matrix for Discovery.

Scores	IF	EIF 0,1	EIF 2–17
Accuracy	0.531	0.783	0.5305
Precision	0.206	0.6263	0.2059
Recall (TPR)	0.211	0.642	0.211
F1 Score	0.209	0.6341	0.2084
Specificity (TNR)	0.661	0.8413	0.6628

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moomtaheen, F.; Bagui, S.S.; Bagui, S.C.; Mink, D. Extended Isolation Forest for Intrusion Detection in Zeek Data. Information 2024, 15, 404. https://doi.org/10.3390/info15070404

AMA Style

Moomtaheen F, Bagui SS, Bagui SC, Mink D. Extended Isolation Forest for Intrusion Detection in Zeek Data. Information. 2024; 15(7):404. https://doi.org/10.3390/info15070404

Chicago/Turabian Style

Moomtaheen, Fariha, Sikha S. Bagui, Subhash C. Bagui, and Dustin Mink. 2024. "Extended Isolation Forest for Intrusion Detection in Zeek Data" Information 15, no. 7: 404. https://doi.org/10.3390/info15070404

APA Style

Moomtaheen, F., Bagui, S. S., Bagui, S. C., & Mink, D. (2024). Extended Isolation Forest for Intrusion Detection in Zeek Data. Information, 15(7), 404. https://doi.org/10.3390/info15070404

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Extended Isolation Forest for Intrusion Detection in Zeek Data

Abstract

1. Introduction

2. Related Works

Gaps in the Literature

3. Background

3.1. What Is an Anomaly

3.2. Anomaly Score

3.3. Isolation Forest

3.3.1. Algorithm

Training Stage

Evaluation Stage

3.4. Extended Isolation Forest

3.4.1. Branching in Extended Isolation Forest

3.4.2. Extension Levels

3.5. Our Extended Isolation Forest Algorithm

3.5.1. Training Stage

3.5.2. Evaluation Stage

4. The Data: UWF-ZeekDataFall22

4.1. Data Description

4.2. Preprocessing

5. Methodology

5.1. Libraries

5.2. Hyperparameter Tuning

5.3. Varying Extension Level

6. Results

6.1. Performance Metrics

6.1.1. Accuracy

6.1.2. Precision (Positive Predictive Value)

6.1.3. Recall (Sensitivity, True Positive Rate (TPR))

6.1.4. F1 Score

6.1.5. Specificity (True Negative Rate (TNR))

6.2. Results

6.2.1. Resource Development

6.2.2. Reconnaissance

6.2.3. Discovery

7. Conclusions

8. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI