In the second step the CICflowmeter [
25,
26] was used to extract features from the network traffic. This tool has been used in several recent studies [
27,
28], especially for the extraction of many datasets widely used in the community. The CICflowmeter uses bidirectional flows by using so-called quadruples—connections between two IPs (with corresponding ports), where the first packet determines the forward (source to destination) and the second the backward (destination to source) flow. The CICflowmeter extracted around 80 features from TCP and UPD network traffic, including flags from TCP network traffic, and inter-arrival and idle based features. Based on a correlation analysis and boxplots (of single features), from that huge feature set, several suitable combinations were considered in detail, in order to investigate the influences of those features on the detection approach. For the detection, an outlier detection method, namely, local outlier factor, was applied. For the evaluation, the focus was on a practical approach to ensure that at least a sign of each attack was detected, while the number of false positives was ensured to be small. Due to the type of attack, it is not feasible to catch all flows connected to the attack.
3.1. Dataset
Due to the lack of existing publicly available APT datasets [
10], we followed the approach in [
11], where two datasets—one containing APT data and another dataset containing benign data as background data—were combined. While [
11] used the 20 year old DARPA dataset as “background” data, we used the more recent CICIDS2017 dataset [
29]. There were several reasons for selecting this specific dataset as the background dataset. First, it is a publicly available dataset, well studied in the literature, and can be considered as a benchmark dataset in the intrusion detection research field. Usage of this dataset can contribute to the verifiability and comparability of results. Second, this dataset reflects a data type and network environment that is compatible with the Contagio dataset containing APT attacks, and reflects the research goals of work presented in this paper.
The CICIDS2017 dataset [
29] includes a small enterprise network data of one-week duration, from Monday to Friday from 9:00 to around 17:00 each day. It is divided into five subsets according to the day of capture. The data were captured on a testbed architecture consisting of two separated networks, a victim-network with around 13 machines and an attacker network. Monday is the only subset without any attack. For the purpose of this experiment, network data from Monday were used.
The Contagio malware database [
30] contains a collection of 36 files capturing raw network data. Each file recorded the traffic subject to attacks by different malware originating from APTs.
Both datasets contain raw network data, collected in .pcaps. In the first step, features of those .pcaps were extracted with the CICflowmeter. In the second, step several attacks (based on their duration) were selected from the Contagio malware database and combined with the features from Monday from the CICIDS2017 dataset. In order to combine those two datasets, victims in the network [
29] were selected; see
Table 1. The corresponding IP addresses of the Contagio files were then adapted in order to fit into the network.
To avoid a dependence of the time and the attack on the detection, three different combinations were considered. While the background dataset stayed the same, the attacks were injected at different time slots; see
Table 2. It was ensured that only one attack appeared within one hour. Four different machines were infected (
Table 1) and the number of attacks on one machine was between one and three. The injection of the attacks into different time slots is given in detail in
Table 2. The column
dur. shows the duration of each attack. The time in the table refers to the start of the attack. The injection of the attacks is also visualized by the number of flows per hour for each of the later evaluation intervals (per hour); see
Figure 2 for attack flows and
Figure 3 for the benign data.
The above described combination can easily be repeated for other combinations of benign and attack data. Since CICflowmeter used statistical features depending on the time a certain packet was sent (between two fixed IPs), and the extraction of features is quite fast (on a common notebook the bigger (benign) pcaps are extracted within some hours), the injection of attacks on the file-level can be mainly performed with programmable routines and the adjustments of IPs. The only process which needs human expertise is the identification of the victim and attacker IPs, including the detection of potential other important members in the networks whose IPs have to be changed (e.g., DNS servers), and of course the decision of where to place a certain attack.
3.2. Features
In order to achieve high performance and limited computational storage as stated in [
9], the proposed approach focuses on network data. Since computational power is quite limited, but there is—to the best of our knowledge—also a lack of the investigation of features for the detection of cyber attacks, the influences of those are considered. Several feature sets are considered in detail to evaluate whether there are any superior features or feature sets which outperform others or significantly help to detect APTs. Since these features are only based on statistics of the network traffic, all these feature sets are suitable for encrypted traffic use.
In the literature, various kinds of flows are used for the creation of statistical features (see [
31] for a comparison of flow exporters) using different numbers of features. While some flow extractors, e.g., Maji, Softflowd and Transalyzer, use the whole unidirectional flow for the creation, there are also flow extractors such as CICflowmeter or Netmate which additionally provide features from bidirectional flow. In this paper we consider different features for the bidirectional flow provided with CICflowmeter.
With the CICflowmeter, in total 76 different features have been extracted, ranging from counters for different flags of TCP network traffic, to packet length, the average packet size, the active and idle time of a flow and the inter-arrival time. Moreover, some features are especially useful for identification, such as the flow ID, the source IP, the destination IP and the source port and the destination port. In order to avoid bias and to ensure that an attack is detected by its network traffic behavior and not by the IP (which could change easily), the features for identification are excluded in this study. A first investigation of these features showed that five features only contained zero values. Therefore, they have been removed (this applies for the features Bwd PSH Flags, Bwd URG Flags, Fwd Bulk Rate Avg, Fwd Bytes Bulk Avg and Fwd Packet Bulk Avg). Moreover, two pairs of features turned out to be identical. This addresses the pair Bwd Segment Size Avg and Bwd Packet Length Mean, and the pair Fwd Segment Size Avg and Fwd Packet Length Mean. Therefore, only the second pair of those features was kept.
From the other features, descriptive statistics have been calculated and a correlation analysis (see
Figure 4) has been performed. High correlations between some features, were taken into account for the feature selection. This applied especially to a higher correlation between the IAT features of the total flow and between the forward IAT the backward IAT flow, and the total, forward and backward packet length features. The influence of a combination of them is addressed by the features
,
,
and
in the experiments.
We dismissed features focusing on the minimum, since that value is per definition fixed for any statistical flow. Based on the boxplots (see
Figure 5,
Figure 6,
Figure 7 and
Figure 8), we further dismissed the flag features, particularly because a detailed investigation also showed that most of the features (
CWR Flag Count,
ECE Flag Count and
URG Flags) had very few non-zero values (all benign data), but all were zero for the attack data.
We selected feature sets for further investigation based on the detailed study of boxplots and the correlation analysis and by using knowledge of previous publications. In [
11], for example, only two features were used for the detection of advanced persistent threats, namely, the duration of a flow and the total number of packets transferred (corresponding to feature set
).
Other features are not applicable for being used on bidirectional data flows, e.g., from log-based approaches or host-based features, as in [
9]—which addressed the data exfiltration stage—using the features:
numbytes: The number of megabytes uploaded by an internal host to an external address;
numflows: The number of flows to an external host initiated by an internal host;
numdst: The number of external IP addresses related to a connection initiated by an internal host.
Moreover, those features are not included in the CICflowmeter tool.
As stated in [
19], the inter-arrival-time and the active and idle time of a flow seemed to be superior in previous work. Therefore, those features were included in different variants. As the number of features used is highly correlated with the processing time and potential storage time, the goal was to find a small superior feature set and avoid including just any (similar) features. That is the reason why we either used the median and standard deviation of a certain value together or the maximum of a value. We did not use the minimum (e.g., packet length), since the boxplots did not show any useful capabilities to distinguish benign and attack flows.
Other publications considering the whole APT life cycle (compare [
10]) and not mainly focusing on intrusions detection, either used alerts for the detection [
32], followed a graph-based approach [
22] or lacked details of features used [
33].
Based on that, different feature sets were considered: see
Table 3 for features using (only) the whole bidirectional flow and
Table 4 for features including some for the forward or backward flow only. The duration of each flow was limited in CICflowmeter by the activity timeout of 5 million seconds and the flow timeout of 12 million seconds.
3.3. Outlier Detection
This paper proposes an unsupervised method to catch signs of the attacks, namely, the local outlier factor [
34]. For that method—as for outlier detection in general—the goal is to separate regular observations from some outliers. The algorithm computes a so-called local outlier factor (LOF)—a score—to reflect the degree of the abnormality of an observation for each object in the dataset. The approach is local in the sense that it is calculated only on a restricted neighborhood of each object, and the calculation of the LOF is only based on those neighbors. The approach is loosely related to density-based clustering, like, e.g., DBSCAN [
35] and OPTICS [
36].
According to [
34], the local outlier factor of an object
p is defined as
with the number of nearest neighbors used in the defined local neighborhood of
p, namely, MinPts. The local reachability density
of an object
p is defined as
where
denotes the nearest neighbors of
p and the reachability distance of an object
p with respect to object
o is given as
The k-distance is the distance of a point to its kth neighbor, i.e., the distance to its kth closest point.
In the proposed approach the outlier detection is applied on different time slots (as shown in
Figure 9). While this paper focuses on how to select a suitable feature set, the approach presented here can also be used for anomaly detection (with a selected feature set). In such a case, suitable features need to be extracted from network traffic.
Figure 9 shows network traffic at a central point. Depending on the environment, the extraction from features from different endpoints would be possible as well. In this case, the anomaly detection should be applied to each endpoint separately, in order to enable a potential user-specific behavior.
For different feature sets, anomaly detection with the local outlier factor method is used in different time slots, where each time slot contains exactly one APT attack. Furthermore, as shown by this analysis, this approach is applicable for security at runtime, to give security administrators hints for further investigations.
Moreover, due to the training in different time slots, characteristics such as more e-mail activity in the morning, are taken into account. It has to be noted that in addition to APT detection, the proposed algorithm is expected to detect other anomalies as well, such as software updates, uploading huge files for partners in a project and other tasks usually appearing in a large network.
As a pre-processing step and in order to avoid detection rates only based on the scaling of some features, robust scaling is performed. Therefore, as for the local outlier detection too, Python’s sklearn library is used. The built in robust scaling is robust to outliers, removes the median and scales the data according to the quartiles range. Each feature is centered and scaled independently.
The experiments have been performed on a Windows notebook with a GHz CPU. For the choice of the parameters in a pre-study, several parameters, especially the number of neighbors and the distance measured, were evaluated. For the number of neighbors, the values , and for the metric minkowski, manhattan and cityblock, were used. Based on these experiments, the best choice is to set the number of neighbors to 40 and to use the minowski norm.
The evaluation of the results is always per hour, i.e., in the intervals 9–10 o’clock, 10–11 o’clock,
…, 15–16 o’clock and 16–17 o’clock. In each of these time slots, exactly one attack exists. All attacks consist of a different number of corresponding flows (see
Figure 2). Local outlier scores were around 1 (in fact in the results
, since the negative outlier score is used to recommendations of the used library) are clearly inliers. However, there is no rule for setting an adequate threshold for identifying significant outliers. A proper threshold highly depends on the dataset.