1. Introduction
Nowadays, network administrators increasingly rely on Intrusion Detection Systems (IDSs) to ensure security during network communication. However, there is an increasing trend that attackers are able to launch attacks by utilizing a large number of hosts instead of an individual host, which could have widespread impact on the entire network. For instance, attackers can scan large numbers of hosts concurrently in order to search for computer vulnerabilities (e.g., network scans); then, use a self-replicating program to propagate their malicious codes to target vulnerable hosts in a short time (e.g., network worms); and, finally, they can use those compromised hosts to flood a targeted system or host to disturb or disrupt its service (i.e., Distributed Denial-of-Service attack) [
1,
2,
3,
4].
In the present study, we focus more on network worm detection and evaluate whether the proposed approach can be extended to other types of attacks. There are two main purposes for attackers to launch network worms: (1) causing a traffic overload in the networks and congestion on backbone links, which disrupt affected hosts and lead to financial losses [
5] and (2) recruiting compromised hosts for future attack use [
6], as illustrated in
Figure 1. For example, in 2013, the Symantec Internet security threat report [
7] pointed out that target attacks increased by 42% in 2012, of which 31% were using worms to attack business; whereas, in 2016, Kaspersky laboratory [
8] reports claimed that, between 2014 and 2015, the global economic machine was affected by worms, viruses, Trojan horses, and other malware attacks. The damage was approximately
$1 billion. Even more seriously, although people can do much in the way of prevention, new types of the network worms are found from time to time, such as polymorphic worms, which can successfully avoid the capture of detection tools or antivirus software and infect millions of hosts in a few seconds. Network worms increasingly pose a great threat to network-based military equipment, economic institutions, and communication networks. Hence, fast and accurate detection and identification of network worms play an essential role in establishing a secure, stable, and reliable network environment that covers a wide range of research in the field of network security and management and has attracted significant attention from both the academia and industry all over the world.
The approaches to the worm detection problem can be divided into two categories. Most network-worm detection methods begin by inspecting each file that enters a system (i.e., antivirus software packages) and then looks for known signatures hiding in network traffic payloads; these are known as signature-based approaches [
9,
10,
11]. The main drawback of these approaches is that their performance is questionable, particularly against unknown worms (a previously unknown version or unreleased worm) or polymorphic or metamorphic worms. However, anomaly-based approaches focus on the identification of observations that do not conform to an expected pattern from normal network traffic instead of looking for fixed regular expressions in payloads. Examples of these techniques can be found in the literature [
12,
13,
14,
15,
16,
17]. Unlike signature-based methods, anomaly-based methods often have high false positive rates since it is difficult to determine the boundary between normal and abnormal patterns. Therefore, network worm detection is still a challenging problem for network administrators.
However, many researchers neglect to focus on a common scenario: that the life cycle of network worms typically comprises the same stages. In addition, network worms have their own specified and similar connected patterns in each stage. After a network worm has been released, the life cycle of a network worm consists of four stages: (1) target discovery, in which the worm scans for vulnerable hosts and then exploits the vulnerability to prepare for the next stage; (2) transferring the malicious code to the target hosts; (3) activation and (4) infection [
16,
18]. During the target discovery and transferring stages, network worms can be detected by a series of tools, such as anti-virus software, IDSs, and firewalls [
18]. However, network worm behaviors in the last two stages are limited since it is difficult to detect the network worm by IDSs in these activities [
5]. A typical example is the existence of anomalous flows generated by scanning in identifying the vulnerable hosts as well as worm propagation [
19,
20].
In this paper, the approach proposed is mainly relying on detecting the inherent connected patterns of worms during the target discovery and transferring stages. Ideally, these patterns can be distinct from normal network traffic, just like a network “footprint” for worms. Thus, we take a moderate stand based on the network flows that record communication between a source and destination IP addresses to capture those suspicious “footprints” of network worms. In our previous work, we present the advantage of leveraging a novel graph model called Network Flow Connectivity Graphs (NFCG) to comprehensively study network-wide flow connectivity relationships. In an NFCG, is defined as a set of vertices V (representing network entities, IP addresses in our case) and edges E (representing interaction relationships between pairs of vertex sets). We could build a series of interaction graphs to capture a whole worm propagation process under an NFCG. An NFCG is consisting of multiple types of motifs, a small, induced subgraph structure that can be seen as the basic cell of a graph. All of the motifs can be used to describe specific and similar connected patterns of network worms.
Thereafter, we developed a Cascading Motif Discovery (CMD) algorithm to capture connected patterns that exhibit typical characteristics of network worms. The CMD algorithm consists of two phases: the scanning motif discovery (SMD) phase, which is used to find the presence of the scanning behaviors in the initial stage of a network worm, and the transferring motif detection (TMD) phase, which is used to detect the similar transferring behaviors within network worms. A host would be identified as a worm victim if it performs network scanning activities and similar transferring behaviors. Experimental results with a simulated dataset obtained from the Georgia Tech Network Simulator (GTNetS) suggest that our approach is effective and efficient in identifying worm victims and detecting the outbreak of network worms. The main contributions of the present study are summarized as follows:
The main contributions of the present study are summarized as follows:
We reveal the characteristics flow connected patterns during a worm outbreak. The target discovery and transferring stages of the worm life cycle have a specific fan-out connection mode, and the cascading appearance and large repetition of these two connection modes are typical characteristics of the flow-connected behavior of network worms.
We propose a technique for worm detection based on network-flow-connected behavior analysis. Additionally, we develop a novel solution for building the worm-life-cycle motifs. Unlike the traditional motif discovery methods in a complex network, we focus on the inherent behavior of network worms and reveal the specific motif classes that conform to the essential characteristics of a worm outbreak.
Due to the scalability of worm connection behaviors and the complexity of subgraph mining, the CMD algorithm realizes the decomposition of worm attacks in different scales and solves the scale problem of subgraph matching.
We conduct many experiments by collecting dataset through a mature simulation system. The experimental results demonstrate that our approach effectively detect regardless of known released worms or never-seen-before worms.
The remainder of this paper is organized as follows: in
Section 2, we introduce the preliminaries and background of the Network Flow Connectivity Graphs (NFCG) and network motif.
Section 3 presents the algorithm of using Cascading Motif Discovery for network worm victims’ identification and gives an illustrative example for the proposed algorithm. In
Section 4, we briefly introduce the GTNetS simulation platform for network worm propagation and present the evaluation results of the proposed approach. The discussion of our proposed method is presented in
Section 5. Finally,
Section 6 gives the conclusions of our work and presents possible future works.
3. Proposed Approach
The proposed approach for detecting networks is introduced in this section. In the present study, the behavior of the target discovery and propagation stages in the worm life cycle can be expressed as different fan-out connected patterns, which can be represented by corresponding motifs. Since the characteristics of worm victims are [
32]:
Worm victims would generate a lot of similar flows (from one victim to target a lot of target hosts)
Each flow generated by worm victims contains only a few packets, but these flow behaviors are very similar since they are caused by the same source reason.
Worm victims often use particular same port numbers to spread a known worm, except for polymorphic worms.
Therefore, several cascading motifs can be used to describe the sequential cascading relationships generated by the whole worm life cycle. Our motif-based method is to find cascading motifs in the NFCGs that can be seen as the problem of motif discovery and subgraph matching. The traditional motif discovery method, such as FANMOD [
33], is used to enumerate size-N motifs in a graph, so that it can be used to analyze and summarize the structure and changes of complex networks. However, our method does not refer to a particular size of motif but a class of motifs with specific fan-out connected patterns. To perform such motif discovery in our method, subgraph matching often needs to be completed with exact patterns, and motifs reflected in worm propagation are similar to fuzzy patterns. Here, we develop a CMD algorithm to capture connected patterns that exhibit typical characteristics of network worms. The CMD algorithm consists of two phases: the scanning motif discovery (SMD) phase, which is used to find the presence of scanning behaviors in the initial stage of a network worm, and the transferring motif detection (TMD) phase, which is used to detect similar transferring behaviors within network worms. The complete flow chart of the CMD algorithm is presented in
Figure 4.
3.1. The Scanning Motif Discovery Phase
The SMD phase aims to detect scanning activities among network flows. As known from the worm life cycle and discussed before, a majority of network worms are typically caused by scan activity (i.e., source host sending packets to a number of destination hosts and/or ports) [
34], except for some special worms, such as the email-spammer. Thus, we need to first distinguish scanning flows from non-scanning flows. There are three cases in the scanning flows: (a) port scan, in which the flows have a host scanning several ports on a single destination host; (b) IP scan, in which the flow from a host scanning a particular port on many destination hosts; and (c) hybrid scan, which is basically a combination of both port scan and IP scan. To this end, for an NFCG graph, we could extract the scanning motif structures by measuring the flow property information embedded in the NFCG weight vector.
Regardless of the type of scanning activities, we group all subgraphs of NFCG into scanning motifs and non-scan motifs by using a combination of the following flow features, including the number of destination hosts targeted, the number of packets sent, and the set of destination ports used. The motifs and network flows would have similar communication features if they caused by the same reason. The output of the SMD phase is to identify those suspicious motifs in which the source hosts perform scanning behaviors. However, there would be a large number of suspicious motifs extracted when those features were calculated for motif discovery. In order to increase accuracy and reduce the false positive rate of the proposed approach, we set the threshold for SMD to filter out those motifs which have the exact scanning behaviors. These motifs and network flow data are used for the next TMD phase.
Since the goal of the worm is to spread their malicious codes over the entire network in a short time, the worm would never stop progressing after successfully infecting a host. The infected hosts first turn to network victims, and then they will go and scan their neighboring hosts for finding the next victims. It is clear that, if we count the number of IP addresses scanned by the victim, and if the amount exceeds a pre-defined threshold, then we can draw a conclusion that a worm has been detected. Therefore, the TMD phase is used to detect whether a motif that contains the “scanner” host exhibits strong correlated behaviors in the next step. In particular, a cascading motif that is created by a scanning motif and a transferring motif depicts the anomalous connection pattern step by step in the target-finding stage and the transferring stage of worm propagation. It is claimed that, if a cascading motif exists in the connection behavior graph of one host, the host would be identified as a worm victim.
The starting point of the TMD phase comes after the SMD approach detects the suspicious motifs performing scanning behaviors. The primary objective of the correlation approach is to check whether suspicious motifs that contain scanner hosts have a strong correlated behavior toward other destination hosts. The flow chart of the TMD phase is also illustrated in the corresponding part of
Figure 4.
Figure 5 illustrates the motif discovery procedure of our proposed approach. Through motif discovery, we can achieve a few motifs that exhibit worm-like behaviors. In addition, these suspicious motifs become more and more obvious with the interaction of the network.
3.2. The Transferring Motif Detection Phase
Informed by
Figure 4 and
Figure 5, we need first to determine a motif that contains “scanner” hosts. If a scanner host has been detected, then this suspicious host must be checked to establish whether it has strong correlated behaviors toward other hosts. The counter will increase if the suspicious host does perform this behavior. Afterwards, if the counter exceeds a pre-defined threshold, the alert is triggered and the IP address of the worm victim is saved in the worm set; otherwise, it is saved in the non-worm set. The pre-defined threshold in the TMD phase is based on some empirical knowledge and analysis of the network worm flows.
3.3. An Illustrative Example
We provide an illustrative example to show how to identify network worm victims by our approach. As
Figure 6 shows, supposing we have the input network flow data as follows, the typical procedure are given as follows:
1. Input network flow data:
2. Then, we check out whether hosts exist that are performing scanning behaviors (hosts that are detected by the scanning motif discovery phase):
3. Set the threshold for the Transferring Motif Detection phase; in this case, we set the threshold as 3. It is means when applying the Transferring Motif Detection phase, if the amount of suspicious worm victims exceed the threshold 3, we can get a conclusion that the network worm has been detected. In this illustrate example case, the result is given as follows:
IP addresses 192.168.0.1 and 192.168.0.3 are firstly detected as scanner hosts as they scan several hosts on the same port by the scanning motif discovery phase. Additionally, 192.168.0.3 generate communication flows to four hosts on destination port 135, which have exceeded the preset threshold; then, 192.168.0.3 is identified as a worm victim seed by the Transferring Motif Detection phase.
5. Discussion
In this paper, our proposed algorithm is designed for worm detection through detecting the outbreak of connected patterns of network worms. In other words, the connected motifs of worms are used to successfully identify worm victims. Many worm detection methods often need the help of a detailed packet payload, such as the Honeypot logs. However, we propose a flow-based worm detection method without analyzing the payload, which mainly focuses on the connected patterns flowing over the entire network, and then digging out those suspicious flow patterns matched with the lifecycle of network worms. However, there are still several problems existing in our algorithm which require improvement regarding applicability.
The first and practical problem for our proposed method is that of the time interval used for flow data collection and graph construction procedure. It is infeasible to visualize all node pairs and edges due to the large scale nature of network size nowadays. As the time interval we selected increases, numerous nodes need to be visualized and processed; thus, selecting a large time interval has an influence on the fact that the graph is poorly visualized and it is hard to do real-time analysis. At the same time, however, if choosing a small time interval, NFCGs are becoming sparse so that they cannot provide abundant information for flow connectivity behaviors, which leads to false negatives increasing. On the other hand, some low rank level nodes which generate irrelative or insignificant edges can also affect results of the analysis. In this case, these edges are often treated as happening occasionally and easily make the analysis fall into confusion. To address this problem, we can adopt filtering rules to remove the insignificant and irrelative nodes and edges generated by network redundant information based on the quantity of flow attributes (e.g., 1000 flows visualization) or time-window (e.g., 30 s time window for large scale network, 3 min time window for small scale network) and select the edges whose weights are larger than a certain suitable value in order to accurately capture the principal connected relationships among network flows (e.g., opened port number larger than 100).
Another issue in our algorithm is that the threshold set up in our CMD algorithm would produce false positives and false negatives. If the threshold is set too large, this would result in more false negatives; while if the threshold is smaller, the false positive value is increasing. Actually, the threshold we used should have an adaptivity-element that is different under different scales of network size. For example, in the wide-area network (e.g., the Backbone Communication Network), the scale of worm attacks in this kind of network is relatively larger than others because of the size of the network and large-volume traffic. In this case, we can decrease the false positive rate of worm victim identification by setting an appropriate threshold value, such as more than 100 hosts in the Scanning Motif Detection Phase. In the Local Area Networks, such as company or campus networks, the threshold would be lower if there are 10 or 20 more hosts being found that have been scanned by the same target host or group hosts.
Moreover, for other worm behaviors, such as email-slammers (worms in Email), they do not have the scanning behaviors to hosts through network flow interacting (e.g., they find vulnerable hosts through an attachment to a mail [
39]), which results in the fact the SMD of our algorithm fails in detection. Hoever, for most scanning worms, our algorithm is effective in identifying network victims no matter if they are known-released or zero-day worms. Although the connected patterns of worms may be similar in graphical visualization with other behaviors, the cascading motifs discovered by our CMD algorithm are completely differential from normal flow behaviors and other anomalous flow behaviors. The current algorithm is designed for communication networks because we mainly consider flow behavior analysis in these networks. If the worm behaviors in other types of networks do not have scanning behavior, the performance of our method would not be good. It is our next step to classify more worm patterns or other anomalous behaviors that produce large-scale similar flow behaviors and also apply our method under different network scenarios.
6. Conclusions and Future Work
The ability of an approach to detect and identify anomalous events that are active on a computer network is critical for network management and security. In this paper, we propose a behavioral detection approach towards the network worm victims identification. Instead of matching the fixed signature in network traffic payloads, our approach focused on discovering specific connected patterns that depicted the inherent behaviors of network worms.
We developed NFCGs, which contain plenty of network-flow behavior characteristics, in order to model the social behaviors among network hosts; and then presented a CMD algorithm to capture scanning and transferring behaviors of network worms. We used the SMD phase to find motifs that perform scanning activities and the TMD phase to detect whether the strongly correlated behaviors exist in the next step of the scanning motif. With the use of the GTNetS simulation platform, we have conducted a variety of experiments for network worms. Experimental results suggest that our approach is effective and efficient in identifying network worm victims. Additionally, we addressed three types of worm propagation situations in our experiment. The results demonstrate that our approach has high detection accuracy, regardless of whether a worm is a known worm or a new polymorphic worm.
In the future, aiming at increasing the robustness and effectiveness of the proposed approach, we would start from sensitive analysis to improve the accuracy of our approach, and performance characteristics will be reported in comparison with some classical worm detection methods. Moreover, our approach can be extended to utilize motif structures to discover anomalous characteristics for more network events and help in the evolution analysis of worm outbreaks.