1. Introduction
According to the numerous cyber security reports, the number of cyber threats is increasing rapidly from 23,680,646 in 2008 [
1] to 5,188,740,554 in 2013 [
2]. This is nowadays one of the most vexing problems in computer systems security [
3]. At the end of 2013, Kaspersky Lab, the Russian producer of anti-virus software, reported that it currently detects and blocks more than 315,000 new malicious programs every day, a significant increase from 2012, when 200,000 malicious programs were detected and blocked each day on average.
According to Nomura Research Institute annual report on cyber security trends, in 2012, 100% of organizations had anti-virus products installed [
4]. Despite this, according to this report, about 30% of organizations are systematically infected by malware. The reason for this situation is not, as it might be expected, the inappropriate update of operating systems and virus definition files, but the lack of all signatures for existing and appearing threats.
A major group of malware is composed of existing (known) parts of malicious codes [
5]. Regardless of the fact that their installation packs are different, this make them similar in a way that they may launch applications with the same features. This simplicity of malicious code development and the effectiveness of obfuscation mechanisms [
6,
7] available for the attackers nowadays make them armed with a powerful weapon.
Moreover, according to the study conducted in 2012 by the Verizon Research, Investigations, Solutions, Knowledge (RISK) Team in cooperation with many national federal organizations, including the Australian Federal Police, Irish Reporting and Information Security Service, and United States Secret Service, the following findings were made [
8]:
54% of malware took months to discover,
29% of malware took weeks to discover,
13% of malware took days to discover.
This report shows how important it is to introduce new techniques that speed up the process of malware detection to hours. The authors of the report [
4] indicate that anti-virus products should be supported by malware behavioral analysis tools in order to detect those attacks for which signatures were not established yet. An existing example of an application that uses behavioral analysis for advanced persistent threat detection is Digital DNA by HBGary, which extends the capabilities of McAfee Total Protection anti-virus [
9]. The detailed technical specification of this solution has not been released for the public yet. The product brochure reveals the information that multiple low level behaviors are identified for every running program or binary. This leads to the conclusion that each application is observed from a behavioral perspective. McAfee is proud that the solution allowed the detection during the last year of more zero-day attacks than the previous five years combined. This indicates the scale of new malware development and the efficacy of the behavioral approach.
2. Motivation
An overwhelming number of computer systems are connected to each other by a global network, the Internet, which allows producing results beyond those achievable by the individual systems alone. The outcomes of cooperative work and the accessibility of information are perceived and appreciated probably by all of its users. The advantages of this technology are available, unfortunately, also for hostile goals, which was highlighted in the Introduction section.
Although awareness of necessary security applications seems to be common and the tools used for that purpose are getting more and more advanced, the number of successful attacks targeted at computer systems is growing [
10]. They are mostly related to denial of offered services, gaining access to or stealing private data, financial fraud,
etc. [
11,
12]. Moreover, the evolution towards cloud computing, increasing use of social networks, mobile and peer-to-peer networking technologies, which are an intrinsic part of our daily routine, bring many conveniences to our personal life, business and government, gives the possibility of using them as tools for cyber criminals [
13] and a potential path of malware propagation [
14,
15].
Cyber criminals are focused on finding a way to bypass security controls and gaining access into the protected network for many reasons. These may be gathering sensitive data or money fraud. That is why organizations, companies, governments and institutions, as well as ordinary citizens all over the world are interested in the detection of all attempts of malicious actions targeting their computer networks and single machines.
The success rate of the applied methods for malware detection depends on the reliability of the malware models. Usually, they are based on the code signatures. Security controls (e.g., anti-virus tools) might be unadjusted, because the signatures of new threats have not been identified yet. Hackers often use existing parts of code in order to implement new types of malware. This allows one, in turn, to quickly develop the signatures of new dangerous software. Therefore, the more signatures that are deployed, the more malicious codes that are identified. On the other hand, one of the methods of misleading the signature-based detection systems is code obfuscation [
16], the aim of which is generating, from already existing code, a new application that cannot be assessed yet as being risky by security controls [
17]. This technique is simple in application and potentially successful, so that also successful countermeasures are necessary. One of the examples is to follow the behavior [
18] of the malicious software in order to identify it and eliminate it from the protected system.
In order to overcome the limitations of signature-based anti-virus tools, there are various modeling methods of malware behavior proposed by the scientific community. As opposed to signature-based techniques, which analyze the static contents of the binaries, behavior techniques are focused on run-time events caused by those binaries observed on different levels (processor, operating system), which form sequences of related actions—so-called patterns. It is very common that sensors focused on monitoring system calls are used to identify the symptoms of malware behavior. From this vast amount of apparently independent events, it is necessary to identify symptoms that match a unique pattern. This approach relies heavily on the model of malware behavior that was used as the basis of a particular detection method.
In [
19,
20], we can find the analysis of both groups and sequences of system calls in the form of n-gram models. These approaches investigate the diversity of system calls by verification of a large-scale collection of system events that were related to the activities performed by particular hosts’ users. N-gram models enable one to identify similarities among different types of malware; however, the matching process is not strictly defined and may be implemented differently.
Another approach was proposed in [
21], which describes an individual system call analysis that tracks processor operations on the assembler code level on the basis of control flow graphs. The case study presented was based on malware affecting web browsers and any loaded browser helper object. With the use of this method, the authors were able to handle sensitive user information, such as the URLs that a user visits or the content of the web pages that are loaded in Microsoft Internet Explorer. This way, a leakage of sensitive information outside the browser can be stopped.
Similarly to [
21], a system call dependency graph [
22] is a technique for extracting optimally discriminative specifications, based on graph mining and concept analysis, which uniquely identifies a class of programs. Such a discriminative specification technique scales to large classes of programs due to probabilistic sampling of the set of events. This technique for mining behaviors by contrasting two sets for malicious and benign programs relies on a precise representation of software behavior, which could be unsuccessful for zero-day attacks constructed from existing parts of malware, which might behave differently.
In [
23], we can find an interesting proposal of the bounded feature space behavior modeling (BOFM) framework for scalable malware detection. BOFM models the interactions among software (which can be malware or benign) and security-critical OS resources in a scalable manner. Information collected at run-time (e.g., based on API hooking) according to this model is then used by machine learning algorithms to learn how to accurately classify software. This approach enables one to perform system monitoring during the user’s regular interaction with the system. However firstly, the learning phase needs to be performed. This approach may suffer from additional false positive and negative detection rates depending on the learning set, as well as the learning process.
In [
24] modeling of the propagation of email worms was proposed. The proposed model can precisely present the repetitious spreading process caused by malicious software distributed as email attachments. This method can be used to evaluate the infection process of newly-created malware, so called zero-day attacks. It could be useful to alarm the users of some network environments (e.g., company, county, region); however, it would not be sufficient for malware detection.
Some of the methods presented above describe only the model used to describe malware behavior [
20,
23,
24]; however, others focus more on the detection method [
21,
23]. They can detect malware activities on the processor level [
21] or the operating system calls [
19] in run-time [
23] or in a controlled or emulated environment [
24]. The main assumption taken for our work was that the malware tracking tool needs to operate during the normal operation of the host machine on the basis of events observed by the operating system monitor sensor. The model must also be able to resemble the behavior of different kinds of malware (not focusing only on Internet browser spyware or e-mail worms) and efficiently support the process of detection. It was also important to provide the possibility to use a particular classifier that can improve the success rate of the detection process (minimizing the false positive and false negative rates).
The aim of this article is to present an approach to the modeling and detection of malware behavior with the use of formal models (classifiers) based on colored Petri nets (CP-nets [
25,
26]). We prove that the method enables one to create models resembling the behavior of malware, and these models support the detection of cyber threats directed at computer systems.
3. CP-nets
Colored Petri nets form a discrete-event modeling language combining the capabilities of Petri nets [
27] with the capabilities of a high-level programming language. CP-nets provide graphical notation typical for Petri nets, but net elements are described using the CPN ML programming language [
25].
A non-hierarchical CP-net [
25] is a nine-tuple
CPN = (
P, T, A, Σ,
V, C, G, E, I), where:
P is a finite set of places.
T is a finite set of transitions, such that P ∩ T = ∅.
A ⊆ (P × T) ∪ (T × P) is a set of directed arcs.
Σ is a finite set of non-empty color sets.
V is a finite set of typed variables, such that Type[v] ∈ Σ for all variables v ∈ V.
C: P → Σ is a color set function.
G: T → ExprV is a guard function, such that ∀t∈T Type[G(t)] = Bool, where ExprV denotes the set of expressions (possibly with free variables from set V) provided by CPN ML.
E: A → ExprV is an arc expression function, such that ∀a∈AType[E(a)] = C(p)MS, where p is the place connected to the arc a and XMS denotes the set of multisets over set X.
I: P → Expr∅ is an initialization function, such that ∀p∈P Type[I(p)] = C(p)MS.
A marking (state) of a CP-net is a function M defined on the set of places P, such that ∀p∈P M(p) ∈ C(p)MS. The initial marking is obtained by evaluating the initialization expressions.
An execution of a CP-net is described by means of an occurrence sequence. It specifies the markings that are reached and the transitions that occurred. To make it possible to evaluate arc expressions, it is necessary to assign (bind) some values to free variables occurring in arc expressions on the arcs connected to the transition and in the transition guard. A transition is enabled (ready to occur) if it is possible to construct such a binding that the guard evaluates as true and each of the arc expressions evaluate as tokens, which are present on the corresponding input places. An occurrence of a transition t removes tokens from input places of t and adds tokens to its output places. The multisets of removed/added tokens are specified by the expressions of the corresponding arcs.
Hierarchical CP-nets are composed of non-hierarchical modules. Substitution transitions and fusion places are used to combine such modules. The former idea allows the user to refine a transition and its surrounding arcs into a more complex net, which usually gives a more precise and detailed description of the activity represented by the substitution transition. A fusion of places allows users to specify a set of places that should be considered as a single one. This means that they all represent a single conceptual place, but are drawn as separate individual places (e.g., for clarity reasons). For more details, see [
25].
4. CP-net Malware Models
Malware executes functions and operations on the same system resources as legitimate applications. Therefore, it is possible to identify patterns that can be used as models of software behavior. These are potentially: (i) operations on files; (ii) operations on the system registry; (iii) run processes and applications; (iv) communication with specific IP addresses; and (v) communication with domains. These actions can be malicious and performed among many actually harmless ones.
The malware detection takes as an input a set of suspicious events received from the process’ hooking engine, so-called PRONTOlogy, developed by the authors of this article [
28,
29]. This engine is based on ontology reasoning [
30] used for the purpose of filtering single malicious incidents among hundred of thousands of regular ones. These single events are passed on to the PRONTOnet engine, the operation of which is based on formal CP-net malware models, for further investigation.
The PRONTOnet uses formal malware models and performs the detection process by passing (moving) through particular places in the model, using the CP-net vocabulary and characteristics described above. CP-nets are one of the most widespread classes of Petri nets. They provide an easy to understand graphical notation similar to other commonly-used graphical languages, e.g., some UML diagrams. The description language is also easy to learn. It is reduced to a small number of statements typical for high-level programming languages. The semantics of CP-nets is formally defined, and a reach set of analysis methods for CP-nets is accessible. In comparison with other formals methods, CP-nets equally describe the states and actions of the modeled system. Depending on the needs, we can focus on the former or the latter of these aspects. Moreover, the mechanisms of the hierarchical models’ design enable users to construct complex models and/or to swap some parts of a model if necessary.
The hierarchical structure of the CP-net model defined by the authors is shown in
Figure 1. The detection process progresses on the basis of the levels of the hierarchy. The primary module of the CP-net model, representing the PRONTOnet threat tracking tool, is depicted in
Figure 2. On the left-hand side of this figure, there is a column of places, marked with ellipses, storing tokens that represent particular assets that might be affected by malware: F, a place storing tokens indicating files; R, registry entries; P, processes; D, domains; IP, IP addresses that the malware may communicate with. Markings of these places represent tracked symptoms. The second column in
Figure 2 is composed of substitute transitions that are related to the acquisition process depicted in
Figure 3. The next column is made up of places indicating particular assets affected by malware activated in the monitored system. Markings of these places represent observed symptoms. They are processed by a substitute transition called Verify, which is aimed at reasoning if the system is infected by a certain malware type. As a consequence, the place RESULT is marked with a vector informing about particular malware type and related symptoms.
The CP-net modeling language assumes the application of particular elements presented in Section 3. They have been also indicated in
Figure 2, called the primary module. For the purpose of malware modeling, we have defined color types and inscriptions as presented in the following listing:
colset S = String;
colset I = Integer;
colset Symptoms = List S;
colset V = Product I * S * Symptoms;
Type S is a string. In this way, particular assets are described by variables f, r, p, d, ip of type S, e.g., file C:\[WINDIR]\System32\svchost.exe or IP address 66.232.126.195. Type I is an integer and is used as threat identity number, and type Symptoms is a list of strings describing assets suspected to be infected by malware. Type V is a product that indicates the threat identity number, the name of the threat and a list of symptoms that were used for identifying which attack was executed in the system.
In
Figure 3, the symptom acquisition process and co-operation with the PRONTOlogy module is presented. The Symptom place is an input/output port that indicates appropriate places F, R, P, D and IP from the higher level module (
Figure 2). In the acquisition module, tokens that represent filtered suspicious activities identified by the PRONTOlogy are passed to the VSymptom place if the same token exists at the Symptom place. Identified suspicious actions mark the VSymptom place for further processing by the Verify transition. Marking of the Symptom place contains all tracked elements of the monitored system, e.g., all IP addresses with which the system may communicate and from which it downloads malicious software. Transition Symptom_detected is developed in order to test the existence of the appropriate token in the Symptom place in case the Trigger place is marked. If the compared tokens are different, the transition does not react. The conformity of tokens is required by the guard [
x =
t]. The CP-net model of the acquisition module takes into account the situation when more than one malware type has the same individual symptoms. In this case, marking of the Symptom place is not reduced, while the Symptom_detected transition is enabled.
This module shows an important role of the PRONTOlogy [
28,
29] in the identification of suspicious events flowing from the system calls’ monitor. This identification process is based on the set of rules describing suspicious activities using the same semantics as CP-nets (color sets, places and their appropriate markings), e.g., modification of the registry by a particular process, but with the use of Web Ontology Language (OWL) and Semantic Web Rule Language (SWRL). The rules enable one to recognize single symptoms and are created by the malware analyst after a deep analysis of particular malware behavior. This process is necessary for sifting through thousands of events flowing from the sensor, increasing the efficiency of the PRONTOnet. The rules are generic and enable one to detect obfuscated malware and newly created zero-day attacks that resemble, at some point, the behavior of existing malicious codes. The identified suspicious actions are passed through to the PRONTOnet module, which now performs contextual analysis on the basis of the events’ sequences.
The verification module is presented in
Figure 4. It must be noted that substitute transition
Verify in the primary module represents multiple transitions designed for the identification of various malware types. This indicates that characteristic marking of V places causes enabling of the transition appropriate for particular malware. This triggers the verification process. An exemplary virus detection for the chosen marking is shown in
Figure 5, which presents the detection of malware called Virut [
31].
In the presented example, assuming the appearance of the appropriate marking, transition Virut is enabled, which, as a consequence, leads to receiving vector v informing about detection of the Virut malware. The structure of vector v is as follows:
1′ 1 | Virut | vrt7.tmp, HKLM\…\Security, winlogon, svchost, zief.pl, setdoc.cn, 209.205.196.18, 94.247.2.38.
The CP-net models presented in this section allow one to easily update the symptoms of a new attack by changing the initial marking of places. Every time a new malware model is entered into the PRONTOnet, the appropriate actualization of places’ markings must be performed.
It is also worth emphasizing that the detection of malware is not limited only to the depicted resources. PRONTOnet may be also used for the identification of exploits through the analysis of network traffic and system statistics [
32]. If some resource is identified as useful for malware detection, the primary module needs only to be updated with additional places and transitions.
6. Conclusions
The article tackles the problem of malware modeling for the purpose of the detection process. It proposes a new approach to behavioral malicious code analysis based on CP-net models in order to inform security stakeholders about suspicious activities observed in the monitored system. This should lead to faster and more appropriate decisions that mitigate the negative results of the conducted cyber attack and, as a consequence, the development of new signatures to avoid similar threats.
The malware behavior model used in the PRONTO hunting tool has been formally specified. It consists of a hierarchical CP-net model allowing for the detection of not only well-known cyber attacks, but also zero-day attacks constructed from the existing parts of malicious codes. Moreover, the method is resilient to the very popular obfuscation of the malicious code. It precisely detects single or groups of suspicious activities caused by the malware. Our approach is focused both on the formal modeling, as well as the success rate of the detection process. It is also generic in terms of the detection of multiple types of malware in opposition to the methods focused on some specific attacks (e.g., on e-mails, web browsers). The proposed method tracks and traces events related to various system assets, like files, system registry, processes, communicated domains or IP-addresses. This list could be extended if needed.
The limitation of our method can be the lack of the possibility to use it in virtualized machines, so-called sandboxes. The reason is the fact that the vast majority of malware is able to detect the virtualized environment and does not activate itself in such an environment. In order to overcome this inconvenience, we are working on a honeypot that could more easily get infected by malware and enable us to identify suspicious or harmful activities characteristic of this particular malware, as well as to perform an analysis of its code in the controlled environment.
The CP-net-based malware model defines behavioral patterns that express software activities on the operating system. The detection method that uses the model moves through the places in the CP-net on the basis of the interaction with sensors that record sequences of activities and verify their potential harmfulness (the acquisition phase) and their analysis (the detection phase).
It has been shown that the proposed approach to modeling and the resulting CP-net cyber attack models constructed with the use of the CP-net malware modeling tool allow one to identify known malware.
Due to the limited size of this article, it is not possible to report all of the experiments that have been performed; however, it is also possible to detect zero-day attacks, the code of which has been obfuscated. PRONTOnet is able to pass on the alarms about the observed threat and the symptoms that indicate the existence of particular malware.
The proposed method of malware modeling and its detection is planned to be adapted at least in the Federated Cyber Defence System being developed in Poland; however, its applicability is much wider. It can be used in honeypots spread in the monitored system in the form of so-called sandboxes without any or with vulnerable security controls in order to trace and track unwanted activities generated by the software, as well as the users of the system. The advantage of the method should be also noticed by companies utilizing only signature-based anti-virus applications. In particular, PRONTO fulfills the current need to shorten the time of malware detection from the moment it has established itself in the system. As mentioned in the introductory chapter of this article, over 50% of malware was detected after months and about 30% after weeks of infection. Before being identified, cyber attacks can cause irreversible losses and damage in the system. The alarm raised by PRONTO and the in-time introduction of appropriate security measures and malware removal can mitigate the risk of potential losses. Moreover, faster detection of new malware should lead to faster delivery of appropriate signatures constructed on the basis of static code analysis, which is a current need of companies producing anti-virus applications.