2. Background
The most well-known representatives of DDoS attacks are TCP flood, SYN flood, UDP flood, ICMP flood, and HTTP flood. There are many methods to protect against DDoS attacks. However, DDoS attacks pose a serious danger and continue to cause enormous damage. A number of recent events [
2,
3,
4] confirm this fact.
The proliferation of unsecured Internet of Things (IoT) devices has led to an increase in the number and power of DDoS attacks [
4]. At the same time, attacks coming from IoT devices are difficult to detect. This is because the devices generate traffic over the TCP protocol, which is practically indistinguishable from the legitimate one; for example the Mirai botnet-generated TCP ACK + push and TCP SYN flood. Experts predict the emergence of even larger botnets capable of flood attacks even without the use of amplification protocols [
5,
6]. Thus, the task of timely detection of signs of the beginning of a flood attack is fundamentally important.
Various methods for detecting flood attacks were described in [
7]. Furthermore, there are signature methods, statistical methods [
8], and detection methods based on anomalies [
9]. The DPI-system is able to analyze, monitor, and filter traffic. Therefore, it is often used to detect and protect against flood attacks.
Currently, network attacks have moved upward in the OSI model. They reside at the presentation layer [
10] and the application layer [
11]. Their goal now is not only denial of service, but also penetration into the system, resulting in data theft, data change, and control of the conquered service. This situation in general is caused by the imperfection of the software used for network data processing. One of the most dangerous and common vulnerabilities is “code injection” [
12]. Code injection vulnerabilities are caused by software bugs and insufficient software testing [
13]. According to [
14], 25% of all vulnerabilities in service-oriented architecture (SOA) software systems are code injections. The most popular type of code injection attack is SQL injection (about 18% in the SOA class). Currently, nothing has changed. In fact, injection attacks have made the OWASP top ten list for the past 14 years and have been listed as the number one attack for the past eight years. Injection attack remains at the A1 position on the latest 2017 OWASP Top 10 list of most prevalent security threats [
15]. This is caused by the widespread use of SQL queries to databases in application code based on user input values without proper control. Thus, the problem of detecting SQL injection attacks is important, which is regularly noted in the reports on the information security of web applications [
16,
17].
A standard well-known way to detect code injection attacks is to find a sequence of data (so-called, signature) that uniquely characterizes the attack against the background of normal requests to the web server [
18]. It should be guaranteed that the signature matches only the attack and does not match any normal traffic, otherwise a false detection event will be produced. It is important to mention that the signature may match only the already detected attack, and the matching rule is usually written by a highly qualified specialist, who should pay attention to the generality and the performance simultaneously. The database of signatures is replenished with a delay, so there is some time gap for large-scale network attacks. This is a significant drawback of the signature approach. Another drawback is the large size of the intrusion detection system’s signature database. Many attack signatures in the database are frequently working for nothing since they relate to attacks that are not dangerous for the protected network service. The large size of the attack signature database requires significant resources to analyze the traffic on the database and does not allow the signature method to be applied totally at the Internet service provider (ISP) level.
The problem of detecting code injections attack is actively investigated by the computer security research community. Hundreds of papers have been published to offer various methods for detecting SQL injection attacks, for example [
19,
20]. Most of the works are based on a signature approach enriched with methods to adopt polymorphic representation of injections and to eliminate the insignificant information from the analyzed query. The comprehensive information on the types of SQL injection attacks and the tools to combat them are presented in numerous reviews, for example in [
21,
22,
23]. However, the problem continues to be relevant, because SQL injection attacks rely on the individual features of the attacked web application, the programming language on which it is written, the database structure, and the database that serves this database. Not surprisingly, new SQL injection attacks are not always recognized by intrusion prevention systems. Attempts have been made to reduce the individuality of queries, which should simplify the recognition of SQL injection attacks [
19], but this does not solve the problem in principle.
One of the promising approaches to identifying attacks of various kinds is to use machine learning methods. In particular, such methods have been used for a long time to detect SQL injection attacks and XSS (cross-site scripting) attacks similar to them. The motivation for using machine learning is the ability to detect implicit dependencies in the data. Presumably, it will allow detecting not only known, but also unknown attacks. Let us consider some research works on this topic. All machine learning-based methods differ in the parameters selected to compose the area of interest and to find the proper characteristic of real world events in this area.
In [
24], as a distinctive feature of malicious requests, the frequency of symbols, classes of symbols, and keywords of SQL language was considered. Training of the neural network recognizer was performed on a synthetic sample containing both normal SQL queries received during the operation of a particular site and queries with known attacks. The peculiarity of the implemented approach was the individual configuration of the neural network recognizer on the protected site.
A similar approach was demonstrated in [
25]. It used a probabilistic model to detect the symbol of a particular class in the arguments of requests to the database server. Code injection attacks will obviously change the structure of the query and the distribution of the characters’ probabilities. Building a probabilistic model requires individualization for a specific application. In this work, the PHP-Nuke system was used as the target application, which at that time had several known vulnerabilities of the SQL injection type.
The construction of a model of valid arguments in the form of intervals was the basis of the approach proposed in [
26]. The method demonstrated good efficiency in terms of errors of the first and second kind, but required a manually-created specification of the exchange protocol. This reduces the applicability of this approach.
Another way to detect database queries containing code injections is to build a profile of normal queries [
27]. Requests containing injections will be marked as not matching the profile. This makes it possible to detect a wide range of SQL injection attacks, including previously unknown ones, but requires individual configuration for each application accessing the database.
To identify the constant part of the text of SQL queries to the database and their arguments automatically, in [
28], a genetic algorithm was used. Training of the anomaly detector was conducted according to the protocol of the normal operations of the DBMS. There was a high efficiency for the method of detecting anomalous queries, but also a high rate of false positives recorded, which in general, is typical of many algorithms for anomaly detection.
The original method of constructing a set of correct SQL queries was proposed in [
29]. The method uses a neural network with feedback. The neural network is trained to recognize chains of a fixed number of tokens that form requests. The recursive nature of connections in the neural network makes it possible to recognize chains of tokens of any desired length. The PHP-Nuke system was used as an object for experiments.
Sometimes, to build generalizing rules for classification of query texts, a representation in the form of N-grams is used [
30,
31]. In the resulting feature space, different methods of single-class classification are used: support vector machines and neural networks.
We can see that many researchers offer machine learning algorithms to detect SQL injection attacks based on anomaly detection. In some works, the anomaly detection algorithm is individualized to a specific application that needs to be protected from SQL injection attacks.
The main thesis of the majority of works on the detection of network attacks is that all necessary information about the fact of the attack is contained in one direction of traffic. Typically, the signs of an attack are to be found in incoming traffic from the Internet, but it is also known that the signs, for example of an infection of computers in the local network, manifest themselves as specific packets coming from the local network to the command centers located somewhere on the Internet.
In previous works [
32,
33], authors have shown that the joint analysis of input and output data allows detecting anomalies. A promising approach to detect SQL injection attacks [
34] is based on accounting for the execution time of normal SQL queries. These approaches can make the detection of network attacks efficient and generalized to detect several kinds of attacks.
3. Method
3.1. Principles of the Proposed Approach
Based on the most general considerations, we can postulate that a network attack is a violation of the normal functioning of a computer system. Obviously, by functioning, it is meant the logic of data processing and interaction with other systems. In particular, for a web server, processing of data consists of forming a response to a request received from outside. Thus, a violation of the normal functioning of the web server is not every “attack” request, even if it contains the signs of a known network attack, but only the one that caused the web server to function incorrectly.
Let us introduce the concept of the correct functioning of a web server. This is a collection of request-response pairs that describe all possible normal scenarios of data processing by the server. Since many classes of web servers provide very limited functionality, the incoming requests and responses to them are mostly typical. In this case, it is possible to create some template for normal interaction with the web server, which generates a limited number of unique request-response pairs of different types. Having such a set will allow us to check the similarity of each request and response with a request-response pair from the set and identify network attacks, the response to which would go beyond this set. The formation of a template of normal interaction with the web server can be automated.
To create such a set, it is necessary to formalize the characteristics of requests and responses to them, containing information about the main parameters and features of the web server response to the request. Obviously, the assumption is that the characteristics calculated during normal interaction with the web server will be similar to the characteristics of the request-response pairs from the set, and the implementation of malicious actions that cause an incorrect reaction of the web server will affect the characteristics of the request-response pairs, which will allow us to distinguish one from the other.
In this paper, for the identification of request-response pairs, it is proposed to use the relationship between the size of the data exchanged between the server and the client and the timing of this exchange. This choice was made due to the following considerations. Each request and each response are characterized by their size. Since information processing takes some time, the timing of the response data stream also should be considered as an important characteristic. The cross-correlation function (XCF) for the time series of the incoming and outgoing web server traffic intensity was adopted as a formalized characteristic that combines both sizes and timing [
33].
The XCF form shows the uniqueness of the request-response pair when the web server processes a normal request. Determining the space of normal XCFs will make it possible to identify anomalous XCFs that go beyond its limits.
Figure 1 shows a scheme of an anomaly detection method based on XCFs. The scheme also shows the difference in the form of XCFs when the web server processes normal requests and when processing anomalous requests with attacks.
To detect anomalies in the multidimensional feature space, one can use well-known classification methods, such as neural networks, support vector machines, etc. [
35,
36].
In this paper, for single-class classification, we used an autoencoder trained on the set XCF, calculated from the time series of the intensity of incoming and outgoing server traffic in the normal interaction with it. The anomaly detection criterion is based on the error of XCF recognition submitted to the input of a trained autoencoder. If the error exceeds the set threshold for an XCF submitted to the autoencoder input, the XCF and its corresponding request-response pairs are considered abnormal.
3.2. Dynamic Model of Request Processing by the Web Server
Consider the formal model of a web server that processes incoming requests from the network and issues back responses. In this case, we will characterize the queries and web server responses only by their size. We impose a restriction on the flow of requests so that it is ordinary, for example Poisson arrivals, that is, at any moment, no more than one request arrives. This assumption with a high accuracy is valid for web servers with a small number of visitors and small traffic. In this case, it is also a fair assumption that the web server response will be fast enough and in most cases will end before the next request comes.
We also postulate that the server’s response to the request does not depend on the order of requests, but depends only on the information contained in the request. Of course, in the general case, the server can accumulate information in its memory, and this affects its responses at the following times. However, the general logic behind the development of network protocols and web servers limits the size of the answer: the search engine is always limited to 10 results found; there are no more than 20 items in the storefront at a time; there is only one large picture on the photo gallery page; there is a limited number of thumbnails of other photos, etc. The reason for this feature is that the web server must respond to the request within a short and guaranteed time. The answer, not limited in size, will slow down the responses to other requests coming to the web server.
Figure 2a shows an example of the intensity of incoming and outgoing traffic when the web server processes requests of the same type.
Figure 2b shows an example of the intensity of incoming and outgoing traffic when the web server processes requests of two different types. Requests of different types have a different form of intensity and are indicated by different colors for clarity.
One can see an analogy between the example shown of processing network requests and the input-output of the transfer function of the dynamic plant used in the theory of automatic control. The traffic intensity measured at discrete instants of time, according to this analogy, can be considered as a signal, and the web server acts as a dynamic plant that converts the signal. Developing this analogy, we see that the web server’s processing of requests of different types can be represented as a scheme, which is shown in
Figure 3. In this case, the web server performs a query type determination and then processes each request using its own block:
or
.
In the case of a larger number of query types, this model is easily extended by adding parallel processing units. This model allows us not only to describe the functioning of the anomaly detection algorithm formally, but also offers methods for modeling abnormal situations.
From the theory of automatic control, it is known that in the case of a linear model of a signal processing unit, its transfer function can be equivalently represented by means of XCFs. For a web server, the linearity of the dynamic response model is not obvious. For example, the same request to a web server can be executed quickly if the necessary information is in the cache of the DBMS, and slowly if you have to read this information from the hard disk of the server. This difference is manifested by two different XCFs. Accordingly, in the model, this behavior of the web server can be represented by two different processing units.
3.6. The Classification Algorithm Implementation
To protect the site based on the selected version of MyBB, a training set must first be created. In our case, it is composed of network traffic with usual operations with the forum including usage of communication functionality (creation of a short message, a picture, a paragraph, several paragraphs), change of settings, user mistakes, and incorrect actions. Further on in this scenario, several sessions of work with the forum were simulated. At the same time, using the network sniffer tshark, all incoming and outgoing web server traffic was recorded. The received data on communications between the client and the server via the HTTP protocol were converted into time series of the data transmission intensity, which was represented by the number of bytes per unit of time. The time sampling step was 0.1 s. This allowed seeing instant changes in the processing network device behavior.
Two synchronous time series with the intensity of incoming and outgoing traffic was used to calculate the XCF with a window width of 11 time samples. Thus, it was possible to register a server’s dynamic response to the request within 0.5 s. A significant number of XCFs should have been calculated for intervals with zero traffic intensity. Such intervals were excluded from the calculations. The obtained 4830 cross-correlation functions formed the normal operation profile of the web server: a training set.
The structure of the autoencoder for the classification of XCFs was chosen based on the width of the correlation window and expert ideas about the possibilities of the autoencoder. The experiments used an autoencoder with 11 neurons in the input layer, which is equal to the width of the discrete cross-correlation function, and successively located fully-connected neuron layers of 15, 10, 6, 3, 5, 10, and 15 neurons and 11 neurons in the output layer. The structure of the autoencoder for the one-class classification is presented in
Figure 5.
The input of the autoencoder was fed the vector of values of the XCF . The autoencoder was trained so that the outputs formed a vector of values , as close as possible to the vector of input values of the XCF . The training used the criterion for minimizing the mean squared error (MSE). The autoencoder was trained in the MATLAB package by the Levenberg–Marquardt method in a training set.
In the middle of the autoencoder neural network, there was a “bottleneck” layer, which consisted of 3 neurons. The output of these neurons can be plotted in 3D to highlight the compression of the multidimensional () input vector data by the neural network. Sometimes, such visualization can help to understand how the neural network works and to distinguish different datasets.
To find out whether the trained autoencoder recognizes the input XCF or not, the instant reconstruction error (IRE) for a given input vector
x and produced output vector
y is defined by the formula:
The value of IRE is closer to zero if the input vector is closer to one of the input vectors in the training set and is higher if the input vector differs from all vectors from the training set. Therefore, IRE characterizes the novelty of the input vector x. The threshold value for the IRE allows implementing the one-class classification algorithm as separate vectors, a low IRE value from vectors with a higher IRE value.
5. Discussion
The developed method did not engage any human expert knowledge to label training data or to construct heuristic rules, which can be used for labeling. On the contrary, it was absolutely unsupervised, and in the case study, expert knowledge was applied only for parameters’ selection: neural network structure, width of XCF, and time sampling rate. The criterion to distinguish normal functioning from anomalies was inspired by the control system theory and implied the most basic assumptions about the protected system: it should be a web server that processes the client’s requests, and the dynamic character of this processing should change in case of a break in normal functioning.
The anomaly-based approach implemented in the method promises the capability to detect zero-day attacks. Really, there were no predefined attacks in the training data, so all anomalies were considered by the classification algorithm as unknown ones. An interesting side effect of the method is that it will detect all events of malfunctioning of the protected web server; it does not matter whether they were malicious events or not.
A significant disadvantage of the method is that all anomalies were of the same type, and to reveal the root cause of the anomaly event, a special effort should be made. Nevertheless, this is the usual behavior of the anomaly-based approach.
We suggest that the method has little computational complexity on the target system because all operations and data volumes involved in them are relatively modest for the modern desktop and server CPUs and even for multicore mobile and embedded CPUs. Our certainty is based on not using deep packet inspection techniques or semantic analysis of the traffic content at all. The series of the traffic intensity needed for training can be gathered on the fly and consume a small amount of memory in storage. There is no need to analyze the content of the network traffic, allowing use of the proposed method to protect network servers that use cryptographic protocols such as TLS.
Since the method calculated XCF between incoming and outgoing traffic intensity, as well as in the training phase and in the evaluation phase, it is hard to obtain appropriate data for comparative evaluation with other methods. For example, well-known datasets such as KDD’99 do not fit the purpose of comparative testing because they do not provide data for training of the developed method. That is why there is no comparison with other methods and approaches.
Moreover, the method requires training for a specific web server, and the trained classifier possibly will not work properly with other web servers or with the same web server with different hardware and software components. Such an unexpected and somewhat strange behavior specializes the application of the method to the systems with fixed functionality and the absence of changes in the configuration for long periods. A good subject for potential applications of the developed method is a hardware appliance such as a dedicated server with rare maintenance or an IoT device with an appropriate use case scenario.
A summary of the imaginable disadvantages of the developed method is listed below:
multiple simultaneous requests can be classified as false positive if such a case were not included in the training dataset;
complex dynamic processing of requests in distributed and multi-tier network services makes the server’s response to the same request unpredictable and therefore may lead to false positives;
all changes in server’s hardware and software (including configuration settings) that may modify the performance will change the dynamics of the request-response and inevitably require training the classifier from scratch to prevent numerous false positives;
simultaneous work of extraneous tasks with noticeable performance impact that do not correlate with processing requests (system administration activity, OS logging, antivirus, DB backup, scheduled tasks, etc.) make the classification quality worse;
an attack giving a dynamic response of the server, similar to one of the normal ones, will be missed;
the detected anomaly cannot be automatically identified in opposition to signature methods for attack detection.
The significant level of false negative errors for flood attacks clearly means that some events will be missed by the detection system. Let us discuss whether this is important or not. It should be noted that new events after the first anomaly detection do not seem very informative, because they characterize the same attack, and the first anomaly detection should attract the attention of the system administrator well enough. The following events are just duplicates of the first one. If some of the following events were missed, this should not change the resulting efficiency of the detection system. In real-world IDSand threat intelligence systems duplicated detection events of the same attack are packed together, otherwise they would fill the dashboard with useless information. That is why we think that the delay of the flood attack detection is more important than the level of false negative errors after the first detection of the anomaly.
Our investigations allows us to highlight several proven and potential advantages:
high quality of attack detection for a relatively simple web server;
independence from expert labeling to train the classifier;
no need for the classifier update until there are significant changes in the web server;
capable of detecting several types of attacks and sensitive to zero-day attacks;
simple algorithms and low resource consumption for lightweight implementation;
applicable for all network protocols and capable of working with encrypted traffic.