It is difficult to deal with unstructured data with a huge amount of data and no typical characteristics. Network traffic is a continuously generated data flow. Before extracting features, the necessary step is to divide the data flow, obtain several small fragments, and further process them. This section first analyzes the relevant content of the TLS Protocol, takes the communication message as the analysis object, marks each message, and then describes the communication process between the client and the server. Then, it analyzes the characteristics of domain names of various application categories, extracts representative domain names, and provides relevant domain name data for exception analysis for further manual analysis of exceptions. In the fourth section, we introduce the process of fine-grained efficient dynamic fingerprint extraction.
Frequent project set mining is an algorithm for discovering relationships between independent items. An improved frequent project set mining algorithm applied to fingerprint extraction is proposed, which has three characteristics: first, the method can process dynamic data, dynamically extract feature fingerprints from continuously generated network data, find the “breakpoint” that divides network traffic, and extract the entire application fingerprint; second, in data analysis, the sequential dependence of messages is the key content of application behavior, and the application fingerprint is established through the N-gram model; third, the utility of the message is used as a screening criterion.
3.1. SSL/TLS Encryption Protocol and Communication Process
This section analyzes TLS1.2, a Transport Layer Protocol that works over TCP. TLS version 1.2 consists of two layers: the TLS Recording Protocol and the TLS Handshake Protocol. The TLS Recording Protocol works under the TLS Handshake Protocol and provides basic functionality. The TLS Logging Protocol defines four specific protocol formats, which are the Handshake Protocol, the Alert Protocol, the Change Password Specification Protocol, and the Application Data Protocol. The upper-layer TLS handshake protocol is primarily used for negotiation sessions.
The TLS Handshake Protocol is superior to the TLS Recording Protocol. When the client and server initiate communication for the first time, the two parties agree on the protocol version, encryption algorithm, digital certificate, etc., and complete the following functions. First is the Hello Message exchange and agreement to carry out algorithms, random numbers, and other information on the server, and check the session recovery. Second, the two sides exchange necessary key parameters, allowing the client and server to directly exchange the Hello Message, the server in the algorithm, the random number, and other information on the agreement, and check the session resumption. Third is the exchange of digital certificates and related parameters so that the client and server can authenticate each other. Fourth, the master key is generated from the pre-master key and random numbers are exchanged with each other. Last, it must be ensured that the client and server can verify each other’s parameters and that the parameters will not be tampered with by attackers.
When the client and server need to use TLS Protocols to establish communication, they must follow the TLS Handshake Protocol to establish communication, which can be divided into three phases.
3.1.1. Establish Communication
The client sends a ClientHello Message to the server. After receiving the request, the server sends a ServerHello Message. Both the ClientHello Message and ServerHello Message contain the information necessary to establish communication.
3.1.2. Exchange Keys and Certificate
The two parties need to exchange keys and certificates. There are four types of messages related to keys: Server Certificates, Server Key Exchanges, Client Certificates, and Client Secret Exchanges. In addition, there is a “Change Password Specification” message and a “Certificate Request” message to complete some functions.
Figure 2 describes the process in detail.
3.1.3. Data Transmission
In this phase, the data that the user needs to transfer are transferred in the application data format.
The communication process shown in
Figure 2 is the first complete communication process between the client and server. The actual use of encrypted communication is a continuous process. Exchanging keys and certificates for each communication consumes resources and is unnecessary. The server will retain the session ID of communication with the client for a while and wait for the client to continue using the previous session. If the session ID of the client cannot be found in the existing list of the server, the server initiates a new session and performs a complete handshake.
Figure 3 shows a simplified TLS communication process.
As shown in
Figure 2 and
Figure 3, different messages are required at different stages of communication establishment. Each message has a specific format and performs certain functions. In practical applications, programmers generally implement the communication between client and server according to protocol requirements, but also make some adjustments according to function requirements, for example, omitting some unnecessary messages or combining multiple packets to improve communication efficiency. The diversity of messages in communication and the flexibility in practice makes it possible to extract fingerprint features.
Table 1 lists all message types and functions of the TLS Handshake Protocol.
The notations in the right-most column of
Table 1 are used to describe the communication process. The research object of this paper is encrypted traffic, but in actual network data, common protocols are used to complete some unimportant communication between client and server. The entire network data is a mixture of encrypted and non-encrypted protocols.
Figure 4 shows the hierarchy of common protocols in the network.
The protocols in
Figure 4 are labeled for description, as shown in
Table 2.
This section describes the communication process between the client and server by analyzing TLS protocol contents and analyzing communication messages. After marking each message, one can prepare for fingerprinting extraction.
3.2. Network Traffic Segment
Packets constitute the data in the network, and each packet has a fixed attribute.
Figure 5 shows a segment of network traffic captured in the actual QQ communication process, which also contains some data packets from other applications.
In the actual process of obtaining traffic, the traffic packets sent and received by the terminal must be a mixture of multiple applications, which makes it possible to divide traffic based on IP address transition (IAT).
The network traffic obtained here contains a large number of encrypted packets, which can be directly obtained by processing only six limited attributes: time, source, destination, protocol, length, and info. Here, to define any packets the following is used: . Based on the data packet attributes that can be directly obtained, this section analyzes the sequence of data packets to obtain encrypted traffic fingerprints.
Data packets generated when multiple applications installed on a terminal exchange data with the server are sent sequentially. Within a time interval , define the data flow generated (N is the number of packets obtained). Any packet contains six attributes: .
The object to be dealt with is the data flow defined above. Given a data flow of mixed data from multiple applications, the proposed method should accomplish two tasks:
Process the data stream
and divide the packet sequence
into several small data stream segments according to the relevant attributes and message analysis:
Based on data stream fragments, the dynamic extraction of fine-grained fingerprint features is realized through the method. The specific method is described in
Section 3. The extracted application fingerprint is a message sequence containing encrypted messages:
.
This paper proposes a new method for the division of data flow , which is no longer based on the simple time interval but on the six basic attributes of packets to find the “breakpoint” of network traffic, to achieve the purpose of accurate division of network traffic.
In the process of sending and receiving data packets, network devices complete an “action” in a time interval, similar to that in an operating system. The CPU is divided into several time slices, and each time slice processes one process. This “action” is equivalent to the application fingerprint that can be extracted. Therefore, in dividing data flows, the determination of “breakpoints” is based on this feature of network devices processing data. The “breakpoint” is determined according to the partitioning performance of each attribute of network traffic, that is, the optimal partitioning attribute. Among the six basic attributes shared by network packets, three attributes can be used as network traffic partitioning: time, source IP address, and destination IP address. In this paper, network traffic is considered as a whole to determine the “breakpoint”, and the partition performance of these three attributes in dividing network traffic is proved from the perspective of information entropy in information theory.
This paper introduces information entropy to calculate the information value contained in a sample. The basic idea is that if each element in a sample has a specific rule, it can be determined according to statistical methods, and its change content can be predicted in advance, which means that the information value of this sample is not high, and the information entropy contained in it is small. On the contrary, if all elements of the sample are disordered and the specific situation is unpredictable, the amount of information contained in the sample will be large, and the information entropy will naturally be larger. Equation (1) is the definition of information entropy:
The network traffic classification process is to divide the packets according to the properties of continuous discretization into information entropy smaller fragments, and each fragment contains less information entropy, which means that the element is regular and will repeat. This is the same as the nature of the application of a fingerprint; the information entropy is small and it will repeat traffic division and be representative. In addition, it can be used as an application fingerprint to identify traffic. Information gain is used as the measurement standard when selecting partition attributes. Information gain is the change that occurs before and after the dataset is partitioned, and (2) is the definition of information gain:
The classification process can be described in a tree structure, in which represents the entropy of attributes of the parent node and represents the entropy of attributes of several child nodes. The larger the information gain is, the more information is contained before classification. The more impure the sample elements are, the smaller the information entropy after classification is, and the purer the sample elements of the child nodes will be. Each sample is more representative of the applied fingerprint, so the division based on IP address is better than the division based on time.
In network traffic, network packets are continuous-discrete samples and network traffic is
packet sequence
, while
,
is the time difference between adjacent packets and
is the IP addresses at both ends of communication for unified analysis, regardless of the order. Therefore, according to the method of time interval division, the information gain of the obtained samples is:
is the network traffic time interval of information entropy, as shown in (4):
is the information entropy in the
ith partition, which can be calculated as follows:
Another division method is based on IP address division
. The calculation method is as follows:
In the actual division process, only the data flow is divided into small fragments. The information entropy only proves that the method of division by IP address proposed in this paper, IAT, is more complete and accurate than that by time interval. Specific results of two different division methods for network data flow are calculated in the experimental part based on actual network communication data according to Formulas (3) and (6).
The following defines IAT: Given a data stream , any data packet has an attribute time and IP address pair . If any of the IP address pairs of two adjacent packets change, the data stream is divided into two segments. The sequence exchange of IP addresses is not considered as the change of IP addresses.
A segment of network traffic obtained by IP address is classified into
N-gram types based on message types. Here, it is first expressed in sequence, and each message is represented by a label defined above. For example, 42 packets in
Figure 3 can be divided into 17 segments, as shown in
Figure 6.
Some of the fragments in
Figure 2,
Figure 3,
Figure 4 and
Figure 5 contain only one data packet. Most of such fragments are caused by the out-of-order and retransmission of data sent by the terminal. In this way, the obtained fragments of network traffic are meaningless and are represented by labels in
Table 1 and
Table 2, as shown in
Table 3.
3.3. DNS Domain Text Clustering
The Domain Name System (DNS) is a database that maps IP addresses to domain names. The DNS server performs this function. Among the obtained network data packets, DNS messages are a few unencrypted messages with certain meanings. The domain names of DNS packets are extracted for text analysis, which can realize automatic network traffic analysis and largely replace some manual analysis. In addition, the corresponding URL can be quickly obtained in traffic classification and anomaly detection, which is convenient for further manual analysis and control in the massive network data flow. Part of the name is derived from the actual meaning of natural language, and part can be used as a combination of letters to represent a certain meaning. By mining domain name information, one can extract effective text information from the domain name, which is of great help to the analysis of encrypted traffic.
This section analyzes the characteristics of domain names of each application category and extracts representative domain names to provide domain name data for exception analysis and further manual exception analysis. Domain name analysis uses incremental clustering to process dynamic data streams, including
RS extraction, and generation of related-word set, feature vector generation, and domain name clustering.
Figure 7 shows the process.
Given a domain name as shown in ““, according to”. “to extract string of ““ to generate string set, defined as an item in domain name correlation set RS.
Network data flows usually generate multiple DNS messages, so RS consists of multiple components, , and because the new DNS message is generated continuously, RS will continue to expand. For one type of other DNS-message-generated domain correlation set, , based on the RS built DNS, DNS is graphed with the vertex and edge of the relationship between domain name and domain names. In the DNS graph, vertices represent strings in domain names, and edges represent relationships between domain names. Edges are added according to the domain name correlation set RS. In a domain name correlation set, edges can be added, and the direction is from the high-level domain name to the lower-level domain name, thus forming the directed DNS graph.
Different from natural language, the domain name can be any combination of the characters, so as the network traffic constantly produces, there will be a new domain name constantly generated, which means that the new node will continue to produce to solve this problem, and to a certain extent reduce the computational complexity, as well as needs to obtain a large number of DNS messages and complete the extraction of node information. After the DNS graph is established, the importance of nodes is used as the feature vector of DNS messages.
The PageRank algorithm was first used in the Google search engine. In order to accurately rank web pages, the number and quality of links in web pages are used to measure the importance of web pages. The directed graph
constructed by DNS messages is similar to the web-page ranking in search engines, in order to find important node information. Assume that
n is the number of nodes,
and
represent vertices and edges, and the PageRank value
of vertex
i is shown in (7):
Calculate the PageRank value of each node, and use
G to represent the PageRank value of all nodes, as shown in (8):
Assume that
A is the adjacency matrix of graph
G, as shown in (9):
Then, (7) can be converted into (10):
In reality, the process of users surfing the Internet is relatively complex. The PageRank algorithm assumes that surfing the Internet is a random process. In order to make
A meet this condition, Equation (10) is improved, as shown in (11):
is the full 1 matrix,
f is the attenuation factor, and (11) can be converted into:
Therefore, (7) can be expressed as follows:
The calculation process of PageRank is a cyclic process. Given an initial value, the PageRank value of the node is calculated. When the change is less than the threshold value, the calculation is stopped. Calculating the PageRank value of the domain name can extract important domain names, the appearance of these important domain names often has certain meaning, and new domain names through manual analysis can have important roles in helping in anomaly detection.
The generation of the network DNS message is continuous. In order to process the newly generated messages in real time, the single-pass incremental clustering algorithm is adopted. Firstly, the eigenvalues of vectors of each class are calculated as the cluster centers, and the similarity is used as the judgment criterion whenever a new message is generated, as shown in (14):
is the clustering center vector, and n is the feature of DNS messages. The classification of DNS messages is determined through similarity calculation. After a certain period, a new DNS is generated, and the clustering center is recalculated according to the clustering result.