In this research, we record the event sequences in a DFPA to form the feature vectors of these event sequences that serve as the input of deep learning so as to obtain the spatial features of events through a graph convolutional network (GCN). Then, the features derived from the GCN are used as the input of the GRU to find the spatio-temporal features, which are subsequently processed by the attention model, making the model able to distinguish the importance of multiple event combinations. Finally, the output of the attention model is mapped to all the defined events of the system by the Softmax function, in order to obtain the probability of occurrence of any event in the system. The schematic of this methodology is shown in
Figure 2. Specifically, the system can detect the types and locations of events, where an attack exists in a finite number of steps by setting a probability threshold. In what follows, let
be the set of all positive integers.
3.1. Problem Definition
Definition 2. Given , is called an observation of G. The observable language of G is a collection of all of its observations, i.e., .
Given an observation ( for all ) and , it is divided into multiple equal-length subsequences , …, for some , where the length of each subsequence is m. In other words, the t-th subsequence of starts with and ends with given . To facilitate the development of the methodology proposed in the research, we transform an observation subsequence into a directed graph called an event-graph.
Definition 3. Given an observation subsequence , a directed graph associated with s, denoted by , is said to be an event-graph if is the set of nodes and is the set of edges.
As defined, an event-graph derived from an observation subsequence
is actually a path starting from node
and ending with
, as schematically demonstrated in
Figure 3. Given a finite number of event-graphs
,
,
…,
, let
be the sequence of the event-graphs, denoted by
. By a slight abuse of notation, write
if
is in
,
. This research collects a group of event-graphs as a block as shown in
Figure 4 in order to accurately estimate the occurrence probability of an event and furthermore to predicate the detection accuracy of cyber-attacks.
Given , it is said be to a full observation permutation if and for all , , i.e., every symbol in appears once in . For , its full observation permutations are not unique. One trivial possibility is to produce such a permutation in lexicographical ordering. Let denote the i-th element in and let us fix a full observation permutation . Then, is called the feature value of element of the fixed , where . Analogously, given an observation , denotes its i-th element.
Definition 4. The feature vector of an event-graph , , with respect to a fixed full observation permutation Δ is defined as an m-dimensional vector for which if , where and . In , is said to be the vertex feature of .
Given a finite number of feature vectors , , …, , let be the sequence of the vertex features, denoted by . By a slight abuse of notation, write if is in , , where t can also be understood as a time step.
Example 3. Consider the DFPA G in Figure 1 with . Obviously, is an observation of G. Let . Then, can be divided into two subsequences and . Fix , i.e., , , and . We have , , and . By the definition of the feature vector of an event-graph, and . Given an event-graph , an m-dimensional vector is assigned to store all the vertex information in the graph; a two-dimensional array is defined to show the incidence relationship between vertices, which is called the adjacency matrix of .
Definition 5. The adjacency matrix of an event-graph is defined aswhere . By definition, the event-graphs of any two divided observation subsequences are isomorphic and thus they share the same adjacency matrices. For
, its adjacency matrix
and feature vector
serve as the inputs to a GCN, which aggregates and updates the features of each vertex in
. Then, as shown in
Figure 2, a block consisting of
I event-graphs that are separately processed by corresponding GCNs and GRUs is considered as the input to an attention model. For simplicity, suppose that the whole process is represented by a function
g obtained by deep learning, which expresses the spatial correlation of each event-graph and the temporal relationship of the
I event-graphs. By doing so, it provides a possibility to predict the future
T event-graphs’ features, represented by
Note that
T is the length of the predicted event-graphs.
3.3. Spatial Correlation Model
The exploration of spatial correlation features of event-graphs is critical as the correlation information hidden in the event-graphs is particularly interesting and useful, as to be exposed in what follows. A GCN model constructs a filter in the Fourier domain and acts on the vertices of a graph. The spatial correlation features between the vertices are obtained through the first-order neighborhood of the graph. Then, the GCN model is built by stacking multiple convolutional layers [
19], which is defined as
where
is a feature vector input of the
-th GCN layer,
is the feature vector output of the
-th GCN layer,
is an adjacency matrix of the graph with self-loops added,
is an identity matrix with appropriate dimensions,
is a degree matrix with
and
for
, and
is a weight matrix (with appropriate dimensions) of this GCN layer. We use
to denote an activation function. Specifically, given a vector
,
and
is defined as
By a slight abuse of notation, write
, if
is an entry of
V,
.
The operation of a multi-layer GCN is shown in Equation (
2). In our GCN model, a two-layer GCN is selected to obtain spatial correlation features, which is defined as
where
,
is the output of feature vectors of
T event-graphs,
is a weight matrix from an input layer to a hidden layer with
m being the number of elements in the feature vector
,
is the number of hidden units, and
is a weight matrix from the hidden layer to an output layer.
The basic idea of a GCN is to average a graph’s neighboring features (including itself) and then pass them into linear layers (linear combination of inputs and weight matrices). The topological relationship between vertices and adjacent vertices on the event-graph is based on the GCN model. Thus, the spatial correlation features of is obtained.
Example 4. Given an event sequence , the event-graph constructed by contains five vertices and four edges, which is shown in Figure 5, where is a set of vertices. Then, the feature vector of this event sequence is defined as , if a full observation permutation is fixed. We show the computational procedure of as follows. The adjacency matrix , the identity matrix , and the new adjacency matrix are, respectively:The new degree matrix is:We compute from the degree matrix :Normalize all rows and columns by (approximate to four decimal places): The basic idea of a GCN is to average a graph’s neighboring features (including itself) and then pass them into linear layers (linear combination of inputs and weight matrices). The topological relationship between vertices and adjacent vertices on the event-graph
is based on the GCN model. Thus, the spatial correlation features of
is obtained. The obtained matrix
is the essence of a GCN model, and then
can be brought into Equation (
4) to compute a two-layer GCN model, as shown in
Figure 6. After the spatial correlation features of the event-graph composed of the event sequence are extracted by the GCN, the cyber-attack detection model further recognizes the structure of the DFPA by computing the temporal correlation features of these event-graphs. We can choose the GRU to extract the temporal correlation features of the event-graphs.
3.4. Temporal Correlation Model
The GRU is a kind of recurrent neural network, which is proposed to solve problems such as long-term memory and gradients in back-propagation [
19]. To better describe the use of temporal correlation features in our cyber-attack detection model, given a vertex feature sequence
, let a vertex feature subsequence
be the input into the GCN model to obtain a sequence of corresponding outputs
, which is written as
. By a slight abuse of notation, write
if
. Moreover, we tentatively express the output of each GRU as a feature matrix. Given a sequence
as input, our GRU model can process each function sequentially and generate a corresponding feature matrix as the output. Each output is passed to the next
for
to compute the next output of a GRU. This is an iterative process. For instance,
serves as an input to the GRU and its output will be used as the input to the GRU together with
to compute a new feature matrix. In this way, our cyber-attack detection model has temporal correlation features. Each feature matrix extracts the features of a function sequence
, which also means that our GRU model can integrate the temporal correlation features of the event-graphs of
I groups in
like it has memories of past event-graphs. A feature matrix is also said to be a hidden state and the hidden state is denoted as
. The subscript
t corresponds to notation
t in
and has an ordering relation inherited from
. The subscript
t can also reflects a temporal sequential relation. Thus,
t can also be thought of as a time step; thus
is a hidden state at time step
.
In the GRU model, we need two kinds of activation functions: Sigmod and Tanh, which are, respectively, defined as follows,
where
is the element of the
i-th row and
j-th column of a matrix
A as the input of a Sigmod function, and
is the element of the
i-th row and
j-th column of a matrix
B as the input of a Tanh function.
The GRU model has two gates and a candidate hidden state: one gate is an “update gate”, and the other is a “reset gate”. The update gate is responsible for determining the relative weight between a previous hidden state and a current input by employing an activation function. Its output is a value matrix which governs the extent of the hidden state influenced by the previous hidden state. The reset gate controls the fusion between the previous hidden state and the current input. It assists in deciding which historical information is discarded and which is retained. We define the specific operational steps of the GRU model as:
where
is the update gate;
is the reset gate;
is the candidate hidden state at time
t of the GRU model;
is the output of the GCN model as shown in Equation (
4); weight matrices
,
, and
are generated by random initialization in the update gate, reset gate, and computational candidate hidden state process, respectively; deviations
,
, and
are generated by random initialization in the update gate, reset gate, and computational candidate hidden state process, respectively;
is the hidden state of an output at time step
t; and
is an operation that executes concatenation vertically on matrices along rows. To concatenate matrices, they must have compatible dimensions. In other words, when we concatenate matrices vertically, they must have the same number of columns. Such a vertical concatenation operation is denoted by
or
. As an illustration, given two matrices
A and
B, are:
Then, the vertical concatenation of
A and
B is
i.e.,
F is a matrix by operating the vertical concatenation.
The GRU model computes a hidden state
based on the previous hidden state
as input. The computation expressed by Equations (
7)–(
10) is diagrammatically shown in
Figure 7, where ⊙ represents the multiplication of the corresponding elements of matrices. The process of using our GRU model is shown in
Figure 8, and the whole process of combining GCN and GRU is shown in
Figure 9. Due to the two gates and the candidate hidden state, the GRU model retains the changing trend of the occurrence features of historical event-graphs when capturing the function
f’s. Therefore, this GRU model can extract dynamic time-varying features. Namely, the GRU model is introduced in our research to learn the temporal correlation features of the occurrences of the events. The features obtained through the GRU model can be applied to an attention model to calculate attention scores. The attention model assesses the degree to which features deserve to receive more attention and then increases the weight coefficient of
. The attention model is introduced as below.
3.5. Attention Model
An attention model is usually divided into soft attention and hard attention. In the case of a hard attention model, only the representation of a selected position is transmitted or utilized, whereas in a soft attention model, the representations of all positions are weighted and summed up. In this research, a soft attention model is assigned to learn the weight coefficient of
, which expresses the trends of event-graphs evolution [
22].
The primary motivation for introducing the attention model is to enhance the model’s ability to focus on relevant features across different time steps, thereby improving its generalization capability. In this study, the event sequences in DESs may contain complex dependencies, and not all events contribute equally to predicting the next event. The attention model allows the model to assign different weights to the hidden states, thus emphasizing the most critical events in the sequence. This selective focus is crucial for accurately predicting the occurrence probability of future events, which is the main goal of our cyber-attack detection model.
Given a finite number of hidden states
, the vertical concatenation of the hidden states called an integrated feature
is defined as
, where
is an operation that executes the vertical concatenation defined previously. An attention scoring function is designed to compute the score (weight) of each hidden state by a multi-layer perceptron. Let us provide an explanation of the multi-layer perceptron. Multi-layer perceptron is a neural network model composed of multiple neurons organized in a hierarchical structure. Through training and adjusting the connection weights between neurons, multi-layer perceptron can learn and solve data prediction problems; thus each hidden state’s attention score is considered as a discrete neuron. The attention scoring function by a multi-layer perceptron is defined as
where
is a vector composed of the attention scores of
I hidden states, with
(
) being the attention score of a hidden state representing the importance of this hidden state at time step
i; weight matrices
and
are generated by random initialization in the first and the second linear layers of the multi-layer perceptron, respectively; deviations
and
are generated by random initialization in the first and the second linear layers of the multi-layer perceptron, respectively; and
is an integrated feature matrix. Afterward, the output of the attention score function is calculated by the Softmax function, defined as
where
is an exponential operation,
I is the number of hidden states, and
is the output of the Softmax function that is a normalization of
, called a normalized attention score.
More specifically, given a finite number of hidden states
, let
be the sequence of the hidden states, denoted by
. Our goal is to more accurately predict the occurrence probability of various events in the next time step, and this weight coefficient reflects how much a hidden state can contribute to the accuracy of our prediction goal in its corresponding
. Then, we use the attention score function computing each hidden state’s attention score, which can be understood as a weight coefficient of importance about a prediction target in
(Equation (
11)). To facilitate the subsequent calculations, the normalized attention scores are obtained by normalizing using the Softmax function; see Equation (
12).
Finally, the attention model is designed to cover a context vector
C that reflects the global information of a hidden state sequence
. The context vector
C is defined as
where
C is a globally
I-accumulated vector (context vector),
is the output of the Softmax function at time step
i,
, and
is a hidden state at time step
i. The whole process is shown in
Figure 10.
3.7. Cyber-Attack Detection Model
A cyber-attack detection model is extracted by a GCN for spatial correlation features and then output to a GRU for the extraction of temporal correlation features to obtain the trend of the types of events that occur over time. The process of the GCN and GRU in the cyber-attack detection model is shown in
Figure 9. Finally, as shown in
Figure 10, the output including the spatial correlation features and the temporal correlation features is used as the input of an attention model for generalization. The cyber-attack detection model can pay more attention to the features of important event sequences.
Suppose that we have an event sequence
handled as multiple event-graphs to form an event-graph sequence
. Each sequence
is inputted into the GCN model and then fed into the GRU model to obtain
I hidden states, denoted by
, which covers the spatio-temporal features. Then, the hidden states’ integrated feature
by concatenation is inputted into the attention model to compute a context vector
C covering the global
I hidden states’ information. Specifically, a multi-layer perception and the Softmax function are assigned to compute the normalized attention scores of all hidden states, which are denoted as
. The context vector
C can be calculated by utilizing the normalized attention scores
and the corresponding hidden states. Finally, we use a focal loss as a loss function to compute the loss of feature vectors of the predicted event-graph in the next time step with the actual feature vectors in the next time step. The loss is used to update the weight matrices in our cyber-attack detection model by back-propagation. In this way, the cyber-attack detection model can predict each event’s occurrence probability by reverting feature vectors of event-graphs to event types in
. For instance, if we have a predicted probability of an event-graph’s feature vector
, we are aware of the occurrence probability of each event in the next time of
.