1. Introduction
After the worldwide announcement regarding the fourth industrial revolution, China put forward the 2025 industrial manufacturing strategy [
1]. Although the Industrial Internet of Things (IIoT) has been booming in recent years, it has also been subject to more and more cyber attacks and malicious behaviors, which has led to the leakage of sensitive information, damage to industrial infrastructure, and economic losses [
2]. Therefore, the IIoT security issue needs to be resolved as soon as possible. Detecting malicious traffic data entering and leaving the IIoT and taking the necessary precautions against malicious attacks is an effective approach. By monitoring and analyzing network traffic, potential security threats, anomalous behavior, and intrusion attempts can be detected.
Previous methods for traffic detection, such as traditional deep packet inspection (DPI) [
3,
4], have proven effective in identifying malicious unencrypted traffic. However, the majority of traffic data today is encrypted traffic. As the detection of encrypted malicious traffic in IIoT can typically be seen as a traffic classification problem [
5], port-based and machine-learning methods are very effective for classifying encrypted malicious traffic [
6], but the accuracy cannot be guaranteed in the face of a significant increase in traffic. Deep-learning techniques, which are trained to automatically choose traffic features and have a larger capacity for learning than that of typical machine-learning techniques, can identify encrypted harmful traffic in the presence of an enormous amount of encrypted traffic classification.
In IIoT malicious traffic classification, traditional deep-learning methods are based on supervised methods [
7], which require enough labeled data and sufficient computational power during training. Unfortunately, capturing and labeling large datasets is time-consuming and labor-intensive. In contrast, unlabeled data are abundant and easily available. Therefore, semi-supervised deep-learning methods are more suitable for classifying malicious IIoT traffic [
8]. At the same time, semi-supervised learning is better able to handle noise and labeling errors because the presence of unlabeled data can smooth and regularize the model, therefore reducing the risk of overfitting. In addition, unlabeled data can help the model to better adapt to new domains or tasks, therefore improving the generalization ability. This approach combines supervised and unsupervised learning, utilizing a small proportion of labeled data as input.
When using deep-learning models for malicious traffic detection [
9], the traditional idea is to transfer large amounts of data from IIoT devices to the cloud for processing [
10], which leads to higher cost and transmission latency; additionally, this method may risk privacy leakage issues. However, if the model is put locally, it can lead to higher computational latency due to weak local computing power [
11]. Edge servers are introduced to integrate computation latency, transmission latency, and privacy-protection factors across local, edge-side, and cloud environments. This integration improves the model’s classification efficiency while maintaining the same accuracy in classifying malicious traffic and reducing privacy leakage concerns.
To enhance the security performance of IIoT and tackle the challenge of detecting malicious traffic, we propose an IIoT malicious traffic classification method. Our method aims to accurately identify malicious traffic within a significant volume of encrypted traffic while minimizing the need for manual feature extraction. To achieve this, we employ a deep-learning model as the classification model. At the same time, to resolve the problem of difficult labeling of real captured traffic, we choose a semi-supervised learning approach. In addition, to address the problems of high latency of model training and classification and the associated privacy leakage during the process, we choose to introduce an edge intelligence model. In summary, our proposed method is an EI-based approach to classify malicious traffic for IIoT. The main contributions of this paper are as follows:
(1) Our proposal involves a semi-supervised deep-learning method capable of classifying malicious traffic in IIoT. This method leverages a significant quantity of unlabeled data alongside a small portion of labeled data.
(2) We propose a method to improve the performance of classification models using edge intelligence. This method considers the computational latency, transmission latency, and privacy-preserving factors during model training and classification.
(3) We model the latency and privacy protection of the edge intelligence model for local, edge, and cloud separately. Then, we optimize them by quantifying the total latency and total privacy level.
(4) Experiments reveal that, compared to that of the previous IIoT malicious traffic classification methods, our method achieves a classification accuracy of 97.55% when using the UNSW-NB15 dataset while minimizing overall latency and the danger of privacy leakage.
2. Related Work
Network traffic is essential for capturing the behavior process of a network [
12], containing comprehensive information about the entire communication between source and destination hosts. Analyzing network traffic can not only understand network bandwidth and allocate bandwidth but also evaluate current network capacity utilization. In particular, by analyzing traffic patterns, abnormal behaviors, and malicious traffic characteristics, security issues can be discovered promptly. The traffic data of IIoT exhibit distinct characteristics, including dynamism, volume, and temporal dependence. For the study of IIoT traffic classification, a part of the research approach utilizes the methods of traffic classification in traditional networks, mainly based on machine learning and deep learning.
Although machine learning is effective at classifying malicious traffic, it necessitates the labor and time-intensive human extraction of features. Fu et al. [
1] uses clustering to detect anomalous traffic in IIoT, and in the paper, they propose a hierarchical detection method that first statistically analyzes the detected traffic frequencies and then detects the traffic attributes using a clustering algorithm. Niu et al. [
13] proposes an adaptive random forest algorithm (IARF) that is capable of adaptively updating parameters when dealing with new types of malicious traffic, and it is also sensitive to traffic information with a few malicious samples. Ikram et al. [
14] proposes an MNSWOA IPM RF method, which divides the traffic classification problem into feature selection and classification prediction, and improves the feature selection part using the whale optimization approach and ideal point method. Yan et al. [
15] proposes a small-scale learning algorithm HCA-MBGDALRM, which improves the processing speed of the dataset through a parallel framework, while still addressing the problem of data skewing.
Deep learning has the advantage of automatic feature selection, which can optimize the problem of the manual feature extraction part of machine learning. Moreover, deep learning is capable of handling complex data and exhibits excellent scalability and flexibility. Transfer learning and pre-trained models in deep learning also demonstrate remarkable performance in practical applications. Researchers classify deep-learning-based anomalous traffic detection into three categories: supervised learning, semi-supervised learning, and unsupervised learning [
16]. The main difference between the three methods is whether the input dataset is labeled, and recent research on IIoT malicious traffic classification is mainly focused on fully supervised learning.
Wang et al.’s method [
17] achieves 86.6% accuracy in classifying 12 traffic types using a one-dimensional convolutional neural network (1D-CNN). Lin et al. [
18] presents a cryptographic traffic recognition scheme called TSCRNN. This scheme utilizes CNN to extract abstract spatial features and introduces stacked bidirectional LSTM to learn temporal features. Zainudin et al. [
19] proposes a method that can detect DDoS attacks in IIoT well, combining an effective feature selection method XGBoost. Shahin et al. [
20] uses the LSTM model when detecting malicious traffic, and enhances the LSTM with CNN and full convolution neural network (FCN).
There is an increasing amount of research on edge intelligence. Edge intelligence technologies are maturing and being applied in a wide range of scenarios. Zhao et al. [
21] proposes a model of IoT encrypted flow detection based on edge intelligence, which reduces the time for establishing the model. Zeb et al. [
22] proposes a new edge-native framework for intelligently predicting data traffic, and Mohammed et al. [
23] also introduces the concept of edge intelligence when classifying IoT traffic. From the above, when edge intelligence is combined with malicious traffic classification of the IIoT, it can also improve the training efficiency of the classification model and reduce the probability of privacy leakage. Qi et al. [
24] proposes a blockchain-driven traffic classification method for edge computing in addressing normal traffic classification for IIoT, which effectively reduces time overhead and memory usage. In addition, edge intelligence can also reduce bandwidth requirements, enhance offline functionality, improve system reliability, and save network costs. These advantages can provide technical support for industrial IoT security.
In our approach, we exploit features such as the temporal correlation of IIoT traffic data, classify them using deep learning, an automatic feature extraction method, and semi-supervised learning models to reduce the dependence on labeled datasets, and finally introduce edge intelligence to improve the efficiency of model training and classification and to reduce the probability of privacy leakage. The methods in the above literature are summarized in
Table 1.
3. Methodology
Overall, our strategy involves classifying malicious traffic for IIoT traffic data, which contains four components. The first part is data processing, where we sample the captured traffic information in three different ways, and then select time series features and basic features in the data for normalization. The second part is the pre-training model in the semi-supervised training model, where we present the encoder–decoder model architecture for encoding in the pre-training phase and use unlabeled data as the input to the pre-training model. The third part is the re-training model in the semi-supervised training model. During re-training with a small quantity of labeled data, this model transfers the parameters and weights from the pre-training model. The re-training step includes decoding; then, the classifier outputs the traffic classes. The fourth part is the edge intelligence model. In this model, we build latency models and privacy-preserving models for cloud, edge, and local areas to improve the training and classification efficiency of the semi-supervised model by optimizing the total latency and privacy level, while reducing the risk of privacy leakage. The overall model architecture is shown in
Figure 1.
3.1. Data Processing
Since the traffic density is very high in real scenarios, it is necessary to sample the data. Data sampling can not only save computational resources and accelerate model training but also balance the bias and variance of the model, improving its generalization ability while maintaining data representativeness. Among various sampling methods, we selected three sampling methods that are feasible in practical applications to compare the effects, namely random sampling, systematic sampling, and clustered sampling. Random sampling means the traffic packets are sampled with equal probability. Systematic sampling means that the first traffic packet is selected randomly and that the other traffic packets are selected using a fixed sampling interval. Cluster sampling means we use the sub-packet groups of the overall traffic packet as the sampling unit, and the whole traffic packet is divided into several sub-packet groups, called clusters; then, a complete cluster is randomly selected as the sampling sample. The pattern diagrams of the three sampling methods are shown in
Figure 2.
After sampling the traffic data, the time series features in the encrypted traffic are extracted, including the source port, the destination port, the payload size, the window size, and the traffic duration. Then the data are normalized, and the input features are scaled to values in the range [−1, 1].
3.2. Pre-Training Model
CNNs were chosen for both the pre-training and re-training models because of their shift-invariant property, which allows CNNs to capture the output traffic pattern even if it is shifted to another input region. At the same time, CNN reduces the amount of calculation through local connections. Due to weight sharing, the network can use the same weights at different locations for the feature extraction mechanism, thus reducing the number of parameters that need to be trained.
During the pre-training stage of the model, an encoder is employed to convert the input sequence into a fixed-length vector. Subsequently, in the re-training model, a decoder is utilized to transform the vector into an output sequence. The encoder–decoder approach [
30], also known as Seq2Seq, is characterized by its end-to-end learning algorithm.
In the encoder–decoder architecture [
31] of the Seq2Seq model, given an input sequence
with length
m, the model generates a target sequence
with length
n.
Figure 3 illustrates this architecture, where the encoder hidden states:
, the decoder hidden states:
, the contextual sequence
.
To tackle the issue of suboptimal final classification caused by lengthy input sequences, the model [
32] incorporates an attention mechanism. Attention mechanism allows the model to allocate different attention weights to different parts of the input data. By doing so, the model can focus more on the information relevant to the current task while ignoring irrelevant information. Additionally, the attention mechanism helps the model suppress noise and interference, which further enhances the model’s performance. This updated Seq2Seq model, depicted in
Figure 4, addresses the problem effectively.
In the attention model, each context sequence is a weighted sum of all hidden state vectors of the encoder as
The input sequence is mapped into multiple context sequences
,
,
, …,
, where
is the context information corresponding to the output
(where
i = 1, 2, 3, …,
n). When the decoder predicts the output
, its result depends on the matching context sequence
and its previous hidden state, i.e.,
Figure 5 shows the architecture of the CNNs-based pre-training model, which is encoded in the pre-training phase.
3.3. Re-Training Model
The weights learned by the pre-trained model are transferred to the retrained model, which is then retrained with a small, labeled dataset and finally decoded for classification. Since many traffic patterns have been observed as part of the pre-trained model, adding a small amount of manually labeled data during re-training can accomplish a fast classification task that makes re-training converge faster.
To avoid experimental chance, we choose a five-fold cross-validation method by dividing the traffic package into five parts, taking one of them as the test set each time and the remaining four as the training set; we cycle through the validation five times.
The final classifier selected the softmax classifier, which solved the multi-classification problem, i.e., it can classify normal traffic and different types of malicious traffic. The softmax function formula is as follows.
where
is the output value of the
i-th node, and
C is the number of output nodes, i.e., the number of classified categories.
Recent neural networks commonly employ cross-entropy as the loss function for classification problems [
33]. Extensive experiments have confirmed that utilizing cross-entropy as the loss function is indeed a superior choice.
The CNNs-based re-training model architecture is shown in
Figure 6.
Table 2 shows the model architecture parameters.
3.4. Edge Intelligence Framework
Edge intelligence (EI) technology synergistically combines the computing capabilities of end devices and edge servers, harnessing the complementary strengths of local and high computing. As a result, it effectively minimizes latency and reduces energy consumption during the inference of deep-learning models [
34]. Ref. [
35] Edge intelligence is classified into six levels, from level one to level six, with an increasing percentage of edge intelligence involvement. As the EI level increases, the number of data offloads and path lengths decrease, which then leads to a decrease in transmission waiting for the time for data offloading, an increase in data confidentiality, and a decrease in WAN bandwidth cost; however, computational latency and energy consumption increase [
35].
Figure 7 shows the comparison of the six capabilities corresponding to the cloud, edge, and device sides.
As shown in
Figure 7, the advantage of edge intelligence is that diversity, scalability, reliability, and latency are better. Meanwhile, privacy can be protected; however, although cloud-side execution has strong computational power, the data transmission process increases the transmission latency and raises the risk of privacy leakage. For local device-side execution, although the data transmission process is reduced, which reduces the transmission latency and risk of privacy leakage, the computational power is insufficient and adds considerable computation time. Edge intelligence combines the advantages of cloud and local, which can better serve the model.
Many factors affect the choice of cloud, edge, and local. We consider three main aspects, namely calculation delay, transmission delay, and privacy protection. Among them, the calculation delay and transmission delay are established as delay models, and the privacy-protection model is established at the same time. Finally, we choose the scheme with the better two models.
3.4.1. Latency Model
When the task is executed locally [
36], the cause of the delay is mainly the computational delay in executing the task
, which is highly related to the working frequency of the local Central Processing Unit (CPU) of user devices [
37]. The local computational latency of the device
can be represented as
where
denotes the total task volume,
denotes the ratio of tasks processed locally,
L refers to the necessary CPU cycles for executing the one-bit task [
38], and
denotes the corresponding CPU-cycle frequency of the user device [
39].
When tasks are offloaded to the edge server [
40], the edge server has abundant computational resources to reduce computational latency compared to that of the local device, but it adds additional transmission latency during the offloading process. Therefore the latency of task offloading to the edge server includes both computation latency and transmission latency [
41].
The computational latency of the edge server
can be described as
The transmission latency of the edge server
can be formulated as
where
is the channel bandwidth of the user device,
denotes the channel gain,
is the transmit power of the user device, and
represents the power spectral density.
In summary, the total latency of the edge server
is the accumulation of computation latency and transmission latency [
42].
When tasks are offloaded to cloud servers, cloud servers have the most computational power compared to that of local devices and edge servers, but have the longest data transfer distance, incurring greater transfer latency and more noise interference during transmission. The latency of task offloading to cloud servers also includes two components.
The computational latency of a cloud server
can be formulated as
The transmission latency of the cloud server
can be defined as
In summary, the total latency of the cloud server
is the accumulation of computation latency and transmission latency
3.4.2. Privacy Preservation Model
As the data transition is from local execution to edge server offloading, and eventually to cloud server offloading, the distance of data transmission increases significantly, resulting in a reduced level of data privacy and confidentiality. Therefore, the following privacy-protection model is established, which is different from the encryption-based approach in cryptography. The overall privacy level
P is represented by the following formula, The smaller the value of
P, the higher the privacy level.
where
denotes the privacy level when the task is executed locally and takes values in the range [0, 0.1],
denotes the privacy level when the task is offloaded to the edge server, and the values range from [0.1, 0.5], and
denotes the privacy level when the task is offloaded to the cloud server, and the values range from [0.5, 1]. A higher privacy level indicates a higher chance of privacy leakage, and conversely, a lower privacy level indicates a lower chance of privacy leakage.
4. Experiments
4.1. Experimental Setup
We implemented the classification model using GPU-accelerated Python 3.7.0 and PyTorch, with CPU frequencies of 2.4 GHz (local device), 2.7 GHz (edge server), and 3.3 GHz (cloud server), corresponding to the experimental hardware platforms. Additionally, we assume that the local, edge, and cloud environments have the same task scheduling priorities when running the classification model and that the resources in each environment are capable of meeting the corresponding requirements.
We completed most of our experiments based on the UNSW-NB15 dataset which was created by the Cyber Range Laboratory of the Australian Cyber Security Center [
43]. The dataset consists of 1,776,851 training data points and 761,508 test data points, which include nine families of attacks.
To verify the generality of our approach, we also conducted experiments on other datasets. One is the CTU-13 dataset, which is an IIoT network traffic dataset. Another dataset is the BoT-IoT dataset of IoT network traffic captured in recent years, which contains both normal traffic and botnet traffic [
44]. Three different datasets, the UNSW-NB15 dataset, CTU-13 dataset, and BoT-IoT dataset, are denoted by D1, D2 and D3, respectively. The data volume and its proportion of attack types of the three datasets are shown in
Table 3.
We used four metrics to evaluate the classification performances of the various methods, including accuracy (AC), precision (PR), recall (RC), and F1 [
18].
where
,
,
, and
refer to true positives, false positives, true negatives, and false negatives, respectively.
There are also two evaluation metrics for latency and privacy level, which are defined from Formulas (6)–(13).
4.2. Experimental Results and Analysis
We compare the total latency and the privacy level of model training and classification under six scenarios during the experiment to make the classification model lower the total latency and minimize the risk of privacy leakage after incorporating edge intelligence. The classification model is divided into two parts, executed on local devices, edge servers, and cloud servers, respectively. The ratio parameters of the six scenarios are shown in
Table 4, and the changes in the two evaluation metrics corresponding to different scenarios are shown in
Figure 8.
The definitions of total latency and privacy level are given in 3.4.
Figure 8, shows that these six scenarios have advantages and disadvantages in terms of total latency and privacy-protection issues. The total latency of scenarios 1, 2, and 4 is lower and not much different, and the threshold of privacy level is set to 5. When the privacy level is lower than 5, privacy confidentiality is better, and the possibility of privacy leakage is lower. Therefore, scenarios 2–6 have a lower likelihood of privacy leakage.
Edge intelligence is introduced to reduce the total latency while reducing the possibility of privacy leakage. Combining these two factors, scenario 2 and scenario 4 are more suitable for the model in this paper. Additional metrics to evaluate our approach under scenarios 2 and 4.
When sampling the flow samples, the three sampling methods, random sampling, systematic sampling, and whole-group sampling, are compared; the accuracy after systematic sampling is found to increase as the training sample grows larger, but random sampling and whole-group sampling do not. We speculate that the increased randomness of random sampling loses some of the key information, making it more difficult to fit the model to the true distribution, and whole-group sampling can only observe the local distribution. Whole-group sampling can observe only local traffic patterns, which has certain limitations. The final accuracy of the three sampling methods is shown in
Figure 9.
As depicted in
Figure 9, the final accuracy for the three datasets tends to stabilize around 97% as the number of training samples increases when utilizing systematic sampling. Among the datasets, D1 achieves the highest classification accuracy of 97.55%. The other two sampling methods do not learn the traffic pattern comprehensively when the training set is small, resulting in low classification accuracy. However, with the increase in the training set, the accuracy of D1 and D3 increased more obviously, and the increase in D2 was less significant because the distribution of key information in D2 was more concentrated. In summary, our model can determine the use of systematic sampling.
Figure 10 displays the results of experiments that were conducted on three datasets using systematic sampling and altering the number of labeled samples. The four evaluation metrics (AC, PR, RC, and F1) are shown to be stable at high levels regardless of the percentage of labeled samples, which suggests that our technique has some degree of generalizability. Specifically, at a labeled sample proportion of 1%, D1 and D3 demonstrate the highest classification accuracy, while at a labeled sample proportion of 5%, D2 exhibits the highest classification accuracy.
4.3. Comparison
For the identical dataset, we compare our method with the fully supervised CNN (sCNN) for malicious traffic classification. In scenarios with a low proportion of labeled samples, our method achieves high levels of accuracy (AC), precision (PR), recall (RC), and F1 score. As the proportion of labeled samples continues to increase, the accuracy of the fully supervised CNN approaches that of our method.
Figure 11 illustrates the trend of the four evaluation metrics as the proportion of labeled samples increases.
From
Figure 11, our method produces the best classification when the proportion of labeled data is between 1% and 5%, but the fully supervised CNN cannot reach the classification requirement at this time, knowing that the classification accuracy of the fully supervised CNN can be maintained at approximately 97% when the proportion of labeled data is above 50%, which illustrates the necessity of using unlabeled data for pre-training to learn traffic patterns. Additionally, our approach achieves the desired goal that malicious traffic can be classified even using a small number of labeled datasets.
At the same time, we compare our method with four other recent industrial IoT malicious traffic classification methods, which are KTDA-ConvLaddernet Ning et al. [
29], IAPF Niu et al. [
13], TSCRNN Lin et al. [
18], and method of Zhao et al. [
21]. These methods are introduced in related work.
The AC, PR, RC, and F1 pairs of the above four methods are shown in
Table 5, and the comparison of total delay and total privacy level is shown in
Figure 12.
The following expands the comparison between the above four recent malicious traffic classification methods of the industrial Internet of Things and our methods:
KTDA-ConvLaddernet is similar to our method in the use of datasets, and it is also a method to train the model using a small proportion of labeled data, so we also compare the accuracy of the two methods when the labeled dataset accounts for 1–5% of the total dataset. As a result, the accuracy of KTDA-ConvLaddernet is 0.2–0.7% lower than that of our method. Thus, our method has a better classification effect. Simultaneously, compared with the total delay of model establishment and privacy level, the total delay of KTDA-ConvLaddernet is approximately 2.4 times that of our method, and it needs long-distance data transmission when using cloud computing, which has a great risk of privacy leakage.
IARF uses the random forest model in machine learning to improve the output. It has a good classification effect when the dataset is small, and the total delay in establishing the model is approximately 1.6 times that of our method. However, the model needs a large amount of labeled data, and features need to be extracted manually. At the same time, like KTDA-ConvLaddernet, it has a great risk of privacy leakage. TSCERNN’s performance on the UNSW-NB15 dataset is not as good as the first two methods, and its accuracy is 3.4% different from our method. The total delay is approximately 2.2 times that of our method, and the risk of privacy leakage is also great. The method of Zhao et al. [
21] introduces edge intelligence to accelerate the model and to reduce the risk of privacy leakage simultaneously, but it does not consider the edge server and other cooperation, which is approximately 1.8 times our method in total delay and 7.25% different from our accuracy method. Thus, it cannot classify malicious traffic in the UNSW-NB15 dataset well.
5. Conclusions
Aiming to classify malicious traffic within a mixture of encrypted traffic data from the Industrial Internet of Things, we propose a semi-supervised deep-learning method that achieves more accurate classification. The method achieves a classification accuracy of 97.55%, precision of 95.21%, recall rate of 98.01%, and F1 rate of 96.59%. Compared to fully supervised classification methods, our approach improves accuracy by 2.55% while using a lower proportion of labeled data, aligning better with the characteristics of real-world IIoT traffic data.
During model optimization, we propose a cloud-side collaboration scheme considering factors such as computing delay, transmission delay, and privacy protection. Through comparison with six different training scenarios, we calculate the total delay and privacy level associated with model training and classification. The results indicate that the introduction of edge intelligence reduces the total delay and the risk of privacy leakage.
Moving forward, we plan to further optimize our proposed model. This includes more accurately determining the proportion of offloading to local-edge-cloud, as well as incorporating the energy consumption of all three parties in the comprehensive performance index. Additionally, we will continue to investigate methods for implementing active defense to protect the security of IIoT after classifying malicious traffic.