A Deep Learning-Based Intrusion Detection Model Integrating Convolutional Neural Network and Vision Transformer for Network Traffic Attack in the Internet of Things

Du, Chunlai; Guo, Yanhui; Zhang, Yuhang

doi:10.3390/electronics13142685

Open AccessArticle

A Deep Learning-Based Intrusion Detection Model Integrating Convolutional Neural Network and Vision Transformer for Network Traffic Attack in the Internet of Things

by

Chunlai Du

¹,

Yanhui Guo

^2,*

and

Yuhang Zhang

¹

School of Information Science and Technology, North China University of Technology, Beijing 100144, China

²

Department of Computer Science, University of Illinois Springfield, Springfield, IL 62703, USA

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(14), 2685; https://doi.org/10.3390/electronics13142685

Submission received: 4 June 2024 / Revised: 3 July 2024 / Accepted: 8 July 2024 / Published: 9 July 2024

(This article belongs to the Special Issue Active Mobility: Innovations, Technologies, and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid expansion and ubiquitous presence of the Internet of Things (IoT), the proliferation of IoT devices has reached unprecedented levels, heightening concerns about IoT security. Intrusion detection based on deep learning has become a crucial approach for safeguarding IoT ecosystems. However, challenges remain in IoT intrusion detection research, including inadequate feature representation at the classifier level and poor correlation among extracted traffic features, leading to diminished classification accuracy. To address these issues, we propose a novel transformer-based IoT intrusion detection model, MBConv-ViT (MobileNet Convolution and Vision Transformer), which enhances the correlation of extracted features by fusing local and global features. By leveraging the high correlation of traffic flow, our model can identify subtle differences in IoT traffic flow, thereby achieving precise classification of attack traffic. Experiments based on the open datasets TON-IoT and Bot-IoT demonstrate that the accuracy of the MBConv-ViT model, respectively, 97.14% and 99.99%, is more effective than several existing typical models.

Keywords:

intrusion detection; Internet of Things; deep learning; transformer

1. Introduction

With the continuous evolution of smart applications such as smart homes, smart cities, and smart factories, the Internet of Things (IoT) has emerged as a focal point in the realm of information technology [1]. The proliferation of IoT devices is rapidly accelerating internet connectivity. Projections indicate that the number of IoT devices is poised to surpass 4.1 billion by 2025 [2]. However, the exponential growth in IoT device connectivity has led to a surge of security risk. In response, IoT intrusion detection, primarily based on traffic analysis, has gained significance as a pivotal defense mechanism. Presently, IoT intrusion detection methods predominantly emphasize improving the detection algorithm of traffic datagram, often overlooking the interrelation between traffic datagrams.

To address this critical gap, we pay close attention to inadequate feature representation and correlation features between traffic datagrams. Consequently, we propose a transformer-based IoT intrusion detection model called MBConv-ViT (MobileNet Convolution and Vision Transformer), which enhances the correlation of extracted features compared with traditional methods. Our model amalgamates local and global features extracted by convolutional operations, thereby increasing detection accuracy. The key contributions of our work are as follows:

(1): We propose a feature fusion method that integrates local and global features, thereby enhancing the correlation among extracted features.
(2): We introduce a transformer-based IoT intrusion detection model, MBConv-ViT. Based on multi-class feature fusion, MBConv-ViT improves detection accuracy.
(3): We conduct experiments on the Bot-IoT and ToN-IoT datasets. The experimental results demonstrate better effectiveness than Xception [3], EfficientNetB0 [4], DenseNet121 [5], and TSODE [6].

The remaining content is organized as follows: Section 2 presents related work. The proposed model, MBConv-ViT, is described in Section 3. Section 4 presents the experiment and discussion. Finally, Section 5 presents the conclusion.

2. Related Work

Over the past decades, numerous machine learning methods have been applied to network intrusion detection. For example, Vijayanand et al. [7] proposed an intrusion detection system combining multiple Support Vector Machines (SVMs) with each SVM responsible for detecting a specific attack type, ensuring accurate attack detection and device security. Tong et al. [8] utilized decision trees to construct a traffic classifier, achieving a high recognition accuracy. Basati et al. [9] proposed a lightweight architecture based on parallel deep selfencoder that focuses on the localization and information around the individual values in the feature vector, greatly reducing the number of parameters, memory footprint, and processing power requirements. Shone et al. [10] proposed an asymmetric deep selfencoder that uses a feature extraction technique using random forests, showing higher efficiency and reducing the training time compared to deep belief networks. Liu et al. [11] suggested analyzing the HTTP protocol to obtain features with high correlation, which can then be analyzed using the SVM algorithm for excellent classification performance, thereby making the extracted features more accurate for classification. Shin et al. [12] used k-means clustering to calculate the similarity between data and adjust parameters, enabling the detection of both denial of service attacks and worm attacks simultaneously. Zhang et al. [13] used multi-level SVM to iteratively specify the optimal cluster center, alleviating the sensitivity of K-Means algorithm to cluster centers and improving its robustness. However, this method has a high false alarm rate for R2L-type attacks. Bahjat et al. [14] applied hierarchical clustering to intrusion detection, dividing intrusion data into two layers for processing, and improving the recognition accuracy of Dos- and U2R-type attacks.

Compared with traditional machine learning methods, deep learning models, with their superior learning capabilities and efficient feature representation, demonstrate significant advantages in the field of intrusion detection. For example, Huong et al. [15] proposed a federated learning-based Network Intrusion Detection System (NIDS) for the industrial Internet of Things (IIoT), utilizing a model architecture that combines differential autoencoders and LSTM to detect anomalies in time series data. Sharmila et al. [16] developed an improved Bayesian model that incorporates an attribute addition algorithm, simplifying data classification complexity. Ding et al. [17] combined CNN and BiSRU to input the spatial features extracted by CNN into BiSRU to extract temporal features, solving the drawback of CNN’s inability to process temporal data. Xiao et al. [18] proposed an intrusion detection method based on R-CNN, which normalizes the collected signals at spatial and temporal scales and uses an improved R-CNN algorithm for intrusion detection. Wu et al. [19] proposed a transformer model applied to intrusion detection scheme, which added a multi-head self-attention mask sublayer to the decoder stack. The experimental results showed that this method displays a certain improvement compared to RNN and LSTM. Li et al. [20] introduced an anomalous traffic detection method that integrates a feature fusion network with a vision transformer architecture (MFVT), enhancing the model’s ability to handle unbalanced datasets and reducing the training sample data requirements. Cetin et al. [21] designed an intrusion detection system that employs stacked autoencoders and the FedAVG algorithm for model training. Xiao et al. [22] proposed the S-Resnet model, an improvement on the residual network, which reduces the complexity of the residual network while enhancing detection accuracy to a certain extent.

Based on the above research analysis, it is evident that the datasets used in previous studies are relatively outdated, with protocol types and services that do not reflect the specific characteristics of IoT or industrial IoT applications, as their testbeds do not include any IoT devices. Additionally, there is a lack of emphasis on the correlation between traffic data features during traffic feature extraction. To address these issues, we propose a deep learning-based feature fusion traffic intrusion detection method that integrates recent local features and distant global features at the model classifier level. This approach enhances the data correlation of extracted features, thereby improving the correlation between traffic feature extraction data and increasing classification accuracy.

3. Proposed Method

In this study, we propose a novel model, MBConv-ViT, to improve classification accuracy. In our model, closer local features in traffic data are obtained through deep separable convolutions, while distant global features are captured using multihead attention and a multilayer perceptron. By fusing these local and global features, our model enhances the classification of detection models. The overall structure of the MBConv-ViT model is shown in Figure 1.

In our efforts to enhance the effectiveness of extracted small sample features, we employ inverted residuals and linear bottlenecks, comprising two depthwise separable convolutions. The first convolution expands the number of channels to enhance feature representation capability, while the second convolution restores the number of channels through pointwise convolution to prevent information loss. Additionally, we employ parallel convolution for feature fusion to aid in the extraction of small sample features.

During the feature fusion process, we utilize inverted residuals and linear bottlenecks to extract local features from the achieved features. Simultaneously, we extract global features through a series of sequential operations, including multi-head attention, Multilayer Perceptron, and Add and Norm. Subsequently, we fuse the local and global features to obtain comprehensive features of network traffic. To preserve spatial structure information and recognize the position of features in the original data, we adopt input embedding by adding positional information to input features.

3.1. Data Preprocessing

3.1.1. Categorical Features Numericalization

In this study, the input dataset utilized is the Bot-IoT dataset, comprising 44 feature dimensions. To facilitate computational analysis, categorical features within the dataset undergo one-hot encoding. For instance, protocol features encompass five distinct types: UDP, TCP, ICMP, ARP, and IPV6-ICMP. These protocols are transformed into five-dimensional vectors, denoted as (1, 0, 0, 0, 0), (0, 1, 0, 0, 0), (0, 0, 1, 0, 0), (0, 0, 0, 1, 0), and (0, 0, 0, 0, 1), respectively, through one-hot encoding.

Similarly, subcategory features, with eight symbol attributes, are converted into eight-dimensional binary feature vectors using one-hot encoding. State features, consisting of eleven types, are transformed into eleven-dimensional binary feature vectors. Furthermore, the flgs type features, comprising nine types, are converted into nine-dimensional binary feature vectors.

After numerical processing of the original 35-dimensional numerical features, the dataset expands into a 68-dimensional binary feature vector. To maintain a consistent input format, zero-padding is applied to augment the dimensionality to 81, followed by reshaping using numpy’s reshape method to form a (9 × 9) two-dimensional matrix. Finally, a resize operation is employed to adjust the matrix dimensions to (64 × 64), rendering it compatible with the model architecture presented in this study, as depicted in Figure 2. Similar processing steps are applied to the TON-IoT dataset.

3.1.2. Data Normalization

To address the challenge posed by disparate magnitudes and ranges of values observed across numerous attributes within the dataset, normalization algorithms are employed to bolster data consistency and alleviate potential bias towards attributes exhibiting wider value ranges. This approach ensures that all features make meaningful contributions to the model’s learning process, thereby preventing the dominance of attributes with larger ranges and the potential loss of information from those with smaller ranges.

The normalization formula utilized in this article is as follows:

x^{'} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

(1)

where

x^{'}

represents the normalized value,

x

represents the initial feature value,

x_{m i n}

represents the minimum feature value in the attribute, and

x_{m a x}

represents the maximum feature value in the attribute.

3.2. Proposed MBConv-ViT Model

The proposed MBConv-ViT network architecture integrates two distinct sections of feature fusion networks. The first component of this fusion network comprises dual parallel layers of MobileNetV2 convolutional networks. MobileNetV2 is a lightweight neural network architecture that utilizes depthwise separable convolution techniques, leading to a notable reduction in parameter count. This architectural design incorporates an enhanced residual structure, which enhances model accuracy and efficiency while maintaining the network’s inherent lightweight characteristics.

The first layer of the proposed architecture consists of two stacked MBConv convolutional blocks, with MBConv serving as the fundamental building block within the MobileNetV2 convolutional network. The initial MBConv block utilizes a stride of 2, while the subsequent MBConv block employs a stride of 1. Both blocks have a convolution kernel size of 3. Similarly, the structure of the second layer mirrors that of the first layer, but with different counts of convolution kernels. Specifically, the first layer comprises 96 convolution kernels, while the second layer contains 192, resulting in distinct output channel counts. Subsequently, the local features extracted by these dual layers of MBConv are fused through a convolutional layer, employing a padding size of 1 throughout the two-layer convolutional process.

The computational process of the initial segment of the feature fusion network is encapsulated by Equation (2). Equation (3) describes the padding operation, while Equations (4) and (5) elucidate the changes in the dimensions of the output matrix following convolutional processing with padding. When employing equal padding values, a stride of 1 maintains the output size, while a stride of 2 reduces the output size by half.

\{\begin{matrix} X_{1} = M B C o n v 1 (X_{0}) \\ X_{2} = M B C o n v 2 (X_{0}) \\ X_{f} = C o n v 2 d (X_{1} + X_{2}) \end{matrix}

(2)

X = Padding (X_{0}, 1) : [\begin{matrix} x_{11} & \dots & x_{1 W} \\ ⋮ & ⋱ & ⋮ \\ x_{H 1} & \dots & x_{H W} \end{matrix}] \to [\begin{matrix} 0 & \dots & 0 & \dots & 0 \\ ⋮ & x_{11} & \dots & x_{1 W} & ⋮ \\ ⋮ & ⋮ & \dots & ⋮ & ⋮ \\ ⋮ & x_{H 1} & \dots & x_{H W} & ⋮ \\ 0 & \dots & 0 & \dots & 0 \end{matrix}]

(3)

In Equation (3), X₀ represents the matrix data obtained after undergoing data preprocessing as described in Section 3.1. Since convolution operations alter the size of the input matrix, it becomes necessary to perform padding, as indicated by Equation (3), in order to preserve the matrix size. In this equation, X represents the matrix after the padding operation, and

x_{i j}

denotes a specific data value within the matrix. In Equations (4) and (5), W represents the width of the matrix, and H represents the height. ksz indicates the size of convolutional kernel. pd and st are padding size and stride step size, respectively.

H = [\frac{H + p d - k s z + 1}{s t}]

(4)

W = [\frac{W + p d - k s z + 1}{s t}]

(5)

Equations (6) and (7) describe convolution operations, where V represents the convolution kernel matrix,

v_{i j}

signifies a specific value within the convolution kernel matrix, and k denotes the kernel size. X₁ signifies the feature matrix derived post the first layer MBConv1 convolution operation, while X₂ represents the feature matrix derived following the second layer MBConv2 convolution operation. When the stride is set to 2, the output size is halved. As a result, after two layers of MBConv, the output feature matrix maintains a consistent size, although the output dimensions differ. The matrix generated by the MBConv1 convolution operation has a dimension of 96, while the matrix resulting from the MBConv2 convolution operation has a dimension of 192.

X_{1} = X ⊙ V_{1} = [\begin{matrix} x_{11} & \dots & x_{1 k} \\ ⋮ & ⋱ & ⋮ \\ x_{k 1} & \dots & x_{k k} \end{matrix}] ⊙ [\begin{matrix} v_{11} & \dots & v_{1 k} \\ ⋮ & ⋱ & ⋮ \\ v_{k 1} & \dots & v_{k k} \end{matrix}]

(6)

X_{2} = X ⊙ V_{2} = [\begin{matrix} x_{11} & \dots & x_{1 k} \\ ⋮ & ⋱ & ⋮ \\ x_{k 1} & \dots & x_{k k} \end{matrix}] ⊙ [\begin{matrix} v_{11} & \dots & v_{1 k} \\ ⋮ & ⋱ & ⋮ \\ v_{k 1} & \dots & v_{k k} \end{matrix}]

(7)

Equation (8) illustrates the specific process of feature fusion between the first and second layers by concatenating the results obtained from these two layers. In this equation, C represents the total number of channels, C(1) represents a channel count of 1, C(96) represents a channel count of 96, and so on. X_f represents the features extracted by the first part of the feature fusion network.

X_{f} = (C (96), \frac{H}{2}, \frac{W}{2}) \oplus (C (192), \frac{H}{2}, \frac{W}{2}) = (C (192), \frac{H}{2}, \frac{W}{2})

(8)

The following part of the network incorporates parallel components consisting of ViT and MBConv. We primarily leverage attention mechanisms to bolster the exchange of global information amidst distant traffic features. Through weight allocation, the attention mechanism identifies the most relevant information by assigning higher weights accordingly. This procedural delineation is illustrated in Figure 3.

As depicted in Figure 3, the computation of attention unfolds through three sequential steps. Firstly, the attention mechanism generates queries (Query), keys (Key), and value vectors (Value) predicated on the input traffic feature data. The Query vector captures the information that requires processing, while the Key and Value vectors represent contextual information. Subsequently, the attention mechanism calculates the similarity between the query vector and each key vector. This similarity score determines the relevance or importance of each key-value pair to the query. Finally, using the obtained similarity scores, the attention mechanism assigns weights to the value vectors and computes their weighted summation. This summation encapsulates the concentrated information, along with an attention map that illustrates the distribution of attention across different elements of the input data.

The entire computational process of the second part of the feature fusion network is depicted in Equation (9). The second part of the feature fusion network uses the output X_f of the first part of the feature fusion network as its input.

\{\begin{matrix} Y_{1} = M B C o n v 1 (X_{f}) \\ Y_{2} = V i T (X_{f}) \\ Y_{f} = C o n v 2 d (Y_{1} + Y_{2}) \end{matrix}

(9)

The features extracted from the first part of the network are input into ViT. The image is divided into patches using a convolutional module. Each patch is flattened using a learnable linear projection E, converting the three-dimensional data into a one-dimensional vector

P = {{P_{1}, P}_{2}, \dots, P_{n}}

, where p_t represents the mapped vector at position t. To ensure that the output vector contains the classification label information of the image, a category p_cls is added at the head of the feature sequence. ViT relies on position embedding E_pos to understand the image patches. To achieve this, a simple and effective one-dimensional position embedding is used, treating the patches as an ordered one-dimensional sequence. The position embedding is then added to the embedding vector. The complete embedding vector sequence is presented in Equation (10) as:

P_{0} = [p_{c l s}, p_{1}, p_{2}, \dots, p_{n}] + E_{p o s}

(10)

ViT utilizes attention mechanisms to assign different weights to different parts of the model input, allowing for the extraction of important features. For the input sequence P₀, ViT initially maps the image patches through linear transformations to obtain corresponding Query vectors Q, Key vectors K, and Value vectors V. The attention weight is then computed by measuring the similarity between the Query vector and the Key vector. The calculation process of self-attention weight is illustrated in Equation (11).

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q T^{T}}{\sqrt{d}}) V

(11)

The multi-head attention mechanism extends the self-attention mechanism by performing h times of self-attention calculations on the input sequence, splicing multiple output results, and obtaining the final output vector through W^MHA projection. Each head uses three learnable projections W^Q, W^K, and W^V, to project Q, K, and V to different vector spaces. The multi-head attention module is shown in Equations (12) and (13):

MHA (Q, K, V) = [{head}_{1}, {head}_{2}, \dots, {head}_{h}] W^{M H A}

(12)

h e a d_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(13)

In summary, the input P₀ sequence is computed in the ViT module through N encoder layers, and layer normalization is used to improve the network’s generalization ability. The computation process can be represented as shown in Equation (14):

\{\begin{matrix} \overset{\land}{P^{k}} = M H A (L N (P^{k - 1})) + P^{k - 1} \\ P^{k} = M L P (L N (\overset{\land}{P^{k}})) + \overset{\land}{P^{k}} \\ y = L N (P^{N}) \end{matrix}

(14)

In Equation (14), k = 1⋯N,

\overset{\land}{P^{k}}

represents the feature vector obtained after passing through the MHA module and residual connection at layer k. P^k denotes the feature vector obtained after passing through the MLP module and residual connection at layer k. LN represents layer normalization layer, and P^N denotes the feature vector obtained after processing through N layers of transformer encoding layers. y represents the final output after processing through the encoding layers. In this paper, the MBConv-ViT model is configured with 8 heads in the multi-head attention mechanism, and the number of layers N is set to 2.

Y_{f} = (C (384), \frac{H}{2}, \frac{W}{2}) \oplus (C (786), \frac{H}{2}, \frac{W}{2}) = (C (768), \frac{H}{2}, \frac{W}{2})

(15)

As depicted in Equation (15), Y_f represents the features extracted by the second part of the feature fusion network. These features are then fed into a fully connected network for the purpose of identifying the malicious traffic.

4. Experimental Results

4.1. Experiment Setup and Data Sources

The experiment environment is shown in Table 1:

4.1.1. Dataset 1: Bot-IoT Dataset

The Bot-IoT dataset, developed by researchers at the University of New South Wales Canberra Network Center in 2019 [23], provides a contemporary collection of IoT traffic information [24]. It offers two distinct data formats: the original pcap file and the csv file. For our experiment, we opt for the csv format. This csv file encompasses 46 features, inclusive of 3 classification label features. These classification labels span binary, five-class, and eleven-class classifications. For our experimentation, we focus on the five-class label, comprising normal traffic alongside four prevalent attack traffic types (DoS attack, DDoS attack, Reconnaissance attack, and Theft attack).

Considering the vastness of the Bot-IoT dataset, a representative subset was meticulously curated by the University of New South Wales Canberra Cyber Range, encompassing 5% of the dataset’s total volume. Extracted from four files totaling approximately 1.07 GB, this subset contains around 3 million records that span the complete spectrum of IoT intrusion detection scenarios. This subset has been established as a benchmark for assessing the performance of most intrusion detection systems [25,26,27,28].

Throughout our experimentation, we adopt a five-fold cross-validation approach, ensuring a consistent training set to test set ratio of 4:1. The resultant performance metric is derived from the average value across the five-fold cross-validation outcomes. Table 2 illustrates the distribution of categories within the dataset.

4.1.2. Dataset 2: ToN-IoT Dataset

The TON-IoT dataset represents a cutting-edge collection of IoT and IIoT data, encompassing telemetry data [29] sourced from various IoT and IIoT services. For our experimentation, we leverage the NetFlow-based IoT network dataset derived from TON-IoT, denoted as NF-ToN-IoT. This dataset boasts a total stream count exceeding 16 million, incorporating 9 distinct attack types alongside 43 non-labeled feature attributes. Given the dataset’s substantial size, we curate a manageable subset comprising 20% of the total, amounting to over 3.2 million data streams. Our analysis employs the CSV format of NF-ToN-IoT. The dataset encompasses diverse traffic data labels, delineated in Table 3 alongside the respective category counts.

4.2. Experiment Evaluation

The evaluation metrics employed in this experiment encompass accuracy, recall, and F1 score. Accuracy denotes the proportion of correctly predicted (both positive and negative) samples relative to the total sample size, thus reflecting the overall predictive proficiency of the model. Recall, also termed the true positive rate, signifies the ratio of correctly predicted positive samples to all actual positive samples, thereby indicating the coverage rate within positive samples. The F1 score, a function of recall and precision, represents the harmonic mean of these two metrics. It offers a balanced assessment by comprehensively considering both overall dataset performance and the prediction accuracy of individual samples. Mathematically, the F1 score is expressed as:

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

(16)

Recall = \frac{TP}{TP + FN}

(17)

Precision = \frac{TP}{TP + FP}

(18)

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(19)

Here, TN represents the count of correctly identified attack flows, signifying that the detection was accurate. FN denotes the instances where attack flows were incorrectly identified as normal, resulting in a misclassification of the actual attack traffic. Conversely, TN also refers to the accurate detection of normal flows, where the model’s assessment was correct. FP, on the other hand, denotes the incorrect classification of attack flows as normal, which is a type II error in detection.

4.3. Comparison and Discussion

4.3.1. Comparison of Ablation Experiments

To elucidate the efficacy of fusing local and global features as delineated in this paper, ablation experiments were undertaken. These experiments entailed a comparison between the non-hybrid MobileNetV2 network, the ViT network in its standalone capacity, and the proposed model integrating both architectures. Across three distinct sets of multi-class classification experiments conducted on the Bot-IoT and ToN-IoT datasets, Experiment 1 utilized the non-hybrid MobileNetV2 network, Experiment 2 employed the non-hybrid ViT network, and Experiment 3 implemented the MobileNetV2 + ViT hybrid network, representing the proposed model structure. The ablation experiment outcomes are graphically depicted in Figure 4 and Figure 5, respectively. These visual representations offer insights into the comparative performance of the different network configurations across the multi-class classification tasks.

The comparative analysis of experiments reveals a notable superiority of the hybrid model over its non-mixed counterparts. In the Bot-IoT experiment, the evaluation metrics—accuracy, recall, and F1 score—of the model presented in this paper significantly surpass those of MobileNetV2 and ViT, with all metrics exceeding 95%. Likewise, in the ToN-IoT dataset, the model achieves remarkable recall and accuracy rates of 95.31% and 97.14%, respectively. These outcomes serve to validate the efficacy of fusing local and global features as proposed in this paper, affirming the effectiveness of the hybrid neural network model structure in feature fusion. The hybrid model adeptly amalgamates local and global traffic features, thereby extracting more comprehensive traffic characteristics and enhancing intrusion detection performance. This underscores the significance of the designed hybrid neural network architecture in augmenting the capabilities of traffic feature extraction and subsequently improving intrusion detection efficacy.

4.3.2. Comparison with the State-of-the-Art Models

To underscore the superiority of the hybrid neural network model proposed in this article, comprehensive comparisons were conducted with classical models including Xception, EfficientNetB0, DenseNet121, and TSODE on both the Bot-IoT and ToN-IoT datasets. These comparisons involved the calculation of various evaluation metrics based on the confusion matrix. By juxtaposing the performance of the hybrid model against classical algorithms, the study sought to elucidate its advantages in accurately classifying diverse traffic patterns. The ensuing evaluation metrics provide a nuanced understanding of the model’s efficacy and its potential to outperform traditional approaches in intrusion detection tasks.

(a): Bot-IoT dataset

Upon completing rigorous training with 80% of the training set on each iteration, the model acquires a nuanced understanding of traffic characteristics. Its efficacy is subsequently validated through rigorous testing on the test set. Throughout this testing phase, the confusion matrix is meticulously recorded, as depicted in Figure 6. In this matrix, the horizontal and vertical coordinates respectively denote the predicted values and true values. Notably, the main diagonal encapsulates the count of samples correctly predicted, while off-diagonal elements signify instances where certain types of traffic are misclassified as others. This visualization elucidates the model’s performance nuances, showcasing its ability to accurately classify traffic types while identifying areas for potential improvement.

The confusion matrix outcomes reveal that the model exhibits a high overall accuracy, particularly for the Normal and Reconnaissance categories. Notably, the model demonstrates exceptional proficiency in discerning the traffic characteristics of these categories, achieving nearly 100% accuracy rates. Conversely, the Theft category displays the lowest performance indicators, likely attributed to the limited number of samples available for this category.

Table 4 provides a comprehensive overview of various evaluation metrics employed for the multi-classification of the intrusion traffic detection model on the Bot-IoT dataset. These metrics offer a detailed assessment of the model’s performance across diverse categories, shedding light on its strengths and areas for further refinement.

Table 5 presents the comparative results of the model proposed in this article against a selection of classic deep learning models and recognition algorithms developed by researchers in recent years, specifically applied to the dataset under consideration. This comparative analysis serves to verify the effectiveness of the model in relation to existing approaches. By juxtaposing the performance metrics of various models, this table offers insights into the relative strengths and weaknesses of each approach, thereby highlighting the competitive edge of the model proposed in this article.

(b) ToN-IoT dataset

To evaluate the classification performance of the model on the ToN-IoT dataset, a rigorous five-fold cross-validation approach was employed. Throughout this process, the ratio of training to testing sets was consistently maintained at 4:1 to uphold fairness in the experimental outcomes. The final results were derived as the average of the five-fold cross-validation results, ensuring robustness and reliability in the assessment.

Similarly, a corresponding confusion matrix was meticulously recorded during the experiment, as depicted in Figure 7. This matrix provides a detailed breakdown of the model’s classification outcomes, facilitating a comprehensive understanding of its performance nuances across different traffic categories.

According to the findings presented in Table 6, the model achieves an impressive recognition accuracy of 97.14% on the ToN-IoT dataset. Notably, the Benign category exhibits exemplary performance, with accuracy, recall, and F1 score all reaching 99%. Conversely, certain flows, such as Mitm and Ransomware categories, display subpar recognition effects, largely attributed to the limited number of samples available for these categories. Consequently, the model’s learning may have been insufficient, resulting in comparatively lower F1 scores for these categories.

To further validate the classification performance of the model proposed in this paper on the ToN-IoT dataset, a comparative analysis was conducted with some classic deep learning models. The results of this comparison are summarized in Table 7, providing insights into the model’s relative performance against established approaches.

The results depicted in the table above clearly demonstrate the superiority of the proposed method on the ToN-IoT dataset, with both accuracy and F1 score outperforming other methods. This remarkable performance can be attributed to the utilization of a hybrid network, which effectively fuses both recent local features and distant global features extracted from traffic data. By enhancing the correlation among the extracted feature data, this hybrid approach enables the extraction of more comprehensive spatial features of traffic.

These findings serve as compelling evidence that the proposed method is exceptionally well-suited for feature extraction from network data when compared to alternative methods. The integration of MobileNetV2 + ViT facilitates a holistic understanding of traffic characteristics, leading to superior classification performance and highlighting the efficacy of the hybrid neural network model in intrusion detection tasks.

5. Conclusions

In current IoT intrusion detection research, challenges persist due to incomplete features at the model classifier level and poor correlation between extracted traffic feature data, resulting in diminished classification accuracy. To mitigate these issues, we propose a novel deep learning-based feature fusion traffic intrusion detection method (MBConv-ViT). The key contribution of this method lies in leveraging the strengths of MobileNetV2’s deep separable convolution and ViT’s attention mechanism to fuse local and global features extracted from traffic data. This fusion results in more comprehensive traffic features, thereby enhancing the accuracy of IoT intrusion detection.

Experimental results demonstrate that our proposed method effectively extracts complete traffic spatial features, enhances the correlation of extracted features, and achieves higher accuracy compared to existing methods. These findings underscore the suitability of our approach for feature extraction in network data analysis.

While the multi-model feature fusion deep learning intrusion detection model (MBConv-ViT) represents a significant advancement in IoT intrusion detection, there remains scope for further enhancement. In future research, we aim to address imbalanced datasets to further improve detection accuracy and explore superior methods tailored specifically for IoT intrusion detection. By focusing on these areas, we seek to advance the effectiveness and reliability of intrusion detection systems in IoT environments.

Author Contributions

Conceptualization, Y.G. and C.D.; methodology, Y.G. and Y.Z.; software, Y.Z.; validation, Y.Z.; investigation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.G.; visualization, Y.Z.; project administration, C.D.; funding acquisition, C.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China grant number 62172006.

Data Availability Statement

The test results data presented in this study are available on request. The dataset can be found on public websites.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fraihat, S.; Makhadmeh, S.; Awad, M.; Al-Betar, M.A.; Al-Redhaei, A. Intrusion Detection System for Large-Scale IoT NetFlow Networks Using Machine Learning with Modified Arithmetic Optimization Algorithm. Internet Things 2023, 22, 100819. [Google Scholar] [CrossRef]
The Growth in Connected IoT Devices Is Expected to Generate 79.4zb of Data in 2025, According to a New IDC Forecast. 2019. Available online: https://www.businesswire.com/news/home/20190618005012/en/The-Growth-in-Connected-IoT-Devicesis-Expected-to-Generate-79.4ZB-of-Data-in-2025-According-to-a-New-IDC-Forecast (accessed on 1 May 2024).
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Kansal, K.; Chandra, T.B.; Singh, A. ResNet-50 vs. EfficientNet-B0: Multi-Centric Classification of Various Lung Abnormalities Using Deep Learning “Session Id: ICMLDsE.004”. Procedia Comput. Sci. 2024, 235, 70–80. [Google Scholar] [CrossRef]
Chhabra, M.; Kumar, R. A Smart Healthcare System Based on Classifier DenseNet 121 Model to Detect Multiple Diseases. In Mobile Radio Communications and 5G Networks: Proceedings of Second MRCN 2021; Springer Nature: Singapore, 2022; pp. 297–312. [Google Scholar]
Fatani, A.; Abd Elaziz, M.; Dahou, A.; Al-Qaness, M.A.A.; Lu, S. IoT Intrusion Detection System Using Deep Learning and Enhanced Transient Search Optimization. IEEE Access 2021, 9, 123448–123464. [Google Scholar] [CrossRef]
Vijayanand, R.; Devaraj, D.; Kannapiran, B. Support Vector Machine Based Intrusion Detection System with Reduced Input Features for Advanced Metering Infrastructure of Smart Grid. In Proceedings of the 2017 4th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 6–7 January 2017; pp. 1–7. [Google Scholar]
Tong, D.; Qu, Y.R.; Prasanna, V.K. Accelerating Decision Tree Based Traffic Classification on FPGA and Multicore Platforms. IEEE Trans. Parallel Distrib. Syst. 2017, 28, 3046–3059. [Google Scholar] [CrossRef]
Basati, A.; Faghih, M.M. PDAE: Efficient Network Intrusion Detection in IoT Using Parallel Deep Auto-Encoders. Inf. Sci. 2022, 598, 57–74. [Google Scholar] [CrossRef]
Shone, N.; Ngoc, T.N.; Phai, V.D.; Shi, Q. A Deep Learning Approach to Network Intrusion Detection. IEEE Trans. Emerg. Top. Comput. Intell. 2018, 2, 41–50. [Google Scholar] [CrossRef]
Liu, C.; Yang, J.; Wu, J. Web Intrusion Detection System Combined with Feature Analysis and SVM Optimization. J. Wirel. Com. Netw. 2020, 2020, 33. [Google Scholar] [CrossRef]
Shin, D.H.; An, K.K.; Choi, S.C.; Choi, H.-K. Malicious Traffic Detection Using K-means. J. Korean Inst. Commun. Inf. Sci. 2016, 41, 277–284. [Google Scholar] [CrossRef]
Zhang, X.; Hao, X. Research on Intrusion Detection Based on Improved Combination of K-Means and Multi-Level SVM. In Proceedings of the 2017 IEEE 17th International Conference on Communication Technology (ICCT), Chengdu, China, 27–30 October 2017; pp. 2042–2045. [Google Scholar]
Bahjat, H.; Mohammed, S.N.; Ahmed, W.; Hamad, S.; Mohammed, S. Anomaly Based Intrusion Detection System Using Hierarchical Classification and Clustering Techniques. In Proceedings of the 2020 13th International Conference on Developments in eSystems Engineering (DeSE), Liverpool, UK, 14–17 December 2020; pp. 257–262. [Google Scholar]
Huong, T.T.; Bac, T.P.; Long, D.M.; Luong, T.D.; Dan, N.M.; Quang, L.A.; Cong, L.T.; Thang, B.D.; Tran, K.P. Detecting Cyberattacks Using Anomaly Detection in Industrial Control Systems: A Federated Learning Approach. Comput. Ind. 2021, 132, 103509. [Google Scholar] [CrossRef]
Sharmila, B.S.; Nagapadma, R. Intrusion Detection System Using Naive Bayes Algorithm. In Proceedings of the 2019 IEEE International WIE Conference on Electrical and Computer Engineering (WIECON-ECE), Bengaluru, India, 15–16 November 2019; pp. 1–4. [Google Scholar]
Ding, S.; Wang, Y.; Kou, L. Network Intrusion Detection Based on BiSRU and CNN. In Proceedings of the 2021 IEEE 18th International Conference on Mobile Ad Hoc and Smart Systems (MASS), Denver, CO, USA, 4–7 October 2021; pp. 145–147. [Google Scholar]
Xiao, X.; Ma, X.; Hui, Y.; Yin, Z.; Luan, T.H.; Wu, Y. Intrusion Detection for High-Speed Railway System: A Faster R-CNN Approach. In Proceedings of the 2021 IEEE 94th Vehicular Technology Conference (VTC2021-Fall), Virtual, 27 September–28 October 2021; pp. 1–5. [Google Scholar]
Wu, Z.; Zhang, H.; Wang, P.; Sun, Z. RTIDS: A Robust Transformer-Based Approach for Intrusion Detection System. IEEE Access 2022, 10, 64375–64387. [Google Scholar] [CrossRef]
Li, M.; Han, D.; Li, D.; Liu, H.; Chang, C.-C. MFVT: An Anomaly Traffic Detection Method Merging Feature Fusion Network and Vision Transformer Architecture. J. Wirel. Com. Netw. 2022, 2022, 39. [Google Scholar] [CrossRef]
Cetin, B.; Lazar, A.; Kim, J.; Sim, A.; Wu, K. Federated Wireless Network Intrusion Detection. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 6004–6006. [Google Scholar]
Xiao, Y.; Xiao, X. An Intrusion Detection System Based on a Simplified Residual Network. Information 2019, 10, 356. [Google Scholar] [CrossRef]
Koroniotis, N.; Moustafa, N.; Sitnikova, E.; Turnbull, B. Towards the Development of Realistic Botnet Dataset in the Internet of Things for Network Forensic Analytics: Bot-IoT Dataset. Future Gener. Comput. Syst. 2019, 100, 779–796. [Google Scholar] [CrossRef]
Ferrag, M.A.; Maglaras, L.; Moschoyiannis, S.; Janicke, H. Deep Learning for Cyber Security Intrusion Detection: Approaches, Datasets, and Comparative Study. J. Inf. Secur. Appl. 2020, 50, 102419. [Google Scholar] [CrossRef]
Koroniotis, N.; Moustafa, N.; Sitnikova, E. A New Network Forensic Framework Based on Deep Learning for Internet of Things Networks: A Particle Deep Framework. Future Gener. Comput. Syst. 2020, 110, 91–106. [Google Scholar] [CrossRef]
Koroniotis, N.; Moustafa, N. Enhancing Network Forensics with Particle Swarm and Deep Learning: The Particle Deep Framework. In Proceedings of the 7th International Conference on Artificial Intelligence and Applications (AIAP-2020), Sydney, Australia, 28–29 March 2020. [Google Scholar]
Koroniotis, N.; Moustafa, N.; Schiliro, F.; Gauravaram, P.; Janicke, H. A Holistic Review of Cybersecurity and Reliability Perspectives in Smart Airports. IEEE Access 2020, 8, 209802–209834. [Google Scholar] [CrossRef]
The Bot-IoT Dataset. Available online: https://research.unsw.edu.au/projects/bot-iot-dataset/ (accessed on 3 July 2024).
Alsaedi, A.; Moustafa, N.; Tari, Z.; Mahmood, A.; Anwar, A. TON_IoT Telemetry Dataset: A New Generation Dataset of IoT and IIoT for Data-Driven Intrusion Detection Systems. IEEE Access 2020, 8, 165130–165150. [Google Scholar] [CrossRef]
Ibitoye, O.; Shafiq, O.; Matrawy, A. Analyzing Adversarial Attacks against Deep Learning for Intrusion Detection in IoT Networks. In Proceedings of the 2019 IEEE Global Communications Conference (GLOBECOM), Waikoloa, HI, USA, 9–13 December 2019; pp. 1–6. [Google Scholar]

Figure 1. MBConv-ViT model.

Figure 2. Original one-dimensional flow data converted into a two-dimensional matrix.

Figure 3. Attention mechanism. * means multiplication.

Figure 4. Bot-IoT ablation comparison experiment.

Figure 5. Comparison of ablation experiments between ToN-IoT and conventional techniques.

Figure 6. Prediction confusion matrix of Bot-IoT.

Figure 7. Prediction confusion matrix of ToN-IoT.

Table 1. Experimental Environment.

Environment	Configuration
OS	Ubuntu 20.04.4 LTS
CPU	Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70 GHZ
RAM	128 GB
GPU	NVIDIA TITAN Xp
Language	Python3.7
Deep learning framework	PyTorch 1.12.1

Table 2. Number of categories in Bot-IoT dataset.

Class	Number of Samples
Normal	477
DDoS	1,926,624
DoS	1,650,260
Reconnaissance	91,082
Theft	79

Table 3. Number of categories in the ToN-IoT dataset.

Class	Number of Samples
Bengin	1,219,894
Scanning	756,284
Xss	491,004
DDoS	405,247
Password	230,665
DoS	142,522
Injection	136,893
Backdoor	3362
Mitm	1544
Ransomware	685

Table 4. Bot-IoT classification results.

Class	Precision	Recall	F1-Score	Total Accuracy
Normal	99.15%	99.56%	98.75%	99.99%
DDoS	99.99%	99.99%	99.99%
DoS	99.99%	99.99%	99.98%
Reconnaissance	100%	100%	100%
Theft	96.18%	97.32%	95.11%

Table 5. Comparison of Bot-IoT classification results.

Methods	Precision	Recall	F1-Score	Accuracy
Xception	87.33%	81.33%	82.83%	99.73%
EfficientNetB0	91.47%	90.87%	91.12%	99.97%
DenseNet121	93.09%	95.67%	94.19%	99.97%
FNN [30]	95%	90%	95%	95%
TSODE	99.04%	99.04%	99.04%	99.04%
MBConv-ViT	99.06%	99.37%	98.77%	99.99%

Table 6. TON-IoT classification results.

Class	Precision	Recall	F1-Score	Total Accuracy
Benign	99.46%	99.79%	99.14%	97.14%
Backdoor	99.04%	99.26%	98.81%
DDoS	98.26%	98.71%	97.81%
DoS	91.18%	90%	92.4%
Injection	86.09%	90.43%	82.21%
Mitm	37.7%	93.96%	23.75%
Password	94.09%	93.43%	94.78%
Ransomware	88.24%	95.02%	82.85%
Scanning	99.06%	99.05%	99.08%
Xss	94.86%	93.41%	96.35%

Table 7. Comparison of ToN-IoT classification results.

Methods	Precision	Recall	F1-Score	Accuracy
ViT	86.02%	75.79%	75.79%	88.34%
MobileNetV2	75.30%	71.06%	71.06%	91.79%
Xception	94.40%	86.56%	88.33%	96.52%
EfficientNetB0	90.61%	81.11%	83.55%	95.28%
DenseNet121	92.43%	79.94%	82.18%	91.68%
MBConv-ViT	99.57%	96.99%	98.05%	99.98%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Du, C.; Guo, Y.; Zhang, Y. A Deep Learning-Based Intrusion Detection Model Integrating Convolutional Neural Network and Vision Transformer for Network Traffic Attack in the Internet of Things. Electronics 2024, 13, 2685. https://doi.org/10.3390/electronics13142685

AMA Style

Du C, Guo Y, Zhang Y. A Deep Learning-Based Intrusion Detection Model Integrating Convolutional Neural Network and Vision Transformer for Network Traffic Attack in the Internet of Things. Electronics. 2024; 13(14):2685. https://doi.org/10.3390/electronics13142685

Chicago/Turabian Style

Du, Chunlai, Yanhui Guo, and Yuhang Zhang. 2024. "A Deep Learning-Based Intrusion Detection Model Integrating Convolutional Neural Network and Vision Transformer for Network Traffic Attack in the Internet of Things" Electronics 13, no. 14: 2685. https://doi.org/10.3390/electronics13142685

APA Style

Du, C., Guo, Y., & Zhang, Y. (2024). A Deep Learning-Based Intrusion Detection Model Integrating Convolutional Neural Network and Vision Transformer for Network Traffic Attack in the Internet of Things. Electronics, 13(14), 2685. https://doi.org/10.3390/electronics13142685

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deep Learning-Based Intrusion Detection Model Integrating Convolutional Neural Network and Vision Transformer for Network Traffic Attack in the Internet of Things

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Data Preprocessing

3.1.1. Categorical Features Numericalization

3.1.2. Data Normalization

3.2. Proposed MBConv-ViT Model

4. Experimental Results

4.1. Experiment Setup and Data Sources

4.1.1. Dataset 1: Bot-IoT Dataset

4.1.2. Dataset 2: ToN-IoT Dataset

4.2. Experiment Evaluation

4.3. Comparison and Discussion

4.3.1. Comparison of Ablation Experiments

4.3.2. Comparison with the State-of-the-Art Models

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI