In our efforts to enhance the effectiveness of extracted small sample features, we employ inverted residuals and linear bottlenecks, comprising two depthwise separable convolutions. The first convolution expands the number of channels to enhance feature representation capability, while the second convolution restores the number of channels through pointwise convolution to prevent information loss. Additionally, we employ parallel convolution for feature fusion to aid in the extraction of small sample features.
During the feature fusion process, we utilize inverted residuals and linear bottlenecks to extract local features from the achieved features. Simultaneously, we extract global features through a series of sequential operations, including multi-head attention, Multilayer Perceptron, and Add and Norm. Subsequently, we fuse the local and global features to obtain comprehensive features of network traffic. To preserve spatial structure information and recognize the position of features in the original data, we adopt input embedding by adding positional information to input features.
3.1. Data Preprocessing
3.1.1. Categorical Features Numericalization
In this study, the input dataset utilized is the Bot-IoT dataset, comprising 44 feature dimensions. To facilitate computational analysis, categorical features within the dataset undergo one-hot encoding. For instance, protocol features encompass five distinct types: UDP, TCP, ICMP, ARP, and IPV6-ICMP. These protocols are transformed into five-dimensional vectors, denoted as (1, 0, 0, 0, 0), (0, 1, 0, 0, 0), (0, 0, 1, 0, 0), (0, 0, 0, 1, 0), and (0, 0, 0, 0, 1), respectively, through one-hot encoding.
Similarly, subcategory features, with eight symbol attributes, are converted into eight-dimensional binary feature vectors using one-hot encoding. State features, consisting of eleven types, are transformed into eleven-dimensional binary feature vectors. Furthermore, the flgs type features, comprising nine types, are converted into nine-dimensional binary feature vectors.
After numerical processing of the original 35-dimensional numerical features, the dataset expands into a 68-dimensional binary feature vector. To maintain a consistent input format, zero-padding is applied to augment the dimensionality to 81, followed by reshaping using numpy’s reshape method to form a (9 × 9) two-dimensional matrix. Finally, a resize operation is employed to adjust the matrix dimensions to (64 × 64), rendering it compatible with the model architecture presented in this study, as depicted in
Figure 2. Similar processing steps are applied to the TON-IoT dataset.
3.1.2. Data Normalization
To address the challenge posed by disparate magnitudes and ranges of values observed across numerous attributes within the dataset, normalization algorithms are employed to bolster data consistency and alleviate potential bias towards attributes exhibiting wider value ranges. This approach ensures that all features make meaningful contributions to the model’s learning process, thereby preventing the dominance of attributes with larger ranges and the potential loss of information from those with smaller ranges.
The normalization formula utilized in this article is as follows:
where
represents the normalized value,
represents the initial feature value,
represents the minimum feature value in the attribute, and
represents the maximum feature value in the attribute.
3.2. Proposed MBConv-ViT Model
The proposed MBConv-ViT network architecture integrates two distinct sections of feature fusion networks. The first component of this fusion network comprises dual parallel layers of MobileNetV2 convolutional networks. MobileNetV2 is a lightweight neural network architecture that utilizes depthwise separable convolution techniques, leading to a notable reduction in parameter count. This architectural design incorporates an enhanced residual structure, which enhances model accuracy and efficiency while maintaining the network’s inherent lightweight characteristics.
The first layer of the proposed architecture consists of two stacked MBConv convolutional blocks, with MBConv serving as the fundamental building block within the MobileNetV2 convolutional network. The initial MBConv block utilizes a stride of 2, while the subsequent MBConv block employs a stride of 1. Both blocks have a convolution kernel size of 3. Similarly, the structure of the second layer mirrors that of the first layer, but with different counts of convolution kernels. Specifically, the first layer comprises 96 convolution kernels, while the second layer contains 192, resulting in distinct output channel counts. Subsequently, the local features extracted by these dual layers of MBConv are fused through a convolutional layer, employing a padding size of 1 throughout the two-layer convolutional process.
The computational process of the initial segment of the feature fusion network is encapsulated by Equation (2). Equation (3) describes the padding operation, while Equations (4) and (5) elucidate the changes in the dimensions of the output matrix following convolutional processing with padding. When employing equal padding values, a stride of 1 maintains the output size, while a stride of 2 reduces the output size by half.
In Equation (3),
X0 represents the matrix data obtained after undergoing data preprocessing as described in
Section 3.1. Since convolution operations alter the size of the input matrix, it becomes necessary to perform padding, as indicated by Equation (3), in order to preserve the matrix size. In this equation,
X represents the matrix after the padding operation, and
denotes a specific data value within the matrix. In Equations (4) and (5),
W represents the width of the matrix, and
H represents the height.
ksz indicates the size of convolutional kernel.
pd and
st are padding size and stride step size, respectively.
Equations (6) and (7) describe convolution operations, where
V represents the convolution kernel matrix,
signifies a specific value within the convolution kernel matrix, and
k denotes the kernel size.
X1 signifies the feature matrix derived post the first layer MBConv1 convolution operation, while
X2 represents the feature matrix derived following the second layer MBConv2 convolution operation. When the stride is set to 2, the output size is halved. As a result, after two layers of MBConv, the output feature matrix maintains a consistent size, although the output dimensions differ. The matrix generated by the MBConv1 convolution operation has a dimension of 96, while the matrix resulting from the MBConv2 convolution operation has a dimension of 192.
Equation (8) illustrates the specific process of feature fusion between the first and second layers by concatenating the results obtained from these two layers. In this equation,
C represents the total number of channels,
C(1) represents a channel count of 1,
C(96) represents a channel count of 96, and so on.
Xf represents the features extracted by the first part of the feature fusion network.
The following part of the network incorporates parallel components consisting of ViT and MBConv. We primarily leverage attention mechanisms to bolster the exchange of global information amidst distant traffic features. Through weight allocation, the attention mechanism identifies the most relevant information by assigning higher weights accordingly. This procedural delineation is illustrated in
Figure 3.
As depicted in
Figure 3, the computation of attention unfolds through three sequential steps. Firstly, the attention mechanism generates queries (Query), keys (Key), and value vectors (Value) predicated on the input traffic feature data. The Query vector captures the information that requires processing, while the Key and Value vectors represent contextual information. Subsequently, the attention mechanism calculates the similarity between the query vector and each key vector. This similarity score determines the relevance or importance of each key-value pair to the query. Finally, using the obtained similarity scores, the attention mechanism assigns weights to the value vectors and computes their weighted summation. This summation encapsulates the concentrated information, along with an attention map that illustrates the distribution of attention across different elements of the input data.
The entire computational process of the second part of the feature fusion network is depicted in Equation (9). The second part of the feature fusion network uses the output
Xf of the first part of the feature fusion network as its input.
The features extracted from the first part of the network are input into ViT. The image is divided into patches using a convolutional module. Each patch is flattened using a learnable linear projection E, converting the three-dimensional data into a one-dimensional vector
, where
pt represents the mapped vector at position t. To ensure that the output vector contains the classification label information of the image, a category
pcls is added at the head of the feature sequence. ViT relies on position embedding
Epos to understand the image patches. To achieve this, a simple and effective one-dimensional position embedding is used, treating the patches as an ordered one-dimensional sequence. The position embedding is then added to the embedding vector. The complete embedding vector sequence is presented in Equation (10) as:
ViT utilizes attention mechanisms to assign different weights to different parts of the model input, allowing for the extraction of important features. For the input sequence P
0, ViT initially maps the image patches through linear transformations to obtain corresponding Query vectors
Q, Key vectors
K, and Value vectors
V. The attention weight is then computed by measuring the similarity between the Query vector and the Key vector. The calculation process of self-attention weight is illustrated in Equation (11).
The multi-head attention mechanism extends the self-attention mechanism by performing h times of self-attention calculations on the input sequence, splicing multiple output results, and obtaining the final output vector through
WMHA projection. Each head uses three learnable projections
WQ,
WK, and
WV, to project
Q,
K, and
V to different vector spaces. The multi-head attention module is shown in Equations (12) and (13):
In summary, the input
P0 sequence is computed in the ViT module through
N encoder layers, and layer normalization is used to improve the network’s generalization ability. The computation process can be represented as shown in Equation (14):
In Equation (14),
k = 1⋯
N,
represents the feature vector obtained after passing through the
MHA module and residual connection at layer
k.
Pk denotes the feature vector obtained after passing through the MLP module and residual connection at layer
k.
LN represents layer normalization layer, and
PN denotes the feature vector obtained after processing through
N layers of transformer encoding layers.
y represents the final output after processing through the encoding layers. In this paper, the MBConv-ViT model is configured with 8 heads in the multi-head attention mechanism, and the number of layers
N is set to 2.
As depicted in Equation (15), Yf represents the features extracted by the second part of the feature fusion network. These features are then fed into a fully connected network for the purpose of identifying the malicious traffic.