1. Introduction
The Internet of Things (IoT) includes physical objects with sensors and software that connects and shares data with other devices through the internet. These features enable them to collect, transmit, and receive data. Typically, these data are utilized for interaction with and control and observation of the real environment. Data collected through these devices can be analyzed locally or sent to the cloud via gateways or edge devices [
1]. IoT devices facilitate communication, data sharing, and automated actions across various domains, including homes, industries, cities, healthcare, agriculture, transportation, and retail, leading to their extensive deployment [
2].
This growth has also led to more traffic in cyberspace and a rise in advanced intrusion attacks. IoT system attacks can lead to significant problems that impact both the targeted devices and the broader network infrastructure and compromise data integrity and privacy and even pose risks to physical safety. These attacks exploit vulnerabilities within IoT systems, making it easy to launch various cyber threats such as DDoS attacks, botnets, malware infections, and ransomware [
3]. It is essential to safeguard the IoT infrastructure against potential threats to minimize the risks of intrusion attacks on IoT systems. This can be accomplished by implementing intrusion detection systems (IDSs). An IDS is like a digital watchdog for networks. It carefully watches for any unusual activity and alerts administrators if it finds anything suspicious. Moreover, an advanced IDS can spot both known and new types of threats, making it essential for keeping networks safe.
A conventional IDS operates using signature and anomaly detection methods [
4]. These traditional approaches have several inherent limitations. Firstly, the rules or signatures used for detection require frequent updates to match the continuously changing environment of cyber threats. Failure to update these rules promptly can lead to missed detection of new or modified attacks. Second, these systems’ low accuracy frequently results in a high rate of false positives, which mistakenly classify benign activities as threats while failing to detect actual threats. Thirdly, conventional IDSs tend to generate a high number of false alarms. This can make security analysts feel exhausted from dealing with alerts, causing them to miss real threats among all the false ones. To address these challenges, implementing IDSs with machine learning (ML) and deep learning (DL) techniques has been proposed [
5,
6,
7,
8,
9,
10,
11].
ML and DL techniques have the potential to significantly improve the performance of IDSs. By training these models on large datasets of network traffic data, they are able to recognize intricate patterns and irregularities that could point to malevolent behavior [
12]. Unlike traditional rules-based systems, ML and DL models can adapt and evolve as new threats emerge, providing a more robust and proactive approach to threat detection. One key advantage of using ML and DL for IDSs is their ability to handle high-dimensional and heterogeneous data sources. IoT systems generate vast amounts of data from various devices and sensors, making it challenging for traditional methods to effectively analyze and correlate this information. ML and DL models can process and extract meaningful insights from these diverse data sources, enabling more comprehensive and accurate detection of potential threats across the entire IoT infrastructure.
However, several challenges may arise that can impact the performance and effectiveness of these models. First, when a model overfits to training data and is unable to generalize well to new, unknown data, it performs poorly in terms of identifying real threats or producing an excessive number of false positives [
13]. Second, the presence of unimportant or irrelevant features in network traffic data can introduce noise and obscure meaningful patterns [
14]. Third, large datasets with many features, which are common in IoT and network environments, can lead to higher computational costs and longer training times for ML/DL models, which can be particularly challenging in resource-constrained IoT devices or edge computing environments with limited computational power, storage, and memory [
15]. Addressing these challenges is crucial for the successful implementation of ML/DL techniques for IDSs in IoT systems and requires careful model selection, tuning, and optimization to ensure optimal performance, accuracy, and efficiency while considering the constraints of the target environment.
In this study, we employ the UNSW-NB15 dataset and the NSL-KDD dataset and perform essential data preprocessing steps to prepare the data for analysis. Firstly, we address the class imbalance issue by balancing the class distribution to prevent bias towards the majority class in the model. This step is crucial for accurate anomaly detection in IoT networks, where malicious activities may be underrepresented in the data. Secondly, we employ feature selection techniques to identify and retain the most relevant features from the dataset. This not only improves the model’s performance by reducing noise and irrelevant information but also reduces computational costs and training times, which is particularly important in resource-constrained IoT environments.
To optimize the construction of an effective intrusion detection system (IDS) for IoT networks and tackle the challenges mentioned above, we train two ensemble models: one using a support vector machine (SVM) with bagging and another using long short-term memory (LSTM) with stacking. The SVM model is created by combining multiple SVM classifiers, each trained on different subsets of the data using bagging. The LSTM model is created by combining multiple LSTM models using stacking. This model can handle sequential data, learn complex features, generalize well, and integrate effectively with ensemble methods [
16]. The models are evaluated with several parameters, including accuracy, precision, recall, F-measure, overfitting value, and ROC curve. We also monitor the computation time during training, as real-time anomaly detection is crucial for IoT networks. Based on these metrics, the LSTM stacking model with ANOVA selection proves to be the superior model, demonstrating the most accurate, reliable, and efficient anomaly detection capabilities. Additionally, we implement the model on a Raspberry Pi 3 Model B+ and measure the model’s loading time.
Our paper presents the following contributions:
We use a class balancing approach to address biased models that perform poorly on minority classes.
We use a feature selection approach to improve the prediction performance and reduce complexity, resulting in faster training times and reduced computational resources.
We evaluate the proposed approach on ML and DL IDSs designed for binary classification: considering that the dataset consists of numerous features, the goal is to identify only features that are highly correlated with the class.
We evaluate the performance of the SVM bagging and LSTM stacking models using several parameters, including accuracy, precision, recall, F-measure, overfitting value, ROC curve, model size, and computation time during training. We also measure the loading time on a Raspberry Pi 3 Model B+.
This article is divided into several sections.
Section 2 provides a preliminary introduction, including a dataset and related work. The methodology is presented in
Section 3, followed by the experiments and discussion in
Section 4. Finally, the conclusion is presented in
Section 5.
3. Proposed Methodology
The proposed IDS includes preprocessing, feature selection, classification methods, and evaluation. Firstly, the dataset is preprocessed by encoding categorical data to numerical values, normalizing it to the same scale, and balancing the dataset using SMOTE. Additionally, to maintain the high performance of the IDS while reducing classification overhead, we use feature selection techniques to select the most important features. We use two feature selection techniques: namely, Spearman rank correlation and ANOVA. We use the UNSW-NB15 and NSL-KDD datasets for evaluating the proposed model. Both datasets comprises many features, some of which have little or no impact on intrusion identification.
The proposed intrusion detection system (IDS) utilizes two ensemble models: one using SVM with bagging and another using LSTM with stacking. Finally, we use the confusion matrix to assess the model’s performance and determine which model is superior to the others. We also evaluate the model in terms of model size and time for loading the model on a Raspberry Pi 3 Model B+.
Figure 1 shows the proposed framework, and the subsection that follows explains our process.
3.1. Preprocessing
In the machine learning pipeline, preprocessing is an essential step. The data must be prepared and transformed. Four steps were involved: data encoding, normalization, class balancing, and feature selection.
3.1.1. Data Encoding
The DL and ML algorithms only work with numerical values, so features with categorical values must be transformed into numerical data. Categorical features are converted to integers with values between 0 and
.
S represents the number of symbols.
Table 3 shows the numerical values of categorical data that have high-cardinality categorical features. To achieve this, we utilized label encoding to prevent an increase in the number of features, as a larger number of features could impact computational complexity [
28]. Label encoding helps streamline training by avoiding feature explosion with one-hot encoding, thus ensuring that the dataset size and computational requirements remain manageable.
3.1.2. Data Normalization
Both datasets have attribute values that cover a wide range. This can cause errors and have a detrimental effect on the model’s performance. To tackle this problem, standardization and normalization are two methods that can be used for scaling the features. In our investigation, the min–max scaling method applies a linear transformation to the original data, which helps to develop a model. The basic formula
can be used to discover the minimum and maximum values within a range of [0, 1], where
x represents the original value and
x′ represents the normalized value.
Table 4 shows the normalized data for the dur and sbytes features for the first 10 data points of the UNSW-NB15 dataset.
3.1.3. Class Balancing
Models with imbalanced datasets may outperform with respect to the majority class while neglecting or misclassifying the minority class [
29]. Balancing the dataset helps to ensure that the model learns to recognize and predict both classes accurately: this has been proven by increasing the F1 score and recall metric [
27,
30]. Imbalanced learning is addressed using resampling techniques such as oversampling, undersampling, combined oversampling and undersampling, and ensemble sampling [
31].
To address the imbalance in class distribution, SMOTE creates new data points for the minority class by interpolating feature values between the minority sample and its nearest within-class neighbors. The SMOTE is often used as a benchmark for oversampling [
27,
31,
32]. It creates synthetic data points for the minority class by generating new data points; these are not simply duplicates but are synthetic data points. This helps to prevent overfitting and is an improvement over simple random oversampling [
30]. The detailed distribution of the balanced data is explained in
Section 4.
3.1.4. Feature Selection
The process of feature selection involves selecting relevant features while removing irrelevant ones from the original dataset. This eliminates redundant information and reduces computational cost [
33,
34]. Accurate detection performance depends greatly on feature selection, which is an effective technique for reducing the impact of irrelevant variables and noise [
35]. The proposed IDS classification utilizes a feature selection algorithm to identify significant features that have a strong impact on the classes. In this paper, Spearman rank correlation and ANOVA are used; these analyze the strengths of relationships between variables.
The statistical measure known as Spearman’s rank correlation coefficient is used to ascertain whether two variables have a monotonic connection [
27]. This measure helps with predicting one variable based on another. Feature selection is done using correlation, as highly correlated variables are good predictors of the target variable. Spearman’s rank correlation values range from −1 to 1. A high distance rank score indicates a strong positive correlation and importance of the feature. A threshold was established to determine which features should be included in the model after each feature was given a score based on statistical evaluations [
5]. This approach is suitable for data with different scales of measurement, as shown in
Table 4, as it reduces the impact of extreme values and discrepancies in measurement scales on the correlation analysis.
Equation (
1) represents the correlation coefficient (
) between two vectors
X and
Y, where
and
are the samples for the random variables X and Y, respectively. If the correlation coefficient (
) is close to ±1, it indicates strong association between the two features. In this case, one of the features can be retained. On the other hand, if the value of
is close to 0, it signifies that there is no association between the two features, and both features should be filtered out [
19,
36].
ANOVA is a statistical method used to compare the means of independent groups [
22]. This method ranks the features by calculating the ratio of variances within groups and between groups [
6]. The one-way ANOVA F-test is a statistical tool used to identify significant differences between the means of two or more groups [
7,
36], which can help with the classification of traits. ANOVA is a suitable method for selecting features in the network log that contribute to distinguishing between normal and attack instances in network traffic data simultaneously [
20]. It effectively leverages the dataset’s characteristics for feature selection, allowing the identification of significant features for distinguishing between normal and attack instances.
The procedure for feature selection uses a dataset (
D) that contains
n rows. Each row in the dataset has
k continuous values for categorical variables. For an individual
j belonging to the group
i,
denotes the value, and
denotes the mean. The term
is the representation of the mean of the entire dataset, and
denotes the total number of values in a group. The F-test compares two types of variances: the mean sum of squares between groups (MSB) and the mean sum of squares within groups (MSW) [
22,
37]. Equation (
2) shows the F-test.
refers to the sum of squares within groups, and it can be expressed as (
3).
refers to the sum of squares between groups. It is a statistical measure that is used to evaluate the variability between group means, and it can be expressed as (
4) [
35]. The ANOVA
F-value is calculated for each feature and class variable, and we select
K features with the strongest connections to the class using the
F-value.
The detailed features of each feature selection method are explained in
Section 4.
3.2. Classification Using Ensemble Techniques
Ensemble techniques combine multiple methods for training the dataset. In this research, we utilize two types of ensemble: namely, SVM with bagging and LSTM with stacking.
3.2.1. SVM with Bagging
Support vector machine (SVM) is a popular algorithm used for binary classification [
6,
8,
9]. In the field of IDS, transactions are classified as either normal or intrusions, irrespective of the type of attack. This study utilizes SVM due to its advantages in analyzing high-dimensional spaces. In addition, SVMs only use a portion of the decision function’s training points: they are memory-efficient.
To separate data points of different classes, SVM finds the optimal hyperplane in a high-dimensional feature space [
38]. The maximum margin hyperplane is selected to maintain the maximum separation from the closest data points for every class. Four different types of kernels are used in SVM: sigmoid, polynomial, radial basis function (RBF), and linear [
39]. In this study, we used the RBF-SVM algorithm, which is a powerful and versatile machine learning algorithm that offers flexibility, robustness, and strong generalization performance.
A single SVM model may not always learn the exact parameters for the global optimum [
40]. It is possible that not all unknown test samples can be correctly classified using the support vectors acquired during the learning process. Therefore, a single SVM model may not provide optimal classification for all test examples.
To address the limitations of SVMs, in this work, we adopt a bagging technique to create an ensemble of diverse samples using bootstrapping sampling [
38]. To establish the final predicted class in bagging, many SVMs are trained individually using bootstrap techniques, and then they are aggregated via majority voting. Training set
for a single SVM consists of pairs of data points
x and their labels
y, where
l is the total number of datasets. In bagging, an SVM ensemble with
K independent SVMs is built using
K training sets of samples. To get a bigger improvement in the aggregation outcome, we must vary the training sample sets. We employ the bootstrap technique to do this. Given a training dataset
D with
N samples, we generate
M bootstrap samples
each containing
N samples drawn with replacements from
D. In any specific replicate training dataset, an example
x from the provided training set
may appear once, more than once, or not at all. A particular SVM will be trained using each replicate training set. The predictions made by each SVM classifier are then combined using majority voting to determine the final predicted class [
10]. The overall model for SVM bagging presented in this paper is illustrated in
Figure 2.
Each dataset was trained using SVM bagging with 10 base estimators and 10 bootstrap samples. The bagging ensemble consisted of 10 individual SVM models trained on different bootstrap samples of the data. For each test instance, we made predictions using all the SVM models and aggregated the predictions using a majority vote to obtain the final prediction for each instance.
3.2.2. LSTM with Stacking
Recurrent neural networks (RNNs) with long short-term memory (LSTM) are trained to handle the vanishing gradient problem and capture long-term dependencies between data points [
37]. The memory cell is a key component of LSTM and is capable of storing information for extended periods. The information flow into and out of LSTM cells is managed by three gates: the input gate, forget gate, and output gate [
41].
Figure 3 depicts the LSTM memory cell used in this study.
The input gate (
) regulates how many of each input element enters the cell state. Each input element is passed through a sigmoid activation function, which generates a value between 0 and 1, as represented in Equation (
5). The forget gate (
) plays a crucial role in deciding which information of the cell state is removed from or kept for the model, as represented in Equation (
6). The primary function of this algorithm is to keep track of the previous cell state (
) that will be allocated to the current time (
). The term
is responsible for deciding the amount of the current state that will be passed, as represented in Equation (
7). Initially, the sigmoid layer (
) defines the output information. Then,
processes the cell state and multiplies it by the layer output sigmoid to generate the final output [
42].
where the weight is denoted by
W, the hidden state of the cell at time
t is represented by
, the input layer is denoted by
, and the bias is represented by
b [
43].
During training, LSTM networks are trained using gradient descent and backpropagation through time (BPTT) algorithms [
41,
44]. The LSTM cell parameters (weights and biases) are adjusted by propagating gradients through the network to minimize the loss function. Their ability to capture long-term dependencies and effectively handle sequential data makes them well-suited for tasks requiring memory and context preservation over extended sequences.
This helps to build deeper and more sophisticated models that can effectively capture the complex temporal patterns and dependencies present in sequential data. By leveraging hierarchical representations and increased model capacity, LSTM stacking offers a powerful framework for tackling a wide range of sequential learning tasks with improved performance and generalization capabilities.
In this research, we implemented an LSTM stacking network consisting of two LSTM layers connected using the hyperparameter settings illustrated in
Figure 4 [
11]. The hyperparameters for configuring the LSTM stacking network include the number of hidden layers, dropout rate, activation functions, and dense functions. Each LSTM layer processes the input sequence and passes its output sequence to the next layer in the stack. The two LSTMs have different sizes, with the first one being larger. The first layer focuses on more general features by using 128 hidden layers and 0.3 for the dropout rate, while the second layer targets more specific features by using 32 hidden layers and 0.3 for the dropout rate. The initial LSTM layer processes each dataset. LSTM Layer 1’s output sequence is the input sequence for LSTM Layer 2. LSTM Layer 2’s output sequence is further processed by additional layers: namely, ReLu as an activation function and softmax as a dense layer for binary classification (normal or attack).
To prevent overfitting in our model, we added dropout layers and established feedforward connections in each LSTM layer. We measured the difference between predicted and actual probabilities using sparse categorical crossentropy and combined the LSTM model with an Adam optimizer for improved performance. To prevent overtraining, we stopped training the model when its performance on the validation data ceased to improve by applying early stopping [
11].
5. Conclusions
In this paper, we propose a scheme to optimize IoT intrusion detection using a combination of class balancing and feature selection for preprocessing. The SMOTE is used to balance the rare classes of the dataset. In addition, we apply Spearman rank correlation and ANOVA to identify the essential features that have a high impact on the class while reducing data dimensionality and computational overhead. We evaluate the performance of SVM bagging and LSTM stacking algorithms on the UNSW-NB15 dataset and NSL-KDD dataset, specifically focusing on accuracy, overfitting, and AUC for binary classification. It is important to note that the training time can impact the model size and overfitting. The performance results suggest that the LSTM stacking with ANOVA feature selection model is superior for classifying network attacks. This model also has a small size and loads quickly, making it suitable for implementation on a Raspberry Pi 3 Model B+.
To enhance our model’s robustness and accuracy, future work will focus on implementing additional deep learning architectures. Specifically, we plan to integrate Transformer models and gated recurrent units (GRUs). For implementation on the Raspberry Pi, the limited computational resources, including CPU, memory, and storage, pose significant challenges. To address these constraints, we intend to optimize the model in order to reduce model size and quickly load the model without significantly compromising performance. This optimization strategy is essential for achieving efficient computation times on the resource-constrained Raspberry Pi platform.