1. Introduction
The Internet of Things (IoT) is an emerging paradigm that focuses on connecting a variety of intelligent physical devices in order to modernize various domains integrating them and thus improving the quality of life. The phrase “Internet of Things” was originally coined in 1999 by Kevin Ashton [
1]. In 2020, it was predicted that global spending on the Internet of Things (IoT) technology would reach 749 billion dollars [
2]. It has been predicted that
trillion US dollars would be spent on IoT globally by 2023. The Asia Pacific region accounted for the largest share of the global IoT market. Europe, the Middle East, and Africa came in second, third, and fourth, respectively.
While the use of IoT-based technologies is on the rise, their security is also of utmost importance for ensuring the successful operations of these connected devices. According to Statista [
3], the number of malicious attacks worldwide from 2020 to 2021 on the devices connected through the Internet of Things (IoT) shows that 10.83 million attacks occurred in October 2020.
One of the challenging problems in the IoT sector is anomaly detection in streaming data captured through the connected sensors. To overcome this problem, researchers have designed and tested a number of cutting-edge models. Examples include deep learning-based models such as convolutional neural networks (CNN) [
4], graph-based modeling [
5], the Graph Attention Network [
6], and the Temporal Convolutional Network (TCN) [
7]. In these approaches, the primary approach is the modeling of the problem through time-series analysis and then detecting what is called anomaly in the time-series streaming data. An open research question is then how the performances of these models vary with respect to accuracy and training time.
In this paper, we investigate the performance of deep learning models, including recurrent neural networks Bidirectional LSTM (BI-LSTM), and Long Short-Term Memory (LSTM), the CNN-based Temporal Convolutional (TCN), and CuDNN-LSTM, which is a fast LSTM implementation supported by CuDNN. What makes our approach different is that in our modelings, we treat the problem as an estimation problem and not a “classification” one, even though we are also reporting the classification results for the sake of comparison. As a result, the performance metric utilized for comparison is based on Root Mean Square Error (RMSE) and deviation of the actual values.
This work is an extension to our previous work [
7], where we compared CNN and RNN-based models through classifications. Here, we extend our approach by considering different types of RNN models and also treating the problem as an estimation problem. To have a fair and unbiased comparison, we build various types of RNN and CNN-based models with similar configurations
Our experimental results show that the TCN model and CuDNN-LSTM model have the lowest Root Mean Squared Error (RMSE) and training time when compared to the BI-LSTM and LSTM models. Furthermore, in terms of performance (i.e., RMSE), CuDNN-LSTM offered the lowest RMSE, whereas the TCN-based model slightly outperforms CuDNN-LSTM in training time. This paper makes the following key contributions:
We provide a formulation where the problem of anomaly detection is treated as an estimation problem rather than classification.
In the context of anomaly detection in time series, we compare the performance of multivariate RNN-based BI-LSTM, LSTM, CNN-based TCN, and CuDNN-LSTM models.
In terms of performance, we find that the CuDNN-LSTM model and TCN-based model outperform the BI-LSTM and LSTM models.
In terms of training time, the TCN model outperforms the BI-LSTM, LSTM, and CuDNN-LSTM models.
The remaining parts of this paper are organized as follow as:
Section 2 briefly reviews the related work. A short background on the conceptual deep learning techniques studied is presented in
Section 3. The experimental setup and procedure are presented in
Section 4. The anomaly detection technique and the outcomes of various other attacks are reported in
Section 5.
Section 6 presents the results of the study along with some discussions.
Section 7 concludes the paper and highlights the future research directions.
2. Related Work
Chalapathy and Chawla [
8] explored a number of approaches to study deep learning-based anomaly detection. In their work, the authors have highlighted some of the problems with deep anomaly detection and the solutions that have been developed so far. However, there are still problems with the Deep Anomaly Detection (DAD) supervised technique: it has a high computational complexity when applied to a real domain. The unsupervised DAD-based technique requires label acquisition, which is an expensive and time-consuming process and thus makes models less robust when dealing with noisy data.
Wu et al. [
9] developed an LSTM-Gauss-NBayes technique for outlier identification in Industrial Internet of Things (IIoT), which included LSTM-NN with the Naive Bayes model. The authors employed a stacked LSTM model, taking advantage of its potent learning potential to handle time-series data with long-term, short-term, and weak temporal dependency.
Du et al. [
10] introduced a detection technique called DeepLog, which is an LSTM-based approach. DeepLog is basically a semi-supervised log anomaly identification technique. DeepLog uses log templates and parameter vectors to identify abnormal logs (i.e., some quantitative features of the logs). On the other hand, DeepLog tends to ignore the log’s semantic information and the temporal dimension’s properties in favor of the template’s category data. There is some considerable work that can be performed on improving DeepLog, such as testing the efficiency of additional types of RNNs (recurrent neural networks) and integrating log data from various applications and systems.
Gopali et al. [
7] compared the performance on RNN-based LSTM and CNN-based TCN models in the context of anomaly detection in a multivariate time series. The authors showed that the TCN models outperform other models with an F1 score of
. The performance of the TCN-based model then was compared with the LSTM model and Graph Learning with Transformer for Anomaly detection (GTA) [
11] models.
Luo et al. [
12] proposed a convolutional LSTM-based Auto-Encoder (ConvLSTM-AE) framework for encoding appearance (i.e., motion) and change of appearance for detection of anomaly. In order to remember the visual shift that corresponds to motion, they utilized ConvNet for encoding each frame and a Convolutional LSTM (ConvLSTM), which is a type of LSTM that keeps the spatial information to record the change of appearance.
Ryota et al. [
13] described a system that, by combining general and environment-specific information into a single framework, simultaneously identifies and recounts abnormal events. They used a large labeled dataset to train a Fast R-CNN-based model in order to gain general knowledge.
Radavliev and de Roure [
14] present a set of optimized algorithms for the purpose of edge computing in which devices are often constrained with low memory. The optimized algorithms are then useful for autonomous environments such as robots or drones where self-adapting and self-evolving mechanisms are critical for continuous operations. In the context of time-series analysis and anomaly detection, the problem is to detect possible anomalies with minimal observations and thus prevent consuming memory capacity.
Gopali et al. [
15] conducted a study in which TCN and LSTM models were analyzed and compared with an Average Stochastic Gradient Descent Weighted Dropped Long Short-Term Memory (AWD-LSTM) model in vulnerability detection in smart contracts, where the TCN model outperformed the other models in precision, recall, and F1 score by
,
, and
, respectively.
3. Background
3.1. Long Short-Term Memory (LSTM)
Long-Term Short Memory is a popular innovative recurrent network architecture (RNN) in combination with an appropriate gradient-based learning algorithm [
16]. LSTM can recognize long-term dependencies and forecast a sequence of events utilizing feedback links and loops in the network. The “long short-term memory” that the LSTM architecture attempts to give for RNNs can endure thousands of timestamps. The cells in the LSTM are repeated in a chain typically main layers or moduals.
Here, each cell has three interconnected layers, comprising a forget gate layer, an input gate layer, and an output gate layer that make up each cell in an LSTM model. Additionally, the gates are constructed using a sigmoid neural network layer and a point-wise multiplication operation. The only connections that the output layer has are from memory cells. Memory cells and gate units contain bias weights and accept input from input units, memory cells, and gate units. The forget gate is the initial stage of the procedure. Depending on the new input data and the prior concealed state, the forget gate determines what parts of long-term memory need to be forgotten at this time.
3.2. Bidirectional Long Short-Term Memory (BI-LSTM)
The original Real-Time Recurrent Learning (RTRL) and Back Propagation Through Time (BPTT) error gradient were employed in the LSTM training method of Gers et al. [
17]. When two LSTMs are applied to the input data, the results are called deep-Bidirectional LSTMs [
18]. An expanded BI-LSTM consists of two LSTM cells: a forward LSTM cell and a backward LSTM cell [
19]. A Long Short-Term Memory (LSTM) is initially trained on the input sequence known as a forward layer. Second, the LSTM model is given a sequence that is the inverse of the original input sequence, which is called a backward layer. Finally, the forward and backward states may be concatenated to summarize BI-LSTM’s output.
3.3. Temporal Convectional Networks (TCN)
For decades, convolutional networks have been used with sequences that were widely utilized for voice recognition [
20]. Temporal Convolutional Networks (TCNs) were later used as a general convolutional sequence prediction architecture. TCNs are a form of time-series model that breaks through constrains by capturing long-term patterns with a hierarchy of temporal convolutional filters.
Two distinct kinds of TCNs were initially described by the pioneering authors in the work of Colin et al. [
21]
- (1)
Encoder–Decoder TCN (ED-TCN) employs a temporal convolution, pooling, and upsampling architecture to effectively capture temporal patterns across large ranges of time.
- (2)
A Dilated TCN incorporates skip connections across layers and employs dilated convolutions in place of pooling and upsampling. It is similar to a convolution using a bigger filter built from the original filter by diluting it with zeros but with far better efficiency [
22]. According to Bai et al. in their work [
23], TCNs characteristics can be differentiated in following key respects: i.e., the architecture convolutions prevent “leakage” of information from the future to the past; and the architecture may translate an input sequence of any length to an output sequence of the same length.
3.4. CuDNN-LSTM
A CuDNNLSTM, according to the Keras documentation [
24], is a fast LSTM implementation supported by CuDNN that only works with on a Graphics Processing Unit (GPU). CuDNN-LSTM is faster than LSTM but has fewer options because it lacks a dropout layer, custom activation function, and masking options.
4. Experimental Setup
This section describes the data set, data preprocessing, the model architecture and the assessment metrics employed for the experiment conducted.
4.1. Dataset
The SWaT dataset [
25] (December–January, 2015–2016, dataset with anomaly) from iTrust, Centre for Research in Cyber Security, Singapore University of Technology and Design is used in this experiment. This dataset contains data from Secure Water Treatment (SWaT), which is a water treatment testbed for cybersecurity researchers aiming to collect and analyze data that help in creating secure Cyber Physical Systems. The dataset studied in this work is produced by typical IoT systems where sensors are used for controlling the water level of a water plant.
The dataset includes seven days (24 h) of SWaT running normally and 4 days with various attack scenarios (i.e., Single-Stage Single-Point Attacks, Single-Stage Multi-Point Attacks, Multi-Stage Single-Point Attacks, and Multi-Stage Multi-Point Attacks).
The dataset contains 53 network traffic features as well as values received from sensors and actuators categorized according to their normal or attack behavior. The experiment conducted in the paper focuses primarily on a Single-Stage Single-Point attack, in which only one sensor is targeted as the attack point. More specifically, the sensor data collected by “
LIT-101” and “
LIT-301” are studied as features for the experimentation.
Figure 1 shows the time series of training and testing datsets for the sensors (i.e.,
LIT-101 and
LIT-301) studied. More over, for the comparison and discussion purpose, we also study other types of attacks such as Single Stage Multi-Point Attacks, Multi Stage Single-Point Attacks, and Multi Stage Multi-Point Attacks.
4.2. Data Preparation
Figure 1a demonstrates the trends of the sensor data studied. The sensor features “
LIT-101” and “
LIT-301” of the SWat dataset in
Figure 1a that have been selected from a pool of 53 features in the dataset. The sample data were drawn in the range of December 2015 to January 2016.
The training dataset contains 345,600 observations and spans four days, from 24 December 2015 10:00:00 a.m. to 28 December 2015 10:00:00 a.m. During model training and validation,
of the observations from the training set was used for validation. The testing set (demonstrated in
Figure 1b) consists of
observations with
Single-Stage Single-Point Attacks in the selected features (i.e., “
LIT-101” and “
LIT-301”) that spans two days (24 h) from 31 December 2015 2:58:39 p.m. to 2 January 2016 2:58:39 p.m.
After removing the trend and seasonality from the training dataset (i.e., the normal dataset in
Figure 2a), the training and validation datasets in
Figure 2b were prepared. Trends and seasonality in the time-series dataset may need to be removed before modeling. A time series is considered non-stationary if its mean or variance changes over time as a result of trends or seasonality, respectively. It is much simpler to mode in the stationary dataset because it has a stable mean and variance. The seasonality removal is performed by applying log-transformation and then taking differences between data. We have normalized the dataset in the preprocessing stage using the sklearn pythony library [
26]. The normalization converts the dataset to a common scale in a range, typically 0 to 1, without distorting the difference in the range value.
Figure 2 shows the time-series data of
LIT-101 after the transformation and removal of seasonality.
4.3. Tools and Libraries Used
We used Python’s pandas to import the dataset [
27]. In the preprocessing stage, we utilized numpy [
28] and sklearn [
26]. For visualization, we used matplotlib [
29]. The keras [
24] with tensorflow [
30] libraries were utilized in the models’ building. The keras library used to build the model architecture of the deep learning models (i.e., RNN-based BI-LSTM, LSTM, CNN-based TCN, and CuDNN-LSTM). We used the skearn library to calculate the performance metrics.
Figure 3 depicts the architecture of the models created. As the architecture of these models indicates, the models need to be consistent (i.e., a similar number of internal layers) so the comparison can be meaningful.
4.4. Model Architecture
We build the deep learning models (i.e., cuDNN-LSTM, TCN, BI-LSTM and LSTM) using the keras python library [
24] with TensorFlow [
30] in backend to develop and train complex deep learning models.
The models are made up of two hidden layers, each of which has a proper number of neurons for the model based on observations made during the model training phase. The hidden layers process the data that are received at the input layer. The values of previous sensor data, i.e., 15, 20, and 30 min) are used to predict the current data (1 s) depending on the input size designated in the input layer.
The models utilized the activation function ’tanh’ with the dropout rate of 0.3. The tanh function has the property that it can only achieve a gradient of 1 when the input value is 0. The activation function’s primary benefit is that it generates zero-centered output, which facilitates the back-propagation process [
31]. In order to prevent gradients from shifting in a specific direction, the activation function’s output should be symmetric at zero, which is known as zero-centered output. The dropout rate prevents against overfitting and provides a way to approximately combine efficiently exponentially many different neural network architectures [
32].
For the experiments conducted in this study, we have set the batch
and
for all models to maintain consistency. The models also utilized the loss function as the mean squared error and Adam [
33] as the optimizer. The Adam optimizer is faster to compute and requires fewer parameters to tune the model.
For a fair comparison, it is essential to build models with similar configurations and structure. We build two models of TCN, LSTM and BI-LSTM, one with a dropout layer and the other one without a dropout layer. As a result, any optimization, in terms of the number of hidden layers, should be consistently applied to all models. In the following sections, we provide detailed information regarding the configuration of each model.
4.4.1. TCN
The TCN models are made up of two TCN hidden layers, one dropout layer and one dense layer. The first hidden layer contains 48 neurons with kernel size and dilation (or steps) of 1 and 2, with ’tanh’ activation function. The second hidden layer contains 30 neurons with kernel size and dilation (or steps) of 1, 2 and 4, with ’tanh’ activation function. The dropout layer contain a dropout rate of . The final (i.e., a dense layer) layer has two neurons as output with a ‘tanh’ activation function. In the TCN model without the dropout layer, the second TCN hidden layer has a dropout rate of , but the model does not have a dropout layer.
4.4.2. LSTM
The LSTM model consists of two LSTM hidden layers: one dropout layer and one dense layer. The two LSTM hidden layers contain 40 and 65 neurons, respectively, and follow with a ‘tanh’ activation function in each layer. Two neurons and the ’tanh’ activation function are present in the final layer (also known as the dense layer). The LSTM model without a dropout layer lacks a dropout layer, while the second LSTM hidden layer has a dropout rate of .
4.4.3. BI-LSTM
The BI-LSTM model is made up with two Bidirectional LSTM hidden layers: one dropout layer and one dense layer. The neurons in the two Bidirectional LSTM layers are 60 and 30, respectively, and each layer has a ‘tanh’ activation function. The dropout layer has a dropout rate. Two neurons with the “tanh” activation function are the output of the final (or dense layer) layer. The dropout layer is not included in the architecture in the BI-LSTM model with no dropout layer, but the other configurations remain the same.
4.4.4. CuDNN-LSTM
The cuDNN-LSTM model is implemented with cuda. The model has two cuDNN-LSTM hidden layers each with 50 and 35 neurons, respectively. The model has a final output (i.e., dense) layer containing two neurons implemented with a ’tanh’ activation function.
4.5. Assessment Metrics
The Root Mean Square Error (
RMSE) is a commonly used statistic for assessing the accuracy of a model’s prediction. It calculates the discrepancies or residuals between actual and predicted values. The metric compares prediction errors of different models for a specific dataset.
RMSE can be calculated through the following formula:
where the total number of observations is
n;
is the true value, and
is the predicted one. The key advantage of utilizing
RMSE is that it penalizes huge errors. It also scales the obtained results in the same units as the anticipated numbers, which is daily.
5. Results
The studies are carried out on the Google Colaboratory Pro Environment with Graphics Processing Unit (GPU) enabled. The results and predicted evaluation metrics are generated once the deep learning models (i.e., CuDNN-LSTM, BI-LSTM, LSTM and TCN) are trained.
5.1. Loss vs. Epoch Plots
The plots depicted in
Figure 4 show the trend of loss values for training and validation stages when building each model, where the x-axis is the epoch number and the y-axis is the loss value. As plots in
Figure 4a,d, and
Figure 4d demonstrate, these LSTM-based models exhibit similar trends in loss values. The loss values decrease sharply within the first few epochs for the training stage. While the loss value keeps decreasing, the amount of reduction does not seem to be substantial, and the loss values show some stability after a certain number of epochs. Similarly, the LSTM-based models show similar trends for loss values during the validation stage.
As the plot in
Figure 4b depicts, the trend of loss values for the TCN-based model is somehow similar but also different than LSTM-based models. The TCN-based model demonstrates a similar pattern when it comes to a sharp reduction in loss values and the exhibition of stable loss values after a certain number of training and epochs. On the other hand and according to the y-axis of the plot in
Figure 4b, the TCN-based model is able to be trained with lower values for loss values. These plots indicate that both LSTM-based and TCN-based models are exhibiting similar patterns in training and validation. However, the TCN-based model is capable of fitting a better model, which is evident by the lower loss values obtained.
5.2. Predicted vs. Actual Values
The plots given in
Figure 5 demonstrate the predicted (colored in orange) and the actual (colored in blue) values for the sensor data studied and obtained by each model.
Table 1 lists the training time and RMSE values for the models. As we observe in the table, the CuDNN-LSTM model builds a more accurate model with RMSE values of
,
,
,
,
and
for timestamps of 30, 20, 15, 10, 7 and 5, respectively. On the other hand, the TCN-based model seems to offer building a comparable model with reasonable RMSE but lower training time. More specifically, the TCN-based models are able to train a reasonably accurate model with RMSE values of
,
,
,
,
and
with lower time allocated for training for timestamps of 30, 20, 15, 10, 7 and 5 min, respectively.
6. Discussion
6.1. Anomaly Detection Methodology
A Python tool called Anomaly Detection Toolkit (ADTK) [
34] is used for unsupervised/rule-based time-series anomaly detection. The SWaT dataset [
25] has an outlier anomaly known as a spike, in which the value quickly increases or decreases to a level that is off the range of the recent past. To identify the spikes anomaly in the experiments, we employed the
PersistAd model from the ADTK. The
model’s parameter
c takes an optional float value to determine the range using a historical inter-quartile range, and it is often set to
. Likewise, the
parameter is set to
both and
set as the timestamps with the goal of observing the longer lookback.
6.2. Analysis of Different Types of Attacks
6.2.1. Single Stage Single-Point Attacks
In its simple case, the “Single-Stage Single-Point Attack” can demonstrate the performance of prediction without interfering with the complexity of attack types. The plots in
Figure 6a illustrate the values of F1 scores captured for different values of timestamps for each model created and studied. As the figure shows, the models exhibit similar patterns. The F1 scores vary between
and
. While the best performance is exhibited when the timestamp is 400, the F1 scores show steady and constant improvement for timestamp values greater than 1200.
6.2.2. Single Stage Multi-Point Attacks
In the Single-Stage Multi-Point Attack, the sensor features “
” and “
” have been taken for the experiments. Based on the experiments’ outcomes, the plot shown in
Figure 7a depicts that the CuDNN-LSTM model without the dropout layer has the lowest RMSE across all timestamps (i.e., 30, 20, 15, 10, 7 and 5 min). In contrast, the TCN-based model with the dropout layer has significant RMSE throughout the timestamp range, with timestamps of 10 min having the highest RMSE of
. The plotted F1 score in
Figure 6b shows that the BI-LSTM model with the dropout layer has an F1 score of
at timestamps 20 and 15 min, and it has the higher F1 scores throughout all timestamps. The plots clearly demonstrate that all models have the lowest F1 scores at timestamps of 10 min.
6.2.3. Multi-Stage Single-Point Attacks
In the Multi-Stage Single-Point Attacks, the sensor features “
” and “
” are utilized for the experiment. The TCN model has the highest RMSE across all of the timestamps, while the BI-LSTM model with no dropout layer has the lowest RMSE, which is demonstrated in the plot shown in
Figure 7b. The TCN model has the highest RMSE,
, at a timestamp of 6 min, whereas the BI-LSTM model without a dropout layer has the lowest RMSE,
, at 7 min timestamps. According to the plot shown in
Figure 6c, the performance of the TCN model with a dropout layer has the lowest F1 score across the timestamps, whereas the lowest F1 score of
is recorded at the 30 min timestamp. In contrast, the BI-LSTM model without a dropout layer has the higher F1 scores throughout the timestamps.
6.2.4. Multi-Stage Multi-Point Attacks
In the Multi-Stage Multi-Point Attacks, the sensor features “
”, “
” and “
” have been utilized for the experiments. The plot in
Figure 7c shows that the CuDNN-LSTM model consistently has the lowest RMSE of
at throughout the timestamps, while the TCN-based model with the dropout layer has the larger RMSE throughout the timestamps (i.e., 30, 20, 15, 10, 7 and 5 min). The TCN model with the dropout layer recorded the highest RMSE of
at 30 min timestamps. The BI-LSTM Model with the dropout layer has higher F1 scores across all timestamps, as shown by the plot shown in
Figure 6d. The plot demonstrates that for timestamps within 10 min, all the model F1 scores are lowest, with the highest F1 scores being
and the lowest being
.
6.3. The Effect of Timestamp on Training Time and Performance
The timestamp parameter enables the models to look back and learn from data up to some certain points. It is important to build more accurate models. As a result, it makes sense to identify the intervals of timestamps where the models performed the best.
Table 1 provides a numerical comparison of RMSE values for the models built for various values of timestamps. A glance at the RMSE values indicates that the greater the timestamp values, the smaller the RMSE values will be. For instance, the values of CuDNN-LSTM without a dropout layer show that the RMSE values are
and
when timestamps are 1800 and 200, respectively.
On the other side, we also naturally expect that the training time increases when the timestamp value (i.e., look-back) increases. For instance, the values of CuDNN-LSTM without a dropout layer show that the training times are 21 minutes, and 5 minutes when timestamps are 1800 and 200, respectively.
Figure 6 also depicts the F1 scores for different types of attacks for the models built when various values of timestamps are studied. In most cases, we observe that by increasing the timestamp values, the models are improved and F1 scores are increased. The only attack type that behaves differently than the others is the case of “Single-Stage Multi-Point Attacks” where we observe a slight reduction in F1 scores. Given that the reduction in F1 scores is not substantial, we conclude that overall, the greater timestamps lead to a better performance in terms of model accuracy but longer training time.
7. Conclusions and Future Work
The Internet of Things (IoT) enables connecting hundreds of devices together. These types of interconnected devices often capture data from their local environment and exchange them with some other devices to accomplish certain tasks cooperatively. Due to the rise in these types of systems, more and more industries such as manufacturing and energy are adapting them to serve their customers. On the other side, these IoT-based systems are also becoming the target points for cyber attackers with the goal of tampering data and thus malfunctioning the operations. As a result, it is salient to detect any anomalies as early as possible so the damage is prevented or minimized.
There is a good number of anomaly detection techniques that are applied to addressing this problem. With the advent of machine and deep learning and with respect to the amount of data captured through the communications of the interconnected devices, researchers have explored these data-oriented approaches to model and detect anomalies.
This paper compares some of the machine/deep learning techniques and reports their performance. More specifically, the paper studies the performance of a number of variations of LSTM-based models, such as RNN-based models, in comparison with a CNN-based model, which is called a TCN. The deriving motivation is the basic differences in how these two types of models (CNN and RNN) perform.
According to our results, we observe that TCN performs relatively well in detecting anomalies with relatively low RMSE. The key aspect of TCN is that it needs lower training time in comparison with the RNN counterparts. On the other hand, we observe that the fast version of LSTM performs the best in terms of accuracy, but it needs a greater amount of time for training.
The analysis performed and results reported here in this paper are not explicitly recommended, in which cases these two types of neuron networks should be employed. However, it suggests that when accuracy can be sacrificed by some small amount, TCN can be utilized, since it requires lower training time. As a future direction, it is necessary to compare the results with different datasets with various volumes and compare the trade-off between these two models.