Next Article in Journal
An Inspiration Recommendation System for Automotive Styling Design Based on User Behavior Data and Group Preferences
Previous Article in Journal
Human Resource Management in Complex Environments: A Viable Model Based on Systems Thinking
Previous Article in Special Issue
Sustainability and Information Systems in the Context of Smart Business: A Systematic Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Cybersecurity: Hybrid Deep Learning Approaches to Smishing Attack Detection

by
Tanjim Mahmud
1,*,
Md. Alif Hossen Prince
1,
Md. Hasan Ali
1,
Mohammad Shahadat Hossain
2,3 and
Karl Andersson
3
1
Department of Computer Science and Engineering, Rangamati Science and Technology University, Rangamati 4500, Bangladesh
2
Department of Computer Science and Engineering, University of Chittagong, Chittagong 4331, Bangladesh
3
Cybersecurity Laboratory, Luleå University of Technology, 97187 Luleå, Sweden
*
Author to whom correspondence should be addressed.
Systems 2024, 12(11), 490; https://doi.org/10.3390/systems12110490
Submission received: 15 September 2024 / Revised: 8 November 2024 / Accepted: 11 November 2024 / Published: 14 November 2024

Abstract

:
Smishing attacks, a sophisticated form of cybersecurity threats conducted via Short Message Service (SMS), have escalated in complexity with the widespread adoption of mobile devices, making it increasingly challenging for individuals to distinguish between legitimate and malicious messages. Traditional phishing detection methods, such as feature-based, rule-based, heuristic, and blacklist approaches, have struggled to keep pace with the rapidly evolving tactics employed by attackers. To enhance cybersecurity and address these challenges, this paper proposes a hybrid deep learning approach that combines Bidirectional Gated Recurrent Units (Bi-GRUs) and Convolutional Neural Networks (CNNs), referred to as CNN-Bi-GRU, for the accurate identification and classification of smishing attacks. The SMS Phishing Collection dataset was used, with a preparatory procedure involving the transformation of unstructured text data into numerical representations and the training of Word2Vec on preprocessed text. Experimental results demonstrate that the proposed CNN-Bi-GRU model outperforms existing approaches, achieving an overall highest accuracy of 99.82% in detecting SMS phishing messages. This study provides an empirical analysis of the effectiveness of hybrid deep learning techniques for SMS phishing detection, offering a more precise and efficient solution to enhance cybersecurity in mobile communications.

1. Introduction

As mobile device usage proliferates, smishing attacks—phishing conducted via Short Message Service (SMS)—have become increasingly sophisticated, complicating individuals’ ability to distinguish between legitimate and malicious messages [1]. These attacks exploit social engineering techniques to deceive victims into revealing sensitive information or clicking harmful links, posing significant threats to personal and financial data [2]. The trust users place in SMS communications creates a critical vulnerability in mobile cybersecurity [3,4,5].
Phishing is a cybercrime technique where attackers impersonate legitimate entities to trick individuals into sharing sensitive information, such as passwords, credit card numbers, or personal details [6]. While phishing is most commonly executed through emails, it can also be conducted through other channels, including phone calls (vishing) and SMS (smishing). Smishing, specifically, involves using SMS text messages to impersonate reputable organizations, such as banks, government agencies, or service providers, to prompt recipients into taking immediate action, like clicking a link or calling a phone number [7]. The messages often convey a sense of urgency, exploiting the immediate and personal nature of SMS to make it harder for users to recognize the scam.
As a subset of phishing, smishing poses unique dangers due to the popularity and trust associated with mobile phones [8]. SMS phishing attacks are often more effective than traditional email phishing because users generally trust text messages more readily [9]. Additionally, attackers leverage social engineering techniques, such as urgency and emotional appeals, to manipulate recipients, increasing the likelihood of a successful attack [10].
Traditional cybersecurity methods for detecting phishing, such as feature-based, rule-based, and heuristic approaches, have struggled to adapt to the rapidly evolving tactics employed by cybercriminals [11]. These conventional techniques often fail to identify novel or subtly altered phishing attempts, leaving users vulnerable to exploitation. The rising frequency and sophistication of smishing attacks highlight the urgent need for advanced detection methods that can respond to emerging threats [12]. According to a 2022 report by Verizon Business, human errors, including phishing and compromised credentials, account for 82% of data breaches [13], while the FBI’s Internet Crime Complaint Center reports that phishing was the most prevalent cybersecurity threat in the United States last year, impacting over 323,000 victims [14]. Alarmingly, only 20% of organizations provide their employees with annual phishing awareness training, exposing a significant gap in cybersecurity preparedness [15]. This landscape underscores the critical need for enhanced detection mechanisms to combat smishing attacks, which can deceptively mimic legitimate messages, complicating detection using traditional methods.
To address these cybersecurity challenges, this study proposes a hybrid deep learning model that combines Bidirectional Gated Recurrent Units (Bi-GRUs) [16] and Convolutional Neural Networks (CNNs) [17], referred to as CNN-Bi-GRU [18]. By leveraging the strengths of both architectures, the model aims to enhance the accuracy of smishing detection, outperforming existing solutions in identifying phishing messages. The SMS Phishing Collection dataset [19] is employed, with unstructured text data converted into numerical representations through preprocessing with Word2Vec [20]. The hybrid model is fine-tuned, achieving an impressive accuracy of 99.82% in detecting smishing messages, thereby demonstrating the potential of deep learning techniques in fortifying cybersecurity defenses.
This study contributes to ongoing efforts to improve cybersecurity and protect users from the growing threat of smishing. By offering a precise and efficient solution for detecting SMS phishing, this research aims to enhance the cybersecurity of smartphone users globally, ultimately reducing the risks associated with these deceptive cyber threats.
The main contributions of this study are as follows:
  • We utilized three distinct SMS datasets, training models individually before merging them into a combined dataset. This approach enhances model generalization and adaptability, as demonstrated by comparative performance analyses through all datasets, providing insights into the models’ effectiveness on varied data sources.
  • We investigated the impact of different training parameters on the model’s performance and employed the Explainable AI (XAI) method LIME to interpret model decisions, adding transparency to the detection process.

2. Review of Existing Studies

SMS phishing attacks, or smishing, have increased in prevalence due to the increasing use of mobile devices for online communication and commerce. Over the years, several studies and models have been developed to detect and prevent smishing attacks using machine learning algorithms. In 2017, Joo et al. [21] used statistical learning to detect smishing attacks. Moving on to 2020, Mishra and Soni [10] utilized a prototype containing an APK download detector, URL filter, code analyzer, and content analyzer to evaluate SMS and URL behavior for smishing detection.
Recent advancements in SMS phishing detection have employed a variety of methodologies. Gunikhan Sonowal [22] utilized four algorithms and machine learning classifiers to rank 52 SMS properties, while Roy et al. [23] developed a system for classifying SMS spam using deep learning techniques. Ghourabi et al. [24] introduced a CNN-LSTM hybrid model for Arabic and English SMS messaging. Jain [25] proposed a zero-day smishing detection approach, emphasizing feature prioritization. This year, Xia and Chen [26] developed a discrete hidden Markov model leveraging word order data to address low-term frequency challenges in SMS spam detection. In 2021, Mishra and Soni [27] achieved a 5-CV accuracy of 97.93%, while Liu et al. [28] compared a modified spam Transformer to nine traditional methods for SMS spam recognition. Finally, Mishra and Soni [29] identified smishing SMS attacks, and Mambina et al. [30] proposed a mixed machine learning approach for Swahili smishing detection. Table 1 presents summary of related works.
Using machine learning methods to detect smishing attacks is a viable strategy for limiting the associated dangers. Current machine learning algorithms mainly rely on feature engineering, which may be time-consuming and not always able to keep up with their dynamic nature. Numerous studies analyzed in this investigation demonstrate that machine learning and deep learning algorithms accurately identify smishing messages. Further study is required to develop algorithms to detect increasingly sophisticated attacks. An analysis has been conducted on the limitations and prospective advancements of various spam and smishing detection systems. Some constraints include language dependence, insufficient detection accuracy, and short training sets. We must determine how deep learning algorithms can identify SMS phishing attempts to defend against these complicated and potentially catastrophic attacks. We analyze a public dataset [32] of SMS phishing attacks utilized in evaluating the proposed deep learning method with adequate accuracy due to meticulous preprocessing procedures and changing hyperparameters with cross-validations for successful models and future research into identifying SMS phishing attacks. Formal training, testing, and validation sets are created from the data set. In the following section, we discuss the methodology employed in this work to accomplish the intended experiments.

3. Study Design and Implementation

Developing an SMS phishing detection system requires consideration of several factors. It is essential to load and configure the training data initially. The text data must be preprocessed to ensure training accuracy by removing whitespace, punctuation, and stop words. A Word2Vec model can organize and clean the data before generating an embedding matrix for the network.

3.1. Datasets

The first step in building a robust classification model for SMS phishing (smishing) attack detection involves gathering a comprehensive dataset. For this study, we utilized three key datasets to ensure model accuracy through different SMS phishing detection scenarios.
  • SMS Phishing Collection: The primary dataset used in this study is the SMS Phishing Collection, curated by Almeida et al. [19], which is publicly available on Kaggle [32]. This dataset contains a total of 5571 SMS messages, comprising 4824 legitimate (ham) messages and 747 phishing (smishing) attempts (see Table 2).
  • Phishing_detection dataset: For further testing, we used a second dataset with a total of 13,320 samples, which includes 7981 spam and 5339 legitimate (ham) messages (https://huggingface.co/datasets/Ad10sKun/phishing_detection, accessed on 12 September 2024). This dataset broadens the coverage of potential spam scenarios and aids in evaluating the model’s performance on both spam and legitimate messages.
  • Phishing-dataset Dataset: Lastly, we leveraged a third dataset (https://huggingface.co/datasets/ealvaradob/phishing-dataset, accessed on 12 September 2024) consisting of 5971 SMS messages that were originally categorized as ham, smishing, and spam [10,33]. To align with the binary classification objective of this study, we restructured this dataset into two categories: legitimate (ham) messages totaling 4844 and smishing messages totaling 1127.
  • Combined Dataset: We created a fourth combined dataset by merging the SMS Phishing Collection, the Phishing_detection Dataset, and the Phishing-dataset. This new dataset aggregates all samples to enhance the robustness of our analysis, leading to a total of 24,862 samples, including 15,007 legitimate (ham) messages and 9855 phishing (smishing) attempts.
Using these three datasets allows us to thoroughly evaluate our hybrid deep learning model’s effectiveness in detecting smishing attacks and distinguishing them from other SMS types.

3.2. Data Preprocessing

Before entering raw SMS message data into the deep learning model, preprocessing is necessary. The input text must be sanitized by removing unnecessary information, such as stop words, punctuation, and special characters [34]. The widely used Word2Vec method converts text into numerical vectors by representing each word in a multi-dimensional vector space, where each word can influence the context of other words in the text. This enables the deep learning model to parse text input, recognize patterns in numerical vectors, and make accurate predictions.

3.2.1. Text Cleanup

Deep learning algorithms require SMS preprocessing, which is essential [35]. During this process, filler words such as ‘the’, ‘and’, ‘or’, and ‘a’, which do not significantly contribute to the text’s meaning, are removed. This enhances the model’s efficiency by cleaning the data. Consistency is also improved by converting all text to lowercase. Depending on the specific purpose and dataset, additional techniques may be applied, such as stemming, lemmatization, and handling rare or misspelled words. Text preprocessing aims to standardize and cleanse raw text input, enabling deep learning algorithms to make accurate predictions.

3.2.2. Process of Tokenization

Tokenization breaks down a written document into smaller parts, typically single words or phrases [36,37,38]. In SMS text processing, tokenization refers to the practice of segmenting raw text data into individual words or tokens. Since machine learning models require numerical input, tokenization is a crucial step in text processing. The Python Natural Language Toolkit (NLTK) package supports text data tokenization. By using the tokenize() function, input text is split into individual words, resulting in a list of these words. Various tokenization approaches and criteria can be customized to fit a specific purpose or dataset. Text data can be naturally divided into tokens based on whitespace, punctuation, and individual letters. Using regular expressions and other techniques, custom tokenization rules can also be created to meet unique requirements.

3.2.3. Padding

After SMS messages have been tokenized, the resulting sequences need to be padded to a uniform length, as machine learning algorithms require inputs of the same size [39]. Therefore, all inputs must contain the same number of features. This process, known as padding, typically involves adding zeros at the beginning or end of sequences. For example, consider two tokenized phrases: ‘I prefer tea’ and ‘I hate tea and coffee’. The first sentence contains three tokens, while the second has five. To make their lengths equal, we can pad the first sentence with two zeros at the end, resulting in ‘I prefer tea 0 0’, and add a single zero at the beginning of the second sentence to make it ‘0 I hate tea and coffee’.

3.2.4. Word Embedding

Semantic research utilizes word embeddings, also known as vector representations, to represent words in a meaningful way. Word2Vec is commonly used to generate these word embeddings by predicting word contexts within a corpus. For example, given words like ‘prefer’, ‘tea’, and ‘and’, the Word2Vec algorithm might predict the phrase ‘I prefer tea and coffee’, based on these context terms.

3.2.5. Embedding Matrix

The learned word embeddings are fed into the neural network, which produces an embedding matrix [40]. The structure of the embedding matrix is based on the vocabulary size (total unique words in the dataset) and the embedding size. This matrix has dimensions of (vocab size, embedding size), with each row representing a word from the dataset. Each cell in the matrix contains numerical values that represent the word embeddings. The embedding matrix can be used to initialize the weights of the neural network’s embedding layer. For example, if our vocabulary contains 4203 terms and the embedding size in our Word2Vec model is 100, the embedding matrix dimensions would be 4203 by 100, or (4203, 100).

3.3. Proposed Model Architecture

After preprocessing the SMS dataset and constructing an embedding matrix, a CNN-Bi-GRU hybrid model is trained to recognize SMS phishing. This proposed model comprises three main components: an embedding layer, a CNN layer, and a Bi-GRU layer. The embedding layer extracts syntactic and semantic information from the text and converts it into a fixed-length vector representation. The CNN layer applies multiple convolutional filters to the word embeddings, generating feature maps through convolutions and nonlinear activation functions. These feature maps are then integrated, normalized, and passed to the Bi-GRU layer, with max pooling preserving distinctive features while reducing spatial dimensions. The Bi-GRU layer combines outputs from both GRU units, which are then passed through a dense (fully connected) layer that applies a nonlinear activation function to produce the final output. This collaborative layering enables meaningful representations from text data, capturing pertinent information for phishing detection. The methodology followed in this study is illustrated in Figure 1.
For model compilation, the log loss function and the Adam optimizer are used, and a randomized search method is employed to fine-tune hyperparameters such as epochs, batch size, dropout rate, and CNN layer filters. The model’s performance is evaluated on a distinct test set, while a validation set helps identify optimal hyperparameters, assessing metrics like precision, recall, accuracy, and F1-score. Built with the Keras Sequential API, the architecture includes a 1D convolutional layer with 32 filters (kernel size of 3), an embedding layer, a max pooling layer, a dense output layer with a sigmoid activation function, and a bidirectional GRU layer with 64 units. Once the Word2Vec embeddings are generated, the model is trained iteratively on the available data, splitting them into training and validation sets. The model is trained in batches of size 128 for 50 epochs with binary cross-entropy loss and the Adam optimizer, iteratively achieving the desired accuracy level through the validation data.

4. Experimental Outcomes and Analysis

In this section, we present a thorough empirical investigation carried out in this work for detecting SMS phishing attacks.

4.1. Setup

In this study, we utilized a Tesla T4 GPU, paired with an Intel® Xeon® CPU operating at 2.00 GHz, to implement our proposed approaches for smishing attack detection (see Table 3). The Tesla T4 is a powerful GPU specifically designed for machine learning and deep learning tasks, featuring NVIDIA’s Tensor Cores, which significantly accelerate training and inference operations. This hardware setup allows for the efficient processing of large datasets and complex models, which is critical in the context of cybersecurity where timely and accurate detection of smishing attacks is paramount. The total memory capacity of 13,290,460 kB (approximately 13 GB) facilitates the handling of substantial datasets and complex feature sets. This memory capacity supports the training of deep learning models without the risk of memory overflow, thus enabling more extensive experiments and model configurations. The effective utilization of the NVIDIA-SMI 535.104.05 driver ensures optimized performance and stability during model training and evaluation phases.

4.2. Dataset Splitting

To ensure the efficacy and precision of a machine learning model for SMS phishing detection, dividing the dataset into training, validation, and testing groups is essential. Often, a split ratio of 75%:10%:15% is employed for this purpose. Stop words must be deleted during preprocessing for the machine learning model to learn from text data more effectively. When the dataset has been preprocessed, it can be divided randomly into three groups. Ensuring each collection has a representative mix of legitimate and smishing messages is essential, although the division can be arbitrary. The validation set is regularly used during learning to assess the model’s accuracy, ensure its development, and prevent overfitting [41], while the training set is utilized to teach the framework about its functionality. Using unlabeled data, the testing set assesses the model’s performance and determines its quality. They can be trained, validated, and tested using this method to properly recognize SMS phishing messages, safeguarding individuals and companies from cyberattacks. Because they can learn features from raw data, deep learning models are more successful than conventional machine learning techniques for recognizing SMS phishing communications. Statistics of the Smishing Collection Dataset [19,32] are given below in Table 4.

4.3. Model Evaluation Results

Models include decision tree, random forest, KNN, SVM, Ada-Boost, LSTM, Bi-LSTM, GRU, CNN-LSTM, CNN-Bi-GRU, CNN-GRU, and CNN-Bi-LSTM for an experiment. The CNN-Bi-GRU model performed well, with an F1 score of 98.56 and an accuracy of 98.54. Precision scores of 0.9939 and recall scores of 0.9775 were above average compared to the other models investigated. Initial deep learning model hyperparameters were as follows: dropout rate 0.2, filters 32, units 64, and recurrent rate 0.2 for deep experimented deep learning models. On the output, the sigmoid activation is used, whereas the ReLU activation is used on the input. Each model was applied to join batch 128, epoch 50. The CNN-Bi-GRU model was highly influential in identifying SMS phishing attempts. In addition, we achieved satisfactory results through all performance metrics [42]. The evaluation findings for the SMS phishing detection dataset 1 are summarized in Table 5.
On the other hand, the performance analysis of various algorithms through Datasets 2, 3, and 4 demonstrates that deep learning models consistently outperformed traditional machine learning algorithms in detecting SMS phishing, especially hybrid models. For Dataset 2, CNN-BiGRU achieved the highest accuracy at 97.22% (see Table 6), while Dataset 3 saw the same model lead again with 98.17% accuracy (see Table 7). Notably, Dataset 4, a combined dataset comprising Dataset 1, Dataset 2, and Dataset 3, reinforced CNN-BiGRU’s effectiveness, reaching an accuracy of 98.21% with high precision, recall, and F1 scores, showcasing its adaptability to various diverse datasets (see Table 8). Overall, CNN-BiGRU consistently delivered the most robust results, demonstrating superior generalizability and high AUC-ROC scores. These findings highlight the value of hybrid deep learning approaches, particularly CNN-BiGRU, for versatile and accurate smishing detection through various data contexts.

4.4. Proposed Model Hyperparameters Tuning

The best values of hyperparameters were used in implementing the system and utilizing hyperparameters of the CNN-Bi-GRU model. The model’s performance was carefully calibrated by setting these parameters, and after extensive testing, the optimal values were found. The filter parameter was initialized to 32 and selected from 32, 64, and 128, with 32 representing the best value. The dropout rate was established at 0.2, with 0.3 and 0.4 also considered.
The model was trained through fifty epochs, with possibilities of between fifty and one hundred epochs considered. The Adam optimizer was chosen above the similarly evaluated RMSProp optimizer. The kernel size was initially set to 3, with 5 also being considered; nevertheless, 3 produced superior results. The units parameter was set to 64 from the available 16, 32, 64, and 128. Other viable values of 0.3 and 0.5 were investigated, along with the recurrent dropout value of 0.2. In addition to tanh and sigmoid, the Relu function was utilized for input activation. Sigmoid was selected as the output activation, while Relu and Tanh were also evaluated. Ultimately, a pool size of 2 was used, but 3 was also considered. We are confident that these hyperparameters create the ideal proposed model configuration due to considerable experimentation. A graph illustrating the accuracy of the model for the parameters is shown in Figure 2.

4.5. Cross-Validation

To address overfitting prevention, we implemented regularization techniques through our models. Dropout layers were added to recurrent layers (LSTM, GRU, and Bidirectional variants) with a 0.2 dropout rate and recurrent dropout. For convolutional models, dropout layers were combined with max-pooling to further mitigate overfitting. Additionally, kernel regularization with l2 (0.01) was applied to the convolutional and recurrent layers, limiting model complexity for better generalization. We also used cross-validation with varied parameter settings to fine-tune performance and validate the model through four datasets.
Table 9 compares the efficacy of three unique cross-validation processes (three-, five-, and ten-fold) on various machine learning algorithms and deep learning models. Model quality is evaluated by predicting a classification problem’s outcome. Models include decision trees, random forests, KNN, SVM, Ada-Boost, CNN-Bi-GRU, GRU, CNN-GRU, CNN-LSTM, LSTM, CNN-Bi-LSTM, and Bi-LSTM. Traditional machine learning (ML) approaches are less accurate than deep learning models, with the CNN-Bi-GRU model doing the best through all three cross-validation strategies. With batch sizes of 128 in 3-, 5-, and 10-fold cross-validation processes, respectively, the accuracy was 0.9837, 0.9890, and 0.9974. Overall, the results illustrate how beneficial deep learning models can be for tackling classification problems and how vital it is to use cross-validation to assess model performance.
It appears that the CNN-Bi-GRU model can identify SMS phishing attempts. In actuality, the CNN-Bi-GRU approach operates admirably. The findings have implications for weak companies. Deep learning models can increase SMS phishing security.
In Table 10, we summarize the performance of our proposed model and show the adequate accuracy on the best value from hyperparameter tuning and batch size 32. The same epoch and 10-fold cross-validation were applied to obtain an optimal result. In our study, various parameters significantly influenced the model’s performance in smishing detection:
  • Units: Increasing the units in layers enhances the model’s capacity to learn complex patterns. For instance, transitioning from 16 to 64 units improved both accuracy and F1 scores, indicating better feature extraction.
  • Batch Size: Smaller batch sizes (e.g., 32) stabilize gradient updates, resulting in improved convergence. This is reflected in the slight gains in F1 scores, suggesting enhanced balance between precision and recall.
  • Kernel and Pool Size: A kernel size of 3 and pooling size of 2 efficiently reduce feature map dimensionality while retaining critical data features, contributing to high performance.
  • Activation Functions: Using ReLU for hidden layers facilitates modeling nonlinear relationships, while Sigmoid in the output layer aids in binary classification, directly impacting the accuracy of predictions.
  • Epochs: Training for 50 epochs strikes a balance between learning and overfitting. Coupled with dropout layers (rate of 0.2), it promotes generalization on unseen data.
  • Filters: The selection of 32 filters captures diverse features from the text efficiently. This ensures computational efficiency without compromising accuracy.

4.6. Comparison of Performance with Alternative Datasets

In analyzing the consolidated performance of various models through four datasets, distinct patterns emerge that offer insights into model suitability for different contexts. Neural-network-based models, particularly CNN-BiGRU and GRU-based architectures, consistently exhibit high accuracy and F1 scores through all datasets. For example, CNN-BiGRU achieves a notably high accuracy of 98.54% on Dataset 1 and 98.21% on Dataset 4, demonstrating adaptability through varied data distributions (see Table 11). This suggests that CNN-BiGRU, with its hybrid deep learning architecture, is well suited for capturing complex patterns and dependencies in diverse datasets, outperforming simpler models such as Decision Trees and SVM, which show significantly lower performance, particularly on Dataset 2.
Traditional machine learning algorithms like Decision Trees, Random Forests, and SVM show a clear performance drop, especially on Dataset 2, where Decision Tree accuracy falls to 84.82% and SVM accuracy is the lowest through all models, at 67.69%. This disparity highlights the limitations of these algorithms in handling certain datasets, possibly due to the lack of feature extraction capabilities present in deep learning models. Random Forest and K-Nearest Neighbors, while outperforming SVM, still struggle on Dataset 2, suggesting that they may be better suited for datasets with simpler structures or when interpretability is prioritized over accuracy. This performance contrast presents the effectiveness of deep learning approaches, especially hybrid models like CNN-BiGRU, in delivering consistently high results through varied dataset characteristics. These findings point to the importance of model selection based on dataset complexity and the trade-offs between interpretability and predictive power.
The three datasets used—Dataset 1, Dataset 2, and Dataset 3—were sourced from various environments with distinct vocabulary, linguistic patterns, and demographic influences. Dataset 1 contains open-source SMS data with common linguistic patterns, while Dataset 2, sourced from Hugging Face, features a broader array of spam scenarios, enhancing the coverage of varied phishing techniques. Dataset 3 initially included three categories (ham, smishing, and spam), but we reorganized it into binary classes (“ham” and “smishing”) for consistency. Each dataset also exhibits a unique balance of legitimate (ham) to phishing (smishing) messages: for example, Dataset 2 contains a higher proportion of spam, while Dataset 3 originally encompassed both spam and smishing content. By training on these varied distributions, we assessed our model’s ability to maintain performance through different class ratios, offering insights into how effectively it can detect phishing attempts irrespective of class imbalances.
To further enhance generalizability, we created a fourth combined dataset by merging all three primary datasets. This aggregated dataset provides a greater diversity in message characteristics, improving model robustness by exposing it to a wider variety of phishing and legitimate messages. Performance gains observed on this combined dataset highlight the benefit of training on a comprehensive dataset, allowing the model to better capture a full range of phishing patterns, enhancing its applicability in various messaging environments.
As shown in Table 11, model performance varies through individual datasets, reflecting their unique compositions. Models generally performed best on Dataset 1, which has a relatively straightforward structure, whereas Dataset 2, with more varied spam scenarios, presented greater challenges and led to moderate accuracy. Dataset 3 performed comparably to Dataset 1, while Dataset 4 demonstrated the highest overall accuracy and F1 scores. These findings underscore the importance of a consolidated dataset that encompasses varied message types for developing models capable of generalizing effectively through unseen data sources.

4.7. Exploring the Explainable AI Method

In the context of machine learning models, interpretability is crucial, particularly in applications such as cybersecurity, where understanding model predictions can provide insights into underlying threats. To address this need, we employed the Local Interpretable Model-agnostic Explanation (LIME) to elucidate the decision-making process of our hybrid deep learning model used for smishing attack detection.
The LIME is an interpretability framework that offers insight into individual predictions by approximating the complex model with a simpler, interpretable model in the vicinity of a specific prediction. By perturbing the input data and observing the changes in the output, LIME generates local approximations that highlight which features contribute significantly to the final prediction. This process enhances the transparency of our model and facilitates a deeper understanding of how specific textual features influence smishing detection.
Mathematical Foundation:
Given a trained model f : R n R that maps input features x R n to a predicted outcome, the LIME operates as follows:
1. Local Neighborhood Creation: For a given instance x 0 for which we wish to obtain an explanation, the LIME generates a dataset of perturbed samples { ( x i , f ( x i ) ) } i = 1 m . These perturbed samples x i are created by randomly modifying the original instance x 0 in a controlled manner. The perturbation function is often defined as
x ˜ i Perturb ( x 0 )
The goal is to create a set of samples around x 0 that retains its local context.
2. Weighted Sampling: The LIME assigns a weight w i to each perturbed sample based on its proximity to x 0 . This weighting can be formulated using a kernel function K ( x 0 , x i ) , typically chosen as an exponential decay function:
w i = K ( x 0 , x i ) = exp d ( x 0 , x i ) 2 σ 2
where d ( x 0 , x i ) is a distance metric (often Euclidean) between the original instance and the perturbed sample and σ is a bandwidth parameter controlling the size of the neighborhood.
3. Fitting an Interpretable Model: With the perturbed samples and their corresponding predictions f ( x i ) and weights w i , the LIME fits a simple, interpretable model g : R n R (commonly a linear model) that minimizes the following objective function:
g ^ = arg min g i = 1 m w i ( g ( x i ) f ( x i ) ) 2 + λ Ω ( g )
where Ω ( g ) is a complexity penalty term that discourages overfitting and λ is a regularization parameter.
4. Interpretation of the Explanation: The resulting interpretable model g provides coefficients that indicate the contribution of each feature to the prediction made by the complex model f for the instance x 0 . This is typically represented as
g ( x ) = j = 1 n θ j x j
where θ j represents the importance of feature j. The sign and magnitude of these coefficients allow us to infer which features are driving the model’s prediction, thus providing an explanation that is both human-readable and mathematically grounded.
In our implementation, we first trained the proposed model on a dataset of SMS messages, where each message is labeled as either benign or a potential smishing attempt. After training, we selected specific instances of interest from the test set to apply LIME. By using LIME, we generated explanations for these instances, which involved analyzing the impact of individual features (e.g., specific words or phrases) on the model’s predictions.
The LIME results revealed which textual features were most influential in classifying messages as smishing (see Table 12 and Figure 3). For instance, words commonly associated with phishing attempts, such as ‘urgent’, ‘free’, or ‘click here’, emerged as significant indicators.
Figure 4 presented here offers a LIME analysis for a sample text classified by a model developed for detecting smishing attacks. The sample text provided is a typical example of a smishing message, using sms like ‘Try CHIT-CHAT on your mobile now!’ and including details that create urgency or curiosity, which are common tactics in smishing attempts.
The words ‘mobile’ and ‘16’ are highlighted as the most influential in classifying this message as smishing, with positive contributions to the ‘smish’ prediction. The LIME provides an interpretability layer by identifying these specific words as critical features that influenced the model’s decision. Words like ‘mobile’ likely contributed due to their association with mobile device-targeted scams, and ‘16’ might be linked to age-related or verification prompts often seen in phishing attempts.
The LIME feature importance table lists the top words influencing the prediction, where ‘16’ and ‘mobile’ have the highest positive weights. This indicates that the model detects such keywords as strong indicators of a smishing attack. Negative weights for words like ‘W1A’ or ‘PO’ suggest these terms do not contribute positively to the smishing classification, possibly due to their lower correlation with typical smishing language or patterns.
The plot shows the prediction probabilities, where the model assigns a high probability of ‘smish’ (1.00), correctly identifying the message as smishing. This suggests that the model is effective in recognizing this type of attack, with minimal confusion between ‘ham’ and ‘smish’.
Similarly, the model has a very high confidence in classifying this message as ‘ham’, with a probability of 1.00 and a ‘smish’, with a probability of 0.00. This result reflects the model’s ability to correctly identify non-threatening or casual messages as non-smishing.
The words in the sentence (see Figure 5), such as ‘What’, ‘Today’, ‘Sunday’, ‘holiday’, and ‘no work’, have no significant feature importance values contributing to the classification as ‘ham’ or ‘smish’. This absence of influential terms indicates that these words do not match any typical smishing patterns or keywords that the model has learned to associate with smishing attacks.
By showing no significant word contributions for either classification, the LIME explanation reinforces the model’s reliability. The model correctly identifies this message as ‘ham’ without falsely attributing any suspicious characteristics to neutral words. In the context of the study, this example demonstrates how the model is not only effective in identifying smishing attacks but also resilient in not over-flagging benign messages, which is crucial for reducing false positives in real-world applications.

4.8. Computational Efficiency and Resource Requirements

This section provides a comprehensive overview of the computational efficiency and resource demands of various models, crucial for evaluating their feasibility in resource-constrained settings like smishing detection. Table 13 reveals that training times vary widely, with bidirectional models like BiLSTM and CNN-BiLSTM requiring significantly longer times per epoch due to their ability to capture both past and future context. In contrast, models like GRU and CNN-BiGRU demonstrate lower training times, suggesting they may be more computationally efficient while still providing sufficient contextual understanding for the task. Testing times, however, are consistently short through all models, ranging from 1 to 2 s, making each option feasible for real-time applications where quick responses are critical.
Table 14 shows similar memory requirements through models, all around 3 MB, indicating that they are practical for deployment even on devices with limited storage. Models like CNN-BiGRU and CNN-GRU strike a favorable balance with lower parameter counts and smaller memory footprints, making them optimal choices for efficient yet powerful smishing detection in edge environments.

4.9. Comparison with Prior Studies

The effectiveness of five different SMS spam detection techniques was examined in this comparison research. DSmishSMS, Hybrid, SMS Spam Filter, Smishing Detector, and Spam Transformer are the approaches in Table 15. Our investigation also included a proposed method. DSmishSMS employed a machine learning strategy along with the T.A. Almeida Collected Dataset and the Pinterest.com text SMS dataset. Using the Random Forest classification technique, 97.93% accuracy was attained. Hybrid utilized deep learning and the T.A. Almeida dataset. It used the CNN-LSTM classification technique and attained 98.37% accuracy. SMS Spam Filter incorporated a deep learning strategy using the UCI Benchmark dataset. It also employed the CNN-LSTM classification technique and obtained 99.44% accuracy. Smishing Detector utilized machine learning and the SMS Spam Collection dataset from UCI. It used the Naive Bayes classification technique and attained 96.12% accuracy. Spam Transformer used a deep learning strategy and two datasets, SMS Spam Collection and UkMI’s Twitter. Throughout the analysis, a 98.92% accuracy score was reported. The proposed method incorporated a deep learning strategy and the T.A. Almeida dataset. It utilized the CNN-Bi-GRU classification algorithm and attained an accuracy of 99.82%.
In comparing the detection performance through deep learning models, our study demonstrates significant advantages in accuracy and dataset diversity by incorporating four distinct SMS datasets. In contrast to Baardsen et al. [31], who utilized email datasets—one containing URLs (Phishing emails and Nazario Phishing Corpus) and another without URLs (Enron dataset)—our best-performing model, CNN-BiGRU, achieved an impressive accuracy of 99.82%, surpassing Baardsen et al.’s highest results of 97.9% for URL data and 96.8% for non-URL data with models like BiLSTM, BiGRU, and CNN. Other high-performing models in our study, such as CNN-GRU (96.93%), CNN-BiLSTM (97.31%), and CNN-LSTM (99.44%), also showcased consistently high accuracy through multiple datasets.
Overall, our analysis demonstrates that the proposed method employing a deep learning approach and the CNN-Bi-GRU classification algorithm is the most successful way to detect SMS phishing, with an accuracy rating of 99.82%. With an accuracy score of 99.44%, the SMS Spam Filter based on deep learning and the CNN-LSTM classification algorithm also demonstrated strong performance.

5. Conclusions and Future Work

This study explored the application of hybrid deep learning models, specifically the CNN-Bi-GRU architecture, for detecting SMS phishing attacks. The proposed model demonstrated superior performance in terms of accuracy, precision, recall, and F1-score, significantly outperforming traditional classifiers such as KNN, decision tree, random forest, SVM, AdaBoost, LSTM, and GRU. With an accuracy of 99.82% and a high F1-score of 98.56%, the CNN-Bi-GRU model proved to be an effective tool for enhancing cybersecurity in the context of SMS phishing detection.
The results present the potential of deep learning models in fortifying SMS phishing security protocols, offering a robust solution for accurately identifying and classifying phishing attempts. By leveraging the strengths of both CNNs and Bi-GRUs, the model efficiently captured and processed the complex patterns within the SMS Phishing Collection dataset, making it a valuable asset in the fight against cybercrime.
In future work, the focus will be on expanding the dataset to enhance the model’s performances and exploring alternative deep learning architectures, such as attention mechanisms and transformers, to further improve detection accuracy. Additionally, efforts will be made to optimize the model for real-time SMS phishing detection on mobile devices, ensuring both efficiency and speed. Addressing more complex and sophisticated phishing attacks will also be a priority, as well as exploring cross-domain learning to create a unified detection framework through different phishing methods. Finally, integrating these insights into user education and awareness initiatives could further reduce the effectiveness of SMS phishing attacks, contributing to a more secure digital environment.

Author Contributions

Conceptualization, T.M., M.A.H.P. and M.H.A.; methodology, T.M., M.A.H.P. and M.H.A.; validation, T.M., M.A.H.P., M.H.A., M.S.H. and K.A.; formal analysis, T.M., M.S.H. and K.A.; investigation, T.M., M.S.H. and K.A.; resources, T.M., M.S.H. and K.A.; data curation, T.M., M.S.H. and K.A.; writing—original draft preparation, T.M., M.A.H.P. and M.H.A.; writing—review and editing, T.M., M.S.H. and K.A.; visualization, T.M., M.A.H.P. and M.H.A.; supervision, T.M., M.S.H. and K.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data used to support the findings of this study are available upon reasonable request to the corresponding author.

Acknowledgments

We acknowledge the use of AI tools for enhancing the quality of the paper, particularly for grammar checking.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
SMSShort Message Service
CNNConvolutional Neural Networks
Bi-GRUBidirectional Gated Recurrent Units
MLMachine Learning
DLDeep Learning (ML)
RNNRecurrent neural network
XAIExplainable AI

References

  1. Cost of a Data Breach Report 2021. Available online: https://www.ibm.com/security/data-breach (accessed on 11 August 2024).
  2. Difference Between Spam and Phishing Mail. Available online: https://www.tutorialspoint.com/difference-between-spam-and-phishing-mail (accessed on 11 August 2024).
  3. Datta, N.; Mahmud, T.; Aziz, M.T.; Das, R.K.; Hossain, M.S.; Andersson, K. Emerging Trends and Challenges in Cybersecurity Data Science: A State-of-the-Art Review. In Proceedings of the 2024 Parul International Conference on Engineering and Technology (PICET), Vadodara, India, 3–4 May 2024; pp. 1–7. [Google Scholar]
  4. 6 Reasons Why SMS Is More Effective than Email Marketing—CallHub. 2016. Available online: https://callhub.io/6-reasons-sms-efectiveemail-marketing/ (accessed on 11 April 2024).
  5. Khan, F.; Mustafa, R.; Tasnim, F.; Mahmud, T.; Hossain, M.S.; Andersson, K. Exploring BERT and ELMo for Bangla Spam SMS Dataset Creation and Detection. In Proceedings of the 2023 26th International Conference on Computer and Information Technology (ICCIT), Cox’s Bazar, Bangladesh, 13–15 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
  6. Ayeni, R.K.; Adebiyi, A.A.; Okesola, J.O.; Igbekele, E. Phishing Attacks and Detection Techniques: A Systematic Review. In Proceedings of the 2024 International Conference on Science, Engineering and Business for Driving Sustainable Development Goals (SEB4SDG), Omu-Aran, Nigeria, 2–4 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–17. [Google Scholar]
  7. Ali, M.M.; Mohd Zaharon, N.F. Phishing—A cyber fraud: The types, implications and governance. Int. J. Educ. Reform 2024, 33, 101–121. [Google Scholar] [CrossRef]
  8. Nadeem, M.; Zahra, S.W.; Abbasi, M.N.; Arshad, A.; Riaz, S.; Ahmed, W. Phishing attack, its detections and prevention techniques. Int. J. Wirel. Secur. Netw. 2023, 1, 13–25. [Google Scholar]
  9. Jakobsson, M. Two-factor inauthentication—The rise in SMS phishing attacks. Comput. Fraud. Secur. 2018, 2018, 6–8. [Google Scholar] [CrossRef]
  10. Mishra, S.; Soni, D. Smishing Detector: A security model to detect smishing through SMS content analysis and URL behavior analysis. Future Gener. Comput. Syst. 2020, 108, 803–815. [Google Scholar] [CrossRef]
  11. What Is Phishing|Attack Techniques & Scam Examples. Learning Center. Available online: https://www.imperva.com/learn/application-security/phishing-attack-scam/ (accessed on 11 May 2024).
  12. Phishing for Information: Spearphishing Link, Sub-Technique T1598.003—Enterprise|MITRE ATT&CK®. Available online: https://attack.mitre.org/techniques/T1598/003/ (accessed on 11 May 2024).
  13. 2022 Data Breach Investigations Report. Available online: https://www.verizon.com/business/en-gb/resources/reports/dbir/ (accessed on 11 August 2024).
  14. Internet Crime Complaint Center (IC3) Releases 2020 Internet Crime Report, Including COVID-19 Scam Statistics. Available online: https://www.ic3.gov/Media/News/2021/210325.aspx (accessed on 11 August 2024).
  15. Increasing Cybercrime: UN Reports 350 Percent Rise in Phishing Websites During Pandemic. 2020. Available online: https://www.newindianexpress.com/business/2020/aug/08/increasing-cybercrime-un-reports-350-per-cent-rise-in-phishing-websites-during-pandemic-2180777.html (accessed on 11 August 2024).
  16. Mahmud, T.; Ptaszynski, M.; Masui, F. Deep Learning Hybrid Models for Multilingual Cyberbullying Detection: Insights from Bangla and Chittagonian Languages. In Proceedings of the 2023 26th International Conference on Computer and Information Technology (ICCIT), Cox’s Bazar, Bangladesh, 13–15 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
  17. Mahmud, T.; Ptaszynski, M.; Masui, F. Automatic Vulgar Word Extraction Method with Application to Vulgar Remark Detection in Chittagonian Dialect of Bangla. Appl. Sci. 2023, 13, 11875. [Google Scholar] [CrossRef]
  18. Mahmud, T.; Ptaszynski, M.; Masui, F. Exhaustive Study into Machine Learning and Deep Learning Methods for Multilingual Cyberbullying Detection in Bangla and Chittagonian Texts. Electronics 2024, 13, 1677. [Google Scholar] [CrossRef]
  19. Almeida, T.A.; Hidalgo, J.M.G.; Yamakami, A. Contributions to the study of SMS spam filtering: New collection and results. In Proceedings of the 11th ACM Symposium on Document Engineering, Mountain View, CA, USA, 19–22 September 2011; pp. 259–262. [Google Scholar]
  20. Naher, S.R.; Sultana, S.; Mahmud, T.; Aziz, M.T.; Hossain, M.S.; Andersson, K. Exploring Deep Learning for Chittagonian Slang Detection in Social Media Texts. In Proceedings of the 2024 International Conference on Electrical, Computer and Energy Technologies (ICECET), Sydney, Australia, 25–27 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
  21. Joo, J.W.; Moon, S.Y.; Singh, S.; Park, J.H. S-Detector: An enhanced security model for detecting Smishing attack for mobile computing. Telecommun. Syst. 2017, 66, 29–38. [Google Scholar] [CrossRef]
  22. Sonowal, G. Detecting phishing SMS based on multiple correlation algorithms. SN Comput. Sci. 2020, 1, 361. [Google Scholar] [CrossRef] [PubMed]
  23. Roy, P.K.; Singh, J.P.; Banerjee, S. Deep learning to filter SMS Spam. Future Gener. Comput. Syst. 2020, 102, 524–533. [Google Scholar] [CrossRef]
  24. Ghourabi, A.; Mahmood, M.A.; Alzubi, Q.M. A hybrid CNN-LSTM model for SMS spam detection in arabic and english messages. Future Internet 2020, 12, 156. [Google Scholar] [CrossRef]
  25. Jain, A.K.; Yadav, S.K.; Choudhary, N. A novel approach to detect spam and smishing SMS using machine learning techniques. Int. J. E-Serv. Mob. Appl. 2020, 12, 21–38. [Google Scholar] [CrossRef]
  26. Xia, T.; Chen, X. A discrete hidden Markov model for SMS spam detection. Appl. Sci. 2020, 10, 5011. [Google Scholar] [CrossRef]
  27. Mishra, S.; Soni, D. DSmishSMS—A System to Detect Smishing SMS. Neural Comput. Appl. 2023, 35, 4975–4992. [Google Scholar] [CrossRef] [PubMed]
  28. Liu, X.; Lu, H.; Nayak, A. A spam transformer model for SMS spam detection. IEEE Access 2021, 9, 80253–80263. [Google Scholar] [CrossRef]
  29. Mishra, S.; Soni, D. Implementation of ‘smishing detector’: An efficient model for smishing detection using neural network. SN Comput. Sci. 2022, 3, 189. [Google Scholar] [CrossRef]
  30. Mambina, I.S.; Ndibwile, J.D.; Michael, K.F. Classifying Swahili Smishing Attacks for Mobile Money Users: A Machine-Learning Approach. IEEE Access 2022, 10, 83061–83074. [Google Scholar] [CrossRef]
  31. Baardsen, A. Phishing and Social Engineering Attack Detection by Applying Intention Detection Methods. Master’s Thesis, NTNU, Trondheim, Norway, 2022. [Google Scholar]
  32. SMS Smishing Collection Data Set. Kaggle. Available online: https://www.kaggle.com/datasets/galactus007/sms-smishing-collection-data-set (accessed on 11 December 2023).
  33. Mishra, S.; Soni, D. Sms phishing dataset for machine learning and pattern recognition. In Proceedings of the International Conference on Soft Computing and Pattern Recognition, Seattle, WA, USA, 14–16 December 2022; Springer: Cham, Switzerland, 2022; pp. 597–604. [Google Scholar]
  34. Mahmud, T.; Ptaszynski, M.; Eronen, J.; Masui, F. Cyberbullying detection for low-resource languages and dialects: Review of the state of the art. Inf. Process. Manag. 2023, 60, 103454. [Google Scholar] [CrossRef]
  35. Mahmud, T.; Karim, R.; Chakma, R.; Chowdhury, T.; Hossain, M.S.; Andersson, K. A Benchmark Dataset for Cricket Sentiment Analysis in Bangla Social Media Text. Procedia Comput. Sci. 2024, 238, 377–384. [Google Scholar] [CrossRef]
  36. Akter, T.; Akter, M.S.; Mahmud, T.; Islam, D.; Hossain, M.S.; Andersson, K. Evaluating Machine Learning Methods for Bangla Text Emotion Analysis. In Proceedings of the 2024 Asia Pacific Conference on Innovation in Technology (APCIT), Mysore, India, 26–27 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
  37. Mahmud, T.; Akter, T.; Aziz, M.T.; Uddin, M.K.; Hossain, M.S.; Andersson, K. Integration of NLP and Deep Learning for Automated Fake News Detection. In Proceedings of the 2024 Second International Conference on Inventive Computing and Informatics (ICICI), Bangalore, India, 11–12 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 398–404. [Google Scholar]
  38. Bappy, A.D.; Mahmud, T.; Kaiser, M.S.; Shahadat Hossain, M.; Andersson, K. A BERT-Based Chatbot to Support Cancer Treatment Follow-Up. In Proceedings of the International Conference on Applied Intelligence and Informatics, Dubai, United Arab Emirates, 29–31 October 2023; Springer: Cham, Switzerland, 2023; pp. 47–64. [Google Scholar]
  39. Rahman, M.A.; Begum, M.; Mahmud, T.; Hossain, M.S.; Andersson, K. Analyzing Sentiments in eLearning: A Comparative Study of Bangla and Romanized Bangla Text using Transformers. IEEE Access 2024, 12, 89144–89162. [Google Scholar] [CrossRef]
  40. Rayhanuzzaman; Mahmud, T.; Das, U.K.; Naher, S.R.; Hossain, M.S.; Andersson, K. Investigating the Effectiveness of Deep Learning and Machine Learning for Bangla Poems Genre Classification. In Proceedings of the 2023 4th International Conference on Intelligent Technologies (CONIT), Bangalore, India, 21–23 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
  41. Habiba, S.U.; Mahmud, T.; Naher, S.R.; Aziz, M.T.; Rahman, T.; Datta, N.; Hossain, M.S.; Andersson, K.; Kaiser, M.S. Deep Learning Solutions for Detecting Bangla Fake News: A CNN-Based Approach. In Proceedings of the Trends in Electronics and Health Informatics: TEHI 2023, Dhaka, Bangladesh, 20–21 December 2023; p. 107. [Google Scholar]
  42. Barman, S.; Biswas, M.R.; Marjan, S.; Nahar, N.; Imam, M.H.; Mahmud, T.; Kaiser, M.S.; Hossain, M.S.; Andersson, K. A Two-Stage Stacking Ensemble Learning for Employee Attrition Prediction. In Proceedings of the International Conference on Trends in Electronics and Health Informatics, Dhaka, Bangladesh, 20–21 December 2023; Springer: Singpore, 2023; pp. 119–132. [Google Scholar]
Figure 1. Proposed system architecture.
Figure 1. Proposed system architecture.
Systems 12 00490 g001
Figure 2. A graph illustrating the proposed model’s performance across different parameter sets on Dataset 1.
Figure 2. A graph illustrating the proposed model’s performance across different parameter sets on Dataset 1.
Systems 12 00490 g002
Figure 3. Pictorial representations of the top ten important features on dataset 3.
Figure 3. Pictorial representations of the top ten important features on dataset 3.
Systems 12 00490 g003
Figure 4. LIME analysis for a sample text classified by a proposed model-1.
Figure 4. LIME analysis for a sample text classified by a proposed model-1.
Systems 12 00490 g004
Figure 5. LIME analysis for a sample text classified by a proposed model-2.
Figure 5. LIME analysis for a sample text classified by a proposed model-2.
Systems 12 00490 g005
Table 1. Summary of the related works regarding SMS phishing detection.
Table 1. Summary of the related works regarding SMS phishing detection.
AuthorsPeriodMethodAccuracyObservations
Joo et al. [21]2017Statistical
learning method
-No deep
learning models
Roy et al. [23]2020CNN-LSTM99.44% English texts only
Ghourabi
et al. [24]
2020CNN-LSTM98.37%No URL or
file analysis.
Mishra
and Soni [10]
2020Structured
 content analysis
96.12%Few effective
options to
prevent threats,
and APK
malware
detection
is difficult.
Gunikhan
Sonowal [22]
2020Pearson, Spearman,
Kendall, and
Point-biserial correlations.
98.40%Extensive analysis
for feature
selection
Ankit Kumar
Jain [25]
2020Classifier implementation
and Information gain values
96%.English texts only.
No URL analysis.
Minimal dataset size.
Xia and
Chen [26]
2020Discrete hidden
Markov model
98.5% No word labeling.
No HMM
versions were tested.
Mishra and
Soni [27]
2021Backpropagation
Algorithm
97.93%A signature is
difficult to generate.
Liu et al. [28]2021Modified model based
on the vanilla Transformer
98.92%Minimal dataset size.
Mishra and
Soni [29]
2022SMS service with custom rules.97.40%No deep learning
SMS Phishing
detection
models.
Mambina
et al. [30]
2022TFIDF and feature
selection in extra
tree classifiers
99.86%No DL models.
Baardsen
et al. [31]
2022BERT Embedding97.9%Used Email Dataset.
Table 2. Distribution of Ham and Smishing Samples Across Datasets.
Table 2. Distribution of Ham and Smishing Samples Across Datasets.
DatasetHam (Legitimate)Smishing (Phishing)Ham to Smishing Ratio
SMS Phishing Collection (Dataset 1)48247476.46:1
Phishing_detection dataset (Dataset 2)533979811:1.5
Phishing-dataset (Dataset 3)484411274.3:1
Combined Dataset (Dataset 4)15,00798551.52:1
Table 3. Device specifications.
Table 3. Device specifications.
ComponentSpecifications
Device NameTesla T4
Model NameIntel® Xeon® CPU @ 2.00 GHz
Total Memory13,290,460 kB
GPU Driver VersionNVIDIA-SMI 535.104.05
Table 4. Dataset statistics (Dataset 1).
Table 4. Dataset statistics (Dataset 1).
Message
Type
Number of
Messages
Training
Size
Testing
Size
Validation
Size
Smishing746 (13.4%)559 (75%)112 (13.4%)75 (13.5%)
Ham4824 (86.6%)3618 (25%)726 (86.6%)480 (86.5%)
Total55704177 (75%)838 (15%)555  (10%)
Table 5. Comparison of algorithm performance on Dataset 1.
Table 5. Comparison of algorithm performance on Dataset 1.
AlgorithmAccuracyPrecisionRecallF1-ScoreROC AUCAP
DECISION TREE0.97470.95680.9960.97580.97410.955
RANDOM FOREST0.97310.97310.9750.97380.97310.961
KNN0.93090.8990.9750.93530.92980.889
SVM0.88710.91270.8620.88670.88770.858
ADA-BOOST0.92090.91470.9330.92350.92060.888
LSTM0.97080.96870.9750.97160.99470.996
BI-LSTM0.97240.97590.9700.97290.99580.997
GRU0.97540.97460.9780.9760.99580.996
BI-GRU0.97080.99220.9510.97090.99570.996
CNN-LSTM0.97240.98170.9640.97280.99710.998
CNN-BI-LSTM0.97310.97020.9780.97390.99750.998
CNN-GRU0.96930.96170.9790.97030.99750.998
CNN-BI-GRU0.98540.99390.9780.98560.99860.999
Table 6. Performance comparison of models on test Dataset 2.
Table 6. Performance comparison of models on test Dataset 2.
AlgorithmAccuracyPrecisionRecallF1 ScoreAUC-ROC Score
Decision Tree0.84820.87060.87460.87260.8420
Random Forest0.73860.88750.64170.74480.7612
K-Nearest Neighbors0.69410.73870.75120.74490.6808
SVM0.67690.66400.92420.77280.6192
AdaBoost0.81260.78420.94480.85700.7817
LSTM0.96380.98080.95790.96920.9948
BiLSTM0.96050.97980.95320.96630.9936
GRU0.96270.98080.95600.96830.9935
BiGRU0.96330.98550.95230.96860.9948
CNN-LSTM0.95830.97610.95320.96450.9929
CNN-BiLSTM0.95880.97790.95230.96490.9925
CNN-GRU0.95830.98070.94860.96430.9934
CNN-BiGRU0.97220.97030.94860.95930.9929
Table 7. Performance comparison of models on test Dataset 3.
Table 7. Performance comparison of models on test Dataset 3.
AlgorithmAccuracyPrecisionRecallF1 ScoreAUC-ROC Score
Decision Tree0.95640.93470.98170.95760.9563
Random Forest0.95640.96580.94660.95610.9565
K-Nearest Neighbors0.89140.84820.95430.89810.8912
SVM0.83720.82340.85980.84120.8371
AdaBoost0.90370.89090.92070.90550.9036
LSTM0.96480.94850.98320.96560.9950
BiLSTM0.97170.96540.97870.97200.9961
GRU0.96940.96250.97710.96970.9961
BiGRU0.96480.97070.95880.96470.9953
CNN-LSTM0.97020.95700.98480.97070.9962
CNN-BiLSTM0.97320.96000.98780.97370.9970
CNN-GRU0.97630.97710.97560.97640.9972
CNN-BiGRU0.98170.98130.96190.97150.9975
Table 8. Performance metrics of different algorithms on Dataset 4.
Table 8. Performance metrics of different algorithms on Dataset 4.
AlgorithmAccuracyPrecisionRecallF1 ScoreAUC-ROC Score
Decision Tree0.92450.91590.93490.92530.9245
Random Forest0.84110.90510.76220.82750.8411
K-Nearest Neighbors0.78550.77440.80610.78990.7855
SVM0.71820.75860.64040.69450.7182
AdaBoost0.80800.75710.90730.82540.8079
LSTM0.97680.98650.96710.97670.9965
BiLSTM0.97630.97830.97440.97640.9974
GRU0.97380.98240.96500.97360.9958
BiGRU0.97630.98840.96400.97600.9974
CNN-LSTM0.97240.98480.95950.97200.9951
CNN-BiLSTM0.97210.97610.96790.97200.9948
CNN-BiGRU0.98210.95970.98570.97250.9950
Table 9. Applied models’ results for various folds on Dataset 1.
Table 9. Applied models’ results for various folds on Dataset 1.
 Model3-Fold5-Fold10-Fold
Decision Tree0.95740.96300.9671
Random Forest0.95090.96350.9692
KNN0.88440.89400.8980
SVM0.85590.86320.8696
Ada Boost0.91820.91260.9140
LSTM0.97330.98370.9917
Bi-LSTM0.97210.98480.9910
GRU0.96890.98300.9932
Bi-GRU0.97100.98270.9913
CNN-LSTM0.97690.98440.9911
CNN-Bi-LSTM0.97100.97960.9926
CNN-GRU0.97720.98510.9935
CNN-Bi-GRU0.98370.98900.9974
Table 10. Performance of the proposed model across various folds with different parameter sets on Dataset 1.
Table 10. Performance of the proposed model across various folds with different parameter sets on Dataset 1.
CV FoldUnitsBatchKernel SizePool SizeInput ActivationOutput ActivationEpochFiltersAccuracyF1 Score
01612832ReluSigmoid50320.98080.9754
31612832ReluSigmoid50320.97590.9823
51612832ReluSigmoid50320.98630.9689
101612832ReluSigmoid50320.99260.9865
03212832ReluSigmoid50320.96930.9579
33212832ReluSigmoid50320.97520.9671
53212832ReluSigmoid50320.98590.9912
103212832ReluSigmoid50320.99360.9876
06412832ReluSigmoid50320.98080.9879
36412832ReluSigmoid50320.98370.9932
56412832ReluSigmoid50320.98900.9971
106412832ReluSigmoid50320.99740.9892
0643232ReluSigmoid50320.98460.9768
3643232ReluSigmoid50320.98980.9834
5643232ReluSigmoid50320.99560.9834
10643232ReluSigmoid50320.99820.9856
Table 11. Consolidated performance comparison of models across four datasets.
Table 11. Consolidated performance comparison of models across four datasets.
AlgorithmDataset 1Dataset 2Dataset 3Dataset 4
AccuracyF1 ScoreAccuracyF1 ScoreAccuracyF1 ScoreAccuracyF1 Score
Decision Tree0.97470.97580.84820.87260.95640.95760.92450.9253
Random Forest0.97310.97380.73860.74480.95640.95610.84110.8275
K-Nearest Neighbors0.93090.93530.69410.74490.89140.89810.78550.7899
SVM0.88710.88670.67690.77280.83720.84120.71820.6945
AdaBoost0.92090.92350.81260.85700.90370.90550.80800.8254
LSTM0.97080.97160.96380.96920.96480.96560.97680.9767
BiLSTM0.97240.97290.96050.96630.97170.97200.97630.9764
GRU0.97540.97600.96270.96830.96940.96970.97380.9736
BiGRU0.97080.97090.96330.96860.96480.96470.97630.9760
CNN-LSTM0.97240.97280.95830.96450.97020.97070.97240.9720
CNN-BiLSTM0.97310.97390.95880.96490.97320.97370.97210.9720
CNN-GRU0.96930.97030.95830.96430.97630.97640.97210.9725
CNN-BiGRU0.98540.98560.97220.95930.98170.97150.98210.9725
Table 12. Feature importance table for dataset 3.
Table 12. Feature importance table for dataset 3.
IndexFeatureImportance
1call0.3059
2txt0.1122
3www0.0717
4http0.0492
5me0.0368
4996greet0.0000
4997green0.0000
4998great0.0000
4999gravity0.0000
5000ã¼0.0000
Table 13. Training and testing times for different models.
Table 13. Training and testing times for different models.
ModelAverage Training Time per EpochTotal Training TimeTotal Testing Time
LSTM21.5 sApproximately 17.92 min (1075 s)Approximately 2 s
BiLSTM191.4 s9571 s2 s
GRU11.5 sApproximately 575 s1 s
BiGRU15.3 s765 s2 s
CNN-LSTM37.82 s1891 s2 s
CNN-BiLSTM206.92 s10346 s2 s
CNN-GRU19.74 s987 s2 s
CNN-BiGRU8 s400 s1 s
Table 14. Model parameters and memory size.
Table 14. Model parameters and memory size.
ModelParametersApproximate Memory Size (MB)
LSTM790,0573.0138
BiLSTM807,1133.0789
GRU785,8972.9980
BiGRU798,7933.0472
CNN-LSTM785,7852.9975
CNN-BiLSTM788,9373.0096
CNN-GRU785,0492.9947
CNN-BiGRU787,4653.0039
Table 15. Analysis of our proposed system versus prior research.
Table 15. Analysis of our proposed system versus prior research.
StudyApproach UsedDataset UsedClassificationAccuracy
SMS Spam
Filter [23]
Deep LearningUCI BenchmarkCNN-LSTM99.44%
Hybrid [24]Deep LearningT.A AlmeidaCNN-LSTM98.37%
Smishing
Detector [10]
Machine- Learning UCI’s SMS
Spam Collection
Naïve Bayes96.12%
DSmishSMS [27]Machine-Learning T.A Almeida
Collected Dataset
and Pinterest.com
text SMS
Random Forest97.93%
Spam
Transformer [28]
Deep Learning SMS Spam
Collection and
UtkMI’s Twitter
CNN-LSTM98.92%
Phishing
Detection [31]
Deep Learning Phishing emails
(Collection and
Nazario Phishing Corpus The
Enron dataset)
BiLSTM (URL, No_URL)97.9%
Proposed
Method
Deep Learning Smishing
Collection [32]
CNN-Bi-GRU99.82%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mahmud, T.; Prince, M.A.H.; Ali, M.H.; Hossain, M.S.; Andersson, K. Enhancing Cybersecurity: Hybrid Deep Learning Approaches to Smishing Attack Detection. Systems 2024, 12, 490. https://doi.org/10.3390/systems12110490

AMA Style

Mahmud T, Prince MAH, Ali MH, Hossain MS, Andersson K. Enhancing Cybersecurity: Hybrid Deep Learning Approaches to Smishing Attack Detection. Systems. 2024; 12(11):490. https://doi.org/10.3390/systems12110490

Chicago/Turabian Style

Mahmud, Tanjim, Md. Alif Hossen Prince, Md. Hasan Ali, Mohammad Shahadat Hossain, and Karl Andersson. 2024. "Enhancing Cybersecurity: Hybrid Deep Learning Approaches to Smishing Attack Detection" Systems 12, no. 11: 490. https://doi.org/10.3390/systems12110490

APA Style

Mahmud, T., Prince, M. A. H., Ali, M. H., Hossain, M. S., & Andersson, K. (2024). Enhancing Cybersecurity: Hybrid Deep Learning Approaches to Smishing Attack Detection. Systems, 12(11), 490. https://doi.org/10.3390/systems12110490

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop