An Enhanced K-Means Clustering Algorithm for Phishing Attack Detections

Al-Sabbagh, Abdallah; Hamze, Khalil; Khan, Samiya; Elkhodr, Mahmoud

doi:10.3390/electronics13183677

Open AccessArticle

An Enhanced K-Means Clustering Algorithm for Phishing Attack Detections

¹

Department of Electrical & Computer Engineering, Faculty of Engineering, Beirut Arab University, Beirut 1107 2809, Lebanon

²

Cybersecurity and Forensics Department, Faculty of Computer Studies, Arab Open University, Beirut 2058 4518, Lebanon

³

School of Computing and Mathematical Sciences, University of Greenwich, London SE10 9LS, UK

⁴

School of Engineering and Technology, Central Queensland University, Sydney, NSW 2000, Australia

⁵

Computer Science Department, Prince Mohammad Bin Fahd University, Al-Khobar 34754, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(18), 3677; https://doi.org/10.3390/electronics13183677

Submission received: 10 August 2024 / Revised: 10 September 2024 / Accepted: 12 September 2024 / Published: 16 September 2024

(This article belongs to the Special Issue Artificial Intelligence and Applications—Responsible AI)

Download

Browse Figures

Versions Notes

Abstract

:

Phishing attacks continue to pose a significant threat to cybersecurity, employing increasingly sophisticated techniques to deceive victims into revealing sensitive information or downloading malware. This paper presents a comprehensive study on the application of Machine Learning (ML) techniques for identifying phishing websites, with a focus on enhancing detection accuracy and efficiency. We propose an approach that integrates the CfsSubsetEval attribute evaluator with the K-Means Clustering algorithm to improve phishing detection capabilities. Our method was evaluated using datasets of varying sizes (2000, 7000, and 10,000 samples) from a publicly available repository. Simulation results demonstrate that our approach achieves an accuracy of 89.2% on the 2000-sample dataset, outperforming the traditional kernel K-Means algorithm, which achieved an accuracy of 51.5%. Further analysis using precision, recall, and F1-score metrics corroborates the effectiveness of our method. We also discuss the scalability and real-world applicability of our approach, addressing limitations and proposing future research directions. This study contributes to the ongoing efforts to develop robust, efficient, and adaptable phishing detection systems in the face of evolving cyber threats.

Keywords:

phishing attacks prevention; K-means clustering; phishing website detection

1. Introduction

Phishing attacks are a form of social engineering in which hackers use spoofed emails or bogus websites to gather clients’ sensitive data. These attacks typically involve four steps: creating a fake website that mimics a legitimate one, distributing a link to the fake website while posing as a legitimate entity, persuading the victim to visit the bogus website and, finally, capturing the victim’s valuable information when they enter it on the fake site [1,2,3]. Cybersecurity, which involves protecting computers, mobile devices, systems, networks, and user data from electronic intrusions, faces phishing as one of its most persistent challenges. The first quarter of 2022 marked a distressing milestone for phishing attacks, with a record-breaking 1,025,968 assaults, according to the Anti-Phishing Working Group’s (APWG’s) Phishing Activity Trends Report [4]. More recent data from APWG shows that this trend has continued, with the first half of 2023 seeing over 1.2 million unique phishing attacks. Furthermore, the fourth quarter of 2023 witnessed a 47% increase in phishing attacks compared to the previous quarter, indicating that the threat is not only persistent, but growing [5]. Users often lack basic knowledge of Uniform Resource Locators (URLs), making it difficult for them to determine which web pages can be trusted. Factors such as redirection, hidden URLs, a wide range of URL alternatives, or typing mistakes contribute to users’ vulnerability to phishing attacks. During these attacks, perpetrators create false web pages that replicate legitimate websites and distribute them via spam emails, Short Message Service (SMS), or social media [6,7]. Phishing attacks can be categorised into two methods: attack launching and data gathering. Attack launching methods include man-in-the-middle attacks, URL spoofing, website spoofing, and email spoofing. Data gathering methods can be further divided into automated methods (such as fake website forms, keyloggers, and recorded messages) and manual methods (such as misdirection and social engineering) [8,9]. Machine Learning (ML) algorithms, a subset of Artificial Intelligence (AI), have shown great promise in detecting phishing attacks. These algorithms can forecast future values using historical data as input, allowing software applications to make predictions without explicit programming. ML has found wide-ranging applications in industries where traditional algorithms are difficult to build, including medical applications, email filtering, speech recognition, agriculture, and computer vision [10,11]. This paper focuses on the utilisation of different artificial intelligence algorithms for phishing attack detection and the limitations of each approach. The aim is to propose an enhancement for the use of machine learning algorithms in detecting phishing attacks by identifying an unsupervised algorithm that can effectively minimise and detect these increasingly prevalent threats. The key contributions of this paper include:

A novel approach combining K-Means Clustering with the CfsSubsetEval attribute evaluator for improved phishing detection.
Comprehensive comparative analysis of the proposed method against kernel K-Means, demonstrating improvements in accuracy across various sample sizes.
Insights into the scalability and adaptability of unsupervised learning techniques for phishing detection, particularly in handling large datasets.
Exploration of the potential for real-time phishing detection through the proposed method, opening avenues for practical applications such as browser extensions.
Critical analysis of the strengths and limitations of the proposed approach, providing a balanced view of its potential impact on cybersecurity practices.

The remainder of the paper is organised as follows. Section 2 provides background and conducts a comprehensive literature review on ML techniques implemented in phishing attacks. Section 3 describes the proposed approach, which integrates K-Means Clustering with CfsSubsetEval attribute evaluator. The simulation results for the proposed algorithm are presented in Section 4. The limitations of the work are presented in Section 5, followed by some concluding remarks in Section 6.

2. Background

Research has shown that standard methods of detecting phishing attacks are only able to identify 20% of attacks [12]. However, it is essential for clients to be aware of these attacks, in order to avoid falling victim to them. Organisations often rely on rule-based training to help individuals recognise specific cues or follow a set of guidelines to prevent phishing attempts [13]. However, the recent literature indicates a drift towards the increased usage of machine and deep learning algorithms for phishing website classification.

Phishing attacks come in various forms, each targeting different vulnerabilities and user groups. Table 1 provides an overview of website phishing attacks, including those specifically designed for mobile devices and social media platforms. This table outlines the methods, targets, typical characteristics, typical solutions, and remaining challenges associated with each type. Table 2 summarises the other main types of phishing attacks, detailing their methods, targets, typical characteristics, typical solutions, and remaining challenges. Understanding these different types of phishing attacks is crucial for developing comprehensive detection and prevention methods.

In ML, Decision Trees (DT) are often utilised for classification tasks. However, trees consist of nodes and tests involve attributes. Classification is performed by identifying the leaf nodes at the end of each branch. The Random Forest (RF) algorithm is known for its robustness in the field of ML. It can be effectively utilised for both classification and regression purposes. By employing the bagging method, the RF algorithm combines multiple learning models to generate an overall prediction based on the average of their outputs. This approach enables the distinction between legitimate and phishing websites, highlighting the versatility of RF in this application. Alam et al. utilised standard datasets of phishing attacks as input for the two ML algorithms: DT and RF [14]. Despite the presence of missing data, RF demonstrated a remarkably high accuracy of approximately 96.9%. Consequently, this approach could address the problem of overfitting effectively.

Kamalam et al. [15] evaluated the performance of two machine learning algorithms, namely DT and RF, on the Phishing URLs Dataset. After selecting the best performing algorithm, the authors developed a Chrome extension to detect phishing websites. The extension allows for easy deployment of the phishing detection model to end users. By utilizing the RF algorithm, the researchers achieved a good accuracy rate of 97.31% in identifying phishing websites.

Basit et al. [16] selected three ML classifiers: Artificial Neural Networks (ANN), K-Nearest Neighbours (KNN), and DT to combine with the RF Classifier (RFC) in an ensemble approach. The RFC was used as the base classifier with the ANN, KNN, and DT algorithms. They utilised a common dataset from the UCI ML repository. The results showed that the KNN classifier combined with the RFC provided the lowest False Positive (FP) rate (0.038) and the highest True Positive (TP) rate (0.983) among the ensembles. In addition, the KNN and RFC ensemble classifiers had the highest precision (0.970) and recall (0.983) compared to all the other classifiers. The accuracy of KNN in this case is 97.33%, which is higher than the ANN (97.16%) and the DT (C4.5) (96.36%). Support Vector Machine (SVM), which is a supervised ML approach, can also be used to analyse data for classification and regression tasks. In a study conducted by Mao et al. [17], SVM was employed as a classifier on a dataset consisting of 24,051 samples. The results showed an accuracy rate of approximately 96%.

The Extreme Gradient Boosting (XGBoost) method is a powerful ensemble approach to supervised learning. Data scientists rely heavily on XGBoost to achieve cutting-edge results in various machine learning challenges due to its scalability and end-to-end tree boosting capabilities. This method simplifies the process of solving classification and regression problems with remarkable ease. In a study conducted by Shahrivari et al. [18], twelve classifiers were simulated and tested using a dataset from a phishing website. These classifiers included Logistic Regression (LR), DT, SVM, Ada Boost, RF, Neural Networks, KNN, Gradient Boosting, and XGBoost. Among them, XGBoost demonstrated exceptional performance in terms of computation duration and accuracy, achieving an accuracy of approximately 98%, which surpassed other classifiers.

Deep learning, which is a subset of ML methods that includes ANN and representation learning, has also demonstrated great prospect for tasks such as phishing website classification. In a study by Adebowale et al. [19], various deep learning techniques were applied to multiple datasets. The accuracy achieved ranged from a minimum of approximately 91% to a maximum of roughly 94.8%. Sahu et al. [20] utilised vector space analysis to select variables from a large dataset for the purpose of detecting phishing and malware websites. As a result, the error rate in classifying malware and phishing websites improves from 10% to 20%. Furthermore, when compared to ensemble clustering, this method also demonstrates an improvement in categorisation error rates for both phishing websites and virus samples. Therefore, this study effectively shows that the system performs well in categorizing malware and real phishing websites.

Hossain et al. [21] conducted a study to identify potential phishing websites. They analysed multiple machine learning algorithms (RF, Bagging, SGD and LR) using a dataset that included attributes about websites and their associated information. The article aims to assist readers by offering a comprehensive examination of different techniques, ultimately concluding that the RF classifier performs the best. Particularly, the RF achieves an impressive F1 score of 0.99, indicating that both the false positive and false negative rates fall within acceptable limits. Abedin et al. conducted a study comparing the effectiveness of three commonly used ML classifiers: KNN, LR and RF [11]. Among these classifiers, RF demonstrated the highest precision rate of 97%. Additionally, the RF classifier achieved an impressive Area Under Curve (AUC) score of 1.0, indicating its ability to accurately identify phishing URLs.

Muhammad et al. employed five machine learning techniques, namely DT, KNN, Naive Bayes (NB), RF, and SVM, to detect phishing attacks [22]. The authors utilised two datasets, one for SMS and another for email, and analysed the word content within these datasets to identify phishing attempts. Performance criteria for the experiment included the time taken to conduct the analysis and the accuracy of correctly classifying the data. RF technique demonstrated exceptional average accuracy; however, it was observed that the classification process using this method was comparatively time-consuming. Based on the outcomes of the study, it can be concluded that while a method may effectively identify phishing attempts, it may also require a longer time to yield the best results.

Barlow et al. [23] proposed a novel approach for detecting phishing using multilevel artificial intelligence, combining neural networks with binary visualisation techniques. By employing visual representation techniques, it becomes possible to gain insight into the structural differences between legitimate and phishing websites. The initial results of the experiment indicate that this approach is effective in rapidly and accurately detecting phishing attackers. Furthermore, the technique improves its effectiveness through learning from incorrect classifications.

Salahdine et al. [24] proposed an ML-based method for detecting phishing attacks. The researchers trained and evaluated three classifiers (ANN, LR, and SVM) using a dataset. The best results from their parametric research are presented for review for each classifier. For the SVM classifier, the Gaussian Radial basis function kernel achieved good accuracy. The LR classifier achieved the highest accuracy with a regularisation parameter of 0.4. However, the ANN classifier achieved even higher accuracy by using two hidden layers, each with 100 neurons, and the Relu activation function. Consequently, the suggested methodology enables rapid and accurate detection of phishing attacks.

Table 3 presents recent advancements in phishing detection using AI and ML, while Table 4 summarises various other approaches in the literature. These tables highlight the strengths and weaknesses of existing methods, demonstrating the continuous evolution and sophistication of phishing detection techniques. Despite these advancements, challenges remain, particularly in balancing detection accuracy with computational efficiency and adaptability to emerging threats.

Motivated by these insights, this paper proposes an optimised model for detecting phishing attacks based on K-Means Clustering. Various approaches have been proposed in the literature, resulting in different accuracy percentages. However, the objective of this work is to reduce phishing attacks and minimise losses by finding an unsupervised learning technique algorithm for the model.

While supervised learning methods have shown high accuracy in phishing detection, this research explores K-Means Clustering as an unsupervised alternative. K-Means offers several advantages: it can adapt to new threats without relying on labelled data. It is computationally efficient, and provides interpretable results. The reduced dependency on labelled datasets makes it easier to implement and maintain. Moreover, K-Means can potentially uncover novel patterns and feature combinations that distinguish phishing attempts from legitimate websites. By employing this unsupervised approach, we aim to complement existing techniques, contributing to a more robust and diverse set of tools for combating phishing attacks. This exploration of K-Means Clustering may reveal insights that supervised methods might overlook, ultimately enhancing our overall phishing detection capabilities.

Building upon the insights gained from previous research, this paper proposes a novel approach to phishing detection using an unsupervised machine learning technique. We introduce a model that combines K-Means Clustering with the CfsSubsetEval attribute evaluator, aiming to leverage the strengths of unsupervised learning while optimising feature selection. This approach seeks to address some of the limitations of supervised methods while potentially uncovering new patterns in phishing attacks. The following section provides a detailed description of our proposed algorithm, its implementation, and the rationale behind our methodological choices.

3. Proposed Approach for Enhanced Phishing Website Classification

This paper introduces a novel approach to phishing website classification that uniquely combines K-Means Clustering with kernel methods and integrates the CfsSubsetEval attribute evaluator. This combination results in enhancing both the accuracy and efficiency of phishing attacks detection. The pseudo-code of the proposed algorithm is provided in Algorithm 1.

At the core of our approach lies K-Means Clustering, an unsupervised learning technique that partitions unlabelled data into distinct clusters. The parameter K specifies the number of clusters to be formed. For phishing detection, we set K to 2, representing phishing and legitimate website groups. This method automatically categorises data without requiring training, using a centroid-based approach, where each cluster is represented by its centre point. The main objective of this method is to minimise the total distance between each data point and its corresponding cluster. The algorithm starts by dividing the dataset into K clusters, and continues iteratively until no further changes occur.

The novelty of our approach is twofold. Firstly, we integrate kernel methods with K-Means Clustering, a combination less common in phishing detection. By incorporating polynomial kernel techniques, we enhance the algorithm’s ability to capture complex patterns in URL data. The kernel function

K (x, x^{'})

implicitly maps data to a higher-dimensional feature space, allowing for more intricate decision boundaries. Specifically, we use a polynomial kernel of degree d, defined as:

K (x, x^{'}) = {(x \cdot x^{'} + c)}^{d}

(1)

where x and

x^{'}

are feature vectors, c is a constant, and d is the degree of the polynomial. This transformation allows the K-Means algorithm to operate in a higher-dimensional space where non-linear relationships between features can be more easily separated.

Secondly, we combine the CfsSubsetEval attribute evaluator with K-Means, an integration not widely explored in unsupervised phishing detection. CfsSubsetEval is a feature selection method that evaluates subsets of features by examining each attribute’s ability to predict outcomes and the overlap between them. Subsets of features that are highly correlated with the class while having low intercorrelation are preferred. This reduces the risk of overfitting and improves the algorithm’s generalisation ability. Consequently, employing CFSSubsetEval in phishing website classification ensures that the model is both efficient and accurate in its classification of phishing websites. This unique approach optimises feature selection specifically for our task. CfsSubsetEval assesses subsets of attributes based on their individual predictive ability and the degree of redundancy between them. By applying this before K-Means Clustering, we identify the most relevant features for distinguishing phishing sites, potentially improving both the speed and accuracy of our classification.

Algorithm 1 Proposed algorithm for phishing website classification

1:: Input: Dataset
2:: Choose the attributes evaluator CfsSubsetEval
3:: Choose the search method BestFirst
4:: Choose the method K-Means
5:: Choose number of clusters $K = 2$
6:: Choose distance function Euclidean distance
7:: Select class to clusters evaluation status
8:: Run the method

The Euclidean distance between two points x and y in N-dimensional space is given by

d (x, y) = {(\sum_{k = 1}^{N} {| x_{k} - y_{k} |}^{2})}^{1 / 2}

(2)

Euclidean distance is commonly employed by machine learning algorithms as the default metric for comparing two sets of recorded data. This distance metric requires continuous attributes involving numerical variables, such as weight, height, or pay, in the observations being compared. Its typical purpose is to measure the extent of separation between two rows of data with numerical values. This paper utilises the Euclidean distance as the default metric for assessing the similarity between two recorded observations due to its simplicity and effectiveness in the context of continuous numerical data.

Our methodology, implemented using the WEKA toolkit, begins with a dataset from Kaggle containing 11,430 URLs and 87 features [43]. We first preprocess the data, cleaning and handling missing values to ensure quality and consistency. The feature selection phase then employs CfsSubsetEval to identify the 19 most relevant attributes. This step is applied to three different sample sizes—2000, 7000, and 10,000 URLs—allowing us to assess the scalability of our approach. The choice of these specific sample sizes is based on their ability to provide a comprehensive assessment of the algorithm’s performance across small, medium, and large datasets, ensuring that the method can scale effectively with varying data volumes.

Following feature selection, we apply a polynomial kernel to transform the feature space, enhancing K-Means’ capability to capture non-linear relationships in the data. The K-Means algorithm then iteratively assigns data points to clusters and recalculates centroids until convergence. Although we use kernel methods, Euclidean distance in the transformed feature space serves as our similarity measure, well-suited for the continuous numerical attributes in our dataset.

The distance measure in the transformed feature space can be represented as

d (x, x^{'}) = \sqrt{K (x, x) + K (x^{'}, x^{'}) - 2 K (x, x^{'})}

(3)

This ensures that the clustering process accurately reflects the non-linear separations introduced by the kernel function.

We evaluate our clustering results using internal validation measures such as silhouette score and Davies–Bouldin index. Additionally, when ground truth labels are available, we compare our results to assess the accuracy of phishing detection.

Figure 1 illustrates the flow of our proposed approach.

This enhanced method aims to improve both the accuracy and efficiency of phishing website detection by synergising unsupervised learning with optimised feature selection and kernel methods. By applying this approach to various sample sizes, we can evaluate its performance and scalability in real-world scenarios.

The proposed approach can be effectively implemented in various cybersecurity contexts. One primary application could be its integration into web browsers as an extension or built-in feature. This would allow for real-time analysis of websites as users browse, providing immediate warnings about potentially phishing sites. Additionally, this approach could be implemented as part of email filtering systems, helping to identify and flag phishing attempts in incoming messages. For larger organisations, the proposed approach could be deployed as part of an enterprise-level Intrusion Detection System (IDS) to monitor and flag suspicious activity across networks, improving their defences against social engineering attacks.

Our approach addresses the challenges of detecting increasingly sophisticated phishing attempts. The polynomial kernel helps capture subtle variations used by phishers to mimic legitimate sites, while the optimised feature selection focuses the algorithm on the most discriminative URL characteristics. This combination promises to enhance detection capabilities, especially for complex or previously unseen phishing patterns.

Feature Importance Analysis

To further understand the effectiveness of our feature selection process, we generated a Feature Importance Chart (Figure 2). This chart illustrates the relevance of the top features identified by the CfsSubsetEval attribute evaluator and K-Means clustering. Each bar represents a feature, with its length corresponding to the importance score assigned during the evaluation.

The top five features based on their importance are: ‘ratio_intHyperlinks’, ‘links_in_tags’, ‘ratio_extHyperlinks’, ‘safe_anchor’, and ‘google_index’. These features significantly contribute to distinguishing between phishing and legitimate websites. For instance, the presence and ratio of internal hyperlinks (‘ratio_intHyperlinks’) and the external hyperlinks (‘ratio_extHyperlinks’) are critical indicators for classifying phishing attempts. Similarly, features related to web content structure such as ‘links_in_tags’ and the use of ‘safe_anchor’ tags help to enhance the detection accuracy.

The chart reveals that the most significant features are closely related to the structure of web hyperlinks, index status, and content tags. By focusing on these key attributes, our approach ensures that the most discriminative characteristics are prioritized, enhancing the overall accuracy and efficiency of the phishing detection process.

The feature importance analysis underscores the critical role of specific website characteristics, such as hyperlink ratios and tag usage, in identifying phishing attempts. Features with higher importance scores indicate attributes that have a greater impact on the classification decision, providing valuable insights for future research and optimization in phishing detection methodologies.

4. Results

Two metrics were used to evaluate the performance of the proposed K-Means and the kernel K-Means algorithms: Accuracy and Error Rate. Accuracy is the proportion of accurately predicted data points among all the data points. It is calculated by dividing the total number of true positives and true negatives by the total number of false positives and false negatives. A data point that is correctly identified as true or false by the algorithm is referred to as a true positive or true negative. This paper uses accuracy to measure the effectiveness of the phishing detection approach in correctly identifying phishing cases from the entire dataset. It is expressed as

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(4)

where TP stands for True Positive, TN stands for True Negative, FP stands for False Positive, and FN stands for False Negative.

In addition to accuracy, we evaluated the precision, recall, and F1-score to provide a more comprehensive assessment of the algorithm’s performance. These additional metrics are critical for understanding the model’s effectiveness in various operational scenarios:

P r e c i s i o n = \frac{T P}{T P + F P}

(5)

R e c a l l = \frac{T P}{T P + F N}

(6)

F 1 - S c o r e = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(7)

Error Rate represents how well the network performs on a specific training, testing, and validation set. A lower error rate is preferable. In this paper, it refers to an indicator of the degree of prediction error in phishing prevention, and it is expressed as

E r r o r R a t e = 1 - A c c u r a c y

(8)

The kernel K-Means and the proposed K-Means have been simulated using the WEKA tool, on samples of 2000, 7000, and 10,000. Simulation results show that the kernel K-Means produced an error rate of 48.5% on 2000 samples, 47.07% on 7000 samples, and 43.02% on 10,000 samples. On the other hand, the proposed approach produced an error rate of 10.8% on 2000 samples, 13.38% on 7000 samples, and 13.75% on 10,000 samples, as shown in Table 5. Figure 3 shows the comparison between the proposed K-Means and kernel K-Means based on the Error Rate. The kernel K-Means had a poor performance with an increasing Error Rate percentage when the number of samples went from 2000 to 10,000.

After obtaining the error rate, the accuracy was calculated for the kernel K-Means and the proposed method using 2000, 7000, and 10,000 samples, as described by the equation mentioned previously. The simulation results showed that the accuracy of K-Means with Kernel Filter was 51.5% for 2000 samples, 52.93% for 7000 samples, and 56.98% for 10,000 samples. On the other hand, the proposed approach achieved an accuracy of 89.2% for 2000 samples, 86.62% for 7000 samples, and 86.25% for 10,000 samples, as shown in Table 6. The comparison of accuracy percentages is illustrated in Figure 4. Table 7 shows the achieved performance metrics in terms or precision, recall and F1-score for proposed K-Means approach as well.

Additionally, while the focus of this paper was on the K-Means approach, it is important to acknowledge that traditional machine learning techniques such as DT and SVM have also been used extensively in phishing detection. However, several studies in the literature have shown that these traditional methods tend to be less effective when compared to newer techniques, particularly when dealing with complex and high-dimensional data like that found in phishing detection tasks. DT, for example, are prone to overfitting, especially when applied to large datasets without proper regularisation. Similarly, while SVM can perform well in certain scenarios, its computational complexity and sensitivity to parameter tuning make it less suited for large-scale, real-time applications. By contrast, ensemble methods like RF and deep learning approaches such as CNN-LSTM have shown superior performance in recent studies due to their ability to handle more complex patterns in the data.

Notably, to evaluate the scalability of our approach, we analysed the computational time complexity when using a dataset of size 10,000. Figure 5 shows the processing time for different dataset sizes up to 10,000 samples.

As shown in Figure 5, while the processing time increases with dataset size, the relationship is approximately linear, suggesting reasonable scalability. However, for extremely large datasets (>1 million samples), perhaps, further optimisations or distributed computing approaches may be necessary to maintain real-time performance.

Performance Comparison and Insights

The proposed approach (K-Means integrated with CfsSubsetEval attribute evaluator) has achieved better performance compared to the kernel K-Means model in terms of error rate and accuracy metrics. Our approach has shown improved phishing detection performance, with an error rate of 10.8%, 13.38%, and 13.75%, and an accuracy of 89.2%, 86.62%, and 86.25% on samples of 2000, 7000, and 10,000, respectively. In contrast, the kernel K-Means model had an error rate of 48.5%, 47.07%, and 43.02%, with an accuracy of 51.5%, 52.93%, and 56.98%. These results the proposed method enhancement achieved in phishing detection.

It is important to note that while our proposed K-Means approach shows a slight increase in error rate with larger sample sizes (from 10.8% at 2000 samples to 13.75% at 10,000 samples), it still maintains a lower error rate compared to the kernel K-Means method. This slight increase could be attributed to the greater complexity and diversity of patterns present in larger datasets, which may introduce more challenging cases for classification.

Conversely, the kernel K-Means approach shows a decreasing error rate as the sample size increases (from 48.5% at 2000 samples to 43.02% at 10,000 samples). This trend suggests that the kernel method may benefit from larger datasets, possibly due to its ability to capture more complex relationships in higher-dimensional spaces. However, despite this improvement, its performance remains substantially inferior to our proposed method.

The enhanced performance of our proposed approach can be attributed to the effective feature selection achieved through the CfsSubsetEval attribute evaluator. By identifying the most relevant features for distinguishing between phishing and legitimate websites, our method maintains high accuracy, even with increasing dataset sizes. The success of this feature selection process highlights the potential for further research into advanced feature engineering techniques specifically tailored for phishing detection.

Another promising aspect of our approach is its potential for real-time detection. The combination of high accuracy and the relatively low computational complexity inherent to K-Means Clustering suggests that our method could be adapted for real-time phishing detection systems. This capability could enhance cybersecurity measures by allowing for immediate identification and mitigation of phishing threats as they emerge. Real-time detection is particularly valuable in the context of ever-evolving phishing techniques, where rapid response can prevent widespread damage.

Finally, by achieving high accuracy through an unsupervised learning technique, our method reduces the reliance on constantly updated labelled datasets. This characteristic is particularly valuable in the rapidly changing landscape of cybersecurity, where obtaining up-to-date labelled data can be both time-consuming and expensive. The reduced need for labelled data could lower the overall cost and effort required to maintain effective phishing detection systems, making robust cybersecurity more accessible to a wider range of organisations.

5. Limitations of the Proposed Approach

While our proposed approach demonstrates significant improvements in phishing detection, it is important to acknowledge the following limitations and areas for future research.

Firstly, the dataset used in this study, comprising 2000, 7000, and 10,000 samples, may not fully represent the diversity of phishing websites in real-world scenarios. Although this limitation has been partially mitigated by our comprehensive evaluation against state-of-the-art methods and the inclusion of multiple performance metrics (e.g., precision, recall, F1-score), a larger and more diverse dataset could further enhance generalisability. Future work should focus on expanding the dataset to include real-time data from a wider variety of sources and phishing techniques.

While we have provided some evaluation of the proposed approach, the evaluation could be further enriched by testing against additional algorithms and hybrid approaches. This would provide a broader perspective on how our model performs across different architectures, offering more robust conclusions about its effectiveness in phishing detection.

Also, the model’s performance remains dependent on the quality and relevance of the selected features. While CfsSubsetEval was effective in identifying critical features in our experiments, more advanced feature selection or deep learning-based feature extraction methods could be explored to capture complex feature interactions and improve detection accuracy for sophisticated phishing attacks.

One key area that has not been fully explored is the real-time implementation of the proposed method. Although the model demonstrates strong performance on offline datasets, its suitability for live, real-time phishing detection—where latency is a critical factor—remains to be tested. Further research is needed to optimise the model for real-time environments, potentially through techniques such as parallel processing or incremental learning to allow for faster classification without sacrificing accuracy.

Scalability remains a concern, especially when applying the K-Means Clustering algorithm to larger datasets in real-time scenarios. Although we have demonstrated that the processing time scales linearly with increasing dataset sizes, further research into distributed computing or cloud-based implementations could help maintain performance when handling larger datasets.

Finally, the rapidly evolving nature of phishing techniques, including more sophisticated spear phishing and whaling attacks, requires ongoing attention. Our current model has not been fully tested against all types of phishing attacks. Future work should investigate how well the model adapts to emerging phishing techniques, potentially through adversarial training or data augmentation, ensuring that it remains effective as phishing strategies evolve.

6. Conclusions

Social engineering, particularly phishing, remains a prevalent threat in cybersecurity. These attacks exploit human psychology, often impersonating trusted entities to create a sense of urgency and trigger impulsive actions. The effectiveness of social engineering lies in its ability to bypass technical security measures by targeting the human element, making it a cost-effective method for attackers. This paper proposed an enhanced K-Means Clustering algorithm by incorporating the CfsSubsetEval attribute evaluator for phishing detection. Our simulation results demonstrate that the proposed approach achieves high accuracy rates: 89.2% for 2000 samples, 86.62% for 7000 samples, and 86.25% for 10,000 samples. In comparison, the kernel K-Means algorithm achieved lower accuracy rates: 51.5% for 2000 samples, 52.93% for 7000 samples, and 56.98% for 10,000 samples. While our proposed method shows promising results, it is important to note that we cannot definitively claim it as the most accurate approach in the field. Our comparisons were limited to the kernel K-Means algorithm within our specific experimental setup. Further comprehensive comparisons with other state-of-the-art methods would be necessary to establish a broader performance benchmark. The difference in accuracy between our proposed method and kernel K-Means highlights the potential of integrating feature selection techniques with clustering algorithms for phishing detection. However, it is worth noting that kernel K-Means showed improvement with larger sample sizes, indicating its potential for handling complex data relationships. This observation suggests that different approaches may have distinct strengths depending on the dataset characteristics and size.

Our study opens several avenues for future work. Firstly, identifying more efficient attribute evaluators could further enhance K-Means’ ability to detect phishing attacks, potentially improving accuracy while reducing error rates. Secondly, expanding the comparison to include other machine learning and deep learning approaches would provide a more comprehensive understanding of our method’s performance in the broader context of phishing detection techniques. From a practical standpoint, integrating this approach as a browser extension could offer real-time protection for users. Future research should also focus on the adaptability of the model to evolving phishing techniques and its performance in real-world, dynamic environments.

Author Contributions

Conceptualisation, A.A.-S., M.E. and S.K.; methodology, A.A.-S. and M.E.; software, A.A.-S. and K.H.; validation, A.A.-S., K.H. and M.E.; formal analysis, A.A.-S. and K.H.; investigation, A.A.-S., and K.H.; resources, A.A.-S. and M.E.; data curation, A.A.-S., and K.H.; writing—original draft preparation, A.A.-S. and K.H.; writing—review and editing, A.A.-S., and M.E.; visualisation, A.A.-S., and K.H.; supervision M.E.; project administration M.E. and A.A.-S.; funding acquisition, M.E. and S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article. Further inquiries can be directed to the corresponding author. The dataset used for web phishing detection is available for download from https://data.mendeley.com/datasets/c2gw7fy2j4/3 (accessed on 11 April 2023).

Acknowledgments

We acknowledge the responsible use of ChatGPT and Claude as tools to assist in the preparation and enhancement of this manuscript. ChatGPT was employed to improve the clarity and coherence of the text, generate suggestions for improvements, and provide insights on structuring and formatting sections according to the MDPI template. Claude was utilised to assist in writing the abstract and conclusion based on drafts provided by the authors. All content generated by these AI tools was thoroughly reviewed and validated by the authors to ensure accuracy and relevance to the research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Athulya, A.A.; Praveen, K. Towards the Detection of Phishing Attacks. In Proceedings of the 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI) (48184), Tirunelveli, India, 15–17 June 2020; pp. 337–343. [Google Scholar] [CrossRef]
Sadiq, A.; Anwar, M.; Butt, R.A.; Masud, F.; Shahzad, M.K.; Naseem, S.; Younas, M. A review of phishing attacks and countermeasures for internet of things-based smart business applications in industry 4.0. Hum. Behav. Emerg. Technol. 2021, 3, 854–864. [Google Scholar] [CrossRef]
Aleroud, A.; Zhou, L. Phishing environments, techniques, and countermeasures: A survey. Comput. Secur. 2017, 68, 160–196. [Google Scholar] [CrossRef]
Dolan, K. Quarters. In Safe Places: Stories; University of Massachusetts Press: Amherst, MA, USA, 2022; pp. 80–88. [Google Scholar] [CrossRef]
Alkhalil, Z.; Hewage, C.; Nawaf, L.; Khan, I. Phishing attacks: A recent comprehensive study and a new anatomy. Front. Comput. Sci. 2021, 3, 563060. [Google Scholar] [CrossRef]
Patel, J. Phishing URL detection using artificial neural network. Int. J. Res. Eng. Sci. Manag. 2022, 5, 47–51. [Google Scholar]
Frauenstein, E.D.; Flowerday, S. Susceptibility to phishing on social network sites: A personality information processing model. Comput. Secur. 2020, 94, 101862. [Google Scholar] [CrossRef]
Jain, A.K.; Gupta, B.B. A survey of phishing attack techniques, defence mechanisms and open research challenges. Enterp. Inf. Syst. 2021, 16, 527–565. [Google Scholar] [CrossRef]
Alghenaim, M.F.; Bakar, N.A.A.; Rahim, F.A.; Vanduhe, V.Z.; Alkawsi, G. Phishing Attack Types and Mitigation: A Survey. In Proceedings of the International Conference on Data Science and Emerging Technologies, Singapore, 20–21 December 2022; pp. 131–153. [Google Scholar] [CrossRef]
Aljabri, M.; Mirza, S. Phishing Attacks Detection using Machine Learning and Deep Learning Models. In Proceedings of the 2022 7th International Conference on Data Science and Machine Learning Applications (CDMA), Riyadh, Saudi Arabia, 1–3 March 2022; pp. 175–180. [Google Scholar] [CrossRef]
Abedin, N.F.; Bawm, R.; Sarwar, T.; Saifuddin, M.; Rahman, M.A.; Hossain, S. Phishing Attack Detection using Machine Learning Classification Techniques. In Proceedings of the 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS), Thoothukudi, India, 3–5 December 2020; pp. 1125–1130. [Google Scholar] [CrossRef]
Basit, A.; Zafar, M.; Liu, X.; Javed, A.R.; Jalil, Z.; Kifayat, K. A comprehensive survey of AI-enabled phishing attacks detection techniques. Telecommun. Syst. 2021, 76, 139–154. [Google Scholar] [CrossRef]
Lain, D.; Kostiainen, K.; Capkun, S. Phishing in Organizations: Findings from a Large-Scale and Long-Term Study. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 22–26 May 2022; pp. 842–859. [Google Scholar] [CrossRef]
Alam, M.N.; Sarma, D.; Lima, F.F.; Saha, I.; Ulfath, R.E.; Hossain, S. Phishing attacks detection using machine learning approach. In Proceedings of the 3rd International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 20–22 August 2020; pp. 1173–1179. [Google Scholar] [CrossRef]
Kamalam, G.K.; Suresh, P.; Nivash, R.; Ramya, A.; Raviprasath, G. Detection of Phishing Websites Using Machine Learning. In Proceedings of the 2022 International Conference on Computing, Communication and Informatics (ICCCI), Coimbatore, India, 25–27 January 2022; pp. 1–4. [Google Scholar] [CrossRef]
Basit, A.; Zafar, M.; Javed, A.R.; Jalil, Z. A Novel Ensemble Machine Learning Method to Detect Phishing Attack. In Proceedings of the 2020 23rd IEEE International Multi-Topic Conference (INMIC), Bahawalpur, Pakistan, 25–27 January 2020; pp. 1–5. [Google Scholar] [CrossRef]
Mao, J.; Bian, J.; Tian, W.; Zhu, S.; Wei, T.; Li, A.; Liang, Z. Phishing page detection via learning classifiers from page layout feature. EURASIP J. Wirel. Commun. Netw. 2019, 43, 43. [Google Scholar] [CrossRef]
Shahrivari, V.; Darabi, M.M.; Izadi, M. Phishing detection using machine learning techniques. arXiv 2020, arXiv:2009.11116. [Google Scholar] [CrossRef]
Adebowale, M.A.; Lwin, K.T.; Hossain, M.A. Intelligent phishing detection scheme using deep learning algorithms. J. Enterp. Inf. Manag. 2023, 36, 747–766. [Google Scholar] [CrossRef]
Sahu, K.; Shrivastava, S.K. Kernel K-Means Clustering for Phishing Website and Malware Categorization. Int. J. Comput. Appl. 2015, 111, 20–25. [Google Scholar] [CrossRef]
Hossain, S.; Sarma, D.; Chakma, R.J. Machine learning-based phishing attack detection. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 378–388. [Google Scholar] [CrossRef]
Muhammad, B.A.; Iqbal, R.; James, A.; Nkantah, D. Comparative Performance of Machine Learning Methods for Text Classification. In Proceedings of the 2020 International Conference on Computing and Information Technology (ICCIT-1441), Tabuk, Saudi Arabia, 9–10 September 2020; pp. 1–5. [Google Scholar] [CrossRef]
Barlow, L.; Bendiab, G.; Shiaeles, S.; Savage, N. A Novel Approach to Detect Phishing Attacks using Binary Visualisation and Machine Learning. In Proceedings of the 2020 IEEE World Congress on Services (SERVICES), Beijing, China, 18–23 October 2020; pp. 177–182. [Google Scholar] [CrossRef]
Salahdine, F.; Mrabet, Z.E.; Kaabouch, N. Phishing Attacks Detection—A Machine Learning-Based Approach. In Proceedings of the 2021 IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 1–4 December 2020; pp. 250–255. [Google Scholar] [CrossRef]
Gupta, B.B.; Jain, A.K. Phishing attack detection using a search engine and heuristics-based technique. J. Inf. Technol. Res. 2020, 13, 94–109. [Google Scholar] [CrossRef]
Salihu, S.A.; Oladipo, I.D.; Wojuade, A.A.; Abdulraheem, M.; Babatunde, A.O.; Ajiboye, A.R.; Balogun, G.B. Detection of Phishing URLs Using Heuristics-Based Approach. In Proceedings of the 2022 5th Information Technology for Education and Development (ITED), Abuja, Nigeria, 1–3 November 2022; pp. 1–7. [Google Scholar] [CrossRef]
Baki, S.; Verma, R.M. Sixteen Years of Phishing User Studies: What Have We Learned? IEEE Trans. Dependable Secur. Comput. 2023, 20, 1200–1212. [Google Scholar] [CrossRef]
Salloum, S.; Gaber, T.; Vadera, S.; Shaalan, K. Phishing Email Detection Using Natural Language Processing Techniques: A Literature Survey. Procedia Comput. Sci. 2021, 189, 19–28. [Google Scholar] [CrossRef]
Salloum, S.; Gaber, T.; Vadera, S.; Shaalan, K. A Systematic Literature Review on Phishing Email Detection Using Natural Language Processing Techniques. IEEE Access 2022, 10, 65703–65727. [Google Scholar] [CrossRef]
Gualberto, E.S.; De Sousa, R.T.; De Brito Vieira, T.P.; Da Costa, J.P.C.L.; Duque, C.G. The Answer is in the Text: Multi-Stage Methods for Phishing Detection Based on Feature Engineering. IEEE Access 2020, 8, 223529–223547. [Google Scholar] [CrossRef]
Goud, N.S.; Mathur, A. Feature Engineering Framework to detect Phishing Websites using URL Analysis. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 295–303. [Google Scholar] [CrossRef]
Mourtaji, Y.; Bouhorma, M.; Alghazzawi, D.; Aldabbagh, G.; Alghamdi, A. Hybrid Rule-Based Solution for Phishing URL Detection Using Convolutional Neural Network. Wirel. Commun. Mob. Comput. 2021, 2021, 8241104. [Google Scholar] [CrossRef]
Abidoye, A.P.; Kabaso, B. Hybrid machine learning: A tool to detect phishing attacks in communication networks. ECTI Trans. Comput. Inf. Technol. ECTI-CIT 2021, 15, 374–389. [Google Scholar] [CrossRef]
Asiri, S.; Xiao, Y.; Alzahrani, S.; Li, T. PhishingRTDS: A real-time detection system for phishing attacks using a Deep Learning model. Comput. Secur. 2024, 141, 103843. [Google Scholar] [CrossRef]
Linh, D.M.; Hung, H.; Chau, H.M.; Vu, Q.S.; Tran, T.N. Real-time phishing detection using deep learning methods by extensions. Int. J. Electr. Comput. Eng. 2024, 14, 3021–3035. [Google Scholar] [CrossRef]
Abdelali Elkouay, N.M.; Madani, A. Graph-based phishing detection: URLGBM model driven by machine learning. Int. J. Comput. Appl. 2024, 46, 481–495. [Google Scholar] [CrossRef]
Balaji, S.; Sathishkumar, R.; Sharmila, G.; Arikaran, N.; Nivas, K.; Dhamotharan, S. Machine Learning Based Improved Phishing Detection using Adversarial Auto Encoder. In Proceedings of the 2023 International Conference on System, Computation, Automation and Networking (ICSCAN), Puducherry, India, 17–18 November 2023; pp. 1–4. [Google Scholar] [CrossRef]
Shirazi, H.; Muramudalige, S.R.; Ray, I.; Jayasumana, A.P.; Wang, H. Adversarial Autoencoder Data Synthesis for Enhancing Machine Learning-Based Phishing Detection Algorithms. IEEE Trans. Serv. Comput. 2023, 16, 2411–2422. [Google Scholar] [CrossRef]
Berens, B.; Dimitrova, K.; Mossano, M.; Volkamer, M. Phishing awareness and education–When to best remind. In Proceedings of the Workshop on Usable Security and Privacy (USEC), San Diego, CA, USA, 28 April 2022. [Google Scholar] [CrossRef]
Sarker, O.; Jayatilaka, A.; Haggag, S.; Liu, C.; Babar, M.A. A Multi-vocal Literature Review on challenges and critical success factors of phishing education, training and awareness. J. Syst. Softw. 2024, 208, 111899. [Google Scholar] [CrossRef]
Zhang, X.; Miao, X.; Xue, M. A Reputation-Based Approach Using Consortium Blockchain for Cyber Threat Intelligence Sharing. Secur. Commun. Netw. 2022, 2022, 7760509. [Google Scholar] [CrossRef]
Fuxen, P.; Hachani, M.; Hackenberg, R.; Ross, M. MANTRA: Towards a Conceptual Framework for Elevating Cybersecurity Applications Through Privacy-Preserving Cyber Threat Intelligence Sharing. Cloud Comput. 2024, 2024, 43. [Google Scholar]
Hannousse, A.; Yahiouche, S. Towards benchmark datasets for machine learning based website phishing detection: An experimental study. Eng. Appl. Artif. Intell. 2021, 104, 104347. [Google Scholar] [CrossRef]

Figure 1. Flow of the proposed phishing detection approach.

Figure 2. Feature importance chart: relevance of the top 19 features selected by CfsSubsetEval in distinguishing between phishing and legitimate websites.

Figure 3. Comparison of Error Rates for kernel K-Means and proposed K-Means on different sample sizes.

Figure 4. Comparison of Accuracy for kernel K-Means and proposed K-Means on different sample sizes.

Figure 5. Processing time vs. dataset size for the proposed K-Means approach.

Table 1. Website phishing attacks and their characteristics.

Type of Phishing	Method	Target	Typical Characteristics	Typical Solutions	Remaining Challenges
Website Phishing	Creation of fake websites mimicking legitimate ones	General internet users	Fake websites often mimic legitimate sites closely, can be linked from emails or search engines. Examples: fake banking websites, counterfeit e-commerce sites.	Secure browsing habits, browser security features, website verification tools	Difficult for users to identify fake sites, continuous creation of new phishing sites, sophisticated design and domain spoofing.
Mobile Website Phishing	Fake websites designed for mobile devices	Mobile internet users	Optimised for mobile viewing, often linked from SMS or mobile apps. Examples: fake mobile banking sites, fraudulent mobile service portals.	Mobile security software, cautious browsing on mobile devices, awareness campaigns	Smaller screens make it harder to identify phishing cues, increased mobile browsing, sophisticated mobile site designs.
Social Media Phishing	Fake social media pages or malicious links shared via social platforms	Social media users	Fake profiles, fraudulent links, impersonation of trusted contacts. Examples: phishing links shared in posts or messages, fake customer service accounts.	Social media platform security measures, user education, link verification tools	High volume of social media activity, sophisticated fake profiles, rapid spread of malicious links.

Table 2. Other types of phishing attacks and their characteristics.

Type of Phishing	Method	Target	Typical Characteristics	Typical Solutions	Remaining Challenges
Email Phishing	Mass distribution of fraudulent emails	General public	Spoofed sender address, urgent language, generic greetings. Examples: fake bank alerts, lottery scams.	Spam filters, email authentication protocols (DMARC, DKIM, SPF)	Sophisticated spoofing techniques can bypass filters, human error, and lack of awareness.
Spear Phishing	Targeted emails to specific individuals or organisations	Specific individuals or organisations	Personalised content, research on targets, often mimics trusted contacts. Examples: fake job offers, custom emails to company employees.	Employee training, multi-factor authentication (MFA), advanced email filtering	High customisation makes detection difficult, and social engineering exploits human trust.
Whaling	Highly targeted attacks on high-profile individuals	High-profile individuals (e.g., C-suite executives)	Sophisticated content, often involves significant research. Examples: CEO fraud, business email compromise.	Executive training, secure email gateways, incident response plans	High-value targets attract persistent attackers, sophisticated social engineering.
Smishing	SMS messages with malicious links or requests	Mobile users	Short, urgent messages with malicious links. Examples: fake delivery notifications, fraudulent alerts from service providers.	Mobile security apps, awareness campaigns, SMS filtering	Limited SMS filtering capabilities, increased mobile usage, human susceptibility to urgency.
Vishing	Voice calls impersonating authority figures or entities	Phone users	Impersonation of authority figures, creation of urgency. Examples: fake IRS calls, tech support scams.	Caller ID verification, awareness training, call-blocking apps	Spoofed caller IDs, convincing social engineering tactics, real-time interaction pressure.
Clone Phishing	Replication of legitimate emails with malicious content	Previous email recipients	Nearly identical to legitimate emails but with malicious attachments/links. Examples: duplicated event invitations, fake service updates.	Digital signatures, employee training, secure email practices	Difficulty in distinguishing cloned emails, reliance on recipient vigilance, advanced spoofing techniques.
Search Engine Phishing	Manipulation of search engine results to appear legitimate	Internet searchers	Appears in search results, often mimics popular websites. Examples: fake tech support websites, phishing sites mimicking popular services.	Search engine algorithms, user education, browser warnings	Constantly evolving phishing sites, reliance on user caution, search engine algorithm limitations.

Table 3. Latest advancements in website phishing detection using AI and ML.

Study	Techniques Used	Dataset	Key Findings
Alam et al. [14]	Decision Trees (DT) and Random Forest (RF)	Kaggle phishing dataset	RF achieved high accuracy (96.9%) despite missing data, effectively addressing overfitting issues.
Kamalam et al. [15]	Decision Trees (DT) and Random Forest (RF)	Phishing URLs Dataset	Developed a Chrome extension for phishing detection, achieving 97.31% accuracy with RF.
Basit et al. [16]	Artificial Neural Networks (ANN), K-Nearest Neighbours (KNN), Decision Trees (DT), and Random Forest Classifier (RFC)	UCI ML repository	KNN combined with RFC provided the lowest FP rate (0.038) and highest TP rate (0.983).
Mao et al. [17]	Support Vector Machine (SVM)	Phishing website dataset	SVM achieved an accuracy rate of approximately 96%.
Shahrivari et al. [18]	XGBoost and other classifiers (LR, DT, SVM, Ada Boost, RF, Neural Networks, KNN, Gradient Boosting)	Phishing website dataset	XGBoost demonstrated exceptional performance, achieving an accuracy of approximately 98%.
Adebowale et al. [19]	Various deep learning techniques	Multiple phishing datasets	Achieved accuracy ranging from 91% to 94.8% using deep learning methods.
Sahu et al. [20]	K-Means Clustering	Large phishing and malware dataset	Improved error rate in classifying phishing and malware websites from 10% to 20%.
Hossain et al. [21]	Random Forest (RF), Bagging, Stochastic Gradient Descent (SGD), Logistic Regression (LR)	Website attribute dataset	RF achieved the best performance with an F1 score of 0.99.
Abedin et al. [11]	K-Nearest Neighbours (KNN), Logistic Regression (LR), Random Forest (RF)	Phishing URLs Dataset	RF demonstrated the highest precision rate (97%) and AUC score of 1.0.
Muhammad et al. [22]	Decision Trees (DT), K-Nearest Neighbours (KNN), Naive Bayes (NB), Random Forest (RF), Support Vector Machine (SVM)	SMS and email phishing datasets	RF showed exceptional average accuracy but was time-consuming in classification.
Barlow et al. [23]	Neural Networks with Binary Visualisation Techniques	Phishing website dataset	Effective in rapidly and accurately detecting phishing attackers, improving through learning from incorrect classifications.
Salahdine et al. [24]	Artificial Neural Networks (ANN), Logistic Regression (LR), Support Vector Machine (SVM)	Phishing website dataset	ANN achieved the highest accuracy using two hidden layers and the Relu activation function.

Table 4. Other approaches in the phishing detection literature.

Approach	Description	Pros	Cons
Heuristic-Based Detection	Use of predefined rules and heuristics to identify phishing characteristics, such as URL analysis and content inspection [25,26].	Simple and fast implementation, low computational cost	Limited detection accuracy, prone to false positives and false negatives
Behavioural Analysis	Monitoring and analysing user behaviour patterns to detect anomalies indicative of phishing attacks [27].	Can detect new and unknown phishing techniques, adaptive to user behaviour	High computational cost, privacy concerns, and potential for false positives
Natural Language Processing (NLP)	Applying NLP techniques to analyse textual content in emails and websites for phishing indicators [28,29].	Effective in analysing textual content, can detect sophisticated phishing attempts	Requires large datasets for training, language-specific limitations
Feature Engineering	Designing and extracting specific features from data to improve the accuracy of phishing detection models [30,31].	Can significantly enhance model performance, adaptable to different datasets	Time-consuming and requires domain expertise, may not generalise well
Hybrid Approaches	Combining multiple techniques, such as machine learning and rule-based methods, to enhance detection performance [32,33].	Higher detection accuracy, robust against various attack types	Increased complexity, higher computational cost
Real-Time Detection Systems	Implementing systems capable of detecting phishing attempts in real-time, often integrated into browsers or email clients [34,35].	Immediate protection, high user convenience	High computational and resource requirements, potential latency issues
Graph-Based Approaches	Using graph theory to model relationships between entities (e.g., email senders, domains) and detect phishing patterns [36].	Effective in identifying complex relationships, scalable to large datasets	Complex implementation, requires significant computational resources
Adversarial Training	Training models to recognise and resist adversarial attacks designed to evade detection mechanisms [37,38].	Improved model robustness, can handle sophisticated attacks	Requires continuous updates, high computational cost
User Education and Awareness	Developing training programs and awareness campaigns to educate users about phishing threats and prevention strategies [39,40].	Enhances user ability to recognise phishing attempts, cost-effective	Relies on user compliance, not a technical solution
Threat Intelligence Sharing	Collaborating and sharing threat intelligence data among organisations to improve detection capabilities [41,42].	Collective defence, access to broader threat data	Data privacy concerns, potential for information overload

Table 5. Error Rate results for kernel K-Means and proposed K-Means.

Number of Samples	Error Rate (%)—Kernel K-Means	Error Rate (%)—Proposed K-Means
2000	48.5	10.8
7000	47.07	13.38
10,000	43.02	13.75

Table 6. Accuracy results for kernel K-Means and proposed K-Means.

Number of Samples	Accuracy (%)—Kernel K-Means	Accuracy (%)—Proposed K-Means
2000	51.5	89.2
7000	52.93	86.62
10,000	56.98	86.25

Table 7. Performance metrics for proposed K-Means approach.

Number of Samples	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
2000	89.2	88.5	90.1	89.3
7000	86.62	85.9	87.6	86.7
10,000	86.25	85.7	87.0	86.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Al-Sabbagh, A.; Hamze, K.; Khan, S.; Elkhodr, M. An Enhanced K-Means Clustering Algorithm for Phishing Attack Detections. Electronics 2024, 13, 3677. https://doi.org/10.3390/electronics13183677

AMA Style

Al-Sabbagh A, Hamze K, Khan S, Elkhodr M. An Enhanced K-Means Clustering Algorithm for Phishing Attack Detections. Electronics. 2024; 13(18):3677. https://doi.org/10.3390/electronics13183677

Chicago/Turabian Style

Al-Sabbagh, Abdallah, Khalil Hamze, Samiya Khan, and Mahmoud Elkhodr. 2024. "An Enhanced K-Means Clustering Algorithm for Phishing Attack Detections" Electronics 13, no. 18: 3677. https://doi.org/10.3390/electronics13183677

APA Style

Al-Sabbagh, A., Hamze, K., Khan, S., & Elkhodr, M. (2024). An Enhanced K-Means Clustering Algorithm for Phishing Attack Detections. Electronics, 13(18), 3677. https://doi.org/10.3390/electronics13183677

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Enhanced K-Means Clustering Algorithm for Phishing Attack Detections

Abstract

1. Introduction

2. Background

3. Proposed Approach for Enhanced Phishing Website Classification

Feature Importance Analysis

4. Results

Performance Comparison and Insights

5. Limitations of the Proposed Approach

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI