1. Introduction
Phishing attacks are a form of social engineering in which hackers use spoofed emails or bogus websites to gather clients’ sensitive data. These attacks typically involve four steps: creating a fake website that mimics a legitimate one, distributing a link to the fake website while posing as a legitimate entity, persuading the victim to visit the bogus website and, finally, capturing the victim’s valuable information when they enter it on the fake site [
1,
2,
3]. Cybersecurity, which involves protecting computers, mobile devices, systems, networks, and user data from electronic intrusions, faces phishing as one of its most persistent challenges. The first quarter of 2022 marked a distressing milestone for phishing attacks, with a record-breaking 1,025,968 assaults, according to the Anti-Phishing Working Group’s (APWG’s) Phishing Activity Trends Report [
4]. More recent data from APWG shows that this trend has continued, with the first half of 2023 seeing over 1.2 million unique phishing attacks. Furthermore, the fourth quarter of 2023 witnessed a 47% increase in phishing attacks compared to the previous quarter, indicating that the threat is not only persistent, but growing [
5]. Users often lack basic knowledge of Uniform Resource Locators (URLs), making it difficult for them to determine which web pages can be trusted. Factors such as redirection, hidden URLs, a wide range of URL alternatives, or typing mistakes contribute to users’ vulnerability to phishing attacks. During these attacks, perpetrators create false web pages that replicate legitimate websites and distribute them via spam emails, Short Message Service (SMS), or social media [
6,
7]. Phishing attacks can be categorised into two methods: attack launching and data gathering. Attack launching methods include man-in-the-middle attacks, URL spoofing, website spoofing, and email spoofing. Data gathering methods can be further divided into automated methods (such as fake website forms, keyloggers, and recorded messages) and manual methods (such as misdirection and social engineering) [
8,
9]. Machine Learning (ML) algorithms, a subset of Artificial Intelligence (AI), have shown great promise in detecting phishing attacks. These algorithms can forecast future values using historical data as input, allowing software applications to make predictions without explicit programming. ML has found wide-ranging applications in industries where traditional algorithms are difficult to build, including medical applications, email filtering, speech recognition, agriculture, and computer vision [
10,
11]. This paper focuses on the utilisation of different artificial intelligence algorithms for phishing attack detection and the limitations of each approach. The aim is to propose an enhancement for the use of machine learning algorithms in detecting phishing attacks by identifying an unsupervised algorithm that can effectively minimise and detect these increasingly prevalent threats. The key contributions of this paper include:
A novel approach combining K-Means Clustering with the CfsSubsetEval attribute evaluator for improved phishing detection.
Comprehensive comparative analysis of the proposed method against kernel K-Means, demonstrating improvements in accuracy across various sample sizes.
Insights into the scalability and adaptability of unsupervised learning techniques for phishing detection, particularly in handling large datasets.
Exploration of the potential for real-time phishing detection through the proposed method, opening avenues for practical applications such as browser extensions.
Critical analysis of the strengths and limitations of the proposed approach, providing a balanced view of its potential impact on cybersecurity practices.
The remainder of the paper is organised as follows.
Section 2 provides background and conducts a comprehensive literature review on ML techniques implemented in phishing attacks.
Section 3 describes the proposed approach, which integrates K-Means Clustering with CfsSubsetEval attribute evaluator. The simulation results for the proposed algorithm are presented in
Section 4. The limitations of the work are presented in
Section 5, followed by some concluding remarks in
Section 6.
2. Background
Research has shown that standard methods of detecting phishing attacks are only able to identify 20% of attacks [
12]. However, it is essential for clients to be aware of these attacks, in order to avoid falling victim to them. Organisations often rely on rule-based training to help individuals recognise specific cues or follow a set of guidelines to prevent phishing attempts [
13]. However, the recent literature indicates a drift towards the increased usage of machine and deep learning algorithms for phishing website classification.
Phishing attacks come in various forms, each targeting different vulnerabilities and user groups.
Table 1 provides an overview of website phishing attacks, including those specifically designed for mobile devices and social media platforms. This table outlines the methods, targets, typical characteristics, typical solutions, and remaining challenges associated with each type.
Table 2 summarises the other main types of phishing attacks, detailing their methods, targets, typical characteristics, typical solutions, and remaining challenges. Understanding these different types of phishing attacks is crucial for developing comprehensive detection and prevention methods.
In ML, Decision Trees (DT) are often utilised for classification tasks. However, trees consist of nodes and tests involve attributes. Classification is performed by identifying the leaf nodes at the end of each branch. The Random Forest (RF) algorithm is known for its robustness in the field of ML. It can be effectively utilised for both classification and regression purposes. By employing the bagging method, the RF algorithm combines multiple learning models to generate an overall prediction based on the average of their outputs. This approach enables the distinction between legitimate and phishing websites, highlighting the versatility of RF in this application. Alam et al. utilised standard datasets of phishing attacks as input for the two ML algorithms: DT and RF [
14]. Despite the presence of missing data, RF demonstrated a remarkably high accuracy of approximately 96.9%. Consequently, this approach could address the problem of overfitting effectively.
Kamalam et al. [
15] evaluated the performance of two machine learning algorithms, namely DT and RF, on the Phishing URLs Dataset. After selecting the best performing algorithm, the authors developed a Chrome extension to detect phishing websites. The extension allows for easy deployment of the phishing detection model to end users. By utilizing the RF algorithm, the researchers achieved a good accuracy rate of 97.31% in identifying phishing websites.
Basit et al. [
16] selected three ML classifiers: Artificial Neural Networks (ANN), K-Nearest Neighbours (KNN), and DT to combine with the RF Classifier (RFC) in an ensemble approach. The RFC was used as the base classifier with the ANN, KNN, and DT algorithms. They utilised a common dataset from the UCI ML repository. The results showed that the KNN classifier combined with the RFC provided the lowest False Positive (FP) rate (0.038) and the highest True Positive (TP) rate (0.983) among the ensembles. In addition, the KNN and RFC ensemble classifiers had the highest precision (0.970) and recall (0.983) compared to all the other classifiers. The accuracy of KNN in this case is 97.33%, which is higher than the ANN (97.16%) and the DT (C4.5) (96.36%). Support Vector Machine (SVM), which is a supervised ML approach, can also be used to analyse data for classification and regression tasks. In a study conducted by Mao et al. [
17], SVM was employed as a classifier on a dataset consisting of 24,051 samples. The results showed an accuracy rate of approximately 96%.
The Extreme Gradient Boosting (XGBoost) method is a powerful ensemble approach to supervised learning. Data scientists rely heavily on XGBoost to achieve cutting-edge results in various machine learning challenges due to its scalability and end-to-end tree boosting capabilities. This method simplifies the process of solving classification and regression problems with remarkable ease. In a study conducted by Shahrivari et al. [
18], twelve classifiers were simulated and tested using a dataset from a phishing website. These classifiers included Logistic Regression (LR), DT, SVM, Ada Boost, RF, Neural Networks, KNN, Gradient Boosting, and XGBoost. Among them, XGBoost demonstrated exceptional performance in terms of computation duration and accuracy, achieving an accuracy of approximately 98%, which surpassed other classifiers.
Deep learning, which is a subset of ML methods that includes ANN and representation learning, has also demonstrated great prospect for tasks such as phishing website classification. In a study by Adebowale et al. [
19], various deep learning techniques were applied to multiple datasets. The accuracy achieved ranged from a minimum of approximately 91% to a maximum of roughly 94.8%. Sahu et al. [
20] utilised vector space analysis to select variables from a large dataset for the purpose of detecting phishing and malware websites. As a result, the error rate in classifying malware and phishing websites improves from 10% to 20%. Furthermore, when compared to ensemble clustering, this method also demonstrates an improvement in categorisation error rates for both phishing websites and virus samples. Therefore, this study effectively shows that the system performs well in categorizing malware and real phishing websites.
Hossain et al. [
21] conducted a study to identify potential phishing websites. They analysed multiple machine learning algorithms (RF, Bagging, SGD and LR) using a dataset that included attributes about websites and their associated information. The article aims to assist readers by offering a comprehensive examination of different techniques, ultimately concluding that the RF classifier performs the best. Particularly, the RF achieves an impressive F1 score of 0.99, indicating that both the false positive and false negative rates fall within acceptable limits. Abedin et al. conducted a study comparing the effectiveness of three commonly used ML classifiers: KNN, LR and RF [
11]. Among these classifiers, RF demonstrated the highest precision rate of 97%. Additionally, the RF classifier achieved an impressive Area Under Curve (AUC) score of 1.0, indicating its ability to accurately identify phishing URLs.
Muhammad et al. employed five machine learning techniques, namely DT, KNN, Naive Bayes (NB), RF, and SVM, to detect phishing attacks [
22]. The authors utilised two datasets, one for SMS and another for email, and analysed the word content within these datasets to identify phishing attempts. Performance criteria for the experiment included the time taken to conduct the analysis and the accuracy of correctly classifying the data. RF technique demonstrated exceptional average accuracy; however, it was observed that the classification process using this method was comparatively time-consuming. Based on the outcomes of the study, it can be concluded that while a method may effectively identify phishing attempts, it may also require a longer time to yield the best results.
Barlow et al. [
23] proposed a novel approach for detecting phishing using multilevel artificial intelligence, combining neural networks with binary visualisation techniques. By employing visual representation techniques, it becomes possible to gain insight into the structural differences between legitimate and phishing websites. The initial results of the experiment indicate that this approach is effective in rapidly and accurately detecting phishing attackers. Furthermore, the technique improves its effectiveness through learning from incorrect classifications.
Salahdine et al. [
24] proposed an ML-based method for detecting phishing attacks. The researchers trained and evaluated three classifiers (ANN, LR, and SVM) using a dataset. The best results from their parametric research are presented for review for each classifier. For the SVM classifier, the Gaussian Radial basis function kernel achieved good accuracy. The LR classifier achieved the highest accuracy with a regularisation parameter of 0.4. However, the ANN classifier achieved even higher accuracy by using two hidden layers, each with 100 neurons, and the Relu activation function. Consequently, the suggested methodology enables rapid and accurate detection of phishing attacks.
Table 3 presents recent advancements in phishing detection using AI and ML, while
Table 4 summarises various other approaches in the literature. These tables highlight the strengths and weaknesses of existing methods, demonstrating the continuous evolution and sophistication of phishing detection techniques. Despite these advancements, challenges remain, particularly in balancing detection accuracy with computational efficiency and adaptability to emerging threats.
Motivated by these insights, this paper proposes an optimised model for detecting phishing attacks based on K-Means Clustering. Various approaches have been proposed in the literature, resulting in different accuracy percentages. However, the objective of this work is to reduce phishing attacks and minimise losses by finding an unsupervised learning technique algorithm for the model.
While supervised learning methods have shown high accuracy in phishing detection, this research explores K-Means Clustering as an unsupervised alternative. K-Means offers several advantages: it can adapt to new threats without relying on labelled data. It is computationally efficient, and provides interpretable results. The reduced dependency on labelled datasets makes it easier to implement and maintain. Moreover, K-Means can potentially uncover novel patterns and feature combinations that distinguish phishing attempts from legitimate websites. By employing this unsupervised approach, we aim to complement existing techniques, contributing to a more robust and diverse set of tools for combating phishing attacks. This exploration of K-Means Clustering may reveal insights that supervised methods might overlook, ultimately enhancing our overall phishing detection capabilities.
Building upon the insights gained from previous research, this paper proposes a novel approach to phishing detection using an unsupervised machine learning technique. We introduce a model that combines K-Means Clustering with the CfsSubsetEval attribute evaluator, aiming to leverage the strengths of unsupervised learning while optimising feature selection. This approach seeks to address some of the limitations of supervised methods while potentially uncovering new patterns in phishing attacks. The following section provides a detailed description of our proposed algorithm, its implementation, and the rationale behind our methodological choices.
3. Proposed Approach for Enhanced Phishing Website Classification
This paper introduces a novel approach to phishing website classification that uniquely combines K-Means Clustering with kernel methods and integrates the CfsSubsetEval attribute evaluator. This combination results in enhancing both the accuracy and efficiency of phishing attacks detection. The pseudo-code of the proposed algorithm is provided in Algorithm 1.
At the core of our approach lies K-Means Clustering, an unsupervised learning technique that partitions unlabelled data into distinct clusters. The parameter K specifies the number of clusters to be formed. For phishing detection, we set K to 2, representing phishing and legitimate website groups. This method automatically categorises data without requiring training, using a centroid-based approach, where each cluster is represented by its centre point. The main objective of this method is to minimise the total distance between each data point and its corresponding cluster. The algorithm starts by dividing the dataset into K clusters, and continues iteratively until no further changes occur.
The novelty of our approach is twofold. Firstly, we integrate kernel methods with K-Means Clustering, a combination less common in phishing detection. By incorporating polynomial kernel techniques, we enhance the algorithm’s ability to capture complex patterns in URL data. The kernel function
implicitly maps data to a higher-dimensional feature space, allowing for more intricate decision boundaries. Specifically, we use a polynomial kernel of degree
d, defined as:
where
x and
are feature vectors,
c is a constant, and
d is the degree of the polynomial. This transformation allows the K-Means algorithm to operate in a higher-dimensional space where non-linear relationships between features can be more easily separated.
Secondly, we combine the CfsSubsetEval attribute evaluator with K-Means, an integration not widely explored in unsupervised phishing detection. CfsSubsetEval is a feature selection method that evaluates subsets of features by examining each attribute’s ability to predict outcomes and the overlap between them. Subsets of features that are highly correlated with the class while having low intercorrelation are preferred. This reduces the risk of overfitting and improves the algorithm’s generalisation ability. Consequently, employing CFSSubsetEval in phishing website classification ensures that the model is both efficient and accurate in its classification of phishing websites. This unique approach optimises feature selection specifically for our task. CfsSubsetEval assesses subsets of attributes based on their individual predictive ability and the degree of redundancy between them. By applying this before K-Means Clustering, we identify the most relevant features for distinguishing phishing sites, potentially improving both the speed and accuracy of our classification.
Algorithm 1 Proposed algorithm for phishing website classification |
- 1:
Input: Dataset - 2:
Choose the attributes evaluator CfsSubsetEval - 3:
Choose the search method BestFirst - 4:
Choose the method K-Means - 5:
Choose number of clusters - 6:
Choose distance function Euclidean distance - 7:
Select class to clusters evaluation status - 8:
Run the method
|
The Euclidean distance between two points
x and
y in
N-dimensional space is given by
Euclidean distance is commonly employed by machine learning algorithms as the default metric for comparing two sets of recorded data. This distance metric requires continuous attributes involving numerical variables, such as weight, height, or pay, in the observations being compared. Its typical purpose is to measure the extent of separation between two rows of data with numerical values. This paper utilises the Euclidean distance as the default metric for assessing the similarity between two recorded observations due to its simplicity and effectiveness in the context of continuous numerical data.
Our methodology, implemented using the WEKA toolkit, begins with a dataset from Kaggle containing 11,430 URLs and 87 features [
43]. We first preprocess the data, cleaning and handling missing values to ensure quality and consistency. The feature selection phase then employs CfsSubsetEval to identify the 19 most relevant attributes. This step is applied to three different sample sizes—2000, 7000, and 10,000 URLs—allowing us to assess the scalability of our approach. The choice of these specific sample sizes is based on their ability to provide a comprehensive assessment of the algorithm’s performance across small, medium, and large datasets, ensuring that the method can scale effectively with varying data volumes.
Following feature selection, we apply a polynomial kernel to transform the feature space, enhancing K-Means’ capability to capture non-linear relationships in the data. The K-Means algorithm then iteratively assigns data points to clusters and recalculates centroids until convergence. Although we use kernel methods, Euclidean distance in the transformed feature space serves as our similarity measure, well-suited for the continuous numerical attributes in our dataset.
The distance measure in the transformed feature space can be represented as
This ensures that the clustering process accurately reflects the non-linear separations introduced by the kernel function.
We evaluate our clustering results using internal validation measures such as silhouette score and Davies–Bouldin index. Additionally, when ground truth labels are available, we compare our results to assess the accuracy of phishing detection.
Figure 1 illustrates the flow of our proposed approach.
This enhanced method aims to improve both the accuracy and efficiency of phishing website detection by synergising unsupervised learning with optimised feature selection and kernel methods. By applying this approach to various sample sizes, we can evaluate its performance and scalability in real-world scenarios.
The proposed approach can be effectively implemented in various cybersecurity contexts. One primary application could be its integration into web browsers as an extension or built-in feature. This would allow for real-time analysis of websites as users browse, providing immediate warnings about potentially phishing sites. Additionally, this approach could be implemented as part of email filtering systems, helping to identify and flag phishing attempts in incoming messages. For larger organisations, the proposed approach could be deployed as part of an enterprise-level Intrusion Detection System (IDS) to monitor and flag suspicious activity across networks, improving their defences against social engineering attacks.
Our approach addresses the challenges of detecting increasingly sophisticated phishing attempts. The polynomial kernel helps capture subtle variations used by phishers to mimic legitimate sites, while the optimised feature selection focuses the algorithm on the most discriminative URL characteristics. This combination promises to enhance detection capabilities, especially for complex or previously unseen phishing patterns.
Feature Importance Analysis
To further understand the effectiveness of our feature selection process, we generated a Feature Importance Chart (
Figure 2). This chart illustrates the relevance of the top features identified by the CfsSubsetEval attribute evaluator and K-Means clustering. Each bar represents a feature, with its length corresponding to the importance score assigned during the evaluation.
The top five features based on their importance are: ‘ratio_intHyperlinks’, ‘links_in_tags’, ‘ratio_extHyperlinks’, ‘safe_anchor’, and ‘google_index’. These features significantly contribute to distinguishing between phishing and legitimate websites. For instance, the presence and ratio of internal hyperlinks (‘ratio_intHyperlinks’) and the external hyperlinks (‘ratio_extHyperlinks’) are critical indicators for classifying phishing attempts. Similarly, features related to web content structure such as ‘links_in_tags’ and the use of ‘safe_anchor’ tags help to enhance the detection accuracy.
The chart reveals that the most significant features are closely related to the structure of web hyperlinks, index status, and content tags. By focusing on these key attributes, our approach ensures that the most discriminative characteristics are prioritized, enhancing the overall accuracy and efficiency of the phishing detection process.
The feature importance analysis underscores the critical role of specific website characteristics, such as hyperlink ratios and tag usage, in identifying phishing attempts. Features with higher importance scores indicate attributes that have a greater impact on the classification decision, providing valuable insights for future research and optimization in phishing detection methodologies.
4. Results
Two metrics were used to evaluate the performance of the proposed K-Means and the kernel K-Means algorithms: Accuracy and Error Rate. Accuracy is the proportion of accurately predicted data points among all the data points. It is calculated by dividing the total number of true positives and true negatives by the total number of false positives and false negatives. A data point that is correctly identified as true or false by the algorithm is referred to as a true positive or true negative. This paper uses accuracy to measure the effectiveness of the phishing detection approach in correctly identifying phishing cases from the entire dataset. It is expressed as
where TP stands for True Positive, TN stands for True Negative, FP stands for False Positive, and FN stands for False Negative.
In addition to accuracy, we evaluated the precision, recall, and F1-score to provide a more comprehensive assessment of the algorithm’s performance. These additional metrics are critical for understanding the model’s effectiveness in various operational scenarios:
Error Rate represents how well the network performs on a specific training, testing, and validation set. A lower error rate is preferable. In this paper, it refers to an indicator of the degree of prediction error in phishing prevention, and it is expressed as
The kernel K-Means and the proposed K-Means have been simulated using the WEKA tool, on samples of 2000, 7000, and 10,000. Simulation results show that the kernel K-Means produced an error rate of 48.5% on 2000 samples, 47.07% on 7000 samples, and 43.02% on 10,000 samples. On the other hand, the proposed approach produced an error rate of 10.8% on 2000 samples, 13.38% on 7000 samples, and 13.75% on 10,000 samples, as shown in
Table 5.
Figure 3 shows the comparison between the proposed K-Means and kernel K-Means based on the Error Rate. The kernel K-Means had a poor performance with an increasing Error Rate percentage when the number of samples went from 2000 to 10,000.
After obtaining the error rate, the accuracy was calculated for the kernel K-Means and the proposed method using 2000, 7000, and 10,000 samples, as described by the equation mentioned previously. The simulation results showed that the accuracy of K-Means with Kernel Filter was 51.5% for 2000 samples, 52.93% for 7000 samples, and 56.98% for 10,000 samples. On the other hand, the proposed approach achieved an accuracy of 89.2% for 2000 samples, 86.62% for 7000 samples, and 86.25% for 10,000 samples, as shown in
Table 6. The comparison of accuracy percentages is illustrated in
Figure 4.
Table 7 shows the achieved performance metrics in terms or precision, recall and F1-score for proposed K-Means approach as well.
Additionally, while the focus of this paper was on the K-Means approach, it is important to acknowledge that traditional machine learning techniques such as DT and SVM have also been used extensively in phishing detection. However, several studies in the literature have shown that these traditional methods tend to be less effective when compared to newer techniques, particularly when dealing with complex and high-dimensional data like that found in phishing detection tasks. DT, for example, are prone to overfitting, especially when applied to large datasets without proper regularisation. Similarly, while SVM can perform well in certain scenarios, its computational complexity and sensitivity to parameter tuning make it less suited for large-scale, real-time applications. By contrast, ensemble methods like RF and deep learning approaches such as CNN-LSTM have shown superior performance in recent studies due to their ability to handle more complex patterns in the data.
Notably, to evaluate the scalability of our approach, we analysed the computational time complexity when using a dataset of size 10,000.
Figure 5 shows the processing time for different dataset sizes up to 10,000 samples.
As shown in
Figure 5, while the processing time increases with dataset size, the relationship is approximately linear, suggesting reasonable scalability. However, for extremely large datasets (>1 million samples), perhaps, further optimisations or distributed computing approaches may be necessary to maintain real-time performance.
Performance Comparison and Insights
The proposed approach (K-Means integrated with CfsSubsetEval attribute evaluator) has achieved better performance compared to the kernel K-Means model in terms of error rate and accuracy metrics. Our approach has shown improved phishing detection performance, with an error rate of 10.8%, 13.38%, and 13.75%, and an accuracy of 89.2%, 86.62%, and 86.25% on samples of 2000, 7000, and 10,000, respectively. In contrast, the kernel K-Means model had an error rate of 48.5%, 47.07%, and 43.02%, with an accuracy of 51.5%, 52.93%, and 56.98%. These results the proposed method enhancement achieved in phishing detection.
It is important to note that while our proposed K-Means approach shows a slight increase in error rate with larger sample sizes (from 10.8% at 2000 samples to 13.75% at 10,000 samples), it still maintains a lower error rate compared to the kernel K-Means method. This slight increase could be attributed to the greater complexity and diversity of patterns present in larger datasets, which may introduce more challenging cases for classification.
Conversely, the kernel K-Means approach shows a decreasing error rate as the sample size increases (from 48.5% at 2000 samples to 43.02% at 10,000 samples). This trend suggests that the kernel method may benefit from larger datasets, possibly due to its ability to capture more complex relationships in higher-dimensional spaces. However, despite this improvement, its performance remains substantially inferior to our proposed method.
The enhanced performance of our proposed approach can be attributed to the effective feature selection achieved through the CfsSubsetEval attribute evaluator. By identifying the most relevant features for distinguishing between phishing and legitimate websites, our method maintains high accuracy, even with increasing dataset sizes. The success of this feature selection process highlights the potential for further research into advanced feature engineering techniques specifically tailored for phishing detection.
Another promising aspect of our approach is its potential for real-time detection. The combination of high accuracy and the relatively low computational complexity inherent to K-Means Clustering suggests that our method could be adapted for real-time phishing detection systems. This capability could enhance cybersecurity measures by allowing for immediate identification and mitigation of phishing threats as they emerge. Real-time detection is particularly valuable in the context of ever-evolving phishing techniques, where rapid response can prevent widespread damage.
Finally, by achieving high accuracy through an unsupervised learning technique, our method reduces the reliance on constantly updated labelled datasets. This characteristic is particularly valuable in the rapidly changing landscape of cybersecurity, where obtaining up-to-date labelled data can be both time-consuming and expensive. The reduced need for labelled data could lower the overall cost and effort required to maintain effective phishing detection systems, making robust cybersecurity more accessible to a wider range of organisations.
5. Limitations of the Proposed Approach
While our proposed approach demonstrates significant improvements in phishing detection, it is important to acknowledge the following limitations and areas for future research.
Firstly, the dataset used in this study, comprising 2000, 7000, and 10,000 samples, may not fully represent the diversity of phishing websites in real-world scenarios. Although this limitation has been partially mitigated by our comprehensive evaluation against state-of-the-art methods and the inclusion of multiple performance metrics (e.g., precision, recall, F1-score), a larger and more diverse dataset could further enhance generalisability. Future work should focus on expanding the dataset to include real-time data from a wider variety of sources and phishing techniques.
While we have provided some evaluation of the proposed approach, the evaluation could be further enriched by testing against additional algorithms and hybrid approaches. This would provide a broader perspective on how our model performs across different architectures, offering more robust conclusions about its effectiveness in phishing detection.
Also, the model’s performance remains dependent on the quality and relevance of the selected features. While CfsSubsetEval was effective in identifying critical features in our experiments, more advanced feature selection or deep learning-based feature extraction methods could be explored to capture complex feature interactions and improve detection accuracy for sophisticated phishing attacks.
One key area that has not been fully explored is the real-time implementation of the proposed method. Although the model demonstrates strong performance on offline datasets, its suitability for live, real-time phishing detection—where latency is a critical factor—remains to be tested. Further research is needed to optimise the model for real-time environments, potentially through techniques such as parallel processing or incremental learning to allow for faster classification without sacrificing accuracy.
Scalability remains a concern, especially when applying the K-Means Clustering algorithm to larger datasets in real-time scenarios. Although we have demonstrated that the processing time scales linearly with increasing dataset sizes, further research into distributed computing or cloud-based implementations could help maintain performance when handling larger datasets.
Finally, the rapidly evolving nature of phishing techniques, including more sophisticated spear phishing and whaling attacks, requires ongoing attention. Our current model has not been fully tested against all types of phishing attacks. Future work should investigate how well the model adapts to emerging phishing techniques, potentially through adversarial training or data augmentation, ensuring that it remains effective as phishing strategies evolve.
6. Conclusions
Social engineering, particularly phishing, remains a prevalent threat in cybersecurity. These attacks exploit human psychology, often impersonating trusted entities to create a sense of urgency and trigger impulsive actions. The effectiveness of social engineering lies in its ability to bypass technical security measures by targeting the human element, making it a cost-effective method for attackers. This paper proposed an enhanced K-Means Clustering algorithm by incorporating the CfsSubsetEval attribute evaluator for phishing detection. Our simulation results demonstrate that the proposed approach achieves high accuracy rates: 89.2% for 2000 samples, 86.62% for 7000 samples, and 86.25% for 10,000 samples. In comparison, the kernel K-Means algorithm achieved lower accuracy rates: 51.5% for 2000 samples, 52.93% for 7000 samples, and 56.98% for 10,000 samples. While our proposed method shows promising results, it is important to note that we cannot definitively claim it as the most accurate approach in the field. Our comparisons were limited to the kernel K-Means algorithm within our specific experimental setup. Further comprehensive comparisons with other state-of-the-art methods would be necessary to establish a broader performance benchmark. The difference in accuracy between our proposed method and kernel K-Means highlights the potential of integrating feature selection techniques with clustering algorithms for phishing detection. However, it is worth noting that kernel K-Means showed improvement with larger sample sizes, indicating its potential for handling complex data relationships. This observation suggests that different approaches may have distinct strengths depending on the dataset characteristics and size.
Our study opens several avenues for future work. Firstly, identifying more efficient attribute evaluators could further enhance K-Means’ ability to detect phishing attacks, potentially improving accuracy while reducing error rates. Secondly, expanding the comparison to include other machine learning and deep learning approaches would provide a more comprehensive understanding of our method’s performance in the broader context of phishing detection techniques. From a practical standpoint, integrating this approach as a browser extension could offer real-time protection for users. Future research should also focus on the adaptability of the model to evolving phishing techniques and its performance in real-world, dynamic environments.