Improved Phishing Attack Detection with Machine Learning: A Comprehensive Evaluation of Classifiers and Features
Abstract
:1. Introduction
- Introducing a new publicly available dataset for phishing attack detection,
- Comparing the performance of the features and finding the best feature subset for phishing attack detection in terms of different performance metrics,
- Comparing the performance of the classification algorithms and finding the best classifier for phishing attack detection in terms of different performance metrics.
2. Related Work
3. Materials and Methods
3.1. Dataset and Features
3.2. Classifiers
3.3. Performance Metrics
4. Results and Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Asiri, S.; Xiao, Y.; Alzahrani, S.; Li, S.; Li, T. A survey of intelligent detection designs of HTML URL phishing attacks. IEEE Access 2023, 11, 6421–6443. [Google Scholar] [CrossRef]
- APWG Anti-Phishing Working Group. Available online: https://apwg.org (accessed on 10 October 2023).
- APWG Phishing Activity Trends Report Q3. 2022. Available online: https://apwg.org/trendsreports (accessed on 10 October 2023).
- Tinubu, C.O.; Falana, O.J.; Oluwumi, E.O.; Sodiya, A.S.; Rufai, S.A. PHISHGEM: A mobile game-based learning for phishing awareness. J. Cyber Secur. Technol. 2023, 7, 134–153. [Google Scholar] [CrossRef]
- Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
- Zhou, Z.H. Machine Learning; Springer Nature: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
- Khonji, M.; Iraqi, Y.; Jones, A. Phishing detection: A literature survey. IEEE Commun. Surv. Tutor. 2013, 15, 2091–2121. [Google Scholar] [CrossRef]
- Mohammad, R.M.; Thabtah, F.; Mccluskey, L. Tutorial and critical analysis of phishing websites methods. Comput. Sci. Rev. 2015, 17, 1–24. [Google Scholar] [CrossRef]
- Google Safe Browsing API. Available online: https://developers.google.com/safe-browsing/v4 (accessed on 10 October 2023).
- Netcraft Anti-Phishing Toolbar. Available online: https://www.netcraft.com/apps (accessed on 10 October 2023).
- Whittaker, C.; Ryner, B.; Nazif, M. Large-scale Automatic Classification of Phishing Pages. In Proceedings of the 17th Network & Distributed System Security Symposium, San Diego, CA, USA, 28 February–3 March 2010; pp. 1–14. [Google Scholar]
- Jain, A.K.; Gupta, B.B. A survey of phishing attack techniques, defence mechanisms and open research challenges. Enterp. Inf. Syst. 2022, 16, 527–565. [Google Scholar] [CrossRef]
- Qabajeh, I.; Thabtah, F.; Chiclana, F. A recent review of conventional vs. automated cyber-security anti-phishing techniques. Comput. Sci. Rev. 2018, 29, 44–55. [Google Scholar] [CrossRef]
- Moore, T.; Clayton, R.; Stern, H. Temporal Correlations between Spam and Phishing Websites. In Proceedings of the 2nd USENIX Workshop on Large-Scale Exploits and Emergent Threats, Boston, MA, USA, 21 April 2009; pp. 1–8. [Google Scholar]
- Thomas, K.; Grier, C.; Ma, J.; Paxson, V.; Song, D. Design and Evaluation of a Real-Time URL Spam Filtering Service. In Proceedings of the IEEE Symposium on Security and Privacy, Oakland, CA, USA, 22–25 May 2011; pp. 447–462. [Google Scholar]
- Gangavarapu, T.; Jaidhar, C.D.; Chanduka, B. Applicability of machine learning in spam and phishing email filtering: Review and approaches. Artif. Intell. Rev. 2020, 53, 5019–5081. [Google Scholar] [CrossRef]
- Zhang, Y.; Hong, J.; Cranor, L. CANTINA: A Content Based Approach to Detecting Phishing Web Sites. In Proceedings of the 16th International Conference on World Wide Web, Banff, AB, Canada, 8–12 May 2007; pp. 639–648. [Google Scholar]
- Wardman, B.; Stallings, T.; Warner, G.; Skjellum, A. High-Performance Content Based Phishing Attack Detection. In Proceedings of the eCrime Researchers Summit, San Diego, CA, USA, 7–9 November 2011; pp. 1–9. [Google Scholar]
- Zhang, H.; Liu, G.; Chow, T.; Wenyin, L. Textual and visual content-based anti-phishing: A Bayesian approach. IEEE Trans. Neural Netw. 2011, 22, 1532–1546. [Google Scholar] [CrossRef]
- Li, Y.; Xiao, R.; Feng, J.; Zhao, L. A semi-supervised learning approach for detection of phishing webpages. Optik 2013, 124, 6027–6033. [Google Scholar] [CrossRef]
- Mao, J.; Tian, W.; Li, P.; Wei, T.; Liang, Z. Phishing-alarm: Robust and efficient phishing detection via page component similarity. IEEE Access 2017, 5, 17020–17030. [Google Scholar] [CrossRef]
- Mohammad, R.M.; Thabtah, F.; Mccluskey, L. An Assessment of Features Related to Phishing Websites Using an Automated Technique. In Proceedings of the IEEE International Conference for Internet Technology and Secured Transactions, London, UK, 10–12 December 2012; pp. 492–497. [Google Scholar]
- Mohammad, R.M.; Thabtah, F.; Mccluskey, L. Intelligent rule-based phishing websites classification. IET Inf. Secur. 2014, 8, 153–160. [Google Scholar] [CrossRef]
- Mohammad, R.M.; Thabtah, F.; Mccluskey, L. Predicting phishing websites based on self-structuring neural network. Neural Comput. Appl. 2014, 25, 443–458. [Google Scholar] [CrossRef]
- Basnet, R.B.; Sung, A.H.; Liu, Q. Rule-Based Phishing Attack Detection. In Proceedings of the International Conference on Security and Management, The World Congress in Computer Science, Computer Engineering and Applied Computing, London, UK, 18–21 July 2011. [Google Scholar]
- Fette, I.; Sadeh, N.; Tomasic, A. Learning to Detect Phishing Emails. In Proceedings of the 16th ACM International Conference on World Wide Web, Banff, AB, Canada, 8–12 May 2007; pp. 649–656. [Google Scholar]
- Aburrous, M.R.; Hossain, A.; Dahal, K.; Thabatah, F. Modelling Intelligent Phishing Detection System for E-banking Using Fuzzy Data Mining. In Proceedings of the IEEE International Conference on CyberWorlds, Washington, DC, USA, 7–11 September 2009; pp. 265–272. [Google Scholar]
- Aburrous, M.R.; Hossain, A.; Dahal, K.; Thabatah, F. Intelligent phishing detection system for e-banking using fuzzy data mining. Expert Syst. Appl. 2010, 37, 7913–7921. [Google Scholar] [CrossRef]
- Chiew, K.L.; Tan, C.L.; Wong, K.; Yong, K.S.; Tiong, W.K. A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Inf. Sci. 2019, 484, 153–166. [Google Scholar] [CrossRef]
- Sahingoz, O.K.; Buber, E.; Demir, O.; Diri, B. Machine learning based phishing detection from URLs. Expert Syst. Appl. 2019, 117, 345–357. [Google Scholar] [CrossRef]
- Xiao, X.; Zhang, D.; Hu, G.; Jiang, Y.; Xia, S. CNN–MHSA: A convolutional neural network and multi-head self-attention combined approach for detecting phishing websites. Neural Netw. 2020, 125, 303–312. [Google Scholar] [CrossRef] [PubMed]
- Sonowal, G.; Kuppusamy, K.S. PhiDMA—A phishing detection model with multi-filter approach. J. King Saud Univ.-Comput. Inf. Sci. 2020, 32, 99–112. [Google Scholar] [CrossRef]
- Almomani, A.; Alauthman, M.; Shatnawi, M.T.; Alweshah, M.; Alrosan, A.; Alomoush, W.; Gupta, B.B. Phishing website detection with semantic features based on machine learning classifiers: A comparative study. Int. J. Semant. Web Inf. Syst. 2022, 18, 1–24. [Google Scholar] [CrossRef]
- Bahaghighat, M.; Ghasemi, M.; Ozen, F. A high-accuracy phishing website detection method based on machine learning. J. Inf. Secur. Appl. 2023, 77, 103553. [Google Scholar] [CrossRef]
- Adebowale, M.A.; Lwin, K.T.; Hossain, M.A. Intelligent phishing detection scheme using deep learning algorithms. J. Enterp. Inf. Manag. 2023, 36, 747–766. [Google Scholar] [CrossRef]
- Basit, A.; Zafar, M.; Liu, X.; Javed, A.R.; Jalil, Z.; Kifayat, K. A comprehensive survey of AI-enabled phishing attacks detection techniques. Telecommun. Syst. 2021, 76, 139–154. [Google Scholar] [CrossRef] [PubMed]
- Abdillah, R.; Shukur, Z.; Mohd, M.; Murah, T.M.Z. Phishing classification techniques: A systematic literature review. IEEE Access 2022, 10, 41574–41591. [Google Scholar] [CrossRef]
- Safi, A.; Singh, S. A systematic literature review on phishing website detection techniques. J. King Saud Univ.-Comput. Inf. Sci. 2023, 5, 590–611. [Google Scholar] [CrossRef]
- Kapan, S. Analysis of the Features Used in Detecting Phishing Attacks by Machine Learning. Master’s Thesis, Eskisehir Osmangazi University, Eskisehir, Türkiye, 2021. [Google Scholar]
- Kirda, E. Getting Under Alexa’s Umbrella: Infiltration Attacks Against Internet Top Domain Lists. In Proceedings of the 22nd International Information Security Conference, New York, NY, USA, 16–18 September 2019. [Google Scholar]
- PhishTank. Available online: https://www.phishtank.com (accessed on 10 October 2023).
- Selenium Web Driver. Available online: https://www.selenium.dev (accessed on 10 October 2023).
- Ratcliff, J.W.; Metzener, D. Pattern matching: The gestalt approach. Dr. Dobb’s J. 1988, 13, 46. [Google Scholar]
- Bal, S.; Sora Gunal, E. The impact of features and preprocessing on automatic text summarization. Rom. J. Inf. Sci. Technol. 2022, 25, 117–132. [Google Scholar]
- Scikit-Learn Library. Available online: https://scikit-learn.org/stable/index.html (accessed on 10 October 2023).
- UCI Machine Learning Repository, Phishing Websites Data Set. 2015. Available online: https://archive.ics.uci.edu/ml/datasets/phishing+websites (accessed on 10 October 2023).
No | Feature | Feature Group | Description |
---|---|---|---|
1 | Domain name similarity | URL | Similarity (based on Ratcliff-Obershelp’s algorithm [43]) between the domain name of the visited website and the URL domain name obtained from Alexa or PhishTank |
2 | URL length | Number of all characters in a URL | |
3 | HTTP protocol | HTTP protocol type: standard (0) or secure (1) | |
4 | # ‘.’ symbol | Number of dot symbols in a URL | |
5 | # ‘/’ symbols | Number of slash symbols in a URL | |
6 | # ‘//’ symbols | Number of double slash symbols in a URL | |
7 | # ‘-’ symbols | Number of dash symbols in a URL | |
8 | # ‘_’ symbols | Number of underscore symbols in a URL | |
9 | # ‘=’ symbols | Number of equal symbols in a URL | |
10 | # ‘(’ and ‘)’ symbols | Number of parenthesis symbols in a URL | |
11 | # ‘{’ and ‘}’ symbols | Number of curly bracket symbols in a URL | |
12 | # ‘[’ and ‘]’ symbols | Number of square bracket symbols in a URL | |
13 | # ‘<’ and ‘>’ symbols | Number of less than and greater than symbols in a URL | |
14 | # ‘~’ symbols | Number of tilde symbols in a URL | |
15 | # ‘*’ symbols | Number of asterisk symbols in a URL | |
16 | # ‘+’ symbols | Number of plus symbols in a URL | |
17 | Inclusion of ‘@’ symbol | URL includes an at symbol (1) or not (0) | |
18 | Inclusion of IP address | URL includes an IP address (1) or not (0) | |
19 | # <a> tags | HTML | Number of <a> tags in a website, used to create hyperlinks or anchor links, which is an essential element for linking one webpage to another, linking to different sections within the same page, or linking to external resources |
20 | # <input> tags | Number of <input> tags in a website, used to create various types of interactive form elements | |
21 | # <button> tags | Number of <button> tags in a website, used to create a clickable button for triggering actions, submitting forms, or performing other interactive functions | |
22 | # <link> tags | Number of <link> tags in a website, used to link external resources, such as stylesheets, icons, and other documents, to an HTML document | |
23 | # <iFrame> tags | Number of <iFrame> tags in a website, used to embed an external resource, such as another HTML document, a video, or a web page, within the current document | |
24 | HTTP response history | HTTP | HTTP response code returned by a server to indicate the outcome of a client’s request made to the server. |
25 | Redirect | Website redirects to another site (1) or not (0), detected using HTTP redirection response codes |
Classifier | Feature Set | # Features | Time | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|---|---|---|
SVM | URL | 18 | 0.03 | 0.91 | 0.91 | 0.91 | 0.91 |
URL + HTML | 23 | 0.03 | 0.94 | 0.93 | 0.95 | 0.94 | |
URL + HTTP | 20 | 0.03 | 0.97 | 0.96 | 0.99 | 0.97 | |
HTML | 5 | 0.03 | 0.85 | 0.81 | 0.91 | 0.86 | |
HTML + HTTP | 7 | 0.03 | 0.92 | 0.88 | 0.97 | 0.93 | |
HTTP | 2 | 0.02 | 0.93 | 0.88 | 0.99 | 0.93 | |
URL + HTML + HTTP | 25 | 0.05 | 0.98 | 0.97 | 0.99 | 0.98 | |
SGD | URL | 18 | 0.03 | 0.84 | 0.86 | 0.81 | 0.84 |
URL + HTML | 23 | 0.03 | 0.92 | 0.91 | 0.93 | 0.92 | |
URL + HTTP | 20 | 0.02 | 0.97 | 0.97 | 0.97 | 0.97 | |
HTML | 5 | 0.02 | 0.84 | 0.84 | 0.85 | 0.84 | |
HTML + HTTP | 7 | 0.03 | 0.91 | 0.88 | 0.95 | 0.92 | |
HTTP | 2 | 0.03 | 0.92 | 0.87 | 0.99 | 0.93 | |
URL + HTML + HTTP | 25 | 0.03 | 0.98 | 0.98 | 0.97 | 0.98 | |
NB | URL | 18 | 0.01 | 0.67 | 0.93 | 0.37 | 0.53 |
URL + HTML | 23 | 0.03 | 0.68 | 0.95 | 0.37 | 0.54 | |
URL + HTTP | 20 | 0.02 | 0.67 | 0.93 | 0.37 | 0.53 | |
HTML | 5 | 0.02 | 0.75 | 0.68 | 0.94 | 0.79 | |
HTML + HTTP | 7 | 0.02 | 0.81 | 0.75 | 0.95 | 0.84 | |
HTTP | 2 | 0.03 | 0.92 | 0.87 | 0.99 | 0.93 | |
URL + HTML + HTTP | 25 | 0.03 | 0.68 | 0.95 | 0.38 | 0.54 | |
MLP | URL | 18 | 0.61 | 0.95 | 0.92 | 0.97 | 0.95 |
URL + HTML | 23 | 0.08 | 0.93 | 0.90 | 0.96 | 0.93 | |
URL + HTTP | 20 | 0.12 | 0.98 | 0.97 | 0.99 | 0.98 | |
HTML | 5 | 0.17 | 0.84 | 0.83 | 0.87 | 0.85 | |
HTML + HTTP | 7 | 1.00 | 0.91 | 0.88 | 0.95 | 0.92 | |
HTTP | 2 | 0.08 | 0.92 | 0.87 | 0.99 | 0.93 | |
URL + HTML + HTTP | 25 | 0.06 | 0.96 | 0.97 | 0.95 | 0.96 | |
k-NN | URL | 18 | 0.05 | 0.93 | 0.89 | 0.98 | 0.93 |
URL + HTML | 23 | 0.05 | 0.92 | 0.91 | 0.93 | 0.92 | |
URL + HTTP | 20 | 0.05 | 0.97 | 0.94 | 0.99 | 0.97 | |
HTML | 5 | 0.05 | 0.85 | 0.85 | 0.85 | 0.85 | |
HTML + HTTP | 7 | 0.05 | 0.93 | 0.91 | 0.96 | 0.94 | |
HTTP | 2 | 0.05 | 0.92 | 0.87 | 0.98 | 0.92 | |
URL + HTML + HTTP | 25 | 0.05 | 0.97 | 0.95 | 0.98 | 0.97 | |
DT | URL | 18 | 0.02 | 0.92 | 0.93 | 0.92 | 0.92 |
URL + HTML | 23 | 0.02 | 0.91 | 0.90 | 0.92 | 0.91 | |
URL + HTTP | 20 | 0.03 | 0.99 | 0.99 | 0.98 | 0.99 | |
HTML | 5 | 0.02 | 0.81 | 0.80 | 0.83 | 0.82 | |
HTML + HTTP | 7 | 0.02 | 0.91 | 0.90 | 0.92 | 0.91 | |
HTTP | 2 | 0.03 | 0.93 | 0.89 | 0.99 | 0.94 | |
URL + HTML + HTTP | 25 | 0.03 | 0.96 | 0.98 | 0.94 | 0.96 |
Classifiers | Dataset | # Features | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|---|---|
SVM | [29] | 48 | 0.91 | 0.90 | 0.91 | 0.91 |
[46] | 30 | 0.87 | 0.87 | 0.87 | 0.87 | |
Our dataset | 25 | 0.98 | 0.97 | 0.99 | 0.98 | |
SGD | [29] | 48 | 0.89 | 0.90 | 0.88 | 0.89 |
[46] | 30 | 0.84 | 0.90 | 0.75 | 0.82 | |
Our dataset | 25 | 0.98 | 0.98 | 0.97 | 0.98 | |
NB | [29] | 48 | 0.81 | 0.99 | 0.62 | 0.76 |
[46] | 30 | 0.68 | 0.97 | 0.38 | 0.55 | |
Our dataset | 25 | 0.68 | 0.99 | 0.38 | 0.54 | |
MLP | [29] | 48 | 0.86 | 0.86 | 0.86 | 0.86 |
[46] | 30 | 0.84 | 0.84 | 0.83 | 0.84 | |
Our dataset | 25 | 0.96 | 0.97 | 0.95 | 0.96 | |
k-NN | [29] | 48 | 0.85 | 0.85 | 0.86 | 0.85 |
[46] | 30 | 0.88 | 0.88 | 0.89 | 0.88 | |
Our dataset | 25 | 0.97 | 0.99 | 0.98 | 0.97 | |
DT | [29] | 48 | 0.93 | 0.93 | 0.93 | 0.93 |
[46] | 30 | 0.90 | 0.95 | 0.84 | 0.89 | |
Our dataset | 25 | 0.96 | 0.98 | 0.94 | 0.96 |
Study | Dataset Size | Classification Method | Accuracy |
---|---|---|---|
Proposed work | 500 legitimate, 500 phishing | DT | 0.99 |
[17] | 100 legitimate, 100 phishing | TF-IDF | 0.95 |
[24] | 600 legitimate, 800 phishing | Neural network | 0.92 |
[25] | 24086 legitimate, 16797 phishing | Rule-based | 0.99 |
[26] | 6950 legitimate, 860 phishing | PILFER | 0.99 |
[29] | 5000 legitimate, 5000 phishing | Random forest | 0.94 |
[32] | 995 legitimate, 667 phishing | Multi-layer filters | 0.92 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kapan, S.; Sora Gunal, E. Improved Phishing Attack Detection with Machine Learning: A Comprehensive Evaluation of Classifiers and Features. Appl. Sci. 2023, 13, 13269. https://doi.org/10.3390/app132413269
Kapan S, Sora Gunal E. Improved Phishing Attack Detection with Machine Learning: A Comprehensive Evaluation of Classifiers and Features. Applied Sciences. 2023; 13(24):13269. https://doi.org/10.3390/app132413269
Chicago/Turabian StyleKapan, Sibel, and Efnan Sora Gunal. 2023. "Improved Phishing Attack Detection with Machine Learning: A Comprehensive Evaluation of Classifiers and Features" Applied Sciences 13, no. 24: 13269. https://doi.org/10.3390/app132413269
APA StyleKapan, S., & Sora Gunal, E. (2023). Improved Phishing Attack Detection with Machine Learning: A Comprehensive Evaluation of Classifiers and Features. Applied Sciences, 13(24), 13269. https://doi.org/10.3390/app132413269