Detecting Phishing URLs Based on a Deep Learning Approach to Prevent Cyber-Attacks
Abstract
:1. Introduction
- This study developed and presents a user-friendly end-to-end web-based system that detects whether a URL is phishing or legitimate;
- It presents a deep learning model using a 1D convolutional neural network to detect URL-based phishing attacks by determining whether a URL is phishing or legitimate;
- It evaluates the proposed system using diverse datasets obtained from PhishTank, UNB, and Alexa;
- This study presents a detailed analysis of existing phishing detection methods, highlighting their limitations and our proposed model’s advantages.
2. Literature Review
2.1. Traditional Methods
2.1.1. Whitelist Approach
2.1.2. Blacklist Approach
2.1.3. Content List Approach
2.1.4. Visual-Similarity-Based Methods
2.1.5. URL-Based Methods
Machine Learning-Based Methods
Deep Learning-Based Methods
3. Methodology
3.1. Deep Learning
3.2. CNN and 1D CNN
3.3. Proposed Architecture
- Data collection;
- Data preprocessing;
- Classification using the proposed deep learning algorithms;
- Web application.
3.3.1. Data Collection
3.3.2. Preprocessing
- Data Cleaning: All details are extracted from the URLs and only important features remain to capture information;
- Text Tokenization: The important features of the text are tokenized;
- Text Stemming: Multiple forms of text are converted into one form or stem to simplify the task of analyzing the data;
- Data Padding: P keeps the size of the vector aligned.
3.3.3. Deep Learning Model
3.3.4. Web Application
3.4. 1D CNN Architecture Diagram
- Input Data: Clean URLs are provided as input data for preprocessing;
- Preprocessing: Stop words are removed from the URLs, and tokenization is performed as in Figure 4;
- Embedding Layers: Data are transferred to the embedding layer, where data dimensioning is performed based on the length of the URL. Suppose that in our model, the length of the URL is 120. Then, 120 dimensions are provided in the embedding layer;
- Convolutional Blocks: After embedding the layer, the data are entered into the convolutional block, where seven convolutional layers that have 1D-CNN blocks and one ReLU function have been mapped. Feature mapping and feature extraction of the URL are performed in this layer. One after the other, the inputs are passed through each block to filter out the most important features of the URL;
- Global Max Pooling: After the convolutional layer, global pooling is performed on the URL’s features, where the input size of the matrix is taken with the input of the dimension, and the max value is selected for the computation. Then, the URLs are moved to a deep neural architecture, where dropout is applied;
- Drop Out: This is used to prevent the model from overfitting;
- Sigmoid: In this step, based on the features identified, the URL is classified as phishing or legitimate.
4. Datasets
- Dataset preprocessing;
- Dataset splitting.
4.1. Data Preprocessing
4.2. Data Splitting
5. Performance Evaluation and Comparison with Existing Approaches
- Accuracy: This represents the number of correctly classified data instances over the total number of data instances and is defined as follows:
- Recall: This helped us identify true positive values by giving true positives divided by actual positives:
- Precision: It gives you the proportion of true positives to the number of total positives that the model predicts and is defined as follows:
- F1 Score: It takes into account both precision and recall and is defined as follows:
6. Limitations and Future Work
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Tang, L.; Mahmoud, Q.H. A Deep Learning-Based Framework for Phishing Website Detection. IEEE Access 2022, 10, 1509–1521. [Google Scholar] [CrossRef]
- Yerima, S.Y.; Alzaylaee, M.K. High Accuracy Phishing Detection Based on Convolutional Neural Networks. In Proceedings of the ICCAIS 2020–3rd International Conference on Computer Applications and Information Security, Riyadh, Saudi Arabia, 19–21 March 2020. [Google Scholar]
- Jakobsson, M.; Myers, S. (Eds.) Phishing and Countermeasures: Understanding the Increasing Problem of Electronic Identity Theft; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
- Hong, J. The state of phishing attacks. Commun. ACM 2012, 55, 74–81. [Google Scholar] [CrossRef]
- Al-Ahmadi, S.; Alotaibi, A.; Alsaleh, O. PDGAN: Phishing Detection With Generative Adversarial Networks. IEEE Access 2022, 10, 42459–42468. [Google Scholar] [CrossRef]
- Rajitha, K.; Vijayalakshmi, D. Suspicious URLs Filtering Using Optimal RT-PFL: A Novel Feature Selection Based Web URL Detection. In Proceedings of the Smart Innovation, Systems and Technologies, Queensland, Australia, 20–22 June 2018. [Google Scholar]
- APWG|Phishing Activity Trends Reports. Available online: https://apwg.org/trendsreports/ (accessed on 19 August 2024).
- Sahingoz, O.K.; Buber, E.; Demir, O.; Diri, B. Machine Learning Based Phishing Detection from URLs. Expert Syst. Appl. 2019, 117, 345–357. [Google Scholar] [CrossRef]
- Bu, S.J.; Cho, S.B. Deep Character-Level Anomaly Detection Based on a Convolutional Autoencoder for Zero-Day Phishing Url Detection. Electron 2021, 10, 1492. [Google Scholar] [CrossRef]
- Kang, J.M.; Lee, D.H. Advanced White List Approach for Preventing Access to Phishing Sites. In Proceedings of the 2007 International Conference on Convergence Information Technology (ICCIT 2007), Gwangju, Republic of Korea, 21–23 November 2007. [Google Scholar]
- Fu, A.Y.; Liu, W.; Deng, X. Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover’s Distance (EMD). IEEE Trans. Dependable Secur. Comput. 2006, 3, 301–311. [Google Scholar] [CrossRef]
- Cao, Y.; Han, W.; Le, Y. Anti-Phishing Based on Automated Individual White-List. In Proceedings of the ACM Conference on Computer and Communications Security, Alexandria, VA, USA, 27–31 October 2008. [Google Scholar]
- Oest, A.; Safei, Y.; Doupe, A.; Ahn, G.J.; Wardman, B.; Warner, G. Inside a Phisher’s Mind: Understanding the Anti-Phishing Ecosystem through Phishing Kit Analysis. In Proceedings of the 2018 APWG Symposium on Electronic Crime Research (eCrime), San Diego, CA, USA, 15–17 May 2018. [Google Scholar] [CrossRef]
- Sharifi, M.; Siadati, S.H. A Phishing Sites Blacklist Generator. In Proceedings of the AICCSA 08–6th IEEE/ACS International Conference on Computer Systems and Applications, Doha, Qatar, 31 March–4 April 2008. [Google Scholar]
- Zhang, Y.; Hong, J.I.; Cranor, L.F. Cantina: A Content-Based Approach to Detecting Phishing Web Sites. In Proceedings of the 16th International World Wide Web Conference (WWW2007), Banff, AB, Canada, 8–12 May 2007. [Google Scholar]
- Prakash, P.; Kumar, M.; Rao Kompella, R.; Gupta, M. PhishNet: Predictive Blacklisting to Detect Phishing Attacks. In Proceedings of the Proceedings IEEE INFOCOM, San Diego, CA, USA, 14–19 March 2010. [Google Scholar]
- Xiang, G.; Hong, J.; Rose, C.P.; Cranor, L. CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites. ACM Trans. Inf. Syst. Secur. 2011, 14, 1–28. [Google Scholar] [CrossRef]
- Keivanloo, I.; Roy, C.K.; Rilling, J. SeByte: Scalable Clone and Similarity Search for Bytecode. Sci. Comput. Program. 2014, 95, 426–444. [Google Scholar] [CrossRef]
- Ozker, U.; Sahingoz, O.K. Content Based Phishing Detection with Machine Learning. In Proceedings of the 2020 International Conference on Electrical Engineering (ICEE 2020), Istanbul, Turkey, 25–27 September 2020. [Google Scholar]
- Liu, W.; Huang, G.; Liu, X.; Zhang, M.; Deng, X. Detection of Phishing Webpages Based on Visual Similarity. In Proceedings of the 14th International World Wide Web Conference (WWW2005), Chiba, Japan, 10–14 May 2005. [Google Scholar]
- Abdelnabi, S.; Krombholz, K.; Fritz, M. VisualPhishNet: Zero-Day Phishing Website Detection by Visual Similarity. In Proceedings of the ACM Conference on Computer and Communications Security, Virtual Event, 9–13 November 2020. [Google Scholar]
- Chen, J.L.; Ma, Y.W.; Huang, K.L. Intelligent Visual Similarity-Based Phishing Websites Detection. Symmetry 2020, 12, 1681. [Google Scholar] [CrossRef]
- Nair, S.M. Detecting Malicious URL Using Machine Learning: A Survey. Int. J. Res. Appl. Sci. Eng. Technol. 2020, 8, 2670–2677. [Google Scholar] [CrossRef]
- Cui, Q.; Jourdan, G.V.; Bochmann, G.V.; Couturier, R.; Onut, I.V. Tracking Phishing Attacks over Time. In Proceedings of the 26th International World Wide Web Conference (WWW 2017), Perth, Australia, 3–7 April 2017. [Google Scholar]
- Alfouzan, N.A.; Narmatha, C. A Systematic Approach for Malware URL Recognition. In Proceedings of the 2022 2nd International Conference on Computing and Information Technology (ICCIT 2022), Tabuk, Saudi Arabia, 25–27 January 2022. [Google Scholar]
- Orunsolu, A.A.; Sodiya, A.S.; Akinwale, A.T. A Predictive Model for Phishing Detection. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 232–247. [Google Scholar] [CrossRef]
- Atimorathanna, D.N.; Ranaweera, T.S.; Devdunie Pabasara, R.A.H.; Perera, J.R.; Abeywardena, K.Y. NoFish; Total Anti-Phishing Protection System. In Proceedings of the ICAC 2020 2nd International Conference on Advancements in Computing, Colombo, Sri Lanka, 10–11 December 2020. [Google Scholar]
- Shah, B.; Dharamshi, K.; Patel, M.; Gaikwad, D.; Professor, A. Chrome Extension for Detecting Phishing Websites. Int. Res. J. Eng. Technol. 2020, 7, 2958–2962. [Google Scholar]
- Abiodun, O.; Sodiya, A.S.; Kareem, S.O. LINKCALCULATOR–AN EFFICIENT LINK-BASED PHISHING DETECTION TOOL. Acta Inform. Malaysia 2020, 4, 37–44. [Google Scholar] [CrossRef]
- Wu, J.; Yang, Z.; Guo, L.; Li, Y.; Liu, W. Convolutional Neural Network with Character Embeddings for Malicious Web Request Detection. In Proceedings of the Proceedings–2019 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking, ISPA/BDCloud/SustainCom/SocialCom 2019, Xiamen, China, 16–18 December 2019; pp. 622–627. [Google Scholar]
- Athiwaratkun, B.; Stokes, J.W. Malware Classification with LSTM and GRU Language Models and a Character-Level CNN. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing–Proceedings, New Orleans, LA, USA, 5–9 March 2017; pp. 2482–2486. [Google Scholar]
- Huang, Y.; Yang, Q.; Qin, J.; Wen, W. Phishing URL Detection via CNN and Attention-Based Hierarchical RNN. In Proceedings of the Proceedings–2019 18th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering, TrustCom/BigDataSE 2019, Rotorua, New Zealand, 5–8 August 2019; pp. 112–119. [Google Scholar]
- Mohammad, R.M.; Thabtah, F.; McCluskey, L. Predicting Phishing Websites Based on Self-Structuring Neural Network. Neural Comput. Appl. 2014, 25, 443–458. [Google Scholar] [CrossRef]
- Shibahara, T.; Yamanishi, K.; Takata, Y.; Chiba, D.; Akiyama, M.; Yagi, T.; Ohsita, Y.; Murata, M. Malicious URL Sequence Detection Using Event De-Noising Convolutional Neural Network. In Proceedings of the IEEE International Conference on Communications, Paris, France, 21–25 May 2017; pp. 1–7. [Google Scholar]
- Janiesch, C.; Zschech, P.; Heinrich, K. Machine Learning and Deep Learning. Electron. Mark. 2021, 31, 685–695. [Google Scholar] [CrossRef]
- Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. J. Big Data 2021, 8, 1–74. [Google Scholar] [CrossRef]
- Bolhasani, H.; Mohseni, M.; Rahmani, A.M. Deep Learning Applications for IoT in Health Care: A Systematic Review. Informatics Med. Unlocked 2021, 23, 100550. [Google Scholar] [CrossRef]
- Hassani, H.; Huang, X.; Silva, E.; Ghodsi, M. Deep Learning and Implementations in Banking. Ann. Data Sci. 2020, 7, 433–446. [Google Scholar] [CrossRef]
- Alahmari, S.S.; Goldgof, D.B.; Mouton, P.R.; Hall, L.O. Challenges for the Repeatability of Deep Learning Models. IEEE Access 2020, 8, 211860–211868. [Google Scholar] [CrossRef]
- Guo, T.; Dong, J.; Li, H.; Gao, Y. Simple convolutional neural network on image classification. In Proceedings of the 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA), Beijing, China, 10–12 March 2017; pp. 721–724. [Google Scholar]
- Singh, K.; Scholar, R.; Mahajan, A.; Mansotra, V. 1D-CNN Based Model for Classification and Analysis of Network Attacks. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 0121169. [Google Scholar] [CrossRef]
- Xiao, X.; Xiao, W.; Zhang, D.; Zhang, B.; Hu, G.; Li, Q.; Xia, S. Phishing Websites Detection via CNN and Multi-Head Self-Attention on Imbalanced Datasets. Comput. Secur. 2022, 108, 102372. [Google Scholar] [CrossRef]
- Atrees, M.; Ahmad, A.; Alghanim, F. Enhancing Detection of Malicious Urls Using Boosting and Lexical Features. Intell. Autom. Soft Comput 2022, 31, 1405–1422. [Google Scholar] [CrossRef]
- Pawluszek-Filipiak, K.; Borkowski, A. On the Importance of Train-Test Split Ratio of Datasets in Automatic Landslide Detection by Supervised Classification. Remote Sens. 2020, 12, 3054. [Google Scholar] [CrossRef]
- Tenis, A.A.; Santhosh, R. Modelling an Efficient URL Phishing Detection Approach Based on a Dense Network Model. Comput. Syst. Sci. Eng. 2023, 47, 2625–2641. [Google Scholar] [CrossRef]
- Bozkir, A.S.; Dalgic, F.C.; Aydos, M. GramBeddings: A New Neural Network for URL Based Identification of Phishing Web Pages Through N-Gram Embeddings. Comput. Secur. 2023, 124, 102964. [Google Scholar] [CrossRef]
- Dhanavanthini, P.; Chakkravarthy, S.S. Phish-Armour: Phishing Detection Using Deep Recurrent Neural Networks. Soft Comput. 2023. [Google Scholar] [CrossRef]
- Adebowale, M.A.; Lwin, K.T.; Hossain, M.A. Intelligent Phishing Detection Scheme Using Deep Learning Algorithms. J. Enterp. Inf. Manag. 2023, 36, 747–766. [Google Scholar] [CrossRef]
- Kumar, P.P.; Jaya, T.; Rajendran, V. SI-BBA–A Novel Phishing Website Detection Based on Swarm Intelligence with Deep Learning. Mater. Today Proc. 2021, 80, 3129–3139. [Google Scholar] [CrossRef]
- Siva Satya Sreedhar, P.; Velpula, S.; Parise, R.; Vamsi, N.K.; Chaitanya, S.K. Phishing Attack Detection Using Convolutional Neural Networks. In Proceedings of the 2023 9th International Conference on Advanced Computing and Communication Systems, ICACCS 2023, Coimbatore, India, 17–18 March 2023. [Google Scholar]
- Said, Y.; Alsheikhy, A.A.; Lahza, H.; Shawly, T. Detecting Phishing Websites through Improving Convolutional Neural Networks with Self-Attention Mechanism. Ain Shams Eng. J. 2024, 15, 102643. [Google Scholar] [CrossRef]
- Saha, I.; Sarma, D.; Chakma, R.J.; Alam, M.N.; Sultana, A.; Hossain, S. Phishing attacks detection using deep learning approach. In Proceedings of the 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 20–22 August 2020; pp. 1180–1185. [Google Scholar]
- Rasymas, T.; Dovydaitis, L. Detection of phishing URLs by using deep learning approach and multiple features combinations. Balt. J. Mod. Comput. 2020, 8, 471–483. [Google Scholar] [CrossRef]
Metrics | Training | Testing |
---|---|---|
Accuracy | 99.7% | 99.3% |
Recall | 99.8% | 99.5% |
Precision | 99.6% | 99.2% |
F1 score | 99.76% | 99.34% |
Settings | Parameters |
---|---|
Epochs | 500 |
Loss function | Binary cross-entropy |
Optimizer | Adam |
Activation function | ReLU |
Batch size | 500 |
Dropout | 0.2 |
Ref. | Author, Year | Methodology | Datasets | Limitations | Accuracy |
---|---|---|---|---|---|
[48] | Adebowale et al., 2023 | Hybrid deep learning technique that detects the public image frame and textual information for URL detection utilizing both the CNN and LSTM methods. | Phishing website datasets | The proposed technique is focused more on URL image detection. | 93.28% |
[45] | Tenis et al., 2023 | A dense forward-backward long short-term memory (LSTM) model (d-FBLSTM) was proposed for the detection of phishing URLs. | MUPD | The proposed model detects only home page URLs. | 98.5% |
[5] | Ahmadi et al., 2022 | URL-based phishing detection based on LTSM and CNN models. | PhishTank and DomCop | The results were not compared with a deep learning or machine learning model. | 97.58% |
[46] | Bozkir et al., 2023 | Phishing URL detection using a quadruplet deep neural network based on combining several n-gram embeddings and word embeddings. | Gram Embedding | Not all the evaluation measures mentioned were applied to evaluate the performance of the model. | 98% |
[47] | Dhanavanthini et al., 2023 | A Phish-armor phishing detection model using deep recurrent neural networks that match SSL and website content for the detection of false URLs. | PhishTank, Common Crawl, and Open-phish | A complex and time-consuming computation in Raspberry Pi. | 90.50% |
[49] | Kumar et al., 2023 | An SI-BBA technique with a deep learning model for identifying phishing websites and successfully classifying them. | Phishing URL EDU | The results of the black box phishing attacks can be improved. | 94.8% |
[50] | Velpula et al., 2023 | A CNN is used for detection with random forest models to determine the feature significance at various levels. | 5000 phishing emails dataset | The detection results can be improved. | 98.68% |
[51] | Said et al., 2023 | This model combines the multi-head self-attention mechanism with a CNN and generative adversarial network model to create a URL detector. | UCI phishing domains | The results can be improved further. | 97.83% |
[52] | Saha et al., 2020 | A data-driven framework for detecting phishing webpages. | Based on ten thousand web pages | The framework focused on phishing web pages. | 93% |
[53] | Rasymas et al., 2020 | Proposes a deep neural network architecture. | Phishing URLs and benign URLs | The results can be improved further. | 94.4% |
Proposed model | - | A model is proposed for the detection of phishing URLs based on 1D CNN architecture. | PhishTank, UNB, and Alexa | It takes a long time to train the model. | 99.7% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Haq, Q.E.u.; Faheem, M.H.; Ahmad, I. Detecting Phishing URLs Based on a Deep Learning Approach to Prevent Cyber-Attacks. Appl. Sci. 2024, 14, 10086. https://doi.org/10.3390/app142210086
Haq QEu, Faheem MH, Ahmad I. Detecting Phishing URLs Based on a Deep Learning Approach to Prevent Cyber-Attacks. Applied Sciences. 2024; 14(22):10086. https://doi.org/10.3390/app142210086
Chicago/Turabian StyleHaq, Qazi Emad ul, Muhammad Hamza Faheem, and Iftikhar Ahmad. 2024. "Detecting Phishing URLs Based on a Deep Learning Approach to Prevent Cyber-Attacks" Applied Sciences 14, no. 22: 10086. https://doi.org/10.3390/app142210086
APA StyleHaq, Q. E. u., Faheem, M. H., & Ahmad, I. (2024). Detecting Phishing URLs Based on a Deep Learning Approach to Prevent Cyber-Attacks. Applied Sciences, 14(22), 10086. https://doi.org/10.3390/app142210086