Homoglyph Attack Detection Model Using Machine Learning and Hash Function
Abstract
:1. Introduction
1.1. Problem Statement(s) and Major Contributions
- Proposes a homoglyph attack detection model that combines a hash function and machine learning.
- Achieves an accuracy of 99.8% using Random Forest, and the hash function improves the accuracy of homoglyph attack detection.
- Compares the proposed model on several criteria to existing phishing detection methods.
1.2. Related Works
1.3. Roadmap of the Paper
2. Material and Methods
- The hash value of the original domain name: www.google.com, (accessed on 18 July 2022).
- The hash value after changing only one ‘o’ of the original domain name into a Greek ‘o’: www.goοgle.com (accessed on 18 July 2022).
2.1. Development Phase
2.1.1. Data Selection
2.1.2. Data Preprocessing
- Data Integration: There is a large imbalance between the legitimate and illegitimate datasets, which leads to model overfitting for the majority class. To avoid this problem, we combined two illegitimate datasets (PhishTank 20% and Phishstat 80%) to have a more considerable record and to be more consistent with the legitimate dataset. We balance the distribution number of records per class label as 50% for illegitimate URLs and 50% for legitimate in our dataset.
- Data Transformation: This is a pivotal role in converting unprocessed data into an understandable form. In this phase, we normalized, aggregated, and generalized our datasets. This modification helps to minimize time and complexity by arranging the columns and creating a summary for a faster overview.
- Data Reaction: This phase involves reducing large amounts of data into smaller and more meaningful fragments that can extract the features directly from it into more small and significant chunks.
2.1.3. Feature Extraction
- URL-Based Features: For example, the protocol (HTTP and HTTPS), subdomain (www), domain (google), top-level domain (TLD) which has different types such as (infrastructure, generic, sponsored, country-code, and test), and path [18].
- URL-Based Language Features: It helps to extract the features of non-English languages from URLs using words as features and N-grams [19].
- URL length: The attacker can take advantage of a long URL to hide suspicious content. For instance, “https://luizaonlaine.com/5257iuhamkvnma024/index.php?o-de-panelas-tramontina-antiaderente-de-aluminio-vermelho-10-pecas-turim-20298-722%25252Fp%25252F144129900%25252Fud%25252Fpanl%25252F&id=1” (accessed on 18 July 2022). And to ensure accuracy, we calculate the total average of URLs and if “>75” it will be classified as phishing websites. Otherwise, it will be categorized as a legitimate website.
- Special characters: Phishing websites usually hide suspicious content in the URL by using special characters. For example, the symbol “@” leads the browser to ignore the rest of the URL before the “@” symbol.
- Redirecting using “//”: This symbol ‘//’ is used to forward the user to other websites since we eliminate (http//and https//). The presence of // indicates that the website is illegitimate.
- Prefixes or suffixes disjointed by (-) to the domain: The attacker confuses the user by accumulating a prefix or suffixes disconnected from this symbol “-” to convince the user that the URL is legitimate, for example, (www.payment-amazon.com (accessed on 18 July 2022)).
- Number of subdomains: The number of subdomains can be utilized by the number of dots (.). Based on the datasets, in case the number of dots is more than four, that means the website will be classified as illegitimate.
- The occurrence of “HTTPS” in the domain of the URL. The attacker will have the ability to attach “HTTPS” symbols to the domain of the website URL so as to fool the audience, such as http://https-www-paypal-it-webapps-mpp-home.soft-hair.com/ (accessed on 18 July 2022). The phisher will probably use HTTP instead of HTTPS.
- Homoglyph characters: The attackers manipulate the characters of the URL to be look-alike legitimate URLs using various writing options such as Latin, Cyrillic, and Greek to trick the user. For example, “www.google.com”, (accessed on 18 July 2022) looks legitimate, but it uses Cyrillic letters to write “Google”.
2.1.4. Feature Selection
2.1.5. Data Splitting
2.1.6. Classifier Selection
2.2. Deployment Phase
- Case 1: Check if the URL exists in the hash dataset:
- -
- If it does exist, the user will be forwarded to a website.
- -
- If it does not exist in the hash dataset, check the illegitimate dataset.
- Case 2: If the URL is existing in the illegitimate dataset, block the website.
- Case 3: If the URL does not exist in both “Hash Dataset” and “Illegitimate Dataset”, then it will go through the process of machine learning.
2.2.1. Homoglyph Detection Implementation
2.2.2. Accessibility Interface Prototype
- Legitimate scenario: When the user enters a legitimate URL, the pop-up message will show that the URL is legitimate and ask the user if they want to continue with the URL of the webpage or not as shown in Figure 8.
- Illegitimate scenario: When the user enters a mix of IP and homoglyph letters, e.g., www.talenteđ/192.168.4.8, (accessed on 18 July 2022) as shown in Figure 9.
3. Experimental Analysis, Observations, and Results
3.1. Performance Analysis
3.2. Functionality Analysis
3.3. Security Analysis
4. Discussion
4.1. Interface Attack
- Password attack: The attacker can gain unauthorized access to the system through an admin account. The following are examples of password attacks.
- Brute force attack: It is an attack when multiple attempts and combinations are tested by the hacker to attain the account password.
- Dictionary attack: The technique utilized to break the system and gain unauthorized access by going through every password in the dictionary word list.
- Rainbow attack: Method used to compare hashed passwords in the rainbow table with the entered password.
- -
- Mitigation: (1) limit failed login attempts; (2) two-factor authentication (2FA); (3) using a strong password.
- Privilege abuse: A privilege abuse attack occurs when the attacker tries to use the admin account inappropriately by entering the system with the admin account.
- -
- Mitigation: (1) access control: regulating access to the system by using ACL to minimize the risk of unauthorized users. (2) Audit trail: auditing and logging the admin’s activities.
4.2. Data Attack
- SQL injection: The SQL injection attack is when the attacker tries to gain unauthorized access by injecting a malicious SQL query to retrieve the desired information from the database. In the case of the proposed model within this paper, the attacker tends to replace the URL with an SQL query [36]. That adds an illegitimate URL in the legitimate dataset to ensure future access to the website.
- Mitigation: The input validation mechanism must occur. The input validation will check the input given by the user. The input must be a URL that will be checked by the machine to determine its legitimacy and add it to the suitable dataset. If the input were an SQL query, the system would drop the input to block any attempt from the user to apply the direct modification to the dataset, which is the responsibility of the machine only [36].
4.3. System Attacks
- Evasion attacks: They are the most prevalent type of attacks that may be encountered during system operation. This attack happened during the learned model, e.g., in our case, the attacker changes some words in the URL and it seems readable, but actually, the system fails to classify the result in the classification process of the ML.
- Mitigation: It evaluated the efficacy of the classifier model by validating the input as shown in Figure 12.
- Poisoning attacks: This attacker is targeting ML during the deployment model. This attack targets the availability and integrity of the ML by injecting so much bad data into the system that whatever boundary the model learns becomes useless.
- Mitigation: The most common type of defense is outlier detection. The idea is when the infection is poisoning the machine learning system, the attacker is injecting something into the training aggregation that is very different from what it should include, and this should be detected.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Woodbridge, J.; Anderson, H.S.; Ahuja, A.; Grant, D. Detecting Homoglyph Attacks with a Siamese Neural Network. In Proceedings of the 2018 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 24 May 2018; pp. 22–28. [Google Scholar] [CrossRef]
- Elsayed, Y.; Shosha, A. Large scale detection of IDN domain name masquerading. In Proceedings of the 018 APWG Symposium on Electronic Crime Research (eCrime), San Diego, CA, USA, 15–17 May 2018; pp. 1–11. [Google Scholar] [CrossRef]
- Thao, T.P.; Sawaya, Y.; Nguyen-Son, H.-Q.; Yamada, A.; Kubota, A.; Van Sang, T.; Yamaguchi, R.S. Human Factors in Homograph Attack Recognition. In Applied Cryptography and Network Security; Springer: Cham, Switzerland, 2020; pp. 408–435. [Google Scholar]
- Simpson, G.; Moore, T.; Clayton, R. Ten years of attacks on companies using visual impersonation of domain names. In Proceedings of the 2020 APWG Symposium on Electronic Crime Research (eCrime), Boston, MA, USA, 16–19 November 2020. [Google Scholar] [CrossRef]
- Chiba, D.; Hasegawa, A.A.; Koide, T.; Sawabe, Y.; Goto, S.; Akiyama, M. DomainScouter: Analyzing the Risks of Deceptive Internationalized Domain Names. IEICE Trans. Inf. Syst. 2020, E103.D, 1493–1511. [Google Scholar] [CrossRef]
- Summary of the Phishing and Attempted Stealing Incident on Binance–Binance. pp. 7–9. Available online: https://support.binance.com/hc/en-us/articles/360001547431-Summary-of-the-Phishing-and-Attempted-Stealing-Incident-on-Binance (accessed on 18 July 2022).
- Helfrich, J.N.; Neff, R. Dual canonicalization: An answer to the homograph attack. In Proceedings of the 2012 eCrime Researchers Summit, Las Croabas, PR, USA, 23–24 October 2012; pp. 1–10. [Google Scholar] [CrossRef]
- Pandey, G.; Martolia, M.; Arora, N. A Novel String Matching Algorithm and Comparison with KMP Algorithm. Int. J. Comput. Appl. 2017, 179, 6–8. [Google Scholar] [CrossRef]
- Pandey, S.K.; Dubey, N.K.; Sharma, S. A Study on String Matching Methodologies. Int. J. Comput. Sci. Inf. Technol. 2014, 5, 4732–4735. [Google Scholar]
- Sawabe, Y.; Chiba, D.; Akiyama, M.; Goto, S. Detecting homograph IDNs using OCR. In Proceedings of the Asia-Pacific Advanced Network, Singapore, 25–29 March 2018; pp. 56–64. [Google Scholar]
- Barnouti, N.H.; Abomaali, M.; Al-Mayyahi, M.H.N. An efficient character recognition technique using K-nearest neighbor classifier. Int. J. Eng. Technol. 2018, 7, 3148–3153. [Google Scholar] [CrossRef]
- Suzuki, H.; Chiba, D.; Yoneya, Y.; Mori, T.; Goto, S. ShamFinder: An automated framework for detecting IDN homographs. In Proceedings of the Internet Measurement Conference, Amsterdam, The Netherlands, 21–23 October 2019; pp. 449–462. [Google Scholar] [CrossRef]
- Vinayakumar, R.; Soman, K.P. Siamese neural network architecture for homoglyph attacks detection. ICT Express 2020, 6, 16–19. [Google Scholar] [CrossRef]
- Alexa Top 1 Million Sites. Available online: https://www.kaggle.com/datasets/cheedcheed/top1m (accessed on 18 May 2022).
- Sahoo, D.; Liu, C.; Hoi, S.C. Malicious URL Detection using Machine Learning: A Survey. arXiv 2017, arXiv:1701.07179. [Google Scholar]
- Le, H.; Pham, Q.; Sahoo, D.; Hoi, S.C.H. URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection. arXiv 2018, arXiv:1802.03162. [Google Scholar]
- Zhu, E.; Chen, Y.; Ye, C.; Li, X.; Liu, F. OFS-NN: An Effective Phishing Websites Detection Model Based on Optimal Feature Selection and Neural Network. IEEE Access 2019, 7, 73271–73284. [Google Scholar] [CrossRef]
- Vanhoenshoven, F.; Napoles, G.; Falcon, R.; Vanhoof, K.; Koppen, M. Detecting malicious URLs using machine learning techniques. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Athens, Greece, 6–9 December 2016; pp. 1–8. [Google Scholar] [CrossRef]
- Baykan, E.; Henzinger, M.; Weber, I. A comprehensive study of techniques for URL-based web page language classification. ACM Trans. Web 2013, 7, 1–37. [Google Scholar] [CrossRef]
- Talavera, L. An Evaluation of Filter and Wrapper Methods for Feature Selection in Categorical Clustering. In Advances in Intelligent Data Analysis VI; Springer: Berlin/Heidelberg, Germany, 2005; pp. 440–451. [Google Scholar]
- Kotthoff, L.; Thornton, C.; Hoos, H.H.; Hutter, F.; Leyton-Brown, K. Auto-WEKA: Automatic Model Selection and Hyperpa-rameter Optimization. In Automated Machine Learning: Methods, Systems, Challenges; Springer: Cham, Switzerland, 2019; p. 81. [Google Scholar]
- Abdelhamid, N.; Ayesh, A.; Thabtah, F. Phishing detection based Associative Classification data mining. Expert Syst. Appl. 2014, 41, 5948–5959. [Google Scholar] [CrossRef]
- Desai, A.; Jatakia, J.; Naik, R.; Raul, N. Malicious web content detection using machine leaning. In Proceedings of the 2017 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, India, 19–20 May 2017; pp. 1432–1436. [Google Scholar] [CrossRef]
- Machado, L.; Gadge, J. Phishing Sites Detection Based on C4.5 Decision Tree Algorithm. In Proceedings of the 2017 International Conference on Computing, Communication, Control and Automation (ICCUBEA), Pune, India, 17–18 August 2017; pp. 1–5. [Google Scholar] [CrossRef]
- Tyagi, I.; Shad, J.; Sharma, S.; Gaur, S.; Kaur, G. A Novel Machine Learning Approach to Detect Phishing Websites. In Proceedings of the 2018 5th International Conference on Signal Processing and Integrated Networks (SPIN), Noida, India, 22–23 February 2018; pp. 425–430. [Google Scholar] [CrossRef]
- Sonmez, Y.; Tuncer, T.; Gokal, H.; Avci, E. Phishing web sites features classification based on extreme learning machine. In Proceedings of the 2018 6th International Symposium on Digital Forensic and Security (ISDFS), Antalya, Turkey, 22–25 March 2018; pp. 1–5. [Google Scholar] [CrossRef]
- Karabatak, M.; Mustafa, T. Performance comparison of classifiers on reduced phishing website dataset. In Proceedings of the 2018 6th International Symposium on Digital Forensic and Security (ISDFS), Antalya, Turkey, 22–25 March 2018; pp. 1–5. [Google Scholar] [CrossRef]
- Ginsberg, A.; Yu, C. Rapid Homoglyph Prediction and Detection. In Proceedings of the 2018 1st International Conference on Data Intelligence and Security (ICDIS), South Padre Island, TX, USA, 8–10 April 2018; pp. 17–23. [Google Scholar] [CrossRef]
- Parekh, S.; Parikh, D.; Kotak, S.; Sankhe, S. A New Method for Detection of Phishing Websites: URL Detection. In Proceedings of the 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), Coimbatore, India, 20–21 April 2018; pp. 949–952. [Google Scholar] [CrossRef]
- Chiew, K.L.; Tan, C.L.; Wong, K.; Yong, K.S.; Tiong, W.K. A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Inf. Sci. 2019, 484, 153–166. [Google Scholar] [CrossRef]
- Sahingoz, O.K.; Buber, E.; Demir, O.; Diri, B. Machine learning based phishing detection from URLs. Expert Syst. Appl. 2019, 117, 345–357. [Google Scholar] [CrossRef]
- Sonowal, G.; Kuppusamy, K. PhiDMA–A phishing detection model with multi-filter approach. J. King Saud Univ.-Comput. Inf. Sci. 2020, 32, 99–112. [Google Scholar] [CrossRef]
- Zhang, W.; Jiang, Q.; Chen, L.; Li, C. Two-stage ELM for phishing Web pages detection using hybrid features. World Wide Web 2017, 20, 797–813. [Google Scholar] [CrossRef]
- Mertooetomo, E.R.; Chen, J. Character recognition with fuzzy features and fuzzy regions. In Proceedings of the 1997 Annual Meeting of the North American Fuzzy Information Processing Society-NAFIPS (Cat. No.97TH8297), Syracuse, NY, USA, 21–24 September 1997; pp. 166–171. [Google Scholar] [CrossRef]
- Fu, A.Y.; Deng, X.; Liu, W.; Little, G. The methodology and an application to fight against Unicode attacks. In Proceedings of the Second Symposium on Usable Privacy and Security, Pittsburgh, PA, USA, 12–14 July 2006; Association for Computing Machinery: New York, NY, USA, 2006; Volume 149, pp. 91–101. [Google Scholar] [CrossRef]
- Burtescu, E. Database security-attacks and control methods. J. Appl. Quant. Methods 2009, 4, 449–454. [Google Scholar]
Algorithms | Accuracy |
---|---|
Support Vector Machine | 99.3% |
Decision Tree | 98.9% |
K Nearest Neighbor | 99.5% |
Random Forest | 99.8% |
Ref. | Year | Dataset | Technique | Findings |
---|---|---|---|---|
[22] | 2014 | PhishTank, Yahoo directory | Multi-label-classifier-based, Associative Classification (MCAC) | The accuracy (%) approximately is: MCAC Accuracy = 94.5% |
[23] | 2017 | UCI | KNN, SVM, and Random Forest | NA |
[24] | 2017 | PhishTank, Statscrop | ELM, LC-ELM | ELM, LC-ELM achieved: accuracy = 99.04% |
[25] | 2018 | PhishTank and Google | DT, RF, GBM | Accuracy = 98.4%, recall = 98.59 Precision = 97.70% |
[26] | 2018 | UCI | ELM | Accuracy = 95.34% |
[27] | 2018 | UCI | lazy K.Star | Accuracy = 97.58% |
[28] | 2018 | UCI | HEFS | Accuracy = 94.6% |
[29] | 2018 | PhishTank | RF | Accuracy of 95% |
[18] | 2019 | PhishTank’s and Ebubekirbbr | DT, NN, NB | Accuracy = 78.4% |
[17] | 2019 | PhishTank, Alexa records, UCI | FVV | Accuracy = 94.5% Precision = 96.4%, recall = 93.6%, F1_score = 95.4%, |
[30] | 2019 | Phishload, PhishTank | Longest Common Subsequence (LCS), Damerau–Levenshtein Edit Distance (DLE) | 90.54% true positive, 94.18% true negative, 5.82% false positive, 9.46% false negative, and 92.72% accuracy. |
[31] | 2019 | PhishTank, Openfish | RF | Accuracy = 97.98% |
[32] | 2020 | PhishTank, Google | C4.5 classifier | Accuracy = 89.5%, precision 88.26% Recall = 89.39%, F_measure = 91.16% |
Ref. | Number of Features Used | Highest | Name of Model | |
---|---|---|---|---|
[33] | Accuracy | 16 | 99.04% | ELM, LC-ELM |
[25] | Precision | 6 | 97.70% | Random Forest |
[25] | Recall | 5 | 98.59% | Random Forest |
[23] | F1_Measure | 3 | 96% | Random Forest |
[33] | False positive | 2 | 0.53 | LC-ELMs |
Scheme/Feature | Performance | Functionality | Security | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Speed | Effectiveness | Scalability | Error Rate | Ease of Use | Implementation | Economical | Independence | Accuracy | Integrity | Availability | |
Optical character recognition (OCR) [10] | □ | □ | |||||||||
Fuzzy logic controller [34] | □ | □ | □ | ||||||||
Siamese neural network [13] | □ | □ | □ | ||||||||
ShamFinder [12] | □ | □ | □ | ||||||||
k-nearest neighbors algorithm (k-NN) [11] | □ | □ | □ | ||||||||
HitZone map [28] | □ | □ | □ | ||||||||
KMP [9,35] | □ | □ | |||||||||
Ours |
Criteria | Homoglyph Detection | OCR Improved by KNN [11] |
---|---|---|
Accuracy | 99.8% | 96.4% |
Precision | 99.6% | 93% |
Recall | 99.8% | 92% |
F1_measure | 99.7% | 91% |
False Positive | 0.2% | 3.6% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Almuhaideb, A.M.; Aslam, N.; Alabdullatif, A.; Altamimi, S.; Alothman, S.; Alhussain, A.; Aldosari, W.; Alsunaidi, S.J.; Alissa, K.A. Homoglyph Attack Detection Model Using Machine Learning and Hash Function. J. Sens. Actuator Netw. 2022, 11, 54. https://doi.org/10.3390/jsan11030054
Almuhaideb AM, Aslam N, Alabdullatif A, Altamimi S, Alothman S, Alhussain A, Aldosari W, Alsunaidi SJ, Alissa KA. Homoglyph Attack Detection Model Using Machine Learning and Hash Function. Journal of Sensor and Actuator Networks. 2022; 11(3):54. https://doi.org/10.3390/jsan11030054
Chicago/Turabian StyleAlmuhaideb, Abdullah M., Nida Aslam, Almaha Alabdullatif, Sarah Altamimi, Shooq Alothman, Amnah Alhussain, Waad Aldosari, Shikah J. Alsunaidi, and Khalid A. Alissa. 2022. "Homoglyph Attack Detection Model Using Machine Learning and Hash Function" Journal of Sensor and Actuator Networks 11, no. 3: 54. https://doi.org/10.3390/jsan11030054
APA StyleAlmuhaideb, A. M., Aslam, N., Alabdullatif, A., Altamimi, S., Alothman, S., Alhussain, A., Aldosari, W., Alsunaidi, S. J., & Alissa, K. A. (2022). Homoglyph Attack Detection Model Using Machine Learning and Hash Function. Journal of Sensor and Actuator Networks, 11(3), 54. https://doi.org/10.3390/jsan11030054