Detection of Korean Phishing Messages Using Biased Discriminant Analysis under Extreme Class Imbalance Problem
Abstract
:1. Introduction
2. Previous Work
2.1. Spam Detection in Balanced Dataset
2.2. Spam Detection in Imbalanced Dataset
2.2.1. Traditional Methods
2.2.2. Deep Learning-Based Methods
2.2.3. Gradient Boosting Methods
2.2.4. Non-Parametric Supervised Learning Methods
2.3. Spam Detection in Extremely Imbalanced Datasets
3. Proposed Method
- Data Conversion: Typically, messages are in the form of text, and they need to be converted into a format that can be understood by machines.
- Curse of Dimensionality: Like all languages, using all morphemes can lead to excessively high dimensionality in the data.
- Morphology: Korean, an agglutinative language, combines nouns and verbs with particles, suffixes, and endings, resulting in a large number of derived word units and a significant increase in the number of features.
- Intention of Writing: Since phishing messages are written with similar intentions, the text often includes a multitude of similar keywords.
3.1. Data Conversion
3.2. Feature Engineering and Decision Making
4. Experimental Results
4.1. Dataset
4.2. Parameters Estimation
4.3. Phishing Messages Classification Results
- Stochastic Gradient Descent (SGD) [45]
- Decision Tree (DT) [46]
- Random Forest (RF) [47]
- Naive Bayes (NB) [48]
- Logistic Regression (LR) [49]
- Support Vector Machine (SVM) [51]
- Adaptive Boosting (AdaBoost) [54]
- Random Under-Sampling Boosting (RUSBoost) [55]
- Extreme Gradient Boosting (XGBoost) [56]
- Light Gradient Boosting Model (LGBM) [57]
- Convolutional Neural Network (CNN) [24]
- Convolutional Neural Network (CNN) and Gated Recurrent Unit (GRU) [23]
- Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) [24]
- Bidirectional Long Short-Term Memory (BiLSTM) [21]
- Synthetic Minority Over-sampling TEchnique (SMOTE) [58]
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
ANN | Artificial Neural Network |
AdaBoost | Adaptive Boosting |
BA | Balanced Accuracy |
BDA | Biased Discriminant Analysis |
BERT | Bidirectional Encoder Representations from Transformers |
BiLSTM | Bidirectional Long Short-Term Memory |
BOW | Bag of Words |
CBoW | Continuous Bag of Words |
CNN | Convolutional Neural Network |
DOB-SCV | Distribution-Optimally-Balanced Stratified Cross-Validation |
DT | Decision Tree |
FDA | Fisher’s Discriminant Analysis |
GRU | Gated Recurrent Unit |
kNN | k-Nearest Neighbor |
KoNLPy | Korean NLP in Python |
LGBM | Light Gradient Boosting Model |
LR | Logistic Regression |
LSTM | Long Short Term Memory |
MLP | Multi-Layer Perceptron |
NB | Naive Base |
NLP | Natural Language Processing |
NN | Nearest Neighbor |
OCSVM | One-Class Support Vector Machine |
PCA | Principal Component Analysis |
RF | Random Forest |
RusBoost | Random Under-Sampling Boosting |
SGD | Stochastic Gradient Descent |
SMOTE | Synthetic Minority Over-sampling TEchnique |
SNS | Social Network Service |
SSSP | Small Sample Size Problem |
SVM | Support Vector Machine |
Word2Vec | Word to Vector |
XGBoost | Extreme Gradient Boosting |
References
- Kim, S.; Lee, Y.; Lee, B. A Study on Countermeasure against Telecommunication Financial Fraud. Police Sci. Inst. 2022, 36, 343–378. [Google Scholar] [CrossRef]
- Lee, H.J. A study on the newtypes of crime using smart phone and the police counter measurements. J. Korean Police Stud. 2012, 2012 11, 319–344. [Google Scholar]
- Choi, Y.; Choi, S. Messenger Phishing Modus Operandi in South Korea. J. Korean Public Police Secur. Stud. 2021, 18, 241–258. [Google Scholar]
- Clement, J. Global Number of Mobile Messaging Users 2018–2022. 2019. Available online: https://www.statista.com/ (accessed on 11 April 2024).
- A critical analysis of cyber threats and their global impact. In Computational Intelligent Security in Wireless Communications; CRC Press: Boca Raton, FL, USA, 2023; pp. 201–220.
- Phishing evolves: Analyzing the enduring cybercrime. In The New Technology of Financial Crime; Routledge: London, UK, 2022; pp. 35–61.
- Annareddy, S.; Tammina, S. A comparative study of deep learning methods for spam detection. In Proceedings of the 2019 Third International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Palladam, India, 12–14 December 2019; pp. 66–72. [Google Scholar]
- Asaju, C.B.; Nkorabon, E.J.; Orah, R.O. Short message service (sms) spam detection and classification using naïve bayes. Int. J. Mechatron. Electr. Comput. Technol. (IJMEC) 2021, 11, 4931–4936. [Google Scholar]
- Gautam, S. Comparison of Feature Representation Schemes to Classify SMS Text using Data Balancing. Int. J. Mech. Eng. 2022, 7, 198–209. [Google Scholar]
- Navaney, P.; Dubey, G.; Rana, A. SMS spam filtering using supervised machine learning algorithms. In Proceedings of the 2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 11–12 January 2018; pp. 43–48. [Google Scholar]
- Turhanlar, M.; Acartürk, C. Detecting Turkish Phishing Attack with Machine Learning Algorithm. In Proceedings of the WEBIST, Online, 26–28 October 2021; pp. 577–584. [Google Scholar]
- Saeed, V.A. A Method for SMS Spam Message Detection Using Machine Learning. Artif. Intell. Robot. Dev. J. 2023, 3, 214–228. [Google Scholar] [CrossRef]
- Verma, S. Detection of Phishing in Mobile Instant Messaging Using Natural Language Processing and Machine Learning. Ph.D. Thesis, National College of Ireland, Dublin, Ireland, 2023. [Google Scholar]
- Choudhary, N.; Jain, A.K. Towards filtering of SMS spam messages using machine learning based technique. In Proceedings of the Advanced Informatics for Computing Research: First International Conference, ICAICR 2017, Jalandhar, India, 17–18 March 2017; Revised Selected Papers. Springer: Berlin/Heidelberg, Germany, 2017; pp. 18–30. [Google Scholar]
- Ora, A. Spam Detection in Short Message Service Using Natural Language Processing and Machine Learning Techniques. Ph.D. Thesis, National College of Ireland, Dublin, Ireland, 2020. [Google Scholar]
- Al Maruf, A.; Al Numan, A.; Haque, M.M.; Jidney, T.T.; Aung, Z. Ensemble Approach to Classify Spam SMS from Bengali Text. In Proceedings of the International Conference on Advances in Computing and Data Sciences, Kolkata, India, 27–28 April 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 440–453. [Google Scholar]
- Sonowal, G. Detecting phishing SMS based on multiple correlation algorithms. SN Comput. Sci. 2020, 1, 361. [Google Scholar] [CrossRef]
- Dwiyansaputra, R.; Nugraha, G.S.; Bimantoro, F.; Aranta, A. Deteksi SMS Spam Berbahasa Indonesia menggunakan TF-IDF dan Stochastic Gradient Descent Classifier. J. Teknol. Inf. Komput. Apl. (JTIKA) 2021, 3, 200–207. [Google Scholar]
- Mishra, S.; Soni, D. Implementation of ‘smishing detector’: An efficient model for smishing detection using neural network. SN Comput. Sci. 2022, 3, 189. [Google Scholar] [CrossRef]
- Gupta, M.; Bakliwal, A.; Agarwal, S.; Mehndiratta, P. A Comparative Study of Spam SMS Detection Using Machine Learning Classifiers. In Proceedings of the 2018 Eleventh International Conference on Contemporary Computing (IC3), Noida, India, 2–4 August 2018; pp. 1–7. [Google Scholar] [CrossRef]
- Abayomi-Alli, O.; Misra, S.; Abayomi-Alli, A. A deep learning method for automatic SMS spam classification: Performance of learning algorithms on indigenous dataset. Concurr. Comput. Pract. Exp. 2022, 34, e6989. [Google Scholar] [CrossRef]
- Han, H.; Li, Y.; Zhu, X. Convolutional neural network learning for generic data classification. Inf. Sci. 2019, 477, 448–465. [Google Scholar] [CrossRef]
- Ulfath, R.E.; Alqahtani, H.; Hammoudeh, M.; Sarker, I.H. Hybrid CNN-GRU framework with integrated pre-trained language transformer for SMS phishing detection. In Proceedings of the 5th International Conference on Future Networks & Distributed Systems, Dubai, United Arab Emirates, 15–16 December 2021; pp. 244–251. [Google Scholar]
- Ghourabi, A.; Mahmood, M.A.; Alzubi, Q.M. A hybrid CNN-LSTM model for SMS spam detection in Arabic and english messages. Future Internet 2020, 12, 156. [Google Scholar] [CrossRef]
- Kim, Y.B.; Chae, H.; Snyder, B.; Kim, Y.S. Training a korean srl system with rich morphological features. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, MD, USA, 22–27 June 2014; pp. 637–642. [Google Scholar]
- Park, H.j.; Song, M.c.; Shin, K.s. Sentiment analysis of korean reviews using cnn: Focusing on morpheme embedding. J. Intell. Inf. Syst. 2018, 24, 59–83. [Google Scholar]
- Lim, J.S.; Kim, J.M. An empirical comparison of machine learning models for classifying emotions in Korean Twitter. J. Korea Multimed. Soc. 2014, 17, 232–239. [Google Scholar] [CrossRef]
- Shim, K.S. Syllable-based pos tagging without korean morphological analysis. Korean J. Cogn. Sci. 2011, 22, 327–345. [Google Scholar] [CrossRef]
- Thabtah, F.; Hammoud, S.; Kamalov, F.; Gonsalves, A. Data imbalance in classification: Experimental evaluation. Inf. Sci. 2020, 513, 429–441. [Google Scholar] [CrossRef]
- Ali, H.; Salleh, M.N.M.; Saedudin, R.; Hussain, K.; Mushtaq, M.F. Imbalance class problems in data mining: A review. Indones. J. Electr. Eng. Comput. Sci. 2019, 14, 1560–1571. [Google Scholar] [CrossRef]
- Abd Elrahman, S.M.; Abraham, A. A review of class imbalance problem. J. Netw. Innov. Comput. 2013, 1, 332–340. [Google Scholar]
- Zheng, Z.; Cai, Y.; Li, Y. Oversampling method for imbalanced classification. Comput. Inform. 2015, 34, 1017–1037. [Google Scholar]
- Zhou, X.S.; Huang, T.S. Small sample learning during multimedia retrieval using biasmap. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001; Volume 1, p. I–I. [Google Scholar]
- Baaqeel, H.; Zagrouba, R. Hybrid SMS Spam Filtering System Using Machine Learning Techniques. In Proceedings of the 2020 21st International Arab Conference on Information Technology (ACIT), Giza, Egypt, 28–30 November 2020; pp. 1–8. [Google Scholar]
- Martinez, A.; Kak, A. PCA versus LDA. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 228–233. [Google Scholar] [CrossRef]
- Choubineh, A.; Wood, D.A.; Choubineh, Z. Applying separately cost-sensitive learning and Fisher’s discriminant analysis to address the class imbalance problem: A case study involving a virtual gas pipeline SCADA system. Int. J. Crit. Infrastruct. Prot. 2020, 29, 100357. [Google Scholar] [CrossRef]
- Fukunaga, K. Introduction to Statistical Pattern Recognition, 2nd ed.; Academic Press: Cambridge, MA, USA, 1990. [Google Scholar]
- Turk, M.; Pentland, A. Eigenfaces for recognition. J. Cogn. Neurosci. 1991, 3, 71–86. [Google Scholar] [CrossRef]
- Moreno-Torres, J.G.; Sáez, J.A.; Herrera, F. Study on the impact of partition-induced dataset shift on k-fold cross-validation. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1304–1312. [Google Scholar] [CrossRef]
- Brodersen, K.H.; Ong, C.S.; Stephan, K.E.; Buhmann, J.M. The balanced accuracy and its posterior distribution. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 3121–3124. [Google Scholar]
- Conversational Scenarios Collected and Refined from Twitter. 2023. Available online: http://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100/ (accessed on 8 May 2023).
- One-Shot Conversation Dataset Containing Korean Emotion Information. Available online: http://aicompanion.or.kr/nanum/tech/data_introduce.php?idx=47,2023 (accessed on 9 May 2023).
- Chatbot Data for Korean. 2018. Available online: https://github.com/songys/Chatbot_data (accessed on 9 May 2023).
- Choi, S.I.; Lee, Y.; Kim, C. Confidence Measure Using Composite Features for Eye Detection in a Face Recognition System. IEEE Signal Process. Lett. 2015, 22, 225–228. [Google Scholar] [CrossRef]
- Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
- Safavian, S.R.; Landgrebe, D. A survey of decision tree classifier methodology. IEEE Trans. Syst. Man, Cybern. 1991, 21, 660–674. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Rish, I. An empirical study of the naive Bayes classifier. In Proceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Washington, DC, USA, 2–5 August 2001; Volume 3, pp. 41–46. [Google Scholar]
- Menard, S. Applied Logistic Regression Analysis; Number 106 in Sage university papers; Quantitative applications in the social sciences; Sage: Thousand Oaks, CA, USA, 1995. [Google Scholar]
- Zhang, M.L.; Zhou, Z.H. A k-nearest neighbor based algorithm for multi-label classification. In Proceedings of the 2005 IEEE International Conference on Granular Computing, Beijing, China, 25–27 July 2005; Volume 2, pp. 718–721. [Google Scholar]
- Noble, W.S. What is a support vector machine? Nat. Biotechnol. 2006, 24, 1565–1567. [Google Scholar] [CrossRef]
- Schölkopf, B.; Platt, J.C.; Shawe-Taylor, J.; Smola, A.J.; Williamson, R.C. Estimating the support of a high-dimensional distribution. Neural Comput. 2001, 13, 1443–1471. [Google Scholar] [CrossRef] [PubMed]
- Amer, M.; Goldstein, M.; Abdennadher, S. Enhancing one-class support vector machines for unsupervised anomaly detection. In Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, Chicago, IL, USA, 11–14 August 2013; pp. 8–15. [Google Scholar]
- Freund, Y.; Schapire, R.; Abe, N. A short introduction to boosting. J.-Jpn. Soc. Artif. Intell. 1999, 14, 1612. [Google Scholar]
- Seiffert, C.; Khoshgoftaar, T.M.; Van Hulse, J.; Napolitano, A. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man,-Cybern.-Part A Syst. Hum. 2009, 40, 185–197. [Google Scholar]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Wike, R.; Silver, L.; Fetterolf, J.; Huang, C.; Austin, S.; Clancy, L.; Gubbala, S. Social Media Seen as Mostly Good for Democracy Across Many Nations, but US is a Major Outlier; Pew Research Center: Washington, DC, USA, 2022; Volume 6. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Data Balance | Algorithm | Embedding | Year | Language | Dataset | Recall | |
---|---|---|---|---|---|---|---|
# Pos | # Neg | ||||||
Balanced | CNN [20] | TF-IDF | 2018 | English | 1000 | 1000 | 97.5% |
SGD [18] | TF-IDF | 2021 | Indonesian | 574 | 569 | 97.2% | |
XGBoost [16] | TF-IDF | 2023 | Bengali | 300 | 250 | 82.6% | |
Imbalanced | RF [14] | BoW | 2017 | English | 429 | 2179 | 96.5% |
CNN [20] | TF-IDF | 2018 | English | 747 | 4827 | 96.4% | |
SVM [10] | BoW | 2018 | English | 747 | 4827 | 96.4% | |
AdaBoost+Kendall [17] | BoW | 2020 | English | 747 | 4831 | 91.4% | |
k-means+SVM [34] | TF-IDF | 2020 | English | 747 | 4825 | 92.0% | |
LGBM [15] | BoW, TF-IDF | 2020 | English | 747 | 4827 | 96.5% | |
CNN+GRU [23] | BERT | 2021 | English | 747 | 4825 | 96.5% | |
NB [8] | BoW | 2021 | English | 747 | 4778 | 97.3% | |
BiLSTM [21] | CBoW+Word2Vec | 2022 | English | 3200 | 6792 | 91.7% | |
SVM [9] | TF-IDF | 2022 | English | 747 | 4827 | 92.3% | |
DT [12] | Word Embedding | 2023 | English | 747 | 4827 | 93.1% | |
RF [13] | TF-IDF+Word2Vec | 2023 | English | 638 | 5333 | 85.0% | |
Extremely Imbalanced | CNN+LSTM [24] | Word Embedding | 2020 | Arabic | 785 | 7579 | 87.9% |
RF, LR [11] | TF-IDF, BoW | 2021 | Turkish | 119 | 3526 | 92.5% | |
ANN [19] | TF-IDF | 2022 | English | 538 | 5320 | 92.4% |
Algorithm | Training | Test | Gap | |||
---|---|---|---|---|---|---|
Recall | BA | Recall | BA | Recall | BA | |
SGD 1 | 96.75% | 98.37% | 91.38% | 95.67% | 5.37% | 2.70% |
DT 2 | 99.84% | 99.92% | 88.46% | 94.17% | 11.38% | 5.75% |
RF 3 | 99.84% | 99.92% | 87.64% | 93.82% | 12.20% | 6.10% |
NB 4 | 93.01% | 95.56% | 73.66% | 85.12% | 19.35% | 10.44% |
LR 5 | 94.11% | 97.05% | 88.78% | 94.38% | 5.33% | 2.67% |
2NN 6 | 100.00% | 99.89% | 90.24% | 94.99% | 9.76% | 4.90% |
3NN 6 | 89.51% | 94.75% | 82.76% | 91.37% | 6.75% | 3.38% |
SVM 7 | 99.51% | 99.76% | 91.87% | 95.89% | 7.64% | 3.87% |
OCSVM 8 | 89.51% | 90.33% | 90.08% | 89.40% | −0.57% | 0.93% |
AdaBoost 9 | 91.14% | 95.55% | 88.62% | 94.27% | 2.52% | 1.28% |
RUSBoost 10 | 94.11% | 96.32% | 91.22% | 94.93% | 2.89% | 1.39% |
XGBoostSVM 11 | 94.43% | 97.22% | 90.41% | 95.18% | 4.02% | 2.04% |
LGBM 12 | 99.39% | 99.70% | 92.03% | 96.00% | 7.36% | 3.70% |
CNN 13 | 90.28% | 93.56% | 88.94% | 92.58% | 1.34% | 0.98% |
CNN+GRU 14 | 94.00% | 96.09% | 93.50% | 95.37% | 0.50% | 0.72% |
CNN+LSTM 15 | 98.62% | 99.12% | 92.36% | 95.86% | 6.26% | 3.26% |
BiLSTM 16 | 97.89% | 98.17% | 94.96% | 96.38% | 2.93% | 1.79% |
SMOTE 17 | 99.96% | 99.67% | 94.31% | 96.66% | 5.65% | 3.01% |
Proposed Method 18 | 98.58% | 98.56% | 95.45% | 96.85% | 3.13% | 1.71% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, S.; Park, J.; Ahn, H.; Lee, Y. Detection of Korean Phishing Messages Using Biased Discriminant Analysis under Extreme Class Imbalance Problem. Information 2024, 15, 265. https://doi.org/10.3390/info15050265
Kim S, Park J, Ahn H, Lee Y. Detection of Korean Phishing Messages Using Biased Discriminant Analysis under Extreme Class Imbalance Problem. Information. 2024; 15(5):265. https://doi.org/10.3390/info15050265
Chicago/Turabian StyleKim, Siyoon, Jeongmin Park, Hyun Ahn, and Yonggeol Lee. 2024. "Detection of Korean Phishing Messages Using Biased Discriminant Analysis under Extreme Class Imbalance Problem" Information 15, no. 5: 265. https://doi.org/10.3390/info15050265
APA StyleKim, S., Park, J., Ahn, H., & Lee, Y. (2024). Detection of Korean Phishing Messages Using Biased Discriminant Analysis under Extreme Class Imbalance Problem. Information, 15(5), 265. https://doi.org/10.3390/info15050265