Malware Classification Using Few-Shot Learning Approach
Abstract
:1. Introduction
2. Literature Review
2.1. Malware and Its Types
- (1)
- Virus: A malware that attaches itself to other software, spreading when that software is installed. It can capture keystrokes, corrupt data, and consume system resources, often infecting computers through malicious email attachments.
- (2)
- Trojan: This malware masquerades as legitimate software, infiltrating systems via free downloads or email attachments. Once activated, it can monitor user activity, breach networks, or steal sensitive data, often indicated by unusual device behavior.
- (3)
- Worm: A self-replicating malware that spreads from one computer to another without human intervention, typically using an Internet connection or a Local Area Network (LAN).
- (4)
- Spyware: This software collects personal data from users and transmits it to third parties without consent. While some spyware may have legitimate purposes, malicious spyware aims to profit from stolen data, leaving users vulnerable to data breaches.
- (5)
- Ransomware: A type of malware that locks users out of their files, demanding a ransom for access. Victims, including organizations, often pay to recover their data, with some variants also stealing data to increase the pressure to pay.
2.2. Malware Classification Using ML Methods
2.3. Traditional Malware Models
2.4. Malware Classification Using DL Techniques
2.5. The Few-Shot Learning Technique
2.6. Few-Shot Malware Models
3. Materials and Methods
3.1. Artificial Neural Networks (ANNs)
3.2. K-Nearest Neighbors (KNN)
- (1)
- Euclidean distance: This is the Cartesian distance between two points that are in the plane/hyperplane. The Euclidean distance can also be visualized as the length of the straight line that joins the two points that are under consideration. If we have two points A (x1,y1), B (x2,y2), the Euclidean distance between these points is given by Formula (1):
- (2)
- Manhattan distance: The Manhattan distance metric is generally used when we are interested in the total distance traveled by an object instead of the displacement. This metric is calculated by summing the absolute difference between the coordinates of the points in n-dimensions. If we have two points A (x1,y1), B (x2,y2), the Euclidean distance between these points is given by Formula (2):
3.3. The Proposed Methodology
3.3.1. Data Collection
3.3.2. Data Preprocessing
3.3.3. The Proposed Model
3.3.4. Performance Evaluation
- TP (true positive): The model classifies the sample as positive and its classification is correct.
- TN (true negative): The model classifies the sample as negative and its classification is correct.
- FP (false positive): The model classifies the sample as positive and its classification is wrong.
- FN (false negative): The model classifies the sample as negative and its classification is wrong.
4. Results and Discussion
4.1. Results of Dataset 1 (the Malware Dataset)
- (1)
- Decision tree: An accuracy of 0.85 indicates moderate performance. Decision trees often have simpler structures and may suffer from overfitting or underfitting depending on their depth and may not capture complex patterns as effectively as more complex models, which explains their low accuracy compared to other models.
- (2)
- Random forest: The accuracy improved compared to that of the decision tree, achieving an accuracy of 0.88, because random forests combine the results of multiple decision trees to improve the generalization ability of the model.
- (3)
- Logistic regression and SVM: Both models achieved an accuracy of 0.94 and thus a high performance, as these models are able to handle linear discontinuities and feature engineering effectively.
- (4)
- The proposed model: The highest accuracy of 0.97 indicates that this model not only handles basic patterns in the data but also handles more complex and subtle distinctions between classes effectively. The impressive performance of the proposed model is the result of a sophisticated hybrid approach that combines an artificial neural network (ANN) and a K-nearest neighbors (KNN) algorithm, both trained using few-shot learning techniques. By combining KNN and the ANN, the model leverages the strengths of both approaches: the ability of the ANN to learn complex high-level features and the simplicity and effectiveness of KNN in making final predictions based on proximity in the feature space.
- (1)
- Feature transformation by the ANN: Dense ANN layers transform raw features into a higher-dimensional space where important patterns can be more easily separated. This transformation ensures that when KNN is applied, the nearest neighbors are selected based on these well-extracted and relevant features, resulting in more accurate classifications.
- (2)
- Augmented generalization using few-shot learning: Few-shot learning helps the ANN and KNN to generalize well from limited data, reducing the risk of overfitting and improving the robustness of the model.
4.2. Results of Dataset 2 (the TUANDROMD Dataset)
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Gopinath, M.; Sethuraman, S.C. A comprehensive survey on deep learning based malware detection techniques. Comput. Sci. Rev. 2023, 47, 100529. [Google Scholar]
- Ryan, M. Ransomware Revolution: The Rise of a Prodigious Cyber Threat (Vol. 85); Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
- Wang, P.; Tang, Z.; Wang, J. A novel few-shot malware classification approach for unknown family recognition with multi-prototype modeling. Comput. Secur. 2021, 106, 102273. [Google Scholar] [CrossRef]
- Thangaraj, M.; Sivakami, M. Text classification techniques: A literature review. Interdiscip. J. Inf. Knowl. Manag. 2018, 13, 117–135. [Google Scholar] [CrossRef] [PubMed]
- Gorade, S.M.; Deo, A.; Purohit, P. A study of some data mining classification techniques. Int. Res. J. Eng. Technol. 2017, 4, 3112–3115. [Google Scholar]
- Gupta, S.; Kumar, D.; Sharma, A. Data mining classification techniques applied for breast cancer diagnosis and prognosis. Indian J. Comput. Sci. Eng. IJCSE 2011, 2, 188–195. [Google Scholar]
- Kruczkowski, M.; Szynkiewicz, E.N. Support Vector Machine for Malware Analysis and Classification. In Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), Warsaw, Poland, 11–14 August 2014; pp. 415–420. [Google Scholar] [CrossRef]
- Choi, S. Combined kNN classification and hierarchical similarity hash for fast malware detection. Appl. Sci. 2020, 10, 5173. [Google Scholar] [CrossRef]
- Aboaoja, F.A.; Zainal, A.; Ghaleb, F.A.; Al-Rimy, B.A.S.; Eisa, T.A.E.; Elnour, A.A.H. Malware detection issues, challenges, and future directions: A survey. Appl. Sci. 2022, 12, 8482. [Google Scholar] [CrossRef]
- Aslan, O.; Samet, R. A comprehensive review on malware detection approaches. IEEE Access 2020, 8, 6249–6271. [Google Scholar] [CrossRef]
- Tang, Z.; Wang, P.; Wang, J. ConvProtoNet: Deep prototype induction towards better class representation for few-shot malware classification. Appl. Sci. 2020, 10, 2847. [Google Scholar] [CrossRef]
- Gao, T.; Han, X.; Liu, Z.; Sun, M. Hybrid attention-based prototypical networks for noisy few-shot relation classification. Proc. AAAI Conf. Artif. Intell. 2019, 33, 6407–6414. [Google Scholar] [CrossRef]
- Abusitta, A.; Li, M.Q.; Fung, B.C. Malware classification and composition analysis: A survey of recent developments. J. Inf. Secur. Appl. 2021, 59, 102828. [Google Scholar] [CrossRef]
- Naseer, M.; Rusdi, J.F.; Shanono, N.M.; Salam, S.; Muslim, Z.B.; Abu, N.A.; Abadi, I. Malware detection: Issues and challenges. J. Phys. Conf. Ser. 2021, 1807, 012011. [Google Scholar] [CrossRef]
- Yang, L.; Li, Y.; Wang, J.; Xiong, N.N. FSLM: An intelligent few-shot learning model based on Siamese networks for IoT technology. IEEE Internet Things J. 2020, 8, 9717–9729. [Google Scholar] [CrossRef]
- Zhou, X.; Liang, W.; Shimizu, S.; Ma, J.; Jin, Q. Siamese neural network based few-shot learning for anomaly detection in industrial cyber-physical systems. IEEE Trans. Ind. Inform. 2021, 17, 5790–5798. [Google Scholar] [CrossRef]
- Bedi, P.; Gupta, N.; Jindal, V. Siam-IDS: Handling class imbalance problem in intrusion detection systems using Siamese neural network. Procedia Comput. Sci. 2020, 171, 780–789. [Google Scholar] [CrossRef]
- Zhou, X.; Hu, Y.; Liang, W.; Ma, J.; Jin, Q. Variational LSTM enhanced anomaly detection for industrial big data. IEEE Trans. Ind. Inform. 2021, 17, 3469–3477. [Google Scholar] [CrossRef]
- Conti, M.; Khandhar, S.; Vinod, P. A few-shot malware classification approach for unknown family recognition using malware feature visualization. Comput. Secur. 2022, 122, 102887. [Google Scholar] [CrossRef]
- Oprea, S.-V.; Bâra, A. Detecting Malicious Uniform Resource Locators Using an Applied Intelligence Framework. Comput. Mater. Contin. 2024, 79, 3827–3853. [Google Scholar] [CrossRef]
- Oprea, S.-V.; Bâra, A. A Recommendation System for Prosumers Based on Large Language Models. Sensors 2024, 24, 3530. [Google Scholar] [CrossRef] [PubMed]
- Rieck, K.; Holz, T.; Willems, C.; Düssel, P.; Laskov, P. Learning and classification of malware behavior. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, Paris, France, 10–11 July 2008; Springer: Berlin/Heidelberg, Germany; pp. 108–125. [Google Scholar]
- Zhu, J.; Jang-Jaccard, J.; Singh, A.; Welch, I.; Al-Sahaf, H.; Camtepe, S. A few-shot meta-learning based siamese neural network using entropy features for ransomware classification. Comput. Secur. 2022, 117, 102691. [Google Scholar] [CrossRef]
- Jung, H.M.; Kim, K.-B.; Cho, H.-J. A study of Android malware detection techniques in virtual environment. Clust. Comput. 2016, 19, 2295–2304. [Google Scholar] [CrossRef]
- Ye, H.-J.; Hu, H.; Zhan, D.-C.; Sha, F. Few-shot learning via embedding adaptation with set-to-set functions. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Yoo, S.; Kim, S.; Kim, S.; Kang, B.B. Ai-Hydra: Advanced hybrid approach using random forest and Deep Learning for malware classification. Inf. Sci. 2021, 546, 420–435. [Google Scholar] [CrossRef]
- Anderson, H.S.; Kharkar, A.; Filar, B.; Roth, P. Evading machine learning malware detection. Black Hat 2017, 2017, 1–6. [Google Scholar]
- Goncalves, E.C.; Freitas, A.A.; Plastino, A. A survey of genetic algorithms for multi-label classification. In Proceedings of the 2018 IEEE Congress on Evolutionary Computation (CEC), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
- Galván, E.; Mooney, P. Neuroevolution in deep neural networks: Current trends and future challenges. IEEE Trans. Artif. Intell. 2021, 2, 476–493. [Google Scholar] [CrossRef]
- Thakur, A.; Konde, A. Fundamentals of neural networks. Int. J. Res. Appl. Sci. Eng. Technol. 2021, 9, 407–426. [Google Scholar] [CrossRef]
- da Silveira Bohrer, J.; Grisci, B.I.; Dorn, M. Neuroevolution of neural network architectures using CoDeepNEAT and keras. arXiv 2020, arXiv:2002.04634. [Google Scholar]
- Islam, M.; Chen, G.; Jin, S. An overview of neural network. Am. J. Neural Netw. Appl. 2019, 5, 7–11. [Google Scholar] [CrossRef]
- Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from Imbalanced Data Sets; Springer: Cham, Switzerland, 2018; Volume 10. [Google Scholar]
- Mohammed, A.J.; Hassan, M.M.; Kadir, D.H. Improving classification performance for a novel imbalanced medical dataset using SMOTE method. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 3161–3172. [Google Scholar] [CrossRef]
- Chehal, D.; Gupta, P.; Gulati, P.; Gupta, T. Comparative Study of Missing Value Imputation Techniques on E-Commerce Product Ratings. Informatica 2023, 47, 373–382. [Google Scholar] [CrossRef]
- Chakrabarti, S.; Biswas, N.; Karnani, K.; Padul, V.; Jones, L.D.; Kesari, S.; Ashili, S. Binned Data Provide Better Imputation of Missing Time Series Data from Wearables. Sensors 2023, 23, 1454. [Google Scholar] [CrossRef]
- L’heureux, A.; Grolinger, K.; Elyamany, H.F.; Capretz, M.A. Machine learning with big data: Challenges and approaches. IEEE Access 2017, 5, 7776–7797. [Google Scholar] [CrossRef]
- Misra, P.; Yadav, A.S. Impact of preprocessing methods on healthcare predictions. In Proceedings of the 2nd International Conference on Advanced Computing and Software Engineering (ICACSE), Sultanpur, India, 8–9 February 2019. [Google Scholar]
- Haghighi, S.; Jasemi, M.; Hessabi, S.; Zolanvari, A. PyCM: Multiclass confusion matrix library in Python. J. Open Source Softw. 2018, 3, 729. [Google Scholar] [CrossRef]
- Markoulidakis, I.; Kopsiaftis, G.; Rallis, I.; Georgoulas, I. Multi-class confusion matrix reduction method and its application on net promoter score classification problem. In Proceedings of the 14th PErvasive Technologies Related to Assistive Environments Conference, Corfu, Greece, 29 June–2 July 2021; pp. 412–419. [Google Scholar]
- Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; Gehler, P. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14318–14328. [Google Scholar]
- MacEachern, S.J.; Forkert, N.D. Machine learning for precision medicine. Genome 2021, 64, 416–425. [Google Scholar] [CrossRef]
- Fu, G.; Sun, P.; Zhu, W.; Yang, J.; Cao, Y.; Yang, M.Y.; Cao, Y. A deep-learning-based approach for fast and robust steel surface defects classification. Opt. Lasers Eng. 2019, 121, 397–405. [Google Scholar] [CrossRef]
Reference | Year | Objective | Dataset | Techniques | Issues |
---|---|---|---|---|---|
[1] | 2022 | Malware detection | Microsoft Malware Classification Challenge (2015) | RNN, CNN, DT, RF | Resolving avoidance strategies in static classifiers |
[4,5,6] | 2021 | Classification techniques | Malware Detection | SVM, neural networks, Naive Bayesian classifiers, K-nearest neighbors | Predictive analysis |
[9] | 2020 | Fast time in malware detection and classification | AI-based malware detection dataset | KNN | Dynamic analysis problems |
[10] | 2021 | Malware classification framework based on deep learning algorithms | Malimg dataset, Microsoft Big 2015 dataset, and Malevis dataset | Deep learning | Converting from adaptable into conventional methods |
[18] | 2021 | An SVM for malware detection | Malware detection | Support Vector Machines | Efficient malware detection in heterogeneous web datasets |
[22] | 2022 | Few-shot learning for malware classification | Android-based malware dataset | Memory-Augmented Neural Network, Natural Language Processing | Improved accuracy with a small number of cases |
[24] | 2021 | Advancements in machine learning models for malware classification | Malware classification | RF, DL | Shift from traditional to adaptive approaches |
[25] | 2019 | Efficacy of the latest FSL methods in malware detection | Classification dataset | Deep learning, data mining, big data methods | Higher accuracy rates |
[26] | 2019 | Adaptive learning | AI-based malware detection | K-nearest neighbors, hierarchical similarity hash | Compromise between accuracy and speed |
[28] | 2022 | Few-shot models for ransomware defense | Malware feature dataset | Deep learning techniques | Application in ransomware defense issue |
Algorithm | Category | Precision | Recall | F1_Score |
---|---|---|---|---|
Decision Tree | Malware | 0.96 | 0.73 | 0.83 |
Benign | 0.78 | 0.97 | 0.87 | |
Random Forest | Malware | 0.81 | 0.99 | 0.89 |
Benign | 0.99 | 0.77 | 0.87 | |
Logistic Regression | Malware | 0.95 | 0.92 | 0.94 |
Benign | 0.93 | 0.95 | 0.94 | |
SVM | Malware | 0.96 | 0.92 | 0.94 |
Benign | 0.93 | 0.96 | 0.94 |
Precision | Recall | F1_Score | |
---|---|---|---|
Malware | 0.97 | 0.97 | 0.97 |
Benign | 0.97 | 0.97 | 0.97 |
Algorithm | Category | Precision | Recall | F1_Score |
---|---|---|---|---|
Decision Tree | Malware | 0.75 | 0.92 | 0.82 |
Benign | 0.91 | 0.71 | 0.80 | |
Random Forest | Malware | 0.90 | 0.95 | 0.92 |
Benign | 0.95 | 0.90 | 0.93 | |
Logistic Regression | Malware | 0.92 | 0.86 | 0.89 |
Benign | 0.88 | 0.93 | 0.90 | |
SVM | Malware | 0.93 | 0.88 | 0.90 |
Benign | 0.90 | 0.94 | 0.92 |
Precision | Recall | F1_Score | |
---|---|---|---|
Malware | 0.97 | 0.97 | 0.97 |
Benign | 0.97 | 0.97 | 0.97 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alfarsi, K.; Rasheed, S.; Ahmad, I. Malware Classification Using Few-Shot Learning Approach. Information 2024, 15, 722. https://doi.org/10.3390/info15110722
Alfarsi K, Rasheed S, Ahmad I. Malware Classification Using Few-Shot Learning Approach. Information. 2024; 15(11):722. https://doi.org/10.3390/info15110722
Chicago/Turabian StyleAlfarsi, Khalid, Saim Rasheed, and Iftikhar Ahmad. 2024. "Malware Classification Using Few-Shot Learning Approach" Information 15, no. 11: 722. https://doi.org/10.3390/info15110722
APA StyleAlfarsi, K., Rasheed, S., & Ahmad, I. (2024). Malware Classification Using Few-Shot Learning Approach. Information, 15(11), 722. https://doi.org/10.3390/info15110722