A Streamlined Framework of Metamorphic Malware Classification via Sampling and Parallel Processing
Abstract
:1. Introduction
- (1)
- We suggest to extract a small proportion of samples from the entire dataset according to the selection criteria and construct a simple and efficient feature vector from the assembly (ASM) files that can reflect the original dataset. The final evaluations prove that the lightweight eigenvector can not only attenuate the complexity of feature engineering, but also satisfy classification requirements.
- (2)
- We propose a parallel processing approach with commonly available hardware resources that utilizes collaboration of multi-core and active recommendation. The parallel strategy can run on the popular personal computer without high-performance hardware resources, and open the door for analysts to leverage general computers to tackle tough tasks due to the large volume of malware.
- (3)
- We conduct systematic assessments using the Microsoft Kaggle malware dataset. The classification accuracy can reach up to 98.53%. The parallel processing technique results in a 37.60% reduction in the processing time compared with the conventional serial process mode. MalSEF can deliver a similar performance to the first winner of the challenge competition with the feature space effectively simplified, outperforming the existing algorithms in terms of simplicity and efficiency.
2. Overview of Related Studies
2.1. Background
2.1.1. Metamorphic Techniques
- (1)
- Instruction reordering
- (2)
- Trash code or dead code insertion
- (3)
- Instruction substitution
2.1.2. Methods for Malware Analysis
2.1.3. Parallel Processing Techniques
- (1)
- Task parallel: during a complete working process, if there are some independent modules executing in parallel, this parallel processing method is called task parallelism. As shown in Figure 4, in this data flow diagram, when module A and B execute in parallel, it is called task parallel.
- (2)
- Pipeline parallel: when a series of connected modules (forming a complete working process) process independent data elements in parallel (these data elements are usually a time series or an independent subset of a certain dataset), this parallel processing method is called pipeline parallelism. As shown in Figure 4, when modules A, C, and D execute in parallel, it is called pipeline parallel.
- (3)
- Data parallel: when a dataset can be divided into a number of subsets and these subsets can be processed simultaneously, this kind of parallel processing method is called data parallelism. As shown in Figure 4, if module B reads data in parallel, it is called data parallel.
2.2. Related Studies
- (1)
- The feature vectors constructed by these methods can usually effectively represent the features of malware. However, when applying to analyze massive malware, the feature space will become too large and bring a heavy burden to analysts. For example, the feature vector proposed by [28] can achieve perfect classification accuracy; however, its feature vector is too large to be realized under the condition of ordinary computing resources.
- (2)
- The parallel processing method above is usually applied to multiple classifiers for parallel detection. In essence, it is still a serial processing mode, and the parallel analysis target of massive samples is not achieved. When faced with a large scale of malware, it will inevitably affect the analysis efficiency.
3. Overview of MalSEF
3.1. Motivation
- (1)
- The complexity of feature engineering may increase sharply due to the large scale of malware variants. The complexity is mainly derived from two aspects: (1) The feature set of the cutting-edge analysis methods is usually fairly complicated, because we often utilize APIs or Opcodes, or their combination (n-grams) as the feature vector to profile the malware. The number of API and Opcode on the Win32 platform is relatively large. (2) The increasing volume of malware variants will further aggravate the complexity, especially in the current massive malware environment.
- (2)
- The efficiency of coping with the sheer number of malware samples cannot be guaranteed. The existing detection methods mainly include the training stage and detection stage. These two stages are separately implemented and the detection process is essentially a serial processing one. We may inevitably fail to tackle the challenge facing the environment of large-scale malware.
- (1)
- How to establish a simple feature set for the large number of malware variants so as to alleviate the computation cost and deliver satisfactory classification performance simultaneously?
- (2)
- How to efficiently handle numerous malware variants and classify them into their homologous families to ensure the efficiency of the classification process?
- (3)
- How to implement the complicated task for analyzing a huge amount of malware on a simple personal computer, so as to enable ordinary researchers to accomplish the seemingly impossible task due to the conventional paradigm?
3.2. Parallel Processing Model of Massive Malware Classification
- (1)
- In the training stage, the process of extracting features from the training set include “assembly commands of samples → extracting Opcode lists → counting the occurrences of every Opcode → generating feature vectors → generating feature matrices”, which can be implemented in parallel mode.
- (2)
- In the detection stage, the process performing on unknown samples include “assembly commands of samples → generating feature vectors → classification”, which can be implemented in parallel mode.
3.3. Overall Framework of MalSEF
- (1)
- Sampling of massive samples
- (2)
- Feature extraction
- (3)
- Feature matrix generation
- (4)
- Classification of massive malware samples
4. Detailed Implementations of MalSEF
4.1. Sampling from the Massive Samples Set and Feature Extraction
4.1.1. Sampling from the Massive Sample Set
Algorithm 1: Sample a subset from the original dataset | |||
//S represents the original dataset. | |||
// represents the set of standard normal deviations for desired confidence level of original dataset. | |||
// represents the set of assumed proportions in the target population of original dataset. | |||
// represents the set of degrees of accuracy desired in the estimated proportion of the original dataset. | |||
Input: the original malware dataset S, , , | |||
Output: the sampled malware subset S’ | |||
1: | trainLabels = readDataset() //read the original dataset | ||
2: | labels = getLabels(S) //read the labels of the original dataset | ||
3: | for i = 1 to labels do | ||
4: | mids = trainLabels[trainLabels.Class == i] //get the samples of Class == i | ||
5: | mids = mids.reset_index() //reset the index of the samples of Class == i | ||
6: | //calculate size of the subset for the ith original dataset | ||
7: | for j=1 to do | ||
8: | rchoice = randit(0, ) //select digits from 1 to 100 randomly | ||
9: | rids = mids [1, rchoice] //build the subset for the Class = i | ||
10: | end for | ||
11: | S’.append(rids) //append the subset of Class == i to S’ | ||
12: | end for | ||
13: | return S’ |
4.1.2. Feature Extraction and Feature Vectors Construction
- (1)
- Analyze samples in the subset one by one and extract Opcodes from each sample;
- (2)
- Count the occurrences of each Opcode in the subset;
- (3)
- Sort Opcodes based on their occurrences;
- (4)
- Select Top-N Opcodes as the feature vector (the value of Top-N is set by the researchers flexibly).
- (5)
- Construct the feature matrix by counting the occurrence numbers of each opcode in each set as the feature value for the feature element.
Algorithm 2: Feature matrix construction | ||
Input: assembly programs of the malware samples | ||
Output: the feature matrix | ||
1: | for i=1 to do | |
2: | //Append the opcodes of the analyzed program to the Opcode list | |
3: | ||
4: | //Build the opcode sequence of all the programs | |
5: | ||
6: | //Count the occurrence times of each Opcode | |
7: | ||
8: | end for | |
9: | //Sort the Opcode sequence based on occurrences | |
10: | ||
11: | //Select the ranked Top-N Opcodes as the feature vector and their occurrences as the feature values | |
12: | ||
13: | return |
4.2. Feature Extraction with Multi-Core Collaboration and Active Recommendation in Parallel
- (1)
- Within the two procedures of parallel processing described in Section 3.2, the relationship between tasks in each procedure is loosely coupled.
- (2)
- In the two procedures of parallel processing described in Section 3.2, we should adopt the iteration method to tackle the distributed parallel tasks load in the practical processing.
- (3)
- In the procedure of parallel processing, different tasks will inevitably lead to different running speeds because of the different performance of the running environment and different computational load. Thus, tasks should be adaptively distributed according to the practical performance of each running node.
- (1)
- Construct the analysis sample queue based on the malware sample sets. The queue is illustrated as follows:
- (2)
- Construct a multi-core resource pool based on the computing resource nodes, each of which serves as a processing core. The core pool is illustrated as follows:
- (3)
- Query the current available resources in the resource pool and establish a queue of current available resources. The available resource queue is depicted as below:
- (4)
- According to the currently available resource queue, the master node (the node that stores the sample sets) fetches samples from the sample sets and allocates them to the resource queue for processing.
- (5)
- In the procedure of task parallel processing, each node monitors its own running state in real time, maintains and updates its own state vector in real time, and communicates the state vector information with each other. Based on the real-time interactive state vector information among the cores, the state vectors are combined to form a global state matrix within the cores.
- (1)
- : computational resource consumption of node i;
- (2)
- : storage resource consumption of node i;
- (3)
- : bandwidth resource consumption of node i;
- (4)
- : the number of tasks assigned to node i;
- (5)
- : the progress of current task processing of node i.
- (6)
- The master node will continuously monitor the state matrix and sort the samples to be analyzed. Once a node finishes its task, it notifies the master node, which then assigns new samples to that node, allowing it to start a new processing task. By this way, our method realizes the real-time and active pushing of processing tasks.
- (7)
- Run continuously until the samples are processed entirely according to the above scheme.
5. Evaluation
5.1. Experimental Configuration
5.1.1. Dataset
5.1.2. Classifier
5.2. Experimental Results and Discussion
5.2.1. Classification Using Features Derived from the Original Set and the Sampled Subset
Composition of the Sampled Dataset—Subtrain
Evaluation of the Classification Results Using Features Extracted from the Entire Train Dataset
Evaluation of the Classification Results Using Features Extracted from the Sampled Subtrain Dataset
5.2.2. Experimental Results of Parallel Processing
Experimental Results of Parallel Processing of Train Dataset with Top-N Opcodes Extracted from Subtrain
- (1)
- The smaller the Top-N that is selected, the less the processing time is, because the shorter feature vector will result in less processing workload and less time consumption.
- (2)
- The time of parallel processing is the shortest when generating 16 processes and processing 32 samples each time, i.e., 2 samples are allocated to each process for analysis. This is because the personal PC workstation used in our experiment has eight cores. According to this setting, we can make the best use of computing resources and obtain the best results.
Experimental Results of Parallel Processing of Train dataset with Top-N Opcodes Extracted from Train
Comparative Analysis of Overhead between Parallel Process and Serial Process
- (1)
- Parallel processing can effectively ameliorate processing efficiency
- (2)
- Choosing the best parallel processing setting based on computing resource condition
5.2.3. Comparison with Similar Studies
- (1)
- MalSEF strikes an optimal balance between the feature space and classification accuracy while sampling the feature vectors. When only 300 features are needed, MalSEF can achieve an accuracy of 98.53%. In contrast, the Ahmadi Mansour method [28] employed 1804 features, and the Hu Xin et al. method [29] required 2000 features, both more than six times the number of features in MalSEF. Their accuracy was only marginally higher than MalSEF by no more than 1.30%. Meanwhile, the time cost of MalSEF is significantly lower than other comparable methods. Although the classification accuracy of MalSEF may be slightly lower than the above-mentioned research, it still meets the requirements. Furthermore, when compared to other similar studies, MalSEF’s feature vector is the most succinct and readily available, achieving promising classification results while minimizing the complexity of feature extraction.
- (2)
- MalSEF really realized the parallel analysis of massive malware, which can effectively reduce the analysis time. Compared with the serial processing, the time efficiency is improved by 37.60%. Compared with similar research, the handing time required by MalSEF is the shortest. There may be concerns over whether the baselines could achieve better time efficiency if they use a smaller feature set. The aforementioned comparison of models is conducted under the terms of their respective required feature size. When only 300 features are used, MalSEF can achieve ideal detection results. However, if the size of the features in the baseline modes decreases, it is not confirmed whether their accuracy can remain. This raises a promising question for future research. It can also be observed that the time reduction is not especially significant when the feature size is decreased. This is because the dataset is not particularly large, and as the size of the dataset increases, the time efficiency of MalSEF will become more apparent.
- (3)
- The performance of the hardware platform required by MalSEF is moderate, so ordinary researchers can earn the opportunity. As a consequence, it can be generalized in the field of popular network security and has good applicability.
- (4)
- As for the similarities among variants of malware, MalSEF provides and verifies semantic explanations by extracting and mining Opcode information from malware samples, which compensates for the lack of semantic explanations in deep learning-based malware classification.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Rezaei, T.; Manavi, F.; Hamzeh, A. A PE header-based method for malware detection using clustering and deep embedding techniques. J. Inf. Secur. Appl. 2021, 60, 102876. [Google Scholar] [CrossRef]
- Darem, A.; Abawajy, J.; Makkar, A.; Alhashmi, A.; Alanazi, S. Visualization and deep-learning-based malware variant detection using OpCode-level features. Future Gener. Comput. Syst. 2021, 125, 314–323. [Google Scholar] [CrossRef]
- Malware. Available online: https://www.av-test.org/en/statistics/malware/ (accessed on 7 July 2021).
- Singh, J.; Singh, J. A survey on machine learning-based malware detection in executable files. J. Syst. Archit. 2021, 112, 101861. [Google Scholar] [CrossRef]
- Botacin, M.; Ceschin, F.; Sun, R.; Oliveira, D.; Grégio, A. Challenges and pitfalls in malware research. Comput. Secur. 2021, 106, 102287. [Google Scholar] [CrossRef]
- Han, W.; Xue, J.; Wang, Y.; Huang, L.; Kong, Z.; Mao, L. MalDAE: Detecting and explaining malware based on correlation and fusion of static and dynamic characteristics. Comput. Secur. 2019, 83, 208–233. [Google Scholar] [CrossRef]
- Santos, I.; Brezo, F.; Ugarte-Pedrero, X.; Bringas, P.G. Opcode sequences as representation of executables for data-mining-based unknown malware detection. Inf. Sci. 2013, 231, 64–82. [Google Scholar] [CrossRef]
- Tien, C.W.; Chen, S.W.; Ban, T.; Kuo, S.Y. Machine learning framework to analyze IoT malware using elf and opcode features. Digit. Threat. Res. Pract. 2020, 1, 5. [Google Scholar] [CrossRef]
- Ling, Y.T.; Sani, N.F.M.; Abdullah, M.T.; Hamid, N.A.W.A. Structural features with nonnegative matrix factorization for metamorphic malware detection. Comput. Secur. 2021, 104, 102216. [Google Scholar] [CrossRef]
- Zheng, J.; Zhang, Y.; Li, Y.; Wu, S.; Yu, X. Towards Evaluating the Robustness of Adversarial Attacks Against Image Scaling Transformation. Chin. J. Electron. 2023, 32, 151–158. [Google Scholar] [CrossRef]
- Zhang, Q.; Ma, W.; Wang, Y.; Zhang, Y.; Shi, Z.; Li, Y. Backdoor attacks on image classification models in deep neural networks. Chin. J. Electron. 2022, 31, 199–212. [Google Scholar] [CrossRef]
- Guo, F.; Zhao, Q.; Li, X.; Kuang, X.; Zhang, J.; Han, Y.; Tan, Y.A. Detecting adversarial examples via prediction difference for deep neural networks. Inf. Sci. 2019, 501, 182–192. [Google Scholar] [CrossRef]
- Rudd, E.M.; Rozsa, A.; Günther, M.; Boult, T.E. A survey of stealth malware attacks, mitigation measures, and steps toward autonomous open world solutions. IEEE Commun. Surv. Tutor. 2016, 19, 1145–1172. [Google Scholar] [CrossRef]
- Microsoft Malware Classification Challenge, Kaggle. Available online: https://www.kaggle.com/c/malware-classification (accessed on 27 October 2022).
- Radkani, E.; Hashemi, S.; Keshavarz-Haddad, A.; Amir Haeri, M. An entropy-based distance measure for analyzing and detecting metamorphic malware. Appl. Intell. 2018, 48, 1536–1546. [Google Scholar] [CrossRef]
- Yagemann, C.; Sultana, S.; Chen, L.; Lee, W. Barnum: Detecting document malware via control flow anomalies in hardware traces. In Proceedings of the Information Security: 22nd International Conference, ISC 2019, New York City, NY, USA, 16–18 September 2019; Proceedings 22. Springer International Publishing: Cham, Switzerland, 2019; pp. 341–359. [Google Scholar]
- Ye, Y.; Li, T.; Adjeroh, D.; Iyengar, S.S. A survey on malware detection using data mining techniques. ACM Comput. Surv. (CSUR) 2017, 50, 1–40. [Google Scholar] [CrossRef]
- Fan, Y.; Ye, Y.; Chen, L. Malicious sequential pattern mining for automatic malware detection. Expert Syst. Appl. 2016, 52, 16–25. [Google Scholar] [CrossRef]
- Burnap, P.; French, R.; Turner, F.; Jones, K. Malware classification using self organising feature maps and machine activity data. Comput. Secur. 2018, 73, 399–410. [Google Scholar] [CrossRef]
- Garcia, D.E.; DeCastro-Garcia, N. Optimal feature configuration for dynamic malware detection. Comput. Secur. 2021, 105, 102250. [Google Scholar] [CrossRef]
- Han, W.; Xue, J.; Wang, Y.; Liu, Z.; Kong, Z. MalInsight: A systematic profiling based malware detection framework. J. Netw. Comput. Appl. 2019, 125, 236–250. [Google Scholar] [CrossRef]
- Guerra-Manzanares, A.; Bahsi, H.; Nõmm, S. Kronodroid: Time-based hybrid-featured dataset for effective android malware detection and characterization. Comput. Secur. 2021, 110, 102399. [Google Scholar] [CrossRef]
- Xin, Y.; Xie, Z.Q.; Yang, J. A load balance oriented cost efficient scheduling method for parallel tasks. J. Netw. Comput. Appl. 2017, 81, 37–46. [Google Scholar] [CrossRef]
- Smilovich, D.; Radovitzky, R.; Dvorkin, E. A parallel staggered hydraulic fracture simulator incorporating fluid lag. Comput. Methods Appl. Mech. Eng. 2021, 384, 114003. [Google Scholar] [CrossRef]
- Wang, K.; Li, X.; Gao, L.; Li, P.; Gupta, S.M. A genetic simulated annealing algorithm for parallel partial disassembly line balancing problem. Appl. Soft Comput. 2021, 107, 107404. [Google Scholar] [CrossRef]
- Bailey, M.; Oberheide, J.; Andersen, J.; Mao, Z.M.; Jahanian, F.; Nazario, J. Automated classification and analysis of internet malware. In Proceedings of the Recent Advances in Intrusion Detection: 10th International Symposium, RAID 2007, Gold Goast, Australia, 5–7 September 2007; Proceedings 10. Springer: Berlin/Heidelberg, Germany, 2007; pp. 178–197. [Google Scholar]
- Nataraj, L.; Yegneswaran, V.; Porras, P.; Zhang, J. A comparative assessment of malware classification using binary texture analysis and dynamic analysis. In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, Chicago, IL, USA, 21 October 2011; pp. 21–30. [Google Scholar]
- Ahmadi, M.; Ulyanov, D.; Semenov, S.; Trofimov, M.; Giacinto, G. Novel feature extraction, selection and fusion for effective malware family classification. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, New Orleans, LA, USA, 9–11 March 2016; pp. 183–194. [Google Scholar]
- Hu, X.; Jang, J.; Wang, T.; Ashraf, Z.; Stoecklin, M.P.; Kirat, D. Scalable malware classification with multifaceted content features and threat intelligence. IBM J. Res. Dev. 2016, 60, 6:1–6:11. [Google Scholar] [CrossRef]
- Lee, T.; Kwak, J. Effective and reliable malware group classification for a massive malware environment. Int. J. Distrib. Sens. Netw. 2016, 12, 4601847. [Google Scholar] [CrossRef]
- Raff, E.; Nicholas, C. Malware classification and class imbalance via stochastic hashed lzjd. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, Dallas, TX, USA, 3 November 2017; pp. 111–120. [Google Scholar]
- Le, Q.; Boydell, O.; Mac Namee, B.; Scanlon, M. Deep learning at the shallow end: Malware classification for non-domain experts. Digit. Investig. 2018, 26, S118–S126. [Google Scholar] [CrossRef]
- Nakazato, J.; Song, J.; Eto, M.; Inoue, D.; Nakao, K. A novel malware clustering method using frequency of function call traces in parallel threads. IEICE Trans. Inf. Syst. 2011, 94, 2150–2158. [Google Scholar] [CrossRef]
- Sheen, S.; Anitha, R.; Sirisha, P. Malware detection by pruning of parallel ensembles using harmony search. Pattern Recognit. Lett. 2013, 34, 1679–1686. [Google Scholar] [CrossRef]
- Wang, X.; Zhang, D.; Su, X.; Li, W. Mlifdect: Android malware detection based on parallel machine learning and information fusion. Secur. Commun. Netw. 2017, 2017, 6451260. [Google Scholar] [CrossRef]
- Kabir, E.; Hu, J.; Wang, H.; Zhuo, G. A novel statistical technique for intrusion detection systems. Future Gener. Comput. Syst. 2018, 79, 303–318. [Google Scholar] [CrossRef]
- Abusitta, A.; Li, M.Q.; Fung, B.C. Malware classification and composition analysis: A survey of recent developments. J. Inf. Secur. Appl. 2021, 59, 102828. [Google Scholar] [CrossRef]
- Mishra, P.; Verma, I.; Gupta, S. KVMInspector: KVM Based introspection approach to detect malware in cloud environment. J. Inf. Secur. Appl. 2020, 51, 102460. [Google Scholar] [CrossRef]
- Wang, P.; Tang, Z.; Wang, J. A novel few-shot malware classification approach for unknown family recognition with multi-prototype modeling. Comput. Secur. 2021, 106, 102273. [Google Scholar] [CrossRef]
- Han, W.; Xue, J.; Wang, Y.; Zhang, F.; Gao, X. APTMalInsight: Identify and cognize APT malware based on system call information and ontology knowledge framework. Inf. Sci. 2021, 546, 633–664. [Google Scholar] [CrossRef]
- Liras, L.F.M.; de Soto, A.R.; Prada, M.A. Feature analysis for data-driven APT-related malware discrimination. Comput. Secur. 2021, 104, 102202. [Google Scholar] [CrossRef]
Related Work | Analysis Method | Dataset | Feature Set | Pros | Cons |
---|---|---|---|---|---|
Bailey et al. [26] | Dynamic analysis | Network security community and a part of the Arbor Malware Library | system state changes | Offer an innovative perspective on comprehending the connections between malware | The underlying shortcomings due to dynamic analysis |
Nataraj Lakshmanan et al. [27] | Static analysis | Host-Rx dataset, Malhuer dataset and a VX Heavens dataset | binary texture feature | With comparable classification accuracy, faster than dynamic technique | Lack of the semantic analysis of the binary programs |
Ahmadi Mansour et al. [28] | Static analysis | Microsoft Malware Challenge dataset | features extracted from hex dumps and decompiled files | High classification accuracy | Complicated feature matrix and high time consumption |
Hu Xin et al. [29] | Static analysis | Microsoft Malware Challenge dataset | multifaceted content features and threat intelligence | High classification accuracy | The process of feature extraction will require a large amount of time |
Lee Taejin et al. [30] | Dynamic analysis | Malware collected from a commercial environment | System call sequences | Good dependability on processing malware volume | May fail to yield an accurate result; high time consumption due to dynamic analysis |
Raff Edward et al. [31] | Static analysis | Microsoft Kaggle dataset and Drebin dataset | SHWeL feature vector by extending Lempel-Ziv Jaccard Distance | Not require domain knowledge | High time consumption |
Quan Le et al. [32] | Static analysis | Microsoft Kaggle dataset | One dimensional representation of the malware sample | Not require domain knowledge | Lack of the semantic analysis of the binary code due to the black-box feature of the deep learning model |
Junji Nakazato et al. [33] | Dynamic analysis | Not clearly stated | Dynamic execution traces | Extract API calls in parallel thread | The parallel processing is not clearly stated |
Sheen Shina et al. [34] | Hybrid analysis | Not clearly stated | PE features and API calls | Construct the at least as good as ensemble classifiers in parallel fashion | Not apply parallel technique into the analysis process |
Wang Xin et al. [35] | Static analysis | Drebin and Android Malware Genome Project | Eight types of static features | Achieve higher detection accuracy | Not apply parallel technique into the analysis process |
Family ID | Family Name | # |
---|---|---|
1 | Ramnit (R) | 1541 |
2 | Lollipop (L) | 2478 |
3 | Kelihos ver3 (K3) | 2942 |
4 | Vundo (V) | 475 |
5 | Simda (S) | 42 |
6 | Tracur (T) | 751 |
7 | Kelihos_ver1 (K1) | 398 |
8 | Obfuscator.ACY (O) | 1228 |
9 | Gatak (G) | 1013 |
Total | 10,868 |
Family ID | Family Name | # |
---|---|---|
1 | Ramnit (R) | 95 |
2 | Lollipop (L) | 100 |
3 | Kelihos ver3 (K3) | 97 |
4 | Vundo (V) | 96 |
5 | Simda (S) | 39 |
6 | Tracur (T) | 92 |
7 | Kelihos_ver1 (K1) | 91 |
8 | Obfuscator.ACY (O) | 97 |
9 | Gatak (G) | 96 |
Total | 803 |
RF | DT | SVM | XGBST | Number of Feature Opcodes | |
---|---|---|---|---|---|
Accuracy | 98.34% | 97.06% | 97.38% | 98.16% | N = 735 |
Precision | 97.93% | 92.13% | 97.17% | 96.29% | |
Recall | 97.83% | 94.92% | 93.63% | 97.57% | |
F1 Score | 97.77% | 93.17% | 95.02% | 96.80% | |
Accuracy | 98.20% | 97.52% | 97.65% | 98.21% | N = 400 |
Precision | 97.62% | 94.20% | 90.58% | 93.90% | |
Recall | 91.00% | 90.31% | 90.31% | 89.69% | |
F1 Score | 92.67% | 91.49% | 90.35% | 90.69% | |
Accuracy | 98.57% | 97.52% | 97.19% | 98.39% | N = 300 |
Precision | 98.26% | 91.11% | 89.18% | 97.98% | |
Recall | 93.65% | 90.20% | 92.39% | 95.69% | |
F1 Score | 95.34% | 90.56% | 90.29% | 96.65% | |
Accuracy | 98.44% | 97.15% | 96.55% | 98.30% | N = 200 |
Precision | 97.97% | 96.48% | 87.77% | 97.81% | |
Recall | 93.30% | 90.82% | 88.87% | 93.16% | |
F1 Score | 94.85% | 92.36% | 88.08% | 94.72% |
RF | DT | SVM | XGBST | Number of Feature Opcodes | |
---|---|---|---|---|---|
Accuracy | 98.21% | 97.24% | 96.87% | 98.39% | N = 350 |
Precision | 97.69% | 96.75% | 91.61% | 96.20% | |
Recall | 93.47% | 89.73% | 92.11% | 93.67% | |
F1 Score | 95.01% | 91.54% | 91.71% | 94.67% | |
Accuracy | 98.53% | 97.38% | 96.83% | 97.98% | N = 300 |
Precision | 98.18% | 93.76% | 92.16% | 97.74% | |
Recall | 94.99% | 90.53% | 90.32% | 91.43% | |
F1 Score | 96.25% | 91.57% | 90.98% | 93.24% | |
Accuracy | 98.44% | 97.29% | 96.83% | 98.34% | N = 250 |
Precision | 98.27% | 94.04% | 93.44% | 98.17% | |
Recall | 94.23% | 91.09% | 94.46% | 95.87% | |
F1 Score | 95.81% | 92.22% | 93.81% | 96.87% | |
Accuracy | 98.34% | 97.06% | 96.92% | 98.11% | N = 200 |
Precision | 97.76% | 93.99% | 94.11% | 97.48% | |
Recall | 95.60% | 94.09% | 93.08% | 94.10% | |
F1 Score | 96.45% | 93.86% | 93.36% | 95.35% |
Num_of_Processes | Num_of_Files_Per_Operation | Num_of_Files_Per_Process | Time_Cost (s) N = 350 | Time_Cost (s) N = 300 | Time_Cost (s) N = 250 | Time_Cost (s) N = 200 |
---|---|---|---|---|---|---|
1 | 1 | 1 | 3144.43 | 3123.36 | 3094.76 | 2991.99 |
4 | 8 | 8/4 = 2 | 2015.38 | 2017.66 | 2011.31 | 2005.77 |
4 | 12 | 12/4 = 3 | 2001.01 | 2055.59 | 2009.72 | 1997.88 |
4 | 16 | 16/4 = 4 | 2006.25 | 2004.85 | 2005.57 | 2002.26 |
4 | 20 | 20/4 = 5 | 2015.54 | 2021.01 | 2013.73 | 2015.71 |
6 | 12 | 12/6 = 2 | 1903.09 | 1905.85 | 1899.42 | 1899.04 |
6 | 18 | 18/6 = 3 | 1910.87 | 1909.65 | 1906.55 | 1908.51 |
6 | 24 | 24/6 = 4 | 1901.23 | 1903.48 | 1899.35 | 1898.98 |
6 | 30 | 30/6 = 5 | 1916.27 | 1907.38 | 1912.05 | 1904.48 |
6 | 36 | 36/6 = 6 | 2054.21 | 1900.02 | 1908.02 | 1900.46 |
8 | 16 | 16/8 = 2 | 2134.08 | 1802.31 | 1804.88 | 1794.36 |
8 | 24 | 24/8 = 3 | 2234.32 | 1814.57 | 1811.59 | 1806.29 |
8 | 32 | 32/8 = 4 | 2163.13 | 1800.85 | 1805.47 | 1800.34 |
8 | 40 | 40/8 = 5 | 2141.56 | 1814.20 | 1810.56 | 1806.71 |
16 | 32 | 32/16 = 2 | 2441.05 | 1784.22 | 1786.66 | 1786.38 |
16 | 48 | 48/16 = 3 | 2452.15 | 1804.17 | 1785.43 | 1789.29 |
16 | 64 | 64/16 = 4 | 2234.93 | 1804.81 | 1780.26 | 1854.18 |
32 | 64 | 64/32 = 2 | 3798.22 | 3366.90 | 3362.39 | 3378.62 |
32 | 96 | 96/32 = 3 | 3716.72 | 3454.91 | 3457.42 | 3446.34 |
Num_of_Processes | Num_of_Files_Per_Operation | Num_of_Files_Per_Process | Time_Cost (s) |
---|---|---|---|
1 | 1 | 1 | 2714.30 |
4 | 8 | 8/4 = 2 | 2024.43 |
4 | 12 | 8/2 = 4 | 2006.35 |
4 | 16 | 10/2 = 5 | 1988.46 |
6 | 18 | 18/6 = 3 | 1906.77 |
6 | 24 | 24/6 = 4 | 1891.73 |
6 | 30 | 30/6 = 5 | 1890.48 |
6 | 36 | 36/6 = 6 | 1881.11 |
8 | 24 | 24/8 = 3 | 1798.20 |
8 | 32 | 32/8 = 4 | 1795.16 |
8 | 40 | 40/8 = 5 | 1798.95 |
16 | 32 | 32/16 = 2 | 1786.84 |
16 | 48 | 48/16 = 3 | 1784.62 |
16 | 64 | 64/16 = 4 | 1785.14 |
32 | 64 | 64/32 = 2 | 3377.65 |
Num_of_Processes | Num_of_Files_Per_Operation | Num_of_Files_Per_Process | Time_Cost (s) N = 735 | Time_Cost (s) N = 400 | Time_Cost (s) N = 300 | Time_Cost (s) N = 200 |
---|---|---|---|---|---|---|
1 | 1 | 1 | 3629.03 | 3421.50 | 3094.94 | 2992.25 |
4 | 8 | 8/4 = 2 | 2095.58 | 2023.76 | 2018.67 | 2010.66 |
4 | 12 | 12/4 = 3 | 2074.93 | 2015.57 | 2000.30 | 2010.03 |
4 | 16 | 16/4 = 4 | 2066.56 | 2022.45 | 2016.49 | 2009.50 |
4 | 20 | 20/4 = 5 | 2037.21 | 2020.76 | 2038.26 | 2017.21 |
6 | 12 | 12/6 = 2 | 1923.39 | 1900.61 | 1903.36 | 1894.10 |
6 | 18 | 18/6 = 3 | 1965.54 | 1915.90 | 1911.74 | 1896.31 |
6 | 24 | 24/6 = 4 | 1970.55 | 1903.14 | 1899.78 | 1893.46 |
6 | 30 | 30/6 = 5 | 2063.14 | 1916.33 | 1910.19 | 1896.37 |
6 | 36 | 36/6 = 6 | 1926.12 | 1911.70 | 1898.43 | 1900.29 |
8 | 16 | 16/8 = 2 | 1822.72 | 2101.03 | 1800.09 | 1787.23 |
8 | 24 | 24/8 = 3 | 1964.03 | 2121.85 | 1809.51 | 1804.67 |
8 | 32 | 32/8 = 4 | 1995.37 | 2104.43 | 1792.06 | 1793.99 |
8 | 40 | 40/8 = 5 | 1971.58 | 2078.92 | 1811.06 | 1802.45 |
16 | 32 | 32/16 = 2 | 2843.86 | 2268.49 | 1782.34 | 1777.53 |
16 | 48 | 48/16 = 3 | 2500.37 | 2247.56 | 1775.79 | 1776.38 |
16 | 64 | 64/16 = 4 | 2418.35 | 2258.10 | 1771.14 | 1774.33 |
32 | 64 | 64/32 = 2 | 3805.67 | 3746.52 | 3360.62 | 3410.64 |
32 | 96 | 96/32 = 3 | 3812.21 | 3854.86 | 3461.20 | 3479.62 |
MalSEF | Ahmadi Mansour [28] | Hu Xin et al. [29] | Raff Edward et al. [31] | Quan Le et al. [32] | |
---|---|---|---|---|---|
Dataset | The Microsoft Malware Classification Challenge dataset in Kaggle | ||||
# Features | 300 | 1804 | 2000 | -- | 10,000 |
Feature Set | Top-N opcode list | features extracted from hex dumps + features extracted decompiled files | multifaceted content features + threat intelligence | Not clearly stated | One dimensional representation of the malware sample |
Classification Accuracy | 98.53% | 99.77% | 99.80% | 97.80% | 98.20% |
Time cost (s) | 1790.24 | 5656.00 | 2867.00 | 32,087.40 | 6372.00 (Train time for deep learning network) |
Required hardware platform | Lenovo ThinkStation, Intel® Core™ i7-6700U CPU @3.40GHz × 8, 8 GB memory | A laptop with a quad-core processor (2 GHz), and 8 GB RAM | Not clearly stated | A workstation with an Intel Xeon E5-2650 CPU at 2.30 GHz, 128 GB of RAM, and 4 TB of SSD storage | A workstation with a 6 core i7-6850K Intel processor |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lyu, J.; Xue, J.; Han, W.; Zhang, Q.; Zhu, Y. A Streamlined Framework of Metamorphic Malware Classification via Sampling and Parallel Processing. Electronics 2023, 12, 4427. https://doi.org/10.3390/electronics12214427
Lyu J, Xue J, Han W, Zhang Q, Zhu Y. A Streamlined Framework of Metamorphic Malware Classification via Sampling and Parallel Processing. Electronics. 2023; 12(21):4427. https://doi.org/10.3390/electronics12214427
Chicago/Turabian StyleLyu, Jian, Jingfeng Xue, Weijie Han, Qian Zhang, and Yufen Zhu. 2023. "A Streamlined Framework of Metamorphic Malware Classification via Sampling and Parallel Processing" Electronics 12, no. 21: 4427. https://doi.org/10.3390/electronics12214427
APA StyleLyu, J., Xue, J., Han, W., Zhang, Q., & Zhu, Y. (2023). A Streamlined Framework of Metamorphic Malware Classification via Sampling and Parallel Processing. Electronics, 12(21), 4427. https://doi.org/10.3390/electronics12214427