Going beyond API Calls in Dynamic Malware Analysis: A Novel Dataset
Abstract
:1. Introduction
2. Background and Related Work
2.1. API Call Feature Sets
2.2. More Comprehensive Feature Sets
2.3. Position of Our Approach
3. Materials and Methods
- Number of benign files belonging to the given file type (cf. column “# Benign instances”—“Full rep.”).
- Number of benign files belonging to the given file type whose execution generates API calls (cf. column “# Benign instances”—“API calls”).
- Number of malware files belonging to the given file type (cf. column “# Malware instances”—“Full rep.”).
- Number of malware files belonging to the given file type whose execution generates API calls (cf. column “# Malware instances”—“API calls”).
- Static and dynamic analysis based on YARA rules for code packing, obfuscation, etc.;
- Registry editing;
- File creation and modification;
- Network access;
- Checking user activities and other evasion techniques and behaviors of file samples, etc.
- Some reports are very big in size (over 2 GB).
- Some samples apply obfuscation techniques (discovered in the basic static file analysis conducted by the sandbox system).
- Some samples implement evasion techniques: checking the browser history, checking whether the running environment is virtual, checking whether a debugger is present, etc.
- We configured a Python agent, which monitors events while file samples are being executed and runs silently in the background.
- The analysis virtual machine is configured with usage history and data, a custom username, and software. Special attention was devoted to the browsers’ history since we previously noticed frequent checks of Internet browser activities conducted by some malware samples.
- The VMware tools were intentionally omitted, and processing resources of the analysis virtual machine were unusually abundant for a sandbox in order to prevent sophisticated malware samples from recognizing and evading the sandbox techniques.
- The user action imitation technique (e.g., mouse move, click, etc.) integrated into the Cuckoo sandbox was enabled, and the execution time was increased to two minutes per file sample.
- The analysis virtual machine had limited Internet access with the firewall configured in front of the laboratory environment.
4. Results
4.1. Model and Metrics Selection
- Loading the panda’s data frames and shuffling them.
- Subsampling, i.e., we randomly selected 70 percent of each panda’s data frame, keeping the original textual reports representing the selected samples.
- Storing, i.e., all subsample instances were stored in one data frame, which is additionally shuffled. This data frame was transformed into a pickle file.
- Full-featured samples represented by CV;
- Full-featured samples represented by TF-IDF;
- API calls samples represented by CV;
- API calls samples represented by TF-IDF;
4.2. Experiment
4.3. Computing Resources and Trade-Offs
- Random forest hyperparameters (max_depth, min_samples_split, min_samples_leaf, and max_features) represent parameter values applied to each RF group of models in the optimization process.
- Computing resources (Exec. time, Max. RAM cons., Max. CPU cons.) describe the resources required to load the extracted features from the files stored on the disk (previously saved to the pickle file for both CV and TF-IDF) and to train and test the RF models.
- Results provide the maximum validation accuracy in the group and the number of trees in an RF model at which it was obtained.
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Ficco, M. Malware Analysis by Combining Multiple Detectors and Observation Windows. IEEE Trans. Comput. 2022, 71, 1276–1290. [Google Scholar] [CrossRef]
- Ho, T.K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 1, pp. 278–282. [Google Scholar] [CrossRef]
- Mira, F. A Review Paper of Malware Detection Using API Call Sequences. In Proceedings of the 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS), Riyadh, Saudi Arabia, 1–3 May 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Deore, M.; Tarambale, M.; Ratna Raja Kumar, J.; Sakhare, S. GRASE: Granulometry Analysis with Semi Eager Classifier to Detect Malware. Int. J. Interact. Multimed. Artif. Intell. 2024, 8, 120–134. [Google Scholar] [CrossRef]
- Düzgün, B.; Çayır, A.; Demirkıran, F.; Kahya, C.N.; Gençaydın, B.; Dağ, H. Benchmark Static API Call Datasets for Malware Family Classification. arXiv 2022. [Google Scholar] [CrossRef]
- Alshmarni, A.; Alliheedi, M.A. Enhancing Malware Detection by Integrating Machine Learning with Cuckoo Sandbox. arXiv 2023, arXiv:2311.04372. [Google Scholar] [CrossRef]
- Syeda, D.; Asghar, M. Dynamic Malware Classification and API Categorization of Windows Portable Executable Files Using Machine Learning. Appl. Sci. 2024, 14, 1015. [Google Scholar] [CrossRef]
- Zhang, S.; Wu, J.; Zhang, M.; Yang, W. Dynamic Malware Analysis Based on API Sequence Semantic Fusion. Applied Sciences 2023, 13, 6526. [Google Scholar] [CrossRef]
- Huang, Y.; Chen, T.; Hsiao, S. Learning Dynamic Malware Representation from Common Behavior. J. Inf. Sci. Eng. 2022, 38, 1317–1334. [Google Scholar] [CrossRef]
- Huang, Y.; Sun, Y.; Chen, M. TagSeq: Malicious behavior discovery using dynamic analysis. PLoS ONE 2022, 17, e0263644. [Google Scholar] [CrossRef]
- Chen, L.; Yagemann, C.; Downing, E. To believe or not to believe: Validating explanation fidelity for dynamic malware analysis. arXiv 2019, arXiv:1905.00122. [Google Scholar] [CrossRef]
- Alhashmi, A.; Darem, A.; Alanazi, M.; Alashjaee, M.; Aldughayfiq, B.; Ghaleb, A.; Ebad, A.; Alanazi, A. Hybrid Malware Variant Detection Model with Extreme Gradient Boosting and Artificial Neural Network Classifiers. Comput. Mater. Contin. 2023, 76, 3483–3498. [Google Scholar] [CrossRef]
- Lee, D.; Jeon, G.; Lee, S.; Cho, H. Deobfuscating Mobile Malware for Identifying Concealed Behaviors. Comput. Mater. Contin. 2022, 72, 5909–5923. [Google Scholar] [CrossRef]
- Chen, T.; Zeng, H.; Lv, M.; Zhu, T. CTIMD: Cyber threat intelligence enhanced malware detection using API call sequences with parameters. Comput. Secur. 2024, 136, 103518. [Google Scholar] [CrossRef]
- Yau, L.; Lam, Y.; Lokesh, A.; Gupta, P.; Lim, J.; Singh, I.; Loo, J.; Ngo, M.; Teo, S.; Truong-Huu, T. A Novel Feature Vector for AI-Assisted Windows Malware Detection. In Proceedings of the 2023 IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing, International Conference on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Abu Dhabi, United Arab Emirates, 14–17 November 2023; pp. 0355–0361. [Google Scholar] [CrossRef]
- Xu, Y.; Chen, Z. Family Classification based on Tree Representations for Malware. In Proceedings of the 14th ACM SIGOPS Asia-Pacific Workshop on Systems, Seoul, Republic of Korea, 24–25 August 2023; pp. 65–71. [Google Scholar] [CrossRef]
- Li, C.; Cheng, C.; Zhu, H.; Wang, L.; Lv, Q.; Wang, Y.; Li, N.; Sun, D. DMalNet: Dynamic malware analysis based on API feature engineering and graph learning. Comput. Secur. 2022, 122, 102872. [Google Scholar] [CrossRef]
- Li, S.; Wen, H.; Deng, L.; Zhouv, Y.; Zhang, W.; Li, Z.; Sun, L. Denoising Network of Dynamic Features for Enhanced Malware Classification. In Proceedings of the 2023 IEEE International Performance, Computing, and Communications Conference (IPCCC), Anaheim, CA, USA, 17–19 November 2023; pp. 32–39. [Google Scholar] [CrossRef]
- Nunes, M.; Burnap, P.; Rana, O.; Reinecke, P.; Lloyd, K. Getting to the root of the problem: A detailed comparison of kernel and user level data for dynamic malware analysis. J. Inf. Secur. Appl. 2019, 48, 102365. [Google Scholar] [CrossRef]
- Li, N.; Lu, Z.; Ma, Y.; Chen, Y.; Dong, J. A Malicious Program Behavior Detection Model Based on API Call Sequences. Electronics 2024, 13, 1092. [Google Scholar] [CrossRef]
- Jindal, C.; Salls, C.; Aghakhani, H.; Long, K.; Kruegel, C.; Vigna, G. Neurlux: Dynamic Malware Analysis Without Feature Engineering. arXiv 2019, arXiv:1910.11376. [Google Scholar] [CrossRef]
- Anderson, H.; Rothl, P. EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models. arXiv 2018, arXiv:1804.04637. [Google Scholar] [CrossRef]
- Bosansky, B.; Kouba, D.; Manhal, O.; Sick, T.; Lisy, V.; Kroustek, J.; Somol, P. Avast-CTU Public CAPE Dataset. arXiv 2022, arXiv:2209.03188. [Google Scholar] [CrossRef]
- Herrera-Silva, J.; Hernández-Álvarez, M. Dynamic Feature Dataset for Ransomware Detection Using Machine Learning Algorithms. Sensors 2023, 23, 1053. [Google Scholar] [CrossRef]
- Irshad, A.; Dutta, M. Identification of Windows-Based Malware by Dynamic Analysis Using Machine Learning Algorithm. In Advances in Computational Intelligence and Communication Technology; Gao, X.-Z., Tiwari, S., Trivedi, M.C., Mishra, K.K., Eds.; Springer: Singapore, 2021; Volume 1086, pp. 207–218. [Google Scholar] [CrossRef]
- Sraw, J.; Kumar, K. Using Static and Dynamic Malware features to perform Malware Ascription. ECS Trans. 2022, 107, 3187–3198. [Google Scholar] [CrossRef]
- Sethi, K.; Kumar, R.; Sethi, L.; Bera, P.; Patra, P. A Novel Machine Learning Based Malware Detection and Classification Framework. In Proceedings of the 2019 International Conference on Cyber Security and Protection of Digital Services (Cyber Security), Oxford, UK, 3–4 June 2019; pp. 1–4. [Google Scholar] [CrossRef]
- Virus Total. Virustotal-Free Online Virus, Malware and Url Scanner. Available online: https://www.virustotal.com/en (accessed on 9 April 2024).
- Taheri, R.; Javidan, R.; Shojafar MP, V.; Conti, M. Can Machine Learning Model with Static Features be Fooled: An Adversarial Machine Learning Approach. arXiv 2020, arXiv:1904.09433. [Google Scholar] [CrossRef]
- Taheri, R.; Ghahramani, M.; Javidan, R.; Shojafar, M.; Pooranian, Z.; Conti, M. Similarity-based Android malware detection using Hamming distance of static binary features. Future Gener. Comput. Syst. 2020, 105, 230–247. [Google Scholar] [CrossRef]
- Ilić, S.; Gnjatović, M.; Popović, B.; Maček, N. A pilot comparative analysis of the Cuckoo and Drakvuf sandboxes: An end-user perspective. Millitary Tech. Cour. 2022, 70, 372–392. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. Available online: http://jmlr.org/papers/v12/pedregosa11a.html (accessed on 3 September 2024).
- McKinney, W. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010; PANDAS Conference Paper. Volume 445, pp. 51–56. [Google Scholar]
File Type | # Benign Instances | # Malware Instances | File Type | # Benign Instances | # Malware Instances | ||||
---|---|---|---|---|---|---|---|---|---|
Full rep. | API Calls | Full rep. | API Calls | Full rep. | API Calls | Full rep. | API Calls | ||
exe | 1395 | 344 | 7097 | 6597 | Xlsm | 0 | 0 | 5 | 5 |
dll | 6574 | 3354 | 238 | 237 | accdb | 5 | 5 | 0 | 0 |
doc | 1659 | 1659 | 947 | 947 | Cat | 5 | 5 | 0 | 0 |
xls | 631 | 631 | 477 | 475 | Gdl | 5 | 5 | 0 | 0 |
txt | 4 | 0 | 803 | 803 | xml | 5 | 5 | 0 | 0 |
fxp | 558 | 558 | 0 | 0 | Db | 4 | 0 | 0 | 0 |
docx | 0 | 0 | 269 | 269 | Pptx | 0 | 0 | 4 | 4 |
118 | 0 | 120 | 0 | Sch | 3 | 3 | 0 | 0 | |
cdx | 223 | 0 | 0 | 0 | Ppt | 0 | 0 | 3 | 3 |
dbf | 222 | 222 | 0 | 0 | Avi | 2 | 2 | 0 | 0 |
xlsx | 0 | 0 | 190 | 190 | Bin | 2 | 2 | 0 | 0 |
prg | 138 | 138 | 0 | 0 | bmp | 2 | 0 | 0 | 0 |
html | 0 | 0 | 128 | 128 | Och | 2 | 2 | 0 | 0 |
none | 0 | 0 | 60 | 59 | Tbk | 2 | 2 | 0 | 0 |
ppd | 52 | 52 | 0 | 0 | Msg | 0 | 0 | 2 | 2 |
docm | 0 | 0 | 45 | 45 | agcoc | 1 | 1 | 0 | 0 |
fpt | 35 | 35 | 0 | 0 | Bak | 1 | 1 | 0 | 0 |
fpx | 0 | 0 | 33 | 33 | Fmt | 1 | 1 | 0 | 0 |
zip | 7 | 0 | 17 | 0 | Ico | 1 | 0 | 0 | 0 |
rar | 0 | 0 | 24 | 23 | Mem | 1 | 1 | 0 | 0 |
htm | 19 | 19 | 0 | 0 | Ms | 1 | 1 | 0 | 0 |
crt | 12 | 12 | 0 | 0 | New | 1 | 1 | 0 | 0 |
inf | 10 | 10 | 0 | 0 | pr1 | 1 | 1 | 0 | 0 |
bat | 9 | 9 | 0 | 0 | Ses | 1 | 1 | 0 | 0 |
dat | 8 | 0 | 0 | 0 | Json | 0 | 0 | 1 | 1 |
ini | 8 | 8 | 0 | 0 | mp3 | 0 | 0 | 1 | 1 |
lnk | 7 | 7 | 0 | 0 | Rtf | 0 | 0 | 1 | 1 |
Total: | 11,735 | 7097 | 10,465 | 9823 |
File Type Extension | Instances | Benign Instances | Malware Instances |
---|---|---|---|
dll | 3221 | 3220 | 1 |
exe | 1551 | 1051 | 500 |
238 | 118 | 120 | |
cdx | 223 | 223 | 0 |
zip | 24 | 7 | 17 |
dat | 8 | 8 | 0 |
txt | 4 | 4 | 0 |
db | 4 | 4 | 0 |
xls | 2 | 0 | 2 |
bmp | 2 | 2 | 0 |
rar | 1 | 0 | 1 |
ico | 1 | 1 | 0 |
none | 1 | 0 | 1 |
5280 | 4638 | 642 |
CV Full | TF-IDF Full | CV API Calls | TF-IDF API Calls | |
---|---|---|---|---|
Num. of trees in the optimal model | 7 | 26 | 32 | 66 |
Validation accuracy (%) | 99.74 | 99.68 | 95.56 | 95.4 |
Precision (benign) | 0.997 | 0.997 | 0.935 | 0.933 |
Precision (malware) | 0.998 | 0.996 | 0.982 | 0.982 |
Macro average precision | 0.997 | 0.998 | 0.959 | 0.957 |
Recall (benign) | 0.998 | 0.997 | 0.985 | 0.985 |
Recall (malware) | 0.996 | 0.998 | 0.921 | 0.918 |
Macro-average recall | 0.997 | 0.995 | 0.953 | 0.952 |
Macro-average F1 | 0.997 | 0.997 | 0.954 | 0.954 |
Num. of samples | 15,542 | 0.997 | 15,542 | 15,542 |
Num. of features | 25,066,934 | 25,066,934 | 294 | 294 |
Benign samples | 8217 | 8217 | 4968 | 4968 |
Malware samples | 7325 | 7325 | 6891 | 6891 |
Random Forest Hyperparameters | Computing Resources | Results | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Datasets | Max_depth | Min_samples_split | Min_samples_leaf | Max_features | Exec. Time (h) | Max. RAM Cons. (GB) | Max. CPU Cons. (%) | # Trees at Max. Acc. | Max. Validation Accuracy (%) | |
1 | Full reports CV | None | 2 | 1 | sqrt | 7.00 | 19,3 | 6 | 89 | 99.067 |
2 | 100 | 2 | 1 | sqrt | 7.80 | 19.3 | 6.7 | 93 | 99.099 | |
3 | None | 100 | 1 | sqrt | 6.25 | 17.8 | 6.78 | 25 | 98.874 | |
4 | None | 2 | 100 | sqrt | 5.90 | 18 | 6.47 | 72 | 89.836 | |
5 | None | 2 | 1 | 0.1 | 50.21 | 18.31 | 6.47 | 45 | 99.742 | |
6 | 1000 | 2 | 1 | sqrt | 7.01 | 18.34 | 6.81 | 87 | 99.067 | |
7 | None | 1000 | 1 | sqrt | 5.50 | 18 | 6.5 | 33 | 98.617 | |
8 | None | 2 | 1000 | sqrt | 7.75 | 18 | 6.5 | 69 | 76.841 | |
9 | 100 | 2 | 1 | 0.2 | 70.84 | 45 | 22 | 7 | 99.743 | |
10 | Full reports TF-IDF | None | 2 | 1 | sqrt | 6.11 | 20 | 7 | 86 | 99.035 |
11 | 100 | 2 | 1 | sqrt | 7.50 | 20 | 7 | 86 | 99.035 | |
12 | None | 100 | 1 | sqrt | 5.34 | 18.8 | 7 | 79 | 98.617 | |
13 | None | 2 | 100 | sqrt | 5.00 | 18 | 6.44 | 95 | 92.879 | |
14 | None | 2 | 1 | 0.1 | 43.33 | 19 | 6.38 | 94 | 99.614 | |
15 | 1000 | 2 | 1 | sqrt | 6.00 | 19.03 | 6.47 | 16 | 98.874 | |
16 | None | 1000 | 1 | sqrt | 5.00 | 19.99 | 6.62 | 51 | 97.813 | |
17 | None | 2 | 1000 | sqrt | 5.00 | 18.44 | 6,7 | 55 | 79.607 | |
18 | 100 | 2 | 1 | 0.2 | 61.09 | 20.4 | 7 | 26 | 99.678 | |
19 | API callsCV | None | 2 | 1 | sqrt | 0.06 | 18 | 6 | 33 | 95.529 |
20 | 100 | 2 | 1 | sqrt | 0.06 | 18 | 6 | 79 | 95.529 | |
21 | None | 100 | 1 | sqrt | 0.06 | 18 | 6 | 9 | 94.854 | |
22 | None | 2 | 100 | sqrt | 0.06 | 18 | 6 | 8 | 92.536 | |
23 | None | 2 | 1 | 0.1 | 0.06 | 18 | 6 | 21 | 95.433 | |
24 | 1000 | 2 | 1 | sqrt | 0.06 | 18 | 6 | 13 | 95.497 | |
25 | None | 1000 | 1 | sqrt | 0.06 | 18 | 6 | 38 | 93.921 | |
26 | None | 2 | 1000 | sqrt | 0.06 | 18 | 6 | 16 | 78.868 | |
27 | 100 | 2 | 1 | 0.2 | 0.09 | 18 | 6 | 32 | 95.561 | |
28 | API callsTF-IDF | None | 2 | 1 | sqrt | 0.07 | 18 | 6.5 | 43 | 95.24 |
29 | 100 | 2 | 1 | sqrt | 0.07 | 18 | 6.5 | 43 | 95.24 | |
30 | None | 100 | 1 | sqrt | 0.05 | 18 | 6.5 | 22 | 94.757 | |
31 | None | 2 | 100 | sqrt | 0.03 | 18 | 6.5 | 42 | 93.149 | |
32 | None | 2 | 1 | 0.1 | 0.08 | 18 | 6.5 | 92 | 95.368 | |
33 | 1000 | 2 | 1 | sqrt | 0.06 | 18 | 6.5 | 43 | 95.24 | |
34 | None | 1000 | 1 | sqrt | 0.03 | 18 | 6.5 | 93 | 93.599 | |
35 | None | 2 | 1000 | sqrt | 0.02 | 18 | 6.5 | 14 | 89.643 | |
36 | 100 | 2 | 1 | 0.2 | 0.13 | 18 | 6.5 | 66 | 95.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ilić, S.; Gnjatović, M.; Tot, I.; Jovanović, B.; Maček, N.; Gavrilović Božović, M. Going beyond API Calls in Dynamic Malware Analysis: A Novel Dataset. Electronics 2024, 13, 3553. https://doi.org/10.3390/electronics13173553
Ilić S, Gnjatović M, Tot I, Jovanović B, Maček N, Gavrilović Božović M. Going beyond API Calls in Dynamic Malware Analysis: A Novel Dataset. Electronics. 2024; 13(17):3553. https://doi.org/10.3390/electronics13173553
Chicago/Turabian StyleIlić, Slaviša, Milan Gnjatović, Ivan Tot, Boriša Jovanović, Nemanja Maček, and Marijana Gavrilović Božović. 2024. "Going beyond API Calls in Dynamic Malware Analysis: A Novel Dataset" Electronics 13, no. 17: 3553. https://doi.org/10.3390/electronics13173553
APA StyleIlić, S., Gnjatović, M., Tot, I., Jovanović, B., Maček, N., & Gavrilović Božović, M. (2024). Going beyond API Calls in Dynamic Malware Analysis: A Novel Dataset. Electronics, 13(17), 3553. https://doi.org/10.3390/electronics13173553