Malware Identification Method in Industrial Control Systems Based on Opcode2vec and CVAE-GAN
Abstract
:1. Introduction
- To address the challenge of the smaller quantity and limited diversity of malware samples in ICSs, our work proposes a novel malware generation method that combines the opcode2vec method based on preprocessed features and CVAE-GAN After the unique preprocessing, opcode2vec converts each opcode into a word vector. The preprocessing ensures that the extracted features remain both simple and effective. Simultaneously, CVAE-GAN leverages the latent space learning capabilities of CVAE with the high-quality generation capabilities of GAN to produce a large and diverse set of malware samples that meet specified conditions. These enhanced samples significantly enrich the training dataset, compensating for the deficiencies of existing samples in terms of features and categories.
- To address the challenges of low accuracy, instability, and lack of robustness in existing classifier models, our work has developed a malware classifier based on an enhanced dataset. We conducted comprehensive analysis and experiments, focusing on the model trained using the DEMI. The experimental results demonstrate that the model, based on the DEMI proposed in our work, shows a significant advantage in accuracy, particularly in scenarios with a limited number of samples.
2. Related Work
2.1. Malware
2.2. Identification Methods
3. Method
3.1. The Overall Structure of DEMI
3.2. The Feature Extraction of DEMI
3.2.1. The Extraction of Opcodes
3.2.2. The Selection of Opcodes
3.2.3. The Word Vector Computation of Opcodes
3.3. The Generation of DEMI
3.3.1. Component Definitions
- Encoder (Enc)
- Function: Encodes the input malware sample x into the latent variable z.
- Input: x (input sample), c (category).
- Output: z (latent variable).
- Mathematical Operation:
- Associated Loss: (Kullback-Leibler () divergence loss).
- Generator (Gen)
- Function: Acts both as the decoder in the VAE and the generator in the GAN, decoding the latent variable z into the generated sample , thus generating a malware sample belonging to category c.
- Input: z (latent variable), c (category).
- Output: (generated sample).
- Mathematical Operation:
- Associated Loss: (loss function for the generator part, including and ).
- Discriminator (Dis)
- Function: Determines whether the input sample is real or fake.
- Input: x (real sample), (generated sample).
- Output: , (discrimination results).
- Mathematical Operation: ,
- Associated Loss: (loss function for the discriminator part).
- Classifier (Cl)
- Function: Measures the posterior probability . The classifier in CVAE-GAN is used to optimize the parameters of the generator through the losses and .
- Input: x (real sample), (generated sample).
- Output: , (classification results).
- Mathematical Operation: ,
- Associated Loss: (loss function for the classifier part).
- Category (c)
- Function: The given category corresponding to the malware sample x, indicating which malware family it belongs to.
3.3.2. Loss Functions
- : divergence loss for the VAE network, representing the difference between the latent vector distribution and the predefined distribution.
- : Loss function for the generator part of the GAN, which includes both the loss functions and .
- : Loss function for the discriminator part of the GAN.
- : Loss function for the classifier part.
- and : Both are components of the loss function for the generator part of the GAN.
3.3.3. Training Phase
- Encoder:
- Generator:
- Discriminator:
- Classifier:
Algorithm 1 Training Pipeline of Generation. |
1: while has not converged do 2: Samples from the real data distribution 3: 4: 5: Samples and from the predefined distribution 6: 7: 8: 9: 10: Compute the feature center of and 11: 12: Compute the feature center of and with respect to separately 13: 14: 15: 16: 17: 18: 19: end while |
3.3.4. Inference Phase
- Encoder:
- Generator:
- Discriminator and Classifier: The discriminator and classifier do not directly participate in the inference phase. However, during the training phase, they optimize the generator’s parameters to ensure the quality and consistency of the generated samples.
3.4. The Identification of DEMI
3.5. The Evaluation of DEMI
- Accuracy: Accuracy is the ratio of correctly predicted instances (both true positives and true negatives) to the total number of instances.
- Precision: Precision is the ratio of correctly predicted positive instances to the total predicted positives.
- Recall: Recall, also known as sensitivity, is the ratio of correctly predicted positive instances to all instances that are actually positive.
- F1-score: The F1-score is the harmonic mean of precision and recall, providing a balanced metric that considers both false positives and false negatives.
4. Experiment and Analysis
4.1. Dataset and Experimental Setup
4.2. Results of DEMI Experiment
4.2.1. Results of Full-Sample Training
4.2.2. Results of Few-Sample Training
4.3. Comparison of Different Methods
4.3.1. Comparison of Results for Full-Sample Training
4.3.2. Comparison of Results for Few-Sample Training
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Shi, T.; McCann, R.A.; Huang, Y.; Wang, W.; Kong, J. Malware Detection for Internet of Things Using One-Class Classification. Sensors 2024, 24, 4122. [Google Scholar] [CrossRef]
- Ortega-Fernandez, I.; Sestelo, M.; Burguillo, J.C.; Piñón-Blanco, C. Network intrusion detection system for DDoS attacks in ICS using deep autoencoders. Wirel. Netw. 2023, 30, 5059–5075. [Google Scholar] [CrossRef]
- Koay, A.M.; Ko, R.K.L.; Hettema, H.; Radke, K. Machine learning in industrial control system (ICS) security: Current landscape, opportunities and challenges. J. Intell. Inf. Syst. 2023, 60, 377–405. [Google Scholar] [CrossRef]
- Song, W.; Peng, G.; Fu, J.; Zhang, H.; Chen, S. Research on Malicious Code Evolution and Traceability Technology. J. Softw. 2019, 30, 2229–2267. [Google Scholar]
- Jiyun, Y.; Gang, C.; Ran, Y.; Jianbin, L. An android malware detection method based on system behavior sequences. J. Chongqing Univ. 2020, 43, 54–63. [Google Scholar]
- Bayer, U.; Moser, A.; Kruegel, C.; Kirda, E. Dynamic analysis of malicious code. J. Comput. Virol. 2006, 2, 67–77. [Google Scholar] [CrossRef]
- Ijaz, M.; Durad, M.H.; Ismail, M. Static and dynamic malware analysis using machine learning. In Proceedings of the 2019 16th International Bhurban Conference on Applied Sciences and Technology (IBCAST), Islamabad, Pakistan, 8–12 January 2019; pp. 687–691. [Google Scholar]
- Wolsey, A. The State-of-the-Art in AI-Based Malware Detection Techniques: A Review. arXiv 2022, arXiv:2210.11239. [Google Scholar]
- Yin, X.C.; Liu, Z.G.; Nkenyereye, L.; Ndibanje, B. Toward an applied cyber security solution in IoT-based smart grids: An intrusion detection system approach. Sensors 2019, 19, 4952. [Google Scholar] [CrossRef]
- Tobiyama, S.; Yamaguchi, Y.; Shimada, H.; Ikuse, T.; Yagi, T. Malware detection with deep neural network using process behavior. In Proceedings of the 2016 IEEE 40th annual computer software and applications conference (COMPSAC), Atlanta, GA, USA, 10–14 June 2016; Volume 2, pp. 577–582. [Google Scholar]
- Hu, W.; Tan, Y. Generating adversarial malware examples for black-box attacks based on GAN. In International Conference on Data Mining and Big Data; Springer: Singapore, 2022; pp. 409–423. [Google Scholar]
- Narudin, F.A.; Feizollah, A.; Anuar, N.B.; Gani, A. Evaluation of machine learning classifiers for mobile malware detection. Soft Comput. 2016, 20, 343–357. [Google Scholar] [CrossRef]
- Liu, Y.; Deng, W.; Liu, Z.; Zeng, F. Semi-supervised attack detection in industrial control systems with deviation networks and feature selection. J. Supercomput. 2024, 80, 14600–14621. [Google Scholar] [CrossRef]
- Tupadha, L.S.; Stamp, M. Machine learning for malware evolution detection. In Artificial Intelligence for Cybersecurity; Springer: Cham, Switzerland, 2022; pp. 183–213. [Google Scholar]
- Akhtar, M.S.; Feng, T. Malware analysis and detection using machine learning algorithms. Symmetry 2022, 14, 2304. [Google Scholar] [CrossRef]
- Cai, H. Assessing and improving malware detection sustainability through app evolution studies. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2020, 29, 1–28. [Google Scholar] [CrossRef]
- Batouche, A.; Jahankhani, H. A comprehensive approach to android malware detection using machine learning. In Information Security Technologies for Controlling Pandemics; Springer: Cham, Switzerland, 2021; pp. 171–212. [Google Scholar]
- Jahromi, A.N.; Karimipour, H.; Dehghantanha, A.; Parizi, R.M. Deep representation learning for cyber-attack detection in industrial iot. In AI-Enabled Threat Detection and Security Analysis for Industrial IoT; Springer: Cham, Switzerland, 2021; pp. 139–162. [Google Scholar]
- Rathore, H.; Samavedhi, A.; Sahay, S.K.; Sewak, M. Towards adversarially superior malware detection models: An adversary aware proactive approach using adversarial attacks and defenses. Inf. Syst. Front. 2023, 25, 567–587. [Google Scholar] [CrossRef]
- Kozák, M.; Jureček, M.; Stamp, M.; Troia, F.D. Creating valid adversarial examples of malware. J. Comput. Virol. Hacking Tech. 2024, 1–15. [Google Scholar] [CrossRef]
- Louthánová, P.; Kozák, M.; Jureček, M.; Stamp, M. A Comparison of Adversarial Learning Techniques for Malware1 Detection. arXiv 2023, arXiv:2308.09958. [Google Scholar]
- Bao, J.; Chen, D.; Wen, F.; Li, H.; Hua, G. CVAE-GAN: Fine-grained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2745–2754. [Google Scholar]
- Larsen, A.B.L.; Sønderby, S.K.; Larochelle, H.; Winther, O. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1558–1566. [Google Scholar]
- Sohn, K.; Lee, H.; Yan, X. Learning structured output representation using deep conditional generative models. In Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
- Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1060–1069. [Google Scholar]
- Langner, R. Stuxnet: Dissecting a cyberwarfare weapon. IEEE Secur. Priv. 2011, 9, 49–51. [Google Scholar] [CrossRef]
- Kovanen, T.; Nuojua, V.; Lehto, M. Cyber threat landscape in energy sector. In Proceedings of the ICCWS 2018 13th International Conference on Cyber Warfare and Security, Washington, DC, USA, 8–9 March 2018; Academic Conferences and Publishing Limited: Cambridge, MA, USA, 2018; p. 353. [Google Scholar]
- Slowik, J. Evolution of ICS Attacks and the Prospects for Future Disruptive Events; Threat Intelligence Centre Dragos Inc.: Houston, TX, USA, 2019. [Google Scholar]
- Rrushi, J.; Farhangi, H.; Howey, C.; Carmichael, K.; Dabell, J. A quantitative evaluation of the target selection of havex ics malware plugin. In Proceedings of the Industrial Control System Security (ICSS) Workshop, Los Angeles, CA, USA, 7–11 December 2015; pp. 1–5. [Google Scholar]
- Geiger, M.; Bauer, J.; Masuch, M.; Franke, J. An analysis of black energy 3, crashoverride, and trisis, three malware approaches targeting operational technology systems. In Proceedings of the 2020 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Vienna, Austria, 8–11 December 2020; Volume 1, pp. 1537–1543. [Google Scholar]
- Chu, A.; Lai, Y.; Liu, J. Industrial control intrusion detection approach based on multiclassification GoogLeNet-LSTM model. Secur. Commun. Netw. 2019, 2019, 6757685. [Google Scholar] [CrossRef]
- Krithivasan, K.; Pravinraj, S.; VS, S.S. Detection of cyberattacks in industrial control systems using enhanced principal component analysis and hypergraph-based convolution neural network (EPCA-HG-CNN). IEEE Trans. Ind. Appl. 2020, 56, 4394–4404. [Google Scholar]
- Selim, G.E.I.; Hemdan, E.E.D.; Shehata, A.M.; El-Fishawy, N.A. Anomaly events classification and detection system in critical industrial internet of things infrastructure using machine learning algorithms. Multimed. Tools Appl. 2021, 80, 12619–12640. [Google Scholar] [CrossRef]
- Ribu Hassini, S.; Gireesh Kumar, T.; Kowshik Hurshan, S. A machine learning and deep neural network approach in industrial control systems. In ICT Analysis and Applications; Springer: Singapore, 2022; pp. 525–536. [Google Scholar]
- Handa, A.; Semwal, P. Evaluating performance of scalable fair clustering machine learning techniques in detecting cyber attacks in industrial control systems. In Handbook of Big Data Analytics and Forensics; Springer: Cham, Switzerland, 2022; pp. 105–116. [Google Scholar]
- Yin, H.; Zhang, J.; Qin, Z. A malware variants detection methodology with an opcode-based feature learning method and a fast density-based clustering algorithm. Int. J. Comput. Sci. Eng. 2020, 21, 19–29. [Google Scholar] [CrossRef]
- Ronen, R.; Radu, M.; Feuerstein, C.; Yom-Tov, E.; Ahmadi, M. Microsoft malware classification challenge. arXiv 2018, arXiv:1802.10135. [Google Scholar]
- Cakir, B.; Dogdu, E. Malware classification using deep learning methods. In Proceedings of the ACMSE 2018 Conference, Richmond, NY, USA, 29–31 March 2018; pp. 1–5. [Google Scholar]
Layer | Type | Kernel | Stride | Output Dimension |
---|---|---|---|---|
INPUT | Input Layer | - | - | |
CONV1 | Convolutional Layer | 2 | 32 filters | |
POOL1 | Max Pooling Layer | 2 | - | |
CONV2 | Convolutional Layer | 2 | 32 filters | |
POOL2 | Max Pooling Layer | 2 | - | |
CONV3 | Convolutional Layer | 2 | 64 filters | |
POOL3 | Max Pooling Layer | 2 | - | |
FC1 | Fully Connected Layer | - | - | 128 neurons |
FC2 | Fully Connected Layer | - | - | 9 neurons |
Malware Family | Round 1 | Round 2 | Round 3 | Round 4 | Round 5 | |||||
---|---|---|---|---|---|---|---|---|---|---|
Real | Generated | Real | Generated | Real | Generated | Real | Generated | Real | Generated | |
Ramnit | 1541 | 308 | 1541 | 370 | 1541 | 444 | 1541 | 533 | 1541 | 640 |
Lollipop | 2478 | 496 | 2478 | 595 | 2478 | 714 | 2478 | 857 | 2478 | 1028 |
Kelihos ver.3 | 2947 | 589 | 2947 | 707 | 2947 | 848 | 2947 | 1018 | 2947 | 1222 |
Vundo | 475 | 95 | 475 | 114 | 475 | 137 | 475 | 164 | 475 | 197 |
Simda | 42 | 8 | 42 | 10 | 42 | 12 | 42 | 14 | 42 | 17 |
Tracur | 751 | 150 | 751 | 180 | 751 | 216 | 751 | 259 | 751 | 311 |
Kelihos ver.1 | 298 | 60 | 298 | 72 | 298 | 86 | 298 | 103 | 298 | 124 |
Obfuscator.ACY | 1228 | 246 | 1228 | 295 | 1228 | 354 | 1228 | 425 | 1228 | 510 |
Gatak | 1013 | 203 | 1013 | 244 | 1013 | 293 | 1013 | 352 | 1013 | 422 |
Malware Family | Round 1 | Round 2 | Round 3 | Round 4 | Round 5 | Round 6 | Round 7 | Round 8 |
---|---|---|---|---|---|---|---|---|
Ramnit | 1541 | 1233 | 986 | 789 | 631 | 505 | 404 | 323 |
Lollipop | 2478 | 1982 | 1586 | 1269 | 1015 | 812 | 650 | 520 |
Kelihos ver.3 | 2947 | 2358 | 1886 | 1509 | 1207 | 966 | 773 | 618 |
Vundo | 475 | 380 | 304 | 243 | 195 | 156 | 125 | 100 |
Simda | 42 | 42 | 42 | 42 | 42 | 42 | 42 | 42 |
Tracur | 751 | 601 | 481 | 385 | 308 | 246 | 197 | 157 |
Kelihos ver.1 | 298 | 238 | 191 | 153 | 122 | 98 | 78 | 62 |
Obfuscator.ACY | 1228 | 982 | 786 | 629 | 503 | 402 | 322 | 258 |
Gatak | 1013 | 810 | 648 | 519 | 414 | 331 | 266 | 212 |
Total | 10,773 | 8626 | 6909 | 5536 | 4436 | 3557 | 2588 | 2292 |
Method | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Traditional methods | ||||
RNN | 0.9411 | 0.9001 | 0.9420 | 0.9206 |
GRU | 0.9287 | 0.8952 | 0.9241 | 0.9094 |
LSTM | 0.9330 | 0.8973 | 0.9352 | 0.9159 |
BiLSTM | 0.9359 | 0.9075 | 0.9510 | 0.9287 |
Random Forest | 0.9456 | 0.9011 | 0.9451 | 0.9226 |
SOTA methods | ||||
Gray+CNN | 0.9647 | 0.9064 | 0.9339 | 0.9199 |
vectorCNN | 0.9657 | 0.9113 | 0.9346 | 0.9228 |
Cakir’s Method | 0.9685 | 0.9092 | 0.9655 | 0.9365 |
MCIopcode2vec (ours) | 0.9722 | 0.9061 | 0.9395 | 0.9225 |
Proposed method | ||||
DEMI (ours) | 0.9730 | 0.9234 | 0.9744 | 0.9482 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, Y.; Liu, J.; Xiang, X.; Wen, P.; Wen, S.; Chen, Y.; Chen, L.; Zhang, Y. Malware Identification Method in Industrial Control Systems Based on Opcode2vec and CVAE-GAN. Sensors 2024, 24, 5518. https://doi.org/10.3390/s24175518
Huang Y, Liu J, Xiang X, Wen P, Wen S, Chen Y, Chen L, Zhang Y. Malware Identification Method in Industrial Control Systems Based on Opcode2vec and CVAE-GAN. Sensors. 2024; 24(17):5518. https://doi.org/10.3390/s24175518
Chicago/Turabian StyleHuang, Yuchen, Jingwen Liu, Xuanyi Xiang, Pan Wen, Shiyuan Wen, Yanru Chen, Liangyin Chen, and Yuanyuan Zhang. 2024. "Malware Identification Method in Industrial Control Systems Based on Opcode2vec and CVAE-GAN" Sensors 24, no. 17: 5518. https://doi.org/10.3390/s24175518
APA StyleHuang, Y., Liu, J., Xiang, X., Wen, P., Wen, S., Chen, Y., Chen, L., & Zhang, Y. (2024). Malware Identification Method in Industrial Control Systems Based on Opcode2vec and CVAE-GAN. Sensors, 24(17), 5518. https://doi.org/10.3390/s24175518