Syntactic–Semantic Detection of Clone-Caused Vulnerabilities in the IoT Devices
Abstract
:1. Introduction
- Using proprietary program code from a corporate code database when developing commercial software;
- Using open-source code repositories when developing proprietary software;
- Using open-source code repositories when developing open-source software.
- Incomplete and inaccurate types of semantic information are captured from functions, leading to a high rate of false positives. For example, the BinGo tool [14] relies on the CFG to generate a function signature model. However, CFGs are significantly divergent across different platforms, resulting in BinGo’s cross-platform code clone detection accuracy being less than 60%;
- Most methods require substantial data processing power, making them challenging to apply to complex tasks. For example, the Genius tool [12] uses spectral algorithms for clustering and graph matching. The Gemini tool [7] applies a deep learning model to process CFGs, and, consequently, it loses a large portion of semantic information while optimizing the data mining procedure. Most detectors work on a syntactic level of clone searching.
- A formal description of the code clone search is proposed. On this basis, a hybrid method for code clone detection is proposed that combines syntactic and semantic analyses. This method utilizes an attributed abstract syntax tree, our improvement of the commonly used abstract syntax tree that was extended with a vector representation of code features, and a Siamese network of two deep graph neural networks. Therefore, the proposed method combines low-level code feature processing and high-level semantic analysis;
- An experimental study of the proposed method was conducted, demonstrating its efficiency in maintaining IoT software. It shows better output quality (e.g., AUC 0.962) than the tested competitors—BinDiff, Gemini, and Asteria utilities.
2. Materials and Methods
2.1. Related Works
- Exact clones: the program code is re-used as is without any modifications.
- Renamed clones: syntactically identical clones. Variables, types, spaces, layout, and comments can be modified.
- Restructured clones: this is based on renamed clones, and code fragments can be re-edited by adding, removing, or modifying the statements.
- Semantic clones: two code samples differ in syntax, but implement the same function and, thus, have the same semantics.
- The use of multi-static analysis methods at different levels of granularity;
- The simultaneous use of static and dynamic methods.
- Methods that offer a sequential comparison of code fragments based on features of code on different granularity levels are more effective than any method that makes a decision based on combinations of code features. This is confirmed by other comparative reviews of methods presented, for example, in studies [32,33];
- Pre-matching and filtering a set of code samples reduces the size of the unmatched set, where semantic methods are applied to identify semantic clones. Such methods utilize machine learning algorithms. Since the unmatched set of code samples, after the syntactic analysis phase, contains only those with structural differences and no syntactic similarity, machine learning models can be tuned more precisely to address the specific task of detecting semantic code clones;
- According to the methods observed, the most efficient and stable results are gained when using graph representations of static and semantic features of code. This leads to the necessity of embedding graphs into low-dimensional vector representations. Intelligent detecting algorithms use GNNs (graph neural networks), RNNs (recurrent neural networks), or CNNs (convolutional neural networks) to produce vectors (e.g., [15,34,35]). According to existing research, GNNs are less time- and memory-intensive on large code bases compared to RNNs and CNNs. But their weakness lies in a high likelihood of collisions, which can result in generating the same attributed vectors for graphs with different topologies and features. Convolutional neural networks tend to treat isomorphic graphs of different functions as similar vectors, while RNN-based methods struggle with functions containing long linear code snippets.
2.2. Code Clone Detection
2.3. Preliminary Processing of Code Fragments
- Semantic representation of the node (i.e., lexeme), obtained by using Word2vec;
- The number of function calls present in the subtree;
- The number of cycles present in the subtree;
- The number of conditional operators present in the subtree;
- The number of switch operators present in the subtree;
- Sum of digital values (values of nodes of int, float types) present in the subtree.
2.4. Stage of Classification
2.5. Combination of Syntactic and Semantic Analyses
- Byte sequences;
- Assembler instruction sequences;
- Statistical values extracted from the analysis of byte and instruction sequences.
- Initial matching involves matching function signatures, which include the number of basic blocks, the number of edges in the CFG, and statistical data on the number of specific instruction types within functions. At this stage, the call graph is also matched, which is constructed for each analyzed code sample.
- Attribute-driven similarity determination: The similarity of functions successfully matched in the previous step is evaluated using key attributes. These attributes include the hash of the function name, the hash of the function body, the matching of function positions within the call graph, etc.
- For matched functions, their CFGs are compared to detect modifications at the level of individual instructions.
Algorithm 1 Algorithm for the preliminary stage. |
|
Algorithm 2 Algorithm for the classification stage. |
|
2.6. A Demonstration Example
3. Results
- Disassembling and restoring the function code using IDA Pro 7.7.
- Building an AST for the restored code of all functions.
- Training the Word2vec model on the combined set of lexeme types of all executable files. Sequences of lexemes of function bodies are used as sentences (continuous sequences of tokens). A mapping of the lexeme set onto a set of semantic vectors is formed.
- Each node is assigned an attribute vector consisting of a Word2vec semantic vector and statistical information on the number of lexemes of a certain type in a subtree. As a result, the AASTs are built.
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Myroshnyk, Y. State of IoT Summer 2024 Report. Available online: https://iot-analytics.com/product/state-of-iot-summer-2024/ (accessed on 17 September 2024).
- Cross-Industry Insight: IoT Market Opportunities and Top Spend Use Cases. Available online: https://www.gartner.com/en/documents/4432199 (accessed on 17 September 2024).
- Li, Z.; Zou, D.; Xu, S.; Ou, X.; Jin, H.; Wang, S.; Deng, Z.; Zhong, Y. VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. In Proceedings of the 25th Annual Network and Distributed System Security Symposium, San Diego, CA, USA, 18–21 February 2018. [Google Scholar]
- Jiang, W.P.; Wu, B.; Jiang, Z.; Yang, S.B. Cloning Vulnerability Detection in Driver Layer of IoT Devices. In Information and Communications Security; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2020; Volume 11999, pp. 89–104. [Google Scholar]
- Gao, J.; Yang, X.; Jiang, Y.; Song, H.; Choo, K.K.R.; Sun, J. Semantic Learning Based Cross-Platform Binary Vulnerability Search for IoT Devices. IEEE Trans. Ind. Inform. 2021, 17, 971–979. [Google Scholar] [CrossRef]
- Jiang, L.; Su, Z.; Chiu, E. Context-Based Detection of Clone-Related Bugs. In Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Dubrovnik, Croatia, 3–7 September 2007; pp. 55–64. [Google Scholar]
- Xu, X.; Liu, C.; Feng, Q.; Yin, H.; Song, L.; Song, D. Neural Network-Based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the ACM Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017; pp. 363–376. [Google Scholar]
- Peng, J.; Wang, Y.; Xue, J.; Liu, Z. Fast Cross-Platform Binary Code Similarity Detection Framework Based on CFGs Taking Advantage of NLP and Inductive GNN. Chin. J. Electron. 2024, 33, 128–138. [Google Scholar] [CrossRef]
- Wang, S.; Jiang, X.; Yu, X.; Su, X. Cross-Platform Binary Code Homology Analysis Based on GRU Graph Embedding. Secur. Commun. Netw. 2021, 2021, 1–8. [Google Scholar] [CrossRef]
- Fu, L.; Ji, S.; Liu, C.; Liu, P.; Duan, F.; Wang, Z.; Chen, W.; Wang, T. Focus: Function Clone Identification on Cross-Platform. Int. J. Intell. Syst. 2022, 37, 5082–5112. [Google Scholar] [CrossRef]
- Quradaa, F.H.; Shahzad, S.; Almoqbily, R.S. A Systematic Literature Review on the Applications of Recurrent Neural Networks in Code Clone Research. PLoS ONE 2024, 19, e0296858. [Google Scholar] [CrossRef] [PubMed]
- Feng, Q.; Zhou, R.; Xu, C.; Cheng, Y.; Testa, B.; Yin, H. Scalable Graph-Based Bug Search for Firmware Images. In Proceedings of the ACM Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 480–491. [Google Scholar]
- Ragkhitwetsagul, C.; Krinke, J.; Clark, D. A Comparison of Code Similarity Analysers. Empir. Softw. Eng. 2018, 23, 2464–2519. [Google Scholar] [CrossRef]
- Chandramohan, M.; Xue, Y.; Xu, Z.; Liu, Y.; Cho, C.Y.; Kuan, T.H.B. BinGo: Cross-Architecture Cross-Os Binary Search. In Proceedings of the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Seattle, WA, USA, 13–18 November 2016; pp. 678–689. [Google Scholar]
- Roy, C.K.; Cordy, J.R.; Koschke, R. Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach. Sci. Comput. Program. 2009, 74, 470–495. [Google Scholar] [CrossRef]
- Gan, S.T.; Qin, X.J.; Chen, Z.N.; Wang, L.Z. Software Vulnerability Code Clone Detection Method Based on Characteristic Metrics. J. Softw. 2015, 26, 348–363. [Google Scholar]
- Li, Z.; Zou, D.; Xu, S.; Jin, H.; Qi, H.; Hu, J. VulPecker: An Automated Vulnerability Detection System Based on Code Similarity Analysis. In Proceedings of the ACM International Conference Proceeding Series, Los Angeles, CA, USA, 5–9 December 2016; pp. 201–213. [Google Scholar]
- Kim, S.; Woo, S.; Lee, H.; Oh, H. VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery. In Proceedings of the IEEE Symposium on Security and Privacy, San Jose, CA, USA, 22–26 May 2017; pp. 595–614. [Google Scholar]
- Zou, D.; Wang, S.; Xu, S.; Li, Z.; Jin, H. MVulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection. IEEE Trans. Dependable Secur. Comput. 2019, 18, 2224–2236. [Google Scholar]
- Liu, Z.; Liao, Q.; Gu, W.; Gao, C. Software Vulnerability Detection with GPT and In-Context Learning. In Proceedings of the 2023 8th International Conference on Data Science in Cyberspace, Hefei, China, 18–20 August 2023; pp. 229–236. [Google Scholar]
- Wu, Y.; Zou, D.; Dou, S.; Yang, W.; Xu, D.; Jin, H. VulCNN: An Image-Inspired Scalable Vulnerability Detection System. In Proceedings of the International Conference on Software Engineering, Pittsburgh, PA, USA, 21–29 May 2022; pp. 2365–2376. [Google Scholar]
- Kim, S.; Choi, J.; Ahmed, M.E.; Nepal, S.; Kim, H. VulDeBERT: A Vulnerability Detection System Using BERT. In Proceedings of the 2022 IEEE International Symposium on Software Reliability Engineering Workshops, Charlotte, NC, USA, 31 October–3 November 2022; pp. 69–74. [Google Scholar]
- Xue, J.; Yu, Z.; Song, Y.; Qin, Z.; Sun, X.; Wang, W. VulSAT: Source Code Vulnerability Detection Scheme Based on SAT Structure. In Proceedings of the 2023 8th International Conference on Signal and Image Processing, Wuxi, China, 8–10 July 2023; pp. 639–644. [Google Scholar]
- Google/Bindiff. Available online: https://github.com/google/bindiff (accessed on 17 September 2024).
- Yang, S.; Cheng, L.; Zeng, Y.; Lang, Z.; Zhu, H.; Shi, Z. Asteria: Deep Learning-Based AST-Encoding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Taipei, Taiwan, 21–24 June 2021; pp. 224–236. [Google Scholar]
- Yang, S.; Dong, C.; Xiao, Y.; Cheng, Y.; Shi, Z.; Li, Z.; Sun, L. Asteria-Pro: Enhancing Deep Learning-Based Binary Code Similarity Detection by Incorporating Domain Knowledge. ACM Trans. Softw. Eng. Methodol. 2023, 33, 1–40. [Google Scholar] [CrossRef]
- Bourquin, M.; King, A.; Robbins, E. BinSlayer: Accurate Comparison of Binary Executables. In Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop, Rome, Italy, 26 January 2013; pp. 1–10. [Google Scholar]
- Huang, H.; Youssef, A.M.; Debbabi, M. BinSequence: Fast, Accurate and Scalable Binary Code Reuse Detection. In Proceedings of the 2017 ACM Asia Conference on Computer and Communications Security, Abu Dhabi, United Arab Emirates, 2–6 April 2017; pp. 155–166. [Google Scholar]
- Zhao, B.; Ji, S.; Xu, J.; Tian, Y.; Wei, Q.; Wang, Q.; Lyu, C.; Zhang, X.; Lin, C.; Wu, J.; et al. A Large-Scale Empirical Analysis of the Vulnerabilities Introduced by Third-Party Components in IoT Firmware. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual. 18–22 July 2022; pp. 442–454. [Google Scholar]
- Wang, S.; Wu, D. In-Memory Fuzzing for Binary Code Similarity Analysis. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, Urbana, IL, USA, 30 October–3 November 2017; pp. 319–330. [Google Scholar]
- Roundy, K.A.; Miller, B.P. Hybrid Analysis and Control of Malware. In Recent Advances in Intrusion Detection; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2010; Volume 6307, pp. 317–338. [Google Scholar]
- Dai, H.; Dai, B.; Song, L. Discriminative Embeddings of Latent Variable Models for Structured Data. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; Volume 48, pp. 2702–2711. [Google Scholar]
- Marcelli, A.; Graziano, M.; Ugarte-Pedrero, X.; Fratantonio, Y.; Mansouri, M.; Balzarotti, D. How Machine Learning Is Solving the Binary Function Similarity Problem. In Proceedings of the 31st USENIX Security Symposium, Boston, MA, USA, 10–12 August 2022; pp. 2099–2116. [Google Scholar]
- Alrabaee, S.; Debbabi, M.; Wang, L. A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features. ACM Comput. Surv. 2022, 55, 1–41. [Google Scholar] [CrossRef]
- Haq, I.U.; Caballero, J. A Survey of Binary Code Similarity. ACM Comput. Surv. 2021, 54, 1–38. [Google Scholar] [CrossRef]
- Lirong, F.; Peiyu, L.; Meng, W.; Lu, K.; Zhou, S.; Zhang, X.; Chen, W.; Ji, S. Understanding the AI-powered Binary Code Similarity Detection. arXiv 2024, arXiv:2410.07537. [Google Scholar]
- Xia, B.; Pang, J.; Zhou, X.; Shan, Z.; Wang, J.; Yue, F. Binary code similarity analysis based on naming function and common vector space. Sci. Rep. 2023, 13, 15676. [Google Scholar] [CrossRef] [PubMed]
- DNN Binary Code Similarity Detection. Available online: https://github.com/xiaojunxu/dnn-binary-code-similarity (accessed on 17 September 2024).
- Asteria-Pro. Available online: https://github.com/Asteria-BCSD/Asteria-Pro (accessed on 17 September 2024).
Method | Low-Level Features | High-Level Features | Detecting Technique | Combining Technique | Specifics |
---|---|---|---|---|---|
Genius [12] | Assembler code | CFG | Static. CFG + functional vector distance calculating | Sequential application to reduce the power of multiple unmatched samples | Analysis of the CFG alone does not provide enough semantic information to determine similarity accurately. |
Gemini [7] | Assembler code | ACFG | Static. ACFG + GNN | Sequential application to reduce the power of multiple unmatched samples | Analysis of the CFG only does not provide enough semantic information to determine similarity accurately. |
Asteria [25], Asteria-Pro [26] | Assembler code, function metadata | ACFG, AST, code statistics | Static. ACFG + AST + GNN + Tree-LSTM | Sequential application to reduce the power of multiple unmatched samples | Many special sequential steps for data processing. The method requires a lot of time. The results are not high, because collected semantic information is not comprehensive. |
BinSlayer [27] | Assembler code, function metadata | CG, CFG | Static. BinDiff + Hungarian algorithm for matching functions by GED | Sequential application to reduce the power of multiple unmatched samples | It can be applied to large sets of code samples. Matched samples are excluded before applying the Hungarian algorithm. |
BinSequence [28] | Assembler code, normalized assembler code | CFG | Static. Preliminary analysis of similarity of the number of basic blocks, the vector representations of normalized assembler code + Analysis of similarity of paths in CFG | Sequential application to reduce the power of multiple unmatched samples | It can be applied to large sets of code samples. Clone detection is performed using graph theory only, without machine learning. |
Zhao et.al. [29] | Assembler code | ACFG | Static. ACFG analysis + GNN | Code features are combined within a single method to produce a decision on the similarity | Analysis algorithm is difficult to scale because there is no preliminary reduction in the set power. |
IMF-SIM [30] | Assembler code | Process execution traces | Static + Dynamic. Reverse taint-analysis to resolve data types + Construction and comparison of program execution traces based on in-memory fuzzing | Code features are used sequentially and cyclically | High complexity. It requires a secure execution environment for the software being analyzed. It also requires a lot of time for high code coverage. |
Roundy et.al. [31] | Assembler code | CFG, process behavior | Static + Dynamic. Analysis of CFG isomorphisms + Modifications of CFG based on data from code execution with instrumentation | Sources of code features are used sequentially and cyclically: construction of CFG based on static analysis, obtaining data from dynamic analysis, modification of CFG, etc.) | High complexity: it requires a secure execution environment for the software being analyzed. Analysis of the CFG alone does not provide enough semantic information to determine similarity accurately. |
Proposed method 1 | Assembler code, function metadata | CG, CFG, AAST | Static. BinDiff + AAST + two deep GNNs | Sequential application to reduce the power of the set of multiple unmatched samples | It can be applied to large sets of code fragments. BinDiff output is refined using comprehensive machine learning analysis of AAST. Modular (e.g., BinDiff can be replaced with another extraction algorithm). |
Dataset | Num. of Functions in Dataset | Clusters | Num. of Clusters in Dataset |
---|---|---|---|
Training dataset | 8267 | In training dataset | 3416 |
Validation dataset | 1116 | In validation dataset | 486 |
Testing dataset | 1276 | In testing dataset | 474 |
Total | 10,659 | Total | 4376 |
Binary File | Software | System | Architecture | Compiled with Optimization |
---|---|---|---|---|
libcrypto.so.1.0.0 | OpenSSL v. 1.0.0, open source library (OpenSSL Software Foundation Inc., Newark, DE, USA) | Linux | MIPS | \O2 |
libcrypto-1_1.dll | OpenSSL v. 1.1.1, open source library (OpenSSL Software Foundation Inc., Newark, DE, USA) | Windows | x86 | \O2 |
Characteristic | Syntactic-Only (BinDiff Works) | Semantic-Only (Only Semantic Part of the Proposed Method Works) | Syntactic–Semantic (Proposed Method Works) |
---|---|---|---|
Similarity score | 0.26 | 0.89 | 0.996 |
Method | Recall | Precision | F1 |
---|---|---|---|
Gemini | 0.880 | 0.889 | 0.884 |
Asteria | 0.510 | 0.554 | 0.531 |
Asteria-Pro | 0.698 | 0.648 | 0.672 |
Proposed method | 0.907 | 0.894 | 0.900 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kalinin, M.; Gribkov, N. Syntactic–Semantic Detection of Clone-Caused Vulnerabilities in the IoT Devices. Sensors 2024, 24, 7251. https://doi.org/10.3390/s24227251
Kalinin M, Gribkov N. Syntactic–Semantic Detection of Clone-Caused Vulnerabilities in the IoT Devices. Sensors. 2024; 24(22):7251. https://doi.org/10.3390/s24227251
Chicago/Turabian StyleKalinin, Maxim, and Nikita Gribkov. 2024. "Syntactic–Semantic Detection of Clone-Caused Vulnerabilities in the IoT Devices" Sensors 24, no. 22: 7251. https://doi.org/10.3390/s24227251
APA StyleKalinin, M., & Gribkov, N. (2024). Syntactic–Semantic Detection of Clone-Caused Vulnerabilities in the IoT Devices. Sensors, 24(22), 7251. https://doi.org/10.3390/s24227251