Question–Answer Methodology for Vulnerable Source Code Review via Prototype-Based Model-Agnostic Meta-Learning
Abstract
:1. Introduction
- Generalization Challenges: Numerous ML models rely on superficial patterns rather than a profound comprehension of code context, therefore limiting their efficacy in detecting novel vulnerabilities. Furthermore, many models are designed to analyze only one programming language, posing a significant obstacle in environments requiring the evaluation of multiple languages [17].
- Dependence on Extensive Data Sets: Traditional approaches require vast data sets annotated at the token or sequence level to identify vulnerabilities, increasing processing demands and diminishing the inherent advantages of these models. Despite the significant potential of Deep Learning (DL), Natural Language Processing (NLP), and Statistical Learning (SL) approaches, these techniques face limitations stemming from the scarcity and diversity of training samples, complicating their generalization and prolonging annotation processes [11,18].
- Static Nature: Many models struggle to adapt to evolving codebases and emerging vulnerabilities, therefore reducing their efficacy over time and necessitating frequent re-training, which imposes considerable operational overheads [18].
- Insufficient Contextual Responses: ML-based tools often fail to provide actionable and precise responses, which is particularly problematic in the DevSecOps context, where practical and direct solutions for code correction are essential. This lack of interpretability represents a challenge for security analysts, who require accurate and contextually grounded insights within their workflows [19,20].
2. Contributions of Proto-MAML to the State-of-the-Art
- Dynamic Prototypes and Real-Time Adaptability: Proto-MAML leverages Prototypical Networks to create dynamic representations that are tailored to each task, allowing for seamless adaptation to evolving code patterns without extensive re-training. This capability sets it apart from traditional static methods, which rely on frequent updates to remain effective. Furthermore, Proto-MAML dynamically adjusts to changes in codebases and emerging threats, ensuring continuous protection while maintaining computational efficiency with optimized complexity.
- Few-Shot Learning (FSL): The FSL capabilities of Proto-MAML ensure high accuracy even when limited labeled data are available. This ability to generalize from minimal data addresses a fundamental challenge in security vulnerability detection, particularly for less-common programming languages or newly identified vulnerabilities.
- Contextual and Predictive Insights: Through integrating BERT, Proto-MAML enhances interpretability by generating contextualized explanations and predictive remediation strategies. This combination not only identifies vulnerabilities but also anticipates potential issues, providing actionable recommendations that assist development teams in proactively mitigating risks.
- Resource Optimization and Efficiency: Proto-MAML optimizes computational resources by reducing the parameters required for task-specific adaptation, facilitated by Prototypical Networks. This efficiency ensures seamless integration into CI/CD pipelines, enabling real-time security assessments without introducing delays while maintaining rapid development cycles and prioritizing security.
- Actionable Interpretability: Proto-MAML bridges the gap between technical detection and practical remediation by delivering clear outputs in natural language. This approach enhances collaboration between development, security, and operations teams, ensuring that the proposed solutions are comprehensible and can be applied immediately.
3. Related Works
4. Materials and Methods
- : Why this code is vulnerable?
- : According to CWE-476: NULL Pointer Dereference: the product dereferences a pointer that it expects to be valid, but is instead NULL.: In this code, the ptr variable is allocated memory, but this memory is never actually used. This happens because the pointer is improperly initialized or handled. Below is the corrected version of the code:
4.1. Inputs
- PHP: The training set comprises 7893 samples, and the test set includes 1974 samples.
- Java: The training set comprises 3741 samples, and the test set includes 936 samples.
- C: The training set comprises 4015 samples, and the test set includes 1004 samples.
- C++: The training set comprises 3230 samples, and the test set includes 808 samples.
4.2. Few-Shot Task Generation
4.3. Proto-MAML
4.3.1. BERT Preprocessing and Tokenization
- Handling Comments: Comments within are identified using language-specific delimiters (e.g., // or /∗∗/ for PHP). Depending on their relevance, these comments are either retained to enrich the context or removed to eliminate superfluous information.
- Normalization: Non-essential whitespace, tabulations, and special symbols are standardized. Natural language text is converted to lowercase, ensuring alignment with the case-insensitive characteristics of the BERT tokenizer.
- Irrelevant Character Filtering: Unnecessary symbols, such as excessive punctuation or repeated line breaks, are removed from . However, tokens integral to the source code, such as function calls, variable names, and SQL queries, are preserved to maintain contextual integrity.
- Retention of Stopwords: Stopwords are deliberately retained within and , diverging from traditional preprocessing techniques. Preserving these elements is essential to maintain the semantic and syntactic coherence between questions and contexts, particularly for the accurate identification of answer spans within .
Listing 1. PHP code vulnerable to SQL injection. |
- The embedding summarizes the global representation of the input;
- The embeddings and represent the relationships specific to the question and context, respectively.
4.3.2. Proto-MAML Training Within a BERT-Enhanced QA Framework
Algorithm 1 Meta-Optimization and Proto-MAML Model Generation. | |
|
5. Results
- Calculation of the Differences (): Compute the differences between each pair of observations from the resulting 5-shot-i-way models as , where and represent the observations from the two related samples in the model for the observation pair .
- Ranking of the Absolute Differences: Order the differences by their absolute values, excluding any differences equal to zero, and assign ranks to them. If there are differences with the same absolute value, assign them an average rank.
- Sum of Positive and Negative Ranks: Sum the ranks corresponding to the positive differences and the ranks corresponding to the negative differences as follows: and , where is the rank assigned to the absolute value of .
- Test Statistic: The test statistic of the Wilcoxon test is the smaller of and ; namely .
- Determination of the p-Value: The value of is compared to a Wilcoxon distribution table to determine the corresponding p-value. The commonly used significance level is . If , the null hypothesis is rejected, concluding that there is a significant difference between the two samples.
- For each pair of models , where and :
- Difference in Precision: ,
- Difference in Recall: ,
- Difference in : ;
- Classification and Sign of the Differences:
- (a)
- If , then model g is better than model f in language l,
- (b)
- If , then model g is worse than model f in language l.
6. Discussion
- Automation of Security Scanning: Proto-MAML can be seamlessly integrated into CI/CD pipelines to analyze code fragments rapidly using FSL, generating context-aware questions () and answers () based on semantic context (). This capability enables early vulnerability detection and the provision of concrete solutions through secure code reconstructions. Furthermore, Proto-MAML is applicable to multiple programming languages and addresses a range of critical vulnerabilities. Unlike state-of-the-art approaches, which are limited to a few languages (e.g., C and C++) and cover only 3–7 CWEs, Proto-MAML processed 9867 PHP samples, 4677 Java samples, 5019 C samples, and 4038 C++ samples, as well as addressing over 24 CWEs, therefore significantly broadening its practical utility.
- Interdisciplinary Collaboration: Through the utilization of a natural language-based approach, Proto-MAML facilitates effective interactions among development, operations, and security teams. The outputs in the form of questions and answers are easily interpretable, eliminating the need for additional analysis and fostering seamless communication across disciplines.
- Continuous Security Integration: Incorporating security evaluations as a core component of CI/CD pipelines ensures consistent code assessment, enhancing integrity and security. Proto-MAML excels in this area due to its low computational complexity (), enabling dynamic security assessments without significantly impacting deployment times. It also provides advanced metrics such as Accuracy, Recall, F1-Score, and EM, which support its continuous performance improvement.
- Real-Time Monitoring and Auditing: The prototype-based architecture of Proto-MAML allows for rapid adaptation to new code samples by leveraging the general characteristics of queries (). This adaptability makes it ideal for dynamic systems with limited data, as it extracts rich semantic context () and efficiently adjusts to emerging vulnerabilities.
- Predictive Capability: Proto-MAML accurately identifies the specific positions of vulnerabilities within code (, ), simplifying traceability and auditing processes. This actionable and precise information significantly enhances the ability of technical teams to efficiently address vulnerabilities.
- Training and Awareness in Security: Through analyzing the semantic context of source code using BERT, Proto-MAML identifies insecure dependencies and generates clear predicted responses (). This functionality is invaluable for detecting issues in third-party libraries or frameworks, not only preventing vulnerabilities but also educating development teams on secure coding practices, allowing for the seamless integration of this knowledge into daily workflows.
- Dependency and Third-Party Component Management: Proto-MAML evaluates external dependencies for vulnerabilities and provides practical solutions while explaining the associated context. Its ability to handle underrepresented languages such as PHP—which constitutes of insecure web deployments—along with its focus on over 24 CWEs makes it highly scalable and adaptable to diverse environments. Additionally, its contextual approach allows for its application across different programming languages, as it relies on semantic context rather than specific code structures.
7. Conclusions
7.1. Theoretical Contributions
- Multilingual Vulnerability Detection: Proto-MAML integrates Prototypical Networks and MAML within a QA-based architecture, enabling a single model to detect vulnerabilities across programming languages like PHP, Java, C, and C++. This approach adapts meta-learning principles to handle the structural and syntactic diversity of source code, which is a common challenge in multilingual environments. The theoretical impact lies in extending meta-learning to heterogeneous data domains.
- Precise Vulnerability Localization: Leveraging the QA framework, the model identifies the exact start and end positions of vulnerabilities within the source code. This level of detail connects theoretical advancements in meta-learning with practical needs for targeted code review and correction. It provides a bridge between machine learning outputs and actionable developer workflows.
- Generalization in Data-Scarce Scenarios: Proto-MAML uses FSL to train effectively with minimal annotated data, addressing a critical limitation in vulnerability datasets. The contribution lies in demonstrating how meta-learning can generalize to real-world problems where comprehensive training datasets are unavailable, a common constraint in software security.
- Integration of Detection and Remediation: The model not only identifies vulnerabilities but also provides contextual suggestions for remediation. This feature aligns theoretical advancements with practical applications by connecting model outputs to actionable developer tasks, making the process of addressing vulnerabilities more efficient.
- Efficiency for CI/CD Integration: The computational complexity of ensures that Proto-MAML is scalable and compatible with time-sensitive CI/CD pipelines. This shows how meta-learning models can be designed to meet the operational constraints of modern software development environments.
7.2. Practical Contributions
- Average Precision, Recall, F1, and EM scores exceeded , , and , respectively.
- PHP achieved the highest metrics, with F1 and EM scores.
- The computational complexity of ensures its practical application in real-time CI/CD workflows.
7.3. Future Research Directions
- Expanding Language Support: Including additional languages such as Python, JavaScript, and Go through transfer learning techniques.
- Improving Explainability: Enhancing interpretability by linking vulnerabilities to specific code regions and providing clear rationales for remediation.
- Real-World Deployment: Evaluating the model in operational CI/CD environments to refine its practical applications.
- Continuous Learning: Enabling incremental adaptation to emerging vulnerabilities and evolving coding practices.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Möller, D.P. Cybersecurity in digital transformation. In Guide to Cybersecurity in Digital Transformation: Trends, Methods, Technologies, Applications and Best Practices; Springer: Berlin/Heidelberg, Germany, 2023; pp. 1–70. [Google Scholar]
- Beaulieu, N.; Dascalu, S.M.; Hand, E. API-first design: A survey of the state of academia and industry. In Proceedings of the ITNG 2022 19th International Conference on Information Technology-New Generations, Las Vegas, NV, USA, 10–13 April 2022; pp. 73–79. [Google Scholar]
- Zhang, F.; Kodituwakku, H.A.D.E.; Hines, J.W.; Coble, J. Multilayer Data-Driven Cyber-Attack Detection System for Industrial Control Systems Based on Network, System, and Process Data. IEEE Trans. Ind. Inform. 2019, 15, 4362–4369. [Google Scholar] [CrossRef]
- Massaoudi, M.; Refaat, S.S.; Abu-Rub, H. Intrusion Detection Method Based on SMOTE Transformation for Smart Grid Cybersecurity. In Proceedings of the 2022 3rd International Conference on Smart Grid and Renewable Energy (SGRE), Doha, Qatar, 20–22 March 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Chen, J.; Mohamed, M.A.; Dampage, U.; Rezaei, M.; Salmen, S.H.; Obaid, S.A.; Annuk, A. A multi-layer security scheme for mitigating smart grid vulnerability against faults and cyber-attacks. Appl. Sci. 2021, 11, 9972. [Google Scholar] [CrossRef]
- Souppaya, M.; Scarfone, K.; Dodson, D. Secure Software Development Framework (SSDF) Version 1.1: Recommendations for Mitigating the Risk of Software Vulnerabilities; Technical Report SP 800-218; National Institute of Standards and Technology (NIST): Gaithersburg, MD, USA, 2022.
- Common Vulnerabilities and Exposures (CVE). 1999. Available online: https://cve.mitre.org/ (accessed on 13 September 2024).
- National Institute of Standards and Technology. National Institute of Standards and Technology (NIST) Official Website. 2024. Available online: https://www.nist.gov (accessed on 8 October 2024).
- Homès, B. Fundamentals of Software Testing; John Wiley & Sons: Hoboken, NJ, USA, 2024. [Google Scholar]
- Lombardi, F.; Fanton, A. From DevOps to DevSecOps is not enough. CyberDevOps: An extreme shifting-left architecture to bring cybersecurity within software security lifecycle pipeline. Softw. Qual. J. 2023, 31, 619–654. [Google Scholar] [CrossRef]
- Li, W.; Li, L.; Cai, H. On the vulnerability proneness of multilingual code. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore, Singapore, 18 November 2022; pp. 847–859. [Google Scholar]
- Akbar, M.A.; Smolander, K.; Mahmood, S.; Alsanad, A. Toward successful DevSecOps in software development organizations: A decision-making framework. Inf. Softw. Technol. 2022, 147, 106894. [Google Scholar] [CrossRef]
- Chakraborty, S.; Krishna, R.; Ding, Y.; Ray, B. Deep learning based vulnerability detection: Are we there yet? IEEE Trans. Softw. Eng. 2021, 48, 3280–3296. [Google Scholar] [CrossRef]
- Mateo Tudela, F.; Bermejo Higuera, J.R.; Bermejo Higuera, J.; Sicilia Montalvo, J.A.; Argyros, M.I. On combining static, dynamic and interactive analysis security testing tools to improve owasp top ten security vulnerability detection in web applications. Appl. Sci. 2020, 10, 9119. [Google Scholar] [CrossRef]
- Bedoya, M.; Palacios, S.; Díaz-López, D.; Laverde, E.; Nespoli, P. Enhancing DevSecOps practice with Large Language Models and Security Chaos Engineering. Int. J. Inf. Secur. 2024, 23, 3765–3788. [Google Scholar] [CrossRef]
- Rajapaksha, S.; Senanayake, J.; Kalutarage, H.; Al-Kadri, M.O. Ai-powered vulnerability detection for secure source code development. In Proceedings of the International Conference on Information Technology and Communications Security, Bucharest, Romania, 23–24 November 2022; pp. 275–288. [Google Scholar]
- Ling, X.; Wu, L.; Zhang, J.; Qu, Z.; Deng, W.; Chen, X.; Qian, Y.; Wu, C.; Ji, S.; Luo, T.; et al. Adversarial attacks against Windows PE malware detection: A survey of the state-of-the-art. Comput. Secur. 2023, 128, 103134. [Google Scholar] [CrossRef]
- Du, X.; Wen, M.; Zhu, J.; Xie, Z.; Ji, B.; Liu, H.; Shi, X.; Jin, H. Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 10507–10521. [Google Scholar] [CrossRef]
- Overflow, S. Stack Overflow. 2023. Available online: https://stackoverflow.com (accessed on 14 October 2024).
- Díaz Ferreyra, N.E.; Vidoni, M.; Heisel, M.; Scandariato, R. Cybersecurity discussions in Stack Overflow: A developer-centred analysis of engagement and self-disclosure behaviour. Soc. Netw. Anal. Min. 2023, 14, 16. [Google Scholar] [CrossRef]
- Le, T.H.; Chen, H.; Babar, M.A. A survey on data-driven software vulnerability assessment and prioritization. Acm Comput. Surv. 2022, 55, 100. [Google Scholar] [CrossRef]
- Alzubi, J.A.; Jain, R.; Singh, A.; Parwekar, P.; Gupta, M. COBERT: COVID-19 question answering system using BERT. Arab. J. Sci. Eng. 2023, 48, 11003–11013. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y.; Anderson, D.V. Hybrid attention-based prototypical networks for few-shot sound classification. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 23–27 May 2022; pp. 651–655. [Google Scholar]
- Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML’17, JMLR.org, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1126–1135. [Google Scholar]
- Wang, H.; Wang, Y.; Sun, R.; Li, B. Global convergence of maml and theory-inspired neural architecture search for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9797–9808. [Google Scholar]
- Cao, C.; Zhang, Y. Learning to compare relation: Semantic alignment for few-shot learning. IEEE Trans. Image Process. 2022, 31, 1462–1474. [Google Scholar] [CrossRef]
- Li, Z.; Zou, D.; Xu, S.; Jin, H.; Zhu, Y.; Chen, Z. Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Trans. Dependable Secur. Comput. 2021, 19, 2244–2258. [Google Scholar] [CrossRef]
- Li, Z.; Zou, D.; Xu, S.; Chen, Z.; Zhu, Y.; Jin, H. Vuldeelocator: A deep learning-based fine-grained vulnerability detector. IEEE Trans. Dependable Secur. Comput. 2021, 19, 2821–2837. [Google Scholar] [CrossRef]
- Huang, W.; Lin, S.; Chen, L. Bbvd: A bert-based method for vulnerability detection. Int. J. Adv. Comput. Sci. Appl. 2022, 13. [Google Scholar] [CrossRef]
- Omar, M. VulDefend: A Novel Technique based on Pattern-exploiting Training for Detecting Software Vulnerabilities Using Language Models. In Proceedings of the 2023 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 22–24 May 2023; pp. 287–293. [Google Scholar] [CrossRef]
- Chi, J.; Qu, Y.; Liu, T.; Zheng, Q.; Yin, H. Seqtrans: Automatic vulnerability fix via sequence to sequence learning. IEEE Trans. Softw. Eng. 2022, 49, 564–585. [Google Scholar] [CrossRef]
- Chen, Z.; Kommrusch, S.; Monperrus, M. Neural transfer learning for repairing security vulnerabilities in c code. IEEE Trans. Softw. Eng. 2022, 49, 147–165. [Google Scholar] [CrossRef]
- Bahaa, A.; Kamal, A.E.R.; Fahmy, H.; Ghoneim, A.S. DB-CBIL: A DistilBert-Based Transformer Hybrid Model using CNN and BiLSTM for Software Vulnerability Detection. IEEE Access 2024, 12, 64446–64460. [Google Scholar] [CrossRef]
- Ma, S.; Thung, F.; Lo, D.; Sun, C.; Deng, R.H. VuRLE: Automatic Vulnerability Detection and Repair by Learning from Examples. In Proceedings of the Computer Security—ESORICS 2017, Oslo, Norway, 14 September 2017; Foley, S.N., Gollmann, D., Snekkenes, E., Eds.; Springer Nature: Cham, Switzerland, 2017; pp. 229–246. [Google Scholar]
- Zhang, X.; Zhang, F.; Zhao, B.; Zhou, B.; Xiao, B. VulD-Transformer: Source Code Vulnerability Detection via Transformer. In Proceedings of the 14th Asia-Pacific Symposium on Internetware, Hangzhou, China, 4–6 August 2023; pp. 185–193. [Google Scholar]
- Espinha Gasiba, T.; Iosif, A.C.; Kessba, I.; Amburi, S.; Lechner, U.; Pinto-Albuquerque, M. May the Source Be with You: On ChatGPT, Cybersecurity, and Secure Coding. Information 2024, 15, 572. [Google Scholar] [CrossRef]
- Bhandari, G.; Naseer, A.; Moonen, L. CVEfixes: Automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering, Athens, Greece, 19–20 August 2021; pp. 30–39. [Google Scholar]
- Common Weakness Enumeration (CWE). 2006. Available online: https://cwe.mitre.org/ (accessed on 13 September 2024).
- Common Vulnerability Scoring System (CVSS). 2005. Available online: https://www.first.org/cvss/ (accessed on 13 September 2024).
- National Institute of Standards and Technology (NIST). National Vulnerability Database (NVD). Available online: https://nvd.nist.gov/ (accessed on 13 September 2024).
- Software Assurance Metrics And Tool Evaluation (SAMATE). 2024. Available online: https://samate.nist.gov/ (accessed on 17 September 2024).
- NIST Software Assurance Reference Dataset. 2024. Available online: https://samate.nist.gov/SARD/ (accessed on 17 September 2024).
- OWASP Foundation. OWASP Top Ten 2021: The Ten Most Critical Web Application Security Risks. 2021. Available online: https://owasp.org/Top10/ (accessed on 22 October 2024).
- Ren, Z.; Shen, Q.; Diao, X.; Xu, H. A sentiment-aware deep learning approach for personality detection from text. Inf. Process. Manag. 2021, 58, 102532. [Google Scholar] [CrossRef]
- SonarQube: Continuous Code Quality. Available online: https://www.sonarqube.org/ (accessed on 13 September 2024).
- Song, Y.; Wang, T.; Cai, P.; Mondal, S.K.; Sahoo, J.P. A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. Acm Comput. Surv. 2023, 55, 1–40. [Google Scholar] [CrossRef]
- Ma, Y.; Zhao, S.; Wang, W.; Li, Y.; King, I. Multimodality in meta-learning: A comprehensive survey. Knowl.-Based Syst. 2022, 250, 108976. Available online: https://www.sciencedirect.com/science/article/abs/pii/S0950705122004737 (accessed on 13 September 2024). [CrossRef]
- Huisman, M.; Van Rijn, J.N.; Plaat, A. A survey of deep meta-learning. Artif. Intell. Rev. 2021, 54, 4483–4541. [Google Scholar] [CrossRef]
- Jamal, M.A.; Qi, G.J. Task Agnostic Meta-Learning for Few-Shot Learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11711–11719. [Google Scholar] [CrossRef]
- Behera, S.K.; Dash, R. Fine-Tuning of a BERT-Based Uncased Model for Unbalanced Text Classification. In Advances in Intelligent Computing and Communication; Springer: Cham, Switzerland, 2022; pp. 425–433. [Google Scholar] [CrossRef]
- Griva, A.I.; Boursianis, A.D.; Iliadis, L.A.; Sarigiannidis, P.; Karagiannidis, G.; Goudos, S.K. Model-Agnostic Meta-Learning Techniques: A State-of-The-Art Short Review. In Proceedings of the 2023 12th International Conference on Modern Circuits and Systems Technologies (MOCAST), Athens, Greece, 28–30 June 2023; pp. 1–4. [Google Scholar]
- Fallah, A.; Mokhtari, A.; Ozdaglar, A. Generalization of model-agnostic meta-learning algorithms: Recurring and unseen tasks. Adv. Neural Inf. Process. Syst. 2021, 34, 5469–5480. [Google Scholar]
- Chitty-Venkata, K.T.; Emani, M.; Vishwanath, V.; Somani, A.K. Neural architecture search for transformers: A survey. IEEE Access 2022, 10, 108374–108412. [Google Scholar] [CrossRef]
- Ji, Z.; Chai, X.; Yu, Y.; Pang, Y.; Zhang, Z. Improved prototypical networks for few-shot learning. Pattern Recognit. Lett. 2020, 140, 81–87. [Google Scholar] [CrossRef]
- Li, X.; Sun, Z.; Xue, J.H.; Ma, Z. A concise review of recent few-shot meta-learning methods. Neurocomputing 2021, 456, 463–468. [Google Scholar] [CrossRef]
- Banerjee, T.; Thurlapati, N.R.; Pavithra, V.; Mahalakshmi, S.; Eledath, D.; Ramasubramanian, V. Few-shot learning for frame-wise phoneme recognition: Adaptation of matching networks. In Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 516–520. [Google Scholar]
- van Doorn, J.; Ly, A.; Marsman, M.; Wagenmakers, E.J. Bayesian rank-based hypothesis testing for the rank sum test, the signed rank test, and Spearman’s ρ. J. Appl. Stat. 2020, 47, 2984–3006. [Google Scholar] [CrossRef]
- Degen, H. Big I notation to estimate the interaction complexity of interaction concepts. Int. J. Hum. Comput. Interact. 2022, 38, 1504–1528. [Google Scholar] [CrossRef]
- Gutoski, M.; Hattori, L.; Romero Aquino, M.; Ribeiro, M.; Lazzaretti, A.; Lopes, H. Qualitative analysis of deep learning frameworks. Learn. Nonlinear Model. 2017, 15, 45–52. [Google Scholar] [CrossRef]
- Zhu, X.; Mao, X. Integrating Security with DevSecOps: Techniques and Challenges. IEEE Access 2020, 8, 101261–101273. Available online: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10613759 (accessed on 7 December 2024).
- Jones, S.L.; Weber, T.J. Holding on to Compliance While Adopting DevSecOps: An SLR. Electronics 2022, 11, 3707. [Google Scholar] [CrossRef]
Language | Common CWEs | Total Vulnerabilities | CVSS Score Percentage (High, Medium, Low) |
---|---|---|---|
PHP | CWE-20, CWE-22, CWE-78, CWE-79, CWE-89, CWE-306, CWE-352, CWE-434, CWE-502, CWE-601 | 13,000+ | High: 30%, Medium: 50%, Low: 20% |
Java | CWE-20, CWE-79, CWE-89, CWE-209, CWE-287, CWE-400, CWE-476, CWE-502, CWE-611, CWE-732 | 3000+ | High: 35%, Medium: 45%, Low: 20% |
C | CWE-20, CWE-119, CWE-125, CWE-200, CWE-362, CWE-399, CWE-416, CWE-476, CWE-772, CWE-787 | 10,000+ | High: 40%, Medium: 40%, Low: 20% |
C++ | CWE-20, CWE-119, CWE-125, CWE-362, CWE-399, CWE-400, CWE-416, CWE-476, CWE-772, CWE-787 | 7000+ | High: 42%, Medium: 38%, Low: 20% |
Metric | Description | Mathematical Definition |
---|---|---|
Precision[START] | Represents the precision for the start index, measuring the proportion of true positives () to the total of true positives () and false positives () for the predicted start index. | |
Precision[END] | Precision for the end index, calculated similarly to the start index. | |
Precisioncombined | Combined precision, representing the average precision across both indices. | |
Recall[START] | Recall for the start index, representing the ratio of true positives (TP) to the sum of true positives and false negatives (FN) for the predicted start index. | |
Recall[END] | Recall for the end index, calculated similarly to the start index. | |
Recallcombined | Combined recall, representing the average recall across both indices. | |
F1[START] | F1-score for the start index, calculated as the harmonic mean of precision and recall. | |
F1[END] | F1-score for the end index, calculated similarly to the start index. | |
F1combined | Combined F1-score, representing the average F1-score across both indices. | |
Entropy () | Measures uncertainty in softmax predictions, where lower entropy indicates more confident predictions. | |
Prediction Error (PE) | Quantifies the discrepancy between predicted and actual indices, summing the absolute differences for the start and end indices. | |
EM | Indicates whether both start and end indices of the predicted answer match exactly with the ground truth. A value of 1 signifies a perfect match, while a value of 0 indicates otherwise. |
i | Support Set Size | Query Set Size | Total Samples () |
---|---|---|---|
5 | 25 | 5 | 3000 |
6 | 30 | 6 | 3600 |
7 | 35 | 7 | 4200 |
8 | 40 | 8 | 4800 |
9 | 45 | 9 | 5400 |
10 | 50 | 10 | 6000 |
i | PHP | Java | C | C++ |
---|---|---|---|---|
5 | % | % | % | % |
6 | % | % | % | % |
7 | % | % | % | % |
8 | % | % | % | % |
9 | % | % | % | % |
10 | % | % | % | % |
i | Entropy PHP | Entropy Java | Entropy C | Entropy C++ |
---|---|---|---|---|
5 | ||||
6 | ||||
7 | ||||
8 | ||||
9 | ||||
10 |
i | PE PHP | PE Java | PE C | PE C++ |
---|---|---|---|---|
5 | ||||
6 | ||||
7 | ||||
8 | ||||
9 | ||||
10 |
i | PHP | Java | C | C++ |
---|---|---|---|---|
5 | ||||
6 | ||||
7 | ||||
8 | ||||
9 | ||||
10 |
Language | Values of i Analyzed | Optimal Value of i (Explanation) | Statistic W (Significance) | p-Value |
---|---|---|---|---|
PHP | The optimal value of i was 9. When increasing i to 9, the performance metrics (, , ) reached their highest values, indicating better classification accuracy. | Positive differences predominate, supporting . | Less than | |
Java | The optimal value of i was 10. Continuous improvements in the metrics were observed, reaching the highest values at . This captures the complexity of the language better. | Positive differences predominate, supporting . | Less than | |
C | The optimal value of i was 5. Increasing i beyond 5 did not improve metrics and sometimes decreased them, suggesting that is sufficient for classification. | Negative differences predominate, supporting . | Less than | |
C++ | The optimal value of i was 8. Performance improved up to , but showed no significant improvement beyond that value. | Positive differences predominate, supporting . | Less than |
Language | i | (%) | (%) | (%) | EM (%) |
---|---|---|---|---|---|
PHP | 9 | ||||
Java | 10 | ||||
C | 5 | ||||
C++ | 8 | ||||
Average | – |
Study | Sample Size and Labeling | CWEs Addressed | Summary of Vulnerabilities |
---|---|---|---|
Proto-MAML (This project) | 18,879 samples with 9867 vulnerable code fragments for PHP, 4677 for Java, 5019 for C, and 4038 for C++, with no labeling required | CWE-20, CWE-79, CWE-89, CWE-78, CWE-352, CWE-22, CWE-434, CWE-502, CWE-601, CWE-30, CWE-209, CWE-287, CWE-400, CWE-476, CWE-611, CWE-732, CWE-119, CWE-125, CWE-787, CWE-416, CWE-200, CWE-362, CWE-772, CWE-399 | Addresses cross-site scripting, SQL injection, memory management issues, concurrency problems, deserialization risks, and improper input validation. Applies to PHP, Java, C, and C++ in CMS systems, ORM tools, and embedded systems. |
SySeVR [27] | 14,780 samples transformed into 340,000 token-level elements for C and C++, labeled as secure or insecure | CWE-119, CWE-125, CWE-787 | Focuses on memory vulnerabilities such as buffer overflows, out-of-bounds reads, and improper memory restriction. Applicable to C and C++ in systems programming and embedded systems. |
BBVD [29] | 16,436 samples for C and C++, with slices labeled as safe or unsafe | CWE-400, CWE-20, CWE-416 | Addresses uncontrolled resource consumption, improper input validation, and use-after-free errors. Targets C and C++ in high-performance and real-time systems. |
VulDefend [30] | 4000 samples for C/C++ with fragment and token-level labeling | CWE-89, CWE-79, CWE-352 | SQL injection, cross-site scripting, and cross-site request forgery vulnerabilities in web applications. |
VulDeeLocator [28] | 29,000 samples for C transformed into 420,627 fragments, labeled as vulnerable or not vulnerable | CWE-119, CWE-125, CWE-787 | Memory vulnerabilities including buffer overflows and out-of-bounds reads in C-based systems. |
SeqTrans [31] | 5000 samples for Java transformed into 650,000 secure–insecure mappings with token-level labeling | CWE-287, CWE-20, CWE-189 | Authentication issues, resource misuse, and numeric errors in Java applications. |
VRepair [32] | 655,741 confirmed vulnerabilities from 1,838,740 commits for insecure C code with no labeling required, BERT-based output | CWE-20, CWE-79, CWE-89, CWE-287 | Focuses on unsafe inputs, poor sanitization, permission control, and information exposure in C codebases. |
DB-CBIL [33] | 33,360 vulnerable functions from C and C++ with token-level sequence labeling | CWE-119, CWE-125, CWE-476 | Addresses memory management issues, including null pointer dereferences, buffer overflows, and out-of-bounds reads in C and C++. |
VuRLE [34] | 48 samples for Java with cluster-based outputs, no labeling required | CWE-20, CWE-89, CWE-287 | Focuses on misplaced resources, injection vulnerabilities, and authentication weaknesses in Java frameworks. |
VulD-Transformer [35] | Between 22,148 and 291,892 samples for C and C++, with 937,608 processed examples labeled as vulnerable or not vulnerable | CWE-119, CWE-476, CWE-125 | API function misuse, memory errors, and null pointer dereference vulnerabilities in C and C++ systems. |
GPT survey [36] | Unspecified number of samples for C and C++ | CWE-121, CWE-758, CWE-242 | Covers buffer overflows, risky functions, and integer overflows in C and C++ implementations. |
Model | Complexity | Technical Reasoning | Comparative Performance (vs. Proto-MAML) |
---|---|---|---|
Proto-MAML | 1. Attention mechanism (): Each token interacts with every other token, resulting in operations. 2. Logarithmic reduction: Meta-learning adjusts parameters incrementally, scaling with the logarithm of task size (). 3. Combined complexity: Meta-learning reduces operations to , optimizing for dynamic scenarios. | Performance: Achieves an average F1-score of 98.78% across all tasks. Specific results: PHP (99.93%), Java (99.12%), C/C++ (97.23%), with an overall Exact Match (EM) score of 98.78%. | |
SySeVR [27] | 1. Abstract Syntax Tree (AST, ): Evaluates relationships between nodes, requiring operations. 2. Program Dependency Graph (PDG, ): Captures semantic dependencies, generating combinations for multi-level relationships. 3. Integration bottleneck: Combined AST and PDG operations elevate complexity to . | Performance: F1 of 85.8% in C/C++. Proto-MAML avoids graph dependencies, improving scalability and accuracy, and surpasses SySeVR by 11.43%. | |
BBVD [29] | 1. Attention (): RoBERTa compares relationships between n tokens, resulting in . 2. Dense operations: Normalization and feedforward add costs, though not dominant. | Performance: F1 of 95.42% in C/C++. Proto-MAML achieves a higher F1-score of 97.23% (+1.81%), with lower computational complexity. | |
VulDefend [30] | 1. Token-level analysis (): PET evaluates relationships between tokens. 2. Absence of meta-learning: No iterative parameter optimization, unlike Proto-MAML. | Performance: F1 of 89.9% in C/C++, while that of Proto-MAML is 97.23%, significantly outperforming VulDefend by +7.33%. | |
VulDeeLocator [28] | 1. AST (): Encodes syntactic relationships, requiring . 2. SSA (): Tracks control flow across nodes, increasing complexity with data flow dependencies. 3. Combined iteration: Sequential RNN and LSTM processes add layers, resulting in overall. | Performance: F1 of 98.8% in C. Proto-MAML achieves comparable F1-scores, but with significantly lower computational complexity ( vs ). | |
SeqTrans [31] | 1. BS (): Iteratively evaluates candidates for each token. 2. Pairwise comparisons (): Evaluates token relationships across sequences, resulting in . | Performance: Masked correction rate of 25.3% in Java. Proto-MAML achieves a correction rate of 99.5%, outperforming SeqTrans by +74.2%. | |
VRepair [32] | 1. SNN (): Processes tokens linearly with operations. 2. SSA (): Adds flow analysis complexity. 3. Cross-propagation: SNN and SSA integration results in . | Performance: Reconstruction rate of 27.59% in C. Proto-MAML achieves an exact match of 98.2%, significantly outperforming VRepair. | |
DB-CBIL [33] | 1. CNN (): Performs filtering and dimensionality reduction over operations. 2. BiLSTM (): Captures forward and backward dependencies sequentially. 3. Joint iterations: Combined CNN and BiLSTM layers result in . | Performance: Reconstruction rate of 99.51% in C/C++. Proto-MAML achieves 98.78% overall accuracy, with significantly lower computational demands. | |
VuRLE [34] | 1. AST (): Captures hierarchical relationships between nodes. 2. DBSCAN (): Clustering compares each token with all others, resulting in . 3. Combined processes: . | Performance: Replacement prediction rate of 65.59% in Java. Proto-MAML achieves a rate of 99.5%, outperforming VuRLE by +33.91%. | |
VulD-Transformer [35] | 1. PDG (): Builds semantic graphs for tokens. 2. Transformer attention (): Token-to-token relationships generate interactions. 3. Combined iterations: . | Performance: Accuracy ranging from 59.34% to 80.44% in C/C++. Proto-MAML surpasses VulD-Transformer, achieving an average F1-score of 97.5%, with lower complexity. | |
GPT Survey [36] | 1. Pre-trained database search (): Each input of size n is compared with internal data sets, scaling quadratically. 2. Answer calibration (): Complex queries require iterative re-evaluations, elevating costs. | Performance: Accuracy of 88%. Proto-MAML achieves an average accuracy of 98.78%, surpassing the GPT Survey in terms of adaptability and efficiency. |
Framework | Quantitative Metrics (F1, TP, FP) | Qualitative Adaptability | False-Positive Handling and Observations |
---|---|---|---|
Proto-MAML (This Study) | F1: , TP: , FP: | High adaptability due to meta-learning loops. Supports diverse languages and CWEs with minimal data. Well-suited for incremental learning scenarios. | False positives reduced through precise query alignment in the support set . Avoids overfitting via regularized parameter updates. |
SySeVR [27] | F1: , TP: , FP: | Limited adaptability. Relies on extensive labeled data sets, constraining performance in unseen languages or CWEs. | High false-positive rate due to static AST generation, which misses dynamic code behaviors. |
BBVD [29] | F1: , TP: , FP: | Moderate adaptability via RoBERTa fine-tuning, but struggles with heterogeneous code due to fixed tokenization schemes. | False positives are controlled using multiple attention layers, but performance degrades when using highly nested code structures. |
VulDefend [30] | F1: , TP: , FP: | Moderate adaptability through RoBERTa and PET. Effective in low-data settings but lacks flexibility for unseen CWEs. | False positives arise from probabilistic template errors during few-shot adaptation. |
VulDeeLocator [28] | F1: , TP: , FP: | High adaptability for C code with AST and SSA. Limited when applied to other languages. | False positives minimized through sequential learning and SSA optimizations but requires costly preprocessing. |
SeqTrans [31] | Masked Correction: , TP: , FP: | Poor adaptability due to reliance on sequential generation. BS struggles with multi-pattern code. | High false-positive rate, as BS favors common sequences over precise mappings. |
VRepair [32] | Reconstruction Rate: , TP: , FP: | Limited adaptability due to heavy reliance on static TL. Struggles with evolving or unseen vulnerabilities. | Moderate false positives due to dependency on sequential context without incorporating dynamic code behavior. |
DB-CBIL [33] | Reconstruction Rate: , TP: , FP: | Strong adaptability through CNN and BiLSTM integration, but computationally intensive. | False positives mitigated by token-level sequence labeling, but not fully eliminated in abstract syntax scenarios. |
VuRLE [34] | Replacement Prediction: , TP: , FP: | Poor adaptability due to reliance on static AST and DBSCAN clustering, which fail in sparse data environments. | False positives arise from inaccuracies in the clustering performed by DBSCAN on skewed data sets. |
VulD-Transformer [35] | F1: Range 59.34–80.44%, TP: , FP: | Moderate adaptability via multi-attention transformers, but limited by PDG preprocessing bottlenecks. | False positives reduced by attention mechanisms, but increases with complex PDG structures. |
GPT Survey [36] | Accuracy: , TP: , FP: | High adaptability in query handling, but inconsistent outputs depending on fine-tuning and batch size. | False positives fluctuate based on query specificity and model calibration. |
Model | Automated Security Scanning | Interdisciplinary Collaboration | Continuous Security Integration | Real-Time Monitoring and Auditing | Predictive Capability | Dependency and Component Management |
---|---|---|---|---|---|---|
Proto-MAML (this study) | FSL capabilities to detect vulnerabilities across PHP, Java, C, and C++, covering over 24 CWEs. Provides actionable reconstructions. | Facilitates collaboration through interpretable QA outputs. | Low complexity (), enabling seamless CI/CD integration. | Adapts dynamically to new samples based on rich semantics (). | Accurately identifies specific spans (, ). | Identifies insecure dependencies and generates practical solutions. |
SySeVR [27] | Graph dependencies (AST, PDG) increase complexity to , limiting scalability. | Graphs require advanced interpretation, hindering accessibility. | Graph preprocessing slows CI/CD processes. | Static nature impedes adaptation to emerging vulnerabilities. | Cannot predict precise spans within code due to reliance on static analysis. | Does not evaluate external dependencies or insecure frameworks. |
BBVD [29] | Limited to C and C++; lacks coverage for critical languages such as PHP or Java. | Fixed tokenization hinders direct interpretation by non-technical teams. | Dense attention layers increase evaluation times, despite lower complexity (). | Cannot dynamically integrate new samples, limiting evolution in real-time systems. | Lacks precise span predictions, reducing audit effectiveness. | Ignores external dependencies and third-party vulnerabilities. |
VulDefend [30] | Only addresses 3–6 CWEs, with limited detection in emerging vulnerabilities. | Probabilistic modeling produces ambiguous results, complicating collaboration. | Absence of meta-learning prevents dynamic adjustments; complexity restricts CI/CD scalability. | Requires intensive re-training to adapt to new patterns, limiting real-time monitoring. | Does not predict spans; relies on rigid templates for analysis. | Does not evaluate external dependencies or insecure libraries. |
VulDeeLocator [28] | High complexity () from combining AST, SSA, and RNN, unsuitable for rapid multilingual analysis. | Dependence on graphs and recurrent networks complicates interdisciplinary collaboration. | Heavy computational demands limit automated pipeline integration. | Relies on pre-labeled data, lacks flexibility for dynamic environments. | Highly precise, but extreme complexity reduces applicability in agile contexts. | Relies on predefined labels, making it unsuitable for external dependencies or unexplored contexts. |
SeqTrans [31] | Restricted to Java with low CWE coverage; BS () lacks scalability. | Sequential search outputs are difficult to interpret and contextualize. | Iterative searches delay CI/CD processes, undermining DevSecOps efficiency. | Inefficient for dynamic scenarios, unable to handle code changes effectively. | Fails to predict spans, limiting traceability of vulnerabilities. | Ignores third-party dependencies, unsuitable for real-world multilingual environments. |
VRepair [32] | Requires extensive data and lacks dynamic adaptability, limiting multilingual effectiveness. | SNNs hinder understanding and implementation by diverse teams. | High complexity () renders it incompatible with continuous delivery pipelines. | Cannot adjust to new samples without complete re-training, unsuitable for evolving systems. | Fails to predict precise spans; outputs are general suggestions. | Lacks analysis of dependencies or third-party frameworks. |
DB-CBIL [33] | CNN and BiLSTM integration results in high computational costs (), unsuitable for rapid analysis. | Complex architecture reduces accessibility for interdisciplinary teams. | Inefficient in CI/CD environments, lacking agility for continuous deployment demands. | Cannot adapt to new vulnerabilities without extensive re-training. | Lacks granularity in predictions, limiting audit effectiveness. | Does not evaluate dependencies or dynamic frameworks, reducing applicability. |
VuRLE [34] | Dependency on AST and clustering (DBSCAN) elevates complexity to , impractical for multilingual scenarios. | Difficult to interpret for non-technical teams, hindering collaboration. | High complexity excludes it from automated CI/CD pipelines. | Static structure prevents adaptation to dynamic environments. | Lacks precision in predictions; does not identify specific spans. | Ignores external dependencies and insecure libraries. |
VulD-Transformer [35] | PDG preprocessing adds significant computational overhead (). | Complex graph-based outputs reduce accessibility for interdisciplinary teams. | Preprocessing requirements make it unsuitable for rapid CI/CD pipelines. | Static architecture limits responsiveness to code changes. | Provides general predictions; lacks granular span identification. | Ignores external dependencies and multilingual vulnerabilities. |
GPT Survey [36] | General-purpose focus lacks specificity for code security tasks; complexity ranges –. | Manual calibration for specific tasks reduces DevSecOps usability. | Inconsistent outputs and processing times hinder CI/CD integration. | Pre-trained database dependency limits adaptability to emerging vulnerabilities. | Lacks granular predictions for vulnerability spans. | Closed approach does not optimize for external dependencies, reducing versatility. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Corona-Fraga, P.; Hernandez-Suarez, A.; Sanchez-Perez, G.; Toscano-Medina, L.K.; Perez-Meana, H.; Portillo-Portillo, J.; Olivares-Mercado, J.; García Villalba, L.J. Question–Answer Methodology for Vulnerable Source Code Review via Prototype-Based Model-Agnostic Meta-Learning. Future Internet 2025, 17, 33. https://doi.org/10.3390/fi17010033
Corona-Fraga P, Hernandez-Suarez A, Sanchez-Perez G, Toscano-Medina LK, Perez-Meana H, Portillo-Portillo J, Olivares-Mercado J, García Villalba LJ. Question–Answer Methodology for Vulnerable Source Code Review via Prototype-Based Model-Agnostic Meta-Learning. Future Internet. 2025; 17(1):33. https://doi.org/10.3390/fi17010033
Chicago/Turabian StyleCorona-Fraga, Pablo, Aldo Hernandez-Suarez, Gabriel Sanchez-Perez, Linda Karina Toscano-Medina, Hector Perez-Meana, Jose Portillo-Portillo, Jesus Olivares-Mercado, and Luis Javier García Villalba. 2025. "Question–Answer Methodology for Vulnerable Source Code Review via Prototype-Based Model-Agnostic Meta-Learning" Future Internet 17, no. 1: 33. https://doi.org/10.3390/fi17010033
APA StyleCorona-Fraga, P., Hernandez-Suarez, A., Sanchez-Perez, G., Toscano-Medina, L. K., Perez-Meana, H., Portillo-Portillo, J., Olivares-Mercado, J., & García Villalba, L. J. (2025). Question–Answer Methodology for Vulnerable Source Code Review via Prototype-Based Model-Agnostic Meta-Learning. Future Internet, 17(1), 33. https://doi.org/10.3390/fi17010033