Next Article in Journal
End-to-End Framework for Identifying Vulnerabilities of Operational Technology Protocols and Their Implementations in Industrial IoT
Previous Article in Journal
Analysis and Evaluation of Intel Software Guard Extension-Based Trusted Execution Environment Usage in Edge Intelligence and Internet of Things Scenarios
Previous Article in Special Issue
Internet of Things and Distributed Computing Systems in Business Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Question–Answer Methodology for Vulnerable Source Code Review via Prototype-Based Model-Agnostic Meta-Learning

by
Pablo Corona-Fraga
1,
Aldo Hernandez-Suarez
2,
Gabriel Sanchez-Perez
2,*,
Linda Karina Toscano-Medina
2,
Hector Perez-Meana
2,
Jose Portillo-Portillo
2,
Jesus Olivares-Mercado
2 and
Luis Javier García Villalba
3,*
1
Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación, Avenida San Fernando No. 37, Colonia Toriello Guerra, Delegación Tlalpan, Mexico City 14050, Mexico
2
Instituto Politecnico Nacional, ESIME Culhuacan, Mexico City 04440, Mexico
3
Group of Analysis, Security and Systems (GASS), Department of Software Engineering and Artificial Intelligence (DISIA), Faculty of Computer Science and Engineering, Office 431, Universidad Complutense de Madrid (UCM), Calle Profesor José García Santesmases, 9, Ciudad Universitaria, 28040 Madrid, Spain
*
Authors to whom correspondence should be addressed.
Future Internet 2025, 17(1), 33; https://doi.org/10.3390/fi17010033
Submission received: 23 October 2024 / Revised: 8 December 2024 / Accepted: 30 December 2024 / Published: 14 January 2025
(This article belongs to the Collection Information Systems Security)

Abstract

:
In cybersecurity, identifying and addressing vulnerabilities in source code is essential for maintaining secure IT environments. Traditional static and dynamic analysis techniques, although widely used, often exhibit high false-positive rates, elevated costs, and limited interpretability. Machine Learning (ML)-based approaches aim to overcome these limitations but encounter challenges related to scalability and adaptability due to their reliance on large labeled datasets and their limited alignment with the requirements of secure development teams. These factors hinder their ability to adapt to rapidly evolving software environments. This study proposes an approach that integrates Prototype-Based Model-Agnostic Meta-Learning(Proto-MAML) with a Question-Answer (QA) framework that leverages the Bidirectional Encoder Representations from Transformers (BERT) model. By employing Few-Shot Learning (FSL), Proto-MAML identifies and mitigates vulnerabilities with minimal data requirements, aligning with the principles of the Secure Development Lifecycle (SDLC) and Development, Security, and Operations (DevSecOps). The QA framework allows developers to query vulnerabilities and receive precise, actionable insights, enhancing its applicability in dynamic environments that require frequent updates and real-time analysis. The model outputs are interpretable, promoting greater transparency in code review processes and enabling efficient resolution of emerging vulnerabilities. Proto-MAML demonstrates strong performance across multiple programming languages, achieving an average precision of 98.49 % , recall of 98.54 % , F1-score of 98.78 % , and exact match rate of 98.78 % in PHP, Java, C, and C++.

Graphical Abstract

1. Introduction

Cybersecurity has emerged as an essential pillar within the realm of Information Technology (IT), serving as the backbone to ensure the confidentiality, integrity, and availability of computing systems. This infrastructure underpins applications, software, and APIs across a wide array of sectors, including communication, education, industrial automation, entertainment, transportation, and finance [1], among others.
Despite the plethora of programming languages and development methodologies enabling robust and accessible deployments, ensuring security across all layers of a product remains imperative [2]. A multi-layered security perspective—that is, one encompassing physical, network, access point, and application dimensions—is paramount to mitigate breaches and vulnerabilities that could compromise digital ecosystems and user safety. To this end, prominent authors such as [3,4] posit that attaining a proactive and robust level of protection and resilience in cybersecurity necessitates the establishment of comprehensive observation points spanning multiple layers. These layers encompass aspects such as user behavior analysis, advanced threat modeling, rigorous vulnerability assessments, incident response frameworks, and the implementation of iterative continuous improvement strategies. Among the most pivotal elements to scrutinize is the software underpinning the information technology infrastructure, which serves as a cornerstone for ensuring systemic security and operational stability.
As a consequence, the exploitation of vulnerabilities and breaches by malicious actors in unattended software, including source code, can lead to unauthorized access, data breaches, service disruptions, and financial losses, resulting in profoundly negative repercussions for organizations and human resources as well [5].
To address the inherent risks in software development, standards have been established under the SSDLC framework [6]. This paradigm incorporates review processes designed to identify potential threats and vulnerabilities in source code, thus reducing the likelihood of insecure production deployments. A key advantage of this strategy lies in prioritizing security across all development stages—from planning and design to implementation and controlled testing—ensuring rigorous scrutiny of code components, the implementation of mitigation measures, and overall enhancement of software quality.
Nonetheless, despite its recognition as an industry standard, the adoption of SSDLC remains inconsistent. The escalating number of reported vulnerabilities in 2023—approximately 22,000, with 15% categorized as highly critical—underscores the persistence of security gaps [7,8]. Several factors account for these deficiencies [9]: first, many organizations lack awareness of SSDLC practices, avoid them due to concerns about potential delays in production, or implement security controls grounded in obscurity, which prove to be ineffective in terms of the continuous detection and mitigation cycle; second, SSDLC relies on Static Application Security Testing (SAST) which, while effective in detecting known vulnerabilities, often produces a high volume of false positives due to the use of outdated databases or an inability to detect complex patterns such as obfuscated or encrypted code segments; and third, SSDLC assumes that Dynamic Application Security Testing (DAST) is carried out to complement SAST, analyzing code during execution. However, DAST lacks the requisite granularity at the source code level and typically necessitates additional manual reviews.
The integration of DevSecOps practices into SSDLC pipelines, combined with Continuous Integration (CI) and Continuous Delivery/Deployment (CD) pipelines, has the potential to optimize both SAST and DAST processes [10]. Nevertheless, this integration is fraught with significant challenges, including the complexity of tool configuration, false-positive management, elevated costs associated with continuous maintenance, the substantial expenses imposed by tool vendors, and the lack of transparency in detection and remediation mechanisms.
Although advances in SSDLC and DevSecOps have bolstered the capabilities of vulnerability detection approaches, programming languages continue to be a significant source of security risks. The study On the Vulnerability Proneness of Multilingual Code [11], leveraging Negative Binomial Regression (NBR), highlighted high vulnerability indices for languages such as C, C++, PHP, and Java. The extensive use of PHP in legacy web applications, coupled with the critical roles of Java, C, and C++ in high-performance systems requiring explicit memory management, exacerbates their exposure to security challenges. These characteristics, despite the maturity and widespread use of these languages, continue to exert pressure regarding the integration of robust security features, ensuring comprehensive policy coverage, and conducting regular monitoring [12]. Such attributes render these languages particularly vulnerable within dynamic and evolving codebases [13].
As underscored by [14,15], the aforementioned challenges can significantly extend testing timelines and, if inadequately managed, may lead to delays in delivery or the deployment of insecure code into production environments. To mitigate these issues, ML techniques have been introduced to enhance the efficiency and accuracy of methods for the review of vulnerable code. According to [14,16], ML approaches can refine the analysis of vulnerable code by examining relationships between code fragments, swiftly identifying faults, mapping secure to insecure code, and improving error detection rates. These innovations have markedly reduced review times, as reflected in improved performance metrics.
While the use of ML is appealing due to its capacity for automation and quantifiable performance in code review tasks, significant challenges still exist, as outlined below.
  • Generalization Challenges: Numerous ML models rely on superficial patterns rather than a profound comprehension of code context, therefore limiting their efficacy in detecting novel vulnerabilities. Furthermore, many models are designed to analyze only one programming language, posing a significant obstacle in environments requiring the evaluation of multiple languages [17].
  • Dependence on Extensive Data Sets: Traditional approaches require vast data sets annotated at the token or sequence level to identify vulnerabilities, increasing processing demands and diminishing the inherent advantages of these models. Despite the significant potential of Deep Learning (DL), Natural Language Processing (NLP), and Statistical Learning (SL) approaches, these techniques face limitations stemming from the scarcity and diversity of training samples, complicating their generalization and prolonging annotation processes [11,18].
  • Static Nature: Many models struggle to adapt to evolving codebases and emerging vulnerabilities, therefore reducing their efficacy over time and necessitating frequent re-training, which imposes considerable operational overheads [18].
  • Insufficient Contextual Responses: ML-based tools often fail to provide actionable and precise responses, which is particularly problematic in the DevSecOps context, where practical and direct solutions for code correction are essential. This lack of interpretability represents a challenge for security analysts, who require accurate and contextually grounded insights within their workflows [19,20].
The challenges associated with effectively addressing vulnerabilities in source code have amplified the necessity for practical and timely solutions in real-world scenarios despite existing contributions from SSDLC, DevSecOps, and ML. This urgency has given rise to a phenomenon termed the “self-disclosure behavior”, where cybersecurity and development professionals proactively share information about vulnerabilities and security risks to enhance collaboration and improve responses to emerging threats. A recent analysis identified Stack Overflow [19] as a vital platform hosting numerous inquiries concerning coding issues and vulnerabilities [20]. The study documented 1239 cybersecurity-related questions, of which 67 % received at least one response, 47 % were resolved with a user-approved solution, 58 % received positive evaluations, and 39 % attracted comments.
This analysis underscores the value of such platforms as repositories of knowledge and support for addressing security challenges. However, significant gaps persist, as many questions remain unanswered or require further clarification, highlighting the persistent challenge of providing comprehensive and timely solutions in an ever-evolving cybersecurity landscape.
In order to address the complexities inherent in the detection and remediation of vulnerabilities in source code [21]—particularly in contemporary programming languages such as PHP, C, C++, and Java which, according to [11], exhibit the highest NBR scores in terms of their propensity for vulnerabilities—this study introduces a Large Language Model (LLM) based on BERT [22], structured within a QA architecture which is designed to generate precise and contextualized responses, in order to address the growing phenomenon referred to as the “self-disclosure behavior”. To meet the demands for accuracy and contextualization, Prototypical Networks (Proto) are incorporated, which compute prototypes as the mean of QA instance embeddings, enabling inference through the use of distance-based metrics [23]. Additionally, the MAML framework [24,25], applied within the FSL [26] paradigm, is integrated to optimize the BERT parameters, facilitating rapid generalization to novel tasks. These components converge into the proposed framework, which we refer to as Prototype-Based Model-Agnostic Meta-Learning (Proto-MAML).
Proto-MAML expedites generalization to new tasks, particularly in multilingual environments, circumventing reliance on superficial patterns and significantly reducing the time required for annotation. Moreover, it ensures dynamic alignment with practical DevSecOps requirements, delivering contextualized, interpretable, and tailored responses to address the evolving landscape of vulnerabilities.
The structure of the remainder of this manuscript is organized as follows: Section 2 outlines the main contributions of the study to the state-of-the-art in the identification and correction of vulnerable source code; Section 3 reviews state-of-the-art methodologies for vulnerability detection and mitigation; Section 4 details the implementation of the Proto-MAML model; Section 5 presents the evaluation metrics and experimental results; Section 6 provides a comparative analysis; and, finally, Section 7 summarizes the findings, proposes potential improvements, and outlines directions for future research.

2. Contributions of Proto-MAML to the State-of-the-Art

Proto-MAML applied to QA represents an innovative approach within the field of ML for the analysis, detection, and remediation of vulnerabilities in source code. This model is distinguished by its interpretative capacity and its design, which addresses the critical demands of the SSDLC and DevSecOps, offering novel elements when compared to existing methodologies, as outlined below.
  • Dynamic Prototypes and Real-Time Adaptability: Proto-MAML leverages Prototypical Networks to create dynamic representations that are tailored to each task, allowing for seamless adaptation to evolving code patterns without extensive re-training. This capability sets it apart from traditional static methods, which rely on frequent updates to remain effective. Furthermore, Proto-MAML dynamically adjusts to changes in codebases and emerging threats, ensuring continuous protection while maintaining computational efficiency with optimized complexity.
  • Few-Shot Learning (FSL): The FSL capabilities of Proto-MAML ensure high accuracy even when limited labeled data are available. This ability to generalize from minimal data addresses a fundamental challenge in security vulnerability detection, particularly for less-common programming languages or newly identified vulnerabilities.
  • Contextual and Predictive Insights: Through integrating BERT, Proto-MAML enhances interpretability by generating contextualized explanations and predictive remediation strategies. This combination not only identifies vulnerabilities but also anticipates potential issues, providing actionable recommendations that assist development teams in proactively mitigating risks.
  • Resource Optimization and Efficiency: Proto-MAML optimizes computational resources by reducing the parameters required for task-specific adaptation, facilitated by Prototypical Networks. This efficiency ensures seamless integration into CI/CD pipelines, enabling real-time security assessments without introducing delays while maintaining rapid development cycles and prioritizing security.
  • Actionable Interpretability: Proto-MAML bridges the gap between technical detection and practical remediation by delivering clear outputs in natural language. This approach enhances collaboration between development, security, and operations teams, ensuring that the proposed solutions are comprehensible and can be applied immediately.

3. Related Works

The detection and remediation of software vulnerabilities have advanced significantly through various machine learning techniques and estimators. Among the early frameworks, SySeVR [27] combined syntactic and semantic representations of C/C++ code with Recurrent Neural Networks (RNNs), including Long Short-Term Memory (LSTM) models. Using Abstract Syntax Trees (ASTs) and Program Dependency Graphs (PDGs), SySeVR generated code slice vectors to identify secure and insecure regions, analyzing over 56,500 secure and 340,000 insecure segments extracted from 14,780 snippets. By integrating LSTM and Bidirectional Gated Recurrent Unit (BGRU) layers, the framework achieved an F1-score of 85.8 %.
Building on RNN-based approaches, VulDeeLocator [28] extended VulDeePecker by leveraging Natural Language Processing (NLP) and RNN methods to detect vulnerabilities in C code. It employed ASTs, segmentation, and Static Single Assignment (SSA) representations to capture data dependencies and control flows across functions and files. The processed information was integrated into a Bi-RNN with LSTM layers, achieving an accuracy of 98.8 % using approximately 29,000 samples.
Transformer-based architectures have further improved vulnerability detection. BBVD (BERT-Based Vulnerability Detection) [29] utilized pre-trained BERT models (e.g., RoBERTa, DistilBERT, and MobileBERT) to analyze over 16,436 code sections in C/C++. Tokenizers and byte-pair encoding were used to fine-tune the model, achieving up to 95.42 % accuracy in classifying segments as safe or unsafe. This method targeted vulnerabilities such as arithmetic errors, pointer references, arrays, and Application Programming Interface (API) calls. Similarly, VulDefend [30] adopted Pattern-Exploiting Training (PET) and cloze-style masking with RoBERTa for detecting vulnerabilities in C and C++ code. VulDefend replaced vulnerable code sections with secure alternatives based on embedding probabilities within a Few-Shot Learning (FSL) framework, achieving 89.9 % accuracy using approximately 4000 samples.
Transformers also proved effective for Java code analysis. SeqTrans [31] addressed vulnerabilities across 624 Common Vulnerabilities and Exposures (CVEs) related to incorrect authentication, critical resource misuse, and numeric errors. SeqTrans transformed Java code into token sequences using byte-pair encoding and applied a Neural Machine Translation (NMT)-style transformer to capture hidden representations. A Beam Search (BS) mechanism identified vulnerable sections for correction through syntax verification, achieving statement-level fix rates of 23.3 % and CVE-level fixes of 25.3 %.
Expanding on SeqTrans, VRepair [32] introduced a transformer-based encoder-decoder architecture for repairing vulnerabilities in C code. Using a dataset of 655,741 committed vulnerabilities, the model achieved precision rates between 22.55 % and 27.59 %. VRepair combined a multi-attention transformer with a Sequential Neural Network (SNN) to analyze and repair buffer overflows and resource management errors. Beam Search was used to generate patches while maintaining syntax and context integrity.
Hybrid models have also been developed for vulnerability detection. DB-CBIL (DistilBERT-Based Transformer Hybrid Model Using CNN and BiLSTM) [33] integrated ASTs and DistilBERT with a Convolutional Neural Network (CNN) for feature extraction and a Bidirectional Long Short-Term Memory (BiLSTM) layer for sequential data processing. By segmenting C/C++ code into 1-gram tokens, DB-CBIL captured both syntactic and semantic features effectively. The model utilized a dataset of 33,360 functions (12,303 labeled as vulnerable and 21,057 as non-vulnerable), achieving a precision of 99.51 %.
Clustering-based approaches have also been explored. VuRLE (VuRLE: Automatic Vulnerability Detection and Repair by Learning from Examples) [34] applied ASTs and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to Java code. Analyzing datasets containing 48 instances across 279 vulnerabilities, VuRLE grouped replacements for safer versions, achieving an average repair prediction accuracy of 65.59 %.
For C++ scenarios, VulD-Transformer [35] extended Deep Learning feature extraction layers with Program Dependency Graphs (PDGs) to analyze control and data dependencies between code segments. The model processed datasets ranging from 22,148 to 291,892 samples, focusing on API function calls, arithmetic expressions, array usage, and pointer operations. VulD-Transformer used a custom multi-attention mechanism and achieved accuracies ranging from 59.34 % to 80.44 %.
Lastly, advancements in LLMs were demonstrated in a study [36] using Generative Pre-trained Transformer (GPT) APIs. The study analyzed the top 25 insecure C++ code examples from SANS, focusing on four critical Common Weakness Enumerations (CWEs): buffer overflow, risky functions, integer overflow, and insecure implementations. GPT models (versions 2–4) predicted over 20 CWEs, detecting 56% of vulnerabilities accurately. Among these, 88% of the identified vulnerabilities were accompanied by secure code recommendations.
Despite the substantial progress achieved by the reviewed methodologies, several critical limitations persist. RNN-based approaches, such as SySeVR [27] and VulDeeLocator [28], rely on ASTs and PDGs, which, while effective for capturing syntactic and semantic relationships, are constrained in scalability and language coverage. Transformer-based models, including BBVD [29] and VulDefend [30], enhance efficiency but are primarily focused on C and C++, addressing a limited subset of CWEs, which restricts their applicability to broader, heterogeneous programming environments. Hybrid models, such as SeqTrans [31], VRepair [32], and DB-CBIL [33], exhibit notable precision in addressing specific vulnerabilities but frequently necessitate extensive labeled data and are often restricted in their ability to generalize across programming languages and contexts. Clustering-based approaches, including VuRLE [34], and dependency graph-based models, such as VulD-Transformer [35], encounter limitations in processing large-scale datasets and adapting to diverse programming constructs. Furthermore, LLM-driven solutions, such as the GPT-based study [36], demonstrate potential for integrating natural language interfaces but are hindered by inconsistent predictions and limited support for underrepresented languages, such as PHP.
These observations form the basis for the comprehensive analysis in Section 6, titled Discussions, where Proto-MAML is evaluated against state-of-the-art approaches through comparative results. The discussion emphasizes how Proto-MAML addresses limitations in vulnerability detection and remediation across PHP, Java, C, and C++ while covering a wide range of CWEs. By leveraging structured meta-learning with FSL, it showcases adaptability to emerging vulnerabilities using minimal data. Furthermore, its integration into CI/CD pipelines facilitates continuous security assessments, and its QA-based framework promotes interdisciplinary collaboration, aligning with the requirements of modern security-centric workflows.

4. Materials and Methods

To initiate the proposed methodology, the following tuple is considered: questions ( Q V ) derived from vulnerable source code, contexts ( C V ) that identify the cause of the vulnerability, and answers ( A V ) embedded within C V that explain the issue and define the solution. The list below exemplifies Q V , C V , and A V .
  • Q V : Why this code is vulnerable?
    Futureinternet 17 00033 i001
  • C V : According to CWE-476: NULL Pointer Dereference: the product dereferences a pointer that it expects to be valid, but is instead NULL.
    A V : I Start position of A V In this code, the ptr variable is allocated memory, but this memory is never actually used. This happens because the pointer is improperly initialized or handled. Below is the corrected version of the code:
    Futureinternet 17 00033 i002
Building on the aforementioned proposition, the methodology follows the steps delineated in Figure 1. The process begins with the generation of data sets comprising vulnerable source code from languages such as PHP, C, C++, and Java. Subsequently, these data sets are structured into tuples according to the BERT format, namely Q V , C V , and A V , where Q V refers to questions regarding the vulnerable code, C V represents the relevant contexts, and A V includes the start ( A V [ START ] ) and end positions ( A V [ END ] ) within C V .
Following this, FSL tasks are generated by partitioning the data set into a support set ( S ) and a query set ( Q ). In this scenario, S is utilized to fine-tune the BERT-based model integrated into the Proto-MAML architecture, while Q is used to evaluate its generalization capabilities on unseen examples.
Subsequently, tokenization of the tuples in the support set { Q V , C V , A V } S is performed, mapping the start ( A V [ START ] ) and end positions ( A V [ END ] ) within C V .
In the following phase, parameter thawing is executed through an internal loop, selectively unfreezing the pre-trained parameters of BERT transferred to Proto-MAML. This allows for effective adaptation across tasks T i over multiple episodes T i + 1 , therefore refining the internal representations. During this process, prototypes p j are calculated for S , determining the feature space of the tuples and A V , thus preventing parameter degradation and generating a concise version of the context.
As the episodes progress, the query set Q is evaluated by comparing its predicted logits ( λ [ START ] and λ [ END ] ) with the prototypes p j . Performance metrics such as the F1-score, Exact Match, and Entropy are used to quantify the effectiveness of the model in generalizing to unseen data, as well as to facilitate local meta-learning.
Finally, the model undergoes a meta-optimization process across all T i , minimizing the global meta-loss over n epochs in an external loop. This tactic aims to enhance its adaptability to unseen tasks, resulting in an optimized Proto-MAML model that is capable of predicting desired questions regarding vulnerable code, as well as associating the context with the response related to the vulnerable line of code and the secure code.
Section 4.1, Section 4.2 and Section 4.3 delve deeper into the methodological blocks.

4.1. Inputs

To generate inputs, the first step is identifying the programming languages that are most susceptible to security issues. Based on the findings presented in [37], the considered factors included the number of entries in the CVE [7] database, associated Common Weakness Enumerations (CWE) [38], and Common Vulnerability Scoring System (CVSS) [39] scores.
The CVE database assigns unique identifiers to publicly disclosed security flaws; CWEs categorize issues according to their weaknesses; and CVSS scores evaluate the severity of flaws, classifying them as low ( 0.1 3.9 ), medium ( 4.0 6.9 ), or high ( 7.0 10.0 ).
An analysis of the MITRE National Vulnerability Database (NVD) [40] associated CVEs with their respective CWEs, revealing that PHP, Java, C, and C++ had the highest number of reported security concerns, as conveyed in Table 1.
Table 1 delineates the primary CWEs impacting PHP, Java, C, and C++, emphasizing critical security flaws across diverse technologies. CWE-20 (Improper Input Validation) significantly affects PHP web applications, including Content Management Systems (CMS) such as WordPress, and is also prevalent in Java-based enterprise platforms, leading to improper parameter handling in APIs. CWE-79 (Cross-Site Scripting) represents a major risk in both PHP and Java. In PHP, it compromises dynamic websites, while in Java, it frequently impacts frameworks like Spring and Struts. CWE-89 (SQL Injection) affects PHP interaction with relational databases like MySQL and PostgreSQL, and Java applications, particularly in ORM tools like Hibernate.
Memory management issues dominate in C and C++. CWE-119 (Improper Restriction of Memory Buffer) and CWE-125 (Out-of-Bounds Read) are prevalent in systems-level software and real-time operating systems, leading to vulnerabilities in network stacks and embedded systems. Similarly, CWE-416 (Use-After-Free) and CWE-476 (NULL Pointer Dereference) appear in C and C++ libraries for scientific computing and game engines. These vulnerabilities contribute to 42% of high-severity issues, as indicated by CVSS scores.
CWE-502 (Deserialization of Untrusted Data) primarily affects Java and PHP applications, with risks evident in serialization libraries and frameworks like Apache Commons and Laravel. CWE-434 (Unrestricted File Upload) is another notable PHP vulnerability, frequently exploited in web hosting platforms, while CWE-352 (Cross-Site Request Forgery) compromises session security in both PHP and Java-based e-commerce platforms.
Concurrency and resource management issues are prominent in C and C++. CWE-362 (Concurrent Execution Vulnerability) impacts real-time systems and database engines, causing deadlocks and race conditions. CWE-399 (Resource Management Errors) is pervasive in multimedia libraries, contributing to medium-severity vulnerabilities. CWE-400 (Uncontrolled Resource Consumption) impacts Java and C++, particularly in cloud-based solutions, leading to denial-of-service risks.
CWE-306 (Missing Authentication for Critical Functionality), CWE-601 (Open Redirect), and CWE-611 (Improper Restriction of XML External Entity Reference) are recurring in web and API-based systems across PHP and Java, while CWE-732 (Incorrect Permission Assignment) is observed in Java frameworks used for authentication and authorization. These issues underline the critical need for sophisticated solutions to mitigate vulnerabilities across languages and technologies.
According to the study titled Comprehensive Vulnerability Analysis Across Programming Languages [11], which employed NBR to examine the proneness of languages to vulnerabilities, certain languages exhibit elevated vulnerability scores. In particular, NBR was used to analyze the occurrence of vulnerabilities in programming languages and assess factors such as memory management, input validation, and access control. C and C++ have high-risk coefficients of 0.5444 and 0.4271 due to memory management issues, while Java has a vulnerability coefficient near 0.6584 due to access control and input validation weaknesses. PHP shows a coefficient of 0.2480 due to XSS and SQL injection, while JavaScript, which is pervasive in web applications, has a coefficient of 0.0451 related to insecure interactions.
The inclusion of PHP alongside Java, C, and C++ in this study is justified due to their distinct programming roles. Java, C, and C++ belong to the same language family and are essential in high-performance systems requiring explicit memory management, which makes them vulnerable to memory safety and access control issues. Notably, PHP—despite its declining popularity—remains widely used in legacy web applications, where vulnerabilities such as Cross-Site Scripting (XSS) and SQL injection continue to be key concerns. Thus, PHP was included in the vulnerability analysis due to its persistent use in security-critical applications alongside Java, C, and C++.
Finally, Python, although being relatively safer (with an NBR of 0.0531 ), can encounter challenges related to input validation and access control when integrated with other languages via Foreign Function Interfaces (FFI), thus heightening its vulnerability. Traditional ML methods based on batch processing struggle to adapt to new data distributions or emerging threats, limiting their flexibility in handling evolving security landscapes [13].
Despite challenges such as licensing, sample diversity, or public access in obtaining vulnerable code, the Software Assurance Metrics and Tool Evaluation (SAMATE) project [41] offers a comprehensive data set for this purpose. The Software Assurance Reference Data Set (SARD) [42] is openly available and provides over 450,000 case studies for programming languages such as PHP, Java, C, and C++, classified according to CWE and including descriptions of code flaws, making it a valuable resource for vulnerability research.
To ensure the quality of the data, vulnerable code from SARD was thoroughly filtered. Only cases labeled as accepted + bad, which have been verified through functional testing, were included. Mixed cases with minimal variations within the same CWE were excluded, and obsolete cases were filtered by comparing descriptions with those in the OWASP Top 10 2021 [43]. This comparison employed a Vector Space Model (VSM) using the Term Frequency–Inverse Document Frequency (TF-IDF) algorithm [44]. The TF-IDF algorithm assigns weights to textual corpora based on their relevance within a data set. Denoting vulnerability descriptions in SARD as D V and those from the OWASP Top 10 as D O W A S P , the TF-IDF is calculated according to Equation (1):
tf - idf ( D V , D O W A S P ) = tf ( D V , D O W A S P ) × log N df ( D O W A S P ) ,
where tf ( D V , D O W A S P ) represents the term frequency in D V , N is the total number of vulnerability descriptions, and df ( D O W A S P ) is the number of occurrences of each term in D O W A S P .
After extracting key terms using TF-IDF from both data sets, cosine similarity was applied (as expressed in Equation (2)) in order to calculate the average similarity score between D V and D O W A S P , yielding a result of 65.7 % .
cos ( θ ) = D V · D O W A S P D V D O W A S P ,
where D V and D O W A S P are the TF-IDF vectors, and D V and D O W A S P are their respective norms. Vulnerabilities with similarity scores below this threshold were discarded, being considered to be redundant or obsolete.
Following this process, 9867 vulnerable code fragments for PHP, 4677 for Java, 5019 for C, and 4038 for C++ were collected. Figure 2 illustrates the final number of CWEs associated with each programming language.
To structure the data set under the QA architecture of BERT, the code samples for each programming language were organized into tuples Q V , C V , A V . For Q V , each vulnerable code fragment was aligned with its respective programming language and reformulated into an argumentative format, integrating both the question and the code fragment into a unified query.
C V was constructed as a single context string by concatenating the vulnerable code, the remediation or mitigation description provided by SARD based on the CWE, and the reconstructed secure code. To enhance C V , the static code analysis tool SonarQube [45] was employed. Using its web API, vulnerable code fragments from the SARD data set were analyzed to detect known patterns across the supported programming languages (PHP, Java, C++, and C). Customized detection rules were designed to ensure the precise identification of CWE patterns within the data set.
The functionality provided by SonarQube allowed for efficient filtering of relevant code fragments. For each sample, only the lines directly associated with the identified vulnerability were retained and incorporated into C V . This process included the vulnerable line, a detailed remediation description, and the reconstructed secure code. Extraneous elements unrelated to the vulnerability were excluded to maintain clarity and precision.
When SonarQube could not identify or reconstruct the secure version of the code, its approximations served as a foundation for further manual analysis. Adhering to the OWASP Top 10 and SANS Top 25 security standards, experts manually reviewed and reconstructed secure code for cases involving complex patterns or when SonarQube failed to detect insecure fragments.
The final composition of C V combined both the vulnerable code and the reconstructed secure code, accompanied by detailed remediation instructions. Similarly, A V was derived by extracting the remediation description from C V , providing an actionable response to the identified vulnerability. The indices A V [ START ] and A V [ END ] were included to indicate the exact location within C V where the description, remediation, or mitigation details of the code are presented.
Each sample for every programming language was ultimately stored in JSON objects, which included the following fields: a unique sample identifier (sample_id), the programming language (language), as well as Q V , C V , A V , A V [ START ] , and A V [ END ] , identified, respectively, by the keys Q_V, C_V, A_V, A_V_START, and A_V_END. Figure 3 provides an excerpt of the data format.
To commence the evaluation process, as elaborated in Section 5, the dataset was partitioned into a training set, comprising 80% of the total samples, and a test set, containing the remaining 20%. The training set was further stratified into a support set S and a query set Q , which were utilized for training and evaluating Proto-MAML, respectively. The distribution of samples across programming languages is detailed below:
  • PHP: The training set comprises 7893 samples, and the test set includes 1974 samples.
  • Java: The training set comprises 3741 samples, and the test set includes 936 samples.
  • C: The training set comprises 4015 samples, and the test set includes 1004 samples.
  • C++: The training set comprises 3230 samples, and the test set includes 808 samples.

4.2. Few-Shot Task Generation

The generation of tasks in FSL environments serves as a fundamental pillar for meta-learning algorithms and agnostic approaches. According to the frameworks established in [46,47,48], these i tasks, represented by T , allow models to iteratively adjust their parameters using small sets of examples known as shots. To implement this process, two key sets are employed: the support set S and the query set Q .
The support set S internally adjusts the parameters of the BERT model, enhancing its generalization capability through understanding the context, semantics, and relationships of the tuples, refining itself according to the number of examples in T . The progress of the model is evaluated in parallel using the query set Q , as mentioned in [49].
In the specific context of QA tasks with BERT under the Proto-MAML arrangement, each few-shot task T i is associated with its respective support set S , as presented in Equation (3):
S = { ( Q v j , C V j ) , ( A V j , A V [ START ] j , A V [ END ] j ) } j = 1 K ,
where K denotes the total number of examples in the support set, and each tuple includes the input pairs ( Q v j , C V j ) , representing the j instances of questions related to vulnerable code and their associated contexts, along with the output elements ( A V j , A V [ START ] j , A V [ END ] j ) , which indicate the answers and their corresponding start and end tokens.
Similarly, the query set Q is structured according to its j th instance in Q , as illustrated in Equation (4):
Q = { ( Q v j , C V j ) } j = 1 q ,
where q refers to the number of examples in the query set, organized as ( Q v j , C V j ) , indicating the questions and contexts used to determine whether the responses generated in task T i during the progressive adaptation of S are optimal based on the calculated performance and loss metrics.
The samples contained in S and Q are selected randomly for each task; although they may repeat in subsequent iterations, identical tuples will never appear within the same iteration.

4.3. Proto-MAML

4.3.1. BERT Preprocessing and Tokenization

To adapt BERT with S and Q to the Proto-MAML architecture, it is essential to ensure that the inputs comply with the specific requirements and dimensions of the model. BERT encompasses a wide range of pre-trained models varying in parameter size, tokenization type, and number of transformer layers. For this project, the BERT Uncased variant [50] was selected (which will hereafter be referred to simply as BERT). This version stands out for incorporating 12 transformer layers, thus optimizing the informational capacity of input sequences. Additionally, it includes 768 hidden layer dimensions, preserving a deep representation of semantic, lexical, and syntactic features, alongside 12 attention heads that enable precise refinement of the relevance of each word in these representations.
Due to its case insensitivity, BERT is particularly suitable regarding the volatile and evolving syntax of source code. Furthermore, its compact vocabulary and ability to adapt to multiple domains greatly enhance the understanding of natural language questions, source code as a technical abstraction, and answers that integrate both aspects.
Figure 4 provides a schematic visualization of the data flow through a general BERT model in a QA environment. The sets S and Q are first tokenized into word units and transformed into different input embeddings, which are then processed by the transformer layers to generate output embeddings used in predictions for A V , which are composed of A V [ START ] and A V [ END ] .
Before being adapted by BERT, the tuples Q V and C V within S and Q must undergo a preprocessing procedure to eliminate noisy textual content and out-of-context vocabulary, which could lead to incorrect or poorly interpretable outputs. The specific steps for preparing the inputs are listed below:
  • Handling Comments: Comments within C V are identified using language-specific delimiters (e.g., // or /∗∗/ for PHP). Depending on their relevance, these comments are either retained to enrich the context or removed to eliminate superfluous information.
  • Normalization: Non-essential whitespace, tabulations, and special symbols are standardized. Natural language text is converted to lowercase, ensuring alignment with the case-insensitive characteristics of the BERT tokenizer.
  • Irrelevant Character Filtering: Unnecessary symbols, such as excessive punctuation or repeated line breaks, are removed from C V . However, tokens integral to the source code, such as function calls, variable names, and SQL queries, are preserved to maintain contextual integrity.
  • Retention of Stopwords: Stopwords are deliberately retained within Q V and C V , diverging from traditional preprocessing techniques. Preserving these elements is essential to maintain the semantic and syntactic coherence between questions and contexts, particularly for the accurate identification of answer spans within C V .
After preprocessing, the BERT tokenizer (denoted as τ ) tokenizes sentences into unigrams using the WordPiece algorithm, generating subwords for the elements Q A and C V present in S and Q for each task T i , and forming a unified sequence as shown in Equation (5).
[ CLS ] + Q V + [ SEP ] + C V + [ SEP ] .
In this sequence, [CLS] acts as a classification marker, while [SEP] separates the question from the context.
To illustrate this process, the PHP code described in Listing 1 is used as an example, which describes an input that is vulnerable to SQL injection.
Listing 1. PHP code vulnerable to SQL injection.
Futureinternet 17 00033 i003
When this code is tokenized by BERT, the resulting sequence is as follows:
τ ( < ? php $ username = $ _ GET [ username ] ) = { [ CLS ] , < ? , php , $ , , ? > } ,
where τ represents the tokenization function, [ CLS ] denotes the classification token, and tokens such as < , ? , php , and $ are extracted from the input.
After tokenization, the sequence is transformed into embeddings. Each embedding E Q V z , C V z , as defined in Equation (7), is the sum of three components: token embeddings T z , positional embeddings P z , and segment embeddings S z :
E Q V z , C V z = T Q V z , C V z + P Q V z , C V z + S Q V z , C V z .
The token embeddings T Q V z , C V z generated by the WordPiece tokenizer are shown in Equation (8):
T Q V z , C V z = Embedding ( t z ) , t z V ,
where V { Q V , C V } is the tokenizer vocabulary and Embedding ( t z ) retrieves the vector representation of t z .
To preserve the token order, BERT adds a positional embedding P i , as defined in Equation (9):
P Q V z , C V z = PositionalEncoding ( z ) , z { 1 , , N } ,
where N is the maximum sequence length.
Segment embeddings S z differentiate between tokens belonging to Q V and C V , as defined in Equation (10):
P Q V z , C V z = SegmentEmbedding ( s z ) , s z { 0 , 1 } .
BERT processes these embeddings E Q V z , C V z through transformer layers, producing contextualized output representations H = { h [ CLS ] , h Q V 1 , h Q V 2 , , h [ SEP ] , h C V 2 , h C V z , , h z } . Among these,
  • The embedding h [ CLS ] summarizes the global representation of the input;
  • The embeddings h Q V and h C V represent the relationships specific to the question and context, respectively.
The contextualized representations H are used to compute the logits for the start and end positions of the answer spans within C V . These logits, generated for both S and Q , form the foundation for Proto-MAML training, as detailed in Section 4.3.2, to predict responses A V { A V [ START ] , A V [ END ] } .

4.3.2. Proto-MAML Training Within a BERT-Enhanced QA Framework

The parameters of BERT, denoted as θ , are transferred to the MAML architecture [24,51]—an adaptive meta-learning algorithm renowned for its data-agnostic nature. Unlike traditional methods that require training a model from scratch using batch-based schemes and extensive labeling, MAML generalizes class spaces K through progressive learning with small pairs of examples and outputs. This process provides an initial representation of the desired final model for each task T i . The framework incorporates a dual optimization mechanism, where the model learns from the sample space via an inner loop that locally adjusts the parameters of the BERT model for T i . Simultaneously, a global adjustment is carried out through an outer loop, which is iterated across multiple epochs, enabling the model to internalize context and propagate it comprehensively.
In each episode of T i , θ is adjusted by refining the parameters to minimize the loss over the support set S , allowing efficient updates for new groups j of ( Q v j , C V j ) , ( A V j , A V [ START ] j , A V [ END ] j ) to be obtained. This process ensures that the inner loop generalizes the QA system, verifying the effectiveness of predictions on the set Q , in terms of correctly associating questions with predicted answers. The parameters adjusted for each T i are denoted as θ T i , as illustrated in Equation (11):
θ T i = θ α θ f T i ( S , θ ) ,
where θ represents the initial parameters of the BERT model; α is the learning rate, which regulates the magnitude of parameter updates during the adaptation process; and ∇ denotes the gradient of f T i , which corresponds to the loss function for each task within its group S , serving as the guide for necessary adjustments.
Simultaneously, the query set Q in the inner loop of MAML adjusts the parameters θ by optimizing the loss over the comparison with S , as presented in Equation (12):
min θ i T L T i ( Q , θ T i ) ,
where min θ represents determining the value that minimizes the task loss L T i , evaluating Q with the updated parameters θ .
Although MAML facilitates the rapid adaptation of parameters, as observed in Equations (11) and (12), it may encounter limitations in tasks requiring precise discrimination, especially when working with discrete data and multiple outputs, such as the start and end indices of answers A V [ START ] j and A V [ END ] j . As MAML is designed for class spaces K with few labels, this challenge may compromise its ability to correctly adjust the relationships between the answer positions A V j within the context C V j and the queries Q V j , leading to inaccurate predictions if the parameters θ degrade during adjustments in the inner loop [52], due to the heterogeneity of the indices to be calculated.
Building on foundational principles of meta-learning, Prototypical Networks [47,48] offer an effective mechanism for encoding structured task representations. Unlike traditional gradient-based methods, which rely on point-wise parameter adjustments, these networks define task relationships through prototypical embeddings, enabling better generalization across heterogeneous distributions.
To mitigate these limitations, Prototypical Networks complement the MAML scheme by providing additional structure in the tuples of S . Instead of relying solely on parameter adaptation in terms of θ , Prototypical Networks compute prototypes that more homogeneously represent the outputs A V [ START ] , A V [ END ] , A V j , along with the representative features of Q V j and C V j , thus simulating a concise and well-defined class space.
Each prototype p j represents a condensed and averaged form of each tuple in S . It synthesizes the start and end information A V [ START ] j and A V [ END ] j , respectively—which define the boundaries of the class space—as well as the complete answer A V within the context C V j , along with its corresponding Q V j . These elements are derived from the embeddings h Q V and h C V , which are generated by BERT for the support examples in S , as indicated in Equation (13):
p j = 1 K j = 1 K f θ ( A V [ START j ] , A V [ END j ] , A V j , Q V j , C V j ) ,
where f θ is the prototype function that maps the key features of A V [ START j ] , A V [ END j ] , A V j , Q V j , and C V j , conditioned by θ in the inner loop.
Proto-MAML emerges as a hybrid design that integrates the rapid adaptability of MAML with the structured representations of Prototypical Networks. Through combining these techniques, Proto-MAML addresses the limitations of gradient-based methods while leveraging metric-based representations to enhance its generalization ability, particularly in low-resource contexts [46].
Compared to other meta-learning techniques such as Latent Embedding Optimization (LEO) [46] and Task-Agnostic Meta-Learning (TAML) [48], which focus on latent-space optimization or task-agnostic adjustments, Proto-MAML provides a balance between computational efficiency and robust adaptation. Proto-MAML also surpasses Meta-Stochastic Gradient Descent (Meta-SGD) and Reptile, which prioritize faster convergence but lack the structured task embeddings that are necessary when considering dynamic and heterogeneous environments. Leveraging prototypical embeddings to minimize the number of gradient updates required, Proto-MAML ensures stable performance while avoiding overfitting on small support sets.
This perspective is particularly desirable in QA scenarios involving source code vulnerability detection. In such cases, the support sets S tend to be small, as specific vulnerabilities are often poorly documented and labeled examples are typically limited. For instance, identifying SQL injections or insecure references in PHP code snippets may require recognizing highly specific patterns within the context of vulnerable code. Proto-MAML, through the use of prototypical representations, is capable of generalizing to new examples of vulnerabilities with minimal supervision, maintaining significant accuracy levels in tasks where data are scarce.
Moreover, the ability of Proto-MAML to integrate contextualized embeddings generated by BERT, such as h Q V and h C V , enables the modeling of complex relationships between specific queries and the context of the code. This ensures that technical questions—such as Which line of code is vulnerable? or Which function needs refactoring?—can be answered accurately even when labeled samples are sparse. Ultimately, Proto-MAML not only minimizes computational costs but also provides an efficient solution for resource-constrained environments in which adaptability and accuracy are paramount.
As highlighted in Figure 5, the input flow through the k transformer layers of BERT within the Proto-MAML framework follows a structured organization, utilizing transformer blocks to generate prototypes of Q V j , C V j , and A V j [53].
As detailed in Section 4.3.1, the input embeddings E are derived as a sum of token embeddings T, positional embeddings P, and segment embeddings S, as defined in Equation (7). The embeddings E serve as the foundation for the contextualized representations H = { h 1 , h 2 , , h n } , which are crucial for Proto-MAML.
After generating E and H, BERT applies a multi-head self-attention mechanism to capture semantic and dependency relationships within the prototypes of S and Q . This mechanism aids in identifying local and global relationships between the samples, therefore improving the generalization capacity of the model. These relationships include token matching within E and H, contextual relevance, and interactions between different segments of vulnerable code and their responses, enabling the identification of both vulnerable lines and secure code.
The contextualized embeddings H, which are generated through the transformer layers of BERT, are incrementally refined as they progress through multiple attention heads and feedforward networks (FFNs). This process ensures that each token embedding within H semantically and syntactically aligns with its respective context. The iterative refinement facilitated by these layers provides a robust representation, enabling precise predictions for the QA pairs.
To maintain numerical stability and ensure consistent gradient flow during backpropagation, layer normalization is applied. This normalization step, as defined in Equation (14), ensures that the embeddings H, including h [ CLS ] , h Q V , and h C V , remain balanced across all processing layers.
h ^ V j = h V j μ S σ S + ϵ ,
where h V j represents the original activation value for a token in the sequence, while h ^ V j is the normalized output; furthermore, the terms μ S and σ S represent the mean and standard deviation computed over the prototype activations in S , respectively, and ϵ is a small constant serving to prevent division by zero.
During fine-tuning of the parameters towards θ , the predictions for answer spans—referred to as logits—are computed. These logits, λ [ START ] and λ [ END ] , correspond to the probabilistic predictions for the start and end positions of the answer spans within C V j , respectively. The formulas for these logits are defined in Equations (15) and (16), respectively.
λ [ START ] = softmax ( W [ START ] h [ CLS ] + b [ START ] ) ,
λ [ END ] = softmax ( W [ END ] h [ CLS ] + b [ END ] ) ,
where W [ START ] and W [ END ] represent the weight matrices used to project h [ CLS ] —the activation vector associated with the special classification token [CLS]—into the corresponding logits. The bias terms, b [ START ] and b [ END ] , adjust these projections to accommodate variations in context and answer spans. Leveraging these refined logits, Proto-MAML can effectively align question–context–answer relationships to maximize its prediction accuracy.
Each prototype p j in S is iteratively refined across the meta-episodes by evaluating the logits on Q , thus adjusting the relationships between Q V j and C V j within the tuple. This refinement process is mathematically formalized in Equation (17):
p j ( i + 1 ) = p j ( i ) η p j ( i ) L proto ( Q T i , p j ( i ) ) ,
where i refers to the i th unfrozen update of θ towards θ , advancing to i + 1 ; and L proto is the loss, defined as the sum of the mean squared errors (MSEs) between the start logits λ [ START ] and the end logits λ [ END ] of the predictions generated by p j ( i ) , compared to the query set Q T i .
As logits represent probabilities, they must be linked to discrete values that correspond to the highest predicted probability. The argmax function, when applied to the logits λ [ START ] and λ [ END ] of Q T i , allows these predictive discrete values to be generated, denoted by A [ START ] and A [ END ] . The loss L proto is then calculated as the average of the losses, as shown in Equation (18):
L proto = 1 2 MSE ( A V [ START ] , A V [ START ] Q T i ) + MSE ( A V [ END ] , A V [ END ] Q T i ) .
In this way, L proto computes a sequential alignment that allows for internal adaptation—best described as local meta-adaptation towards BERT within Proto-MAML for each T i + 1 —allowing performance metrics such as F1-Score and Exact Match to be computed for evaluation of its performance.
When evaluating the performance, the total meta-loss L meta ( 1 ) is backpropagated with the new parameters θ θ , coupling L proto for each Q T i and p j , as clarified in Equation (19):
L meta ( 1 ) = 1 | T | T i L proto ( Q T i , p j , θ ) .
The adjustment of the model parameters during the computation of the meta-loss is formalized in Equation (20):
θ ( i + 1 ) = θ ( i ) γ θ L meta ( 1 ) ,
where i denotes the update in each T i , γ is the learning rate for the meta-update, and θ L meta ( 1 ) is the gradient of the meta-loss with respect to the model parameters.
In the Proto-MAML architecture, optimization of the BERT model parameters θ involves a global meta-optimization process that follows an inner and outer loop. Meta-learning—or learning to learn—optimizes the model not only for a specific task but also in order to generalize to new tasks more quickly, based on past experience. In this case, Proto-MAML optimizes the performance of BERT by iteratively adapting its parameters to improve the prototypes during learning.
In the inner loop, the model refines prototypes for each T i within the embedding space generated with BERT. Subsequently, the meta-optimization requires global backpropagation of the adjustments obtained in the inner loop, such that θ is updated optimally after multiple epochs (denoted as n epochs) in the outer loop. During each epoch e ( 1 e n ) , the model iterates over the tasks T i , computing the total meta-loss L meta ( 2 ) for all tasks. The cumulative loss over an epoch is expressed in Equation (21).
L meta ( 2 ) = e = 1 n T i = 1 m L meta ( 1 ) ,
where m is the total number of tasks processed in each epoch e ( 1 e n ) . The cumulative update of the total loss is used to adjust the parameters of BERT, as expressed in Equation (22):
θ ( n + 1 ) = θ ( n ) γ θ L meta ( 2 ) ,
where γ is the learning rate for the meta-optimization update, which ensures that BERT efficiently adapts its parameters with the accumulated gradients across all tasks during each epoch. Finally, Algorithm 1 describes the global meta-optimization process of Proto-MAML, leading to the progressive refinement of prototypes and robust learning of the BERT architecture.
Algorithm 1 Meta-Optimization and Proto-MAML Model Generation.
 1:
Input: Support set S , query set Q , number of epochs n, initial parameters θ
 2:
for each epoch n do  ▹ Outer Loop: Global optimization across multiple epochs
 3:
    for each task T i do  ▹ Inner Loop: Process for each task
 4:
        Tokenize { Q V , C V } S
 5:
        Unfreeze the parameters of BERT:
θ u θ f + Δ θ
where θ f is the frozen state and Δ θ is the parameter change derived from gradient updates.
 6:
        Compute prototypes  p j from A V in S using:
p j = 1 K j = 1 K f θ ( A V [ START j ] , A V [ END j ] , A V j , Q V j , C V j )
where the prototypes p j represent the key features derived from the support sequences.
 7:
        Evaluate Q against the prototypes p j
 8:
        Calculate the logits λ [ START ] , λ [ END ] for A V S
 9:
        Convert the logits into discrete start and end positions A V [ START ] , A V [ END ] using:
A V [ START ] = arg max ( λ [ START ] ) , A V [ END ] = arg max ( λ [ END ] )
10:
        Compute performance metrics: F1-score, Exact Match, and Entropy.
11:
        Update model parameters: Optimize using the first meta-loss:
θ ( i + 1 ) = θ ( i ) γ θ L meta ( 1 ) ( Q T i , θ )
where γ is the learning rate for parameter updates and θ L meta ( 1 ) is the gradient of the meta-loss with respect to the parameters of the model.
12:
        Update prototypes: Backpropagate gradients to adjust prototypes using the second meta-loss:
p j ( i + 1 ) = p j ( i ) η p j ( i ) L proto ( Q T i , p j ( i ) )
where η is the learning rate for prototype updates.
13:
    end for
14:
    Minimize global loss: Global optimization after all tasks:
θ ( n + 1 ) = θ ( n ) γ θ L meta ( 2 )
where γ is the global learning rate and L meta ( 2 ) corresponds to the cumulative loss during the epoch.
15:
end for
16:
Output: Optimized Proto-MAML model with adjusted parameters θ

5. Results

The performance evaluation of Proto-MAML in QA tasks followed a structured and systematic approach, assessing each phase of the training process and culminating in a final evaluation under the k-shot-i-way paradigm. As described in [54], this method involves selecting k samples for T i classification tasks, distributed across the support set S and the query set Q , over n epochs. Therefore, performance indicators are calculated at two points: while the Proto-MAML estimator is trained across the inner and outer loops, resulting in a total of k × i × n tasks, and upon producing the final model.
Typically, i is set in the range of 5 i 10 in order to balance task complexity with the generalization capacity of the model. Previous studies, such as those cited in [55], have demonstrated that this range is effective for ensuring generalization without overfitting, thus avoiding degradation in meta-learning updates.
Compared to alternative configurations, such as i > 10 , which increase task complexity, the range 5 i 10 has been shown to achieve a favorable trade-off between training efficiency and performance. Configurations with i < 5 , on the other hand, may oversimplify the learning task, leading to insufficient representation of real-world complexities [56]. When setting i within this intermediate range, FSL tasks provide a more realistic and generalizable benchmark for model evaluation.
Metrics such as Precision, Recall, F1-score, and EM have been widely adopted in the context of QA tasks due to their ability to evaluate both the accuracy and completeness of predictions. Precision and Recall focus on correctness and coverage, while the F1-score integrates both of these measures, providing a balanced perspective. Exact Match (EM) complements these metrics by quantifying the exact alignment between predicted and ground truth indices, making it indispensable for QA models predicting A V [ START ] and A V [ END ] spans. These metrics have been validated as effective evaluation tools across various QA benchmarks, demonstrating their robustness in capturing nuanced performance in Answer Extraction (AE) and Recommender Systems (RS) contexts.
During training, metrics such as Entropy H ( p ) and Prediction Error (PE) are integral to Proto-MAML. The Entropy quantifies the uncertainty in softmax predictions, with lower values reflecting greater confidence. This metric ensures the adaptability of the model to new tasks by identifying areas that require fine-tuning or data augmentation. The PE measures the alignment between predicted and actual indices, directly influencing positional accuracy. Reducing the PE improves the reliability of predictions, enhances the ability of the model to generalize to unseen data, and optimizes its performance across diverse scenarios. Table 2 summarizes the performance metrics obtained during and after training.
The configurations for the support and query sets under the 5-shot-i-way with n = 100 epochs are described in Table 3.
Table 4 summarizes the F 1 combined results for different values of i during training. Based on this, Table 5 expands the analysis by presenting the entropy ( H ( p ) ) values, which reflect the uncertainty in predictions across different programming languages and configurations, considering the distributions of true positives ( T P A V [ START ] and T P A V [ END ] ), false positives ( F P A V [ START ] and F P A V [ END ] ), and false negatives ( F N A V [ START ] and F N A V [ END ] ). Complementing these data, Table 6 provides the PE obtained for the same configurations.
Table 7 presents the average meta-loss values L meta ( 2 ) across different programming languages and configurations during 5-shot-i-way training, highlighting the general differences in performance over n = 100 epochs. Figure 6 complements this table by illustrating the evolution of meta-loss values (recorded every 10 epochs), providing a comprehensive view of the training progress for various programming languages under their respective configurations.
To determine the optimal values of the 5-shot-i-way models and maintain clarity regarding the performance metrics of Proto-MAML for QA with respect to each programming language, a statistical analysis was conducted using the Wilcoxon Signed-Rank (WSR) test. This test is an alternative to the paired t-test for related samples but does not require the assumption of normality [57].
The procedure to calculate the test statistic W in the WSR test is as follows:
  • Calculation of the Differences ( D i ): Compute the differences between each pair of observations from the resulting 5-shot-i-way models as D i = X i , f X i , g , where X i , f and X i , g represent the observations from the two related samples in the i th model for the observation pair { f , g } .
  • Ranking of the Absolute Differences: Order the differences by their absolute values, excluding any differences equal to zero, and assign ranks to them. If there are differences with the same absolute value, assign them an average rank.
  • Sum of Positive and Negative Ranks: Sum the ranks corresponding to the positive differences W + and the ranks corresponding to the negative differences W as follows: W + = D i > 0 R i and W = D i < 0 R i , where R i is the rank assigned to the absolute value of D i .
  • Test Statistic: The test statistic of the Wilcoxon test is the smaller of W + and W ; namely W = min ( W + , W ) .
  • Determination of the p-Value: The value of W is compared to a Wilcoxon distribution table to determine the corresponding p-value. The commonly used significance level is α = 0.05 . If p < α , the null hypothesis is rejected, concluding that there is a significant difference between the two samples.
The steps and results of the analysis for each programming language concerning the WSR test are detailed below.
The metrics Precision combined , Recall combined , and F 1 combined were obtained for different values of i with respect to each language. These values were used to evaluate the effectiveness of each model.
For each language l L :
  • For each pair of models ( f , g ) , where f , g i and f > g :
    • Difference in Precision: D f , g , l Precision combined = Precision combined g , l Precision combined f , l ,
    • Difference in Recall: D f , g , l Recall combined = Recall combined g , l Recall combined f , l ,
    • Difference in F 1 : D f , g , l F 1 combined = F 1 combined g , l F 1 combined f , l ;
    • Classification and Sign of the Differences:
      (a)
      If D f , g , l Metric > 0 , then model g is better than model f in language l,
      (b)
      If D f , g , l Metric < 0 , then model g is worse than model f in language l.
The differences are ordered by their absolute value, and ranks are assigned to the differences, where smaller absolute values receive lower ranks. In the case of differences with equal absolute values, an average rank is assigned.
The statistic W is calculated by summing the ranks of the positive differences, as follows: W = D i > 0 R i , where D i > 0 represents the positive differences and R i are the corresponding ranks. With the obtained value of W , a Wilcoxon distribution table is used to calculate the p-value. A significance level of α = 0.05 is established. If p < α , it is concluded that there is a significant difference between the models. For each language, the WSR results are listed in Table 8.
As demonstrated by Table 4, Table 5 and Table 6, optimal values for F 1 combined , Entropy, and PE, alongside the superior meta-loss L meta presented in Table 7, were identified under the following configurations: 5-shot-9-way for PHP, 5-shot-10-way for Java, 5-shot-5-way for C, and 5-shot-8-way for C++. These findings, derived through rigorous application of the Wilcoxon Signed-Rank Test, highlight statistically significant variations in the performance metrics as i fluctuates, therefore substantiating the selection of optimal configurations. These results highlight the need to customize models to the intricate nuances inherent to each programming language, thus ensuring peak efficacy. Building upon these refined configurations, distinct models were constructed for each language, facilitating an exhaustive evaluation of the Proto-MAML framework utilizing residual samples from X P . Table 9 encapsulates these insights, delineating the metrics Precision combined , Recall combined , F 1 combined , and EM.

6. Discussion

To highlight the advantages of Proto-MAML compared to the state-of-the-art works discussed in Section 3, three key points were identified that emphasize the importance of ML operations in reviewing and correcting vulnerable source code, as outlined in the following paragraphs.
First point of comparison: number of samples and language coverage. Proto-MAML efficiently reviews vulnerable code with fewer samples, leveraging prototype-based learning to mimic human-like gradual generalization, providing succinct responses that are necessary to maintain alignment with the workflows of analysts in the SSDLC context when addressing scenarios involving limited code. With fewer than 18 , 879 samples across PHP, Java, C, and C++, covering 24 critical CWEs, Proto-MAML can address vulnerabilities such as buffer overflows, code injection, cross-site request forgery, remote code execution, and SQL injection, among others. This makes it highly suitable for fast-paced environments where large-scale labeling is not practical.
In contrast, other state-of-the-art studies have focused on fewer languages and typically used larger data sets. For example, the samples used for Proto-MAML included 9867 for PHP, 4677 for Java, 5019 for C, and 4038 for C++, addressing over 24 CWEs. As shown in Table 10, studies such as [27,29,30,33,35,36] have focused primarily on C and C++, with an average of 21,858 samples for C and 174 , 733 samples for C++ across these studies, while addressing only 3 to 6 CWEs. Some of these studies, despite not requiring labeling, still require the use of extensive data sets, such as [32], which used 655,741 samples, and [35], which processed 937,608 samples. Similarly, studies such as [27,28,29,33,35] have evaluated C, with an average of 21,858 samples and addressing 3–7 CWEs. Java was analyzed with approximately 327,524 samples addressing 2–3 CWEs in [31,34]. PHP, despite representing 30% of insecure web deployments, remains underrepresented in the existing literature, which constitutes a significant drawback.
Second point of comparison: complexity and performance. Proto-MAML and other approaches can be analyzed in terms of their computational complexity for source code review and correction using Big-O notation [58]. This notation describes how algorithms perform in the worst-case scenario as input size increases. For example, O ( 1 ) represents constant time, where the execution time remains the same regardless of input size. O ( log n ) indicates logarithmic growth, O ( n ) denotes linear growth, and O ( n log n ) reflects log-linear growth, combining linear and logarithmic rates. More complex cases include O ( n N ) , which represents exponential growth, where the execution time doubles or more with each increase in input size. Finally, O ( n ! ) reflects factorial growth, which is the most inefficient case, making it impractical when considering large input sizes. Table 11 provides a detailed breakdown of the computational complexity of Proto-MAML and related models.
The results presented in Table 11 highlight the distinct advantages of Proto-MAML when compared with other state-of-the-art approaches. This discussion underscores the efficiency, scalability, and adaptability of Proto-MAML in the context of automated vulnerability detection and correction while also providing a nuanced critique of the limitations observed for alternative models. The emphasis on computational complexity, required data samples, and real-world applicability further illustrates why Proto-MAML integrated with BERT emerges as a superior solution for the QA task.
To begin, SySeVR [27] demonstrates a significant computational burden due to its reliance on ASTs and PDGs. AST introduces O ( n 2 ) complexity, requiring extensive iterations to analyze syntactic relationships, while PDG elevates this to O ( n 3 ) through its modeling of semantic dependencies. Although these graph-based structures provide semantic context, their heavy reliance on labeled data hinders the scalability of the model. Furthermore, the incorporation of LSTM and BGRU models exacerbates these inefficiencies, often leading to gradient loss when processing large inputs. In contrast, Proto-MAML sidesteps these limitations through its BERT-based transformer architecture. The self-attention mechanism enables it to capture complex relationships with a reduced complexity of O ( n log n ) . The proposed approach achieved an F1-score of 97.5 % for C and C++, significantly outperforming the 85.8 % achieved by SySeVR, thus offering a scalable and computationally efficient alternative.
Building upon this, BBVD [29] incorporates RoBERTa and MobileBERT to reduce the complexity to O ( n 2 ) through leveraging smaller parameter sets and vocabularies. However, its performance is diminished in scenarios involving mixed human and technical language, and its reliance on multiple attention layers and dense operations introduces further computational demands. Proto-MAML surpasses these limitations by employing dual meta-learning cycles that dynamically adjust parameters without the need for exhaustive fine-tuning. With an F1-score of 97.5 % , Proto-MAML offers a more streamlined computational framework and outperforms the 95.42 % achieved by BBVD.
Transitioning to VulDefend, [30] this model employs probabilistic embedding transformations (PET) combined with RoBERTa and FSL. Although it has been demonstrated to be efficient in low-data scenarios, its probabilistic nature fails to capture intricate relationships within vulnerable code. The resulting complexity of O ( n 2 ) further limits its capacity to adapt to unseen patterns. Proto-MAML addresses these challenges through structured meta-learning cycles that dynamically characterize dependencies at both the token and output levels. This adaptability enabled Proto-MAML to achieve an accuracy of 97.5 % , significantly exceeding the 89.9 % achieved by VulDefend.
The limitations of graph-based models become even more evident when considering VulDeeLocator, [28] which combines AST, SSA, an RNN, and LSTM, resulting in an overall complexity of O ( n 4 ) . The AST introduces O ( n 2 ) complexity for syntactic analysis, while SSA and RNN add O ( n 3 ) and O ( n 2 ) , respectively. These inefficiencies are further compounded by a dependency on high-quality labeled data. Proto-MAML, leveraging prototype-based learning and global meta-learning optimization, circumvents these challenges entirely. Achieving 97.1 % accuracy, Proto-MAML closely matched the performance of this model while maintaining a significantly lower complexity of O ( n log n ) .
In a similar vein, SeqTrans, as discussed in [31], relies on NMT and BS to map secure to insecure code. While BS evaluates multiple token candidates with O ( n ) operations per token, the sequential evaluation of entire data sets increases the overall complexity to O ( n 2 ) . Additionally, this approach struggles with inaccuracies when processing large data sets due to its reliance on token-level matching. Proto-MAML overcomes these inefficiencies with its prototype-based meta-learning approach, achieving a 99.5 % match rate for Java code, thus vastly outperforming the 25.3 % match rate of SeqTrans.
Turning to VRepair, [32] its use of an SNN and SSA results in iterative operations with an overall complexity of O ( n 4 ) . The SNN processes sequential data with O ( n 2 ) complexity, while SSA adds substantial overhead through analyzing the control flow. Proto-MAML avoids these resource-intensive operations by dynamically adapting features across support and query sets and achieved an exact match rate of 98.2 % for C code, representing a dramatic improvement over the 27.59 % match rate of VRepair.
Similarly, DB-CBIL [33] combines DistilBERT with a CNN and BiLSTM. Although DistilBERT is lightweight, the integration of CNN and BiLSTM architectures elevates its complexity to O ( n 4 ) . These processes lack the adaptability required to generalize effectively across diverse vulnerability patterns. Proto-MAML, leveraging its structured meta-learning framework, achieved a comparable match rate of 97.9 % for C++ code while maintaining a more efficient complexity of O ( n log n ) .
Furthermore, VuRLE, as outlined in [34], incorporates AST and DBSCAN, resulting in an overall complexity of O ( n 5 ) . The hierarchical structure of AST and the clustering operations of DBSCAN are computationally prohibitive, making this approach unsuitable for large-scale applications. In contrast, Proto-MAML relies on prototype-based learning, thus eliminating these inefficiencies. Proto-MAML achieved a match rate of 99.5 % for Java code, significantly surpassing the 65.59 % match rate of VuRLE.
The GPT Survey [36] presents a distinct approach, leveraging large-scale pre-trained models for vulnerability detection. However, its reliance on fine-tuning for specific queries introduces significant variability in complexity, ranging from O ( n 2 ) for simpler queries to O ( n 3 ) for more intricate ones. This dependence on extensive calibration processes increases the computational overhead, particularly in scenarios requiring complex query adjustments or knowledge generation. Additionally, the closed-source nature of this approach limits its adaptability to specific tasks, such as identifying vulnerabilities in diverse programming languages. In comparison, Proto-MAML employs structured meta-learning cycles to avoid iterative fine-tuning, enabling efficient adaptation across various support-query tasks. This resulted in a consistent performance advantage, with Proto-MAML achieving an average accuracy of 97.9 % , significantly exceeding the 88 % accuracy of the GPT Survey. Furthermore, the complexity of O ( n log n ) provided by Proto-MAML offers a scalable and efficient framework for real-world applications, solidifying its superiority in balancing performance and computational efficiency.
Notably, two exceptions where other models surpassed Proto-MAML stand out: DB-CBIL [33] achieved a reconstruction rate of 99.51 %, compared to the exact match rate of 97.9 % achieved by Proto-MAML, but with a higher complexity of O ( n 4 ) . Similarly, VulDeeLocator [28] outperformed Proto-MAML in terms of accuracy, reporting 98.8 % compared to the 97.1 % accuracy of the proposed approach, although it also had a higher complexity of O ( n 4 ) .
For Java, Proto-MAML surpassed the results reported for VuRLE [34], which had a transformation rate of 65.59 %, and SeqTrans [31], which achieved a masked correction rate of 25.3 %, compared to the exact match rate of 99.5 % achieved by Proto-MAML.
In the same vein, qualitative adaptability and false-positive handling are critical dimensions when evaluating ML approaches for their application-specific utility, as highlighted in the qualitative analysis of DL methods [59]. These aspects emphasize the ability of these approaches to generalize across diverse contexts and effectively mitigate erroneous outputs, providing a nuanced understanding of their operational strengths and limitations. As detailed in Table 12, methodologies such as Proto-MAML demonstrate a balance between high adaptability and precise error management, highlighting their suitability for low-resource and dynamic scenarios. This comprehensive perspective transcends traditional performance metrics and enhances the evaluation process through the integration of both quantitative and qualitative insights.
Third point of comparison: Integration with SSDLC and DevSecOps. The detailed analyses in [60,61] highlighted several critical aspects for the effective integration of source code review in multilingual environments within the SSDLC and DevSecOps frameworks. These aspects underscore the necessity of timely vulnerability detection and resolution in order to maintain the quality and security of software. Proto-MAML, through its QA approach powered by a context-rich LLM-based BERT transformer, emerges as a versatile and robust solution for addressing these challenges, providing the following key contributions.
  • Automation of Security Scanning: Proto-MAML can be seamlessly integrated into CI/CD pipelines to analyze code fragments rapidly using FSL, generating context-aware questions ( Q V ) and answers ( A V ) based on semantic context ( C V ). This capability enables early vulnerability detection and the provision of concrete solutions through secure code reconstructions. Furthermore, Proto-MAML is applicable to multiple programming languages and addresses a range of critical vulnerabilities. Unlike state-of-the-art approaches, which are limited to a few languages (e.g., C and C++) and cover only 3–7 CWEs, Proto-MAML processed 9867 PHP samples, 4677 Java samples, 5019 C samples, and 4038 C++ samples, as well as addressing over 24 CWEs, therefore significantly broadening its practical utility.
  • Interdisciplinary Collaboration: Through the utilization of a natural language-based approach, Proto-MAML facilitates effective interactions among development, operations, and security teams. The outputs in the form of questions and answers are easily interpretable, eliminating the need for additional analysis and fostering seamless communication across disciplines.
  • Continuous Security Integration: Incorporating security evaluations as a core component of CI/CD pipelines ensures consistent code assessment, enhancing integrity and security. Proto-MAML excels in this area due to its low computational complexity ( O ( n log n ) ), enabling dynamic security assessments without significantly impacting deployment times. It also provides advanced metrics such as Accuracy, Recall, F1-Score, and EM, which support its continuous performance improvement.
  • Real-Time Monitoring and Auditing: The prototype-based architecture of Proto-MAML allows for rapid adaptation to new code samples by leveraging the general characteristics of queries ( Q V ). This adaptability makes it ideal for dynamic systems with limited data, as it extracts rich semantic context ( C V ) and efficiently adjusts to emerging vulnerabilities.
  • Predictive Capability: Proto-MAML accurately identifies the specific positions of vulnerabilities within code ( A V [ S T A R T ] , A V [ E N D ] ), simplifying traceability and auditing processes. This actionable and precise information significantly enhances the ability of technical teams to efficiently address vulnerabilities.
  • Training and Awareness in Security: Through analyzing the semantic context of source code using BERT, Proto-MAML identifies insecure dependencies and generates clear predicted responses ( A V ). This functionality is invaluable for detecting issues in third-party libraries or frameworks, not only preventing vulnerabilities but also educating development teams on secure coding practices, allowing for the seamless integration of this knowledge into daily workflows.
  • Dependency and Third-Party Component Management: Proto-MAML evaluates external dependencies for vulnerabilities and provides practical solutions while explaining the associated context. Its ability to handle underrepresented languages such as PHP—which constitutes 30 % of insecure web deployments—along with its focus on over 24 CWEs makes it highly scalable and adaptable to diverse environments. Additionally, its contextual approach allows for its application across different programming languages, as it relies on semantic context rather than specific code structures.
Table 13 provides a comprehensive comparison of Proto-MAML against state-of-the-art models, in terms of their limitations and capabilities. This comparison highlights the strengths of Proto-MAML, including its low computational complexity, adaptability to new vulnerabilities, and ability to handle diverse programming languages and CWEs. Unlike other approaches, Proto-MAML enables superior interdisciplinary collaboration, can be efficiently integrated into CI/CD pipelines, and demonstrates actionable predictive capabilities, setting a benchmark for real-world applications in SSDLC and DevSecOps environments.

7. Conclusions

This study presented Proto-MAML, an approach that integrates Prototypical Networks with Model-Agnostic Meta-Learning within a QA framework based on BERT. Proto-MAML addresses challenges related to detecting and remediating vulnerabilities in source code across multiple programming languages, specifically PHP, Java, C, and C++. Leveraging FSL and prototype-based meta-learning, the model adapts efficiently to limited data, making it applicable to the dynamic and resource-constrained environments common in SSDLC and DevSecOps practices.

7.1. Theoretical Contributions

Proto-MAML advances the theoretical understanding of how meta-learning can be applied to detecting, reviewing, and correcting vulnerabilities in source code. These contributions are significant due to their focus on addressing core challenges specific to software security. The main theoretical contributions include:
  • Multilingual Vulnerability Detection: Proto-MAML integrates Prototypical Networks and MAML within a QA-based architecture, enabling a single model to detect vulnerabilities across programming languages like PHP, Java, C, and C++. This approach adapts meta-learning principles to handle the structural and syntactic diversity of source code, which is a common challenge in multilingual environments. The theoretical impact lies in extending meta-learning to heterogeneous data domains.
  • Precise Vulnerability Localization: Leveraging the QA framework, the model identifies the exact start and end positions of vulnerabilities within the source code. This level of detail connects theoretical advancements in meta-learning with practical needs for targeted code review and correction. It provides a bridge between machine learning outputs and actionable developer workflows.
  • Generalization in Data-Scarce Scenarios: Proto-MAML uses FSL to train effectively with minimal annotated data, addressing a critical limitation in vulnerability datasets. The contribution lies in demonstrating how meta-learning can generalize to real-world problems where comprehensive training datasets are unavailable, a common constraint in software security.
  • Integration of Detection and Remediation: The model not only identifies vulnerabilities but also provides contextual suggestions for remediation. This feature aligns theoretical advancements with practical applications by connecting model outputs to actionable developer tasks, making the process of addressing vulnerabilities more efficient.
  • Efficiency for CI/CD Integration: The computational complexity of O ( n log n ) ensures that Proto-MAML is scalable and compatible with time-sensitive CI/CD pipelines. This shows how meta-learning models can be designed to meet the operational constraints of modern software development environments.
These theoretical contributions address specific challenges in vulnerability detection and correction, demonstrating how meta-learning can be applied to software security. By linking detection with actionable insights and ensuring scalability, Proto-MAML provides a framework that can guide future advancements in secure software development.

7.2. Practical Contributions

The experimental evaluation demonstrates that Proto-MAML achieves high performance across PHP, Java, C, and C++:
  • Average Precision, Recall, F1, and EM scores exceeded 98.49 % , 98.54 % , and 98.78 % , respectively.
  • PHP achieved the highest metrics, with 99.93 % F1 and 99.82 % EM scores.
  • The computational complexity of O ( n log n ) ensures its practical application in real-time CI/CD workflows.
The model enhances collaboration across development, security, and operations teams by providing precise, actionable outputs. This integration supports the goals of SSDLC and DevSecOps practices.

7.3. Future Research Directions

Further research could expand and refine Proto-MAML in the following areas:
  • Expanding Language Support: Including additional languages such as Python, JavaScript, and Go through transfer learning techniques.
  • Improving Explainability: Enhancing interpretability by linking vulnerabilities to specific code regions and providing clear rationales for remediation.
  • Real-World Deployment: Evaluating the model in operational CI/CD environments to refine its practical applications.
  • Continuous Learning: Enabling incremental adaptation to emerging vulnerabilities and evolving coding practices.

Author Contributions

Conceptualization, A.H.-S., G.S.-P., L.K.T.-M., H.P.-M., J.P.-P., J.O.-M., P.C.-F. and L.J.G.V.; methodology, A.H.-S., G.S.-P., L.K.T.-M., H.P.-M., J.P.-P., J.O.-M., P.C.-F. and L.J.G.V.; validation, A.H.-S., G.S.-P., L.K.T.-M., H.P.-M., J.P.-P., J.O.-M., P.C.-F. and L.J.G.V.; formal analysis, A.H.-S., G.S.-P., L.K.T.-M., H.P.-M., J.P.-P., J.O.-M., P.C.-F. and L.J.G.V.; investigation, A.H.-S., G.S.-P., L.K.T.-M., H.P.-M., J.P.-P., J.O.-M., P.C.-F. and L.J.G.V.; data curation, A.H.-S., G.S.-P., L.K.T.-M., H.P.-M., J.P.-P., J.O.-M., P.C.-F. and L.J.G.V.; writing—original draft preparation, A.H.-S., G.S.-P., L.K.T.-M., H.P.-M., J.P.-P., J.O.-M., P.C.-F. and L.J.G.V.; writing—review and editing, A.H.-S., G.S.-P., L.K.T.-M., H.P.-M., J.P.-P., J.O.-M., P.C.-F. and L.J.G.V.; visualization, A.H.-S., G.S.-P., L.K.T.-M., H.P.-M., J.P.-P., J.O.-M., P.C.-F. and L.J.G.V. All authors have read and agreed to the published version of the manuscript.

Funding

Thanks to the National Council for Humanities, Science, and Technology (CONHACYT) for the support provided in this project. This work was also supported by the European Commission under the Horizon Europe research and innovation program as part of the project LAZARUS (Grant Agreement no. 101070303). Views and opinions expressed are, however, those of the authors only and do not necessarily reflect those of the European Union or European Commission—EU. Neither the European Union nor the European Commission can be held responsible for them.

Data Availability Statement

The data set is publicly available in the Test Cases of the Software Assurance Reference Data Set (https://samate.nist.gov/SARD/test-cases) from NIST. For access to the Proto-MAML source code or to evaluate it with your own set of questions, please contact the corresponding author, as the code is proprietary to the Sección de Estudios de Posgrado e Investigación de la Escuela Superior de Ingeniería Mecánica y Eléctrica, Unidad Culhuacán, del Instituto Politécnico Nacional.

Conflicts of Interest

Author Pablo Corona-Fraga was employed by the company Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Möller, D.P. Cybersecurity in digital transformation. In Guide to Cybersecurity in Digital Transformation: Trends, Methods, Technologies, Applications and Best Practices; Springer: Berlin/Heidelberg, Germany, 2023; pp. 1–70. [Google Scholar]
  2. Beaulieu, N.; Dascalu, S.M.; Hand, E. API-first design: A survey of the state of academia and industry. In Proceedings of the ITNG 2022 19th International Conference on Information Technology-New Generations, Las Vegas, NV, USA, 10–13 April 2022; pp. 73–79. [Google Scholar]
  3. Zhang, F.; Kodituwakku, H.A.D.E.; Hines, J.W.; Coble, J. Multilayer Data-Driven Cyber-Attack Detection System for Industrial Control Systems Based on Network, System, and Process Data. IEEE Trans. Ind. Inform. 2019, 15, 4362–4369. [Google Scholar] [CrossRef]
  4. Massaoudi, M.; Refaat, S.S.; Abu-Rub, H. Intrusion Detection Method Based on SMOTE Transformation for Smart Grid Cybersecurity. In Proceedings of the 2022 3rd International Conference on Smart Grid and Renewable Energy (SGRE), Doha, Qatar, 20–22 March 2022; pp. 1–6. [Google Scholar] [CrossRef]
  5. Chen, J.; Mohamed, M.A.; Dampage, U.; Rezaei, M.; Salmen, S.H.; Obaid, S.A.; Annuk, A. A multi-layer security scheme for mitigating smart grid vulnerability against faults and cyber-attacks. Appl. Sci. 2021, 11, 9972. [Google Scholar] [CrossRef]
  6. Souppaya, M.; Scarfone, K.; Dodson, D. Secure Software Development Framework (SSDF) Version 1.1: Recommendations for Mitigating the Risk of Software Vulnerabilities; Technical Report SP 800-218; National Institute of Standards and Technology (NIST): Gaithersburg, MD, USA, 2022.
  7. Common Vulnerabilities and Exposures (CVE). 1999. Available online: https://cve.mitre.org/ (accessed on 13 September 2024).
  8. National Institute of Standards and Technology. National Institute of Standards and Technology (NIST) Official Website. 2024. Available online: https://www.nist.gov (accessed on 8 October 2024).
  9. Homès, B. Fundamentals of Software Testing; John Wiley & Sons: Hoboken, NJ, USA, 2024. [Google Scholar]
  10. Lombardi, F.; Fanton, A. From DevOps to DevSecOps is not enough. CyberDevOps: An extreme shifting-left architecture to bring cybersecurity within software security lifecycle pipeline. Softw. Qual. J. 2023, 31, 619–654. [Google Scholar] [CrossRef]
  11. Li, W.; Li, L.; Cai, H. On the vulnerability proneness of multilingual code. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore, Singapore, 18 November 2022; pp. 847–859. [Google Scholar]
  12. Akbar, M.A.; Smolander, K.; Mahmood, S.; Alsanad, A. Toward successful DevSecOps in software development organizations: A decision-making framework. Inf. Softw. Technol. 2022, 147, 106894. [Google Scholar] [CrossRef]
  13. Chakraborty, S.; Krishna, R.; Ding, Y.; Ray, B. Deep learning based vulnerability detection: Are we there yet? IEEE Trans. Softw. Eng. 2021, 48, 3280–3296. [Google Scholar] [CrossRef]
  14. Mateo Tudela, F.; Bermejo Higuera, J.R.; Bermejo Higuera, J.; Sicilia Montalvo, J.A.; Argyros, M.I. On combining static, dynamic and interactive analysis security testing tools to improve owasp top ten security vulnerability detection in web applications. Appl. Sci. 2020, 10, 9119. [Google Scholar] [CrossRef]
  15. Bedoya, M.; Palacios, S.; Díaz-López, D.; Laverde, E.; Nespoli, P. Enhancing DevSecOps practice with Large Language Models and Security Chaos Engineering. Int. J. Inf. Secur. 2024, 23, 3765–3788. [Google Scholar] [CrossRef]
  16. Rajapaksha, S.; Senanayake, J.; Kalutarage, H.; Al-Kadri, M.O. Ai-powered vulnerability detection for secure source code development. In Proceedings of the International Conference on Information Technology and Communications Security, Bucharest, Romania, 23–24 November 2022; pp. 275–288. [Google Scholar]
  17. Ling, X.; Wu, L.; Zhang, J.; Qu, Z.; Deng, W.; Chen, X.; Qian, Y.; Wu, C.; Ji, S.; Luo, T.; et al. Adversarial attacks against Windows PE malware detection: A survey of the state-of-the-art. Comput. Secur. 2023, 128, 103134. [Google Scholar] [CrossRef]
  18. Du, X.; Wen, M.; Zhu, J.; Xie, Z.; Ji, B.; Liu, H.; Shi, X.; Jin, H. Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 10507–10521. [Google Scholar] [CrossRef]
  19. Overflow, S. Stack Overflow. 2023. Available online: https://stackoverflow.com (accessed on 14 October 2024).
  20. Díaz Ferreyra, N.E.; Vidoni, M.; Heisel, M.; Scandariato, R. Cybersecurity discussions in Stack Overflow: A developer-centred analysis of engagement and self-disclosure behaviour. Soc. Netw. Anal. Min. 2023, 14, 16. [Google Scholar] [CrossRef]
  21. Le, T.H.; Chen, H.; Babar, M.A. A survey on data-driven software vulnerability assessment and prioritization. Acm Comput. Surv. 2022, 55, 100. [Google Scholar] [CrossRef]
  22. Alzubi, J.A.; Jain, R.; Singh, A.; Parwekar, P.; Gupta, M. COBERT: COVID-19 question answering system using BERT. Arab. J. Sci. Eng. 2023, 48, 11003–11013. [Google Scholar] [CrossRef] [PubMed]
  23. Wang, Y.; Anderson, D.V. Hybrid attention-based prototypical networks for few-shot sound classification. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 23–27 May 2022; pp. 651–655. [Google Scholar]
  24. Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML’17, JMLR.org, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1126–1135. [Google Scholar]
  25. Wang, H.; Wang, Y.; Sun, R.; Li, B. Global convergence of maml and theory-inspired neural architecture search for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9797–9808. [Google Scholar]
  26. Cao, C.; Zhang, Y. Learning to compare relation: Semantic alignment for few-shot learning. IEEE Trans. Image Process. 2022, 31, 1462–1474. [Google Scholar] [CrossRef]
  27. Li, Z.; Zou, D.; Xu, S.; Jin, H.; Zhu, Y.; Chen, Z. Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Trans. Dependable Secur. Comput. 2021, 19, 2244–2258. [Google Scholar] [CrossRef]
  28. Li, Z.; Zou, D.; Xu, S.; Chen, Z.; Zhu, Y.; Jin, H. Vuldeelocator: A deep learning-based fine-grained vulnerability detector. IEEE Trans. Dependable Secur. Comput. 2021, 19, 2821–2837. [Google Scholar] [CrossRef]
  29. Huang, W.; Lin, S.; Chen, L. Bbvd: A bert-based method for vulnerability detection. Int. J. Adv. Comput. Sci. Appl. 2022, 13. [Google Scholar] [CrossRef]
  30. Omar, M. VulDefend: A Novel Technique based on Pattern-exploiting Training for Detecting Software Vulnerabilities Using Language Models. In Proceedings of the 2023 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 22–24 May 2023; pp. 287–293. [Google Scholar] [CrossRef]
  31. Chi, J.; Qu, Y.; Liu, T.; Zheng, Q.; Yin, H. Seqtrans: Automatic vulnerability fix via sequence to sequence learning. IEEE Trans. Softw. Eng. 2022, 49, 564–585. [Google Scholar] [CrossRef]
  32. Chen, Z.; Kommrusch, S.; Monperrus, M. Neural transfer learning for repairing security vulnerabilities in c code. IEEE Trans. Softw. Eng. 2022, 49, 147–165. [Google Scholar] [CrossRef]
  33. Bahaa, A.; Kamal, A.E.R.; Fahmy, H.; Ghoneim, A.S. DB-CBIL: A DistilBert-Based Transformer Hybrid Model using CNN and BiLSTM for Software Vulnerability Detection. IEEE Access 2024, 12, 64446–64460. [Google Scholar] [CrossRef]
  34. Ma, S.; Thung, F.; Lo, D.; Sun, C.; Deng, R.H. VuRLE: Automatic Vulnerability Detection and Repair by Learning from Examples. In Proceedings of the Computer Security—ESORICS 2017, Oslo, Norway, 14 September 2017; Foley, S.N., Gollmann, D., Snekkenes, E., Eds.; Springer Nature: Cham, Switzerland, 2017; pp. 229–246. [Google Scholar]
  35. Zhang, X.; Zhang, F.; Zhao, B.; Zhou, B.; Xiao, B. VulD-Transformer: Source Code Vulnerability Detection via Transformer. In Proceedings of the 14th Asia-Pacific Symposium on Internetware, Hangzhou, China, 4–6 August 2023; pp. 185–193. [Google Scholar]
  36. Espinha Gasiba, T.; Iosif, A.C.; Kessba, I.; Amburi, S.; Lechner, U.; Pinto-Albuquerque, M. May the Source Be with You: On ChatGPT, Cybersecurity, and Secure Coding. Information 2024, 15, 572. [Google Scholar] [CrossRef]
  37. Bhandari, G.; Naseer, A.; Moonen, L. CVEfixes: Automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering, Athens, Greece, 19–20 August 2021; pp. 30–39. [Google Scholar]
  38. Common Weakness Enumeration (CWE). 2006. Available online: https://cwe.mitre.org/ (accessed on 13 September 2024).
  39. Common Vulnerability Scoring System (CVSS). 2005. Available online: https://www.first.org/cvss/ (accessed on 13 September 2024).
  40. National Institute of Standards and Technology (NIST). National Vulnerability Database (NVD). Available online: https://nvd.nist.gov/ (accessed on 13 September 2024).
  41. Software Assurance Metrics And Tool Evaluation (SAMATE). 2024. Available online: https://samate.nist.gov/ (accessed on 17 September 2024).
  42. NIST Software Assurance Reference Dataset. 2024. Available online: https://samate.nist.gov/SARD/ (accessed on 17 September 2024).
  43. OWASP Foundation. OWASP Top Ten 2021: The Ten Most Critical Web Application Security Risks. 2021. Available online: https://owasp.org/Top10/ (accessed on 22 October 2024).
  44. Ren, Z.; Shen, Q.; Diao, X.; Xu, H. A sentiment-aware deep learning approach for personality detection from text. Inf. Process. Manag. 2021, 58, 102532. [Google Scholar] [CrossRef]
  45. SonarQube: Continuous Code Quality. Available online: https://www.sonarqube.org/ (accessed on 13 September 2024).
  46. Song, Y.; Wang, T.; Cai, P.; Mondal, S.K.; Sahoo, J.P. A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities. Acm Comput. Surv. 2023, 55, 1–40. [Google Scholar] [CrossRef]
  47. Ma, Y.; Zhao, S.; Wang, W.; Li, Y.; King, I. Multimodality in meta-learning: A comprehensive survey. Knowl.-Based Syst. 2022, 250, 108976. Available online: https://www.sciencedirect.com/science/article/abs/pii/S0950705122004737 (accessed on 13 September 2024). [CrossRef]
  48. Huisman, M.; Van Rijn, J.N.; Plaat, A. A survey of deep meta-learning. Artif. Intell. Rev. 2021, 54, 4483–4541. [Google Scholar] [CrossRef]
  49. Jamal, M.A.; Qi, G.J. Task Agnostic Meta-Learning for Few-Shot Learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11711–11719. [Google Scholar] [CrossRef]
  50. Behera, S.K.; Dash, R. Fine-Tuning of a BERT-Based Uncased Model for Unbalanced Text Classification. In Advances in Intelligent Computing and Communication; Springer: Cham, Switzerland, 2022; pp. 425–433. [Google Scholar] [CrossRef]
  51. Griva, A.I.; Boursianis, A.D.; Iliadis, L.A.; Sarigiannidis, P.; Karagiannidis, G.; Goudos, S.K. Model-Agnostic Meta-Learning Techniques: A State-of-The-Art Short Review. In Proceedings of the 2023 12th International Conference on Modern Circuits and Systems Technologies (MOCAST), Athens, Greece, 28–30 June 2023; pp. 1–4. [Google Scholar]
  52. Fallah, A.; Mokhtari, A.; Ozdaglar, A. Generalization of model-agnostic meta-learning algorithms: Recurring and unseen tasks. Adv. Neural Inf. Process. Syst. 2021, 34, 5469–5480. [Google Scholar]
  53. Chitty-Venkata, K.T.; Emani, M.; Vishwanath, V.; Somani, A.K. Neural architecture search for transformers: A survey. IEEE Access 2022, 10, 108374–108412. [Google Scholar] [CrossRef]
  54. Ji, Z.; Chai, X.; Yu, Y.; Pang, Y.; Zhang, Z. Improved prototypical networks for few-shot learning. Pattern Recognit. Lett. 2020, 140, 81–87. [Google Scholar] [CrossRef]
  55. Li, X.; Sun, Z.; Xue, J.H.; Ma, Z. A concise review of recent few-shot meta-learning methods. Neurocomputing 2021, 456, 463–468. [Google Scholar] [CrossRef]
  56. Banerjee, T.; Thurlapati, N.R.; Pavithra, V.; Mahalakshmi, S.; Eledath, D.; Ramasubramanian, V. Few-shot learning for frame-wise phoneme recognition: Adaptation of matching networks. In Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 516–520. [Google Scholar]
  57. van Doorn, J.; Ly, A.; Marsman, M.; Wagenmakers, E.J. Bayesian rank-based hypothesis testing for the rank sum test, the signed rank test, and Spearman’s ρ. J. Appl. Stat. 2020, 47, 2984–3006. [Google Scholar] [CrossRef]
  58. Degen, H. Big I notation to estimate the interaction complexity of interaction concepts. Int. J. Hum. Comput. Interact. 2022, 38, 1504–1528. [Google Scholar] [CrossRef]
  59. Gutoski, M.; Hattori, L.; Romero Aquino, M.; Ribeiro, M.; Lazzaretti, A.; Lopes, H. Qualitative analysis of deep learning frameworks. Learn. Nonlinear Model. 2017, 15, 45–52. [Google Scholar] [CrossRef]
  60. Zhu, X.; Mao, X. Integrating Security with DevSecOps: Techniques and Challenges. IEEE Access 2020, 8, 101261–101273. Available online: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10613759 (accessed on 7 December 2024).
  61. Jones, S.L.; Weber, T.J. Holding on to Compliance While Adopting DevSecOps: An SLR. Electronics 2022, 11, 3707. [Google Scholar] [CrossRef]
Figure 1. Blocks of the proposed methodology.
Figure 1. Blocks of the proposed methodology.
Futureinternet 17 00033 g001
Figure 2. Number of CWEs associated with each programming language. The total exceeds the number of samples per language, as SARD frequently links multiple weaknesses to a single vulnerability (with up to six identified in one case).
Figure 2. Number of CWEs associated with each programming language. The total exceeds the number of samples per language, as SARD frequently links multiple weaknesses to a single vulnerability (with up to six identified in one case).
Futureinternet 17 00033 g002
Figure 3. Example of the data composition for QA, with the respective keys Q V , C V , A V , A V [ START ] , and A V [ END ] , in JSON format.
Figure 3. Example of the data composition for QA, with the respective keys Q V , C V , A V , A V [ START ] , and A V [ END ] , in JSON format.
Futureinternet 17 00033 g003
Figure 4. Schematic representation of the data flow through BERT in a QA environment. The input consists of tokens from the support set ( S ) and the query set ( Q ), which are transformed into input embeddings ( E Q V z , C V z ) through token embeddings. These embeddings are processed through k transformer blocks, progressively refining the token representations. The output embeddings (H) generated by the final transformer block are used to predict the start ( A V [ START ] ) and end ( A V [ END ] ) positions of the answer span within the context.
Figure 4. Schematic representation of the data flow through BERT in a QA environment. The input consists of tokens from the support set ( S ) and the query set ( Q ), which are transformed into input embeddings ( E Q V z , C V z ) through token embeddings. These embeddings are processed through k transformer blocks, progressively refining the token representations. The output embeddings (H) generated by the final transformer block are used to predict the start ( A V [ START ] ) and end ( A V [ END ] ) positions of the answer span within the context.
Futureinternet 17 00033 g004
Figure 5. Input flow through the transformer layers of BERT within the Proto-MAML framework, utilizing transformer blocks to refine the relationships among prototypes Q V j , C V j , and A V j , as well as for computation of the predictions.
Figure 5. Input flow through the transformer layers of BERT within the Proto-MAML framework, utilizing transformer blocks to refine the relationships among prototypes Q V j , C V j , and A V j , as well as for computation of the predictions.
Futureinternet 17 00033 g005
Figure 6. Evolution of the meta-loss L meta ( 2 ) for different programming languages and values of i, visualized every 10 epochs.
Figure 6. Evolution of the meta-loss L meta ( 2 ) for different programming languages and values of i, visualized every 10 epochs.
Futureinternet 17 00033 g006
Table 1. Top CWEs, Total Vulnerabilities, and CVSS Score Distribution by Language.
Table 1. Top CWEs, Total Vulnerabilities, and CVSS Score Distribution by Language.
LanguageCommon CWEsTotal VulnerabilitiesCVSS Score Percentage (High, Medium, Low)
PHPCWE-20, CWE-22, CWE-78, CWE-79,
CWE-89, CWE-306, CWE-352, CWE-434,
CWE-502, CWE-601
13,000+High: 30%, Medium: 50%, Low: 20%
JavaCWE-20, CWE-79, CWE-89, CWE-209,
CWE-287, CWE-400, CWE-476, CWE-502,
CWE-611, CWE-732
3000+High: 35%, Medium: 45%, Low: 20%
CCWE-20, CWE-119, CWE-125, CWE-200,
CWE-362, CWE-399, CWE-416, CWE-476,
CWE-772, CWE-787
10,000+High: 40%, Medium: 40%, Low: 20%
C++CWE-20, CWE-119, CWE-125, CWE-362,
CWE-399, CWE-400, CWE-416, CWE-476,
CWE-772, CWE-787
7000+High: 42%, Medium: 38%, Low: 20%
CVE and CWE data are based on the NVD and MITRE databases.
Table 2. Descriptions and Mathematical Definitions of Metrics used for Evaluation of Proto-MAML.
Table 2. Descriptions and Mathematical Definitions of Metrics used for Evaluation of Proto-MAML.
MetricDescriptionMathematical Definition
Precision[START]Represents the precision for the start index, measuring the proportion of true positives ( T P A V [ START ] ) to the total of true positives ( T P A V [ START ] ) and false positives ( F P A V [ START ] ) for the predicted start index. Precision [ START ] = T P A V [ START ] T P A V [ START ] + F P A V [ START ]
Precision[END]Precision for the end index, calculated similarly to the start index. Precision [ END ] = T P A V [ END ] T P A V [ END ] + F P A V [ END ]
PrecisioncombinedCombined precision, representing the average precision across both indices. Precision combined = Precision [ START ] + Precision [ END ] 2
Recall[START]Recall for the start index, representing the ratio of true positives (TP) to the sum of true positives and false negatives (FN) for the predicted start index. Recall [ START ] = T P A V [ START ] T P A V [ START ] + F N A V [ START ]
Recall[END]Recall for the end index, calculated similarly to the start index. Recall [ END ] = T P A V [ END ] T P A V END ] + F N A V [ END ]
RecallcombinedCombined recall, representing the average recall across both indices. Recall combined = Recall [ START ] + Recall [ END ] 2
F1[START]F1-score for the start index, calculated as the harmonic mean of precision and recall. F 1 [ START ] = 2 · Precision [ START ] · Recall [ START ] Precision [ START ] + Recall [ START ]
F1[END]F1-score for the end index, calculated similarly to the start index. F 1 [ END ] = 2 · Precision [ END ] · Recall [ END ] Precision [ END ] + Recall [ END ]
F1combinedCombined F1-score, representing the average F1-score across both indices. F 1 combined = F 1 [ START ] + F 1 [ END ] 2
Entropy ( H ( p ) )Measures uncertainty in softmax predictions, where lower entropy indicates more confident predictions. H ( p ) = k = 1 n p k log ( p k )
Prediction Error (PE)Quantifies the discrepancy between predicted and actual indices, summing the absolute differences for the start and end indices. PE = | A V [ START ] A V [ START ] | + | A V [ END ] A V [ END ] |
EMIndicates whether both start and end indices of the predicted answer match exactly with the ground truth. A value of 1 signifies a perfect match, while a value of 0 indicates otherwise. E M = 1 if A V [ START ] = A V [ START ] and A V [ END ] = A V [ END ] 0 otherwise
Table 3. Configurations of 5 × i × n for S and Q during Proto-MAML training.
Table 3. Configurations of 5 × i × n for S and Q during Proto-MAML training.
iSupport Set Size | S | Query Set Size | Q | Total Samples ( 5 × i × n )
52553000
63063600
73574200
84084800
94595400
1050106000
Table 4. F 1 combined values for PHP, Java, C, and C++ in 5-shot-i-way tasks during the training of Proto-MAML.
Table 4. F 1 combined values for PHP, Java, C, and C++ in 5-shot-i-way tasks during the training of Proto-MAML.
i F 1 combined PHP F 1 combined Java F 1 combined C F 1 combined C++
5 96.90 % 98.41 % 99.23 % 98.95 %
6 75.85 % 87.00 % 78.20 % 80.40 %
7 83.25 % 91.50 % 89.00 % 87.25 %
8 96.55 % 98.10 % 90.75 % 96.85 %
9 99.93 % 88.95 % 90.30 % 87.70 %
10 99.10 % 99.12 % 99.05 % 98.80 %
Table 5. Entropy ( H ( p ) ) values for different configurations in 5-shot-i-way tasks across programming languages during Proto-MAML training.
Table 5. Entropy ( H ( p ) ) values for different configurations in 5-shot-i-way tasks across programming languages during Proto-MAML training.
iEntropy PHPEntropy JavaEntropy CEntropy C++
5 0.182 0.160 0.140 0.145
6 0.275 0.210 0.260 0.250
7 0.240 0.195 0.210 0.220
8 0.185 0.170 0.190 0.180
9 0.140 0.230 0.210 0.225
10 0.160 0.150 0.145 0.155
Table 6. Prediction Error (PE) for different configurations of 5-shot-i-way tasks across programming languages during Proto-MAML training.
Table 6. Prediction Error (PE) for different configurations of 5-shot-i-way tasks across programming languages during Proto-MAML training.
iPE PHPPE JavaPE CPE C++
5 0.0576 0.0291 0.0140 0.0191
6 0.5731 0.2690 0.5018 0.4388
7 0.3622 0.1672 0.2225 0.2630
8 0.0643 0.0349 0.1835 0.0585
9 0.0013 0.2236 0.1934 0.2525
10 0.0163 0.0160 0.0173 0.0219
Table 7. Average meta-loss values L meta ( 2 ) for different programming languages and configurations during 5-shot-i-way training of Proto-MAML.
Table 7. Average meta-loss values L meta ( 2 ) for different programming languages and configurations during 5-shot-i-way training of Proto-MAML.
iPHPJavaCC++
5 0.0753 0.0501 0.0273 0.0283
6 0.6501 0.3888 0.6173 0.5833
7 0.5985 0.2399 0.3171 0.3329
8 0.0884 0.0581 0.2612 0.0852
9 0.0017 0.3095 0.2999 0.3636
10 0.0222 0.0281 0.0290 0.0342
Table 8. Summary of WSR Results for Each Programming Language.
Table 8. Summary of WSR Results for Each Programming Language.
LanguageValues of i AnalyzedOptimal Value of i (Explanation)Statistic W (Significance)p-Value
PHP 5 , 6 , 7 , 8 , 9 , 10 The optimal value of i was 9. When increasing i to 9, the performance metrics ( Precision combined , Recall combined , F 1 combined ) reached their highest values, indicating better classification accuracy.Positive differences predominate, supporting i = 9 .Less than 0.05
Java 5 , 6 , 7 , 8 , 9 , 10 The optimal value of i was 10. Continuous improvements in the metrics were observed, reaching the highest values at i = 10 . This captures the complexity of the language better.Positive differences predominate, supporting i = 10 .Less than 0.05
C 5 , 6 , 7 , 8 , 9 , 10 The optimal value of i was 5. Increasing i beyond 5 did not improve metrics and sometimes decreased them, suggesting that i = 5 is sufficient for classification.Negative differences predominate, supporting i = 5 .Less than 0.05
C++ 5 , 6 , 7 , 8 , 9 , 10 The optimal value of i was 8. Performance improved up to i = 8 , but showed no significant improvement beyond that value.Positive differences predominate, supporting i = 8 .Less than 0.05
Table 9. Optimal performance metrics for each language in 5-shot-i-way tasks based on combined evaluations.
Table 9. Optimal performance metrics for each language in 5-shot-i-way tasks based on combined evaluations.
Languagei Precision combined (%) Recall combined (%) F 1 combined (%)EM (%)
PHP9 99.80 % 99.75 % 99.93 % 99.82 %
Java10 99.15 % 99.10 % 99.12 % 99.50 %
C5 97.10 % 97.30 % 99.23 % 98.20 %
C++8 97.90 % 98.00 % 96.85 % 97.60 %
Average 98.49 % 98.54 % 98.78 % 98.78 %
Table 10. Comparison of sample sizes, labeling, and vulnerabilities across different state-of-the-art studies.
Table 10. Comparison of sample sizes, labeling, and vulnerabilities across different state-of-the-art studies.
StudySample Size and LabelingCWEs AddressedSummary of Vulnerabilities
Proto-MAML (This project)18,879 samples with 9867 vulnerable code fragments for PHP, 4677 for Java, 5019 for C, and 4038 for C++, with no labeling requiredCWE-20, CWE-79, CWE-89, CWE-78, CWE-352, CWE-22, CWE-434, CWE-502, CWE-601, CWE-30, CWE-209, CWE-287, CWE-400, CWE-476, CWE-611, CWE-732, CWE-119, CWE-125, CWE-787, CWE-416, CWE-200, CWE-362, CWE-772, CWE-399Addresses cross-site scripting, SQL injection, memory management issues, concurrency problems, deserialization risks, and improper input validation. Applies to PHP, Java, C, and C++ in CMS systems, ORM tools, and embedded systems.
SySeVR [27]14,780 samples transformed into 340,000 token-level elements for C and C++, labeled as secure or insecureCWE-119, CWE-125, CWE-787Focuses on memory vulnerabilities such as buffer overflows, out-of-bounds reads, and improper memory restriction. Applicable to C and C++ in systems programming and embedded systems.
BBVD [29]16,436 samples for C and C++, with slices labeled as safe or unsafeCWE-400, CWE-20, CWE-416Addresses uncontrolled resource consumption, improper input validation, and use-after-free errors. Targets C and C++ in high-performance and real-time systems.
VulDefend [30]4000 samples for C/C++ with fragment and token-level labelingCWE-89, CWE-79, CWE-352SQL injection, cross-site scripting, and cross-site request forgery vulnerabilities in web applications.
VulDeeLocator [28]29,000 samples for C transformed into 420,627 fragments, labeled as vulnerable or not vulnerableCWE-119, CWE-125, CWE-787Memory vulnerabilities including buffer overflows and out-of-bounds reads in C-based systems.
SeqTrans [31]5000 samples for Java transformed into 650,000 secure–insecure mappings with token-level labelingCWE-287, CWE-20, CWE-189Authentication issues, resource misuse, and numeric errors in Java applications.
VRepair [32]655,741 confirmed vulnerabilities from 1,838,740 commits for insecure C code with no labeling required, BERT-based outputCWE-20, CWE-79, CWE-89, CWE-287Focuses on unsafe inputs, poor sanitization, permission control, and information exposure in C codebases.
DB-CBIL [33]33,360 vulnerable functions from C and C++ with token-level sequence labelingCWE-119, CWE-125, CWE-476Addresses memory management issues, including null pointer dereferences, buffer overflows, and out-of-bounds reads in C and C++.
VuRLE [34]48 samples for Java with cluster-based outputs, no labeling requiredCWE-20, CWE-89, CWE-287Focuses on misplaced resources, injection vulnerabilities, and authentication weaknesses in Java frameworks.
VulD-Transformer [35]Between 22,148 and 291,892 samples for C and C++, with 937,608 processed examples labeled as vulnerable or not vulnerableCWE-119, CWE-476, CWE-125API function misuse, memory errors, and null pointer dereference vulnerabilities in C and C++ systems.
GPT survey [36]Unspecified number of samples for C and C++CWE-121, CWE-758, CWE-242Covers buffer overflows, risky functions, and integer overflows in C and C++ implementations.
Table 11. Detailed Complexity, Technical Reasoning, and Comparative Performance for Proto-MAML and Related Models.
Table 11. Detailed Complexity, Technical Reasoning, and Comparative Performance for Proto-MAML and Related Models.
ModelComplexityTechnical ReasoningComparative Performance (vs. Proto-MAML)
Proto-MAML O ( n log n ) 1. Attention mechanism ( O ( n 2 ) ): Each token interacts with every other token, resulting in n × n operations.
2. Logarithmic reduction: Meta-learning adjusts parameters incrementally, scaling with the logarithm of task size ( T log n ).
3. Combined complexity: Meta-learning reduces O ( n 2 ) operations to O ( n log n ) , optimizing for dynamic scenarios.
Performance: Achieves an average F1-score of 98.78% across all tasks. Specific results: PHP (99.93%), Java (99.12%), C/C++ (97.23%), with an overall Exact Match (EM) score of 98.78%.
SySeVR [27] O ( n 3 ) 1. Abstract Syntax Tree (AST, O ( n 2 ) ): Evaluates relationships between nodes, requiring n × n = n 2 operations.
2. Program Dependency Graph (PDG, O ( n 3 ) ): Captures semantic dependencies, generating n 3 combinations for multi-level relationships.
3. Integration bottleneck: Combined AST and PDG operations elevate complexity to O ( n 3 ) .
Performance: F1 of 85.8% in C/C++. Proto-MAML avoids graph dependencies, improving scalability and accuracy, and surpasses SySeVR by 11.43%.
BBVD [29] O ( n 2 ) 1. Attention ( O ( n 2 ) ): RoBERTa compares relationships between n tokens, resulting in n × n = n 2 .
2. Dense operations: Normalization and feedforward add O ( n ) costs, though not dominant.
Performance: F1 of 95.42% in C/C++. Proto-MAML achieves a higher F1-score of 97.23% (+1.81%), with lower computational complexity.
VulDefend [30] O ( n 2 ) 1. Token-level analysis ( O ( n 2 ) ): PET evaluates n × n = n 2 relationships between tokens.
2. Absence of meta-learning: No iterative parameter optimization, unlike Proto-MAML.
Performance: F1 of 89.9% in C/C++, while that of Proto-MAML is 97.23%, significantly outperforming VulDefend by +7.33%.
VulDeeLocator [28] O ( n 4 ) 1. AST ( O ( n 2 ) ): Encodes syntactic relationships, requiring n × n = n 2 .
2. SSA ( O ( n 3 ) ): Tracks control flow across nodes, increasing complexity with data flow dependencies.
3. Combined iteration: Sequential RNN and LSTM processes add O ( n 2 ) layers, resulting in O ( n 4 ) overall.
Performance: F1 of 98.8% in C. Proto-MAML achieves comparable F1-scores, but with significantly lower computational complexity ( O ( n log n ) vs O ( n 4 ) ).
SeqTrans [31] O ( n 2 ) 1. BS ( O ( n ) ): Iteratively evaluates O ( n ) candidates for each token.
2. Pairwise comparisons ( O ( n 2 ) ): Evaluates token relationships across sequences, resulting in n × n = n 2 .
Performance: Masked correction rate of 25.3% in Java. Proto-MAML achieves a correction rate of 99.5%, outperforming SeqTrans by +74.2%.
VRepair [32] O ( n 4 ) 1. SNN ( O ( n 2 ) ): Processes tokens linearly with n × n = n 2 operations.
2. SSA ( O ( n 2 ) ): Adds flow analysis complexity.
3. Cross-propagation: SNN and SSA integration results in O ( n 2 ) × O ( n 2 ) = O ( n 4 ) .
Performance: Reconstruction rate of 27.59% in C. Proto-MAML achieves an exact match of 98.2%, significantly outperforming VRepair.
DB-CBIL [33] O ( n 4 ) 1. CNN ( O ( n 2 ) ): Performs filtering and dimensionality reduction over n × n = n 2 operations.
2. BiLSTM ( O ( n 2 ) ): Captures forward and backward dependencies sequentially.
3. Joint iterations: Combined CNN and BiLSTM layers result in O ( n 2 ) × O ( n 2 ) = O ( n 4 ) .
Performance: Reconstruction rate of 99.51% in C/C++.
Proto-MAML achieves 98.78% overall accuracy, with significantly lower computational demands.
VuRLE [34] O ( n 5 ) 1. AST ( O ( n 2 ) ): Captures hierarchical relationships between nodes.
2. DBSCAN ( O ( n 3 ) ): Clustering compares each token with all others, resulting in O ( n 3 ) .
3. Combined processes: O ( n 2 ) × O ( n 3 ) = O ( n 5 ) .
Performance: Replacement prediction rate of 65.59% in Java. Proto-MAML achieves a rate of 99.5%, outperforming VuRLE by +33.91%.
VulD-Transformer [35] O ( n 4 ) 1. PDG ( O ( n 2 ) ): Builds semantic graphs for tokens.
2. Transformer attention ( O ( n 2 ) ): Token-to-token relationships generate n 2 interactions.
3. Combined iterations: O ( n 2 ) × O ( n 2 ) = O ( n 4 ) .
Performance: Accuracy ranging from 59.34% to 80.44% in C/C++. Proto-MAML surpasses VulD-Transformer, achieving an average F1-score of 97.5%, with lower complexity.
GPT Survey [36] O ( n 2 ) O ( n 3 ) 1. Pre-trained database search ( O ( n 2 ) ): Each input of size n is compared with internal data sets, scaling quadratically.
2. Answer calibration ( O ( n 3 ) ): Complex queries require iterative re-evaluations, elevating costs.
Performance: Accuracy of 88%. Proto-MAML achieves an average accuracy of 98.78%, surpassing the GPT Survey in terms of adaptability and efficiency.
Table 12. Qualitative and Quantitative Comparison of Vulnerability Detection Models.
Table 12. Qualitative and Quantitative Comparison of Vulnerability Detection Models.
FrameworkQuantitative Metrics (F1, TP, FP)Qualitative AdaptabilityFalse-Positive Handling and Observations
Proto-MAML (This Study)F1: 98.78 % , TP: 97.8 % , FP: 2.2 % High adaptability due to meta-learning loops. Supports diverse languages and CWEs with minimal data. Well-suited for incremental learning scenarios.False positives reduced through precise query alignment in the support set S . Avoids overfitting via regularized parameter updates.
SySeVR [27]F1: 85.8 % , TP: 85.1 % , FP: 14.2 % Limited adaptability. Relies on extensive labeled data sets, constraining performance in unseen languages or CWEs.High false-positive rate due to static AST generation, which misses dynamic code behaviors.
BBVD [29]F1: 95.42 % , TP: 90.5 % , FP: 5.1 % Moderate adaptability via RoBERTa fine-tuning, but struggles with heterogeneous code due to fixed tokenization schemes.False positives are controlled using multiple attention layers, but performance degrades when using highly nested code structures.
VulDefend [30]F1: 89.9 % , TP: 86.2 % , FP: 7.8 % Moderate adaptability through RoBERTa and PET. Effective in low-data settings but lacks flexibility for unseen CWEs.False positives arise from probabilistic template errors during few-shot adaptation.
VulDeeLocator [28]F1: 98.8 % , TP: 98.2 % , FP: 1.8 % High adaptability for C code with AST and SSA. Limited when applied to other languages.False positives minimized through sequential learning and SSA optimizations but requires costly preprocessing.
SeqTrans [31]Masked Correction: 25.3 % , TP: 62.1 % , FP: 25.9 % Poor adaptability due to reliance on sequential generation. BS struggles with multi-pattern code.High false-positive rate, as BS favors common sequences over precise mappings.
VRepair [32]Reconstruction Rate: 27.59 % , TP: 72.4 % , FP: 12.3 % Limited adaptability due to heavy reliance on static TL. Struggles with evolving or unseen vulnerabilities.Moderate false positives due to dependency on sequential context without incorporating dynamic code behavior.
DB-CBIL [33]Reconstruction Rate: 99.51 % , TP: 99.0 % , FP: 8.3 % Strong adaptability through CNN and BiLSTM integration, but computationally intensive.False positives mitigated by token-level sequence labeling, but not fully eliminated in abstract syntax scenarios.
VuRLE [34]Replacement Prediction: 65.59 % , TP: 65.0 % , FP: 7.2 % Poor adaptability due to reliance on static AST and DBSCAN clustering, which fail in sparse data environments.False positives arise from inaccuracies in the clustering performed by DBSCAN on skewed data sets.
VulD-Transformer [35]F1: Range 59.34–80.44%, TP: 80.1 % , FP: 12.0 % Moderate adaptability via multi-attention transformers, but limited by PDG preprocessing bottlenecks.False positives reduced by attention mechanisms, but increases with complex PDG structures.
GPT Survey [36]Accuracy: 88.0 % , TP: 88.0 % , FP: 12.0 % High adaptability in query handling, but inconsistent outputs depending on fine-tuning and batch size.False positives fluctuate based on query specificity and model calibration.
Table 13. Limitations of State-of-the-Art Models Versus Proto-MAML for SSDLC and DevSecOps Integration.
Table 13. Limitations of State-of-the-Art Models Versus Proto-MAML for SSDLC and DevSecOps Integration.
ModelAutomated Security ScanningInterdisciplinary CollaborationContinuous Security IntegrationReal-Time Monitoring and AuditingPredictive CapabilityDependency and Component Management
Proto-MAML (this study)FSL capabilities to detect vulnerabilities across PHP, Java, C, and C++, covering over 24 CWEs. Provides actionable reconstructions.Facilitates collaboration through interpretable QA outputs.Low complexity ( O ( n log n ) ), enabling seamless CI/CD integration.Adapts dynamically to new samples based on rich semantics ( C V ).Accurately identifies specific spans ( A V [ S T A R T ] , A V [ E N D ] ).Identifies insecure dependencies and generates practical solutions.
SySeVR [27]Graph dependencies (AST, PDG) increase complexity to O ( n 3 ) , limiting scalability.Graphs require advanced interpretation, hindering accessibility.Graph preprocessing slows CI/CD processes.Static nature impedes adaptation to emerging vulnerabilities.Cannot predict precise spans within code due to reliance on static analysis.Does not evaluate external dependencies or insecure frameworks.
BBVD [29]Limited to C and C++; lacks coverage for critical languages such as PHP or Java.Fixed tokenization hinders direct interpretation by non-technical teams.Dense attention layers increase evaluation times, despite lower complexity ( O ( n 2 ) ).Cannot dynamically integrate new samples, limiting evolution in real-time systems.Lacks precise span predictions, reducing audit effectiveness.Ignores external dependencies and third-party vulnerabilities.
VulDefend [30]Only addresses 3–6 CWEs, with limited detection in emerging vulnerabilities.Probabilistic modeling produces ambiguous results, complicating collaboration.Absence of meta-learning prevents dynamic adjustments; O ( n 2 ) complexity restricts CI/CD scalability.Requires intensive re-training to adapt to new patterns, limiting real-time monitoring.Does not predict spans; relies on rigid templates for analysis.Does not evaluate external dependencies or insecure libraries.
VulDeeLocator [28]High complexity ( O ( n 4 ) ) from combining AST, SSA, and RNN, unsuitable for rapid multilingual analysis.Dependence on graphs and recurrent networks complicates interdisciplinary collaboration.Heavy computational demands limit automated pipeline integration.Relies on pre-labeled data, lacks flexibility for dynamic environments.Highly precise, but extreme complexity reduces applicability in agile contexts.Relies on predefined labels, making it unsuitable for external dependencies or unexplored contexts.
SeqTrans [31]Restricted to Java with low CWE coverage; BS ( O ( n 2 ) ) lacks scalability.Sequential search outputs are difficult to interpret and contextualize.Iterative searches delay CI/CD processes, undermining DevSecOps efficiency.Inefficient for dynamic scenarios, unable to handle code changes effectively.Fails to predict spans, limiting traceability of vulnerabilities.Ignores third-party dependencies, unsuitable for real-world multilingual environments.
VRepair [32]Requires extensive data and lacks dynamic adaptability, limiting multilingual effectiveness.SNNs hinder understanding and implementation by diverse teams.High complexity ( O ( n 4 ) ) renders it incompatible with continuous delivery pipelines.Cannot adjust to new samples without complete re-training, unsuitable for evolving systems.Fails to predict precise spans; outputs are general suggestions.Lacks analysis of dependencies or third-party frameworks.
DB-CBIL [33]CNN and BiLSTM integration results in high computational costs ( O ( n 4 ) ), unsuitable for rapid analysis.Complex architecture reduces accessibility for interdisciplinary teams.Inefficient in CI/CD environments, lacking agility for continuous deployment demands.Cannot adapt to new vulnerabilities without extensive re-training.Lacks granularity in predictions, limiting audit effectiveness.Does not evaluate dependencies or dynamic frameworks, reducing applicability.
VuRLE [34]Dependency on AST and clustering (DBSCAN) elevates complexity to O ( n 5 ) , impractical for multilingual scenarios.Difficult to interpret for non-technical teams, hindering collaboration.High complexity excludes it from automated CI/CD pipelines.Static structure prevents adaptation to dynamic environments.Lacks precision in predictions; does not identify specific spans.Ignores external dependencies and insecure libraries.
VulD-Transformer [35]PDG preprocessing adds significant computational overhead ( O ( n 4 ) ).Complex graph-based outputs reduce accessibility for interdisciplinary teams.Preprocessing requirements make it unsuitable for rapid CI/CD pipelines.Static architecture limits responsiveness to code changes.Provides general predictions; lacks granular span identification.Ignores external dependencies and multilingual vulnerabilities.
GPT Survey [36]General-purpose focus lacks specificity for code security tasks; complexity ranges O ( n 2 ) O ( n 3 ) .Manual calibration for specific tasks reduces DevSecOps usability.Inconsistent outputs and processing times hinder CI/CD integration.Pre-trained database dependency limits adaptability to emerging vulnerabilities.Lacks granular predictions for vulnerability spans.Closed approach does not optimize for external dependencies, reducing versatility.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Corona-Fraga, P.; Hernandez-Suarez, A.; Sanchez-Perez, G.; Toscano-Medina, L.K.; Perez-Meana, H.; Portillo-Portillo, J.; Olivares-Mercado, J.; García Villalba, L.J. Question–Answer Methodology for Vulnerable Source Code Review via Prototype-Based Model-Agnostic Meta-Learning. Future Internet 2025, 17, 33. https://doi.org/10.3390/fi17010033

AMA Style

Corona-Fraga P, Hernandez-Suarez A, Sanchez-Perez G, Toscano-Medina LK, Perez-Meana H, Portillo-Portillo J, Olivares-Mercado J, García Villalba LJ. Question–Answer Methodology for Vulnerable Source Code Review via Prototype-Based Model-Agnostic Meta-Learning. Future Internet. 2025; 17(1):33. https://doi.org/10.3390/fi17010033

Chicago/Turabian Style

Corona-Fraga, Pablo, Aldo Hernandez-Suarez, Gabriel Sanchez-Perez, Linda Karina Toscano-Medina, Hector Perez-Meana, Jose Portillo-Portillo, Jesus Olivares-Mercado, and Luis Javier García Villalba. 2025. "Question–Answer Methodology for Vulnerable Source Code Review via Prototype-Based Model-Agnostic Meta-Learning" Future Internet 17, no. 1: 33. https://doi.org/10.3390/fi17010033

APA Style

Corona-Fraga, P., Hernandez-Suarez, A., Sanchez-Perez, G., Toscano-Medina, L. K., Perez-Meana, H., Portillo-Portillo, J., Olivares-Mercado, J., & García Villalba, L. J. (2025). Question–Answer Methodology for Vulnerable Source Code Review via Prototype-Based Model-Agnostic Meta-Learning. Future Internet, 17(1), 33. https://doi.org/10.3390/fi17010033

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop