Next Article in Journal
Increasing the Robustness of Image Quality Assessment Models Through Adversarial Training
Previous Article in Journal
Path Tracking Control for Four-Wheel Independent Steering and Driving Vehicles Based on Improved Deep Reinforcement Learning
Previous Article in Special Issue
Integrating Building Information Modelling and Artificial Intelligence in Construction Projects: A Review of Challenges and Mitigation Strategies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Generative Models for Source Code: Fine-Tuning Techniques for Structured Pattern Learning

by
Valentina Franzoni
1,2,*,†,
Silvia Tagliente
3 and
Alfredo Milani
1,*,†
1
Department of Mathematics and Computer Science, University of Perugia, 06123 Perugia, Italy
2
Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
3
Polytechnic University of Turin, 10129 Turin, Italy
*
Authors to whom correspondence should be addressed.
Current address: Via Vanvitellli 1, 06100 Perugia, Italy.
Technologies 2024, 12(11), 219; https://doi.org/10.3390/technologies12110219
Submission received: 3 October 2024 / Revised: 25 October 2024 / Accepted: 1 November 2024 / Published: 4 November 2024

Abstract

:
This study addresses the problem of how to automatically generate source code that is not only functional, but also well-structured, readable, and maintainable. Existing generative models for source code often produce functional code, but they lack consistency in structure and adherence to coding standards, essential for integration into existing application development projects and long-term software maintenance. By training the model on specific code structures, including a dataset with Italian annotations, the proposed methodology ensures that the generated code is compliant with both the functional requirements and the pre-defined coding standards. The methodology proposed in this study applies transfer learning techniques on the DeepSeek Coder model, to refine pre-trained models to generate code that integrates additional structuring constraints. By training the model on specific code structures, including a dataset with Italian comments, the proposed methodology ensures that the generated code meets both functional requirements and coding structure. Experimental results, evaluated using the perplexity metric, demonstrate the effectiveness of the proposed approach, which impacts the goals of reducing errors, and ultimately improves software development quality.

Graphical Abstract

1. Introduction

The demand for automated source code generation has significantly increased with machine learning and artificial intelligence advancements. Generative models, such as those based on transformer architectures, are used to assist developers in creating functional source code. However, these models often lack the capability to ensure adherence to best practices and coding standards, even though they can generate syntactically correct code.
Ensuring code quality is critical to reducing development time, minimizing errors, and facilitating the integration of generated code into existing projects. Without proper structure and readability, generated code can lead to increased technical debt and maintenance costs.
To address these challenges, this study investigates the use of transfer learning techniques [1] to enhance the quality of generative models for source code. By refining and customizing pre-trained models, the proposed approach aims to generate code that is not only functionally correct but also adheres to predefined coding standards, thereby improving readability and maintainability. Specifically, we utilize the DeepSeek Coder model [2], which is fine-tuned on datasets with structural constraints, including Italian language annotations, to demonstrate the potential for generating structured and maintainable code.

Research Goal

This study focuses on studying whether these transfer learning techniques can be effectively used to introduce and enforce ordered structures in generated code, ensuring that the output is consistent with established quality standards.
Exploiting a specially modified dataset with Italian comments, our approach aims to generate code that not only meets the required programming functionality but also aligns with predefined coding standards.
Aiming to fill the existing research gap (see Section 2.1), our study addresses the following research questions:
  • Can code be generated in forms that are functional, expressive, readable, and maintainable?
  • How can transfer-learning techniques improve the structural coherence while maintaining the functional consistency of generative models for programming code?
In the next section, related works and background knowledge on previous models for code generation and the DeepSeeker architecture is given, with its key features and code generation capabilities. In Section 3, the general state-of-the-art methodologies for transfer-learning with code generation models (i.e., classical fine-tuning and adapters fine-tuning), the transfer learning methodology proposed in this work and the original proposed dataset generation process to incorporate structural constraints are described. In Section 4, the transformed datasets used for training and testing the model are described, presentation and results of the conducted experiments obtained are given, along with the analysis and discussion of the results to evaluate the effectiveness of the proposed approach. Conclusions are drawn in the last section.
Two appendices follow: Appendix A reports some sample prompts and the output produced by the code-generative model, before and after the two adaptation strategies; Appendix B details the scripts that can be used to replicate our system.

2. Related Works

2.1. Generative Models for Source Code

The advent of transformer-based models [3], such as the Generative Pre-trained Transformer (GPT) [4], has significantly improved the capabilities of text generation. Transformers use a self-attention mechanism that enables them to consider the entire context of a given input, making them highly effective at understanding and generating long-form text. This architecture allows them to capture complex relationships within the text and produce outputs that are contextually relevant and grammatically correct.
Modern Large Language Models (LLMs) like GPT-3 and GPT-4 [5,6] are pre-trained on large datasets. LLMs have been applied to various natural language tasks, such as writing essays, summarizing articles, and even answering complex questions. These models generate text by anticipating the subsequent words in a sequence based on the given context, ensuring coherence and relevance in the output.
Recently, there has been growing interest in applying these models to code generation [7]. In addition to generating syntactically correct code, these models ensure that the code follows the logical constructs, dependencies, and semantic rules necessary to run it correctly. Programming code differs from natural language in several fundamental ways. While natural language allows for a degree of ambiguity and flexibility in grammar and phrasing, code must adhere to strict syntactic and semantic rules defined by programming languages. Even a small syntax error can render a piece of code non-functional [8]. In addition, a deeper understanding of the context over long sequences is required to understand the logical flow and dependencies between variables and functions in programming [9]. As a result, code generation presents unique challenges beyond the scope of natural language generation.
Generative models like GPT-3, Codex [10], and CodeT5 [11] have been adapted to address such challenges by leveraging transfer learning techniques and large-scale pre-training on diverse code repositories. These models can be fine-tuned on specialized datasets containing code from multiple programming languages to capture language-specific syntax and structures [12]. As a result, they gain the ability to complete code snippets, generate function definitions, and even respond to high-level statements [13], making them powerful tools for code generation and software development support [14,15].
Despite these advances, generative models for programming code still face several limitations [16]; in many cases, they produce code that is syntactically correct but logically flawed or inefficient. In current code generation models, there is still a significant gap between the generation of code that appears to be correct and the generation of code that performs as intended and verifies properties, such as adherence to a coding standard structuring, that are critical in the software development process.

2.2. The DeepSeek Models

For our study of code adaptation, several generative programming systems were considered. The choice of DeepSeek Coder Base model was mainly due to its availability as open source code and its smaller size, with 1.3 billion parameters. This allows for the additional training required for fine-tuning even with limited computational resources.
The DeepSeek Coder models [2] are a set of models developed by the company DeepSeek AI [17], which focuses on research in the field of AGI (Artificial General Intelligence), with the goal of developing an intelligent agent capable of performing any intellectual task that a human being could perform [18,19,20]. These models share the architecture of the Llama 2 model [21] developed by Meta, and the GenAI model, to ensure compatibility with the toolchains within the Llama ecosystem. However, it is important to note that the training data and model parameters are not shared with the Llama 2 model: the DeepSeek model was instead trained from scratch using different datasets.
Each DeepSeek model is pre-trained on 2 trillion tokens ( 2 × 10 12 ) , 87% of which is code written in more than 80 different programming languages, while the remaining 13% is in natural language, in English, Chinese, and other languages. These models are available in various sizes, ranging from 1 billion to 33 billion parameters.
The models were trained using a mixed approach: in addition to the traditional sequential token prediction method, a masked token contextual completion method [22] was also employed. The method basically consists in randomly omitting some parts of the code during training, and the model was trained to predict/complete them correctly. This type of training led to the creation of the “Base” models, which are suitable for code completion. Subsequently, these models were further fine-tuned using 2 billion additional instruction-based data, resulting in the “Instruct” models capable of responding to instructions, not limited to just code completion.
The data collection procedure steps are illustrated in Figure 1 and are as follows:
  • Collect code from GitHub, and apply filtering rules to select useful content;
  • Parse file dependencies within the same repository to reorganize file locations, based on their dependencies;
  • Concatenate dependent files into a single record and use repository-level min-hashing, for deduplication;
  • Further filter, to exclude low-quality code, such as code with syntax errors or poor readability.
Figure 1. Data collection process.
Figure 1. Data collection process.
Technologies 12 00219 g001
The main phases of the training process, shown in Figure 2, consist of an initial model pre-training with a dataset composed of 87% code, 10% code-related language (such as GitHub Markdown and StackExchange content), and 3% non-code-related Chinese language, where training is performed with 1.8 trillion tokens and a batch size of 4000 tokens; a next phase of long-context pre-training follows with a batch size of 16,000 tokens and an additional dataset of 200 billion tokens, resulting in the Base model; in the final phase, the base model is fine-tuned with 2 billion tokens of instruction-based data, resulting in the Instruct model.

3. Methodologies for Code Generative Models Adaptation

This section presents a comprehensive methodology focused on adapting generative models through advanced transfer learning techniques, to effectively generate source code that satisfies both functional requirements and structural consistency. The goal is to improve the DeepSeek Coder model’s ability to produce code that functions correctly and maintains a clear, consistent coding structure. Transfer learning is used for introducing additional functions into the generated code. In general, transfer learning avoids massive retraining of a deep neural network from scratch, preserving most of the previously acquired knowledge, and was first applied in the field of image classification [1].
Two transfer learning techniques, a classic fine-tuning method [23] and a fine-tuning method using adapters, are first introduced and their variants are discussed in Section 3.1 and Section 3.2. The proposed methodology and its specific adaptation for fine-tuning and the proposed methodology for the dataset generation process to incorporate structural patterns in code generation are illustrated in Section 3.3 and Section 3.4.
The basic versions of the used strategies are also available in the HuggingFace libraries, specifically Transformers for loading and training the model, and PEFT (Parameter-Efficient Fine-Tuning) for loading and managing the adapter.

3.1. Classical Fine-Tuning

Fine-tuning is a common approach in transfer learning wherein pre-trained models are adapted to a specific task using task-related data.
In particular, the classic fine-tuning approach [23] is typically implemented with the strategy of freezing the initial layers of the model. This methodology allows for the general pre-acquired knowledge of the model to remain intact while training focuses on the final layers, which are responsible for learning more specific and contextual details of the dataset. This tactic aims to refine the model’s ability to adapt to new data while maintaining a solid base of universal pre-trained knowledge. It also requires far fewer computational resources than fine-tuning all of the model’s weights.
This process builds upon the pre-training general knowledge and refines the model to focus on the specifics of the new task.
The key steps in classical fine-tuning are the following:
  • Step 1: Loading the pre-trained model and freezing the initial layers. The process starts by loading a pre-trained model, in our case DeepSeek Coder, which has been trained on large datasets. The pre-trained model already captures general patterns, such as syntactic structures in a programming language and semantic aspects. The early layers responsible for learning these basic features are typically frozen, meaning that their weights are not updated during fine-tuning and that the general knowledge acquired during pre-training is preserved, reducing computational cost and ensuring that the model retains its ability to generalize across different tasks.
  • Step 2: By contrast, the final layers of the model, which are more task-specific, are unfrozen. These layers are fine-tuned by updating their weights with task-specific data, allowing the model to adapt to new requirements. For example, the final layers could learn specialized patterns relevant to programming languages, such as syntax and logical structures, in our case of source code generation. After fine-tuning, these layers incorporate the specific features of the new dataset. The general generative knowledge from the pre-training is retained.
  • Step 3: Reduced Rate Training. This is a training technique [24] to ensure that the fine-tuning process effectively adapts the model without disrupting the pre-trained knowledge, preventing overfitting and ensuring that the model effectively learns task-specific features. A reduced learning rate helps avoid large and sudden changes in model weights, allowing for gradual and focused refinement.
Additionally, other techniques are used during training to reduce the risk of overfitting. Fine-tuning can lead to overfitting, especially with smaller datasets. To mitigate this risk, techniques such as dropout (randomly deactivating neurons during training) [25] or weight decay (adding penalties for large weight values) [26] are used. These methods help ensure that the model generalizes well to new, unseen data. It does not just memorize the training set. In particular, monitoring validation performance and using early stopping strategy (stopping training when validation performance plateaus) [27] can prevent overfitting and improve model robustness.

3.2. Adapter Fine-Tuning

For the fine-tuning with adapter, we focused on considering the following methods provided by the PEFT library for efficient parameter management:
  • Low-Rank Adapters (LoRA): LoRA introduce low-rank adapters into neural network models, such as the attention layers and feed-forward networks of transformer models, without changing the original weights [28,29]. They use low-rank matrices to approximate the weights, reducing the number of parameters to train while maintaining high performance with less required memory.
  • IA3: IA3 implements autoencoders in the internal activations of models to regulate these activations, allowing for targeted modifications, without directly changing the layers’ weights. This method provides more granular control over model response and improves memory management efficiency.
  • AdaLoRA: A variant of LoRA, AdaLoRA combines the low-rank approach with adaptive learning. This method emphasizes dynamic parameter adjustment during fine-tuning, providing finer customization and greater control over the data and specific task requirements.
  • 4-bit quantization: PEFT implements 4-bit quantization techniques, useful for loading large language models (LLMs) on non-specialized hardware such as consumer GPUs. This process significantly reduces memory consumption and prevents system overload.
These models are designed to be compatible with Accelerate, facilitating distributed training and optimizing memory usage during training and inference. PEFT models can also be loaded with reduced precision data types (8-bit or 4-bit), further reducing memory requirements. This feature is particularly useful for handling large models.

3.3. Proposed Fine-Tuning Methodology

After preliminary experiments, we decided to focus on LoRA [29] to implement the adapter method due to its efficiency and scalability. Traditional fine-tuning is resource-intensive because it requires updating all model parameters. We have introduced low-rank matrices to approximate weight updates. This significantly reduces the computational load, making fine-tuning both faster and more memory efficient. By freezing the original weights and modifying only the smaller matrices, LoRA preserves the pre-trained knowledge of the model while at the same time allowing it to effectively adapt to new tasks.
The introduced low-rank matrices technique makes LoRA highly scalable, allowing for large models to be fine-tuned even on consumer-grade hardware. In addition, a relevant advantage of the proposed method is that it can be applied to different components of neural networks without modifying their core architecture, which is particularly useful for generating source code where precise control over task-specific learning is critical. As confirmed by the experiments presented in Section 4, the proposed model can generate structured and semantically correct code while maintaining overall coherence and consistency offered by the DeepSeek model, making it a practical and effective solution for model adaptation.
To better illustrate the method, consider it in the context of traditional fine-tuning, where the update of a weight matrix W is represented as Δ W . As shown in Figure 3, LoRA approximates Δ W using the product of two lower-rank matrices, which we denote as A and B. This approximation can be understood as a decomposition of Δ W into the matrices A and B.
To create the matrices A and B, a hyperparameter r is used to specify the rank of the resulting matrices. A smaller value of r produces smaller matrices, resulting in fewer parameters to train. This can lead to much faster training times and requires less computational capacity, but with smaller matrices, the information retention capacity decreases.
For an illustration of the reduction in parameters, assume that the weight matrix of a specific layer has dimensions 5000 × 10,000, or 50 million parameters in total. If r = 8 , two matrices will be initialized, with dimensions 5000 × 8 and 10,000 × 8, totaling 120,000 parameters.
The scaling factor α is another fundamental hyperparameter for the use of LoRA. This factor determines how much the LoRA layer changes the existing layers. A higher value of α results in more significant changes to the final behavior of the model.
It is important to note that this type of approach does not modify the original weights of the model. Instead, they are frozen. The final result uses both the original weights and the weights resulting from adapter training.
In practice, the use of LoRA can be summarized in the following steps:
  • Instantiate the base model;
  • Create a configuration (LoraConfig) where the parameters for LoRA are specified;
  • Encapsulate the base model with the method get_peft_model() to obtain a trainable PeftModel;
  • Train the PeftModel using classic model training methods.

3.4. Proposed Dataset Generation Process Incorporating Structural Patterns

The goal of our study is to verify whether the proposed tuning methods can give the DeepSeek Coder Base model the ability to generate Python code reflecting certain structural patterns that the model could not produce originally. Thus, the dataset used for the transfer learning experiments must contain the desired structural pattern.
In our study, we propose an original methodology to process the widely known existing training code dataset Alpaca18k and transform it to incorporate the structural pattern constraints we wish to enforce to be respected by the generated code.
Alpaca18k is the shortname for python code instructions 18k alpaca, a dataset available on the HuggingFace platform [30]. Alpaca18k consists of 18,600 rows, each containing an instruction, an input (if necessary), the expected output (i.e., the code that should be generated to correctly respond to the instruction), and finally a prompt field that concatenates all the previous information in the Alpaca style, following the structure of the dataset used by Stanford for fine-tuning the Alpaca model.
The python code instructions 18k alpaca dataset entries consist of four fields: Instruction, Input, Output, and Prompt.
The Prompt field in the original data set follows the structure shown in the sample below:
  •     Below is an instruction that describes a task. Write a response
  •     that appropriately completes the request.
  •     ### Instruction:
  •     Automate a Python program to check the length of each word in a
  •     given string.
  •     ### Input:
  •     ### Output:
  •     def lengthWord(inp):
  •         result = []
  •         for word in inp.split(’ ’):
  •             result.append(len(word))
  •         return result
  •     # Driver Code
  •     inp = "Generating a code generation task instructions"
  •     print(lengthWord(inp))
In order to build an appropriate dataset for adapting the model to incorporate new structural patterns, the A l p a c a 18 k dataset is used for training and testing. Our proposed methodology modifies the dataset to reflect the desired structure.
The following steps were defined to generate the dataset, incorporating structural patterns used in the experimental phase.
  • Removing optional text: using an appropriate script, the initial optional text part was removed from all samples in the dataset, e.g., the phrase “Below is an instruction that describes a task. Write a response that appropriately completes the request.” in the sample prompt above.
  • Main structure personalization: to ensure that the model follows the expected structure without being influenced by its prior knowledge of the English language, the following replacements were made:
    The word “Instruction” is replaced with the Italian word “Istruzione”.
    The word “Input” is replaced with the Italian word “Ingresso”.
    The word “Output” is replaced with the Italian word “Uscita”.
    The script used for these replacements can be found in Appendix B Listing A2.
    The objective was to insert specific structural patterns, i.e., placeholder comments at predefined points in the code. This intervention was designed to create distinctive markers, unknown both to DeepSeek Coder Base and to the Alpaca18k original dataset. Placeholders are useful for clearly and effectively evaluating the model’s ability to incorporate structure in code generation once subjected to different fine-tuning strategies.
    The structural comments placeholders were inserted as follows:
    Before each function definition “def function_name”,
    was added a comment “# Definition of function function_name;
    Before each if condition, was added a comment “# If condition”;
    Before each while loop, was added a comment “# While loop”;
    Before each for loop, was added a comment “# For loop”.
  • Padding and End of Sequence token: since not all samples in the dataset are of the same length, padding up to the maximum length of the longest sentence was applied during tokenization using the EOS (End Of Sequence) token. This means that, for each shorter sentence or text sample compared to the maximum length, EOS tokens were added at the end to bring all samples to the same length. This procedure ensures that the machine learning model can effectively handle the entire dataset, as natural language processing models typically require uniform input lengths. Moreover, inserting EOS tokens allows for a precise and timely interruption of the generation process, avoiding the generation of superfluous or irrelevant text.
For illustration purposes, below is an example of the "prompt" section of a tuning dataset sample after all the transformations described:
  •     ### Istruzione:
  •     Automate a Python program to check the length of each word in a
  •     given string.
  •     ### Ingresso:
  •     ### Uscita:
  •     # Definizione della funzione lengthWord
  •     def lengthWord(inp):
  •         result = []
  •         # For loop
  •         for word in inp.split(’ ’):
  •             result.append(len(word))
  •         return result
  •     # Driver Code
  •     inp = "Generating a code generation task instructions"
  •     print(lengthWord(inp))
  •     <EOS>
It is important to note that the modified generated dataset incorporates structural patterns different from the original ones and totally unknown to the DeepSeek model and not present in the original Alpaca18k.

4. Experiments and Discussion

To evaluate the effectiveness of the proposed model adaptation techniques, we conducted a series of experiments using both classical fine-tuning and adapter-based approaches on the DeepSeek Coder model. These experiments aim to determine how well each adaptation method improves the model’s ability to generate well-structured and readable source code. Performance was evaluated using perplexity metrics as well as practical code quality evaluations. The following subsections detail the evaluation metrics, experimental plan and setup, the two fine-tuning experiments, and results with discussion.

4.1. Evaluation Metrics

The evaluation of language generative models is challenging. The main difficulty lies in the evaluation of generated text, which is inherently subjective, such as assessing the quality of a generated text in terms of coherence, naturalness, and relevance to the context. This is in contrast to tasks such as classification, wherein accuracy can be measured objectively with respect to a given ground truth. Moreover, a model may generate syntactically correct text, but semantically inappropriate or irrelevant to the given context, making evaluation even more complex when applied to programming code generative systems.
Common metrics used to evaluate language models include the following:
  • BLEU (Bilingual Evaluation Understudy) [31]: Measures the overlap between generated text and reference translations using n-gram precision and a brevity penalty. It is mainly used to evaluate the quality of translations and code generation.
  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [32]: Focuses on recall and the ability of the generated text to capture essential content, with variants such as ROUGE-N (n-gram overlap), ROUGE-L (longest common subsequence), and ROUGE-S (skip-bigram overlap).
  • METEOR (Metric for Evaluation of Translation with Explicit Ordering) [33]: Balances precision and recall while accounting for semantic meaning through synonyms and word order, providing a flexible evaluation metric.
  • Perplexity [34] indicates the model’s uncertainty in predicting sequences, with lower values suggesting better language modeling and text generation accuracy. It is based on the inverse probability of the given text, normalized by text length. Low perplexity indicates that the model predicts words with high precision, showing a good understanding of linguistic structures.
Although commonly used in NLP, metrics like ROUGE, BLEU, and METEOR are inadequate for source code evaluation due to several limitations.
Code is highly sensitive to syntax and structure, where even minor variations can result in non-functional code despite high n-gram similarity. These metrics also do not consider semantic correctness or execution validity, which are critical for code functionality. Perplexity, on the other hand, is better suited for evaluating code generation as it focuses on predicting sequences accurately, measuring how well the model understands and generates syntactically and contextually correct code. This makes it more effective at assessing the model’s ability to produce usable and logically consistent code outputs. Although, in general, more specialized evaluation metrics that consider code functionality, execution, and syntactic correctness (e.g., code compilation success or runtime performance) are necessary to assess more accurately the performance of generative models in the context of source code generation, let us examine more closely the characteristics of perplexity.
Perplexity [34] is a metric commonly used in the evaluation of language models, including those used in natural language processing (NLP) and code generation. Perplexity is used to evaluate how well a language model can predict a sequence of words or tokens. It is essentially a measure of the model’s uncertainty in the prediction of the next token in a sequence. The lower the perplexity, the better the predictive ability of the model.
Formally, perplexity (PP) is defined as follows:
P P ( W ) = P ( w 1 , w 2 , , w N ) 1 N = 2 1 N i = 1 N log 2 P ( w i )
where W = w 1 , w 2 , , w N represents the sequence of tokens, and P ( w i ) is the probabilty assigned by the model to the token w i , that is often decomposed into a product of conditional probabilities, using the Markov assumption [35], which allows for the probability of a word to be calculated based only on a limited number of preceding words, rather than on the entire history. This aspect makes perplexity extremely sensitive to the length of the text and the vocabulary used, performing best on texts of homogeneous length.
We point out that metrics such as correctness and consistency of the generated code are not evaluated since their are directly inherited unchanged by the used DeepSeek model and by the basic Alpaca18k dataset. On the other hand, readability and maintainability of the code depends on the quality of the encoded structural constraints which are requested to be embedded in generated code, and that our method effectively realizes.

4.2. Experimental Plan

Experiments of the proposed model adaptation methodologies have been implemented for fine-tuning and adapter. A subset of lines not exceeding 500 characters from the Alpaca 18k dataset was used. The lines selected from the dataset were split into a training set and a test set, where the latter represents 10% of the initial dataset. To evaluate the effectiveness of the adaptation method, the reference initial perplexity value was 4.64, measured before adaptation on a test dataset.
To evaluate the training of the language model, we used a loss function to quantify the difference between the model’s predictions and the actual values in the data. During the training process, the loss is calculated for each batch of input data, and the goal is to minimize this value through successive iterations, thereby refining the model’s predictive capabilities. A low loss value indicates that the model is capable of predicting or generating outputs consistent with the training data, signaling a high level of learning and adaptation.
To assess the model’s ability to maintain the defined structure, we used perplexity to measure how likely a given sequence of words is to occur, according to the model. A low perplexity value indicates that the model predicts sequences with greater precision and consistency. Perplexity was calculated before and after applying the transfer learning techniques.

4.3. Fine-Tuning Experiment

The selected dataset was processed using the methodology described in the previous paragraph and subsequently tokenized before feeding the model in the tuning phase. As can be seen in Figure 4, the DeepSeek model is structured into 24 levels called LlamaDecoderLayer, a final normalization layer, and a lm_head layer, which connects the output of the last layer of the model to the tokenizer vocabulary. Each LlamaDecoderLayer consists of the following components:
  • LlamaAttention: The multi-head attention mechanism, fundamental for focusing on relevant parts of the input. It includes four linear projections: q_proj, k_proj, v_proj, and o_proj, to handle the queries, keys, values, and output, respectively.
  • LlamaMLP: A multilayer perceptron (MLP) that processes the output of the attention mechanism. It includes several linear layers (gate_proj, up_proj, down_proj) and the SiLU (Sigmoid Linear Unit) activation function, which introduces non-linearity.
  • Normalization: Present in every LlamaDecoderLayer, with two modules, input_layernorm and post_attention_layernorm, used before and after the attention mechanism, respectively. This normalization is crucial for stabilizing learning and improving model performance.
Figure 4. Structure of DeepSeek model.
Figure 4. Structure of DeepSeek model.
Technologies 12 00219 g004
For the fine-tuning, all layers were initially frozen, and tests were conducted to determine the optimal number of layers to unfreeze. The best trade-off between training time and performance was achieved by unfreezing the last three LlamaDecoderLayer (21, 22, 23), the final normalization layer, and the lm_head layer.
The decision to unfreeze these specific layers was guided by considerations aimed at optimizing the model during learning while avoiding the computational cost of using all layers. Unfreezing the last layers of the model is based on their proximity to the output, making them particularly suitable for capturing and modeling fine details and specific features of the dataset the model is trained on, improving its ability to respond accurately and specifically to the data learned during fine-tuning.
The model normalization (model.model.norm.weight) is fundamental for training stability and efficiency, significantly influencing how the model processes and interprets data. Adjusting this layer improves the model’s performance, better adapting it to the specific task.
Finally, the model’s head (model.lm_head.weight) is essential for generating accurate and relevant textual outputs. This component, responsible for converting internal representations into words or tokens, must be adaptable to produce grammatically correct and contextually appropriate responses for the specific task.
The code used to freeze the model’s weights and unfreeze the aforementioned ones is shown in Appendix B Listing A4. To monitor the number of parameters actually trained after freezing the initial layers, the function used is shown in Appendix B Listing A3. This function examines each layer of the model, summing the trainable parameters of each unfrozen layer. In the specific case presented, the number of trainable parameters results is 217856000 over a total of 1346471936.
This indicates that the fraction of unfrozen, and therefore trainable, parameters represents 16.18% of the total parameters in the model.
To simulate the effect of a larger batch size without requiring higher computational resources, the gradient_accumulation_steps parameter was modified. In this specific case, gradient_accumulation_steps was set to 8. This approach allows the model to accumulate gradients from multiple mini-batches before proceeding with weight updates, effectively simulating the effect of a larger batch size, which in this case would be 8. The training configuration is shown in Appendix B in Listing A5.
As mentioned earlier, the loss function was used as an evaluation metric during the training process, and the values are shown in Table 1. The resulting graph can be seen in Figure 5.
Analyzing the training and validation loss values together, a positive trend is observed in the training process. Both values show a consistent decrease, indicating that the model is effectively improving in adapting to the training data and generalizing. This trend suggests stable and progressive learning, with no evident signs of overfitting, as the validation loss decreases in parallel with the training loss. The progressive reduction in the gap between the two loss values generally indicates a good balance between learning on the training data and the ability to generalize to new data.
The perplexity value measured after training was 1.61, suggesting that the model successfully adapted to the new input structure.
  • Sample Prompt 1.
As previously discussed, evaluating the performance of a generative model for code using traditional metrics can be complex and, in some cases, impractical. To better clarify, for a qualitative evaluation of the results, it is useful to observe the model’s output generated by the model in response to same sample prompts.
An example of a prompt could be as follows:
  •     ### Istruzione:
  •     Python function named funzione\_anno to check if year is a leap year.
  •     ### Ingresso:
  •     ### Uscita:
Analyzing the output generated by this prompt before fine-tuning, the following result is observed:
  •     ### Istruzione:
  •     Python function named funzione_anno to check if year is a leap
  •     year.
  •     ### Ingresso:
  •     ### Uscita:
  •     ### Esempio di input:
  •     ### Esempio di output:
  •     ### Esempio di input:
From this example, it is clear that the model, in the absence of adequate prior knowledge, is unable to correctly interpret the expected output structure. The lack of specific indications about what should follow the term Uscita: (i.e., “Output” in Italian) leads the model to generate content that repeats the input structure provided, without producing useful or meaningful code. This example clearly shows how, before fine-tuning, the model tends to replicate input elements rather than develop coherent and functional responses. The presence of Italian terms in the output, such as “Esempio di” (“Example of:”) can be explained by the presence of some Italian comments within the code used in the model’s base training.
The same prompt, processed through the fine-tuned model, produces the following result:
  •     ### Istruzione:
  •     Python function named funzione_anno to check if year is a leap year.
  •     ### Ingresso:
  •     ### Uscita:
  •     # Definizione della funzione funzione_anno
  •     def funzione_anno(year):
  •     # Condizione if
  •     if year % 4 == 0 and (year % 100 != 0 or year % 400 == 0): return True
  •     else:
  •     return False
From this example, we observe that the model correctly maintained the structure of the output and comments. Additionally, it demonstrates a partial understanding of the provided instructions. This limited understanding could be due to the model’s adaptation training on a relatively small dataset. Initially lacking a deep understanding of natural language outside the comments in the code, the model has problems with processing complex instructions, an aspect that could be improved with more extensive fine-tuning based on a larger dataset, capable of enriching its understanding of natural language. These latter limitations are not due to the methodologies for structural constraint embedding proposed in this work, but are limitations intrinsic to the DeepSeek model and to the dataset Alpaca18k employed for fine-tuning.
In Appendix A, some examples of more complex prompts are discussed.

4.4. Adapter Experiment

For the adapter methodology for transfer learning, the dataset was processed as described earlier and tokenized before being passed to the model for training.
Next, the adapter configuration was performed. Specifically, the adapter was configured to intervene in the projections of the attention mechanism (q_proj, k_proj, v_proj, o_proj) and in the projections of the multilayer perceptron (MLP) (mlp.gate_proj, mlp.up_proj, mlp.down_proj), in addition to the language model head (lm_head).
In the adapter configuration for the model, a strategic choice was made regarding the hyperparameters: the rank of the adapter matrices, indicated as r, was set to 32 instead of the default value of 64. This reduction aimed to lighten the model’s computational load. Simultaneously, the scaling factor, denoted as lora_alpha, was increased to 64 from the standard value of 16. This increase in lora_alpha is significant because it directly affects the importance assigned to the adapter weights: a higher alpha value implies that the adapter weight matrix, multiplied by α / r , receives more emphasis. Consequently, this change allows for more weight to be given to the new input data, enabling the adapter to influence the model more significantly compared to its prior knowledge.
The adapter configuration code is shown in Appendix B Listing A6.
In this adapter setting, the number of trainable parameters is 31,080,448 which corresponds to 2.26% of the 1,377,552,384 total model parameters. It is important to note that the total also includes the parameters added through the adapter; therefore, the total is greater than the 1,346,471,936 parameters of the original model, since the additional 31,080,448 are the specific parameters introduced by the adapter’s low-rank matrices.
The adapter training configuration can be found in Appendix B Listing A7.
In addition to the mentioned configuration, another key aspect of the training process was setting the learning rate to 2.5 ×   10 5 (0.000025). This choice was motivated by the need to balance the learning process precisely. A well-calibrated learning rate is essential in transfer learning scenarios, where the model, already trained on a general task, needs fine-tuning to adapt to a new dataset. A relatively low value like 2.5 ×   10 5 is ideal in this context: it is small enough to prevent excessive adjustments to the model weights, which could lead to overfitting on the training data, but large enough to ensure effective learning across epochs.
For evaluation during training, the loss function data can be viewed in Table 2.
Analyzing the results, represented in the graph in Figure 6, it can be seen that the training loss is particularly high initially due to an underfitting situation. As training progresses, a gradual convergence of the two metrics is observed, with a gradual and steady decrease in both training and validation loss. This trend suggests that the model is improving its ability to generalize from the training data to the validation data, indicating positive adaptation without clear signs of overfitting. Toward the end of the training process, the stabilization and convergence of the losses indicate a balance between learning from the training data and generalizing to the validation data, suggesting that the model is reaching an optimal level of performance.
The perplexity after training was 1.57, indicating that the training process significantly improved the model’s ability to predict data with a structure similar to that of the training data.
It is worth noticing that the model tuned with the Adapter method produces the same answer to a simple prompt like the sample prompt shown in Section 4.3 and similarly produce imprecise answers, although different, to more complex prompts, as can be observed in the examples reported in Appendix A.

4.5. Results Discussion

The comparison between fine-tuning experiments and the use of adapters in the context of transfer learning has provided interesting insights into the effectiveness and efficiency of both approaches in adapting a language model, as seen in Table 3.
In terms of fine-tuning, the decision to unlock only the final layers of the model was a strategic choice that balanced the computational burden, with a significant improvement in performance, as indicated by the significant reduction in perplexity (−3.03). This approach allowed for a high degree of quality in adapting the model fit, while avoiding overfitting and ensuring proper generalization from the training data to the validation data.
On the other hand, adaptation using adapters showed that careful configuration of the hyperparameters, especially choosing a reduced rank for the adapter matrices and increasing the scaling factor, can yield promising results even with a limited number of trainable parameters. This approach resulted in an effective reduction of perplexity (−3.07) and convergence of loss metrics, highlighting the importance of balancing the introduction of new task-specific knowledge with the retention of the model’s prior knowledge.
Both approaches showed an adequate ability to adapt the model to specific tasks, and despite differences in the nature and configuration of the methods used, they obtained quite similar results in the perplexity value. However, it was shown that the model’s ability to understand complex instructions remains a challenge, indicating the need for further improvements.
In summary, the detailed analysis of the experiments suggests that the choice between fine-tuning and the use of adapters, although a relevant difference in favor of adapters in terms of efficiency, depends on the specific situation and the available resources. Both approaches are capable of offering significant advantages in adapting language models to specialized tasks.
It is worth noticing that the systematic automatic introduction of structured comments in generated code, such as those shown in the experiments, has a relevant impact on the software development process, improving software readability, and related phases such software maintainability and debugging, since they ultimately represent a form of automatic documentation.

5. Conclusions

This study explores the potential of generative models for programming languages and provides a methodology for applying the fine-tuning and adapter techniques in customizing pre-trained models for specific code generation tasks. The results of the fine-tuning and adapter experiments, which show a significant perplexity reduction of 3.03% and 3.07%, respectively, make it possible to draw conclusions confirming the effectiveness of these transfer learning techniques in integrating and maintaining structured patterns in generated code. Fine-tuning, with its ability to adapt the final layers of the model, has proven to be an efficient strategy for reducing perplexity and balancing the computational load. On the other hand, the use of adapters, through careful hyperparameter configuration, offers a way to introduce new knowledge into the model while preserving its structure and prior knowledge, representing a balance between the demands of the new task and the existing capabilities of the model.
Despite these successes, both approaches showed limitations in understanding more complex instructions, indicating the need for further development in this area. The results suggest that more extensive fine-tuning or the use a different generative model or the introduction of a more diverse training dataset could improve the model’s ability to interpret and generate source code in response to complex instructions.The observed limitations, such as complexity of instructions, correctness, and consistency of code are intrinsic on the DeepSeek model and to the dataset adequacy for training a generative model for source code, rather than to the proposed technique for encoding structural constraints, which has satisfactory performances.
Future further research could explore how generative models in the area of programming can not only produce consistent code but also produce code that adheres to higher coding standards, ensuring readability and maintainability. Areas of interest could include the integration of software engineering principles and coding standards into the training processes of these models to generate code that is not only efficient, but also elegant and compliant with current programming standards.
Finally, although the current focus of research is mostly on code generation, another line of future research could be the exploration of the use of generative models for other aspects of the software development cycle, such as automated debugging, code optimization, and automated documentation. In addition to improving the efficiency of software development, these developments may provide new opportunities for the growing role of artificial intelligence in the software design and development process.

Author Contributions

A.M. and S.T. conceptualized the study, S.T. developed the experiments, V.F. and A.M. supervised the study, all the authors contributed to writing the paper, V.F. critically reviewed the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by funds from University of Perugia Project MiRA and Knowledge and Information Technology Laboratory.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original data used in the study are openly available at HuggingFace. The dataset “python code instruction 18k alpaca” is available online: https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca, accessed on 3 October 2024. The code used to process the data and to implement the presented method and experiments is available in Appendix B.

Acknowledgments

We would like to thank Mattia Polticchia for his important suggestions during this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Generative Sample Prompts and Answers

  • Alpaca18k Sample Record.
  •     Below is an instruction that describes a task. Write a response
  •     that appropriately completes the request.
  •     ### Instruction:
  •     Automate a Python program to check the length of each word in a
  •     given string.
  •     ### Input:
  •     ### Output:
  •     def lengthWord(inp):
  •         result = []
  •         for word in inp.split(’ ’):
  •             result.append(len(word))
  •         return result
  •     # Driver Code
  •     inp = "Generating a code generation task instructions"
  •     print(lengthWord(inp))
  • Tuning data set: Sample Record after structural pattern encapsulation.
  •     ### Istruzione:
  •     Automate a Python program to check the length of each word in a
  •     given string.
  •     ### Ingresso:
  •     ### Uscita:
  •     # Definizione della funzione lengthWord
  •     def lengthWord(inp):
  •         result = []
  •         # For loop
  •         for word in inp.split(’ ’):
  •             result.append(len(word))
  •         return result
  •     # Driver Code
  •     inp = "Generating a code generation task instructions"
  •     print(lengthWord(inp))
  •     <EOS>
  • Prompt 01 sample.
  •     ### Istruzione:
  •     Python function named funzione\_anno to check if year is a leap year.
  •     ### Ingresso:
  •     ### Uscita:
  • Prompt 01: answer before tuning.
  •     ### Istruzione:
  •     Python function named funzione_anno to check if year is a leap
  •     year.
  •     ### Ingresso:
  •     ### Uscita:
  •     ### Esempio di input:
  •     ### Esempio di output:
  •     ### Esempio di input:
  • Prompt 01: answer after Fine-tuning.
  •     ### Istruzione:
  •     Python function named funzione_anno to check if year is a leap year.
  •     ### Ingresso:
  •     ### Uscita:
  •     # Definizione della funzione funzione_anno
  •     def funzione_anno(year):
  •     # Condizione if
  •     if year % 4 == 0 and (year % 100 != 0 or year % 400 == 0): return True
  •     else:
  •     return False
  • Prompt 01: answer after Adapter tuning.
  •     ### Istruzione:
  •     Python function named funzione_anno to check if year is a leap year.
  •     ### Ingresso:
  •     ### Uscita:
  •     # Definizione della funzione funzione_anno def funzione_anno(year):
  •     # Condizione if
  •     if (year % 4 == 0 and year % 100 != 0) or (year % 400 == 0):
  •     return True
  •     else:
  •     return False
Complex prompt. It is also interesting to examine a more complex prompt to evaluate not only the model’s ability to preserve the structure of the response and comments, but also its ability to correctly interpret the given instructions. To this end, consider the following prompt:
  • Prompt 02 Sample.
  •     ### Istruzione:
  •     Generate two random numbers using randint and then sum them,
  •     then multiply first number with the sum.
  •     If the sum is greater than 10, print‘Greater‘
  •     ### Ingresso:
  •     ### Uscita:
  • Prompt 02: answer after Fine-tuning.
  •     ### Istruzione:
  •     Generate two random numbers using randint and then sum them, then multiply first number with the sum. If the sum is greater than 10, print ‘Greater‘
  •     ### Ingresso:
  •     ### Uscita:
  •     # Definizione della funzione sum_and_multiply
  •         def sum_and_multiply(num1, num2):
  •             sum = num1 + num2
  •         # Condizione if
  •         if if sum > 10:
  •             print(‘Greater‘)
  •         else:
  •             print(num1 * sum)
  • Prompt 02: answer after Adaper tuning.
  •     ### Istruzione:
  •     Generate two random numbers using randint and then sum them, then multiply first number with the sum. If the sum is greater than 10, print ‘Greater‘.
  •     ### Ingresso:
  •     ### Uscita:
  •     import random
  •     a = random.randint(1, 10)
  •     b = random.randint(1, 10)
  •     sum = a + b
  •     # Condizione if
  •     if sum > 10:
  •         print(‘Greater‘)
As can be seen, both for fine-tuning and adapter methods, the model shows a reduced ability to accurately understand more complex instructions while maintaining a strong ability to retain the output structure and comments.

Appendix B. Code Listings

  • Listing A1. Script for adding comments to code while maintaining correct indentation.
1
def add_comment_to_functions(prompt):
2
    # split code into lines
3
    licript for adding comments to code while maintaining correct indentation.nes = prompt.split(’\n’)
4
    modified_lines = []
5
    for line in lines:
6
        # check if the line contains a function definition
7
        match = re.match(r"def\s+(\w+)\s*\(", line)
8
        if match:
9
            function_name = match.group(1)
10
            # find the indentation level of the function definition
11
            indentation = len(line) - len(line.lstrip())
12
            modified_lines.append(f"{’ ’ * indentation}# Definizione della funzione {function_name}")
13
            modified_lines.append(line)
14
        else:
15
            # check if the line contains while, for, if
16
            if line.strip().startswith((’while’, ’for’, ’if’)):
17
                indentation = len(line) - len(line.lstrip())
18
                if line.strip().startswith(’if’) and not line.strip().startswith(’elif’):
19
                    modified_lines.append(f"{’ ’ * indentation}# Condizione if")
20
                elif line.strip().startswith(’for’):
21
                    modified_lines.append(f"{’ ’ * indentation}# Ciclo for")
22
                elif line.strip().startswith(’while’):
23
                    modified_lines.append(f"{’ ’ * indentation}# Ciclo while")
24
                modified_lines.append(line)
25
            else:
26
                modified_lines.append(line)
27
    modified_code = ’\n’.join(modified_lines)
28
    return modified_code
  • Listing A2. Script for modifying names within the prompt field.
1
def replace_labels(examples):
2
    prompt_text = examples["prompt"]
3
    instruction_start = prompt_text.find("### Instruction:")
4
    instruction_end = prompt_text.find("\n\n###END")
5
    if instruction_start != -1 and instruction_end != -1:
6
        examples["prompt"] = prompt_text[instruction_start:instruction_end].replace( "Instruction", "Istruzione").replace("Input", "Ingresso").replace("Output", "Uscita")
7
    return examples
  • Listing A3. Function to find the number of trainable parameters.
1
def print_trainable_parameters(model):
2
    #prints the number of trainable parameters in the model
3
    trainable_params = 0
4
    all_param = 0
5
    for _, param in model.named_parameters():
6
        all_param += param.numel()
7
        if param.requires_grad:
8
            trainable_params += param.numel()
9
    print(
10
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}" )
  • Listing A4. Freezing Weights for Fine-Tuning.
1
for name, param in model.named_parameters():
2
    param.requires_grad = False
3
 
4
#list of layers to unlock
5
layers_to_unfreeze = [21, 22, 23]
6
for name, param in model.named_parameters():
7
    if any(f"model.layers.{i}." in name for i in layers_to_unfreeze):
8
        param.requires_grad = True
9
model.model.norm.weight.requires_grad = True
10
model.lm_head.weight.requires_grad = True
  • Listing A5. Training Configuration for Fine-Tuning.
1
training_args = TrainingArguments(
2
    output_dir="./deepseek-coder-trained",
3
    overwrite_output_dir=True,
4
    num_train_epochs=1,
5
    per_device_train_batch_size=1,
6
    save_steps=100,
7
    save_total_limit=1,
8
    gradient_accumulation_steps = 8,
9
    seed = seed,
10
    evaluation_strategy="steps",
11
    eval_steps=100,
12
    eval_accumulation_steps = 1,
13
    load_best_model_at_end=True,
14
    metric_for_best_model="loss",
15
    greater_is_better=False,
16
    logging_steps = 100,             )
  • Listing A6. Adapter Configuration.
1
config = LoraConfig(
2
    r=32,    lora_alpha=64,
3
    target_modules=[
4
        "self_attn.q_proj",
5
        "self_attn.k_proj",
6
        "self_attn.v_proj",
7
        "self_attn.o_proj",
8
        "mlp.gate_proj",
9
        "mlp.up_proj",
10
        "mlp.down_proj",
11
        "lm_head"],
12
    bias="none", lora_dropout=0.05,  ask_type="CAUSAL_LM",)
13
model = get_peft_model(model, config)
  • Listing A7. Training Configuration for the Adapter.
1
training_args = TrainingArguments(
2
    output_dir="./deepseek-coder-trainedadapter",
3
    overwrite_output_dir=True,
4
    num_train_epochs=1,
5
    per_device_train_batch_size=1,
6
    save_steps=250,
7
    save_total_limit=1,
8
    gradient_accumulation_steps = 2,
9
    evaluation_strategy="steps",
10
    eval_steps=250,
11
    eval_accumulation_steps = 1,
12
    load_best_model_at_end=True,
13
    metric_for_best_model="loss",
14
    greater_is_better=False,
15
    learning_rate=2.5e-5,
16
    seed = seed,
17
)

References

  1. Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A Comprehensive Survey on Transfer Learning. Proc. IEEE 2020, 109, 43–76. [Google Scholar] [CrossRef]
  2. Github. DeepSeek Coder. Available online: https://deepseekcoder.github.io (accessed on 3 October 2024).
  3. Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of transformers. AI Open 2022, 3, 111–132. [Google Scholar] [CrossRef]
  4. Radford, A.; Narasimhan, K. Improving Language Understanding by Generative Pre-Training. OpenAi Report; 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 3 October 2024).
  5. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
  6. OpenAI. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
  7. Dehaerne, E.; Dey, B.; Halder, S.; De Gendt, S.; Meert, W. Code Generation Using Machine Learning: A Systematic Review. IEEE Access 2022, 10, 82434–82455. [Google Scholar] [CrossRef]
  8. Yan, D.; Gao, Z.; Liu, Z. A Closer Look at Different Difficulty Levels Code Generation Abilities of ChatGPT. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; pp. 1887–1898. [Google Scholar] [CrossRef]
  9. Zhang, X.; Jiang, Y.; Wang, Z. Analysis of Automatic Code Generation Tools based on Machine Learning. In Proceedings of the 2019 IEEE International Conference on Computer Science and Educational Informatization (CSEI), Kunming, China, 16–19 August 2019; pp. 263–270. [Google Scholar] [CrossRef]
  10. Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.d.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar]
  11. Wang, Y.; Wang, W.; Joty, S.; Hoi, S.C. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 8696–8708. [Google Scholar] [CrossRef]
  12. Naik, P.; Nelaballi, S.; Pusuluri, V.; Kim, D.K. Deep Learning-Based Code Refactoring: A Review of Current Knowledge. J. Comput. Inf. Syst. 2022, 64, 314–328. [Google Scholar] [CrossRef]
  13. López Espejel, J.; Yahaya Alassan, M.S.; Chouham, E.M.; Dahhane, W.; Ettifouri, E.H. A comprehensive review of State-of-The-Art methods for Java code generation from Natural Language Text. Nat. Lang. Process. J. 2023, 3, 100013. [Google Scholar] [CrossRef]
  14. Shi, E.; Wang, Y.; Zhang, H.; Du, L.; Han, S.; Zhang, D.; Sun, H. Towards Efficient Fine-Tuning of Pre-trained Code Models: An Experimental Study and Beyond. In Proceedings of the ISSTA 2023—32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, Seattle, WA, USA, 17–21 July 2023; pp. 39–51. [Google Scholar] [CrossRef]
  15. Chi, K.; Li, C.; Ge, J.; Luo, B. An Empirical Study on Code Search Pre-trained Models: Academic Progresses vs. Industry Requirements. In Proceedings of the Internetware ’24—15th Asia-Pacific Symposium on Internetware, Macau, China, 24–26 July 2024; pp. 41–50. [Google Scholar] [CrossRef]
  16. Odeh, A.; Odeh, N.; Mohammed, A. A Comparative Review of AI Techniques for Automated Code Generation in Software Development: Advancements, Challenges, and Future Directions. TEM J. 2024, 13, 726–739. [Google Scholar] [CrossRef]
  17. DeepSeek. DeepSeek AI Ltd., Hangzhou, China. Available online: https://www.deepseek.com/ (accessed on 3 October 2024).
  18. Gao, J.; Heng, F.; Yuan, Y.; Liu, Y. A novel machine learning method for multiaxial fatigue life prediction: Improved adaptive neuro-fuzzy inference system. Int. J. Fatigue 2024, 178, 108007. [Google Scholar] [CrossRef]
  19. Gao, J.; Liu, Y.; Yuan, Y.; Heng, F. Residual Strength Modeling and Reliability Analysis of Wind Turbine Gear under Different Random Loadings. Mathematics 2023, 11, 4013. [Google Scholar] [CrossRef]
  20. Gao, J.X.; Heng, F.; Yuan, Y.P.; Liu, Y.Y. Fatigue Reliability Analysis of Composite Material Considering the Growth of Effective Stress and Critical Stiffness. Aerospace 2023, 10, 785. [Google Scholar] [CrossRef]
  21. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
  22. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Human Language Technologies, Volume 1 (Long and Short Papers), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019. [Google Scholar] [CrossRef]
  23. Yu, Y.; Zuo, S.; Jiang, H.; Ren, W.; Zhao, T.; Zhang, C. Fine-Tuning Pre-trained Language Model with Weak Supervision: A Contrastive-Regularized Self-Training Approach. In Human Language Technologies, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, Online, 6–11 June 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 1063–1077. [Google Scholar] [CrossRef]
  24. Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; de Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-Efficient Transfer Learning for NLP. arXiv 2019, arXiv:1902.00751. [Google Scholar]
  25. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  26. Yun, J.; Kim, B.; Kim, J. Weight Decay Scheduling and Knowledge Distillation for Active Learning. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVI. Speringer: Berlin/Heidelberg, Germany, 2020; pp. 431–447. [Google Scholar] [CrossRef]
  27. Vilares Ferro, M.; Doval, Y.; Ribadas Pena, F.; Darriba Bilbao, V. Early stopping by correlating online indicators in neural networks. Neural Netw. 2022, 159, 109–124. [Google Scholar] [CrossRef] [PubMed]
  28. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
  29. HuggingFace. PEFT Documentation: LoRa. Available online: https://huggingface.co/docs/peft/conceptual_guides/lora (accessed on 3 October 2024).
  30. HuggingFace. Datasets: Python Code Instruction 18k Alpaca. Available online: https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca (accessed on 3 October 2024).
  31. Song, X.; Cohn, T.; Specia, L. BLEU Deconstructed: Designing a Better MT Evaluation Metric. Int. J. Comput. Linguist. Appl. 2013, 4, 29–44. [Google Scholar]
  32. Barbella, M.; Tortora, G. ROUGE Metric Evaluation for Text Summarization Techniques. 2022. Available online: https://ssrn.com/abstract=4120317 (accessed on 3 October 2024).
  33. Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA; 2005; pp. 65–72. [Google Scholar]
  34. Jelinek, F.; Mercer, R.L.; Bahl, L.R.; Baker, J.K. Perplexity—A measure of the difficulty of speech recognition tasks. J. Acoust. Soc. Am. 1977, 62, S63. [Google Scholar] [CrossRef]
  35. Bochman, A. The Markov assumption: Formalization and impact. In Proceedings of the IJCAI ’13—Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013; AAAI Press: Menlo Park, CA, USA, 2013; pp. 782–788. [Google Scholar]
Figure 2. DeepSeek model training process.
Figure 2. DeepSeek model training process.
Technologies 12 00219 g002
Figure 3. Approximate weight update in LoRA.
Figure 3. Approximate weight update in LoRA.
Technologies 12 00219 g003
Figure 5. Graph of training and validation loss (fine-tuning).
Figure 5. Graph of training and validation loss (fine-tuning).
Technologies 12 00219 g005
Figure 6. Graph of training and validation loss (adapter).
Figure 6. Graph of training and validation loss (adapter).
Technologies 12 00219 g006
Table 1. Training and validation loss values (fine-tuning).
Table 1. Training and validation loss values (fine-tuning).
StepTraining LossValidation Loss
1000.6512000.537167
2000.5203000.511235
3000.5089000.498147
4000.5117000.491705
5000.4991000.486529
6000.5012000.483028
7000.4877000.478938
Table 2. Training and validation loss values (adapter).
Table 2. Training and validation loss values (adapter).
StepTraining LossValidation Loss
250No log0.510076
5000.6097000.475910
7500.6097000.468542
10000.4878000.460246
12500.4878000.456496
15000.4806000.455638
17500.4806000.452426
20000.4631000.451217
22500.4631000.450196
Table 3. Fine-tuning and adapter methods performance comparison.
Table 3. Fine-tuning and adapter methods performance comparison.
MethodTrainable %ParametersValidation Loss Plateau at #StepsPerplexity Pre-TrainPerplexity Post-Train
Fine-Tuning16.18%6004.641.61
Adapter2.26%5004.641.57
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Franzoni, V.; Tagliente, S.; Milani, A. Generative Models for Source Code: Fine-Tuning Techniques for Structured Pattern Learning. Technologies 2024, 12, 219. https://doi.org/10.3390/technologies12110219

AMA Style

Franzoni V, Tagliente S, Milani A. Generative Models for Source Code: Fine-Tuning Techniques for Structured Pattern Learning. Technologies. 2024; 12(11):219. https://doi.org/10.3390/technologies12110219

Chicago/Turabian Style

Franzoni, Valentina, Silvia Tagliente, and Alfredo Milani. 2024. "Generative Models for Source Code: Fine-Tuning Techniques for Structured Pattern Learning" Technologies 12, no. 11: 219. https://doi.org/10.3390/technologies12110219

APA Style

Franzoni, V., Tagliente, S., & Milani, A. (2024). Generative Models for Source Code: Fine-Tuning Techniques for Structured Pattern Learning. Technologies, 12(11), 219. https://doi.org/10.3390/technologies12110219

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop