1. Introduction
The evolution of natural language processing (NLP) took a significant leap forward with the advent of deep learning. Before deep learning, NLP relied heavily on rule-based methods and traditional machine learning, which required extensive feature engineering and domain expertise. Deep learning changed this landscape by introducing models that could automatically learn to represent language from vast amounts of text data. These models, known as neural networks, are composed of layers of interconnected nodes that mimic the human brain’s structure and functioning. They have the remarkable ability to learn complex patterns in data, enabling them to grasp the nuances of human language, including syntax, context, and even sentiment. With deep learning, NLP applications like speech recognition, language translation, and sentiment analysis saw unprecedented improvements in accuracy and efficiency [
1]. This shift marked a new era in NLP, where models can understand and generate human-like text, opening the door to more sophisticated and human-centric applications.
The introduction of the transformer model in 2017 by Vaswani et al. [
1] marked another groundbreaking moment in NLP’s evolution. The transformer model, unlike its predecessors, does not rely on sequential data processing (where input data are processed in order) but instead uses a mechanism known as ‘self-attention’ to process all parts of the input data simultaneously. This allows the model to capture the context of words in a sentence more effectively, leading to a significant boost in the quality of tasks like translation and text generation. Transformers quickly became the foundation for a new generation of large language models, such as BERT [
2], GPT [
3], and their derivatives, which have set new standards in a wide range of NLP tasks. These models are not only more accurate but also more versatile and capable of handling a variety of language processing tasks without task-specific fine-tuning. The transformer’s architecture has thus revolutionized NLP, enabling more natural, efficient, and contextually aware language processing and generation and paving the way for more advanced and nuanced language understanding by machines.
Transformers, with their unique self-attention mechanism, have been pivotal in handling the complexities of human language, allowing models to consider the context of each word in a sentence. This capability has enabled the development of large-scale multilingual models like BERT and GPT, which have set new benchmarks in a variety of NLP tasks, including machine translation. However, despite these advancements, a significant challenge persists in applying these models to low-resource languages [
4]. Low-resource languages, characterized by limited available digital text for training, pose a unique challenge for deep learning-based models, which typically require vast amounts of data to achieve high performance. This gap in performance for languages with fewer resources exacerbates global digital inequalities and limits the reach of NLP technologies [
5].
In response to this challenge, our research proposes a novel approach to fine-tuning large multilingual language models for low-resource machine translation. We utilize parameter-efficient adapter methods, which involve incorporating small, trainable modules within the existing structure of pre-trained transformer models. These adapters, specifically designed for each target low-resource language, provide a means to customize the model’s behavior with minimal adjustments to the overall architecture. This approach offers a balance between the model’s general capabilities and the specific nuances of each language without the computational and data demands of full model retraining.
Our study aims to evaluate the effectiveness of these adapter methods in enhancing translation accuracy for low-resource languages and to assess their efficiency in comparison to traditional fine-tuning techniques. By doing so, we seek to not only enhance the practical deployment of advanced machine translation systems but also contribute to the broader goal of linguistic inclusivity in the digital era.
This article begins with an overview of the transformer model and its impact on the field of NLP, particularly in machine translation. We then delve into the challenges associated with low-resource languages and discuss the potential of adapter methods as a solution. Following this, our methodology is detailed, highlighting the architectural design of the adapters and the fine-tuning process. We present a comprehensive analysis of our experiments, demonstrating the impact of our approach on translation quality and resource efficiency. Finally, we discuss the broader implications of our findings for the future of NLP and the integration of low-resource languages in digital platforms.
2. Related Works
The introduction of the transformer model by Vaswani et al. [
1] led to a major shift in NLP. Its self-attention mechanism allows for the parallel processing of sequences, enhancing efficiency and effectiveness in handling long-range dependencies in text. This architecture forms the foundation of prominent language models like BERT and GPT, which have demonstrated remarkable performance across a range of NLP tasks.
Despite these advancements, translating low-resource languages remains challenging [
5]. The scarcity of training data for these languages limits the effectiveness of traditional NMT approaches. Studies have explored various strategies to mitigate this, including transfer learning, where a model trained on high-resource languages is adapted for low-resource languages, and multilingual model training [
6,
7,
8]. Recent research has focused on parameter-efficient [
9] methods for adapting large models to new tasks with minimal training. This includes techniques like Adapter layers, which are small trainable modules inserted into a pre-trained model. These approaches offer a balance between retaining the knowledge of the original model and adapting to the specifics of a new task, proving particularly beneficial for low-resource scenarios. Various studies have demonstrated the effectiveness of these techniques in NLP. For instance, adapting a multilingual BERT model using language-specific adapters has shown promising results in improving translation for low-resource languages [
10]. Comparative analyses indicate that these methods can achieve close to state-of-the-art performance while requiring significantly fewer resources for training [
11]. The field continues to evolve, with emerging trends focusing on unsupervised and semi-supervised learning methods, which could further enhance the translation quality for low-resource languages. Additionally, advanced transfer learning techniques, which leverage knowledge from related high-resource languages, are seen as a promising direction for future research.
3. Materials and Methods
Machine translation systems use parallel corpora developed for source and target languages. Translation systems for languages such as English, French, and German possess a rich data repository, which positively influences the quality of translation. Consequently, the quality of machine translation among these languages is notably high. However, efforts to enhance translation quality in low-resource languages like Turkish continue to progress. This study specifically focuses on the Turkish–English language pair. Within this research, the TR-EN dataset from the WMT17 benchmark test is utilized [
12]. This dataset comprises 200 k parallel sentences (
Table 1).
Large language models that are trained multilingually, such as mBART [
13,
14], are commonly used in the fine-tuning of machine translation systems. The mBART language model was utilized in this research. mBART is a sequence-by-sequence denoising autoencoder that has been pre-trained on large-scale monolingual corpora in numerous languages. mBART represents a multilingual adaptation of the BART model (Bidirectional and Auto-Regressive Transformers) [
15] and is capable of performing text comprehension and production tasks across multiple languages. The mBART model features 12 layers in both its encoder and decoder components. Each layer of the mBART model employs 16 attention heads and uses a hidden vector of size 1024. The number of dimensions of the layers in both the encoder and decoder blocks is 4096. The mBART language model has approximately 610 m parameters. In this study, TR-EN translation systems were developed using the mBART model by employing various fine-tuning and adapter methodologies.
In this study, in addition to full fine-tuning (FFT), parameter-efficient methods such as bottleneck adapters [
16] and LoRA [
10] were also utilized.
3.1. Bottleneck Adapter
Bottleneck adapters are a transfer learning technique used in NLP and other machine learning applications [
17,
18]. They are designed to fine-tune pre-trained models more efficiently (
Figure 1).
For each layer of a model, a bottleneck adapter is added. This adapter typically consists of two separate linear (or fully connected) layers: a contraction layer and an expansion layer. Suppose the original layer’s dimension is
, the contraction layer has dimensions of
and the expansion layer has dimensions of
. Here,
r is the bottleneck size and is usually chosen such that
r ≪
d. An activation function, typically a Rectified Linear Unit (
) or a similar nonlinear activation function, is placed between the contraction and expansion layers. The adapter is added to the original layers of the model. Suppose the original layer’s output is
, and the output of the adapter
is computed as Equation (1).
where
and
are the weights of the contraction and expansion layers,
and
are bias terms, and σ is the activation function.
During fine-tuning, only the parameters of the adapter layers (i.e., ) are updated. All other parameters of the model remain fixed. This structure is designed to allow for necessary adaptations to specific tasks without altering the main parameters of the model. The use of bottleneck adapters preserves the model’s original structure and learned knowledge while enabling task-specific fine-tuning. This significantly reduces computational costs and training time, especially for large and complex models.
3.2. LoRA (Low-Rank Adaptation of Large Language Models)
LoRA operates by adjusting the weights of pre-trained deep learning models through low-rank matrices [
10] (
Figure 2).
Conceptually, if we consider the original weights of the model as , the primary modification is made by adding small, low-rank matrices ΔW to these weights. This approach maintains the integrity of the model’s structure while adapting it to new tasks.
LoRA employs two low-dimensional matrices
and
, where
is the dimension of the original weight matrix and r is the chosen rank value. Typically,
is selected to be much smaller than
d, i.e.,
. The low-rank modification matrix
is computed as
. The product AB yields a low-rank matrix. This low-rank modification is then added to the original weight matrix,
, where
is the fine-tuned new weight matrix (Equation (2)).
The advantage of this approach is that instead of updating the entire weight matrix, only two smaller matrices, and , are trained. This significantly reduces computation and memory requirements, particularly in large models. maintains the core structure and learned representations of the model, making minimal but necessary adjustments to adapt the model to new tasks.
3.3. Proposed Method
The proposed model is based on combining bottleneck adapters with
LoRA (
Figure 3).
Bottleneck adapters are modules that are added to specific layers of the models and enhance efficiency by reducing the number of parameters. These adapters decrease the input and output dimensions, thereby reducing the computational load and accelerating the learning process. The use of bottleneck adapters is particularly important for improving model performance when working with large datasets. LoRA is a technique that reduces the number of parameters by utilizing low-rank representations of weight matrices in certain layers of models. This approach significantly reduces the number of parameters that need to be updated during the fine-tuning process while preserving the model’s capacity. By maintaining the learning capacity and performance of the model, LoRA allows for effective fine-tuning with fewer parameter updates. This, in turn, reduces computational costs and speeds up training processes. LoRA has been found to be especially effective in enhancing parameter efficiency in large models. This technique facilitates more efficient fine-tuning of large-scale models, such as those used in natural language processing. The use of low-rank representations by LoRA reduces memory and computational requirements while preserving or even enhancing model performance. This is a significant advantage, particularly in resource-constrained environments or situations with limited data availability.
In summary, the combination of bottleneck adapters and LoRA enhances parameter efficiency and reduces computational load, leading to faster and more efficient learning processes. This approach enables high performance even in challenging applications, such as low-resource language pairs.
3.4. Evaluation
The BLEU (Bilingual Evaluation Understudy) score is an automatic metric used to measure the quality of machine translation [
19]. The most commonly used metric for evaluating machine translation systems is the BLEU score [
20]. This score mathematically evaluates how similar the translation result is to reference translations produced by human translators. Essentially, it looks at how well the machine-generated translation matches the reference translation in terms of words and phrases. The BLEU score is calculated by comparing the machine translation output with one or more reference translations. The matches between words in the translation and the references are counted, and the number of these matches is normalized by dividing by the total number of words in the translation and in the references, as shown in Equation (3).
The BLEU score typically ranges from 0 to 100. A higher score indicates that the machine translation is closer to the human translation.
4. Results
This study utilized the mBART pre-trained language model and the WMT17 (TR-EN) benchmark dataset. The dataset and language model used are provided as open sources in the Huggingface library [
21]. Adapter methods were implemented using the AdapterHub library [
18]. All configurations were trained over three epochs.
The models we created, parameter numbers, and Loss and Bleu scores are given in
Table 2.
The studies were conducted using Python with the PyTorch deep learning library. A pre-trained language model was sourced from the Huggingface library. The experimental studies were carried out on a computer system equipped with a 2xNvidia 3090 graphics card, 32 GB RAM, and Intel i7 processor.
This research compared full fine-tuning with adapter methods. When all model parameters (610 M) were updated during training, the BLEU score achieved was 21.95. Using the recommended method, only 30 M parameters were updated, which corresponds to 5% of the total parameters used in full fine-tuning. The BLEU score obtained with the recommended method was 21.3.
The total FLOP (Floating Point Operations) computation for the entire model involves a detailed breakdown of each component within the model. The key components include the encoder and decoder layers, with each layer comprising three main subcomponents, namely the self-attention mechanism, the feed-forward network, and layer normalization:
Self-Attention Mechanism: The self-attention mechanism, present in both the encoder and decoder layers, is a significant contributor to the FLOP count. In this mechanism, three matrix projections are performed for the query, key, and value. Each projection involves a matrix multiplication with a size of 1024 × 1024, as the hidden size of the model is 1024. Additionally, there is an output projection that produces the final result of the self-attention operation. The FLOPs for these projections are computed for a single token and then scaled across 128 tokens and 24 layers to obtain the total FLOP count for the self-attention mechanism.
Feed-Forward Network: Each encoder and decoder layer includes a two-step fully connected network. The first layer expands the hidden dimension from 1024 to 4096, followed by a second layer that reduces it from 4096 back to 1024. The FLOP computations for these fully connected layers are calculated for the projections, and then the total FLOPs are derived by considering the number of layers and tokens.
Layer Normalization: Layer normalization is applied in each layer. Although its computational cost is lower than that of the self-attention and feed-forward layers, its contribution becomes significant when applied across all layers of the model.
The FLOP counts for forward and backward passes are computed separately for each of these components, yielding a total FLOP count of 71.6 billion for the entire model.
In addition to the core model, the FLOP computations for the adapters and LoRA components were also performed:
mh_adapter (r = 16) and output_adapter (r = 2): These adapters involve both down-projection and up-projection operations. The FLOP count for these operations, calculated per layer and token, resulted in a total of 339.7 million FLOPs.
LoRA adapters (r = 16): LoRA adapters also perform down-projection and up-projection operations. The total FLOP count for the forward and backward passes of these adapters was calculated as 302 million FLOPs.
When combining the FLOP counts for the adapters and LoRA with the core model, the total contribution of the adapters and LoRA to the overall FLOP count was minimal, representing only 0.896% of the total model FLOPs. This demonstrates that incorporating adapters and LoRA provides a low-overhead fine-tuning mechanism that can adjust the model without significantly increasing the computational cost.
The dataset utilized in this study was evaluated in comparison with the translation outcomes generated by various parameter-efficient methods. A comprehensive summary of all results is presented in
Table 3.
An analysis of
Table 3 reveals that the proposed method achieved a BLEU score of 21.3 with only 5.03% trainable parameters. This outcome demonstrates remarkable success in terms of both parameter efficiency and overall performance compared to other parameter-efficient fine-tuning methods. Notably, the proposed approach yields a BLEU score that closely approximates the 21.95 score achieved by full fine-tuning, which utilizes 100% trainable parameters, while only requiring updates to 5.03% of the model parameters. This significant reduction in computational cost, without compromising translation quality, highlights the method’s effectiveness in balancing efficiency and performance.
One of the primary advantages of the proposed method is its ability to surpass methods with higher trainable parameter ratios, such as MAM [
22]. While MAM achieved a BLEU score of 18.8 with 6.98% trainable parameters, the proposed method attained a superior BLEU score with a lower parameter ratio. Similarly, the parallel method [
23] only reached a 17.5 BLEU score using 6.56% of trainable parameters. These findings underscore the proposed method’s capability to deliver a more efficient fine-tuning process, outperforming methods with higher parameter counts.
The method employed in this study has been evaluated across language pairs with varying levels of resource availability, including a very-low-resource pair (Kk-En), a high-resource pair (Et-En), and a very-high-resource pair (Pl-En). A detailed summary of the results for each language pair is provided in
Table 4.
According to the data presented in
Table 4, the proposed method demonstrated a noticeable decline in performance for very-low-resource languages compared to full fine-tuning. For instance, in the Kazakh–English (Kk-En) language pair, the proposed method achieved a BLEU score of 8.2, falling short of the 11.0 BLEU score obtained by full fine-tuning. This outcome highlights the challenges of improving translation quality in low-resource languages where data scarcity limits the model’s ability to fully capture the complexities of the language. Consequently, the reduced amount of available training data significantly impacts the translation performance of parameter-efficient fine-tuning methods in such settings.
In contrast, the proposed method achieved competitive results for high-resource languages when compared to full fine-tuning. For the Estonian–English (Et-En) and Polish–English (Pl-En) language pairs, the proposed method yielded BLEU scores of 27.1 and 29.9, respectively, coming within 1–2 points of the full fine-tuning scores. This indicates that for high-resource languages, the proposed method is capable of efficiently fine-tuning with fewer parameters while maintaining translation performance close to that of full fine-tuning. The abundance of data in these language pairs allows the model to learn the linguistic patterns effectively, enabling parameter-efficient methods to produce results that are comparable to full fine-tuning.
5. Conclusions
In summary, in this study, we evaluated the performance of parameter-efficient adapter methods for fine-tuning language models in NMT, focusing specifically on low-resource language pairs. The experiments aimed to understand how these adapter techniques impact translation quality under constrained data conditions.
Among the various adapter methods tested, the model that combined (LoRA) with bottleneck adapters demonstrated the highest performance compared to other approaches. Notably, the model using only bottleneck adapters with a higher parameter count achieved lower scores. However, when LoRA was integrated, translation quality significantly improved. This combination not only enhanced parameter efficiency but also reduced computational requirements, leading to substantial improvements in translation quality for low-resource language pairs. Additionally, this model updated only 5% of the total parameters, allowing for efficient fine-tuning and optimized resource utilization. We also experimented with full fine-tuning, which showed only a 3% improvement in performance. This result highlights that full fine-tuning, despite its high computational costs, provides limited performance gains.
Our experimental results indicate that the combination of LoRA and bottleneck adapters is an effective approach to optimize the performance of neural machine translation models. This method offers substantial advantages by imposing less memory and computational load on the model while maintaining or improving translation quality. This is particularly beneficial for applications with limited resources. Furthermore, updating only 5% of the parameters significantly reduced computational costs, enhancing overall efficiency. The findings of this study underscore the importance of parameter-efficient methods in the fine-tuning of language models. The successful integration of LoRA and bottleneck adapters represents a crucial step towards developing more efficient and high-performance neural machine translation models for low-resource language pairs. This approach streamlines the model updating process, making it faster and more cost-effective, thereby broadening its potential applications.
Future research should test these adapter methods across a wider range of languages and larger datasets. Comparative studies with other parameter-efficient techniques are also essential to evaluate the general applicability and superiority of this combination. In conclusion, this study makes a significant contribution to the field of neural machine translation by providing a pathway for developing more efficient, effective, and scalable solutions.