1. Introduction
Natural language processing (NLP) is a field that is concerned with understanding, processing, and analysing natural languages (i.e., human languages). The evolution of the approaches used in NLP tasks is worth noting. Initially, the rule-based approach was dominant in the NLP field, which neglects to consider the contextual meaning of words, and which finds it difficult to cover all the morphologies of the language. With the growth in the availability and accessibility of data, the so-called machine-learning approach emerged. This method has a benefit in terms of accuracy when compared to the rule-based approach. It uses machine learning algorithms, but one of its drawbacks is that it requires complex manual feature engineering. With the emergence of neural networks and, more recently, deep learning, feature engineering has become automatic through the use of word embedding techniques, including Word2Vec [
1], Glove [
2], FastText [
3], and others. Word vectors have been used in the NLP field, and they have achieved state-of-the-art results. Word embeddings are used to map all words into vectors of numbers in the vector space. Language models use pre-trained word embedding as an additional feature to initiliase the first layer of the basic model. The limitations of the word embeddings models are that they cannot handle out-of-vocabulary (OOV) words and meaning or context-dependent representations are lacking.
However, for model training, machine learning and deep learning approaches demand extensive amounts of labelled data, which is time-consuming to annotate and prepare. At present, significant evolution is noticeable in the NLP field, particularly with the emergence of transfer learning. This has reduced the need for massive amounts of training examples. In many NLP applications, most of the recent research that has applied transfer learning techniques has been associated with state-of-the-art results. Transfer learning, according to [
4], “The improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned”. Transfer learning with pre-trained language models has attracted the interest of the research community in recent years, thus relying on the so-called semi-supervised approach. The language model trains in an unsupervised manner with a significant amount of unlabeled data (corpora), followed by a supervised process of fine-tuning the language model with a labelled dataset that is small and task-specific.
With the advancement of technology and the widespread nature of social networking sites, people have become more expressive of their sentiment and emotion, and others’ opinions may also influence them. Many entities have started to consider customer opinions relating to their products or services. The Arabic language is one of the most popular languages in the world and was ranked fourth among the languages used on the internet [
5], and it is also the primary language for 22 countries [
6]. Therefore, there is a great need for tools and models to analyse Arabic sentiment and emotions on specific topics, phenomena, or trends, which can be benefited from several fields. Affect is the superordinate group; emotions and sentiments are statuses within this group. In other words, affect is a general term which includes both emotions and sentiment states. Sentiment analysis and emotion detection are both NLP tasks that have emerged as hot topics of interest in the NLP research community. According to the Oxford Dictionary [
7], a sentiment refers to “a feeling or an opinion, especially one based on emotions”, whereas an emotion is “a strong feeling such as love, fear, or anger; the part of a person’s character that consists of feelings.”, where we can infer that sentiment is a general interpretation of emotion. Sentiment is classified as positive, negative, or neutral, or—using an expanded scale—as very positive, positive, neutral, negative, or very negative. Emotions are often classified according to well-known models, including the Ekman model [
8], into happiness, sadness, fear, anger, surprise, and disgust, or using the Plutchik wheel of emotion [
9].
Emotion and sentiment analysis in text depends essentially on the language used. This study aims to analyse sentiment and emotion in Arabic, which is one of the most challenging languages. The Arabic language has several varieties, including classical Arabic, modern standard Arabic (MSA), and other dialects. Many challenges are facing the field of text emotion and sentiment analysis in the Arabic language in particular, including the lack of resources, the diversity of dialects that have no standard rules, and the detection of the implicit expression of sentiment or emotion. Furthermore, one root word may be written in more than one form, or one word may have more than one meaning. Additionally, the diacritics change the meaning of the words [
5]. The Arabic language differs from other languages due to its morphological richness and complex syntactic synthesis. Therefore, NLP tasks for text emotion and sentiment analysis become more complex in the Arabic language. These challenges, along with others, have delayed progress in this research area, meaning that these tasks have not been adequately investigated and explored in Arabic compared to English. In Arabic, a degree of progress has been recorded in the field of sentiment analysis, where sentiments are typically classified as positive, negative, or neutral. However, progress in emotion detection task is ongoing, and few studies have been conducted that classify emotions deeply (e.g., emotions classification according to Ekman [
8], or Plutchik [
9]). The evolution that occurred in this area, especially the exploitation of transfer learning and advanced pre-trained language models, led to overcoming many of this field challenges, as well as substantial performance improvements. Arabic research papers predominantly employ machine learning or deep learning algorithms, as opposed to pre-trained language models.
Recent efforts in these fields have focused on adapting pre-trained language models to specific domains and tasks using domain-specific or task-specific unlabeled corpora, by the continuation of the pre-training of language models on this task or domain. Using either a domain-adaptation approach [
10,
11,
12] or a task-adaptation approach [
13,
14], model adaptation has led to significant performance enhancements in the English language. As far as we know, model adaptation approaches, especially additional pre-training of the language model on a specific domain, have only been used in two Arabic language studies [
15,
16]. However, classifying sentiment and emotion was not the focus of these studies. Additionally, further pre-training the language model on a specific task (i.e., within-task and cross-task adaptation) has not been investigated for Arabic in general and for sentiment and emotion tasks in particular. This study aims to tackle these problems and fill these gaps by developing models with the overall aim of advancing the current state of sentiment and emotion classification tasks for Arabic. The pre-trained language model QARiB [
17], which has achieved state-of-the-art results in several NLP tasks, including sentiment analysis and emotion detection, was used in this study. QARiB is further pre-trained using sentiment and emotion-specific datasets, assuming that the small task-specific datasets given during the pre-training process are sufficient to improve model performance in that task. The developed model was then evaluated by fine-tuning it on seven sentiment and emotion datasets. In particular, the contributions of this study are:
Develop five new Arabic language models: QST, QSR, QSRT, QE3, and QE6, which are the first task-specific adapted language models based on QARiB, for Arabic sentiment analysis and emotion detection tasks. The developed models significantly enhanced the performance of seven Arabic sentiment and emotion datasets, and the research community can use these models for sentiment and emotion tasks;
Conduct comprehensive experiments to investigate the impact of the within-task and cross-task adaptation approaches on the performance of sentiment and emotion classification;
Analyse the influence of the genre of training datasets (i.e., tweets and reviews) utilised for model adaption on the performance of sentiment classification;
The remainder of the paper is organised as follows:
Section 2 offers a concise literature review on the classification of sentiment and emotion in Arabic text. In
Section 3, the approach proposed for developing the models, pre-training datasets, and all necessary pre-processing steps and tokenisation is described. In
Section 4, the experimental setup is described, including the evaluation datasets, the baseline model, the fine-tuning architecture, and the hyperparameter choices for fine-tuning our models. The results of the experiment are presented and discussed in
Section 5.
Section 6 concludes the paper besides outlining a few other future directions.
4. Experiments
We evaluated our adapted models on two Arabic NLP downstream tasks: sentiment analysis and emotion detection. In order to investigate whether further pre-training of the QARiB model [
17] using task-specific unlabeled data could continue to improve the performance of the QARiB model [
17] on sentiment and emotion tasks, the experiments aimed at addressing the following research questions for this study.
To provide an answer to RQ1, the sentiment fine-tuning datasets that we used came from two distinct domains (Twitter and reviews). We intended to study the training and fine-tuning using various data types and evaluate model performance on each dataset from different points of view. For illustration, the QST model was trained using just tweets. However, we wanted to study the extent to which it performed well with review datasets (e.g., ArSentiment, and MASC). In contrast, we also wanted to evaluate the performance of the QSR model trained using only reviews on the tweet datasets (e.g., SS2030, 40k-Tweets, Twitter-AB).
- 2.
What sentiment classification performance can be achieved if the QARiB language model is further pre-trained on a sentiment-specific dataset?
To address RQ2, we investigated whether sentiment models such as QST, QSR, and QSRT, which were further pre-trained on unlabeled sentiment datasets, could improve performance when fine-tuned on various labelled sentiment datasets. In this experiment, five sentiment datasets were used to fine-tune sentiment models.
- 3.
What emotion classification performance can be achieved if the QARiB language model is further pre-trained on an emotion-specific dataset?
To find the answer to RQ3, we studied how well emotion models, such as QE3, and QE6 models, that were trained on emotion-unlabeled datasets, performed when fine-tuned on a variety of emotion-labelled datasets. We fine-tuned the emotion models using two-emotion datasets in an attempt to enhance the classification results. Moreover, we fine-tuned the BERT-base-QaRiB model as a baseline model on all seven sentiment and emotion datasets and compared the results.
- 4.
Is there a relationship between sentiment and emotion representation? (i.e., can further pre-training QARiB with a sentiment dataset boost emotion classification results and vice versa?)
To provide an answer to RQ4 and see if there is a relationship between the sentiment and emotion tasks, we fine-tuned sentiment models QST, QSR, and QSRT on the two emotion datasets to examine whether the model trained on sentiment data could improve or increase the performance of emotion classification. Second, we fine-tuned the QE3 and QE6 models on the five sentiment datasets to see whether the model trained using emotion data could improve or enhance the results of the sentiment classification performance.
This section describes the experiment’s setup, including the evaluation datasets, the baseline model compared to our models, the fine-tuning architecture, and the hyperparameter choices for fine-tuning our models.
4.1. Fine-Tuning Datasets
The datasets used for the evaluation process were chosen from the available Arabic sentiment and emotion dataset. For fine-tuning our models, we used five sentiment datasets and two emotion datasets. For all fine-tuning experiments, we applied the standard train/development/test set split of 80/10/10. Below is a description of the datasets utilised:
4.1.1. Sentiment Datasets
In order to cover different domains or sources, we chose the five sentiment datasets from different domains, including Twitter and reviews. The SS2030 [
75], 40k-Tweets [
76], and Twitter-AB [
77] datasets were sourced from Twitter. In addition, ArSentiment [
78], and MASC [
79] were reviews datasets.
SS2030 dataset [
75]: sentiment dataset that has been gathered from Twitter includes 4252 tweets focusing on a variety of social issues in Saudi Arabia. The data set was manually annotated, and it consists of two classes (2436 positive, 1816 negative).
Twitter-AB [
77]: This dataset consists of 2000 tweets that were gathered from Twitter and have been classified into 1k positive, and 1k negative. The dataset was manually labelled and included both MSA and the Jordanian dialect, encompassing diverse topics related to politics and the arts.
40k-Tweets [
76]: There are 40,000 tweets in this dataset, 20,000 of which are positive and 20,000 of which are negative. These tweets are written in both MSA and an Egyptian dialect. Furthermore, the gathered tweets are manually labelled and span a wide range of topics such as politics, sports, health, and social problems.
The ArSentiment [
78] is a large and multi-domain reviews dataset consisting of over 45k reviews on the 5-rating scale, for movies, hotels, restaurants, and products. We used a rating scale to assign labels to data, 1 and 2 stars have been considered negative, 3 stars have been considered neutral, and 4 and 5 stars have been considered positive.
Multi-domain Arabic Sentiment Corpus (MASC) [
79]: a review dataset that was scraped from a variety of websites including Google Play, Twitter, Facebook, and Qaym. The dataset, which included several different domains, was manually annotated into two classes: positive, and negative.
We selected datasets of varying sizes, some of which contained 40,000 sentences such as 40k-Tweets, and ArSentiment. Others, such as SS2030, Twitter-AB, and MASC had sizes of less than 7000 sentences. The selection of sentiment datasets with diverse domains and sizes was motivated by a desire to examine the impact of adaptation approaches from multiple perspectives. The statistics and classes distribution of the sentiment datasets are shown in
Table 6. In addition,
Table 7 provides the number of train, development, and test samples for each dataset.
4.1.2. Emotion Datasets
We evaluated our models on emotion Arabic tweet dataset (EATD) [
80] and ExaAEC dataset [
81]. In comparison to sentiment datasets, the labelled emotion datasets for the Arabic language are small and scarce. All these datasets are derived from Twitter.
Table 8 illustrates the distribution of classes for each dataset. The number of train, development, and test samples for each dataset is presented in
Table 9.
EATD [
80]: an Arabic emotion dataset gathered from Twitter. The dataset was classified into four classes including anger, disgust, joy, and sadness. The annotation of the dataset was automatically for over 22k tweets based on emojis and manually for a subset of 2021 tweets. The manually annotated dataset has been utilised in our experiments.
ExaAEC [
81]: a multi-label Arabic emotion dataset consisting of approximately 20,000 tweets categorized as “neutral”, “joy”, “love”, “anticipation”, “acceptance”, “surprise”, “sadness”, “fear”, “anger”, and “disgust.” Each tweet in this dataset was manually annotated with one or two emotions. Given that the dataset contains tweets with multiple labels, we select a subset containing only tweets with a single label and according to the Ekman model, as follows: ‘sadness’ 1909, ‘disgust’ 1176, ‘surprise’ 795, ‘joy’ 472, ‘fear’ 195, ‘anger’ 191, for a total of approximately 4738 tweets.
4.2. Fine-Tuning Architecture
Fine-tuning the BERT model is “simple and direct”, as indicated by [
18], and only requires the addition of one more layer after the last BERT layer and training for a small number of iterations. The input sequence used to fine-tune the language model, in this case, is represented by the tokens [CLS] and [SEP] appended to the beginning and end of the sentence, respectively. The [CLS] token is used for all classification-related tasks. As a result, our models can be utilised for a variety of downstream text classification tasks with only minor architecture changes needed. Specifically, we fine-tuned our models for sentiment and emotion classification in Arabic text using the same fine-tuning strategy as BERT [
18]. Trainer is a class within the Transformers library that can be utilised to fine-tune a variety of pre-trained Transformers-based models using a specific dataset. For the purpose of instantiating our sequence classification models, we utilised the AutoModelForSequenceClassification class. Due to the fact that our models were not pre-trained on the process of classifying sentences, the head of the model that had been pre-trained was removed, and in its place, a new head more suited to each task was added. The new head’s weights were initially selected at random. This indicates that during model fine-tuning, just the weights of the new layers will be updated. In other words, during fine-tuning, all of the layers in our models will be frozen. For classification tasks, we added a fully connected feed-forward layer to the model and used the standard SoftMax activation function for prediction. It is worth noting that we fine-tuned our models independently for each task and dataset, using the same fine-tuning architecture. For a specific number of epochs, we fine-tuned our models on the training set. After that, the model checkpoint with the lowest validation loss was chosen automatically. We then used this checkpoint to do an evaluation of the test set.
4.3. Fine-Tuning Hyper-Parameters
Evaluating or fine-tuning the pre-trained language model is time-consuming, and manually experimenting with various hyperparameters might take days. Hyperparameter optimisation libraries, like Ray-tune [
82], allow for the automatic selection of optimal values for model hyperparameters. This library is compatible with a wide variety of machine learning frameworks, including PyTorch and TensorFlow. This library was used in the experiments we conducted for this work. We ran ten trials for each dataset, and the hyperparameters were randomly chosen by the tool. After the hyperparameter search was completed, we obtained the best hyperparameters, which were used to fine-tune our final model. It should be noted that, due to computational and time constraints, we did not run the search for more than ten trials. In fact, for some datasets, such as 40k-Tweets and ArSentiment, the training time ranges from 6 to 10 hours and may exceed that time depending on the hyperparameters chosen by RayTune.
4.4. Baseline Models
As a baseline, to estimate how well our models performed, we compared them to the BERT-base-QaRiB model’s [
17] performance on the same tasks. We used a currently available BERT-base-QaRiB model and performed supervised fine-tuning, as described in
Section 4.2, of the model’s parameters for each dataset. Moreover, the results of this study were compared to the benchmark results provided by the datasets’ original papers [
75,
76,
77,
78,
79,
80]. Except for the ExaAEC dataset, of which only a subset was used in this work, which is incompatible with the version used in [
81].
5. Results and Discussion
5.1. Exp-I: Experiment to Investigate the Influence of the Within-Task Adaptation Approach on Sentiment and Emotion Classification Performance
The results of fine-tuning the BERT-base-QaRiB, QST, QSR, and QSRT models on the SS2030, Twitter-AB, 40k-Tweets, ArSentiment, and MASC datasets are shown in
Table 10.
Table 11 presents the results obtained by fine-tuning the BERT-base-QaRiB, QE3, and QE6 models using the EATD and ExaAEC datasets. The results are discussed, compared, and analysed specifically based on the macro-F1 score for the partition of the test set. In addition, the results of each dataset of the base studies [
75,
76,
77,
78,
79,
80] are presented in
Table 10 and
Table 11. The results that showed an enhancement above the results obtained by the baseline model (i.e., the BERT-base-QaRiB model) are typically highlighted in bold. The results that are highlighted in bold and underlined are the best results that have been achieved for each dataset according to the model used. In total, 26 separate experiments were carried out utilising various sentiment and emotion datasets.
In
Table 10, the results that outperformed the BERT-base-QaRiB model results are highlighted in bold. All sentiment models, including QST, QSR, and QSRT, outperformed the BERT-base-QaRiB model on all sentiment datasets. In addition, when comparing the QST model to the QSR model across all of the experiments (given in
Table 10), we observed that the sentiment datasets that were sourced from Twitter, including SS2030, Twitter-AB, and 40k-Tweets, the QST model outperformed the QSR model. Based on this, we may infer that data distributions of tasks within the same source or domain could be similar. This also indicates that further pre-training of the model using task-specific datasets from the same genre or domain of the fine-tuning datasets yields better results than utilising datasets from a different genre or domain.
Compared to the BERT-base-QaRiB model,
Table 10 reveals a performance gain of 2.22% for the QST model and 0.80% for the QSR model on the SS2030 dataset. In addition, the QST and QSR models showed improvements in the Twitter-AB dataset by 2.56% and 2.06%, respectively. In fact, the 40k-Tweets dataset performance increased only by 0.88% using the QST model and by 0.77% using the QSR model. The explanation might be that the 40k-Tweets dataset is a multi-domain dataset including several domains, such as politics and arts, and these domains were not included or covered extensively during the training of the model. Comparing the improvement in the performance of our models QST and QSR on the SS2030 and Twitter-AB datasets to the 40k-Tweets dataset may suggest that the models perform better on small datasets as opposed to large datasets. Compared to the BERT-base-QaRiB model,
Table 10 reveals an improvement in the performance of 0.90% for the QST model and 2.60% for the QSR model on the ArSentiment dataset. Meanwhile, performance on the MASC dataset improved by 2.51% using the QST model and by 1.04% using the QSR model.
The QSRT model outperformed BERT-base-QaRiB on the SS2030, Twitter-AB, 40k-Tweets, ArSentiment, and MASC datasets by 1.54%, 3.08, 0.15%, 2.22%, and 1.72%, respectively. Our best sentiment model for SS2030, 40k-Tweets and MASC datasets was the QST model, which achieved 92.93%, 91.37%, and 97.12% F1-scores, respectively. Moreover, the QSR model obtained the highest F1 score on the ArSentiment dataset by achieving 78.53%. The QSRT model was the best sentiment model on the Twitter-AB dataset with macro-F1 of 97.45%. This may suggest that there is no need to perform further pre-training of a model, with a large amount of training data. Instead, training with task-specific datasets that share the same domain as the fine-tuning datasets could result in higher performance.
Table 10 demonstrates that the developed models significantly outperform the results of the original studies for the SS2030, Twitter-AB, 40k-Tweets, and ArSentiment datasets in terms of accuracy by 3.35%, 10.25% 3.32%, and 31.55%, respectively. In terms of the F1 score, it improved by 0.04% on the MASC dataset. The results reported in
Table 10 show that the within-task adaptation approach has a beneficial impact on the final results of the sentiment analysis task. In other words, further pre-training of the QARiB language model with unlabeled sentiment datasets and fine-tuning using labelled sentiment datasets improved or enhanced the final results of the sentiment analysis task. In addition, further pre-training using sentiment-specific datasets with the same source or domain as the fine-tuning datasets leads to better enhancement in the sentiment classification results.
In terms of the results of emotion detection, it is challenging for the model that the emotion datasets used, such as EATD and ExaAEC, have multiple classes (i.e., four and six emotion classes). Nevertheless, compared to BERT-base-QaRiB, our QE3 and QE6 models performed better on all emotion datasets, as shown in
Table 11 in bold. Compared to the BERT-base-QaRiB model, the QE3 model performed 4.64% better on the EATD dataset. Using the same dataset and the QE6 model, a 2.84% improvement in performance was observed. Results for the ExaAEC dataset were enhanced by 0.40% with the QE3 model and by 0.35% with the QE6 model. The reason could be that the ExaAEC dataset contains six emotion classes in addition to being an unbalanced dataset, as demonstrated in
Table 8.
It can be observed that the QE3 model was the best emotion model across all emotion datasets, including the EATD, and ExaAEC datasets, obtaining macro-F1 scores of 90.10%, and 64.21%, respectively. Furthermore,
Table 11 shows that the QE3 model outperforms the SVM model in terms of the F1 score by 21.58% for the EATD dataset. These results indicate that QE3 outperformed QE6 on the two emotion datasets. In addition, as shown in
Table 5, the training results of the QE6 models were not superior to those of the QE3 models. In fact, QE6 validation loss and perplexity increased. Together, these findings provide an important insight, namely that pre-training the model for longer training steps is not necessary to achieve optimal performance.
Finally, the results reported in
Table 11 show that the within-task adaptation approach has a positive impact on the final results of the emotion detection task. Accordingly, pre-training the QARiB language model with unlabeled emotion datasets and fine-tuning it with labelled emotion datasets improves emotion detection task performance. In addition, further pre-training of the model for longer training steps is unnecessary to get the highest performance.
5.2. Exp-II: Experiment to Investigate the Influence of the Cross-Task Adaptation Approach on Sentiment and Emotion Classification Performance
Table 12 summarises the results of fine-tuning the BERT-base-QaRiB, QE3, and QE6 models using the sentiment datasets SS2030, Twitter-AB, 40k-Tweets, ArSentiment, and MASC.
Table 13 shows the results of fine-tuning the BERT-base-QaRiB, QST, QSR, and QSRT models using EATD and ExaAEC datasets. The results are discussed, compared, and analysed specifically based on the macro-F1 score for the partition of the test set. In addition, the results of each dataset of the original studies [
75,
76,
77,
78,
79,
80] are presented in
Table 12 and
Table 13. The results that showed an improvement over those obtained by the baseline model (i.e., BERT-base-QaRiB model) are highlighted in bold. The results that are in bold and underlined represent the highest results that have been obtained for each dataset and according to which model. In total, 16 experiments were carried out using various sentiment and emotion datasets.
Overall, as shown in
Table 12 and
Table 13, all sentiment models, including QST, QSR, and QSRT, and emotion models, including QE3, and QE6, outperformed the BERT-base-QaRiB model for all sentiment and emotion datasets. In the tables, results that exceeded the BERT-base-QaRiB model results are highlighted in bold.
Table 12 reveals that the QE3 model outperformed BERT-base-QaRiB on the SS2030, Twitter-AB, 40k-Tweets, ArSentiment, and MASC datasets by 2.18%, 2.06%, 0.70%, 3.18%, and 1.38%, respectively. In addition, the QE6 model outperformed BERT-base-QaRiB on the SS2030, Twitter-AB, 40k-Tweets, ArSentiment, and MASC datasets by 0.57%, 1.53%, 0.90%, 2.71%, and 0.91%, respectively.
When comparing emotion models, including QE3 and QE6, across five sentiment datasets, QE3 outperforms QE6 on the SS2030, Twitter-AB, ArSentiment, and MASC datasets, obtaining macro-F1 by 92.89%, 96.43%, 79.12%, and 95.99%, respectively. On the 40k-Tweets dataset, the QE6 model outperforms the QE3 and obtains macro-F1 by 91.40%. These results may provide insight into the fact that pre-training the model for longer training steps does not necessarily give optimal performance. In addition, the further pre-training of the model using sentiment data and fine-tuning it with an emotion dataset can improve the final emotion classification results.
In
Table 13, compared to the BERT-base-QaRiB model, the QST model improved performance on the EATD dataset by 4.71%, making it the best model on this dataset with a macro-F1 of 90.18%. On the same dataset and using the QSR model, a 2.80% improvement in performance was observed. On the same dataset, the QSRT model showed a performance improvement of 4.32%, which was better compared to the QSR models. Using the ExaAEC dataset, QST model outperformed all other models with an improvement of 1.24% and a macro-F1 of 65.05%. While using the QSR and QSRT models, performance improvements of 1.17% and 0.57% were achieved.
Comparing the results of the QST and QSR models on emotion datasets, including EATD and ExaAEC on all emotion datasets, we noticed that the QST model outperformed the QSR model. This was expected because the QST model was further trained using tweet data, and the emotion datasets were also taken from Twitter. These findings indicate that we might get better results if we further pre-train the model using a task-specific dataset from the same genre for fine-tuning the datasets. On emotion datasets, the QSRT model performed somewhat worse than the QST and QSR models on the ExaAEC dataset, although the QSRT model was trained with a larger dataset and a different data genre (Twitter and reviews). This may indicate that a large quantity of data may not be necessary for training the model and that a dataset of the same genre as the dataset used for fine-tuning may be more efficient. In general,
Table 12 demonstrates that the developed models significantly outperform the results of the base studies for the SS2030, Twitter-AB, 40k-Tweets, and ArSentiment datasets, except the MASC dataset, in terms of accuracy by 3.35%, 9.23%, 3.35%, and 31.42%, respectively. Furthermore,
Table 13 shows that the QST model outperforms the SVM model in terms of f1-score by 21.66% for the EATD dataset.
In conclusion, the results presented in
Table 12 and
Table 13 illustrate the effectiveness of the cross-task adaptation approach on the final results of the sentiment and emotion classification tasks. These results suggest that the data distribution between sentiment and emotion may be converging. We can see how each task can influence and improve the results of the other. This may give an important insight into how convergent tasks with converged data distribution might enhance each other’s performance. For instance, the Arabic emotion detection task has more limited resources than sentiment. Therefore, this study may help researchers tackling emotion detection to obtain better results by utilising sentiment resources. Additionally, when comparing the results of the two task-adaptation approaches (cross-task and within-task), it can be shown that the cross-task adaptation results sometimes outperform the within-task approach. On the emotion datasets EATD and ExaAEC, for instance, the sentiment model QST outperformed the emotion models QE3 and QE6. Additionally, the emotion models QE3 and QE6 outperformed the sentiment models on the sentiment datasets 40k-Tweets and ArSentiment.
6. Conclusions
The experiments described in the previous sections examine the effect of two adaptation approaches: within-task and cross-task adaptation. In total, five new models were developed using the previous approach: the QST, QSR, QSRT, QE3, and QE6 models. Different evaluation experiments were conducted by fine-tuning each model for two downstream tasks, sentiment analysis and emotion detection. Using five sentiment datasets, including SS2030, Twitter-AB, 40k-Tweets, ArSentiment, and MASC, in addition to two emotion datasets, EATD and ExaAEC, 42 experiments were carried out in total. The sentiment and emotion datasets covered both small- and large-resource settings. The experiments reveal the following: first, the within-task and cross-task adaptation approaches have influenced the final results and boosted performance for all tasks (i.e., sentiment and emotion). Second, our newly developed QST, QSR, QSRT, QE3, and QE6 models outperformed the BERT-base-QaRiB model on all sentiment and emotion datasets. Third, the training using task-specific datasets that share the same domain as the fine-tuning datasets results in higher performance. Fourth, additional pre-training of the model for longer training steps is unnecessary to get the highest performance. Finally, cross-task adaptation shows that sentiment and emotion data may converge, and each task might enhance the results of the other.
This study showed that pre-training the QARiB language model on small-scale sentiment or emotion data improves model understanding of this domain data and yields considerable improvements. Because of the scarcity of emotion datasets, one of the limitations of this research is that the model was only evaluated on two small emotion datasets. In general, findings reveal interesting areas for future research. The findings indicate that these approaches (i.e., within-task and cross-task adaptation) can improve the performance of QARiB. Consequently, any pre-trained Arabic language model can be utilised with the approaches that we have investigated. While Arabic language models like AraBERT and MARBERT already perform effectively well on sentiment and emotion tasks, they may benefit significantly from further task-specific pre-training. In addition, we believe that pre-training on larger task-specific data could further enhance performance. Finally, the developed language models are publicly available to be used by the NLP community for research purposes, and we hope this work helps researchers interested in the domain of Arabic sentiment and emotion analysis.