1. Introduction
Due to the growth and proliferation of social media platforms, the huge amount of textual data available on the Internet is prompting more attention to be given to sentiment analysis [
1]. Sentiment analysis (SA), often referred to as opinion mining, is a type of Natural Language Processing (NLP) that aims to extract sentiments by analyzing textual data and classifying it based on text polarity [
2]. It plays an important role in analyzing thoughts, opinions, and emotions in texts written about healthcare systems, e-commerce, and social networks [
3]. Although Arabic is one of the most widely used languages in the world, research in Arabic sentiment analysis is still growing slowly compared to other languages such as English [
4]. Therefore, extending the same success in SA to the Arabic language is still a challenge.
In the field of SA, most research is focused on the English language, with little attention paid to the Arabic language [
5]. This is because Arabic Sentiment Analysis (ASA) is still challenging due to Arabic varieties, orthography, morphology, lack of corpora, lack of sentiment lexicons, and the use of dialectal Arabic [
6]. Arabic is a global language with more than 500 million speakers worldwide [
7], and about 185 million Arabic speakers use the Web [
6]. Thus, ASA has recently emerged as an active research area, particularly in the field of Machine Learning (ML) applications [
8]. One of the ways to strengthen the ASA domain is through the use of emojis, as they provide helpful features to enrich the textual features for sentiment analysis, which are becoming more popular in the world of social media [
9].
They provide a rich source of semantic dimensions that can assist in conveying users’ opinions. Here, we did not just consider emoticons that reflect facial expressions, but also those that are used to enrich the text with concepts and ideas, such as celebrations, weather status, vehicles and buildings, food and drink, animals and plants, and the intended feelings and emotions from their use [
10]. For example, the “❤” emoji means “يحب شخص و الرومنسية و المودة”, and in English means “loves someone, romance, and affection”, and “😀” means “السعادة والإثارة بشكل عام”, and means “happiness and excitement in general” in English, while “⛰” is rich in meanings and intentions such as “الجبال المادية أو فكرة المشي لمسافات طويلة والمغامرة. الإعجاب بالطبيعة أو القوة أو السفر. او التغلب على التحديات، أو إحساسًا بالسلام والتأمل”, which means “Physical mountains or the idea of hiking and adventure. Admiration for nature, strength, or travel. Or overcoming challenges, or a sense of peace and contemplation”. Thus, eliminating such emojis could omit valuable information and feelings that they reflect, and change the overall meaning of the user’s tweet and its emotional tone. On the other hand, including the intended meaning and emotion of the emoji will help ML extract the right insights and support decision-makers and managers in their decision-making.
This research work presents an approach to emoji encoding introduced by replacing each emoji with its emotional and real social media meaning. Furthermore, a hybrid deep learning model is proposed to evaluate the impact of this preprocessing step on the quality of ASA and to build robust prediction models. These techniques address specific challenges in Arabic sentiment analysis, such as the complexity of Arabic dialects, the lack of sentiment lexicons, and the intricacies of Arabic morphology. Our approach advances the state of the art by offering a more nuanced understanding of how these techniques can be effectively employed to overcome these challenges. To the best of our knowledge, this is the first work that utilizes a combined deep learning approach with emoji encoding for Arabic Sentiment Analysis, which deserves to be considered.
Accordingly, we can summarize the main contribution of our proposed approach as follows:
Combination of emoji encoding with the hybrid CNN-LSTM model: Our method integrated emoji encoding that captures all the emotional and real meanings, specifically tailored to enhance the understanding of sentiment in Arabic text.
Impact of preprocessing steps: We explored the effects of various preprocessing techniques, such as keeping non-Arabic words, retaining punctuations, and using different stemmers and embedding transformers, on the performance of our sentiment analyzer. This exploration provides deeper insights into how specific transformation or stemming strategies can effectively leverage punctuation and non-Arabic words to enhance sentiment extraction in Arabic text.
The rest of this paper is organized as follows. The literature review on sentiment analysis and text data preprocessing is discussed in
Section 2. Then, the proposed methodology, including data collection, preprocessing, and hybrid model prediction and tuning processes, is presented in
Section 3.
Section 4 shows the results obtained from the different experiments, which are discussed in
Section 5. Finally,
Section 6 presents the conclusions and suggests future work in this area.
2. Literature Review
Sentiment analysis is the understanding of people’s opinions, emotions, and attitudes toward any topic or person expressed in textual data [
11]. In the field of Natural Language Processing, the ASA has recently received increasing attention [
12]. Through reading on the ASA field, we found research undertaken on hybrid models, deep learning models, and classical machine learning models for classifying Arabic sentiments.
Hybrid models play a role in our understanding of the complexity of Arabic sentiment as these models are trained on different datasets to build predictive models. The study in [
13] applied a combination of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) on three datasets, the Arabic Health Services Dataset (Main-AHS and Sub-AHS) [
14], Ar-Twitter [
15], and the Arabic Sentiment Tweets Dataset (ASTD) [
16] datasets. The max-pooling layer was excluded from the CNN to maintain the same feature vector length after convolving the filters on the input data. In addition, several dataset-preparation techniques such as MADAMIRA, Farasa, and Stanford for Arabic text preprocessing and several pre-trained word-embedding techniques for providing vector representation for the text features, such as Word2Vec, Glove, and fastText, were investigated to improve the accuracy of Arabic sentiment classification. The best accuracy, of 94.83%, was achieved for the Main-AHS dataset using Farasa Lemmatization normalization, and 88.86% for the Ar-Twitter dataset using Madamira Stem normalization and 81.62% for the ASTD using Word2VecSG word embedding. Subsequently, a more complex approach was proposed in [
17] by implementing a hybrid model to combine contextualized sentence representations generated by the AraBERT model with static word embedding using pre-trained Mazajak. In addition, CNN-Bidirectional Long Short-Term Memory (CNN-BiLSTM) was used to obtain sentence representations from the static word vectors in order to be able to concatenate the two types of embeddings. The hybrid model outperforms the standalone AraBERT model tested on the ArSarcasm-v2 dataset for both sarcasm and sentiment classification tasks. The best results are a 0.62 F1 score and 0.715 F-PN score (macro average of positive and negative class F scores) for sarcasm and sentiment classification, respectively. Another hybrid model of CNN-BiLSTM was used in [
18] for different tasks, including a topic classifier, a sentiment analyzer, a sarcasm detector, and an emotion classifier. This model was trained on different datasets for each task, with four of them for sentiment analysis tasks; SS2030, ArSAS, Twitter dataset for Arabic Sentiment Analysis, and ArSarcasm-v2 datasets, consisting of 4214, 21,000, 348,797, and 15,548 tweets, respectively. The proposed model achieves an accuracy of 97.58%, 86%, 97%, and 81.6% for topic, sentiment, sarcasm, and emotion classification, respectively.
On the other hand, deep learning has also been used for ASA. For example, the study in [
19] used deep learning to evaluate GloVe, Word2Vec, and FastText as classical word embedding techniques and ARBERT as contextualized word embedding for sentiment analysis with a comparative analysis. The word embedding techniques were evaluated in trained and pre-trained versions by applying two deep learning models of BiLSTM and CNN on five datasets, including HARD, Khooli, Arabic Jordanian General Tweets (AJGT), ArSAS, and ASTD for sentiment classification. The BiLSTM model outperforms CNN on three datasets, while CNN performs better on smaller datasets. In addition, the generated embeddings outperform their pre-trained versions by about 0.28% to 1.8% accuracy. The contextualized transformer-based embedding BERT model achieves the highest performance in both trained and pre-trained versions. Another study in [
20] employed Deep Neural Networks along with investigating Support Vector Machines (SVM), Naive Bayes (NB), and Random Forest (RF) as classical ML models that were tuned using Differential Evolution (DE) algorithms for classifying the sentiment of Arabic texts related to monkeypox. The dataset used was collected from Twitter over eight months, resulting in 4763 tweets. The best result was obtained using the DNN based on Leaky ReLU with an accuracy of 92%.
Classical ML has also been used for ASA. Thus, several supervised ML models have been applied in [
21], including SVM, Linear Regression, NB, Complementary Naive Bayes (CNB), and Stochastic Gradient Descent (SGD) for both sentiment and sarcasm classification. These models were trained and tested with 5-fold cross-validation on the ArSarcasm-v2 dataset. The best accuracy was achieved using SVM with 59.8% and 74.6% for sentiment and sarcasm, respectively. Based on the same dataset, an improvement was presented in [
22] by applying different versions of two transformer-based models, AraELECTRA and AraBERT, for sarcasm and sentiment detection. The best results for sarcasm were achieved by the AraBERTv2-base model with an accuracy of 78.3%, while AraBERTv0.2-large was the best for the sentiment task, with an accuracy of 65.15%. It is important to note that the pre-trained model in [
3] was not used to generate the embeddings. Instead, it presents a fine-tuning approach of three stages for a pre-trained model called Arabic BERT, which was developed for Arabic sentiment analysis. These stages consist of text pre-processing and data cleaning, transfer learning of weights of pre-trained models, and a classification layer. Model evaluation was performed by testing this model on five different Arabic review datasets and comparing its results with 11 state-of-the-art models. This model outperforms the prediction accuracy of the proposed models.
Researchers in SA follow different strategies to deal with emojis; some researchers just eliminate the emojis, while others have considered the significance of emojis in their work [
23]. Including the emojis can help in expressing writers’ feelings, which helps in improving the classification performance [
24].
One strategy exploits the emojis in SA by replacing the emojis with textual data, such as the study in [
25], which is directed towards translating emojis by conducting emoji Unicode translation. Also, it investigates the effect of combining Recurrent Neural Network (RNN), LSTM, and Gated Recurrent Unit (GRU) in conjunction with Logistic Regression (LR), RF, and SVM and grid search to improve the prediction performance for Arabic sentiment analysis. The model performance is compared with three deep learning models, which are RNN, LSTM, and GRU, implemented with CBOW word embedding and tuned using Keras-tuner, and with five ML models, which are Decision Tree (DT), LR, K-Nearest Neighbor (KNN), RF, and NB, implemented with the Term Frequency–Inverse Document Frequency (TF-IDF) feature extraction model and grid-search cross-validation for model tuning. Different datasets are used for training and testing the models: ASTC, ArTwitter, and AJGT. Stacking LR achieved the highest testing accuracy of 92.22% compared to ML models and DL models when using the ASTC dataset. Also, the study in [
26] used a Russian dataset of 6957 posts and each post has at least one emotional indicator (emojis, emoticons, punctuation marks that express emotions); each emotional indicator was replaced with its meaning to improve the model. The best model was an ensemble model of word2vector model and a model of emotional indicator embedding tested on a dataset of 524 posts with an accuracy of 91%.
Another strategy to improve SA is to use emojis as non-verbal features. The study in [
23] adapted non-verbal features for the task of Arabic sentiment analysis. Thus, several ML models including NB, multinomial naive Bayes (MNB), SGD, sequential minimal optimization-based support vector machines (SMO-SVM), DT, and RF were evaluated on emoji-based features with a feature vector of length 429 and for 2091 instances. The MNB achieved the best Area Under the Curve (AUC) of 87.30% when applied to the top 250 most relevant emojis selected using ReliefF and Correlation-Attribute Evaluator feature selection techniques. In [
27], several ML models were also investigated, including SGD, SVM, Gaussian NB, KNN, DT, LSTM, GRU, Bi-LSTM, and bidirectional-GRU, to evaluate non-verbal features. A dataset of 2091 microblogs after excluding tweets without emojis was collected from ASTD, ArTwitter, QCRI, Syria, Semeval-2017 Task4 Subtask#A, and 843 Arabic microblogs with emojis from Twitter and YouTube. Then, the Emoji Sentiment Ranking (ESR) lexicon, which is an emoji lexicon containing 969 used emojis after excluding the unused emojis, and Principle Component Analysis (PCA) were applied to reduce the dimensionality of the features from 430 to 100 features. The best accuracy of 71.71% was achieved by the bidirectional-GRU model. In addition to non-verbal features, textual features were also used in the study of [
9]. Thus, five datasets were used after removing instances that did not contain emojis, including Syria, ASTD, ArTwitter, QCRI, and Semeval-2017. After merging all the datasets, each tweet was divided into textual and emoji features, and then for the feature extraction step, the TF-IDF, Latent Semantic Analysis (LSA), and two methods of word embedding were used to extract textual features, while a set of 120 emojis was used to calculate the occurrence of each emoji to obtain nonverbal features. The SVM achieved the best results by merging skip-gram features with emojis and using correlation-based feature selection with an accuracy of 83.02%.
In [
28], another approach was applied by training an attention-based long short-term memory network on the embeddings generated by bi-sense emojis and inspired by word sense embedding. To obtain sentiment-aware embeddings of emojis, the bi-sense emojis were learned under positive and negative sentimental tweets. The best accuracy of 90% was achieved on the AA-sentiment dataset using Multi-Level Attention-based LSTM with bi-sense emoji embedding (MATTBiE-LSTM) and 83.4% on the HA-sentiment dataset using word-guide attention-based LSTM with bi-sense emoji embedding.
The previous studies used different hybrid models, transformers, and emoji-handling strategies for ASA. However, the morphological complexity of the Arabic language and the effect of several factors that change the meaning of the text, such as punctuation, non-Arabic words, and emojis, mean that the ASA field needs further investigation. The studies in [
13,
17,
18,
19] applied the hybrid and deep learning models on prepared datasets without exploring the effect of emoji meaning, punctuation, or sentences in other languages on the final classification results. Other studies applied classical machine learning models, such as [
21,
22], but these models could not overcome the complexity of Arabic, so they did not reach high accuracy scores. On the other hand, the studies in [
20,
25,
26] treated the emojis by replacing them with textual data. In contrast, the studies in [
23,
27] treated the emojis as non-verbal features and removed text which may be rich in sentiment that can improve the model results, so in [
9] both the non-verbal features and the original text were used. Although these studies examined emojis, they did not investigate the effect of keeping non-Arabic words, punctuation, or the most suitable transformers when having words written in other languages inside the Arabic text or when keeping punctuation, or emoji encoding on the emotional and real meaning in their results. In this study, we propose a combination of CNN and LSTM models trained and tested on the ASTC dataset to improve the ASA. The study also investigates the effect of the proposed hybrid model under different experiments and conditions to understand the importance of each step in data preprocessing, including examining Keras and AraVec transformers and their suitability when keeping punctuation and non-Arabic words, emoji handling, and the effect of keeping Arabic words and punctuation on the model results.
5. Discussion
In this research study, we present a combined deep-learning approach for the analysis and classification of Arabic tweets. Also, the role of preprocessing in improving the field of Arabic sentiment analysis was investigated by checking the model performance with different preprocessing groups to find the most suitable set of preprocessing steps for the tweet dataset. We also investigated the translation of emojis into their meanings to understand their importance in data preparation.
In the first experiment, all non-Arabic words and punctuation were removed. Then, the model was used to evaluate the different techniques for handling emojis, stemming, and embedding. The results in
Table 4 show that removing emojis from the data resulted in poor classification accuracy in R1, R5, R7, and R8, whereas translating emojis into real and emotional meanings improved the model accuracy in R2, R3, R4, and R6, reaching 91.69% in R3 when using Snowball stemmer and Keras embedding. Also, using the ISRI stemmer in R2 gave a close result of 90.23%. In R4 and R6, the pre-trained AraVec 3.0 had less effect on improving the model results, with an accuracy of 87.32% and 76.09%, respectively.
In the second experiment, the results in
Table 5 suggest that keeping the non-Arabic words had no positive effect on the results when using ISRI stemmer and emoji encoding or AraVec 3.0 embedding and emoji encoding in R2, R4, and R6 over the results in
Table 4, while keeping the non-Arabic words improved the results in R3 and R5 when using Snowball stemmer and Keras embedding. This indicated that the combination of Snowball stemmer and Keras embedding can deal with both the emotions stored inside the emojis and the words written in other languages and can employ them to provide insight into full vector representation, while ISRI stemmer and AraVec transformers could not employ the non-Arabic words to improve the classification results, especially when using emoji encoding. This is because AraVec is a pre-trained model trained on Arabic tweets and texts from Wikipedia, and the existence of non-Arabic words affects its transformation performance, while the Keras transformer is trained on the same dataset, which helps it to provide better representation of the emotions from the emojis and the non-Arabic words. So, the best result of 91.85% was achieved in Exp. 2 R3 over all experiments by keeping the non-Arabic words, which often carry significant sentiment information that contributes to the overall meaning of a post, and this led to a noticeable improvement in the sentiment classification accuracy. This is because non-Arabic words often act as strong sentiment indicators. For example, a tweet containing the phrase “I love” would likely indicate a positive sentiment. Removing these tokens would remove important context from the post, potentially leading to misclassification. Also, AraVec 3.0 embedding was slightly positively affected by keeping the non-Arabic words and removing the emoticons compared to Experiment 1 R7 and R8. This can be explained by the fact that removing the emoticons helps AraVec to provide vector representation for the tweets with an output dimension of 100, while Keras embedding uses the appropriate output dimension with the Keras tuner and generates more meaningful full embeddings.
In the third experiment, non-Arabic and punctuation were retained, and the effect of emoji removal and emoji encoding on the model was the same as in Experiments 1 and 2. This is because the accuracy achieved by removing the emoji was improved by replacing each emoji with its meaning. Also, the effect of punctuation was tested in this experiment by keeping the punctuation and non-Arabic words to see their effect on the model performance compared to keeping the non-Arabic and removing the punctuation. These results show that keeping the punctuation had a negative effect on the model accuracy for all experiments, especially R3, which provided the best accuracy in Exp 2, while R2 and R6 results were improved. These results show the indiscriminate use of punctuation by Twitter users. Thus, removing the punctuation will provide a more reliable and constant model.
The results obtained by the proposed approach applied to the ASTC dataset were compared with the results obtained by following different approaches that applied to the same dataset. The comparison was made with the study in [
25] and presents the difference in the study aim, preprocessing steps, and the classification model, as shown in
Table 7.
The proposed model shows comparable results with the results obtained in Heterogeneous Ensemble Deep Learning Model for Enhanced Arabic Sentiment Analysis, which used the emoji Unicode translation and CBOW word embedding to generate a numerical representation of the text to use the RNN, LSTM, and GRU combined with three meta-learners, LR, RF, and SVM, to classify the tweets [
25]. The investigated model in [
25] aimed to improve the performance of the model for predicting Arabic sentiment analysis. Therefore, they started with data preprocessing by cleaning the data by removing non-Arabic letters, digits, single Arabic letters, symbols, URLs, emails, and hashtags. Then, tokenization was carried out by splitting the text with spaces, followed by the removal of stop words, stemming from the ISRI stemmer, and emoji Unicode translation. In contrast, the proposed model investigated the hybrid CNN-LSTM model with different data preprocessing steps to achieve comparable accuracy results and highlighted the effect of emoji encoding on emotional and real meaning, as well as non-Arabic words, punctuation, Arabic stemmers, and trainable and pre-trained transformers. This research presents the compatibility between Snowball stemmer, Keras embedding, and CNN-LSTM model and shows how keeping the non-Arabic words improved the model, while keeping the punctuation had a negative effect on it. Moreover, both the study in [
25] and our approach showed the role of using the emoji meaning to enrich the sentiment of the text by achieving an accuracy of 92.22% and 91.85%, respectively, while our approach provided a comparison between the results when removing the emojis and when transforming them, which validates the meaning of the emojis in the generated emoji meaning dataset.