There are an excess of data in the twenty-first century, which has resulted in a digital infusion of technology, paving the way for the growth of big data [
21]. People all over the world are becoming more electronically sophisticated, using devices like digital cameras with sensor capability, smartphones, and communication tools to gain access to social media for disseminating information within their community, and this has increased the number of data processing actuators [
22,
23]. In the age of big data, the use of sentiment analysis has proven effective for categorising public attitudes into various moods and assessing public mood [
24]. In the subsequent section, we look at the sentiment analysis in the financial sector in
Section 2.1. Thereafter, we explore the lexicon or rule-based approaches in
Section 2.2 that leverage predefined sentiment dictionaries or rules to assign sentiment scores. These approaches are easy to put into practice, require few computational resources, and are tailored to specific domains. However, these approaches may face difficulties in capturing subtle sentiments and need the inclusion of new words and terms in lexicons when dealing with new domains. On the other hand, most ML approaches extract information from labelled data and offer higher accuracy. However, they need significant training on data specific to each domain and have to retrain models when dealing with new domains. In
Section 2.3, we look at deep learning approaches that have the capability to learn intricate patterns automatically, understand context, and demonstrate outstanding performance. Nevertheless, they demand significant resources and lack interpretation capability. Finally, we show the LFEAR model, which helps improve natural language models through conversation structure and RAG.
2.1. Financial Sentiment Analysis
As researchers sought to understand the nuanced expressions and idiosyncratic language of finance-related writing, the field of financial sentiment analysis began to go into business. Yet even the best-performing model we have to date, the pre-trained language model FinBERT, which massively outperforms our most universal models when we perform any task related to finance, has proven unable to keep up with the language of the slowly but constantly evolving world of finance itself [
25]. Meanwhile, a recent paper from Gite et al. [
26] proposed a method of improving stock price prediction that fuses the stock market’s sentiment, as expressed in the headlines of relevant news articles, with price-relevant data about the stocks themselves. Yet while this approach aims for only a narrow slice of the market, it has been shown to be highly reliant on historical data and offers an inadequate framework for the kind of real-time shifts in market sentiment that high-frequency traders obsess over.
Changing market conditions affect financial sentiment models, necessitating the development of models that can adapt to these changes without relying solely on static, pre-labelled datasets. Sharaff et al. [
27] have taken the LSTM framework with word embeddings, which had already been converted into a financial news sentiment tool and have pushed its performance into an even more favourable realm relative to many current models. This is, no doubt, due to the crucial role that context and temporal dependencies play when attempting to extract meaning from the often dense and terse language of such articles. However, what appears at first blush to be a model with high performance is one with very limited applicability to the fast-paced world of finance due to its heavy reliance on long-term, semi-static historical datasets. Mishev et al. [
28] compared various methods of sentiment analysis, especially the most advanced models. Even though today’s transformer-based models perform really well, the authors pointed out that some genuinely fixable problems need to be addressed. They mainly involved the kind and amount of data available for training these models. But they also highlighted something else that we thought was important, which is the issue of what one does with the model’s output, particularly in real-time decisions that could have significant financial implications.
Additionally, present-day models often overlook the specialised financial vocabulary, which encompasses not only the distinctive terminology of the domain but also the abbreviations and shifts in sentiment driven by the context that make up the world of finance. Liu et al. [
29] performed a sentiment analysis on social media data using a novel approach and, in the process, extracted investor sentiment from these otherwise undervalued reservoirs of information. After conducting a thorough validation of their tools, they found that the optimal way of interpreting sentiments from these networks is not as “positive”, “neutral”, or “negative”, but instead in more contextually relevant categories, which are “buy”, “hold” or “sell”. Ardekani et al. [
30] introduced a financial sentiment model that uses contextual language processing, but their work still does not address the inherent difficulties tied to performing aspect-based sentiment analysis, which is absolutely crucial for interpreting the multi-faceted nature of financial data.
2.2. Lexicon-Based Approaches
Lexicon-based approaches are popular for detecting sentiment in text because they are simpler and faster than supervised learning methods. Yue et al. [
31] conducted a study comparing the performance of supervised and unsupervised machine learning techniques for sentiment analysis from a given set of Twitter messages. Their findings revealed that supervised methods generally exhibit superior accuracy to unsupervised approaches like lexicon-based algorithms. Nonetheless, acquiring sufficient labelled training data for supervised methods can be costly and time-consuming.
Kiritchenko et al. [
32] used a lexicon to find (a) the mood of short, casual text messages like tweets and SMS on the SemEval-2013 dataset (message-level task) and (b) the mood of a word or phrase within a message (term-level task). On top of that, the researcher employs commonly used statistical-style features such as word, character, elongated, punctuation, and POS tag counts, as well as common Twitter-specific attributes such as emoticons and hashtag counts developed to handle negation. The scheme is tested by selecting 1455 high-frequency terms from the Sentiment140 Corpus and the Hashtag SentimentCorpus, which includes 1.6 million tweets with positive and negative sentiment labels. The data consist of regular English words, Twitter-specific terms (e.g., emoticons, abbreviations, and creative spellings), and negated natural expressions manually annotated using MaxDiff (available online:
https://saifmohammad.com/WebPages/lexicons.html (accessed on 1 October 2024)) to assign a score to the most prominent words in the individual tweet and the SMS [
33]. The approach achieved a macro-averaged F-score of 69.02% in the message-level task and 88.93% in the term-level assignment. A linear-kernel support vector machine (SVM) was used to train the SMS messages. The system achieved an F-score of 70.45% for the message-level task and an F-score of 89.50% for the term-level task. It came in second for detecting the sentiment of terms within SMS messages (F-score of 88.00·, 0.39 points behind the first-ranked system).
Bradley, M.M., and Lang, P.J. [
34] introduced affective norms for English words (ANEWs), which provide emotional assessments for many English words. The positivity and intensity levels of stimuli impact emotional responses, influencing how we process and perceive emotional information. The author of [
35] presented AFINN-96, which consists of 2477 distinct words, with an addition of 15 phrases that were not used. The author simplified the process by focusing only on valence when scoring the words, excluding factors like subjectivity/objectivity, arousal, and dominance, and assigning scores manually. The author of [
35] used ANEW as a classification-based fuzzy model as a basis for expressing stochastic with five label terms for classifying tweets into five fuzzy opinion categories, very negative, negative, neutral, positive, and very positive, which allowed for a more nuanced understanding of sentiment in social media data. While ANEW includes many words commonly used in English, it lacks depth due to the evolving nature of language use in online communication and social media posts. ANEW does not account for the use of negations in words, making evaluation challenging as negations do not always reverse the meaning of each word, especially when adverbs are involved, leading to increased complexity. The author of [
36] introduced a novel approach for analysing the sentiments from tweets that include a mix of adjectives, adverbs, and verbs to determine the sentiment score, and the actual polarity of the text or online communication is classified using a linear function that incorporates emotion intensity.
Hutto and Gilbert (2014) [
37] proposed a valence-aware dictionary and sentiment reasoner (VADER), a lexicon- and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed from online text collected from social media or the Internet. VADER (available online:
http://www.nltk.org/_modules/nltk/sentiment/vader.html (accessed on 17 September 2024)) supports the handling of emoticons, idioms, punctuation, negation, emphasis, and contrasts. VADER also considers the impact of words in ALL CAPS to emphasise their meaning. Depending on the original sentiment of the word, the overall polarity is adjusted by 0.733. Additionally, VADER can identify negated sentences and evaluate sentiment shifts brought on by contrastive conjunctions like “but”. However, the empirical validation of the VADER is based on multiple independent human judges that incorporate a “goldstandard” sentiment lexicon that is especially attuned to microblog-like contexts.
Çılgın et al. [
38] used VADER to analyse and perform opinion mining on Twitter data. In addition to the binary classification system that almost all other Twitter-based sentiment analysis models provided, their model also provided a multi-classification system. It was observed that VADER was an apt selection to analyse large sets of data. This model’s limitations were that a small percentage of the actual data were employed, a general lexicon was used to categorise the data, and the data were not trained. Newman and Joyner [
39] used VADER to analyse student evaluations of teaching from three sources. The positive/negative valences of the comments were compared, frequently used keywords in comments were identified, and the impact of those comments containing said keywords was determined.
Elbagir et al. [
40] used VADER to analyse sentiments in social media posts using individual words and sentences. The text of tweets was preprocessed to remove unwanted characters and words such as punctuation, unicode problems, URLs, emails, currency symbols, and numerals. Jain et al. [
41] used natural language processing (NLP) with VADER to predict the sentiment from social media data such as X, Facebook, and Reddit, and the results are interpreted and explained using heatmaps. Using VADER to evaluate sentiments in the Twitter dataset, the author achieved 69.52% accuracy, 64.88% precision, 85.10% recall, and 73.63% F1-score for the tweet, respectively, after normalisation. In their study, Borg et al. [
42] utilised a linear SVM as a machine learning classification model and VADER to predict customer feedback sentiment for the Huge Swedish Telecom Corporation by analysing a dataset of 168,010 emails. The author employed the Swedish sentiment Lexicon and VADER for sentiment analysis, achieving an F1 score of 83.4% and a mean AUC of 0.896. Moreover, the author in [
42] identified a pattern in email discussions that could potentially predict the emotions of unseen emails.
Hu et al. [
43] proposed a method to summarise customer reviews of electronic products from Amazon and C|net.com based on product features. Nevertheless, many incorrectly spelt words are part of the features that frequently appear in social media content, which can make it challenging for automated systems to accurately identify and summarise customer opinions. Additionally, the constant evolution of language and slang used in online reviews adds another layer of complexity to the task of extracting meaningful insights from customer feedback.
Baccianella et al. [
44] proposed SentimentNet. 3.0, a lexical resource specifically designed to enable sentiment classification and opinion mining applications. SentimentNet 3.0 is an upgraded version of SentiWordNet 1.0, a lexical resource made publicly available for research purposes licensed to over 300 research groups and used in a wide range of research projects throughout the world [
44]. SentiWordNet 1.0 and 3.0 are the outcomes of automatically annotating all WordNet synsets for positivity, negativity, and neutrality.
Moshkin et al. [
45] used fuzzy ontology subgraphs based on lexical dictionaries to find morphological features in VKontakte text fragments, such as word, smile, or style. They determined sentiment by analysing features of the subject area, focusing on syntagmatic units rather than individual words for compatibility. The lexical ontology was assessed using SentiWordNet 3.0, which is built on WordNet 3.0. The method was tested using ML algorithms such as the Naive Bayes (NB) classifier, linear regression, and SVM on 420 VKontakte posts and comments. The average accuracy achieved was 78.33%, 65.24%, and 75.25%, respectively. The study revealed that the NB classifier performed better than the other ML algorithms tested.
Sadhasivam et al. [
46] retrieved a dataset from an official product review site. The data were cleaned by eliminating unnecessary elements like stop words, verbs, punctuation, and conjunctions. In the referenced study, the author computes the positive and negative probabilities for each word in the dataset, merges these probabilities, and determines the sentiment based on the higher probability. To determine sentiment, each set of data is converted into a more complex format, and then a mathematical operation is used to identify the strongest indicator of sentiment, with the assistance of SentiWordNet. The dataset is trained using NB, SVM, and Ensemble methods with positive and negative labels, resulting in an accuracy of 78.86%. Nevertheless, the accuracy of the predictions fluctuates depending on the number of classifiers combined for the review output. It is also difficult to precisely interpret how users convey their emotions through emoticons in the reviews.
Selecting an appropriate feature subset is crucial in sentiment classification. Tools like LIWC can extract psycholinguistic aspects from texts for analysis [
47]. For example, Onan et al. [
48] introduced a psycholinguistic approach to sentiment analysis on Twitter. LIWC extracts psycholinguistic features like linguistic processes, psychological aspects, personal concerns, spoken categories, and punctuation from texts. These features are then processed using an ML algorithm. The author in [
48] tested the proposed approach on English Twitter messages, including 6218 negative, 4891 positive, and 4252 neutral tweets. NB has the highest predicted performance (77.35% accuracy) when using the linguistic feature set. Incorporating ensembles, the Random Subspace Ensemble of NB achieves a classification accuracy of 89.10%. In the same way, Koutsoumpis et al. [
49] used five main personality traits and 52 linguistic categories to find links between things like self-reports, reports from others, life outcomes, and behavioural measures of personality for text-based assessments. The results indicate that text-based personality assessment offers precise and dependable insights into an individual’s personality traits. As another example, Chen et al. [
50] used the computerised LIWC to quantify students’ cognitive, emotional, and social engagement in social annotation. Additionally, the author in [
50] explored how students with varying levels of engagement differ in their social annotation behaviours. They used a statistical method to analyse data from 91 undergraduate students and 29 reading materials, successfully identifying two different engagement patterns with a high Cohen’s Kappa of 0.78. The two different engagement patterns were labelled as “active” and “passive”. Despite a comprehensive examination of student interactions in social annotation, the study did not examine the interactions between students and instructors in social annotation or the contribution of the student–instructor relationships to assist in the overall learning outcomes. Li et al. [
51] explored the patterns in how students used annotations and how they responded to them in social annotation activities. They examined how students’ performance in behavioural, cognitive, emotional, and social areas changed based on their interactions. They gathered 93 undergraduates who were enrolled in an elective course at a large North American university, and the students were tasked with collaboratively annotating the class readings uploaded to Perusall, a social annotation platform, over 7 weeks. To determine exactly how to effectively classify student behaviours into groups, the researcher in [
51] used metaclustering analysis based on the number of annotations and response behaviours. For example, they combined multiple clustering solutions to make them more robust and reliable. We looked at the number of annotation and response behaviours and used the K-means algorithm to find the best number of clusters from 905 data instances. Then, we used LIWC [
52], a text mining tool to evaluate the levels of students’ cognitive activities, specifically focusing on cognitive insight and cognitive discrepancy.
Word Sense Disambiguation (WSD) is a method used to determine the meaning of a word with multiple senses. Rentoumi et al. [
53] introduced the use of WSD to assign polarity based on an n-gram graph for analysing sentiment in text related to figurative phrases. This polarity assignment involves using a tool that adjusts the sentiment of figurative phrases in text or online content based on their context. It uses a range of eight surrounding words to determine similarity and generate a gloss vector (GV) in WSD. GV creates a co-occurrence matrix of words, where each cell in the matrix indicates the number of times the words represented by the row and the column occur together in a WordNet gloss. Each word in a WordNet gloss is depicted as a vector in a multi-dimensional space based on its specific row. For each word in the gloss, a context vector is formed by using the respective row in a matrix that shows how often words appear together. Subsequently, a gloss vector is generated as the average of all these context vectors for each word sense. The similarity is determined by analysing the parts of speech (POS) using a tool called the Stanford POS tagger. The next step involves assessing the sentiment by associating WordNet senses with words in the text, which are classified into positive or negative categories. The author used hidden Markov models (HMMs) to figure out how sentences made the author feel, and the results were confirmed in the Affective Text task of SemEval’07 [
54]. The results show that this method effectively assigns sentiment to figurative language, indicating its potential for use in different NLP tasks.
Jayakrishnan et al. [
55] used WSD and SVM classification on non-English text to find emotions or people’s feelings. They obtained a 91.8% per cent success rate, but this rate can go up if semantic and syntactic features are added. Hogenboom et al. [
56] used historical stock prices of NASDAQ-100 companies and news articles from Dow Jones Newswires to test graph-based WSD as a sentiment predictor of stock price. They obtained a 53.3% success rate. However, the system cannot evaluate more complex trading strategies, and future research is expected to incorporate a more explicit notion of human sentiment with respect to news articles.
Table 1 provides an overview of various lexicon-based methods for sentiment analysis. It shows the reference, the wording size, the attributes used to represent text, an acronym for the proposed model, the source or dataset, and the evaluation metric such as precision (PR), recall (RC), F1-score (F1), and accuracy (ACC) or result obtained for comparison.
2.3. Deep Learning-Based Approaches
Recently, it has been observed that the number of people actively involved in social media is rapidly increasing [
58]. People are expressing their feelings in the form of reviews, comments, posts, and statuses on various topics [
59]. As a result of this tremendous amount of data generated on the Internet, which can be analysed for further research, traditional methods using predefined rules or ML algorithms are not enough to manage the large volume of data available [
60,
61]. Hence, the adoption of deep learning models in NLP is essential to uncovering valuable insights from these unorganised data. These models have delivered positive outcomes in analysing emotions, condensing text, and translating languages, becoming essential instruments for comprehending and utilising the large volume of text online. Applications of deep learning models have been successful in identifying patterns and trends in text data, providing valuable information for businesses to make informed decisions. Additionally, these models have the potential to revolutionise customer service by automating responses and improving the overall user experience. Furthermore, these models also help businesses personalise their marketing strategies and target specific customer segments more effectively. However, in South Africa, consumer reviews on social media platforms like Hellopeter provide essential insights into the performance and sentiment of financial institutions. Dealing with finance-related documents has presented challenges due to the traditional methods used for sentiment analysis.
The recent advancements in LLMs and NLP have led to better ways of conducting sentiment analysis tasks. The development of LLMs and NLP has led to changes in model structures, pre-training techniques, and the integration of RAG technologies. RAG is a process that obtains relevant information from outside data in real time to improve context and relevance generation. The extra information is then fine-tuned into the model itself or downstream tasks [
62]. Additionally, RAG technologies enable the models to interact with structured knowledge sources, allowing them to access and answer questions beyond the scope of the provided data. For instance, Liu et al. [
63] proposed a method for text classification wherein the author highlights the importance of models that are easy to understand and interpret, in addition to focusing on performance. Gao et al. [
64] proposed a model in which information is retrieved before generating responses, enabling the generated responses to be more precise and relevant in tasks that require a lot of knowledge. Fan et al. (2024) [
65] and Hu et al. [
66] studied RAG for NLP tasks but mainly financial sentiment analysis. Lewis et al. [
67] suggested a retrieval component that looks through huge collections of documents for relevant ones and feeds them into the generative model. This makes the answers more accurate and relevant to the situation. Zhang et al. [
68] address how hard it is to use real-time, context-relevant data in financial sentiment analysis and how RAG models help solve these challenges while making outputs easier to understand and more reliable in a world where finances are always changing. These studies highlight the importance of incorporating information retrieval into NLP tasks to improve the quality of generated responses. By simply combining both modelling and retrieval techniques, researchers have been able to achieve more accurate and relevant results in various applications, such as financial sentiment analysis. This has led to significant advancements in the field of NLP. This has led to significant advancements in the field of NLP, ultimately improving decision-making processes and providing valuable insights for businesses.
However, RAG models have their drawbacks. These include efficiently exploring vast corpora while remaining fast and accurate. Furthermore, the challenge of incorporating the returned knowledge into the generating process in a cohesive and contextually meaningful manner persists. Future studies could include improving the retrieval function using real-time data sources and experimenting with merging RAG with reinforcement learning or meta-learning methodologies. For example, Shivaprasad et al. [
69] emphasised the necessity of conducting thorough sentiment analysis on product reviews to gauge customer sentiment, which subsequently affects the financial markets. In this context, the use of LLMs and RAG improves the precision of sentiment prediction to support better strategic decisions. Zhao et al. [
70] proposed a generalised pre-training framework to enhance the existing RAG model, showcasing the effectiveness of LLMs for sentiment analysis tasks. The researchers amalgamated conversational fine-tuning and RAG approaches, and this new development has made LLM performance much better, especially in sentiment analysis in the financial domain. Vulic et al. [
71] presented a method that adjusts LLMs for dialogue systems, showcasing their ability to handle complex and context-specific conversations. Likewise, Alghisi et al. [
72] proposed evaluation methods for adapting LLMs to dialogue-based assignments, highlighting the advantages and disadvantages of each approach. These methods are especially useful in financial sentiment analysis because of the precise natural linguistics of human expression in text posted on social media, and the approach allows one to model the human perspective in a natural language so as to capture the full context or nuances of the language used and offer decision-makers insights into customer behaviour and market trends.
Furthermore, there has been widespread use of LLMs, such as ChatGPT, sparking debates about their capabilities and limitations in a variety of industries. For example, in the field of education, Fütterer et al. [
73] demonstrated responses from all over the world to ChatGPT, which revealed a wide range of sentiments, from supportive to alarming. These studies highlight the significance of considering context and other culturally specific factors in the evaluation of AI. Certainly, LLMs have received praise for their achievements across various domains. Duan et al. [
74] introduced an innovative hybrid neural network model for analysing financial text data. This model surpasses previous approaches in sentiment analysis by enhancing topic extraction and pre-training techniques. In general, LLMs have performed well in a number of NLP tasks, such as answering questions and aspect sentiment classification (ASC). For example, Ling et al. [
75] developed a retrieval-augmented method that makes semantic representations more descriptive, which makes it easier to classify sentiment across different aspects. The approach also has the capability to handle multidomain sentiment classification, which focuses on transferring information from one domain to the next. The models are first trained in the source domain; the knowledge is then transferred and explored in another domain. For example, Chen et al. [
76] proposed a weakly supervised multimodal deep learning (WS-MDL) model to predict multimodal sentiments for tweets. The model uses CNN and DynamicCNN (DCNN) to calculate multimodal prediction scores and sentiment consistency scores. Due to the enormous data available on social media in different forms like videos, audio, and photos for expressing sentiment on social media platforms, the conventional approach for text-based sentiment analysis progressed into compound models of multimodal sentiment analysis. Hence, capturing the sentiment perspectives expressed in different modalities became a crucial approach [
77]. The number of data available, the quantity of hidden units (nodes) required to solve the problem, and other factors still impact the choice of a particular deep learning model in the field of ASC.
Table 2 shows a full comparison of existing literature that was discussed based on different factors, including embedding representation, dataset, deep learning model, and performance metrics. For example, RC, ACC, and F1 are common metrics for sentiment analysis tasks because they show how well models do at analysing and classifying sentiment in textual data.
In this study, we build on ARLMs, PTLMs, and RAG to establish a new framework called LFEAR. This framework promises to not merely detect but also to classify the meanings of various sentences found in textual forms of social media and in the financial comments made about products and services of financial institutions (e.g., Hellopeter). The LFEAR model retrieves real-time financial information on a dynamic basis by integrating RAG. This satisfies the need for a constant adaptation to not only emerging trends but also to the types of new language that these trends provoke [
95]. LFEAR not only makes a more computationally efficient model but also heightens the intensity and granularity of the sentiment classifications. This allows for a much clearer representation of the actual sentiments expressed in the data. It is necessary, especially in the finance industry, to have a precise model that can influence decision-making based on the surrounding environment and the interaction of the customers with the product and service offer [
96]. LFEAR leverages a continuous learning framework to incorporate financial information in a structured manner and seeks to build a model that is adaptive, accurate, and precise. LFEAR’s goal is to obliterate the limits we have identified in the field of sentiment analysis and offer a model that is comprehensive, adaptive, and accurate enough to serve as a sentiment analysis engine in the financial domain [
97].