Stability Analysis of ChatGPT-Based Sentiment Analysis in AI Quality Assurance

Ouyang, Tinghui; MaungMaung, AprilPyone; Konishi, Koichi; Seo, Yoshiki; Echizen, Isao

doi:10.3390/electronics13245043

Open AccessArticle

Stability Analysis of ChatGPT-Based Sentiment Analysis in AI Quality Assurance

by

Tinghui Ouyang

^1,*,

AprilPyone MaungMaung

²,

Koichi Konishi

³,

Yoshiki Seo

³ and

Isao Echizen

²

¹

Center for Computational Sciences, University of Tsukuba, Tsukuba 305-0006, Japan

²

Information and Society Research Division, National Institute of Informatics (NII), Tokyo 101-8430, Japan

³

Digital Architecture Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 135-0064, Japan

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(24), 5043; https://doi.org/10.3390/electronics13245043

Submission received: 29 October 2024 / Revised: 10 December 2024 / Accepted: 20 December 2024 / Published: 22 December 2024

(This article belongs to the Special Issue AI Technologies and Smart City)

Download

Browse Figures

Versions Notes

Abstract

:

In the era of large AI models, the intricate architectures and vast parameter sets of models such as large language models (LLMs) present significant challenges for effective AI quality management (AIQM). This paper investigates the quality assurance of a specific LLM-based AI product: ChatGPT-based sentiment analysis. The study focuses on stability issues, examining both the operation and robustness of ChatGPT’s underlying large-scale AI model. Through experimental analysis on benchmark datasets for sentiment analysis, the findings highlight the ChatGPT-based sentiment analysis’s susceptibility to uncertainty, which relates to various operational factors. Furthermore, the study reveals that the ChatGPT-based model faces stability challenges, particularly when confronted with conventional small-text adversarial attacks targeting robustness.

Keywords:

AI quality management; large language model (LLM); ChatGPT-based sentiment analysis; stability analysis

1. Introduction

Recent advances in machine learning, particularly in deep learning (DL), have enabled artificial intelligence (AI) to excel in a wide range of applications, such as autonomous driving, e-commerce, and robotics [1,2,3]. Among these advancements, large language models (LLMs) have emerged as transformative tools in natural language processing (NLP) [4]. As a prominent LLM product of OpenAI, ChatGPT [5], powered by the Generative Pre-trained Transformer (GPT) architecture [6], leverages vast amounts of pretraining data and an immense number of parameters to achieve impressive performance in understanding and generating human-like text. Its strengths include exceptional semantic comprehension, context-aware dialogue generation, and multilingual capabilities.

ChatGPT has proven effective across a variety of NLP tasks. For instance, it serves as a translator [7], grammar checker [8], and conversational assistant for tasks such as question answering, educational guidance [9], and medical counseling [10]. Its generative abilities extend further, enabling it to augment text data for research [11] or even assist in creative endeavors. Despite these advantages, ChatGPT has also raised societal concerns, particularly in education, where its capabilities have introduced issues such as cheating in writing tasks [12]. Its performance in these diverse roles underscores its significance as a groundbreaking AI product.

However, ChatGPT’s remarkable capabilities are accompanied by notable challenges. From a societal perspective, it raises concerns about transparency, misuse, and accountability. From a research and development standpoint, ChatGPT’s complex architecture and black-box nature complicate its evaluation, testing, and maintenance. These challenges are particularly significant for researchers in software engineering and quality management, who must assess ChatGPT’s reliability, robustness, and performance to ensure effective deployment.

Several studies have been conducted to evaluate ChatGPT’s quality and stability. For correctness, its performance has been benchmarked on traditional NLP tasks like translation, grammar checking, and classification [13], as well as novel tasks such as mathematical reasoning and coding logic [14]. Robustness analyses have utilized adversarial text datasets to test its resilience against attacks [15,16]. While gradient-based attacks have been explored for open-source models like LLaMA [17], ChatGPT’s black-box nature limits direct adversarial analysis. Innovative approaches, such as using LLMs to generate adversarial prompts for testing robustness, have also been proposed [18]. Other studies have examined ChatGPT’s reliability in text classification [19], its biases and limitations on sensitive topics [20,21], and its failures in specific cases [22,23].

Despite these efforts, many challenges remain. ChatGPT’s lack of clearly defined functional requirements, as is common in traditional software systems, complicates direct evaluation. Existing studies often focus on specific tasks, such as sentiment analysis or grammar checking, or employ surrogate datasets, such as AdvGLUE [24], to evaluate performance. While these approaches uncover some weaknesses, they do not comprehensively address ChatGPT’s shortcomings as a software product.

To address these gaps, this paper investigates the stability of ChatGPT-based sentiment analysis as a case study in quality assurance [25,26]. Sentiment analysis, being a well-defined and widely studied NLP task with clear evaluation metrics, provides an ideal context for exploring ChatGPT’s stability and robustness. Rather than proposing new methodologies or metrics, this study focuses on identifying and summarizing the causes of instability from a software engineering perspective and evaluating robustness against attacks. Experimental results on benchmark sentiment analysis datasets demonstrate that while ChatGPT exhibits robustness against certain attacks, it remains vulnerable to perturbations, such as synonym substitutions, highlighting areas for improvement in ensuring its reliability and stability. The contributions of this paper are summarized as follows:

A simple yet effective sentiment analysis framework is proposed for quality assurance studies. This framework can serve as a reference for conducting AIQM [27] studies on other large language model (LLM)-based products.
Two critical types of stability in LLM-based sentiment analysis are explored: operational uncertainty and model stability, providing a comprehensive perspective for quality assurance.
Operational uncertainty emphasizes the unpredictability or complexity caused by some factors in the operation. A detailed analysis in ChatGPT is conducted from multiple perspectives, including usage patterns, timing effects, and prompt engineering techniques.
Model stability is generally an issue related to robustness, which is systematically studied under four distinct types of textual perturbations in this paper to evaluate ChatGPT’s stability in handling input variations.

The structure of this paper is as follows: Section 2 provides an overview of using ChatGPT for sentiment analysis, and introduces the two dimensions of stability considered in the AIQM framework. Section 3 analyzes ChatGPT’s operational uncertainty, focusing on the unpredictability or complexity of factors leading to different outputs. Section 4 investigates the robustness of ChatGPT against textual perturbations and examines its ability to maintain stability under different input scenarios. Section 5 concludes the paper, summarizing the findings and contributions of the study.

2. Overview

2.1. ChatGPT-Based Sentiment Analysis

As described above, our primary objective was to investigate the quality assurance of ChatGPT-based sentiment analysis. While the product’s functionality aligns well with traditional software testing paradigms, ChatGPT, as a foundation model, encompasses a broad range of applications, including question answering, chatting, text generation, grammar checking, and translation. Evaluating the overall quality of ChatGPT as an AI product requires consideration of multiple dimensions, making it a highly time-intensive task. To streamline this AIQM process, this study focused specifically on assessing ChatGPT’s performance in a single, well-defined NLP task: sentiment analysis. Sentiment analysis was chosen because it is a fundamental and representative NLP task, as well as a classification problem with established evaluation metrics. Thus, the primary goal of this paper is to evaluate the quality of ChatGPT-based sentiment analysis within the scope of AIQM.

The diagram in Figure 1 illustrates how to use ChatGPT for sentiment analysis. Compared with conventional testing based on input and output data, e.g., review comments and sentiment labels in sentiment analysis, it is found that some differences happen in the process of using ChatGPT in Figure 1. Since ChatGPT operates in conversation mode, extra editing on prompt and response is usually needed to make ChatGPT focus on a specific task and produce the desired output. An understanding of this difference, and with the help of the ChatGPT API provided by OpenAI, it becomes easy for users to develop a specific AI product based on ChatGPT. The following are details about the settings for developing a ChatGPT-based sentiment analysis product in our AIQM study.

PromptSetting: analyze the following product review and determine if the sentiment is POSITIVE, NEGATIVE or NEUTRAL: {ReviewText}.
OutputControl: return only a single word, such as POSITIVE, NEGATIVE or NEUTRAL.

2.2. Stability of AI

According to the Machine Learning Quality Management guidelines [27], among the diverse AI qualities, stability is a crucial quality that deserves special attention. In this context, stability refers to an AI product performing consistently and reliably. It encompasses the ability of the product to operate seamlessly under varying conditions and to resist disruptions. This means that the product must be less prone to errors, crashes, or unexpected behaviors. It is thus crucial to achieve and maintain AI product stability to ensure a positive user experience and to minimize the effect of potential problems.

Moreover, with respect to AI-based software products, the stability study can be approached in two ways. One is to focus on system stability, specifically the intricacies of operational stability and the challenges posed by uncertainty. The second is to focus on model stability, mainly related to the AI model used in the software product. In other words, the second type of stability is about the robustness analysis of trained AI models.In the subsequent sections, it will explore these two stability studies in the context of ChatGPT in detail.

3. Uncertainty Analysis

3.1. Model Architecture Design

Currently, it is widely discussed that the uncertainty issue exists in the operation phase of ChatGPT. Both with the advanced GPT4 and the widely used GPT-3.5-turbo model, the responses are non-deterministic, even for a temperature setting of 0.0. For example, repeated inputs of a question to ChatGPT generally produce different responses. One possible reason for this is that ChatGPT is continually being updated on the basis of data collected from global customers. However, as shown in Figure 2, responses on two different devices for the same input at the same time also differ. Although the responses could be affected by randomness in the text generation process, the major reason for the non-determinism of responses may be the introduction of sparse mixture of expert models (MoEs) in ChatGPT [28]. On the one hand, these MoE models can dynamically and sparsely activates selected experts to effectively reduce computational costs. On the other hand, the huge quantity of tokens in MoE models exhibit uncertain routing outcomes, with nearly equal scores across multiple experts. This uncertainty can result in output variability, and even incorrect expert selections [29,30]. Based on the consideration, the reference [31] provides insights on studying MoE for improving reliability of LLMs.

This architecture organizes tokens into fixed-size groups due to capacity constraints and emphasizes balance within each group. When groups incorporate tokens from diverse sequences or inputs, there is often competition between expert buffers among these tokens, leading to a failure in enforcing per-sequence determinism.

3.2. Differences Between Using ChatGPT and ChatGPT API

Developers using ChatGPT have reported differences between using ChatGPT on the web and the ChatGPT API. A general finding is that the web version performs better than the API, even when using the same model. Several reasons for this have been suggested [32,33]. One is that the web version is continually being updated on the basis of the huge amounts of data being received from global customers, whereas the API version is fixed for a certain period of time. Another is related to the system prompt. ChatGPT has a default system prompt, namely “You are an LLM-based AI system created by OpenAI...”, or what model it is using, or something else. However, when using the API for testing, developers must construct their own system prompt to specify how ChatGPT should behave. For example, when using the API to develop a sentiment analysis model, the system prompt can be set as “You are an AI language model trained to analyze and detect the sentiment of product reviews”.

Moreover, whereas ChatGPT is a conversation-based system which can collect historical data for AI to produce the most satisfactory results, the API is a simple question–answer chat robot. A conversation management system must be created to enable history-based conversation. Another possible reason that the web version performs better than the API even when using the same model is a difference in the setting of the undisclosed “temperature” parameter in ChatGPT. This parameter controls the degree of creativity or unpredictability in response generation. Generally, the higher the value, the more uncertain the result. To obtain a stable result, a temperature setting of 0.0 should be used.

3.3. Variance Due to Timing

As discussed above, ChatGPT continuously updates itself on the basis of newly collected data and design changes. For example, OpenAI has made both GPT-3 and GPT-4 models available for use in ChatGPT, and reports that GPT-4 performs better than GPT-3 on most tasks. Even for two identical ChatGPT systems processing the same input at the same time, there is uncertainty about the output. Since the process for updating the models is not transparent, the effect of each update on model behavior is unclear. These uncertainties pose challenges in studying the stability of ChatGPT. For example, if there is a sudden change in a model’s response to a prompt (e.g., its accuracy or formatting), downstream processing may be disrupted. Furthermore, it is challenging to reproduce a model’s results even with the same settings. This issue has been discussed on the basis of comparative experiments on several NLP downstream tasks [34]. Moreover, we use a subset of Amazon.com review data (e.g., 983 testing samples) for evaluation at different time slots for comparative study; the results are shown in Figure 3.

In Figure 3, a comparison study on the use of the ChatGPT API is implemented for sentiment analysis, and there are three versions (2023-06, 2023-12, and 2024-1) with the same set-up considered in the experiment. The confusion matrices are used for evaluation. From these results, it is hard to say each update will bring great improvement on all tasks. In the given sentiment analysis example, the results for the 2023-12 version have the best comprehensive accuracy. However, in terms of precision and recall of each sentiment class, the 2023-06 and 2024-12 versions are found to be better able to comprehend the prompts to distinguish positive, negative and neutral reviews. The 2023-12 version appears to focus more on polar classification (positive or negative), reflecting a weakness in recognizing neutral review comments, as shown in Figure 3. This change in ChatGPT’s ability illustrates that it is essential to determine what happens when the model version of ChatGPT is updated.

3.4. Prompt Engineering

In the operation of ChatGPT, another issue that affects the stability of ChatGPT operation is prompts. Prompt engineering is an essential topic in the study of LLMs. The prompts should usually be carefully designed to optimize the output. To investigate how the prompts affect the stability of ChatGPT, this paper designed several prompt settings. By making use of the in-context learning ability of ChatGPT, the prompt engineering here is set based on the criterion of zero-shot, one-shot, and few-shot, as shown in Figure 4.

In Figure 4, three example prompts are given for using ChatGPT in sentiment analysis. It is seen that with a zero-shot prompt, no example information is provided, and with a few-shot prompt, several examples are provided for guidance. A one-shot prompt is a special case of a few-shot prompt where only one example is provided for guidance. Therefore, based on this idea, three sub-designs providing positive, neutral, and negative examples, respectively, are also considered in this paper. Also, based on the same subset of Amazon.com review data from Section 3.3 used for sentiment analysis, the evaluation results are presented via the confusion matrix, as shown in Figure 5.

As shown by the confusion matrices in Figure 5, different prompts resulted in different sentiment classification accuracy for the same dataset, indicating that prompt engineering indeed affects the stability of ChatGPT performance. The accuracy with the zero-shot prompt was slightly better than that with the few-shot prompt. The main difference was that the few-shot prompt reduced prediction accuracy for positive reviews. Looking at the matrices for the three one-shot prompts with positive/neutral/negative examples, it is seen that using the ones with neutral and negative examples seems to have improved accuracy, indicating that the performance of ChatGPT for sentiment analysis can be improved by careful engineering of the prompts.

4. Robustness Testing

4.1. Data Preparation

In this paper, two commonly used benchmark datasets for sentiment analysis are considered for evaluation, such as the Amazon.com review dataset [35] and the Stanford Sentiment Treebank (SST) dataset [36].

Amazon.com review dataset: This dataset is a collection of a large number of product reviews from Amazon.com. The raw data contains 82.83 million unique reviews, and includes product and user information, a rating score (1–5 stars), and a plain text review. For sentiment analysis, researchers usually take a review score of 1 or 2 as negative, 4 or 5 as positive, and 3 as neutral. We did likewise.
SST dataset: This dataset consists of 11,855 individual sentences extracted from movie reviews by Pang and Lee [36]. Applying the Stanford parser to this dataset enables a comprehensive examination of the compositional effects of sentiment in language. In this paper, we used an extension of this dataset with fine-grained labels (very positive, positive, neutral, negative, very negative), and roughly categorized the review sentiments as positive, neutral, or negative.

Considering evaluation as the major target in this paper, we randomly selected a small amount of samples selected from the original datasets as the testing sets. According to Figure 1, we can select the review comment texts as the inputs, and the sentiment labels as the outputs of sentiment analysis. The general three classes of sentiment are considered, such as positive/neutral/negative. The detailed information of the selected testing sets is presented in Table 1.

As shown in Table 1, 983 samples from the Amazon.com review dataset and 1101 samples from the SST dataset were used for testing. Looking at the distribution of positive/neutral/negative reviews, we see that the SST dataset has a more balanced distribution than the Amazon.com one, which contains a larger number of positive reviews and a lower number of negative and neutral reviews. Moreover, Table 1 also shows the average text length of these two testing sets. Amazon.com review data has a longer average text length, and the SST dataset usually has short reviews. To further discuss the two datasets, the statistics of their review text lengths are plotted in Figure 6.

From Figure 6, it is more clear that these two datasets have different distributions on review text length. Combined with the results in Table 1, we can say these two datasets consider both data balance and text length for testing the given AI product, which would be helpful for the complete evaluation in AIQM.

4.2. Evaluation Metrics

We evaluated the robustness of the AI model by using the definition of an adversarial example in which a small perturbation in the data fools the model. Since the evaluation of using ChatGPT for sentiment analysis is essentially classification problem, we can apply two traditional classification metrics: accuracy (

A c c

) and attack success rate (ASR).

A c c = \frac{# of correctly classified samples}{# of total testing samples} \times 100 %

(1)

A S R = \frac{# of successfully attacked samples}{# of total testing samples} \times 100 %

(2)

Although these two metrics are similar, they have different meanings.

A c c

gives the percentage of samples for which the decision matched the ground truth, and

A S R

gives the difference in accuracy between before and after perturbation. Therefore,

A S R

can be directly used for robustness analysis; meanwhile, we can also compare accuracy before and after perturbation for robustness analysis.

4.3. Perturbation and Robustness Analysis

As shown in Figure 7, this paper considered four types of perturbation from different perspectives, such as typo, synonym, homoglyph, and homophone perturbation. They mainly involve character- and word-level perturbation. To maintain naturalness and readability, the adversarial texts were generated in accordance with the given perturbation, and kept within an edit distance of one word, as seen in the examples in Figure 7, where successfully attacked review comments via the given four perturbations are presented.

4.3.1. Typo Perturbation

Psychological studies have shown that a sentence with one or more typos can often be still comprehended by a person [37]. However, in the information era with computers, words with typo perturbation are encoded differently, which may lead to incorrect machine processing. Therefore, typo perturbation is regarded as a common attack in NLP studies, mainly because typing errors are common in computer input. In this paper, we used four kinds of common typo perturbations for words introduced in the TextAttack package [38]: swapping two adjacent letters in a word, substituting a letter with a random letter, randomly deleting a letter, and randomly inserting a letter. Adversarial texts with these perturbations were generated by restricting the edit distance in the sentence to maintain semantic understanding, which we refer to as “one-word perturbation”. Typos were accordingly added to review comments in the datasets, and the performance of the AI model’s robustness was tested, and is presented in Table 2.

In Table 2, sentiment analysis performance of three models (i.e., BERT, RoBERTa, and GPT) on the given two datasets (Amazon and SST datasets) are presented. BERT and RoBERTa are two commonly used baselines for NLP study. In this paper, pretrained BERT and RoBERTa models for sentiment analysis are selected for fine-tuning; moreover, they will be used as the references for evaluating the GPT-based model’s robustness in the following experiments. The GPT model is based on the general pretrained ChatGPT without fine-tuning. Moreover, some metrics are given, such as

o r i_{a c c}

and

p e r t_{a c c}

, representing accuracy before and after perturbation, respectively, their difference

Δ_{d i f f}

, and the ASR. From these results, we can see that the fine-tuned conventional BERT and RoBERTa models have slightly better accuracy than the general ChatGPT-based sentiment analysis model. However, with improvement on accuracy, their robustness to perturbations will be weakened, seeing from the values of

Δ_{d i f f}

and ASR. Moreover, it is also interesting to find that the ASRs for the two datasets were close to the accuracy drops in Table 2. Furthermore, by comparing performance on two datasets, both accuracy and robustness on the Amazon.com review data are relatively better than that on the SST data, while, along with the text length analysis results shown in Figure 6, the larger

A S R

on the SST dataset indicates that short review texts are more easily attacked by typo perturbation. This is consistent with the common understanding that longer texts are more robust against attacks.

4.3.2. Synonym Perturbation

Another common attack used in NLP studies is synonym perturbation [39], in which a word is replaced with its synonym word, leading to a different output. The text with the replaced synonym word should have readability and a meaning similar to the original text. However, in ML-/DL-based NLP studies, an NLP model may behave differently after synonym attack because different words have different encoding in the tokenization phase. As a result, synonym perturbation has generally worked well in conventional adversarial text generation studies, especially in sentiment analysis where sentiment words play a crucial role. This paper adopts the provided synonym perturbation algorithm [39] to create adversarial text in both the Amazon.com review dataset and the SST dataset. Results of robustness analysis are presented in Table 3.

In Table 3, three sentiment analysis models are also studied. Similar conclusions can be summarized, e.g., the ChatGPT-based model may lose some accuracy, but has better stability quality, based on a comparison of

Δ_{d i f f}

and ASR values. Moreover, by comparing Table 2 and Table 3, it is found that the ASR values of synonym perturbation are much larger than those for typo perturbation, indicating that synonym perturbations are stronger attacks than typo perturbations. Similarly, the longer texts in the Amazon.com review dataset were more robust than the shorter ones in the SST dataset against synonym perturbation.

4.3.3. Homoglyph Perturbation

In [38], Gao et al. also proposed using homoglyph perturbation for adversarial text attacks. The idea is to replace a character with a similar-looking character, e.g., using a symbol with an identical shape but with a different ASCII code, as illustrated by the examples in Figure 8. This perturbation was demonstrated to be useful in adversarial text attacks. In accordance with the number of replaced characters, homoglyph perturbation can be categorized as character-level (transforming a few characters) or word-level (transforming all characters in a word). Since homoglyph perturbation retains a look similar to that of the original word, it does not affect readability or the semantic meaning of the original text, which is useful in adversarial robustness studies. Therefore, in this paper, word-level homoglyph perturbation is adopted first. On the basis of word importance in sentiment analysis as determined using the Vader analyzer [40], adversarial texts were generated to test the performance of ChatGPT on sentiment analysis. Table 4 presents the performances of the studied three models on the given two datasets.

Through the comparison of the results in Table 2, Table 3 and Table 4, we conclude that homoglyph perturbation is an attack performing between typo perturbation and synonym perturbation, as per the ASR values.

4.3.4. Homophone Perturbation

In contrast to homoglyph perturbation, in which text appearance is transformed, homophone perturbation [41] transforms text on the basis of pronunciation. Homophones, which are words with similar sounds but are spelled differently and have different meanings, can cause the model to misclassify the sentiment. It has been shown to be useful, especially in adversarial Chinese text generation [41]. In this paper, we use this perturbation in English text perturbation. First, word importance for sentiment analysis was determined using the Vader analyzer. Then, homophone perturbation was used to generate adversarial texts based on the order of word importance. Robustness analysis based on these adversarial texts is performed based on two datasets, and the results are presented in Table 5.

As shown in Table 5, both the accuracy drops and ASR values for homophone perturbation were quite low, indicating that homophone perturbation is not a strong attack because current NLP models are almost based on textual embedding instead of pronunciation.

By summarizing all of the results from Table 2, Table 3, Table 4 and Table 5, we can concludes some points. First, compared with conventional baselines, like BERT and RoBERTa models, the general ChatGPT-based model can have better robustness. Second, among these four studied perturbations, synonym perturbation is the strongest attack, as it results in the largest accuracy drops and ASR values. Conversely, homophone perturbation is the worst textual attack for quality assurance. However, a comprehensive evaluation also shows that it is difficult to generate a strong attack against a ChatGPT-based product by using conventional methods, including at the character and word levels. Third, by comparing the given two datasets, it is found that longer texts are more robust to diminish the influence of small perturbations in quality assurance.

5. Conclusions

This paper introduces sentiment-analysis based on ChatGPT as the studied product for AI quality assurance. Meanwhile, the study mainly focuses on the specific quality assurance of stability. Two topics are discussed to examine the stability issue. One is based on the operation uncertainty, and several factors are analyzed and discussed. The second one is based on ChatGPT’s robustness against four types of perturbations. Two benchmark datasets were used for the evaluation. Results demonstrated that the ChatGPT-based sentiment analysis is robust against all four perturbations, although a bit weaker against synonym perturbation. Through the results, it is found reasonable to obtain a robust sentiment analysis product based on ChatGPT. Still, it is also necessary to notice the operation uncertainty due to ChatGPT continuously updating itself, the effect of differences in the time of operation, and some other factors.

Author Contributions

Writing—original draft preparation, T.O.; writing—review and editing, A.M., K.K. and Y.S.; supervision, I.E.; project administration, I.E. and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the project ‘JPNP20006’, commissioned by the New Energy and Industrial Technology Development Organization (NEDO) and partly supported by JSPS Grant-in-Aid for Early-Career Scientists (Grant Number JP22K17961).

Data Availability Statement

The data presented in this study are publicly available, which can be found by referring to references [35,36].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ouyang, T.; Isobe, Y.; Sultana, S.; Seo, Y.; Oiwa, Y. Autonomous driving quality assurance with data uncertainty analysis. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–7. [Google Scholar]
Shinde, P.P.; Shah, S. A review of machine learning and deep learning applications. In Proceedings of the 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 16–18 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
Hordri, N.F.; Yuhaniz, S.S.; Shamsuddin, S.M. Deep learning and its applications: A review. In Proceedings of the Conference on Postgraduate Annual Research on Informatics Seminar, Kuala Lumpur, Malaysia, 12 September 2016; pp. 1–5. [Google Scholar]
Zhou, J.; Müller, H.; Holzinger, A.; Chen, F. Ethical ChatGPT: Concerns, Challenges, and Commandments. Electronics 2024, 13, 3417. [Google Scholar] [CrossRef]
OpenAI. ChatGPT. Available online: https://chatgpt.com/ (accessed on 1 December 2023).
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Jiao, W.; Wang, W.; Huang, J.t.; Wang, X.; Tu, Z. Is ChatGPT a good translator? A preliminary study. arXiv 2023, arXiv:2301.08745. [Google Scholar]
Filippi, S. Measuring the impact of ChatGPT on fostering concept generation in innovative product design. Electronics 2023, 12, 3535. [Google Scholar] [CrossRef]
Petrillo, L.; Martinelli, F.; Santone, A.; Mercaldo, F. Toward the Adoption of Explainable Pre-Trained Large Language Models for Classifying Human-Written and AI-Generated Sentences. Electronics 2024, 13, 4057. [Google Scholar] [CrossRef]
Selivanov, A.; Rogov, O.Y.; Chesakov, D.; Shelmanov, A.; Fedulova, I.; Dylov, D.V. Medical image captioning via generative pretrained transformers. Sci. Rep. 2023, 13, 4171. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Chen, H.; Ruggles, T.A.; Feng, Y.; Singh, D.; Yoon, H.J. Improving Text Classification with Large Language Model-Based Data Augmentation. Electronics 2024, 13, 2535. [Google Scholar] [CrossRef]
Mitrović, S.; Andreoletti, D.; Ayoub, O. Chatgpt or human? detect and explain. explaining decisions of machine learning model for detecting short chatgpt-generated text. arXiv 2023, arXiv:2301.13852. [Google Scholar]
Frieder, S.; Pinchetti, L.; Griffiths, R.R.; Salvatori, T.; Lukasiewicz, T.; Petersen, P.; Berner, J. Mathematical capabilities of chatgpt. arXiv 2023, arXiv:2301.13867. [Google Scholar]
Guo, Y.; Lee, D. Leveraging chatgpt for enhancing critical thinking skills. J. Chem. Educ. 2023, 100, 4876–4883. [Google Scholar] [CrossRef]
Jiang, S.; Chen, Q.; Xiang, Y.; Pan, Y.; Lin, Y. Linguistic Rule Induction Improves Adversarial and OOD Robustness in Large Language Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 10565–10577. [Google Scholar]
Wang, B.; Chen, W.; Pei, H.; Xie, C.; Kang, M.; Zhang, C.; Xu, C.; Xiong, Z.; Dutta, R.; Schaeffer, R.; et al. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In Proceedings of the NeurIPS, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Jones, E.; Dragan, A.; Raghunathan, A.; Steinhardt, J. Automatically auditing large language models via discrete optimization. In Proceedings of the International Conference on Machine Learning, PMLR, Hangzhou, China, 17–19 February 2023; pp. 15307–15329. [Google Scholar]
Yang, Y.; Huang, P.; Cao, J.; Li, J.; Lin, Y.; Ma, F. A prompt-based approach to adversarial example generation and robustness enhancement. Front. Comput. Sci. 2024, 18, 184318. [Google Scholar] [CrossRef]
Johnson, D.; Goodman, R.; Patrinely, J.; Stone, C.; Zimmerman, E.; Donald, R.; Chang, S.; Berkowitz, S.; Finn, A.; Jahangir, E.; et al. Assessing the accuracy and reliability of AI-generated medical responses: An evaluation of the Chat-GPT model. Res. Sq. 2023. [Google Scholar] [CrossRef]
Rozado, D. The political biases of chatgpt. Soc. Sci. 2023, 12, 148. [Google Scholar] [CrossRef]
Li, T.O.; Zong, W.; Wang, Y.; Tian, H.; Wang, Y.; Cheung, S.C.; Kramer, J. Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 14–26. [Google Scholar]
Borji, A. A categorical archive of chatgpt failures. arXiv 2023, arXiv:2302.03494. [Google Scholar]
Zhang, H.; Cheah, Y.N.; Alyasiri, O.M.; An, J. Exploring aspect-based sentiment quadruple extraction with implicit aspects, opinions, and ChatGPT: A comprehensive survey. Artif. Intell. Rev. 2024, 57, 17. [Google Scholar] [CrossRef]
Yuan, L.; Chen, Y.; Cui, G.; Gao, H.; Zou, F.; Cheng, X.; Ji, H.; Liu, Z.; Sun, M. Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and LLMs evaluations. Adv. Neural Inf. Process. Syst. 2023, 36, 58478–58507. [Google Scholar]
Ouyang, T.; Seo, Y.; Oiwa, Y. Quality assurance study with mismatched data in sentiment analysis. In Proceedings of the 2022 29th Asia-Pacific Software Engineering Conference (APSEC), Virtual, 6–9 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 442–446. [Google Scholar]
Zhang, Y.; Xu, H.; Zhang, D.; Xu, R. A Hybrid Approach to Dimensional Aspect-Based Sentiment Analysis Using BERT and Large Language Models. Electronics 2024, 13, 3724. [Google Scholar] [CrossRef]
Machine Learning Quality Management Guideline; Digital Architecture Research Center|AIST: Tokyo, Japan, 2022.
Masoudnia, S.; Ebrahimpour, R. Mixture of experts: A literature survey. Artif. Intell. Rev. 2014, 42, 275–293. [Google Scholar] [CrossRef]
Wu, H.; Qiu, Z.; Wang, Z.; Zhao, H.; Fu, J. GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory. arXiv 2024, arXiv:2406.12375. [Google Scholar]
Cai, W.; Jiang, J.; Wang, F.; Tang, J.; Kim, S.; Huang, J. A survey on mixture of experts. Authorea Prepr. 2024, Preprints. [Google Scholar]
Chen, G.; Zhao, X.; Chen, T.; Cheng, Y. MoE-RBench: Towards Building Reliable Language Models with Sparse Mixture-of-Experts. arXiv 2024, arXiv:2406.11353. [Google Scholar]
OpenAI. Chatgpt-Results-Much-Better-than-API. Available online: https://community.openai.com/t/chatgpt-results-much-better-than-api/336749 (accessed on 1 December 2023).
OpenAI. Different Output Generated for Same Prompt in Chat Mode and API Mode Using GPT-3.5-Turbo. Available online: https://community.openai.com/t/different-output-generated-for-same-prompt-in-chat-mode-and-api-mode-using-gpt-3-5-turbo/318246 (accessed on 1 December 2023).
Chen, L.; Zaharia, M.; Zou, J. How is ChatGPT’s behavior changing over time? arXiv 2023, arXiv:2307.09009. [Google Scholar] [CrossRef]
McAuley, J.; Leskovec, J. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, Hong Kong, China, 12–16 October 2013; pp. 165–172. [Google Scholar]
Pang, B.; Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. arXiv 2005, arXiv:0506075. [Google Scholar]
Rayner, K.; White, S.J.; Liversedge, S. Raeding wrods with jubmled lettres: There is a cost. Psychol. Sci. 2006, 17, 192–193. [Google Scholar] [CrossRef] [PubMed]
Gao, J.; Lanchantin, J.; Soffa, M.L.; Qi, Y. Black-box generation of adversarial text sequences to evade deep learning classifiers. In Proceedings of the 2018 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 24 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 50–56. [Google Scholar]
Ren, S.; Deng, Y.; He, K.; Che, W. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1085–1097. [Google Scholar]
Hutto, C.; Gilbert, E. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the International AAAI Conference on Web and Social Media, Ann Arbor, MI, USA, 1–4 June 2014; Volume 8, pp. 216–225. [Google Scholar]
Xu, E.H.; Zhang, X.L.; Wang, Y.P.; Zhang, S.; Liu, L.X.; Xu, L. Adversarial Examples Generation Method for Chinese Text Classification. Int. J. Netw. Secur. 2022, 24, 587–596. [Google Scholar]

Figure 1. Diagram of using ChatGPT for sentiment analysis.

Figure 2. ChatGPT’s responses on two devices at same time.

Figure 3. ChatGPT for sentiment analysis at different time.

Figure 4. Designs of zero-, one-, and few-shot prompts for sentiment analysis.

Figure 5. Sentiment analysis results for different prompt settings.

Figure 6. Length of reviews in Amazon.com review and SST datasets.

Figure 7. Examples of different types of attacks.

Figure 8. Example homoglyph (hom.) of 26 letters in English alphabet.

Table 1. Dataset information.

Dataset	# of Samples	Dis.(Pos./Neu./Neg.)	Avg. Length
Amazon	983	0.8993/0.0264/0.0743	49.6185
SST	1101	0.4033/0.2080/0.3887	19.3224

Table 2. Robustness against typo perturbation.

		${ori}_{acc}$	${pert}_{acc}$	$Δ_{diff}$	$ASR$
BERT	Amazon	0.8947	0.6316	0.2631	0.2632
BERT	SST	0.8557	0.5040	0.3517	0.3515
RoBERTa	Amazon	0.9058	0.6547	0.2511	0.2510
RoBERTa	SST	0.8491	0.4948	0.3543	0.3541
GPT	Amazon	0.8942	0.7636	0.1306	0.1273
GPT	SST	0.8065	0.6129	0.1936	0.1935

Table 3. Robustness against synonym perturbation.

		${ori}_{acc}$	${pert}_{acc}$	$Δ_{diff}$	$ASR$
BERT	Amazon	0.8947	0.4731	0.4216	0.4387
BERT	SST	0.8557	0.3239	0.5318	0.5769
RoBERTa	Amazon	0.9058	0.4840	0.4218	0.4456
RoBERTa	SST	0.8491	0.3022	0.5469	0.5853
GPT	Amazon	0.8942	0.5781	0.3161	0.3642
GPT	SST	0.8065	0.3871	0.4194	0.5200

Table 4. Robustness against homoglyph perturbation.

		${ori}_{acc}$	${pert}_{acc}$	$Δ_{diff}$	$ASR$
BERT	Amazon	0.8947	0.6178	0.2769	0.2757
BERT	SST	0.8557	0.6739	0.1818	0.2199
RoBERTa	Amazon	0.9058	0.6273	0.2785	0.2760
RoBERTa	SST	0.8491	0.6379	0.2112	0.2290
GPT	Amazon	0.8942	0.6536	0.2406	0.2397
GPT	SST	0.8065	0.7419	0.0646	0.1290

Table 5. Robustness against homophone perturbation.

		${ori}_{acc}$	${pert}_{acc}$	$Δ_{diff}$	$ASR$
BERT	Amazon	0.8947	0.7657	0.1290	0.1290
BERT	SST	0.8557	0.7497	0.1360	0.1361
RoBERTa	Amazon	0.9058	0.7734	0.1324	0.1324
RoBERTa	SST	0.8491	0.7036	0.1455	0.1455
GPT	Amazon	0.8942	0.8445	0.0497	0.0497
GPT	SST	0.8065	0.7419	0.0646	0.0645

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ouyang, T.; MaungMaung, A.; Konishi, K.; Seo, Y.; Echizen, I. Stability Analysis of ChatGPT-Based Sentiment Analysis in AI Quality Assurance. Electronics 2024, 13, 5043. https://doi.org/10.3390/electronics13245043

AMA Style

Ouyang T, MaungMaung A, Konishi K, Seo Y, Echizen I. Stability Analysis of ChatGPT-Based Sentiment Analysis in AI Quality Assurance. Electronics. 2024; 13(24):5043. https://doi.org/10.3390/electronics13245043

Chicago/Turabian Style

Ouyang, Tinghui, AprilPyone MaungMaung, Koichi Konishi, Yoshiki Seo, and Isao Echizen. 2024. "Stability Analysis of ChatGPT-Based Sentiment Analysis in AI Quality Assurance" Electronics 13, no. 24: 5043. https://doi.org/10.3390/electronics13245043

APA Style

Ouyang, T., MaungMaung, A., Konishi, K., Seo, Y., & Echizen, I. (2024). Stability Analysis of ChatGPT-Based Sentiment Analysis in AI Quality Assurance. Electronics, 13(24), 5043. https://doi.org/10.3390/electronics13245043

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Stability Analysis of ChatGPT-Based Sentiment Analysis in AI Quality Assurance

Abstract

1. Introduction

2. Overview

2.1. ChatGPT-Based Sentiment Analysis

2.2. Stability of AI

3. Uncertainty Analysis

3.1. Model Architecture Design

3.2. Differences Between Using ChatGPT and ChatGPT API

3.3. Variance Due to Timing

3.4. Prompt Engineering

4. Robustness Testing

4.1. Data Preparation

4.2. Evaluation Metrics

4.3. Perturbation and Robustness Analysis

4.3.1. Typo Perturbation

4.3.2. Synonym Perturbation

4.3.3. Homoglyph Perturbation

4.3.4. Homophone Perturbation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI