Stability Analysis of ChatGPT-Based Sentiment Analysis in AI Quality Assurance
Abstract
:1. Introduction
- A simple yet effective sentiment analysis framework is proposed for quality assurance studies. This framework can serve as a reference for conducting AIQM [27] studies on other large language model (LLM)-based products.
- Two critical types of stability in LLM-based sentiment analysis are explored: operational uncertainty and model stability, providing a comprehensive perspective for quality assurance.
- Operational uncertainty emphasizes the unpredictability or complexity caused by some factors in the operation. A detailed analysis in ChatGPT is conducted from multiple perspectives, including usage patterns, timing effects, and prompt engineering techniques.
- Model stability is generally an issue related to robustness, which is systematically studied under four distinct types of textual perturbations in this paper to evaluate ChatGPT’s stability in handling input variations.
2. Overview
2.1. ChatGPT-Based Sentiment Analysis
- PromptSetting: analyze the following product review and determine if the sentiment is POSITIVE, NEGATIVE or NEUTRAL: {ReviewText}.
- OutputControl: return only a single word, such as POSITIVE, NEGATIVE or NEUTRAL.
2.2. Stability of AI
3. Uncertainty Analysis
3.1. Model Architecture Design
3.2. Differences Between Using ChatGPT and ChatGPT API
3.3. Variance Due to Timing
3.4. Prompt Engineering
4. Robustness Testing
4.1. Data Preparation
- Amazon.com review dataset: This dataset is a collection of a large number of product reviews from Amazon.com. The raw data contains 82.83 million unique reviews, and includes product and user information, a rating score (1–5 stars), and a plain text review. For sentiment analysis, researchers usually take a review score of 1 or 2 as negative, 4 or 5 as positive, and 3 as neutral. We did likewise.
- SST dataset: This dataset consists of 11,855 individual sentences extracted from movie reviews by Pang and Lee [36]. Applying the Stanford parser to this dataset enables a comprehensive examination of the compositional effects of sentiment in language. In this paper, we used an extension of this dataset with fine-grained labels (very positive, positive, neutral, negative, very negative), and roughly categorized the review sentiments as positive, neutral, or negative.
4.2. Evaluation Metrics
4.3. Perturbation and Robustness Analysis
4.3.1. Typo Perturbation
4.3.2. Synonym Perturbation
4.3.3. Homoglyph Perturbation
4.3.4. Homophone Perturbation
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Ouyang, T.; Isobe, Y.; Sultana, S.; Seo, Y.; Oiwa, Y. Autonomous driving quality assurance with data uncertainty analysis. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–7. [Google Scholar]
- Shinde, P.P.; Shah, S. A review of machine learning and deep learning applications. In Proceedings of the 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 16–18 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
- Hordri, N.F.; Yuhaniz, S.S.; Shamsuddin, S.M. Deep learning and its applications: A review. In Proceedings of the Conference on Postgraduate Annual Research on Informatics Seminar, Kuala Lumpur, Malaysia, 12 September 2016; pp. 1–5. [Google Scholar]
- Zhou, J.; Müller, H.; Holzinger, A.; Chen, F. Ethical ChatGPT: Concerns, Challenges, and Commandments. Electronics 2024, 13, 3417. [Google Scholar] [CrossRef]
- OpenAI. ChatGPT. Available online: https://chatgpt.com/ (accessed on 1 December 2023).
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- Jiao, W.; Wang, W.; Huang, J.t.; Wang, X.; Tu, Z. Is ChatGPT a good translator? A preliminary study. arXiv 2023, arXiv:2301.08745. [Google Scholar]
- Filippi, S. Measuring the impact of ChatGPT on fostering concept generation in innovative product design. Electronics 2023, 12, 3535. [Google Scholar] [CrossRef]
- Petrillo, L.; Martinelli, F.; Santone, A.; Mercaldo, F. Toward the Adoption of Explainable Pre-Trained Large Language Models for Classifying Human-Written and AI-Generated Sentences. Electronics 2024, 13, 4057. [Google Scholar] [CrossRef]
- Selivanov, A.; Rogov, O.Y.; Chesakov, D.; Shelmanov, A.; Fedulova, I.; Dylov, D.V. Medical image captioning via generative pretrained transformers. Sci. Rep. 2023, 13, 4171. [Google Scholar] [CrossRef] [PubMed]
- Zhao, H.; Chen, H.; Ruggles, T.A.; Feng, Y.; Singh, D.; Yoon, H.J. Improving Text Classification with Large Language Model-Based Data Augmentation. Electronics 2024, 13, 2535. [Google Scholar] [CrossRef]
- Mitrović, S.; Andreoletti, D.; Ayoub, O. Chatgpt or human? detect and explain. explaining decisions of machine learning model for detecting short chatgpt-generated text. arXiv 2023, arXiv:2301.13852. [Google Scholar]
- Frieder, S.; Pinchetti, L.; Griffiths, R.R.; Salvatori, T.; Lukasiewicz, T.; Petersen, P.; Berner, J. Mathematical capabilities of chatgpt. arXiv 2023, arXiv:2301.13867. [Google Scholar]
- Guo, Y.; Lee, D. Leveraging chatgpt for enhancing critical thinking skills. J. Chem. Educ. 2023, 100, 4876–4883. [Google Scholar] [CrossRef]
- Jiang, S.; Chen, Q.; Xiang, Y.; Pan, Y.; Lin, Y. Linguistic Rule Induction Improves Adversarial and OOD Robustness in Large Language Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 10565–10577. [Google Scholar]
- Wang, B.; Chen, W.; Pei, H.; Xie, C.; Kang, M.; Zhang, C.; Xu, C.; Xiong, Z.; Dutta, R.; Schaeffer, R.; et al. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In Proceedings of the NeurIPS, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Jones, E.; Dragan, A.; Raghunathan, A.; Steinhardt, J. Automatically auditing large language models via discrete optimization. In Proceedings of the International Conference on Machine Learning, PMLR, Hangzhou, China, 17–19 February 2023; pp. 15307–15329. [Google Scholar]
- Yang, Y.; Huang, P.; Cao, J.; Li, J.; Lin, Y.; Ma, F. A prompt-based approach to adversarial example generation and robustness enhancement. Front. Comput. Sci. 2024, 18, 184318. [Google Scholar] [CrossRef]
- Johnson, D.; Goodman, R.; Patrinely, J.; Stone, C.; Zimmerman, E.; Donald, R.; Chang, S.; Berkowitz, S.; Finn, A.; Jahangir, E.; et al. Assessing the accuracy and reliability of AI-generated medical responses: An evaluation of the Chat-GPT model. Res. Sq. 2023. [Google Scholar] [CrossRef]
- Rozado, D. The political biases of chatgpt. Soc. Sci. 2023, 12, 148. [Google Scholar] [CrossRef]
- Li, T.O.; Zong, W.; Wang, Y.; Tian, H.; Wang, Y.; Cheung, S.C.; Kramer, J. Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 14–26. [Google Scholar]
- Borji, A. A categorical archive of chatgpt failures. arXiv 2023, arXiv:2302.03494. [Google Scholar]
- Zhang, H.; Cheah, Y.N.; Alyasiri, O.M.; An, J. Exploring aspect-based sentiment quadruple extraction with implicit aspects, opinions, and ChatGPT: A comprehensive survey. Artif. Intell. Rev. 2024, 57, 17. [Google Scholar] [CrossRef]
- Yuan, L.; Chen, Y.; Cui, G.; Gao, H.; Zou, F.; Cheng, X.; Ji, H.; Liu, Z.; Sun, M. Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and LLMs evaluations. Adv. Neural Inf. Process. Syst. 2023, 36, 58478–58507. [Google Scholar]
- Ouyang, T.; Seo, Y.; Oiwa, Y. Quality assurance study with mismatched data in sentiment analysis. In Proceedings of the 2022 29th Asia-Pacific Software Engineering Conference (APSEC), Virtual, 6–9 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 442–446. [Google Scholar]
- Zhang, Y.; Xu, H.; Zhang, D.; Xu, R. A Hybrid Approach to Dimensional Aspect-Based Sentiment Analysis Using BERT and Large Language Models. Electronics 2024, 13, 3724. [Google Scholar] [CrossRef]
- Machine Learning Quality Management Guideline; Digital Architecture Research Center|AIST: Tokyo, Japan, 2022.
- Masoudnia, S.; Ebrahimpour, R. Mixture of experts: A literature survey. Artif. Intell. Rev. 2014, 42, 275–293. [Google Scholar] [CrossRef]
- Wu, H.; Qiu, Z.; Wang, Z.; Zhao, H.; Fu, J. GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory. arXiv 2024, arXiv:2406.12375. [Google Scholar]
- Cai, W.; Jiang, J.; Wang, F.; Tang, J.; Kim, S.; Huang, J. A survey on mixture of experts. Authorea Prepr. 2024, Preprints. [Google Scholar]
- Chen, G.; Zhao, X.; Chen, T.; Cheng, Y. MoE-RBench: Towards Building Reliable Language Models with Sparse Mixture-of-Experts. arXiv 2024, arXiv:2406.11353. [Google Scholar]
- OpenAI. Chatgpt-Results-Much-Better-than-API. Available online: https://community.openai.com/t/chatgpt-results-much-better-than-api/336749 (accessed on 1 December 2023).
- OpenAI. Different Output Generated for Same Prompt in Chat Mode and API Mode Using GPT-3.5-Turbo. Available online: https://community.openai.com/t/different-output-generated-for-same-prompt-in-chat-mode-and-api-mode-using-gpt-3-5-turbo/318246 (accessed on 1 December 2023).
- Chen, L.; Zaharia, M.; Zou, J. How is ChatGPT’s behavior changing over time? arXiv 2023, arXiv:2307.09009. [Google Scholar] [CrossRef]
- McAuley, J.; Leskovec, J. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, Hong Kong, China, 12–16 October 2013; pp. 165–172. [Google Scholar]
- Pang, B.; Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. arXiv 2005, arXiv:0506075. [Google Scholar]
- Rayner, K.; White, S.J.; Liversedge, S. Raeding wrods with jubmled lettres: There is a cost. Psychol. Sci. 2006, 17, 192–193. [Google Scholar] [CrossRef] [PubMed]
- Gao, J.; Lanchantin, J.; Soffa, M.L.; Qi, Y. Black-box generation of adversarial text sequences to evade deep learning classifiers. In Proceedings of the 2018 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 24 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 50–56. [Google Scholar]
- Ren, S.; Deng, Y.; He, K.; Che, W. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1085–1097. [Google Scholar]
- Hutto, C.; Gilbert, E. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the International AAAI Conference on Web and Social Media, Ann Arbor, MI, USA, 1–4 June 2014; Volume 8, pp. 216–225. [Google Scholar]
- Xu, E.H.; Zhang, X.L.; Wang, Y.P.; Zhang, S.; Liu, L.X.; Xu, L. Adversarial Examples Generation Method for Chinese Text Classification. Int. J. Netw. Secur. 2022, 24, 587–596. [Google Scholar]
Dataset | # of Samples | Dis.(Pos./Neu./Neg.) | Avg. Length |
---|---|---|---|
Amazon | 983 | 0.8993/0.0264/0.0743 | 49.6185 |
SST | 1101 | 0.4033/0.2080/0.3887 | 19.3224 |
BERT | Amazon | 0.8947 | 0.6316 | 0.2631 | 0.2632 |
SST | 0.8557 | 0.5040 | 0.3517 | 0.3515 | |
RoBERTa | Amazon | 0.9058 | 0.6547 | 0.2511 | 0.2510 |
SST | 0.8491 | 0.4948 | 0.3543 | 0.3541 | |
GPT | Amazon | 0.8942 | 0.7636 | 0.1306 | 0.1273 |
SST | 0.8065 | 0.6129 | 0.1936 | 0.1935 |
BERT | Amazon | 0.8947 | 0.4731 | 0.4216 | 0.4387 |
SST | 0.8557 | 0.3239 | 0.5318 | 0.5769 | |
RoBERTa | Amazon | 0.9058 | 0.4840 | 0.4218 | 0.4456 |
SST | 0.8491 | 0.3022 | 0.5469 | 0.5853 | |
GPT | Amazon | 0.8942 | 0.5781 | 0.3161 | 0.3642 |
SST | 0.8065 | 0.3871 | 0.4194 | 0.5200 |
BERT | Amazon | 0.8947 | 0.6178 | 0.2769 | 0.2757 |
SST | 0.8557 | 0.6739 | 0.1818 | 0.2199 | |
RoBERTa | Amazon | 0.9058 | 0.6273 | 0.2785 | 0.2760 |
SST | 0.8491 | 0.6379 | 0.2112 | 0.2290 | |
GPT | Amazon | 0.8942 | 0.6536 | 0.2406 | 0.2397 |
SST | 0.8065 | 0.7419 | 0.0646 | 0.1290 |
BERT | Amazon | 0.8947 | 0.7657 | 0.1290 | 0.1290 |
SST | 0.8557 | 0.7497 | 0.1360 | 0.1361 | |
RoBERTa | Amazon | 0.9058 | 0.7734 | 0.1324 | 0.1324 |
SST | 0.8491 | 0.7036 | 0.1455 | 0.1455 | |
GPT | Amazon | 0.8942 | 0.8445 | 0.0497 | 0.0497 |
SST | 0.8065 | 0.7419 | 0.0646 | 0.0645 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ouyang, T.; MaungMaung, A.; Konishi, K.; Seo, Y.; Echizen, I. Stability Analysis of ChatGPT-Based Sentiment Analysis in AI Quality Assurance. Electronics 2024, 13, 5043. https://doi.org/10.3390/electronics13245043
Ouyang T, MaungMaung A, Konishi K, Seo Y, Echizen I. Stability Analysis of ChatGPT-Based Sentiment Analysis in AI Quality Assurance. Electronics. 2024; 13(24):5043. https://doi.org/10.3390/electronics13245043
Chicago/Turabian StyleOuyang, Tinghui, AprilPyone MaungMaung, Koichi Konishi, Yoshiki Seo, and Isao Echizen. 2024. "Stability Analysis of ChatGPT-Based Sentiment Analysis in AI Quality Assurance" Electronics 13, no. 24: 5043. https://doi.org/10.3390/electronics13245043
APA StyleOuyang, T., MaungMaung, A., Konishi, K., Seo, Y., & Echizen, I. (2024). Stability Analysis of ChatGPT-Based Sentiment Analysis in AI Quality Assurance. Electronics, 13(24), 5043. https://doi.org/10.3390/electronics13245043