An Investigation of Applying Large Language Models to Spoken Language Learning
Abstract
:1. Introduction
1.1. Related Works
1.2. The Contributions of This Study
- This study introduces a dataset on spoken language intelligence that serves as a substantial benchmark for speech and language learning scenarios, providing valuable supplementation to existing benchmarks.
- This study conducts a study on various prompting strategies (such as zero-shot, few-shot, direct/CoT, domain-specific exemplar, and external tools) and analyzes their performance using multiple-choice questions.
- This study demonstrates that in-domain example sampling techniques can consistently improve performance on domain-specific data.
- This study conducts an expert evaluation of a small set of multi-turn conversations generated by GPT-3.5. Unless otherwise specified, the results mentioned in this paper regarding GPT-3.5 are based on the GPT-3.5-turbo-0613 model. Error analysis indicates that GPT-3.5 has the potential to enhance conversational spoken language learning.
2. Methodology
2.1. Dataset
- Knowledge and Concept: We designed a set of concept-related questions to test the large models’ knowledge of spoken languages, such as “what is language transfer” and “how many classifications of consonants are there for the manner of articulation?’’ These questions were mainly sampled from the exercises at the end of each chapter of the book [42] and then manually adjusted to the needs of the research.
- Application Questions: To address the ever-changing nature of personalized problems, it is necessary to utilize knowledge of phonetics or linguistics to make complex reasoning. For example, different contexts call for different appropriate stress patterns, which is fundamental to an automatic personalized language learning system. Therefore, based on language teaching practices involving pronunciation, pauses, stress, and intonation, we manually designed a series of representative questions. For example, regarding stress, we formulated questions about the placement of word stress within words, phrase stress when conveying specific meanings, and sentence stress. For the issue of word stress within words, we considered having GPT determine the positions of stressed syllables in both regular and irregular words (e.g., the word “present” has different stress locations depending on the part of speech). For the issue of intonation, we designed questions that would allow the LLMs to determine the intended meanings of sentences with different positions of rising/falling tone or vice versa, i.e., allowing LLMs to determine the proper intonation for sentences with specific meanings. Overall, we aimed to increase the breadth of questions and enhance their practical relevance as much as possible. An example about pronunciation is shown in Figure 1.
2.2. Prompting Methods
3. Experiments and Results
3.1. Experimental Setup
3.2. Zero-Shot and Few-Shot Benchmarks
3.3. Analysis of Advanced Prompting Methods
3.4. Experts’ Evaluations of the Multi-Turn Conversations
- RATING-A: The response was both valid and satisfactory and was relevant to the evaluation prompt.
- RATING-B: This response was acceptable but contained minor errors or imperfections.
- RATING-C: Although this response was relevant and addressed the instruction, it contained significant errors in its content.
- RATING-D: This response was either irrelevant to the evaluation prompt or entirely invalid for the current topic.
4. Discussion
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Eskenazi, M. An overview of spoken language technology for education. Speech Commun. 2009, 51, 832–844. [Google Scholar] [CrossRef]
- Rogerson-Revell, P.M. Computer-assisted pronunciation training (CAPT): Current issues and future directions. Relc J. 2021, 52, 189–205. [Google Scholar] [CrossRef]
- Kang, O.; Kermad, A. Assessment in second language pronunciation. In The Routledge Handbook of Contemporary English Pronunciation; Routledge: Abingdon-on-Thames, UK, 2017; pp. 511–526. [Google Scholar]
- Kang, O.; Rubin, D.; Pickering, L. Suprasegmental measures of accentedness and judgments of language learner proficiency in oral English. Mod. Lang. J. 2010, 94, 554–566. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Doersch, C.; Gupta, A.; Efros, A.A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1422–1430. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Coope, S.; Farghly, T.; Gerz, D.; Vulić, I.; Henderson, M. Span-ConveRT: Few-shot Span Extraction for Dialog with Pretrained Conversational Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 5–10 July 2020; pp. 107–121. [Google Scholar]
- Min, B.; Ross, H.; Sulem, E.; Veyseh, A.P.B.; Nguyen, T.H.; Sainz, O.; Agirre, E.; Heintz, I.; Roth, D. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Comput. Surv. 2023, 56, 1–40. [Google Scholar] [CrossRef]
- Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; de Las Casas, D.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training Compute-Optimal Large Language Models. arXiv 2022, arXiv:2203.15556. [Google Scholar]
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. arXiv 2022, arXiv:2204.02311. [Google Scholar]
- OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y.T.; Li, Y.; Lundberg, S.; et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv 2023, arXiv:2303.12712. [Google Scholar]
- Bang, Y.; Cahyawijaya, S.; Lee, N.; Dai, W.; Su, D.; Wilie, B.; Lovenia, H.; Ji, Z.; Yu, T.; Chung, W.; et al. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. arXiv 2023, arXiv:2302.04023. [Google Scholar]
- Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring massive multitask language understanding. arXiv 2020, arXiv:2009.03300. [Google Scholar]
- Castro Nascimento, C.M.; Pimentel, A.S. Do Large Language Models Understand Chemistry? A Conversation with ChatGPT. J. Chem. Inf. Model. 2023, 63, 1649–1655. [Google Scholar] [CrossRef]
- Frank, M.C. Baby steps in evaluating the capacities of large language models. Nat. Rev. Psychol. 2023, 2., 451–452. [Google Scholar] [CrossRef]
- Valmeekam, K.; Olmo, A.; Sreedharan, S.; Kambhampati, S. Large language models still can’t plan (A benchmark for LLMs on planning and reasoning about change). In Proceedings of the NeurIPS 2022 Foundation Models for Decision Making Workshop, New Orleans, LA, USA, 2022. [Google Scholar]
- Liévin, V.; Hother, C.E.; Winther, O. Can large language models reason about medical questions? arXiv 2022, arXiv:2207.08143. [Google Scholar]
- Dai, W.; Lin, J.; Jin, H.; Li, T.; Tsai, Y.S.; Gašević, D.; Chen, G. Can large language models provide feedback to students? A case study on ChatGPT. In Proceedings of the 2023 IEEE International Conference on Advanced Learning Technologies (ICALT), Orem, UT, USA, 10–13 July 2023; pp. 323–325. [Google Scholar]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]
- Xu, P.; Zhu, X.; Clifton, D.A. Multimodal learning with transformers: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12113–12132. [Google Scholar] [CrossRef]
- Hossain, M.Z.; Sohel, F.; Shiratuddin, M.F.; Laga, H. A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. (CsUR) 2019, 51, 1–36. [Google Scholar] [CrossRef]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- Ling, S.; Hu, Y.; Qian, S.; Ye, G.; Qian, Y.; Gong, Y.; Lin, E.; Zeng, M. Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition. arXiv 2023, arXiv:2307.08234. [Google Scholar]
- Sigurgeirsson, A.T.; King, S. Using a Large Language Model to Control Speaking Style for Expressive TTS. arXiv 2023, arXiv:2305.10321. [Google Scholar]
- Rubenstein, P.K.; Asawaroengchai, C.; Nguyen, D.D.; Bapna, A.; Borsos, Z.; de Chaumont Quitry, F.; Chen, P.; Badawy, D.E.; Han, W.; Kharitonov, E.; et al. AudioPaLM: A Large Language Model That Can Speak and Listen. arXiv 2023, arXiv:2306.12925. [Google Scholar]
- Borsos, Z.; Marinier, R.; Vincent, D.; Kharitonov, E.; Pietquin, O.; Sharifi, M.; Roblek, D.; Teboul, O.; Grangier, D.; Tagliasacchi, M.; et al. AudioLM: A Language Modeling Approach to Audio Generation. arXiv 2023, arXiv:2209.03143. [Google Scholar] [CrossRef]
- Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; et al. PaLM 2 Technical Report. arXiv 2023, arXiv:2305.10403. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2023, arXiv:2201.11903. [Google Scholar]
- Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
- Nakano, R.; Hilton, J.; Balaji, S.; Wu, J.; Ouyang, L.; Kim, C.; Hesse, C.; Jain, S.; Kosaraju, V.; Saunders, W.; et al. WebGPT: Browser-assisted question-answering with human feedback. arXiv 2022, arXiv:2112.09332. [Google Scholar]
- Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. arXiv 2023, arXiv:2302.04761. [Google Scholar]
- Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; Callan, J.; Neubig, G. Pal: Program-aided language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 10764–10799. [Google Scholar]
- Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, New York, NY, USA, 3–10 March 2021; pp. 610–623. [Google Scholar] [CrossRef]
- Kang, O.; Rubin, D.L. Reverse linguistic stereotyping: Measuring the effect of listener expectations on speech evaluation. J. Lang. Soc. Psychol. 2009, 28, 441–456. [Google Scholar] [CrossRef]
- Kasneci, E.; Sessler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
- Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; De Leon, L.; Elepaño, C.; Madriaga, M.; Aggabao, R.; Diaz-Candido, G.; Maningo, J.; et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health 2023, 2, e0000198. [Google Scholar] [CrossRef] [PubMed]
- Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A.A.M.; Abid, A.; Fisch, A.; Brown, A.R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv 2022, arXiv:2206.04615. [Google Scholar]
- Ladefoged, P.; Johnson, K. A Course in Phonetics; Cengage Learning: Boston, MA, USA, 2014. [Google Scholar]
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
- Lai, G.; Xie, Q.; Liu, H.; Yang, Y.; Hovy, E. Race: Large-scale reading comprehension dataset from examinations. arXiv 2017, arXiv:1704.04683. [Google Scholar]
- Lin, S.; Hilton, J.; Evans, O. Truthfulqa: Measuring how models mimic human falsehoods. arXiv 2021, arXiv:2109.07958. [Google Scholar]
- Robinson, J.; Rytting, C.M.; Wingate, D. Leveraging large language models for multiple choice question answering. arXiv 2022, arXiv:2210.12353. [Google Scholar]
- Imani, S.; Du, L.; Shrivastava, H. Mathprompter: Mathematical reasoning using large language models. arXiv 2023, arXiv:2303.05398. [Google Scholar]
- Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Zhou, D. Rationale-augmented ensembles in language models. arXiv 2022, arXiv:2207.00747. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; tau Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2021, arXiv:2005.11401. [Google Scholar]
- Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; van den Driessche, G.; Lespiau, J.B.; Damoc, B.; Clark, A.; et al. Improving language models by retrieving from trillions of tokens. arXiv 2022, arXiv:2112.04426. [Google Scholar]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
- Tay, Y.; Dehghani, M.; Tran, V.Q.; Garcia, X.; Bahri, D.; Schuster, T.; Zheng, H.S.; Houlsby, N.; Metzler, D. Unifying language learning paradigms. arXiv 2022, arXiv:2205.05131. [Google Scholar]
- Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling Instruction-Finetuned Language Models. arXiv 2022, arXiv:2210.11416. [Google Scholar]
- Biderman, S.; Schoelkopf, H.; Anthony, Q.; Bradley, H.; O’Brien, K.; Hallahan, E.; Khan, M.A.; Purohit, S.; Prashanth, U.S.; Raff, E.; et al. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. arXiv 2023, arXiv:2304.01373. [Google Scholar]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv 2023, arXiv:2306.05685. [Google Scholar]
- Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T.B. Stanford Alpaca: An Instruction-Following LLaMA Model. Available online: https://github.com/tatsu-lab/stanford_alpaca (accessed on 11 August 2023).
- Fu, Y.; Ou, L.; Chen, M.; Wan, Y.; Peng, H.; Khot, T. Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models’ Reasoning Performance. arXiv 2023, arXiv:2305.17306. [Google Scholar]
- Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlíček, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011. [Google Scholar]
- Witt, S.; Young, S. Computer-assisted pronunciation teaching based on automatic speech recognition. In Language Teaching and Language Technology; Routledge: Abingdon-on-Thames, UK, 2014; pp. 25–35. [Google Scholar]
- Marcus, G. The next decade in AI: Four steps towards robust artificial intelligence. arXiv 2020, arXiv:2002.06177. [Google Scholar]
- Russin, J.; O’Reilly, R.C.; Bengio, Y. Deep learning needs a prefrontal cortex. Work. Bridg. AI Cogn. Sci. 2020, 107, 1. [Google Scholar]
- Mitchell, M. Abstraction and analogy-making in artificial intelligence. Ann. N. Y. Acad. Sci. USA 2021, 1505, 79–101. [Google Scholar] [CrossRef] [PubMed]
- Huang, J.; Chang, K.C.C. Towards reasoning in large language models: A survey. arXiv 2022, arXiv:2212.10403. [Google Scholar]
- Lin, V.; Yeh, H.C.; Chen, N.S. A systematic review on oral interactions in robot-assisted language learning. Electronics 2022, 11, 290. [Google Scholar] [CrossRef]
Task Description/System | You Are an Expert in Phonetics, English Phonology, and Second Language Acquisition. | |
---|---|---|
Here Is a Multiple-Choice Question for You to Answer Correctly. | ||
Zero-shot | Few-shot | |
Shot | ∅ | Question: [Question] |
Answer: [answer] | ||
… | ||
Question | Question: [Question] | Question: [Question] |
Answer | Answer: <answer> | Answer: <answer> |
Zero-shot CoT | Few-shot CoT | |
Shot | ∅ | Question: [Question] |
Answer: Let’s think step by step. [CoT] | ||
Therefore, the answer is [answer] | ||
… | ||
Question | Question: [Question] | Question: [Question] |
CoT | Answer: Let’s think step by step. <CoT> | Answer: Let’s think step by step. <CoT> |
Answer | Therefore, the answer is <answer> | Therefore, the answer is <answer> |
In-domain Exemplar | Tool Augmentation | |
Question | Question: [Question about T★] | Question: [Question] |
Answer: Let’s think step by step. [CoT] | Thought: <CoT> + I need to use some tools to find the answer. | |
Therefore, the answer is [answer] | Action: {"action": "<tool>", “input”: "<input>"} | |
… | Observation: The <tool> contains detailed information about the <input>. | |
Question: [Question about T] | … | |
CoT | Answer: Let’s think step by step. <CoT> | Thought: <CoT> |
Answer | Therefore, the answer is <answer> | Therefore, the answer is <answer> |
Model | Release | Size | Base | Adaptation | Pre-Train | Hardware | Training/FT | Evaluation | ||
---|---|---|---|---|---|---|---|---|---|---|
Time | (B) | Model | IT | RLHF | Data Scale | (GPUs / TPUs) | Time | ICL | CoT | |
LLaMA [52] | February -2023 | 7∼65 | - | - | - | 1.4 T tokens | 2048 80 G A100 | 21 d | √ | - |
Vicuna [57] | March-2023 | 7∼33 | LLaMA | - | - | - | 8 80 G A100 | - | √ | √ |
Alpaca [58] | March-2023 | 7 | LLaMA | √ | - | - | 4 80 G A100 | 3 h | - | - |
Flan-T5 [55] | October-2022 | 11 (XXL) | T5 | √ | - | - | - | - | √ | √ |
Flan-UL2 [54] | March-2023 | 20 | UL2 | √ | - | - | - | - | √ | √ |
Pythia [56] | April-2023 | 12 | - | - | - | 300 B tokens | 256 40 G A100 | - | √ | - |
LLaMA2 [52] | July-2023 | 7∼70 | - | √ | √ | 2.0 T tokens | 2000 80 G A100 | - | √ | √ |
GPT-3 [12] | May-2020 | 175 | - | - | - | 300 B tokens | - | - | √ | - |
GPT-4 [14] | March-2023 | - | - | √ | √ | - | - | - | √ | √ |
Model | Prompt | Shot | Concept (144) | Applied Questions (301) | Overall | ||
---|---|---|---|---|---|---|---|
LLaMA1-7B | Direct | 0 | 38.9% (56) | 0.0% (0) | 12.6% | ||
LLaMA1-7B | Direct | 3 | 14.6% (21) | −24.3% | 4.0% (12) | +4.0% | 7.4% |
LLaMA1-7B | CoT | 0 | 4.9% (7) | −34.0% | 4.3% (13) | +4.3% | 4.5% |
LLaMA1-7B | CoT | 3 | 40.3% (58) | +1.4% | 31.2% (94) | +31.2% | 34.2% |
Alpaca-7B | Direct | 0 | 50.7% (73) | 27.6% (83) | 35.1% | ||
Alpaca-7B | Direct | 3 | 41.0% (59) | −9.7% | 28.6% (86) | +1.0% | 32.6% |
Alpaca-7B | CoT | 0 | 13.9% (20) | −36.8% | 7.6% (23) | −20.0% | 9.6% |
Alpaca-7B | CoT | 3 | 39.6% (57) | −11.1% | 33.2% (100) | +5.6% | 35.3% |
Vicuna-7B | Direct | 0 | 50.0% (72) | 18.6% (56) | 28.8% | ||
Vicuna-7B | Direct | 3 | 0.7% (1) | −49.3% | 1.0% (3) | −17.6% | 0.9% |
Vicuna-7B | CoT | 0 | 57.6% (83) | +7.6% | 22.3% (67) | +3.7% | 33.7% |
Vicuna-7B | CoT | 3 | 60.4% (87) | +10.4% | 33.2% (100) | +14.6% | 42.0% |
LLaMA2-7B | Direct | 0 | 38.2% (55) | 15.3% (46) | 22.7% | ||
LLaMA2-7B | Direct | 3 | 45.1% (65) | +6.9% | 13.6% (41) | −1.7% | 23.8% |
LLaMA2-7B | CoT | 0 | 11.1% (16) | −27.1% | 4.0% (13) | −11.3% | 6.5% |
LLaMA2-7B | CoT | 3 | 68.1% (98) | +29.9% | 36.2% (109) | +20.9% | 46.5% |
LLaMA2-7B-Chat | Direct | 0 | 72.2% (104) | 31.2% (94) | 44.5% | ||
LLaMA2-7B-Chat | Direct | 3 | 70.8% (102) | −1.4% | 31.0% (93) | −0.2% | 43.8% |
LLaMA2-7B-Chat | CoT | 0 | 86.1% (124) | +13.9% | 28.6% (86) | −2.6% | 47.2% |
LLaMA2-7B-Chat | CoT | 3 | 77.1% (111) | +4.9% | 33.2% (100) | +2.0% | 47.4% |
Pythia-7B | Direct | 0 | 2.8% (4) | 18.9% (57) | 13.7% | ||
Pythia-7B | Direct | 3 | 15.3% (22) | +12.5% | 15.3% (46) | −3.6% | 15.3% |
Pythia-7B | CoT | 0 | 8.3% (12) | +5.5% | 12.0% (36) | −6.9% | 10.8% |
Pythia-7B | CoT | 3 | 14.6% (21) | +11.8% | 26.9% (81) | +8.0% | 22.9% |
Flan-T5-XXL-11B | Direct | 0 | 88.2% (127) | 46.2% (139) | 59.8% | ||
Flan-T5-XXL-11B | Direct | 3 | 88.9% (128) | +0.7% | 47.8% (144) | +1.6% | 61.1% |
Flan-T5-XXL-11B | CoT | 0 | 79.2% (114) | −9.0% | 37.9% (114) | −8.3% | 51.2% |
Flan-T5-XXL-11B | CoT | 3 | 86.1% (124) | −2.1% | 40.5% (122) | −5.7% | 55.3% |
Pythia-12B | Direct | 0 | 20.1% (29) | 24.9% (75) | 23.4% | ||
Pythia-12B | Direct | 3 | 20.1% (29) | +0.0% | 26.6% (80) | +1.7% | 24.5% |
Pythia-12B | CoT | 0 | 18.1% (26) | −2.0% | 19.3% (58) | −5.6% | 18.9% |
Pythia-12B | CoT | 3 | 22.2% (32) | +2.1% | 22.3% (67) | −2.6% | 22.2% |
LLaMA1-13B | Direct | 0 | 33.3% (48) | 7.6% (23) | 15.9% | ||
LLaMA1-13B | Direct | 3 | 66.7% (96) | +33.4% | 26.2% (79) | +18.6% | 39.3% |
LLaMA1-13B | CoT | 0 | 25.0% (36) | −8.3% | 11.3% (34) | +3.7% | 15.7% |
LLaMA1-13B | CoT | 3 | 65.3% (94) | +32.0% | 33.6% (101) | +26.0% | 43.8% |
LLaMA2-13B | Direct | 0 | 72.2% (104) | 30.2% (91) | 43.8% | ||
LLaMA2-13B | Direct | 3 | 81.3% (117) | +9.1% | 16.3% (49) | −9.3% | 37.3% |
LLaMA2-13B | CoT | 0 | 25.7% (37) | +46.5% | 10.6% (32) | −19.6% | 15.5% |
LLaMA2-13B | CoT | 3 | 83.3% (120) | +11.1% | 42.2% (127) | +19.9% | 55.5% |
LLaMA2-13B-Chat | Direct | 0 | 88.2% (127) | 35.2% (106) | 52.4% | ||
LLaMA2-13B-Chat | Direct | 3 | 74.3% (107) | −13.9% | 36.2% (109) | −1.0% | 48.5% |
LLaMA2-13B-Chat | CoT | 0 | 84.7% (122) | −3.5% | 38.9% (117) | +3.7% | 53.7% |
LLaMA2-13B-Chat | CoT | 3 | 85.4% (123) | −2.8% | 40.2% (121) | +5.0% | 54.8% |
Vicuna-13B | Direct | 0 | 63.2% (91) | 19.9% (60) | 33.9% | ||
Vicuna-13B | Direct | 3 | 6.9% (10) | −56.3% | 10.6% (32) | −9.3% | 9.4% |
Vicuna-13B | CoT | 0 | 72.9% (105) | +9.7% | 30.6% (92) | +10.7% | 44.3% |
Vicuna-13B | CoT | 3 | 83.3% (120) | +20.1% | 38.9% (117) | +9.0% | 53.3% |
Flan-UL2-20B | Direct | 0 | 87.5% (126) | 41.9% (126) | 56.6% | ||
Flan-UL2-20B | Direct | 3 | 88.9% (128) | +1.4% | 44.5% (134) | +2.6% | 58.9% |
Flan-UL2-20B | CoT | 0 | 86.8% (125) | −0.7% | 38.9% (117) | −3.0% | 54.4% |
Flan-UL2-20B | CoT | 3 | 88.9% (128) | +1.4% | 38.5% (116) | −3.4% | 54.8% |
LLaMA1-30B | Direct | 0 | 78.5% (113) | 0.7% (2) | 25.8% | ||
LLaMA1-30B | Direct | 3 | 86.8% (125) | +8.3% | 38.5% (116) | +37.8% | 54.2% |
LLaMA1-30B | CoT | 0 | 25.7% (37) | −52.8% | 13.6% (41) | +18.9% | 17.5% |
LLaMA1-30B | CoT | 3 | 82.0% (118) | +3.5% | 41.2% (124) | +40.5% | 54.4% |
Vicuna-33B | Direct | 0 | 71.5% (103) | 36.2% (109) | 47.6% | ||
Vicuna-33B | Direct | 3 | 81.9% (118) | +10.4% | 37.9% (114) | +1.7% | 52.1% |
Vicuna-33B | CoT | 0 | 74.3% (107) | +2.8% | 36.5% (110) | +0.3% | 48.8% |
Vicuna-33B | CoT | 3 | 77.8% (112) | +6.3% | 42.5% (128) | +6.3% | 53.9% |
LLaMA1-65B | Direct | 0 | 0.0% (0) | 17.3% (52) | 11.7% | ||
LLaMA1-65B | Direct | 3 | 86.1% (124) | +86.1% | 42.2% (127) | +24.9% | 56.4% |
LLaMA1-65B | CoT | 0 | 23.6% (34) | +23.6% | 9.0% (27) | −8.3% | 13.7% |
LLaMA1-65B | CoT | 3 | 86.8% (125) | +86.8% | 47.8% (144) | +30.5% | 60.4% |
LLaMA2-70B | Direct | 0 | 84.7% (122) | 5.0% (15) | 30.8% | ||
LLaMA2-70B | Direct | 3 | 91.6% (132) | +6.9% | 50.8% (153) | +45.8% | 64.0% |
LLaMA2-70B | CoT | 0 | 36.8% (53) | −47.9% | 12.6% (38) | +7.6% | 20.4% |
LLaMA2-70B | CoT | 3 | 85.4% (123) | +0.7% | 50.2% (151) | +45.2% | 61.6% |
LLaMA2-70B-Chat | Direct | 0 | 86.8% (125) | 42.2% (127) | 56.6% | ||
LLaMA2-70B-Chat | Direct | 3 | 88.9% (128) | +2.1% | 42.2% (127) | +0.0% | 57.3% |
LLaMA2-70B-Chat | CoT | 0 | 88.2% (127) | +1.4% | 44.2% (133) | +2.0% | 58.4% |
LLaMA2-70B-Chat | CoT | 3 | 88.2% (127) | +1.4% | 48.5% (146) | +6.3% | 61.3% |
GPT-3.5-turbo | Direct | 0 | 93.0% (134) | 49.1% (148) | 63.4% | ||
GPT-3.5-turbo | Direct | 3 | 95.8% (138) | +2.8% | 53.5% (161) | +4.4% | 67.2% |
GPT-3.5-turbo | CoT | 0 | 85.4% (123) | −7.6% | 54.2% (163) | +5.1% | 64.3% |
GPT-3.5-turbo | CoT | 3 | 91.7% (132) | −1.3% | 56.8% (171) | +7.7% | 68.1% |
GPT-4 | Direct | 0 | 96.5% (139) | 73.4% (221) | 80.9% | ||
GPT-4 | Direct | 3 | 97.2% (140) | +0.7% | 73.1% (220) | +0.3% | 80.9% |
GPT-4 | CoT | 0 | 96.5% (139) | +0.0% | 77.4% (233) | +4.0% | 83.6% |
GPT-4 | CoT | 3 | 97.2% (140) | +0.7% | 77.4% (233) | +4.0% | 83.8% |
Model | A | B | C | D | Acc. | p-Value |
---|---|---|---|---|---|---|
GPT-4 | 77 | 84 | 69 | 71 | 74.6% | |
GPT-3.5-turbo | 76 | 91 | 70 | 64 | 53.1% | |
LLaMA2-70B-chat | 64 | 87 | 100 ▲ | 50 ▼ | 44.3% | |
Flan-UL2-20B | 93 ▲ | 56 ▼ | 70 | 82 | 41.6% | |
Flan-T5-XXL-11B | 93 ▲ | 67 | 80 | 61 | 44.8% | |
No. of Labels | 74 | 76 | 80 | 71 |
Model | CoT | Self-Consistency |
---|---|---|
GPT-3.5 | 60.1 ± 1.5 | 64.4 |
LLaMA2-70B-chat | 48.6 ± 1.2 | 48.2 |
Method | Acc. | Explicit Reject | True Reject |
---|---|---|---|
Zero-Shot | 49.1 | 1 | - |
Tools Aug. | 49.1 | 14 | 6 |
System Prompt |
---|
You are an expert in phonetics, English phonology, and second language acquisition. You will play the role of an English teacher who is helping me practice American English pronunciation and speaking skills. You will see a spoken speech evaluation result, where context provides the context for this pronunciation, canonical is the text of this evaluation and represents the expected pronunciation of the speaker, soundlike is how the user’s pronunciation actually sounds, and SentenceScore is the sentence pronunciation score, with a higher score indicating better pronunciation. Fluency is the sentence fluency score, with a higher score indicating better fluency. Speed is the speech rate, which is the average number of milliseconds per phoneme, and emotion is the emotion. In WordScores, each word score is shown in parentheses, and PhonesScores contains the phoneme pronunciation score for each word and what these phonemes actually sound like. Liaison represents the connected sounds between two words, marked with a [∼] symbol. Break represents a break between two words that is greater than 200ms, marked with a [pause] symbol. Stress represents the emphasized syllable or word in a sentence, marked with an asterisk symbol to indicate a word that is emphasized more than the others being compared. Intonation indicates whether the sentence’s intonation is rising, falling, or flat. |
Model | A | B | C | D |
---|---|---|---|---|
GPT-3.5 | 55.6 | 27.8 | 16.7 | 0.0 |
LLaMA2-70B-chat | 35.1 | 18.9 | 43.2 | 2.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gao, Y.; Nuchged, B.; Li, Y.; Peng, L. An Investigation of Applying Large Language Models to Spoken Language Learning. Appl. Sci. 2024, 14, 224. https://doi.org/10.3390/app14010224
Gao Y, Nuchged B, Li Y, Peng L. An Investigation of Applying Large Language Models to Spoken Language Learning. Applied Sciences. 2024; 14(1):224. https://doi.org/10.3390/app14010224
Chicago/Turabian StyleGao, Yingming, Baorian Nuchged, Ya Li, and Linkai Peng. 2024. "An Investigation of Applying Large Language Models to Spoken Language Learning" Applied Sciences 14, no. 1: 224. https://doi.org/10.3390/app14010224
APA StyleGao, Y., Nuchged, B., Li, Y., & Peng, L. (2024). An Investigation of Applying Large Language Models to Spoken Language Learning. Applied Sciences, 14(1), 224. https://doi.org/10.3390/app14010224