Evaluating Causal Reasoning Capabilities of Large Language Models: A Systematic Analysis Across Three Scenarios
Abstract
:1. Introduction
2. Methods
2.1. Methods Description
- Type Causality (Cause to Effect): The model is required to predict the outcome based on initial conditions and actions.
- Actual Causality (Effect to Cause): The model needs to infer the cause(s) leading to an observed outcome.
- Type Causality with Intervention: This variant introduces an intervention that modifies an initial condition or action, challenging the model to adjust its reasoning to account for the change.
- Actual Causality with Intervention: This scenario includes an intervention impacting the outcome, requiring the model to infer altered causal pathways.
2.2. Case Descriptions and Intervention Details
- Case 1: Coin- Flipping Game
- Case 2: File Downloading
- Case 3: Simpson’s Paradox in Smallpox Vaccination
3. Experiments
3.1. Experimental Setup
3.2. Experimental Setup
3.3. Results
3.3.1. Zero-ShotPrompting
- ✓: Completely correct results and explanations.
- ⍻: Results with flawed analysis or incomplete reasoning.
- ✕: Incorrect results.
3.3.2. Few-Shot Prompting
3.3.3. CoT Prompting
3.4. Analysis
- GPT-4’s Performance: GPT-4 consistently demonstrated superior causal understanding compared to other models. Its ability to answer questions correctly across all four patterns in each scenario, regardless of the prompting method, suggests a more robust internal representation of causal relationships. To explore the specific factors contributing to GPT-4’s performance, future work should examine its training dataset and architectural features. This analysis could offer insights into why GPT-4 excels in causal reasoning tasks, possibly due to a larger corpus of causal language data or architectural enhancements.
- Challenges in Complex Scenarios: Case 3, which involves Simpson’s Paradox, proved particularly challenging for most models across prompting methods. This phenomenon highlights the complexity of causal scenarios that involve counterintuitive statistical relationships and require an understanding of confounding variables. LLMs struggle with Simpson’s Paradox because it involves multiple levels of causal inference. To address this challenge, we highlighted research on statistical reasoning and causal inference, where even human reasoning often falls short without training in statistical methods. Studies such as [8,11] have demonstrated that handling such paradoxes demands an understanding of conditional dependencies and confounders—concepts that current models are not fully equipped to process.
- Few-Shot Prompting: Few-shot prompting generally improved the response accuracy of most models, with Gemini achieving the highest correctness rate of 41.7%. In the few-shot learning prompts, we selected examples that simplify the target scenario. For instance, in the coin-flipping case, the model’s task involves reasoning in a scenario with 4 people divided into 2 groups, while the added example demonstrates sequential actions by a single person. This example aims to help the model understand the rules and the format of the outcomes more accurately. This example-based learning suggests that providing structured examples enhances models’ abilities to apply causal reasoning, especially in tasks where causal chains are more apparent.
- Chain-of-Thought (CoT) Prompting: CoT prompting, designed to guide models through complex reasoning steps, particularly benefited models with strong code comprehension, such as CodeLlama-70B, which achieved a correctness rate of 25%. This correlation between structured logical processes and improved causal reasoning suggests that models capable of following logical sequences may better handle causality.
- Unexpected Performance Declines: Models such as Mixtral and CodeLlama showed performance fluctuations under interventional variants. These observations are helpful for further analysis of the challenges MOE models and code models face in causal reasoning under modified conditions, providing a basis for related improvements.
- Performance Variability Across Models: Models such as Mixtral and CodeLlama showed fluctuating performance with interventional variants, and we noted specific challenges they encountered. Clarifying these difficulties helps explain why certain models struggled with causal reasoning under modified conditions and suggests areas for improvement.
- Discrepancies Between Conclusions and Explanations: Across all prompting methods, we observed that LLMs frequently arrived at correct conclusions despite offering flawed explanations. This discrepancy suggests that while LLMs possess some degree of inherent reasoning ability, they lack a robust underlying structure for causality. This observation emphasizes the limitations of current LLMs in truly understanding causal relationships, as they may rely on statistical patterns or heuristic shortcuts rather than genuine causal reasoning. Gaining insights into this potential reliance mechanism can offer guidance for future model development, highlighting the need to prioritize capturing genuine causal relationships.
- Impact of Interventions on Model Performance: Finally, the introduction of interventional variants across cases generally increased task difficulty for most models. This finding underscores that reasoning about causal relationships under altered conditions remains a significant challenge for current LLMs, highlighting an area of potential improvement for building models better suited to adaptive causal analysis.
4. Discussions
4.1. Practical Applications
4.2. Limitations
4.3. Future Works
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. OpenAI Blog 2018, 1–12. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
- Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
- Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst. 2024, 36, 1–14. [Google Scholar]
- Schölkopf, B.; Locatello, F.; Bauer, S.; Ke, N.R.; Kalchbrenner, N.; Goyal, A.; Bengio, Y. Toward causal representation learning. Proc. IEEE 2021, 109, 612–634. [Google Scholar] [CrossRef]
- Jin, Z.; Chen, Y.; Leeb, F.; Gresele, L.; Kamal, O.; Lyu, Z.; Blin, K.; Gonzalez Adauto, F.; Kleiman-Weiner, M.; Sachan, M.; et al. Cladder: A benchmark to assess causal reasoning capabilities of language models. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar] [CrossRef]
- Li, Y.; Xu, M.; Miao, X.; Zhou, S.; Qian, T. Large language models as counterfactual generator: Strengths and weaknesses. arXiv 2023, arXiv:2305.14791. [Google Scholar]
- Pearl, J.; Mackenzie, D. The Book of Why: The New Science of Cause and Effect; Basic Books: New York City, NY, USA, 2018. [Google Scholar]
- Illari, P.; Russo, F. Causality: Philosophical Theory Meets Scientific Practice; OUP: Oxford, UK, 2014. [Google Scholar]
- Feltz, B.; Missal, M.; Sims, A. Free Will, Causality, and Neuroscience; Brill: Leiden, The Netherlands, 2019. [Google Scholar]
- Halpern, J.Y. Actual Causality; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
- Cai, Z.; Cao, M.; Chen, H.; Chen, K.; Chen, K.; Chen, X.; Chen, X.; Chen, Z.; Chen, Z.; Chu, P.; et al. InternLM2 Technical Report. arXiv 2024, arXiv:2403.17297. [Google Scholar]
- Madaan, A.; Zhou, S.; Alon, U.; Yang, Y.; Neubig, G. Language models of code are few-shot commonsense learners. arXiv 2022, arXiv:2210.07128. [Google Scholar]
- Zhang, L.; Dugan, L.; Xu, H.; Callison-Burch, C. Exploring the curious case of code prompts. arXiv 2023, arXiv:2304.13250. [Google Scholar]
- Penedo, G.; Malartic, Q.; Hesslow, D.; Cojocaru, R.; Alobeidli, H.; Cappelli, A.; Pannier, B.; Almazrouei, E.; Launay, J. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data only. Adv. Neural Inf. Process. Syst. 2024, 36, 79155–79172. [Google Scholar]
- Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; Casas, D.D.L.; Hanna, E.B.; Bressand, F.; et al. Mixtral of experts. arXiv 2024, arXiv:2401.04088. [Google Scholar]
- Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Remez, T.; Rapin, J.; et al. Code llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
- Team, G.; Mesnard, T.; Hardin, C.; Dadashi, R.; Bhupatiraju, S.; Pathak, S.; Sifre, L.; Rivière, M.; Kale, M.S.; Love, J.; et al. Gemma: Open models based on gemini research and technology. arXiv 2024, arXiv:2403.08295. [Google Scholar]
Case | Variant | InternLM2 | Falcon | Mixtral | Llama | CodeLlama | ChatGPT | GPT4 | Gemini |
---|---|---|---|---|---|---|---|---|---|
Case 1 | cause -> effect | ✕ | ✕ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
effect -> cause | ✕ | ✕ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
cause -> effect (i) | ✕ | ⍻ | ✕ | ✕ | ✓ | ✓ | ✓ | ✕ | |
effect -> cause (i) | ✕ | ✕ | ✕ | ✕ | ✕ | ✓ | ✓ | ✕ | |
Case 2 | cause -> effect | ✕ | ✕ | ✕ | ✓ | ✕ | ✓ | ✓ | ✕ |
effect -> cause | ✕ | ✕ | ✕ | ✕ | ✕ | ✓ | ✓ | ✕ | |
cause -> effect (i) | ✕ | ✕ | ✕ | ✕ | ✕ | ✕ | ✓ | ✕ | |
effect -> cause (i) | ✕ | ✕ | ✕ | ✕ | ⍻ | ✓ | ✓ | ✓ | |
Case 3 | cause -> effect | ✕ | ✕ | ✕ | ✕ | ✕ | ✓ | ✓ | ✓ |
effect -> cause | ⍻ | ⍻ | ✕ | ⍻ | ✕ | ⍻ | ⍻ | ⍻ | |
cause -> effect (i) | ⍻ | ✕ | ⍻ | ✕ | ⍻ | ✕ | ✓ | ⍻ | |
effect -> cause (i) | ⍻ | ⍻ | ⍻ | ✕ | ⍻ | ⍻ | ⍻ | ⍻ |
Case | Variant | InternLM2 | Falcon | Mixtral | Llama | CodeLlama | ChatGPT | GPT4 | Gemini |
---|---|---|---|---|---|---|---|---|---|
Case 1 | cause -> effect | ✕ | ✕ | ✕ | ✕ | ⍻ | ✓ | ✓ | ✓ |
effect -> cause | ✕ | ✓ | ✓ | ⍻ | ✓ | ✕ | ✓ | ✕ | |
cause -> effect (i) | ✓ | ✕ | ✕ | ✕ | ⍻ | ✕ | ✓ | ✕ | |
effect -> cause (i) | ✕ | ✕ | ✕ | ✕ | ✕ | ✕ | ✓ | ✕ | |
Case 2 | cause -> effect | ✕ | ✕ | ✕ | ✕ | ✕ | ✓ | ✓ | ✓ |
effect -> cause | ✕ | ✕ | ✕ | ✕ | ✕ | ✓ | ✓ | ✓ | |
cause -> effect (i) | ✕ | ✕ | ✕ | ✕ | ✕ | ✓ | ✓ | ✓ | |
effect -> cause (i) | ✕ | ✕ | ✕ | ✕ | ⍻ | ✓ | ✓ | ✓ | |
Case 3 | cause -> effect | ✕ | ✕ | ✓ | ✕ | ✕ | ✓ | ✓ | ✕ |
effect -> cause | ✓ | ✕ | ⍻ | ⍻ | ⍻ | ⍻ | ⍻ | ⍻ | |
cause -> effect (i) | ✓ | ⍻ | ✓ | ✕ | ⍻ | ⍻ | ✓ | ✕ | |
effect -> cause (i) | ✕ | ✕ | ✕ | ✕ | ⍻ | ✕ | ✓ | ⍻ |
Case | Variant | InternLM2 | Falcon | Mixtral | Llama | CodeLlama | ChatGPT | GPT4 | Gemini |
---|---|---|---|---|---|---|---|---|---|
Case 1 | cause -> effect | ✕ | ✕ | ✕ | ✕ | ⍻ | ✕ | ✓ | ✕ |
effect -> cause | ⍻ | ✕ | ✕ | ✕ | ✓ | ✓ | ✓ | ✓ | |
cause -> effect (i) | ⍻ | ✕ | ✓ | ✕ | ✓ | ✓ | ✓ | ✕ | |
effect -> cause (i) | ✕ | ✕ | ✕ | ✕ | ✕ | ✓ | ✓ | ✕ | |
Case 2 | cause -> effect | ✕ | ✕ | ✓ | ✕ | ✕ | ✓ | ✓ | ✕ |
effect -> cause | ✕ | ✕ | ✕ | ✕ | ✕ | ✓ | ✓ | ✕ | |
cause -> effect (i) | ✕ | ✕ | ✕ | ✕ | ✕ | ✕ | ✓ | ✕ | |
effect -> cause (i) | ✕ | ✕ | ⍻ | ✕ | ✕ | ✓ | ✕ | ||
Case 3 | cause -> effect | ✕ | ✕ | ✕ | ✕ | ⍻ | ✓ | ✓ | ✕ |
effect -> cause | ✓ | ⍻ | ✕ | ⍻ | ⍻ | ✓ | ✓ | ⍻ | |
cause -> effect (i) | ⍻ | ✕ | ⍻ | ✕ | ⍻ | ✓ | ✓ | ✕ | |
effect -> cause (i) | ⍻ | ✕ | ⍻ | ⍻ | ⍻ | ⍻ | ✓ | ⍻ |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, L.; Shen, Y. Evaluating Causal Reasoning Capabilities of Large Language Models: A Systematic Analysis Across Three Scenarios. Electronics 2024, 13, 4584. https://doi.org/10.3390/electronics13234584
Wang L, Shen Y. Evaluating Causal Reasoning Capabilities of Large Language Models: A Systematic Analysis Across Three Scenarios. Electronics. 2024; 13(23):4584. https://doi.org/10.3390/electronics13234584
Chicago/Turabian StyleWang, Lei, and Yiqing Shen. 2024. "Evaluating Causal Reasoning Capabilities of Large Language Models: A Systematic Analysis Across Three Scenarios" Electronics 13, no. 23: 4584. https://doi.org/10.3390/electronics13234584
APA StyleWang, L., & Shen, Y. (2024). Evaluating Causal Reasoning Capabilities of Large Language Models: A Systematic Analysis Across Three Scenarios. Electronics, 13(23), 4584. https://doi.org/10.3390/electronics13234584