1. Introduction
The advent of large language models (LLMs) has heralded a new era of technological advancement with far-reaching implications across various fields, including psychiatry (
Obradovich et al., 2024). These sophisticated artificial intelligence systems, trained on extensive textual data corpora, have exhibited significant capabilities in natural language processing, generation, and understanding (
Luo et al., 2024). Given these advancements, this study aimed to empirically test how LLMs are applied in the critical area of mental health diagnostics and treatment recommendations.
Large language models (LLMs), such as ChatGPT-3.5 and ChatGPT-4, have been extensively trained on diverse textual datasets using advanced transformer architectures and self-supervised learning methods (
Luo et al., 2024;
Omar & Levkovich, 2024). These pretrained models excel in processing and generating human-like language, enabling their application in critical areas such as mental health diagnostics and treatment planning (
Elyoseph et al., 2024;
Obradovich et al., 2024). However, the proprietary nature of their training processes limits transparency, often presenting a “black box” characteristic that necessitates careful interpretation of their outputs in clinical applications (
Levkovich et al., 2024a). Despite these limitations, research highlights their potential to perform comparably to mental health professionals in areas such as suicide risk assessment and treatment recommendations, underscoring their transformative capabilities (
Elyoseph et al., 2024;
Levkovich et al., 2024b).
The rapid development of LLM tools, such as ChatGPT-4, Claude, and their derivatives, offers unprecedented opportunities for advancing therapeutic interventions, optimizing data analysis, and facilitating personalized patient care (
Demszky et al., 2023;
Yang et al., 2023). Moreover, the ability of LLMs to integrate multimodal data, such as patient history and symptom descriptions, further enhances their diagnostic utility, highlighting their potential to address existing gaps in mental healthcare (
Nazi & Peng, 2024).
Nevertheless, as AI-driven solutions gain prominence, these technologies also prompt critical ethical considerations, including concerns about data privacy and the potential diminishment of human expertise (
Hadar-Shoval et al., 2024;
Tortora, 2024). Additionally, although LLMs can enhance diagnostic accuracy and treatment options, they cannot replicate the nuanced understanding and empathetic engagement that human clinicians provide, making it essential to consider their role as supportive tools rather than replacements for mental health professionals (
Haber et al., 2024).
Indeed, accurate diagnosis by mental health professionals is essential for delivering effective treatment and care as it directly influences patient well-being and treatment outcomes (
Haber et al., 2024). Yet the diagnostic process is often complicated by such challenges as overlapping symptoms across disorders and the influence of cultural biases, which can lead to potential inaccuracies and subsequently affect the quality of patient care (
Bradford et al., 2024). LLM assessments do not always align with those of human mental health clinicians, indicating the need for further refinement before LLMs can operate independently. In one study (
Elyoseph & Levkovich, 2024), ChatGPT was used to analyze case vignettes with varying levels of perceived burdensomeness and thwarted belonging, both of which are key suicide risk factors. Although ChatGPT correctly identified the highest risk in vignettes with elevated levels of both factors, it generally predicted a lower suicide risk than human professionals (
Elyoseph & Levkovich, 2023). In contrast, in another study related to depression, the diagnoses and recommendations made by ChatGPT were found to be more accurate than those of medical professionals (
Levkovich & Elyoseph, 2023).
Beyond the ability of professionals to identify mental health issues, the treatment recommendations they provide constitute a critical step in the therapeutic process (
Morgan et al., 2013;
Elyoseph & Levkovich, 2024). Previous studies comparing professionals and language models have yielded conflicting results. In research related to depression, the diagnoses and recommendations in identifying depression and its determinants made by ChatGPT were found to be more accurate than those of medical professionals. One study (
Elyoseph & Levkovich, 2024) evaluated the ability of mental health professionals to predict the prognosis of schizophrenia, both with and without treatment, including long-term positive and negative outcomes. The study then compared these professional evaluations with those of Google Bard, ChatGPT-3.5, ChatGPT-4, and Claude (
Elyoseph & Levkovich, 2024). The findings revealed that while ChatGPT-3.5 produced more pessimistic estimates, the other language models were closely aligned with the professional evaluations (
Elyoseph & Levkovich, 2024). A review examining how artificial intelligence tools were applied in managing anxiety and depression revealed that a wide range of tools, including chatbots, mobile applications, and LLMs, are effective in reducing symptoms (
Pavlopoulos et al., 2024).
To date, only limited research has examined diagnosis and treatment of psychiatric conditions using LLM tools. Most studies have focused on specific mental disorders, such as depression and schizophrenia, and have primarily assessed the ability of various LLM tools to identify these conditions and evaluate treatment recommendations. Previous studies comparing professionals and language models have yielded conflicting results (
Omar et al., 2024).
The current study offers a comprehensive examination of multiple disorders across a broad range of models. In doing so, it explores the intersection between LLM tools and mental health professionals, with particular focus on how these technologies have reshaped the landscape of mental healthcare, psychological assessment, and predicted outcomes. By analyzing the experiences and perspectives of psychology professionals as a reference group, we seek to provide a comprehensive understanding of the current state of LLM integration in mental health.
The research objectives were as follows:
To compare correct diagnosis rates across different LLM tools and mental health professionals.
To compare treatments across different LLM tools and mental health professionals.
To compare outcomes predicted by LLM tools and mental health professionals, both for those who received help and for those who did not.
4. Discussion
This study sought to compare various LLM tools and mental health professionals with respect to diagnosis accuracy, recommended treatments, and predicted outcomes in the case of different mental health conditions. Individuals who received help were compared with those who did not.
In the present study, all LLM tools achieved a 100% correct diagnosis rate for the depression vignettes, while professionals recorded a slightly lower accuracy rate of 95%. ChatGPT-4 demonstrated superior performance for the depression with suicidal thoughts vignette, achieving a 100% correct diagnosis rate and significantly outperforming the other entities. Yet ChatGPT-4 exhibited a notably lower correct diagnosis rate of 55% for early schizophrenia, compared to the 95–100% rates achieved by the other entities. Similarly, ChatGPT-4 underperformed in the chronic schizophrenia vignette, with a 67% correct diagnosis rate compared to the 95% rate achieved by professionals. Despite these shortcomings, all the LLM tools consistently achieved a 100% correct diagnosis rate for the social phobia and PTSD vignettes, surpassing the performance of professionals.
The observed underperformance of ChatGPT-4 in diagnosing early schizophrenia and chronic schizophrenia highlights a potential limitation in the ability of current LLMs to generalize across diverse mental health conditions (
Guo et al., 2024). This variability suggests that while LLMs are effective in diagnosing common conditions such as depression (
Elyoseph & Levkovich, 2023;
Weisenburger et al., 2024), their reliability may diminish when faced with more complex or less common disorders (
Levkovich & Omar, 2024). A study analyzing approximately 3000 posts from clinicians regarding the ethical concerns of using LLMs in healthcare raised a number of key issues, including the fairness and reliability of these systems, as well as concerns over data accuracy (
Mirzaei et al., 2024). These findings underscore the importance of continuous model improvement and the need for caution when relying solely on LLMs to diagnose complex cases (
Lawrence et al., 2024). The mixed performance of ChatGPT-4 across different conditions further emphasizes the necessity for additional training and evaluation of LLMs (
Stade et al., 2024). Ensuring that these tools are trained on diverse and representative datasets is crucial for enhancing their generalizability and reliability across a broad spectrum of mental health conditions (
Cabrera et al., 2023). Moreover, ongoing evaluation in real-world clinical settings is essential, both to ensure that these tools consistently perform at a high level and to provide accurate diagnostic support (
Ohse et al., 2024).
In the current study, significant disparities in treatment recommendations were observed between LLM tools and professionals across all vignettes. For the depression vignette, LLM tools consistently recommended seeing a GP (100% rate), exceeding the recommendations made by professionals. Similar trends were noted for counselor recommendations, with LLM tools showing greater consistency. In contrast, professionals demonstrated more variability in recommending antidepressants, whereas ChatGPT-4 consistently recommended them (100% rate). In the depression with suicidal thoughts vignette, LLM tools frequently recommended a broader range of treatment options, including high recommendation rates for physical activities and psychotherapy. Across all vignettes, LLM tools generally and more frequently advocate for a wider range of treatments, including physical exercise and cognitive behavioral therapy, than professionals did.
The consistent recommendation patterns observed in LLM tools, such as the nearly universal suggestion to see a GP (100% across various vignettes), indicate that these tools may be programmed to prioritize certain baseline interventions (
Golden et al., 2024). In a study involving three hypothetical patient scenarios with significant complaints, ChatGPT was used as a virtual assistant to a psychiatrist (
Dergaa et al., 2024). While ChatGPT’s initial recommendations were appropriate, as the complexity of the clinical cases increased, the recommendations became inappropriate and potentially dangerous in some instances (
Dergaa et al., 2024). The lack of variability in these recommendations may also indicate a limitation in the ability of these tools to tailor advice to specific patient needs, potentially leading to overgeneralization (
Higgins et al., 2023). This limitation highlights a deficiency in the models’ capacity for clinical judgment and nuanced decision-making (
Dergaa et al., 2024). To maintain the quality of patient care, it is crucial to use LLMs as complementary tools rather than as replacements for professional expertise (
Prabhod, 2023;
Xie et al., 2024).
In the present study, LLM tools consistently predicted less optimistic mental health outcomes for individuals receiving help than did professionals. For depression with suicidal thoughts, Claude predicted a 55% full recovery rate and ChatGPT-4 only a 13.64% full recovery rate, whereas professionals estimated the full recovery rate at 94.4%. For early schizophrenia, Claude predicted a 5% full recovery, in contrast to professionals’ prediction of 60.4%. LLMs often predicted no recovery for untreated individuals across various conditions, whereas professionals generally expected some level of improvement. Additionally, LLMs tend to predict lower rates of full recovery and higher rates of partial recovery than professional assessments. The complexity of prognosis prediction is also evident in studies incorporating extensive personal and medical information. For example, a review of 30 studies analyzed the use of AI methods to predict clinical outcomes in patients with psychotic disorders, where detailed patient histories, including medical records and personal factors, were given. The results revealed predicted accuracy ranging from 48% to 89% for the AI methods (
Tay et al., 2024).
The consistent optimism of human professionals regarding full recovery rates highlights the importance of clinical experience and judgment in mental health treatment (
Leamy et al., 2023). Professionals’ positive outlooks may reflect a broader understanding of patient resilience, therapeutic potential, and the nuances of mental health conditions, which LLM tools currently lack (
Alhuwaydi, 2024). This suggests that while LLMs can be valuable in supporting diagnosis and treatment planning, they should not replace the nuanced judgment of experienced clinicians (
Alowais et al., 2023).
The variability and conservative predictions of LLM tools, especially in cases in which professional intervention is lacking, raise concerns about their reliability and ethical use in mental health (
Hadar-Shoval et al., 2024;
Mirzaei et al., 2024). These tools often display a pessimistic bias, which may underestimate recovery potential and negatively affect treatment planning if used without professional oversight (
Weisenburger et al., 2024). This finding underscores the importance of carefully integrating LLMs into clinical practice to support, rather than hinder, patient outcomes (
Elyoseph & Levkovich, 2024;
Golden et al., 2024). Continued refinement of LLMs that incorporate diverse real-world clinical data is necessary to improve their accuracy (
Aich et al., 2024). Furthermore, combining LLM tools with ongoing professional input can address the current limitations, thus reinforcing the irreplaceable role of human clinical expertise in mental healthcare (
Lawrence et al., 2024).
Limitations
This study utilized valid vignettes employed in previous research across a range of professionals and mental disorders. Nevertheless, these vignettes are not real cases, and the use of text-based scenarios may not fully capture the complexities of real-life patient interactions or the broader spectrum of mental health conditions, thereby limiting the generalizability of the findings. Additionally, the study acknowledges the inherent biases within LLMs, which are trained on extensive datasets that may contain biases influencing their diagnostic and treatment recommendations. The ‘black box’ nature of LLMs further complicates this issue, making it challenging to discern the rationale behind specific recommendations or diagnoses—a crucial factor in clinical contexts for establishing trust and ethical practice. The findings of this study are based on comparisons with health professionals, yet they lack direct clinical validation in actual patient care, underscoring the need for future research to focus on clinical trials and real-world applications to assess the efficacy and safety of LLMs in mental health diagnostics and treatment planning. Moreover, given the rapid development of AI technologies, the capabilities of LLMs are continuously evolving, such that newer versions of the models assessed in this study may perform differently, highlighting the necessity for ongoing evaluation. Addressing these limitations provides a more comprehensive understanding of the challenges and considerations in applying LLM technology to mental health, paving the way for more informed and ethical research and its implementation in the future.
5. Conclusions
This study compared the diagnostic accuracy and treatment recommendations of Gemini, Claude, ChatGPT-3.5, and ChatGPT-4 with those of mental health professionals for various mental health conditions. Text vignettes were used to evaluate the performance of LLMs and compare it to norms established by a sample of health professionals. The LLMs demonstrated high diagnostic accuracy, with 100% correct diagnosis rates for depression, social phobia, and PTSD, often surpassing professionals. However, ChatGPT-4 was less accurate in the case of early and chronic schizophrenia than other entities. The LLMs consistently recommended consulting healthcare professionals at higher rates than the professionals themselves.
The LLMs exhibited more conservative estimates, generally predicting lower and higher rates of full and partial recovery, respectively. Conversely, human experts consistently demonstrated a more optimistic outlook regarding full recovery across various conditions, including depression, suicidal ideation, schizophrenia, and PTSD. While both groups forecasted poorer outcomes in the absence of intervention, the LLMs displayed a notably more pessimistic perspective. These findings underscore the contrast between the generally hopeful prognoses of human professionals and the more cautious predictions of LLMs in the context of mental health recovery. The results highlight the potential for integrating LLMs into clinical decision-making processes; however, further research is necessary to validate these findings and overcome the study’s limitations.