Next Article in Journal
Analysis of Maternity Rights Perception: Impact of Maternal Care in Diverse Socio-Health Contexts
Previous Article in Journal
Effects of Parents’ Smartphone Use on Children’s Emotions, Behavior, and Subjective Well-Being
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals

Faculty of Education, Tel-Hai Academic College, Upper Galilee 2208, Israel
Eur. J. Investig. Health Psychol. Educ. 2025, 15(1), 9; https://doi.org/10.3390/ejihpe15010009
Submission received: 7 November 2024 / Revised: 23 December 2024 / Accepted: 8 January 2025 / Published: 18 January 2025

Abstract

:
Large language models (LLMs) offer promising possibilities in mental health, yet their ability to assess disorders and recommend treatments remains underexplored. This quantitative cross-sectional study evaluated four LLMs (Gemini (Gemini 2.0 Flash Experimental), Claude (Claude 3.5 Sonnet), ChatGPT-3.5, and ChatGPT-4) using text vignettes representing conditions such as depression, suicidal ideation, early and chronic schizophrenia, social phobia, and PTSD. Each model’s diagnostic accuracy, treatment recommendations, and predicted outcomes were compared with norms established by mental health professionals. Findings indicated that for certain conditions, including depression and PTSD, models like ChatGPT-4 achieved higher diagnostic accuracy compared to human professionals. However, in more complex cases, such as early schizophrenia, LLM performance varied, with ChatGPT-4 achieving only 55% accuracy, while other LLMs and professionals performed better. LLMs tended to suggest a broader range of proactive treatments, whereas professionals recommended more targeted psychiatric consultations and specific medications. In terms of outcome predictions, professionals were generally more optimistic regarding full recovery, especially with treatment, while LLMs predicted lower full recovery rates and higher partial recovery rates, particularly in untreated cases. While LLMs recommend a broader treatment range, their conservative recovery predictions, particularly for complex conditions, highlight the need for professional oversight. LLMs provide valuable support in diagnostics and treatment planning but cannot replace professional discretion.

1. Introduction

The advent of large language models (LLMs) has heralded a new era of technological advancement with far-reaching implications across various fields, including psychiatry (Obradovich et al., 2024). These sophisticated artificial intelligence systems, trained on extensive textual data corpora, have exhibited significant capabilities in natural language processing, generation, and understanding (Luo et al., 2024). Given these advancements, this study aimed to empirically test how LLMs are applied in the critical area of mental health diagnostics and treatment recommendations.
Large language models (LLMs), such as ChatGPT-3.5 and ChatGPT-4, have been extensively trained on diverse textual datasets using advanced transformer architectures and self-supervised learning methods (Luo et al., 2024; Omar & Levkovich, 2024). These pretrained models excel in processing and generating human-like language, enabling their application in critical areas such as mental health diagnostics and treatment planning (Elyoseph et al., 2024; Obradovich et al., 2024). However, the proprietary nature of their training processes limits transparency, often presenting a “black box” characteristic that necessitates careful interpretation of their outputs in clinical applications (Levkovich et al., 2024a). Despite these limitations, research highlights their potential to perform comparably to mental health professionals in areas such as suicide risk assessment and treatment recommendations, underscoring their transformative capabilities (Elyoseph et al., 2024; Levkovich et al., 2024b).
The rapid development of LLM tools, such as ChatGPT-4, Claude, and their derivatives, offers unprecedented opportunities for advancing therapeutic interventions, optimizing data analysis, and facilitating personalized patient care (Demszky et al., 2023; Yang et al., 2023). Moreover, the ability of LLMs to integrate multimodal data, such as patient history and symptom descriptions, further enhances their diagnostic utility, highlighting their potential to address existing gaps in mental healthcare (Nazi & Peng, 2024).
Nevertheless, as AI-driven solutions gain prominence, these technologies also prompt critical ethical considerations, including concerns about data privacy and the potential diminishment of human expertise (Hadar-Shoval et al., 2024; Tortora, 2024). Additionally, although LLMs can enhance diagnostic accuracy and treatment options, they cannot replicate the nuanced understanding and empathetic engagement that human clinicians provide, making it essential to consider their role as supportive tools rather than replacements for mental health professionals (Haber et al., 2024).
Indeed, accurate diagnosis by mental health professionals is essential for delivering effective treatment and care as it directly influences patient well-being and treatment outcomes (Haber et al., 2024). Yet the diagnostic process is often complicated by such challenges as overlapping symptoms across disorders and the influence of cultural biases, which can lead to potential inaccuracies and subsequently affect the quality of patient care (Bradford et al., 2024). LLM assessments do not always align with those of human mental health clinicians, indicating the need for further refinement before LLMs can operate independently. In one study (Elyoseph & Levkovich, 2024), ChatGPT was used to analyze case vignettes with varying levels of perceived burdensomeness and thwarted belonging, both of which are key suicide risk factors. Although ChatGPT correctly identified the highest risk in vignettes with elevated levels of both factors, it generally predicted a lower suicide risk than human professionals (Elyoseph & Levkovich, 2023). In contrast, in another study related to depression, the diagnoses and recommendations made by ChatGPT were found to be more accurate than those of medical professionals (Levkovich & Elyoseph, 2023).
Beyond the ability of professionals to identify mental health issues, the treatment recommendations they provide constitute a critical step in the therapeutic process (Morgan et al., 2013; Elyoseph & Levkovich, 2024). Previous studies comparing professionals and language models have yielded conflicting results. In research related to depression, the diagnoses and recommendations in identifying depression and its determinants made by ChatGPT were found to be more accurate than those of medical professionals. One study (Elyoseph & Levkovich, 2024) evaluated the ability of mental health professionals to predict the prognosis of schizophrenia, both with and without treatment, including long-term positive and negative outcomes. The study then compared these professional evaluations with those of Google Bard, ChatGPT-3.5, ChatGPT-4, and Claude (Elyoseph & Levkovich, 2024). The findings revealed that while ChatGPT-3.5 produced more pessimistic estimates, the other language models were closely aligned with the professional evaluations (Elyoseph & Levkovich, 2024). A review examining how artificial intelligence tools were applied in managing anxiety and depression revealed that a wide range of tools, including chatbots, mobile applications, and LLMs, are effective in reducing symptoms (Pavlopoulos et al., 2024).
To date, only limited research has examined diagnosis and treatment of psychiatric conditions using LLM tools. Most studies have focused on specific mental disorders, such as depression and schizophrenia, and have primarily assessed the ability of various LLM tools to identify these conditions and evaluate treatment recommendations. Previous studies comparing professionals and language models have yielded conflicting results (Omar et al., 2024).
The current study offers a comprehensive examination of multiple disorders across a broad range of models. In doing so, it explores the intersection between LLM tools and mental health professionals, with particular focus on how these technologies have reshaped the landscape of mental healthcare, psychological assessment, and predicted outcomes. By analyzing the experiences and perspectives of psychology professionals as a reference group, we seek to provide a comprehensive understanding of the current state of LLM integration in mental health.
The research objectives were as follows:
  • To compare correct diagnosis rates across different LLM tools and mental health professionals.
  • To compare treatments across different LLM tools and mental health professionals.
  • To compare outcomes predicted by LLM tools and mental health professionals, both for those who received help and for those who did not.

2. Materials and Methods

2.1. Procedure

The study was conducted in May 2024 to investigate how four advanced LLMs were applied in mental health diagnostics: Gemini (Google), Claude (Anthropic), ChatGPT-3.5, and ChatGPT-4 (OpenAI). The LLMs were chosen based on their recent advancements and specific capabilities in processing natural language inputs related to mental health scenarios. The selected mental health conditions covered a range of common and complex cases encountered in clinical settings. The LLMs were assessed for their prognosis of various mental health conditions, such as depression, suicidal thoughts, early schizophrenia, chronic schizophrenia, social phobia, and PTSD. Their evaluations were then compared to norms established by mental health professionals (general practitioners, psychiatrists, and psychologists), as reported by Morgan et al. (2013, 2014). Health professionals were selected based on their area of specialization to ensure a diverse representation of perspectives. The study evaluated four large language models (LLMs): Gemini, Claude, ChatGPT-3.5, and ChatGPT-4. Each model was tested 20 times across six clinical vignettes, including depression, suicidal ideation, early and chronic schizophrenia, social phobia, and PTSD, resulting in a total of 480 assessments. To explore gender-based variations, an additional set of 480 assessments was conducted (n = 480 male, n = 480 female), yielding a total of N = 960 evaluations.

2.2. Vignette Methodology

The study employed the text vignette methodology, utilizing six vignettes that represent different mental disorders (Reavley & Jorm, 2012). Each vignette was available in both male and female versions. Each vignette underwent ten separate assessments using various LLM tools. After the vignettes were input to LLMs, the models were asked a series of questions to evaluate identification, referral recommendations, and prognosis (Morgan et al., 2013).
All the vignettes depicted an individual with a mental disorder, with male participants referred to as ‘John’ and female participants referred to as ‘Mary’. The vignettes were designed to meet the diagnostic criteria outlined in the Diagnostic and Statistical Manual of Mental Disorders, 5th Edition (DSM-5) and the International Classification of Diseases, 11th Revision (ICD-11). The LLMs were asked to identify any issues with John/Mary that emerged from the vignettes and then to answer two questions assessing the perceived effectiveness of the various interventions. Question 1: ‘What is the most likely outcome for John/Mary given that he/she receives the type of professional help you deem most appropriate?’ (e.g., psychiatrist, psychologist, CBT, psychotherapy, antidepressants, antipsychotics). Question 2: ‘What is the most likely outcome for John/Mary if he/she does not receive any professional help?’ (with response options ranging from “Full recovery with no further problems” to “Condition will worsen”). Responses were assessed using a 6-point Likert scale, ranging from ‘full recovery with no further problems’ to ‘condition will worsen’ (Supplementary Materials).
The performance of all the LLMs was evaluated according to the criteria established by Morgan et al. (2013, 2014). Subsequently, we compared the LLMs’ performance to previously established norms derived from a sample of 1536 health professionals, including 518 general practitioners (GPs), 506 psychiatrists, 498 clinical psychologists (with 14 respondents not specifying their profession), and 6016 members of the Australian public.

2.3. Statistical Analyses

To determine the number of correct identifications for each LLM tool under each condition, string searches were applied to the LLMs’ open-ended answers to the question regarding the most appropriate diagnosis for the individual described in the vignette. Specifically, for the depression scenarios, any answer with the word ‘depress’ was counted. For depression with suicidal ideation, to be counted, an answer had to include both the string ‘depress’ and the string ‘suicid’ (to include responses containing words such as ‘suicidal’). For both early and chronic schizophrenia conditions, answers containing the string ‘schizophrenia’ or ‘psychosis’ were counted (after accounting for capital letters). Answers containing the string ‘social’ were counted for the social phobia condition, and answers that contained either the string ‘post’ or the string ‘trauma’ were counted for the PTSD condition. The obtained counts were calculated as percentages or counts for each statistical comparison and in accordance with the data provided in the articles (Morgan et al., 2013, 2014).
To provide a rigorous assessment of the effectiveness of various LLMs, the study utilized SPSS version 27 for statistical analysis. All the research hypotheses were examined using the χ2 test to identify differences. In instances where cell counts were too low, the Fisher’s exact test was used as an alternative. This methodological approach facilitated pattern quantification and underscored the potential of integrating LLMs into clinical decision-making. Chi-square tests were conducted, with significance level set at p < 0.01, reflecting a more conservative threshold. Additionally, Cramer’s V was used to measure effect sizes. Values ranged from 0 to 1, with 0.1, 0.3, and 0.5 representing small, medium, and large effect sizes, respectively. SPSS version 27 was used for chi-square tests and Fisher’s exact tests, while R software version 4.4.1 was utilized for generating advanced visualizations and calculating effect sizes (e.g., Cramer’s V). In addition to the automated analyses conducted using SPSS and R, manual validations were performed to cross-check the diagnostic outputs of the LLMs against predefined accuracy criteria. These validations ensured consistency and completeness in the responses analyzed.

3. Results

As noted, the detailed methodology examined four leading LLMs across a range of mental health conditions. This section describes the findings for diagnostic accuracy, treatment recommendations, and predicted outcomes. All analyses were conducted using R software version 4.4.1 (R Core Team, 2020) and RStudio version 2023.06.1 (R Core Team, 2023).

3.1. Comparison of Correct Diagnosis Rates for Different LLM Tools and Professionals

For both depression vignettes, the LLM tools (Gemini, ChatGPT-3.5, and ChatGPT-4) achieved a correct diagnosis rate of 100%, whereas the professionals exhibited a correct diagnosis rate of 95%. Fisher’s exact test yielded a p-value of 0.001 for both vignettes. For the vignette on depression and suicidal thoughts, ChatGPT-4 achieved a 100% correct diagnosis rate, outperforming the other models. Statistical analysis yielded a Cramer’s V of 0.81 with a p-value of <0.001, indicating significant differences. For the early schizophrenia vignette, ChatGPT-4 achieved a correct diagnosis rate of 55%, which was notably lower than the other entities, with a Cramer’s V of 0.55 and a p-value of <0.001. For the chronic schizophrenia vignette, professionals and ChatGPT-3.5 achieved similar correct diagnosis rates (95%), whereas ChatGPT-4 exhibited a lower rate (67%). Cramer’s V was 0.37 with a p-value of <0.001. For the social phobia and PTSD vignettes, all the LLM tools achieved a 100% correct diagnosis rate, outperforming professionals. Fisher’s exact test revealed significant differences, with p-values of less than 0.001 for both vignettes (Table 1).

3.2. Comparison of Treatments Across Different LLM Tools and Professionals

For each vignette, the treatment recommendations made by the LLMs were also compared to those of the various professionals. The therapies or activities chosen for the comparisons were based on the agreement of multiple practitioners (Haber et al., 2024; Bradford et al., 2024).

3.2.1. Depression Vignette

Table 2 compares the treatment recommendations for depression made by GPs, psychiatrists, psychologists, and LLM tools (Claude, Gemini, ChatGPT-3.5, and ChatGPT-4). Notable disparities were observed between the groups. The LLM tools consistently recommended high rates of consulting a family doctor or GP (100%, except for ChatGPT-3.5 at 89.95%), surpassing GPs (95%) and psychiatrists (91%) as well as psychologists (76%). Recommendations to see a counselor were also higher among LLM tools (100%, except for Claude at 80%) than among GPs (86%), psychiatrists (49%), and psychologists (53%). Telephone counseling services were more frequently suggested by LLM tools (80–100%) than by professionals (57–79%). Recommendations to see a psychiatrist or psychologist were uniformly high, with LLM tools either matching or exceeding the recommendation rates of professionals. ChatGPT-4 exhibited the highest recommendation rate for antidepressants (100%), whereas the antidepressant recommendations of other LLM tools and professionals varied from 59% to 95.24%. LLM tools generally recommended activities and therapies at higher rates than did human professionals.

3.2.2. Depression with Suicidal Thoughts Vignette

Table 3 compares treatment recommendations for depression with suicidal thoughts across various entities, including LLM tools (Claude, Gemini, ChatGPT-3.5, and ChatGPT-4) and mental health professionals. Significant differences were observed between the groups. Most LLM tools consistently recommended seeing a family doctor or GP (100%), with the exception of ChatGPT-3.5 (45%), while these recommendations among GPs, psychiatrists, and psychologists were 95%, 92%, and 83%, respectively. LLM tools also recommended seeing a counselor (100%), compared to GPs (86%), psychiatrists (47%), and psychologists (58%). Telephone counseling was universally recommended by LLM tools (100%) but less frequently by professionals (GPs—89%; psychiatrists—61%; and psychologists—83%). Psychiatrist consultations were recommended at high rates by all groups. LLM tools and professionals also varied in their recommendations for antidepressants and other interventions, with LLM tools generally advocating a broader range of activities and therapies.

3.2.3. Early Schizophrenia Vignette

Recommendation rates for seeing a family doctor or GP were generally high: 100% for Gemini and 68% and 70%, respectively, for ChatGPT-3.5 and ChatGPT-4. Recommendations to see a family doctor were consistently high among GPs and psychiatrists (100% and 95%, respectively), whereas such recommendations among psychologists were somewhat lower (88%). Recommendations to see a social worker varied significantly, with LLM tools (Claude and ChatGPT-3.5) at 100%, compared to 65% for GPs, 32% for psychiatrists, and 37% for psychologists. All entities highly recommended seeing a psychiatrist (99% for psychologists and 100% for all other entities). The recommendation rate for seeing a psychologist was 100% for Gemini, with rates varying among the other LLM tools and professionals. Professionals and LLM tools also recommended antidepressants at rates ranging from 85% to 97%, except for Claude (30%) and Gemini (50%). Recommendations for cognitive behavior therapy and psychiatric admission exhibited substantial variability, with LLM tools generally advocating broader treatment options (Table 4).

3.2.4. Chronic Schizophrenia Vignette

Significant differences were observed across LLM tools and professionals in their treatment recommendations for chronic schizophrenia. Most LLM tools and professionals highly recommended seeing a family doctor or GP, with Gemini at 94.44% and GPs and psychiatrists at 94% and 93%, respectively, whereas ChatGPT-3.5 and ChatGPT-4 exhibited lower rates (75% and 47.62%, respectively). Recommendations for consulting with social workers varied, with Gemini at 100% and others exhibiting lower rates: 74% (GPs), 58% (psychiatrists), and 72% (psychologists). All entities consistently recommended seeing a psychiatrist (95.24–100%). Recommendations to see a psychologist were highest for Gemini (100%) and lower for others. Antidepressants were strongly recommended by professionals (94–99%), while LLM tools showed more variability. Recommendations for cognitive behavior therapy and psychiatric ward admission also varied, with ChatGPT-4 exhibiting higher rates. Recommendations for physical activity, alcohol consumption, and information from health educators varied more across entities (Table 5).

3.2.5. Social Phobia Vignette

In the social phobia condition, significant differences were observed across all entities. The LLM tools recommended seeing a family doctor or GP at a rate of 100%, except for ChatGPT-4 (76.19%). In contrast, 91% of GPs, 71% of psychiatrists, and 64% of psychologists recommended seeing a family doctor in the case of social phobia. The LLM tools also consistently recommended seeing a counselor (100%), whereas these recommendations were lower among professionals: 87% (GPs), 57% (psychiatrists), and 49% (psychologists). Telephone counseling services were highly recommended by LLM tools (85.71–100%), with lower recommendation rates among professionals (35–76%). Seeing a psychiatrist was universally recommended by LLM tools, whereas the recommendations of professionals varied. Antidepressants were recommended at varying rates, with ChatGPT-4 at 95.24% and lower rates for other entities. LLM tools generally recommended activities such as physical exercise and social engagement at higher rates than professionals. Cognitive behavior therapy and psychotherapy were also recommended more frequently by the LLM tools (Table 6).

3.2.6. Post-Traumatic Stress Disorder Vignette

In the PTSD condition, significant differences were observed among all entities. Most LLM tools highly recommended seeing a family doctor or GP (90–100%), whereas the recommendations of the professionals were lower: 93% (GPs), 78% (psychiatrists), and 69% (psychologists). The LLM tools also consistently recommended seeing a counselor for PTSD (100%), whereas the professional recommendations were lower: 80% (GPs), 38% (psychiatrists), and 49% (psychologists). Telephone counseling was also highly recommended by the LLM tools (90–100%) and less recommended by the professionals: 69% (GPs), 47% (psychiatrists), and 62% (psychologists). Seeing a psychiatrist was universally recommended by the LLM tools, with the professionals also exhibiting high rates. Recommendations for seeing a psychologist for PTS varied, with the LLM tools ranging from 60% to 100% and the professionals showing slightly higher consistency. Antidepressant recommendations were higher among LLM tools such as ChatGPT-3.5 and ChatGPT-4 (90%), whereas the professionals’ rates of antidepressant recommendation were lower. Activities and therapies, including physical activity and relaxation techniques, were highly recommended by the LLM tools, often at higher rates than the professionals (Table 7).
Overall, this study revealed key differences in treatment recommendations between LLM tools and mental health professionals. The LLM tools consistently recommended family doctors, counselors, and telephone counseling more frequently across conditions such as depression and PTSD. The tools also suggested a broader range of treatments, including antidepressants and therapeutic activities, especially in cases of social phobia and schizophrenia. Professionals showed more consistency in recommending psychiatric consultations and specific medications, whereas LLM tools emphasized a wider array of treatments. Overall, LLM tools were more proactive in recommending diverse options than human professionals (Figure 1).

3.3. Comparison Between LLM Tools and Mental Health Professionals Regarding Predicted Outcomes for Those Who Receive Help and Those Who Do Not

The study also compared outcome predictions across various mental health conditions as described in the vignettes, both for those who received professional assistance and for those who did not.
For individuals who received help (Table 8), no significant differences in outcome predictions were found between professionals and LLM tools. Nevertheless, the professionals consistently demonstrated a more optimistic outlook regarding full recovery rates. For instance, in cases of depression with suicidal thoughts, professionals predicted a full recovery rate of 94.4%, whereas the predictions of the LLM tools varied widely, with Claude at 55% and ChatGPT-4 at 13.64%. Partial recovery predictions also differed, with professionals estimating 5.2% and LLM tools such as Gemini and ChatGPT-4 predicting 66.67% and 86.36%, respectively.
For early schizophrenia, professionals predicted a 60.4% full recovery rate compared with a 5% recovery rate predicted by Claude. For chronic schizophrenia, professionals predicted a 20.1% full recovery rate, whereas ChatGPT-4 predicted a 0% recovery rate. Social phobia and PTSD also exhibited significant differences, with LLM tools generally predicting lower full recovery rates and higher partial recovery rates than professionals. These discrepancies highlight the variability in LLM predictions and the generally optimistic outlook of human professionals.
For those who did not receive help (Table 9), the outlook for recovery was significantly worse across all conditions and entities. Professionals predicted a 7.7% full recovery rate for depression, with LLM tools predicting a 0% recovery rate. For early schizophrenia, no entities predicted full recovery, and all exhibited very low partial recovery rates. For chronic schizophrenia and social phobia, no full recovery was predicted, and partial recovery rates were low across the board. PTSD predictions followed similar trends, with professionals predicting some recovery but LLM tools often predicting none. These findings underscore the generally more negative outlook of LLM tools compared with human professionals, especially when no professional help is provided.
For those who received help, professionals consistently predicted higher full recovery rates than did LLM tools, particularly in cases of depression with suicidal thoughts (94.4% vs. lower rates among LLMs). For conditions such as early and chronic schizophrenia, professionals also showed a more optimistic outlook than LLM tools, which predicted minimal rates of recovery (Figure 2). For those who did not receive help, both LLM tools and professionals predicted worse outcomes, with professionals maintaining a slightly more optimistic view. Overall, LLM tools were more cautious in their predictions, especially when no professional assistance was involved.

4. Discussion

This study sought to compare various LLM tools and mental health professionals with respect to diagnosis accuracy, recommended treatments, and predicted outcomes in the case of different mental health conditions. Individuals who received help were compared with those who did not.
In the present study, all LLM tools achieved a 100% correct diagnosis rate for the depression vignettes, while professionals recorded a slightly lower accuracy rate of 95%. ChatGPT-4 demonstrated superior performance for the depression with suicidal thoughts vignette, achieving a 100% correct diagnosis rate and significantly outperforming the other entities. Yet ChatGPT-4 exhibited a notably lower correct diagnosis rate of 55% for early schizophrenia, compared to the 95–100% rates achieved by the other entities. Similarly, ChatGPT-4 underperformed in the chronic schizophrenia vignette, with a 67% correct diagnosis rate compared to the 95% rate achieved by professionals. Despite these shortcomings, all the LLM tools consistently achieved a 100% correct diagnosis rate for the social phobia and PTSD vignettes, surpassing the performance of professionals.
The observed underperformance of ChatGPT-4 in diagnosing early schizophrenia and chronic schizophrenia highlights a potential limitation in the ability of current LLMs to generalize across diverse mental health conditions (Guo et al., 2024). This variability suggests that while LLMs are effective in diagnosing common conditions such as depression (Elyoseph & Levkovich, 2023; Weisenburger et al., 2024), their reliability may diminish when faced with more complex or less common disorders (Levkovich & Omar, 2024). A study analyzing approximately 3000 posts from clinicians regarding the ethical concerns of using LLMs in healthcare raised a number of key issues, including the fairness and reliability of these systems, as well as concerns over data accuracy (Mirzaei et al., 2024). These findings underscore the importance of continuous model improvement and the need for caution when relying solely on LLMs to diagnose complex cases (Lawrence et al., 2024). The mixed performance of ChatGPT-4 across different conditions further emphasizes the necessity for additional training and evaluation of LLMs (Stade et al., 2024). Ensuring that these tools are trained on diverse and representative datasets is crucial for enhancing their generalizability and reliability across a broad spectrum of mental health conditions (Cabrera et al., 2023). Moreover, ongoing evaluation in real-world clinical settings is essential, both to ensure that these tools consistently perform at a high level and to provide accurate diagnostic support (Ohse et al., 2024).
In the current study, significant disparities in treatment recommendations were observed between LLM tools and professionals across all vignettes. For the depression vignette, LLM tools consistently recommended seeing a GP (100% rate), exceeding the recommendations made by professionals. Similar trends were noted for counselor recommendations, with LLM tools showing greater consistency. In contrast, professionals demonstrated more variability in recommending antidepressants, whereas ChatGPT-4 consistently recommended them (100% rate). In the depression with suicidal thoughts vignette, LLM tools frequently recommended a broader range of treatment options, including high recommendation rates for physical activities and psychotherapy. Across all vignettes, LLM tools generally and more frequently advocate for a wider range of treatments, including physical exercise and cognitive behavioral therapy, than professionals did.
The consistent recommendation patterns observed in LLM tools, such as the nearly universal suggestion to see a GP (100% across various vignettes), indicate that these tools may be programmed to prioritize certain baseline interventions (Golden et al., 2024). In a study involving three hypothetical patient scenarios with significant complaints, ChatGPT was used as a virtual assistant to a psychiatrist (Dergaa et al., 2024). While ChatGPT’s initial recommendations were appropriate, as the complexity of the clinical cases increased, the recommendations became inappropriate and potentially dangerous in some instances (Dergaa et al., 2024). The lack of variability in these recommendations may also indicate a limitation in the ability of these tools to tailor advice to specific patient needs, potentially leading to overgeneralization (Higgins et al., 2023). This limitation highlights a deficiency in the models’ capacity for clinical judgment and nuanced decision-making (Dergaa et al., 2024). To maintain the quality of patient care, it is crucial to use LLMs as complementary tools rather than as replacements for professional expertise (Prabhod, 2023; Xie et al., 2024).
In the present study, LLM tools consistently predicted less optimistic mental health outcomes for individuals receiving help than did professionals. For depression with suicidal thoughts, Claude predicted a 55% full recovery rate and ChatGPT-4 only a 13.64% full recovery rate, whereas professionals estimated the full recovery rate at 94.4%. For early schizophrenia, Claude predicted a 5% full recovery, in contrast to professionals’ prediction of 60.4%. LLMs often predicted no recovery for untreated individuals across various conditions, whereas professionals generally expected some level of improvement. Additionally, LLMs tend to predict lower rates of full recovery and higher rates of partial recovery than professional assessments. The complexity of prognosis prediction is also evident in studies incorporating extensive personal and medical information. For example, a review of 30 studies analyzed the use of AI methods to predict clinical outcomes in patients with psychotic disorders, where detailed patient histories, including medical records and personal factors, were given. The results revealed predicted accuracy ranging from 48% to 89% for the AI methods (Tay et al., 2024).
The consistent optimism of human professionals regarding full recovery rates highlights the importance of clinical experience and judgment in mental health treatment (Leamy et al., 2023). Professionals’ positive outlooks may reflect a broader understanding of patient resilience, therapeutic potential, and the nuances of mental health conditions, which LLM tools currently lack (Alhuwaydi, 2024). This suggests that while LLMs can be valuable in supporting diagnosis and treatment planning, they should not replace the nuanced judgment of experienced clinicians (Alowais et al., 2023).
The variability and conservative predictions of LLM tools, especially in cases in which professional intervention is lacking, raise concerns about their reliability and ethical use in mental health (Hadar-Shoval et al., 2024; Mirzaei et al., 2024). These tools often display a pessimistic bias, which may underestimate recovery potential and negatively affect treatment planning if used without professional oversight (Weisenburger et al., 2024). This finding underscores the importance of carefully integrating LLMs into clinical practice to support, rather than hinder, patient outcomes (Elyoseph & Levkovich, 2024; Golden et al., 2024). Continued refinement of LLMs that incorporate diverse real-world clinical data is necessary to improve their accuracy (Aich et al., 2024). Furthermore, combining LLM tools with ongoing professional input can address the current limitations, thus reinforcing the irreplaceable role of human clinical expertise in mental healthcare (Lawrence et al., 2024).

Limitations

This study utilized valid vignettes employed in previous research across a range of professionals and mental disorders. Nevertheless, these vignettes are not real cases, and the use of text-based scenarios may not fully capture the complexities of real-life patient interactions or the broader spectrum of mental health conditions, thereby limiting the generalizability of the findings. Additionally, the study acknowledges the inherent biases within LLMs, which are trained on extensive datasets that may contain biases influencing their diagnostic and treatment recommendations. The ‘black box’ nature of LLMs further complicates this issue, making it challenging to discern the rationale behind specific recommendations or diagnoses—a crucial factor in clinical contexts for establishing trust and ethical practice. The findings of this study are based on comparisons with health professionals, yet they lack direct clinical validation in actual patient care, underscoring the need for future research to focus on clinical trials and real-world applications to assess the efficacy and safety of LLMs in mental health diagnostics and treatment planning. Moreover, given the rapid development of AI technologies, the capabilities of LLMs are continuously evolving, such that newer versions of the models assessed in this study may perform differently, highlighting the necessity for ongoing evaluation. Addressing these limitations provides a more comprehensive understanding of the challenges and considerations in applying LLM technology to mental health, paving the way for more informed and ethical research and its implementation in the future.

5. Conclusions

This study compared the diagnostic accuracy and treatment recommendations of Gemini, Claude, ChatGPT-3.5, and ChatGPT-4 with those of mental health professionals for various mental health conditions. Text vignettes were used to evaluate the performance of LLMs and compare it to norms established by a sample of health professionals. The LLMs demonstrated high diagnostic accuracy, with 100% correct diagnosis rates for depression, social phobia, and PTSD, often surpassing professionals. However, ChatGPT-4 was less accurate in the case of early and chronic schizophrenia than other entities. The LLMs consistently recommended consulting healthcare professionals at higher rates than the professionals themselves.
The LLMs exhibited more conservative estimates, generally predicting lower and higher rates of full and partial recovery, respectively. Conversely, human experts consistently demonstrated a more optimistic outlook regarding full recovery across various conditions, including depression, suicidal ideation, schizophrenia, and PTSD. While both groups forecasted poorer outcomes in the absence of intervention, the LLMs displayed a notably more pessimistic perspective. These findings underscore the contrast between the generally hopeful prognoses of human professionals and the more cautious predictions of LLMs in the context of mental health recovery. The results highlight the potential for integrating LLMs into clinical decision-making processes; however, further research is necessary to validate these findings and overcome the study’s limitations.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/ejihpe15010009/s1.

Funding

This research received no external funding.

Institutional Review Board Statement

The professional survey part of this study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of the university (approval number-18523). The public survey data were obtained from previously published research by Morgan et al. (2014) in the Australian & New Zealand Journal of Psychiatry, which was approved by the University of Melbourne Ethics Committee.

Informed Consent Statement

Not applicable for studies not involving humans.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author, I.L., upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Aich, A., Quynh, A., Osseyi, P., Pinkham, A., Harvey, P., Curtis, B., Depp, C., & Parde, N. (2024). Using LLMs to aid annotation and collection of clinically-enriched data in bipolar disorder and schizophrenia. arXiv, arXiv:2406.12687. [Google Scholar] [CrossRef]
  2. Alhuwaydi, A. M. (2024). Exploring the role of artificial intelligence in mental healthcare: Current trends and future Directions—A narrative review for a comprehensive insight. Risk Management and Healthcare Policy, 17, 1339–1348. [Google Scholar] [CrossRef]
  3. Alowais, S. A., Alghamdi, S. S., Alsuhebany, N., Alqahtani, T., Alshaya, A. I., Almohareb, S. N., Aldairem, A., Alrashed, M., Bin Saleh, K., & Badreldin, H. A. (2023). Revolutionizing healthcare: The role of artificial intelligence in clinical practice. BMC Medical Education, 23(1), 689. [Google Scholar] [CrossRef]
  4. Bradford, A., Meyer, A. N., Khan, S., Giardina, T. D., & Singh, H. (2024). Diagnostic error in mental health: A review. BMJ Quality & Safety, 33, 663–672. [Google Scholar] [CrossRef]
  5. Cabrera, J., Loyola, M. S., Magaña, I., & Rojas, R. (2023). Ethical dilemmas, mental health, artificial intelligence, and LLM-based chatbots. Neuropsychopharmacology, 49, 1383–1391. [Google Scholar] [CrossRef]
  6. Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., Eichstaedt, J. C., Hecht, C., Jamieson, J., & Johnson, M. (2023). Using large language models in psychology. Nature Reviews Psychology, 2(11), 688–701. [Google Scholar] [CrossRef]
  7. Dergaa, I., Fekih-Romdhane, F., Hallit, S., Loch, A. A., Glenn, J. M., Fessi, M. S., Ben Aissa, M., Souissi, N., Guelmami, N., & Swed, S. (2024). ChatGPT is not ready yet for use in providing mental health assessment and interventions. Frontiers in Psychiatry, 14, 1277756. [Google Scholar] [CrossRef] [PubMed]
  8. Elyoseph, Z., & Levkovich, I. (2023). Beyond human expertise: The promise and limitations of ChatGPT in suicide risk assessment. Frontiers in Psychiatry, 14, 1213141. [Google Scholar] [CrossRef] [PubMed]
  9. Elyoseph, Z., & Levkovich, I. (2024). Comparing the perspectives of generative AI, mental health experts, and the general public on schizophrenia recovery: Case vignette study. JMIR Mental Health, 11, e53043. [Google Scholar] [CrossRef]
  10. Elyoseph, Z., Levkovich, I., & Shinan-Altman, S. (2024). Assessing prognosis in depression: Comparing perspectives of AI models, mental health professionals and the general public. Family Medicine and Community Health, 12(Suppl. S1), e002583. [Google Scholar] [CrossRef] [PubMed]
  11. Golden, G., Popescu, C., Israel, S., Perlman, K., Armstrong, C., Fratila, R., Tanguay-Sela, M., & Benrimoh, D. (2024). Applying artificial intelligence to clinical decision support in mental health: What have we learned? Health Policy and Technology, 13(2), 100844. [Google Scholar] [CrossRef]
  12. Guo, Z., Lai, A., Thygesen, J. H., Farrington, J., Keen, T., & Li, K. (2024). Large language models for mental health applications: Systematic review. JMIR Mental Health, 11(1), e57400. [Google Scholar] [CrossRef]
  13. Haber, Y., Levkovich, I., Hadar-Shoval, D., & Elyoseph, Z. (2024). The artificial third: A broad view of the effects of introducing generative artificial intelligence on psychotherapy. JMIR Mental Health, 11, e54781. [Google Scholar] [CrossRef]
  14. Hadar-Shoval, D., Asraf, K., Shinan-Altman, S., Elyoseph, Z., & Levkovich, I. (2024). Embedded values-like shape ethical reasoning of large language models on primary care ethical dilemmas. Heliyon, 10, e38056. [Google Scholar] [CrossRef]
  15. Higgins, O., Short, B. L., Chalup, S. K., & Wilson, R. L. (2023). Artificial intelligence (AI) and machine learning (ML) based decision support systems in mental health: An integrative review. International Journal of Mental Health Nursing, 32(4), 966–978. [Google Scholar] [CrossRef] [PubMed]
  16. Lawrence, H. R., Schneider, R. A., Rubin, S. B., Matarić, M. J., McDuff, D. J., & Bell, M. J. (2024). The opportunities and risks of large language models in mental health. JMIR Mental Health, 11(1), e59479. [Google Scholar] [CrossRef]
  17. Leamy, M., Foye, U., Hirrich, A., Bjørgen, D., Silver, J., Simpson, A., Ellis, M., & Johan-Johanson, K. (2023). A systematic review of measures of the personal recovery orientation of mental health services and staff. International Journal of Mental Health Systems, 17(1), 33. [Google Scholar] [CrossRef] [PubMed]
  18. Levkovich, I., & Elyoseph, Z. (2023). Identifying depression and its determinants upon initiating treatment: ChatGPT versus primary care physicians. Family Medicine and Community Health, 11(4), e002391. [Google Scholar] [CrossRef] [PubMed]
  19. Levkovich, I., & Omar, M. (2024). Evaluating of BERT-based and Large Language Mod for Suicide Detection, Prevention, and Risk Assessment: A Systematic Review. Journal of Medical Systems, 48(1), 113. [Google Scholar] [CrossRef] [PubMed]
  20. Levkovich, I., Rabin, E., Brann, M., & Elyoseph, Z. (2024a). Large language models outperform general practitioners in identifying complex cases of childhood anxiety. Digital Health, 10, 20552076241294182. [Google Scholar] [CrossRef] [PubMed]
  21. Levkovich, I., Shinan-Altman, S., & Elyoseph, Z. (2024b). Can Large Language Models be sensitive to Culture Suicide Risk Assessment? Journal of Cultural Cognitive Science, 8, 275–287. [Google Scholar] [CrossRef]
  22. Luo, X., Chen, F., Zhu, D., Wang, L., Wang, Z., Liu, H., Lyu, M., Wang, Y., Wang, Q., & Chen, Y. (2024). Potential roles of large language models in the production of systematic reviews and meta-analyses. Journal of Medical Internet Research, 26, e56780. [Google Scholar] [CrossRef] [PubMed]
  23. Mirzaei, T., Amini, L., & Esmaeilzadeh, P. (2024). Clinician voices on ethics of LLM integration in healthcare: A thematic analysis of ethical concerns and implications. BMC Medical Informatics and Decision Making, 24(1), 250. [Google Scholar] [CrossRef] [PubMed]
  24. Morgan, A. J., Jorm, A. F., & Reavley, N. J. (2013). Beliefs of australian health professionals about the helpfulness of interventions for mental disorders: Differences between professions and change over time. Australian & New Zealand Journal of Psychiatry, 47(9), 840–848. [Google Scholar] [CrossRef]
  25. Morgan, A. J., Reavley, N. J., & Jorm, A. F. (2014). Beliefs about mental disorder treatment and prognosis: Comparison of health professionals with the Australian public. Australian & New Zealand Journal of Psychiatry, 48(5), 442–451. [Google Scholar] [CrossRef]
  26. Nazi, Z. A., & Peng, W. (2024). Large language models in healthcare and medical domain: A review. Informatics, 11(3), 57. [Google Scholar] [CrossRef]
  27. Obradovich, N., Khalsa, S. S., Khan, W. U., Suh, J., Perlis, R. H., Ajilore, O., & Paulus, M. P. (2024). Opportunities and risks of large language models in psychiatry. NPP—Digital Psychiatry and Neuroscience, 2(1), 8. [Google Scholar] [CrossRef] [PubMed]
  28. Ohse, J., Hadžić, B., Mohammed, P., Peperkorn, N., Danner, M., Yorita, A., Kubota, N., Rätsch, M., & Shiban, Y. (2024). Zero-shot strike: Testing the generalisation capabilities of out-of-the-box LLM models for depression detection. Computer Speech & Language, 88, 101663. [Google Scholar] [CrossRef]
  29. Omar, M., & Levkovich, I. (2024). Exploring the efficacy and potential of large language models for depression: A systematic review. Journal of Affective Disorders, 371, 234–244. [Google Scholar] [CrossRef]
  30. Omar, M., Soffer, S., Charney, A. W., Landi, I., Nadkarni, G. N., & Klang, E. (2024). Applications of large language models in psychiatry: A systematic review. Frontiers in Psychiatry, 15, 1422807. [Google Scholar] [CrossRef] [PubMed]
  31. Pavlopoulos, A., Rachiotis, T., & Maglogiannis, I. (2024). An overview of tools and technologies for anxiety and depression management using AI. Applied Sciences, 14(19), 9068. [Google Scholar] [CrossRef]
  32. Prabhod, K. J. (2023). Integrating large language models for enhanced clinical decision support systems in modern healthcare. Journal of Machine Learning for Healthcare Decision Support, 3(1), 18–62. Available online: https://medlines.uk/index.php/JMLHDS/article/view/23 (accessed on 20 November 2024).
  33. R Core Team. (2020). R: A language and environment for statistical computing, R foundation for statistical. R Core Team. [Google Scholar]
  34. R Core Team. (2023). RStudio: Integrated development environment for R. Rstudio. PBC. Available online: http://www.rstudio.com/ (accessed on 20 November 2024).
  35. Reavley, N. J., & Jorm, A. F. (2012). Stigmatising attitudes towards people with mental disorders: Changes in Australia over 8 years. Psychiatry Research, 197(3), 302–306. [Google Scholar] [CrossRef]
  36. Stade, E. C., Stirman, S. W., Ungar, L. H., Boland, C. L., Schwartz, H. A., Yaden, D. B., Sedoc, J., DeRubeis, R. J., Willer, R., & Eichstaedt, J. C. (2024). Large language models could change the future of behavioral healthcare: A proposal for responsible development and evaluation. NPJ Mental Health Research, 3(1), 12. [Google Scholar] [CrossRef] [PubMed]
  37. Tay, J. L., Htun, K. K., & Sim, K. (2024). Prediction of clinical outcomes in psychotic disorders using artificial intelligence methods: A scoping review. Brain Sciences, 14(9), 878. [Google Scholar] [CrossRef] [PubMed]
  38. Tortora, L. (2024). Beyond discrimination: Generative AI applications and ethical challenges in forensic psychiatry. Frontiers in Psychiatry, 15, 1346059. [Google Scholar] [CrossRef]
  39. Weisenburger, R. L., Mullarkey, M. C., Labrada, J., Labrousse, D., Yang, M. Y., MacPherson, A. H., Hsu, K. J., Ugail, H., Shumake, J., & Beevers, C. G. (2024). Conversational assessment using artificial intelligence is as clinically useful as depression scales and preferred by users. Journal of Affective Disorders, 351, 489–498. [Google Scholar] [CrossRef]
  40. Xie, W., Xiao, Q., Zheng, Y., Wang, X., Chen, J., Ji, K., Gao, A., Wan, X., Jiang, F., & Wang, B. (2024). LLMs for doctors: Leveraging medical LLMs to assist doctors, not replace them. arXiv, arXiv:2406.18034. [Google Scholar] [CrossRef]
  41. Yang, R., Tan, T. F., Lu, W., Thirunavukarasu, A. J., Ting, D. S. W., & Liu, N. (2023). Large language models in health care: Development, applications, and challenges. Health Care Science, 2(4), 255–263. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Comparison of average treatment recommendations by LLM tools and mental health professionals across mental health conditions.
Figure 1. Comparison of average treatment recommendations by LLM tools and mental health professionals across mental health conditions.
Ejihpe 15 00009 g001
Figure 2. Comparison of full recovery predictions: LLM tools vs. professionals (help provided).
Figure 2. Comparison of full recovery predictions: LLM tools vs. professionals (help provided).
Ejihpe 15 00009 g002
Table 1. Comparison of correct diagnosis rates across different LLM tools and professionals.
Table 1. Comparison of correct diagnosis rates across different LLM tools and professionals.
VignetteProfessionalsClaudeGeminiChatGPT-3.5ChatGPT-4Cramer’s Vp
Depression95%95%100%100%100% 0.001
Depression and suicidal thoughts12%10%11%5%100%0.81<0.001
Early schizophrenia95%95%100%100%55%0.55<0.001
Chronic schizophrenia95%85%100%95%67%0.37<0.001
Social phobia86%100%100%100%100% <0.001
PTSD90%95%100%100%100% <0.001
Table 2. Comparison of treatments for depression across different entities.
Table 2. Comparison of treatments for depression across different entities.
DepressionGPPsychiatristsPsychologistsClaudeGeminiGPT 3.5GPT 4Cramer’s Vp
People
A typical family GP or doctor95917610010080.951000.33<0.001
A counsellor864953801001001000.52<0.001
Telephone counselling service (e.g., Lifeline)7957638010010085.710.39<0.001
A psychiatrist9292751001001001000.36<0.001
A psychologist9386937510095.241000.3<0.001
Medications
Antidepressants8080596068.4295.241000.36<0.001
Activities/therapies
Becoming more physically active9095935010095.241000.52<0.001
Reading about people with similar problems and how they have dealt with them8382666010095.2495.240.38<0.001
Getting out and about more8370808510095.241000.32<0.001
Courses on relaxation, stress management, meditation or yoga7969728010095.241000.34<0.001
Cutting out alcohol altogether6682629010095.241000.41<0.001
Psychotherapy8273737510095.241000.33<0.001
Cognitive behaviour therapy8982906063.1690.4890.480.31<0.001
Consulting a website that gives information about his/her problem7464605094.7480.9580.950.31<0.001
Consulting an expert using email or the web about his/her problem6360649010080.9585.710.35<0.001
Consulting a book that gives information about his/her problem7666677594.7461.985.710.25<0.001
Receiving information about the problem from a health educator8869747578.9566.6790.480.2<0.001
Note: data given in %, GPs—general practitioners.
Table 3. Comparison of treatments for depression with suicidal thoughts across different entities.
Table 3. Comparison of treatments for depression with suicidal thoughts across different entities.
Depression with Suicidal ThoughtsGPPsychiatristsPsychologistsClaudeGeminiGPT 3.5GPT 4Cramer’s Vp
People
A typical family GP or doctor959283100100451000.56<0.001
A counsellor86475810094.441001000.55<0.001
Telephone counselling service (e.g., Lifeline)8961831001001001000.46<0.001
A psychiatrist929583100100100100-<0.001
A psychologist93799810094.44951000.29<0.001
Medications
Antidepressants959072951009595.450.31<0.001
Activities/therapies
Becoming more physically active9283966010010095.450.43<0.001
Reading about people with similar problems and how they have dealt with them80817510094.449572.730.28<0.001
Getting out and about more7764871001009595.450.39<0.001
Courses on relaxation, stress management, meditation or yoga85618710094.44951000.4<0.001
Cutting out alcohol altogether85747110088.89851000.31<0.001
Psychotherapy83818210094.44951000.27<0.001
Cognitive behaviour therapy9179899094.44951000.21<0.001
Consulting a website that gives information about his/her problem66616610094.4410072.730.4<0.001
Consulting an expert using email or the web about his/her problem5155721001009572.730.46<0.001
Consulting a book that gives information about his/her problem6355719094.447577.270.3<0.001
Receiving information about the problem from a health educator8664766072.228077.270.19<0.001
Note: data given in %, GPs—general practitioners.
Table 4. Comparison of treatments for early schizophrenia across different entities.
Table 4. Comparison of treatments for early schizophrenia across different entities.
Early SchizophreniaGPPsychiatristsPsychologistsClaudeGeminiGPT 3.5GPT 4Cramer’s Vp
People
A typical family GP or doctor10095888010068700.36<0.001
A social worker65323710094.44100700.59<0.001
A psychiatrist10010099100100100100-1
A psychologist8658926510075800.34<0.001
Medications
Antidepressants919788305085850.54<0.001
Activities/therapies
Becoming more physically active6552678094.44100800.37<0.001
Cutting out alcohol altogether8471768011.1150650.48<0.001
Cognitive behaviour therapy4532805594.44551000.5<0.001
Admission to a psychiatric ward of a hospital767859516.6730950.64<0.001
Receiving information about the problem from a health educator6458758072.2255550.2<0.001
Note: data given in %, GPs—general practitioners.
Table 5. Comparison of treatments for chronic schizophrenia across different entities.
Table 5. Comparison of treatments for chronic schizophrenia across different entities.
Early SchizophreniaGPPsychiatristsPsychologistsClaudeGeminiGPT 3.5GPT 4Cramer’s Vp
People
A typical family GP or doctor9493837094.447547.620.39<0.001
A social worker745872351009052.380.45<0.001
A psychiatrist9898961001009095.24-<0.001
A psychologist785786551005571.430.36<0.001
Medications
Antidepressants9699949022.224557.140.63<0.001
Activities/therapies
Becoming more physically active5370669061.117538.10.32<0.001
Cutting out alcohol altogether4952388544.446038.10.3<0.001
Cognitive behaviour therapy3541715083.335590.480.4<0.001
Admission to a psychiatric ward of a hospital758562655.562595.240.61<0.001
Receiving information about the problem from a health educator59636580502028.570.39<0.001
Note: data given in %, GPs—general practitioners.
Table 6. Comparison of treatments for chronic social phobia across different entities.
Table 6. Comparison of treatments for chronic social phobia across different entities.
Social PhobiaGPPsychiatristsPsychologistsClaudeGeminiGPT 3.5GPT 4Cramer’s Vp
People
A typical family GP or doctor9171641001008576.190.36<0.001
A counsellor87574910083.331001000.52<0.001
Telephone counselling service (e.g., Lifeline)76443510094.449085.710.54<0.001
A psychiatrist859670100100901000.37<0.001
A psychologist94909610077.786095.240.39<0.001
Medications
Antidepressants5772419072.226595.240.38<0.001
Activities/therapies
Becoming more physically active847983951009595.240.25<0.001
Reading about people with similar problems and how they have dealt with them8385881001001001000.3<0.001
Getting out and about more637279100100951000.43<0.001
Courses on relaxation, stress management, meditation or yoga899090100100801000.27<0.001
Cutting out alcohol altogether50444210038.896095.240.49<0.001
Psychotherapy867774901006595.240.31<0.001
Cognitive behaviour therapy9396985088.89651000.49<0.001
Consulting a website that gives information about his/her problem647578100100451000.49<0.001
Consulting an expert using email or the web about his/her problem607174100100701000.42<0.001
Consulting a book that gives information about his/her problem667478100100551000.44<0.001
Receiving information about the problem from a health educator85798010094.44301000.57<0.001
Note: data given in %, GPs—general practitioners.
Table 7. Comparison of treatments for post-traumatic stress disorder across different entities.
Table 7. Comparison of treatments for post-traumatic stress disorder across different entities.
Post-Traumatic Stress Disorder (PTSD)GPPsychiatristsPsychologistsClaudeGeminiGPT 3.5GPT 4Cramer’s Vp
People
A typical family GP or doctor9378699010090950.3<0.001
A counsellor80384910088.891001000.59<0.001
Telephone counselling service (e.g., Lifeline)69476210010095900.49<0.001
A psychiatrist9296741001001001000.39<0.001
A psychologist9681996072.22651000.4<0.001
Medications
Antidepressants678138505090900.41<0.001
Activities/therapies
Becoming more physically active87827950100100950.45<0.001
Reading about people with similar problems and how they have dealt with them91738190100100950.31<0.001
Getting out and about more675865100100851000.45<0.001
Courses on relaxation, stress management, meditation or yoga82708110094.44751000.32<0.001
Cutting out alcohol altogether6366559561.11601000.37<0.001
Psychotherapy86767670100601000.36<0.001
Cognitive behaviour therapy9781919594.4480950.22<0.001
Consulting a website that gives information about his/her problem5560797594.4465950.34<0.001
Consulting an expert using email or the web about his/her problem5551727510070950.39<0.001
Consulting a book that gives information about his/her problem5763757510050950.39<0.001
Receiving information about the problem from a health educator7968787077.7865950.22<0.001
Note: data given in %, GPs—general practitioners.
Table 8. Predicted outcome for those who receive help.
Table 8. Predicted outcome for those who receive help.
VignetteProfessionalsClaudeGeminiGPT 3.5GPT 4p
Full recoveryDepression972584.21001
Partial recoveryDepression37515.79100100
No improvementDepression00000
Get worseDepression00000
Full recoveryDepression with suicidal thoughts94.45533.33013.640.14
Partial recoveryDepression with suicidal thoughts5.24566.6710086.36
No improvementDepression with suicidal thoughts0.40000
Get worseDepression with suicidal thoughts00000
Full recoveryEarly schizophrenia60.450000.14
Partial recoveryEarly schizophrenia38.79594.44100100
No improvementEarly schizophrenia0.90000
Get worseEarly schizophrenia005.5600
Full recoveryChronic schizophrenia20.1205.56001
Partial recoveryChronic schizophrenia788044.4495100
No improvementChronic schizophrenia2038.8950
Get worseChronic schizophrenia0011.1100
Full recoverySocial phobia652011.11000.14
Partial recoverySocial phobia34.68088.89100100
No improvementSocial phobia0.40000
Get worseSocial phobia00000
Full recoveryPTSD89.63522.220151
Partial recoveryPTSD106577.7810085
No improvementPTSD0.40000
Get worsePTSD0.20000
Note: data given in %, GPs—general practitioners.
Table 9. Predicted outcome for those who don’t receive help.
Table 9. Predicted outcome for those who don’t receive help.
VignetteProfessionalsClaudeGeminiGPT 3.5GPT 4p
Full recoveryDepression7.700001
Partial recoveryDepression35.82531.5833.334.76
No improvementDepression9.85510.5357.1438.1
Get worseDepression46.72057.899.5257.14
Full recoveryDepression with suicidal thoughts4.700001
Partial recoveryDepression with suicidal thoughts2010000
No improvementDepression with suicidal thoughts10.6350750
Get worseDepression with suicidal thoughts64.75510025100
Full recoveryEarly schizophrenia000001
Partial recoveryEarly schizophrenia4.300150
No improvementEarly schizophrenia6.8450550
Get worseEarly schizophrenia88.95510030100
Full recoveryChronic schizophrenia000001
Partial recoveryChronic schizophrenia20009.52
No improvementChronic schizophrenia29600454.76
Get worseChronic schizophrenia69401005585.71
Full recoverySocial phobia000001
Partial recoverySocial phobia10.94011.1154.76
No improvementSocial phobia32.25533.339542.86
Get worseSocial phobia57555.56052.38
Full recoveryPTSD6.900001
Partial recoveryPTSD44.33516.672010
No improvementPTSD13.4455.566515
Get worsePTSD35.42077.781575
Note: data given in %, GPs—general practitioners.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Levkovich, I. Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals. Eur. J. Investig. Health Psychol. Educ. 2025, 15, 9. https://doi.org/10.3390/ejihpe15010009

AMA Style

Levkovich I. Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals. European Journal of Investigation in Health, Psychology and Education. 2025; 15(1):9. https://doi.org/10.3390/ejihpe15010009

Chicago/Turabian Style

Levkovich, Inbar. 2025. "Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals" European Journal of Investigation in Health, Psychology and Education 15, no. 1: 9. https://doi.org/10.3390/ejihpe15010009

APA Style

Levkovich, I. (2025). Evaluating Diagnostic Accuracy and Treatment Efficacy in Mental Health: A Comparative Analysis of Large Language Model Tools and Mental Health Professionals. European Journal of Investigation in Health, Psychology and Education, 15(1), 9. https://doi.org/10.3390/ejihpe15010009

Article Metrics

Back to TopTop