1. Introduction
The integration of artificial intelligence (AI) and large language models (LLMs) in healthcare is revolutionizing various aspects of patient care and medical research. AI has diverse applications in healthcare, encompassing automated disease diagnosis, accelerated drug discovery, computer-assisted surgical interventions, and personalized patient care management. These AI-driven approaches demonstrate potential to improve clinical outcomes, mitigate healthcare expenditures, and facilitate evidence-based treatment strategies [
1].
Recent advancements in AI, particularly in the domain of LLMs, have catalyzed their widespread adoption across diverse fields, including the medical sector. This technological progress has led to a plethora of studies actively investigating multifaceted applications to enhance healthcare outcomes. These applications span a broad spectrum, encompassing but not limited to the following:
Early Detection and Diagnosis: The early detection and diagnosis of diseases through pattern recognition in medical imaging and clinical data.
Personalized Treatment Plans: The development of personalized treatment plans based on individual patient profiles and genetic markers.
Clinical Decision Support Systems: The implementation of sophisticated clinical decision support systems to aid healthcare professionals in evidence-based practice.
Medical Education: The augmentation of medical education through interactive, AI-driven learning platforms.
Recent years have witnessed significant developments in large language models (LLMs), drawing major investments from leading technology companies worldwide. These sophisticated AI systems, with prominent examples including the GPT family and similar models, have revolutionized natural language processing capabilities, achieving unprecedented success in diverse linguistic tasks. Through extensive training on comprehensive datasets, these models have evolved beyond simple text prediction, demonstrating sophisticated language understanding and generation abilities that approach human-like linguistic competence. The integration of large language models in healthcare presents transformative opportunities across multiple domains of medical practice. Studies have demonstrated their capacity to enhance healthcare delivery through streamlined administrative processes and improved clinical decision support [
2]. Current applications of these advanced AI systems range from automating routine documentation to providing evidence-based clinical recommendations. Research indicates that these technologies can assist healthcare providers by analyzing patient data to suggest personalized care approaches, facilitate medical literature review, and support clinical workflow optimization [
3,
4]. Additionally, these systems show promise in accelerating medical knowledge synthesis, potentially expediting the development and updating of clinical protocols through the rapid analysis of the medical literature [
5]. This study focuses on evaluating the diagnostic accuracy and clinical utility of AI-based tools, specifically M4CXR and ChatGPT, and explores their potential applications in clinical environments. M4CXR represents a significant advancement in medical imaging analysis as a multi-modal, multi-task, multi-image input, and multi-turn chatting chest X-ray interpretation tool. Built upon sophisticated LLMs, this cloud-based system produces comprehensive radiology reports encompassing diagnostic findings and conclusions. Trained on extensive datasets of chest X-ray images, M4CXR offers rapid and precise diagnostic insights, positioning it as an asset in clinical decision-making processes.
ChatGPT, powered by the GPT-4o architecture, demonstrates considerable potential in the medical field, particularly in chest X-ray image interpretation. Although still pending formal clinical validation and regulatory approval, these language models demonstrate the capability to process radiological images and produce structured diagnostic interpretations that align with professional medical reporting standards. In settings with limited radiological expertise, this capability suggests a promising future where AI could assist healthcare professionals in more effective patient diagnosis and treatment. Recent studies suggest that ChatGPT could serve as supportive diagnostic tools, offering timely image interpretation assistance particularly in healthcare settings where immediate access to specialized radiological expertise may be limited [
5].
M4CXR represents a specialized cloud-based system that utilizes advanced large language models (LLMs) for the purpose of generating comprehensive radiology reports from chest X-ray images. In contrast, ChatGPT is a general-purpose conversational AI model based on the GPT-4 architecture, capable of engaging in dialogs and generating text across a wide range of topics beyond the medical domain.
The comparative analysis of these two systems provides valuable insights into the performance and limitations of specialized medical AI versus general-purpose language models. Specifically, it allows us to understand the extent to which a specialized AI system like M4CXR may outperform a general conversational model like ChatGPT in terms of diagnostic accuracy and reliability in the radiological interpretation of chest X-rays.
This study aimed to explore their potential applications and implications for clinical practice, focusing on their performance in chest X-ray interpretation. By evaluating these AI tools against human radiologists in terms of performance, we seek to understand their strengths, limitations, and potential impact on the future of medical imaging diagnostics.
4. Discussion
Artificial intelligence (AI) is increasingly being integrated into the field of chest X-ray interpretation. Its applications are diverse and impactful, ranging from breast cancer risk assessment and disease detection to reducing interpretation times and serving as a supplementary ‘reader’ during screening processes [
7]. This wide array of applications demonstrates AI’s potential to significantly enhance both the accuracy and efficiency of medical diagnostics. In a notable study, physicians at a single hospital reported favorable experiences with AI-based software for chest radiographs. They found it particularly valuable in emergency room settings and for the detection of pneumothorax [
8]. These findings highlight the practical benefits of AI in time-sensitive clinical environments. Researchers have proposed a machine learning model capable of automatically diagnosing various diseases based on chest radiographs [
9]. This development represents a significant step toward more comprehensive and efficient diagnostic processes. A multicenter study further validated AI’s efficacy as a chest X-ray screening tool. Recent study findings indicated that the artificial intelligence platform effectively differentiated pathological from non-pathological radiographic findings, while contributing to improved workflow efficiency and supporting clinical interpretation processes [
10]. These results underscore AI’s potential to streamline workflow and support clinical decision making. Moreover, an AI solution for chest X-ray evaluation has shown practical viability, robust performance, and tangible benefits in clinical settings [
11]. This evidence further reinforces the growing role of AI as a valuable asset in modern radiological practice. Collectively, these studies indicate that AI is not only transforming chest X-ray interpretation but also enhancing overall patient care through improved diagnostic capabilities and increased operational efficiency.
The application of artificial intelligence (AI) in chest X-ray interpretation, while promising, faces several challenges that warrant careful consideration. Traditional approaches relying on labeled datasets encounter significant hurdles in terms of both accuracy and efficiency. One primary obstacle is the resource-intensive nature of manual labeling for large datasets. This process is not only costly but also extremely time-consuming, potentially limiting the scalability of AI models [
12]. Furthermore, attempts to automate label extraction from radiology reports have proven challenging due to the nuanced nature of medical terminology, where semantically similar words can lead to misinterpretation, and the frequent occurrence of incomplete annotated data [
12].
A multicenter evaluation revealed another critical limitation: the performance discrepancy between retrospective and prospective validations. The AI algorithm for chest X-ray analysis demonstrated that the computational model’s diagnostic performance in prospective clinical implementation did not fully match the accuracy levels observed in retrospective evaluations [
13]. This finding underscores the importance of robust validation processes that accurately reflect real-world clinical scenarios. However, it is noteworthy that despite these challenges, AI models have shown competitive performance when compared to human radiologists. In most regions of the Receiver Operating Characteristic (ROC) curve, the AI model’s performance was either on par with or only slightly inferior to that of experienced human interpreters [
14]. This suggests that while there is room for improvement, AI systems are already approaching human-level competency in certain aspects of chest X-ray interpretation. These advancements represent a significant step forward in overcoming the current limitations of AI in chest X-ray interpretation, paving the way for more robust, efficient, and widely applicable AI systems in radiology.
The recent literature has increasingly focused on the application of advanced language processing systems in medical image interpretation. Current research explores the integration of these computational models across diverse aspects of radiological practice, encompassing automated report creation, systematic classification, key finding identification, and interactive diagnostic support. Studies suggest that these advanced analytical tools may enhance clinical practice through optimized workflow processes, improved diagnostic consistency, and augmented decision support capabilities for healthcare practitioners [
15]. The ChatGPT model has been specifically highlighted as a tool for researchers to explore more potential applications in this field. The goal is to bridge the gap between language models and medical imaging, inspiring new ideas and innovations in this exciting area of research.
M4CXR represents a cutting-edge advancement in radiological diagnostics, leveraging the power of artificial intelligence and sophisticated language models. This cloud-based medical technology offers a web-accessible platform with an intuitive interface designed for healthcare professionals. Its primary function is to analyze chest X-ray images in DICOM format, generating comprehensive radiological reports that include detailed findings and conclusions.
The core of M4CXR’s capabilities lies in its advanced AI, which has undergone extensive training on large-scale chest X-ray image datasets. This training enables the system to provide rapid and accurate diagnostic insights, significantly aiding radiologists in their work. The technology not only enhances the precision of diagnoses but also substantially reduces the time required for report generation. These features make M4CXR particularly valuable in high-volume clinical environments or in settings where radiological expertise is limited or overstretched.
In parallel, ChatGPT, built on the GPT-4 architecture, has demonstrated potential in the medical field, particularly in the domain of chest X-ray interpretation. Its extensive training, encompassing a wide range of medical texts and imaging data, allows it to generate diagnostic reports based on chest X-ray information. This large language model has the capacity to expedite the diagnostic process by offering swift and reliable interpretations of chest X-rays, which is especially beneficial in areas with limited access to specialized radiology expertise.
However, it is crucial to emphasize that while ChatGPT shows promise, it is not intended to replace professional medical judgment. As noted in the literature [
16], ChatGPT should not be considered a substitute for professional medical advice, diagnosis, or treatment. Instead, its optimal use lies in its integration into existing clinical workflows as a supportive tool, designed to complement and enhance the skills of medical professionals rather than replace them [
17].
Our study’s detailed analysis of radiological reports reveals that M4CXR consistently demonstrates superior performance compared to GPT4o across multiple dimensions of diagnostic accuracy. This superiority is evident in both the higher percentages of accurate interpretations and the greater interobserver agreement rates observed for M4CXR.
While both systems demonstrated capabilities in chest X-ray interpretation, it is essential to contextualize their performance within their intended purposes. M4CXR, as a specialized medical AI system, was specifically designed and optimized for radiological analysis, incorporating domain-specific knowledge and training focused on chest X-ray interpretation. This specialization is reflected in its superior performance across all evaluation metrics, particularly in anatomical localization (76–77.5% accuracy) and reduced hallucination rates.
In contrast, ChatGPT, while showing promising capabilities, operates as a general-purpose language model not specifically optimized for medical image analysis. Its lower performance in specific metrics (36–36.5% accuracy in anatomical localization) reflects this fundamental difference in design and purpose. The performance gap between these systems demonstrates the value of specialized medical AI tools in clinical settings, while also highlighting the potential limitations of deploying general-purpose AI models for specialized medical tasks.
Detailed analysis of failure cases revealed specific scenarios where both systems exhibited limitations. In cases involving complex, overlapping pathological patterns, M4CXR’s accuracy decreased to 43.2%, while ChatGPT struggled significantly, achieving only 28.5% accuracy. Both systems showed particular vulnerability to image quality variations, with M4CXR maintaining moderate performance (52.3% accuracy) on slightly overexposed images, while ChatGPT’s performance declined more substantially to 31.1% under similar conditions.
The systems demonstrated notable challenges when encountering rare pathological findings. M4CXR’s confidence scores averaged 0.68 for uncommon conditions, indicating increased uncertainty. ChatGPT showed a tendency to misclassify rare conditions as more common ones, with a false-negative rate of 38.2%. Furthermore, anatomical variations and post-surgical changes proved challenging for both systems, with M4CXR achieving 48.7% accuracy and ChatGPT reaching 33.4% accuracy in these cases.
These findings highlight specific areas requiring improvement before widespread clinical implementation. The performance degradation in complex cases, particularly with rare conditions and post-surgical changes, suggests the need for expanded training datasets and refined algorithms to handle these challenging scenarios. The difference in response times and output format quality also indicates important considerations for practical clinical integration.
While the integration of Software as Medical Device (SaMD) technologies such as M4CXR and ChatGPT into medical diagnostics shows considerable promise, it also presents several critical challenges that warrant further investigation. One of the primary concerns is the potential for bias in AI model training data. Many AI systems are developed using datasets that may not adequately represent the full spectrum of human diversity, potentially leading to diagnostic inaccuracies or biases, particularly when applied to underrepresented population groups [
18].
The implementation of AI-assisted diagnostic systems like M4CXR could have particular significance in resource-limited healthcare settings, where access to specialized radiological expertise may be limited. In developing regions, such systems could serve as valuable diagnostic support tools, potentially improving the accuracy and efficiency of chest X-ray interpretation while helping to address the shortage of specialized radiologists, though further validation studies in these specific contexts would be necessary.
Another significant challenge lies in the inherent complexity and opacity of these AI systems, often referred to as the ‘black box’ problem [
18]. The intricate internal processes by which these technologies analyze and interpret chest X-ray images are not fully transparent or easily interpretable. This lack of explainability can hinder healthcare professionals’ ability to fully understand and, consequently, trust the diagnostic conclusions generated by these AI tools.
The integration of artificial intelligence tools into clinical practice presents a significant challenge in the advancement of medical diagnostics. Healthcare professionals often express reservations about relying on AI for critical diagnostic tasks, stemming from concerns over the accuracy and reliability of these systems. Additionally, the potential legal and ethical implications of AI-assisted diagnoses contribute to this hesitancy. Establishing a foundation of trust in AI technologies is therefore crucial for their successful adoption and integration into medical settings [
19].
Recent research by Soleimani et al. demonstrated that among various NLP models evaluated for radiology report generation, Bart and XLM models achieved remarkably high performance with mean similarity scores of up to 99.3% compared to physician reports. Their findings suggest that while certain AI models show promise in generating radiological reports comparable to those of medical professionals, careful model selection and evaluation remain crucial for clinical implementation [
20].
The implementation of AI systems in medical diagnosis raises significant ethical considerations that warrant careful examination. A primary concern involves data privacy and security in AI-based medical systems. While M4CXR employs immediate data deletion post analysis and secure cloud-based processing, the broader implications of handling sensitive medical data through AI systems require ongoing scrutiny. We implemented comprehensive data protection measures, including HIPAA-compliant anonymization protocols and secure data transmission channels, demonstrating a practical approach to addressing privacy concerns in AI-mediated medical diagnosis.
The ‘black box’ nature of AI decision making presents another significant challenge. Both M4CXR and ChatGPT utilize complex neural networks whose internal decision-making processes are not fully transparent. To address this, M4CXR incorporates an explainable AI component that provides confidence scores and highlights specific image regions influencing its decisions. This feature helps clinicians understand and validate the system’s interpretations, although complete algorithmic transparency remains a challenge.
Rather than viewing these AI systems as potential replacements for radiologists, our findings support their role as complementary diagnostic tools. The M4CXR served as an effective ‘second reader’, flagging potential abnormalities for closer examination and reducing the likelihood of oversight errors. Similarly, while ChatGPT showed lower specialized performance, it demonstrated utility in preliminary screening and educational contexts. The integration of these AI tools into clinical workflows requires careful consideration of human–AI collaboration. Therefore, the integration of M4CXR as a concurrent reading tool could help standardize reporting terminology and reduce inter-reader variability by providing consistent reference points for radiologists, potentially minimizing discrepancies between human and AI-generated interpretations while maintaining the critical role of human expertise in final diagnostic decisions.
Our study, while offering valuable insights into the comparative performance of M4CXR and GPT4o in chest X-ray interpretation, was subject to several limitations that warrant careful consideration. These limitations not only contextualize our findings but also highlight areas for future research.
Firstly, we acknowledge that our study’s reliance on data from a single institution (Inha University Hospital) may limit the generalizability of our findings. The demographic characteristics and clinical patterns specific to our institution might not fully represent the diverse patient populations and varying clinical scenarios encountered in different healthcare settings. Future validation studies incorporating data from multiple institutions with diverse patient demographics and clinical settings would be valuable in confirming the broader applicability of our findings.
A critical limitation of this study is the absence of a definitive reference standard for chest X-ray interpretations. While we assessed interobserver agreement between readers, the lack of a gold standard means that even interpretations by experienced radiologists cannot be considered absolute truths. This absence of a definitive benchmark may impact the reliability and validity of the diagnostic conclusions drawn in our study and underscores the inherent complexity and subjectivity in radiological interpretation.
Another notable constraint arose from the ethical safeguards programmed into ChatGPT, which prevented direct requests for chest X-ray interpretation. This necessitated the use of indirect prompts to elicit diagnostic interpretations from the AI system. While this workaround allowed us to proceed with the study, it may have influenced the quality and accuracy of the results obtained from ChatGPT. We acknowledge the possibility that more direct prompts, if permissible, might have yielded different or potentially more accurate results from ChatGPT.
Furthermore, our study did not account for the potential variability in image quality and the diversity of pathological conditions that might be encountered in a broader clinical setting. The limited scope of our image dataset may not fully represent the wide spectrum of chest X-ray findings seen in daily clinical practice.
Additionally, the comparison between M4CXR, a specialized medical AI tool, and GPT4o, a general-purpose language model, may not fully capture the nuances of their respective capabilities in medical imaging interpretation. The fundamental differences in their design and training may contribute to performance disparities that require more in-depth analysis.
Lastly, our study did not extensively explore the explainability and interpretability of the AI-generated results, which are crucial factors in building trust and facilitating the integration of AI tools in clinical practice.
Our study provides valuable insights into the capabilities and limitations of AI systems in radiological interpretation, yet several important avenues for future research remain. While our findings demonstrate the potential of these systems, future studies should prioritize multicenter validation across diverse healthcare settings and patient populations to establish broader generalizability. The development of more transparent AI architectures that clearly communicate their diagnostic reasoning processes would enhance trust and adoption among healthcare professionals.
Future research should focus on three key areas. First, comprehensive clinical integration studies are needed to determine optimal approaches for incorporating these AI systems into existing radiological workflows. Second, extended performance validation across diverse healthcare settings and pathological conditions will help establish robust performance benchmarks and identify potential limitations. Third, research into human–AI interaction design could improve system usability and enhance radiologist productivity while maintaining diagnostic accuracy. These directions emphasize both technical advancement and practical clinical utility, ensuring that developments in AI-assisted radiology ultimately serve to enhance patient care while supporting healthcare providers’ decision-making processes.