Accuracy of Artificial Intelligence Based Chatbots in Analyzing Orthopedic Pathologies: An Experimental Multi-Observer Analysis
Abstract
:1. Introduction
2. Material and Methods
2.1. Study Design
2.2. Case Vignettes
2.3. Selection of Symptom Checkers
2.4. Data Rating and Grading for Evaluation
2.5. Statistical Analysis
3. Results
3.1. Accuracy of Diagnoses
3.2. Urgency
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- El-Kareh, R.; Sittig, D.F. Enhancing Diagnosis Through Technology: Decision Support, Artificial Intelligence, and Beyond. Crit. Care Clin. 2022, 38, 129–139. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Cawley, C.; Bergey, F.; Mehl, A.; Finckh, A.; Gilsdorf, A. Novel Methods in the Surveillance of Influenza-Like Illness in Germany Using Data from a Symptom Assessment App (Ada): Observational Case Study. JMIR Public Health Surveill. 2021, 7, e26523. [Google Scholar] [CrossRef] [PubMed]
- Brink, J.A.; Arenson, R.L.; Grist, T.M.; Lewin, J.S.; Enzmann, D. Bits and bytes: The future of radiology lies in informatics and information technology. Eur. Radiol. 2017, 27, 3647–3651. [Google Scholar] [CrossRef] [PubMed]
- Dreyer, K.J.; Geis, J.R. When machines think: Radiology’s next frontier. Radiology 2017, 285, 713–718. [Google Scholar] [CrossRef] [PubMed]
- Shimizu, H.; Nakayama, K.I. Artificial intelligence in oncology. Cancer Sci. 2020, 111, 1452–1460. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Bhinder, B.; Gilvary, C.; Madhukar, N.S.; Elemento, O. Artificial Intelligence in Cancer Research and Precision Medicine. Cancer Discov. 2021, 11, 900–915. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Kann, B.H.; Hosny, A.; Aerts, H.J. Artificial intelligence for clinical oncology. Cancer Cell 2021, 39, 916–927. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Johnson, K.W.; Soto, J.T.; Glicksberg, B.S.; Shameer, K.; Miotto, R.; Ali, M.; Ashley, E.; Dudley, J.T. Artificial Intelligence in Cardiology. J. Am. Coll. Cardiol. 2018, 71, 2668–2679. [Google Scholar] [CrossRef] [PubMed]
- Kulkarni, P.; Mahadevappa, M.; Chilakamarri, S. The Emergence of Artificial Intelligence in Cardiology: Current and Future Applications. Curr. Cardiol. Rev. 2022, 18, e191121198124. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Tsoi, K.; Yiu, K.; Lee, H.; Cheng, H.-M.; Wang, T.-D.; Tay, J.-C.; Teo, B.W.; Turana, Y.; Soenarta, A.A.; Sogunuru, G.P.; et al. Applications of artificial intelligence for hypertension management. J. Clin. Hypertens. 2021, 23, 568–574. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Yang, L.S.; Perry, E.; Shan, L.; Wilding, H.; Connell, W.; Thompson, A.J.; Taylor, A.C.F.; Desmond, P.V.; Holt, B.A. Clinical application and diagnostic accuracy of artificial intelligence in colonoscopy for inflammatory bowel disease: Systematic review. Endosc. Int. Open 2022, 10, E1004–E1013. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Schattenberg, J.M.; Chalasani, N.; Alkhouri, N. Artificial Intelligence Applications in Hepatology. Clin. Gastroenterol. Hepatol. 2023, 21, 2015–2025. [Google Scholar] [CrossRef] [PubMed]
- Sahin, C. Rules of engagement in mobile health: What does mobile health bring to research and theory? Contemp. Nurse 2018, 54, 374–387. [Google Scholar] [CrossRef]
- Wattanapisit, A.; Teo, C.H.; Wattanapisit, S.; Teoh, E.; Woo, W.J.; Ng, C.J. Can mobile health apps replace GPs? A scoping review of comparisons between mobile apps and GP tasks. BMC Med. Inform. Decis. Mak. 2020, 20, 5. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Bisson, L.J.; Komm, J.T.; Bernas, G.A.; Marzo, J.M.; Rauh, M.A.; Browning, W.M. How Accurate Are Patients at Diagnosing the Cause of Their Knee Pain with the Help of a Web-based Symptom Checker? Orthop. J. Sports Med. 2016, 4, 2325967116630286. [Google Scholar] [CrossRef] [PubMed]
- Roncero, A.P.; Marques, G.; Sainz-De-Abajo, B.; Martín-Rodríguez, F.; Vegas, C.d.P.; Garcia-Zapirain, B.; de la Torre-Díez, I. Mobile Health Apps for Medical Emergencies: Systematic Review. JMIR mHealth uHealth 2020, 8, e18513. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Holderried, M.; Schlipf, M.; Höper, A.; Meier, R.; Stöckle, U.; Kraus, T.M. Chancen und Risiken der Telemedizin in der Orthopädie und Unfallchirurgie. Z. Orthop. Unfallchirurgie 2018, 156, 68–77. [Google Scholar] [CrossRef] [PubMed]
- Johansson, A.; Esbjörnsson, M.; Nordqvist, P.; Wiinberg, S.; Andersson, R.; Ivarsson, B.; Möller, S. Technical feasibility and ambulance nurses’ view of a digital telemedicine system in pre-hospital stroke care—A pilot study. Int. Emerg. Nurs. 2019, 44, 35–40. [Google Scholar] [CrossRef]
- Ashmawy, M.N.; Khairy, A.M.; Hamdy, M.W.; El-Shazly, A.; El-Rashidy, K.; Salah, M.; Mansour, Z.; Khattab, A. (Eds.) SmartAmb: An Integrated Platform for Ambulance Routing and Patient Monitoring. In Proceedings of the 2019 31st International Conference on Microelectronics (ICM), Cairo, Egypt, 15–18 December 2019. [Google Scholar]
- Celi, L.A.; Cellini, J.; Charpignon, M.-L.; Dee, E.C.; Dernoncourt, F.; Eber, R.; Mitchell, W.G.; Moukheiber, L.; Schirmer, J.; Situ, J.; et al. Sources of bias in artificial intelligence that perpetuate healthcare disparities-A global review. PLoS Digit. Health 2022, 1, e0000022. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Obermeyer, Z.; Topol, E.J. Artificial intelligence, bias, and patients’ perspectives. Lancet 2021, 397, 2038. [Google Scholar] [CrossRef] [PubMed]
- Nagendran, M.; Chen, Y.; A Lovejoy, C.; Gordon, A.C.; Komorowski, M.; Harvey, H.; Topol, E.J.; A Ioannidis, J.P.; Collins, G.S.; Maruthappu, M. Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies. BMJ 2020, 368, m689. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Zhou, Q.; Chen, Z.-H.; Cao, Y.-H.; Peng, S. Clinical impact and quality of randomized controlled trials involving interventions evaluating artificial intelligence prediction tools: A systematic review. npj Digit. Med. 2021, 4, 154. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Wehkamp, K.; Krawczak, M.; Schreiber, S. The quality and utility of artificial intelligence in patient care. Dtsch. Arztebl. Int. 2023, 120, 463–469. [Google Scholar] [CrossRef]
- Hill, M.G.; Sim, M.; Mills, B. The quality of diagnosis and triage advice provided by free online symptom checkers and apps in Australia. Med. J. Aust. 2020, 212, 514–519. [Google Scholar] [CrossRef]
- Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; L. Erlbaum Associates: Hillsdale, NJ, USA, 1988. [Google Scholar]
- Gilbert, S.; Mehl, A.; Baluch, A.; Cawley, C.; Challiner, J.; Fraser, H.; Millen, E.; Montazeri, M.; Multmeier, J.; Pick, F.; et al. Original research: How accurate are digital symptom assessment apps for suggesting conditions and urgency advice? A clinical vignettes comparison to GPs. BMJ Open 2020, 10, e040269. [Google Scholar] [CrossRef] [PubMed]
- Ceney, A.; Tolond, S.; Glowinski, A.; Marks, B.; Swift, S.; Palser, T. Accuracy of online symptom checkers and the potential impact on service utilisation. PLoS ONE 2021, 16, e0254088. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Chambers, D.; Cantrell, A.J.; Johnson, M.; Preston, L.; Baxter, S.K.; Booth, A.; Turner, J. Digital and online symptom checkers and health assessment/triage services for urgent health problems: Systematic review. BMJ Open 2019, 9, e027743. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Kotecha, D.; Chua, W.W.L.; Fabritz, L.; Hendriks, J.; Casadei, B.; Schotten, U.; Vardas, P.; Heidbuchel, H.; Dean, V.; Kirchhof, P.; et al. European Society of Cardiology smartphone and tablet applications for patients with atrial fibrillation and their health care providers. Europace 2018, 20, 225–233. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Gong, K.; Yan, Y.-L.; Li, Y.; Du, J.; Wang, J.; Han, Y.; Zou, Y.; Zou, X.-Y.; Huang, H.; She, Q.; et al. Mobile health applications for the management of primary hypertension: A multicenter, randomized, controlled trial. Medicine 2020, 99, e19715. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Eberle, C.; Löhnert, M.; Stichling, S. Effectiveness of Disease-Specific mHealth Apps in Patients With Diabetes Mellitus: Scoping Review. JMIR mHealth uHealth 2021, 9, e23477. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Spearman’s ρ | Significance (2-Tailed) ρ | |
---|---|---|
Doctor 1 vs. Doctor 2 | 0.538 | 0.002 |
Doctor 1 vs. Doctor 3 | 0.388 | 0.034 |
Doctor 2 vs. Doctor 3 | 0.143 | 0.450 |
Spearman’s ρ | Significance (2-Tailed) ρ | |
---|---|---|
Doctor 1 vs. Ada | 0.036 | 0.849 |
Doctor 1 vs. Symptomate | 0.358 | 0.052 |
Doctor 1 vs. Symptoma | 0.037 | 0.846 |
Doctor 2 vs. Ada | 0.007 | 0.969 |
Doctor 2 vs. Symptomate | n/a | n/a |
Doctor 2 vs. Symptoma | 0.267 | 0.154 |
Doctor 3 vs. Ada | −0.160 | 0.399 |
Doctor 3 vs. Symptomate | 0.040 | 0.832 |
Doctor 3 vs. Symptoma | −0.086 | 0.653 |
Correct Evaluation [%] | Incorrect Evaluation (Too Low) [%] | Incorrect Evaluation (Too High) [%] | Number of Cases [n] | |
---|---|---|---|---|
Doctor 2 | 75.9 | 10.3 | 11.5 | 29 |
Doctor 3 | 76.9 | 13.8 | 11.5 | 26 |
Ada (Dr_1) | 20.0 | 60.0 | 20.0 | 3 |
Ada (Dr_2) | 60.0 | 43.3 | 6.7 | 18 |
Ada (Dr_3) | 23.1 | 76.9 | 23.1 | 16 |
Ada (Dr_4) | 18.8 | 68.8 | 12.5 | 16 |
Babylon (Dr_1) | 22.2 | 66.7 | 11.1 | 27 |
Babylon (Dr_2) | 36.7 | 50 | 13.3 | 30 |
Babylon (Dr_3) | 43.3 | 40.0 | 16.7 | 30 |
Babylon (Dr_4) | 62.5 | 20.8 | 16.7 | 24 |
Symptomate (Dr_1) | 50.0 | 33.3 | 16.7 | 6 |
Symptomate (Dr_2) | 75.0 | 0.0 | 25.0 | 6 |
Symptomate (Dr_3) | 55.6 | 11.1 | 33.3 | 9 |
Symptomate (Dr_4) | 22.2 | 33.3 | 44.4 | 9 |
Ada Health (P1) | Ada Health (P2) | Ada Health (P3) | Ada Health (P4) | Babylon (P1) | Babylon (P2) | Babylon (P3) | Babylon (P4) | Symptomate (P1) | Symptomate (P2) | Symptomate (P3) | Symptomate (P4) | |
Ada Health (P1) | 0.671 * (12) | 1.000 * (13) | 0.604 * (14) | 0.408 (14) | 0.833 (4) | |||||||
Ada Health (P2) | 0.671 * (12) | 1.000 * (12) | 1.000 * (14) | 0.315 (18) | 0.745 (5) | |||||||
Ada Health (P3) | 1.000 * (13) | 1.000 * (12) | 1.000 * (12) | 0.415 (16) | 0.894 * (6) | |||||||
Ada Health (P4) | 0.604 * (14) | 1.000 * (14) | 1.000 * (12) | 0.669 * (14) | 0.323 (5) | |||||||
Babylon (P1) | 0.408 (14) | 0.563 * (27) | 0.550 * (27) | 0.660 * (21) | 0.833 (6) | |||||||
Babylon (P2) | 0.315 (18) | 0.563 * (27) | 0.474 * (30) | 0.607 * (24) | 0.690 (8) | |||||||
Babylon (P3) | 0.415 (16) | 0.550 * (27) | 0.474 * (30) | 0.724 * (24) | 0.926 * (9) | |||||||
Babylon (P4) | 0.669 * (14) | 0.660 * (21) | 0.607 * (24) | 0.724 * (24) | 0.567 (6) | |||||||
Symptomate (P1) | 0.833 (4) | 0.833 (6) | 0.000 * (4) | 1.000 * (3) | 1.000 * (3) | |||||||
Symptomate (P2) | 0.745 (5) | 0.690 (8) | 0.000 * (4) | 0.791 (5) | 0.577 (4) | |||||||
Symptomate (P3) | 0.894 * (6) | 0.926 * (9) | 1.000 * (3) | 0.791 (5) | 0.943 (4) | |||||||
Symptomate (P4) | 0.323 (5) | 0.567 (6) | 1.000 * (3) | 0.577 (4) | 0.943 (4) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gehlen, T.; Joost, T.; Solbrig, P.; Stahnke, K.; Zahn, R.; Jahn, M.; Adl Amini, D.; Back, D.A. Accuracy of Artificial Intelligence Based Chatbots in Analyzing Orthopedic Pathologies: An Experimental Multi-Observer Analysis. Diagnostics 2025, 15, 221. https://doi.org/10.3390/diagnostics15020221
Gehlen T, Joost T, Solbrig P, Stahnke K, Zahn R, Jahn M, Adl Amini D, Back DA. Accuracy of Artificial Intelligence Based Chatbots in Analyzing Orthopedic Pathologies: An Experimental Multi-Observer Analysis. Diagnostics. 2025; 15(2):221. https://doi.org/10.3390/diagnostics15020221
Chicago/Turabian StyleGehlen, Tobias, Theresa Joost, Philipp Solbrig, Katharina Stahnke, Robert Zahn, Markus Jahn, Dominik Adl Amini, and David Alexander Back. 2025. "Accuracy of Artificial Intelligence Based Chatbots in Analyzing Orthopedic Pathologies: An Experimental Multi-Observer Analysis" Diagnostics 15, no. 2: 221. https://doi.org/10.3390/diagnostics15020221
APA StyleGehlen, T., Joost, T., Solbrig, P., Stahnke, K., Zahn, R., Jahn, M., Adl Amini, D., & Back, D. A. (2025). Accuracy of Artificial Intelligence Based Chatbots in Analyzing Orthopedic Pathologies: An Experimental Multi-Observer Analysis. Diagnostics, 15(2), 221. https://doi.org/10.3390/diagnostics15020221