Artificial or Human Intelligence: A Comparative Study of Diagnostic Accuracy in Clinical Settings
Keywords:
Artificial intelligence, Diagnostic accuracy, ChatGPT, GeminiAbstract
Background and aim: Recent advancements in artificial intelligence (AI) have expanded its application in medicine, particularly in diagnostics. To compare the diagnostic efficiency of two Artificial intelligence (AI) tools and human doctors in medical cases from four medical specialties.
Methods: A total of 120 cases in dermatology, internal medicine, pediatrics, and psychiatry (30 cases/specialty) were presented to Google Gemini 1.5 Flash, ChatGPT-4o, and human doctors. Cases were presented in a standardized way. Responses were evaluated by one specialist (per specialty) and scored. Case total scores were compared between agents and between specialties using the Kruskal-Wallis test. Diagnostic accuracy was compared using the Chi-square test.
Results: ChatGPT obtained the highest grand total score (1432/1800) and the highest total score in each specialty, except dermatology, which was obtained by human doctors (293/450). The difference between the case total scores was significant (p = 0.000), with ChatGPT scoring significantly higher than both Gemini and human doctors. Also, ChatGPT had a significantly higher diagnostic accuracy (91%). Comparing responses between specialties showed that ChatGPT had scored significantly higher in internal medicine, Gemini in Psychiatry, and human doctors in dermatology. In dermatology, no significant difference was found between the responses and between the diagnostic accuracy. The case total scores of the three agents were significantly different in the other specialties. Diagnostic accuracy was significantly different only in internal medicine.
Conclusions: Artificial intelligence, especially ChatGPT, has a great potential to be used in medical diagnosis. Caution, however, must be employed as mistakes could be made by such tools.
References
1. Gil de Zúñiga H, Goyanes M, Durotoye T. A Scholarly Definition of Artificial Intelligence (AI): Advancing AI as a Conceptual Framework in Communication Research. Political Commun. 2023;41(2):317-334. doi: 10.1080/10584609.2023.2290497
2. Imran M, Almusharraf N. Google Gemini as a next generation AI educational tool: a review of emerging educational technology. Smart Learn Environ. 2024;11(1). doi: 10.1186/s40561-024-00310-z
3. Lee RST. Artificial Intelligence in Daily Life. Singapore: Springer; 2020. doi: 10.1007/978-981-15-7695-9
4. Kim J, Merrill Jr K, Collins C. AI as a friend or assistant: The mediating role of perceived usefulness in social AI vs. functional AI. Telemat Inform. 2021;64. doi: 10.1016/j.tele.2021.101694
5. Marikyan D, Papagiannidis S, Rana OF, Ranjan R, Morgan G. “Alexa, let’s talk about my productivity”: The impact of digital assistants on work productivity. J Bus Res. 2022;142:572-584. doi: 10.1016/j.jbusres.2022.01.015
6. Rajaraman V. From ELIZA to ChatGPT. Resonance. 2023;28(6):889-905. doi: 10.1007/s12045-023-1620-6
7. Hamilton A, Molzahn A, McLemore K. The Evolution From Standardized to Virtual Patients in Medical Education. Cureus. 2024;16(10):e71224. doi: 10.7759/cureus.71224
8. Katal S, York B, Gholamrezanezhad A. AI in radiology: From promise to practice - A guide to effective integration. Eur J Radiol. 2024;181:111798. doi: 10.1016/j.ejrad.2024.111798
9. Paul D, Sanap G, Shenoy S, Kalyane D, Kalia K, Tekade RK. Artificial intelligence in drug discovery and development. Drug Discov Today. 2021;26(1):80-93. doi: 10.1016/j.drudis.2020.10.010
10. Schukow C, Smith SC, Landgrebe E, Parasuraman S, Folaranmi OO, Paner GP, et al. Application of ChatGPT in Routine Diagnostic Pathology: Promises, Pitfalls, and Potential Future Directions. Adv Anat Pathol. 2024;31(1):15-21. doi: 10.1097/PAP.0000000000000406
11. Goodman RS, Patrinely JR, Stone CA, Jr., Zimmerman E, Donald RR, Chang SS, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw Open. 2023;6(10):e2336483. doi: 10.1001/jamanetworkopen.2023.36483
12. Shieh A, Tran B, He G, Kumar M, Freed JA, Majety P. Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Sci Rep. 2024;14(1):9330. doi: 10.1038/s41598-024-58760-x
13. Meyer A, Soleman A, Riese J, Streichert T. Comparison of ChatGPT, Gemini, and Le Chat with physician interpretations of medical laboratory questions from an online health forum. Clin Chem Lab Med. 2024;62(12):2425-2434. doi: 10.1515/cclm-2024-0246
14. Gunay S, Ozturk A, Yigit Y. The accuracy of Gemini, GPT-4, and GPT-4o in ECG analysis: A comparison with cardiologists and emergency medicine specialists. Am J Emerg Med. 2024;84:68-73. doi: 10.1016/j.ajem.2024.07.043
15. Zaboli A, Brigo F, Sibilio S, Mian M, Turcato G. Human intelligence versus Chat-GPT: who performs better in correctly classifying patients in triage? Am J Emerg Med. 2024;79:44-47. doi: 10.1016/j.ajem.2024.02.008
16. Franco D'Souza R, Amanullah S, Mathew M, Surapaneni KM. Appraising the performance of ChatGPT in psychiatry using 100 clinical case vignettes. Asian J Psychiatr. 2023;89:103770. doi: 10.1016/j.ajp.2023.103770
17. Dergaa I, Fekih-Romdhane F, Hallit S, Loch AA, Glenn JM, Fessi MS, et al. ChatGPT is not ready yet for use in providing mental health assessment and interventions. Front Psychiatry. 2023;14:1277756. doi: 10.3389/fpsyt.2023.1277756
18. Reverberi C, Rigon T, Solari A, Hassan C, Cherubini P, Group GIGCS, et al. Experimental evidence of effective human-AI collaboration in medical decision-making. Sci Rep. 2022;12(1):14952. doi: 10.1038/s41598-022-18751-2
19. Sallam M, Barakat M, Sallam M. A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact J Med Res. 2024;13:e54704. doi: 10.2196/54704
20. Morris-Jones R, Powell A-M, Benton E. 100 Cases in Dermatology. London (GB): CRC Press; 2011. doi: 10.1201/b13487
21. Rees J, Pattison J, Kosky C. 100 Cases in Clinical Medicine. 3rd ed. London (GB): CRC Press; 2013. doi: 10.1201/b15862
22. Cheung R, Cunnington A, Drysdale S, Raine J, Walker J. 100 Cases in Paediatrics. 2nd ed. Boca Raton (FL): CRC Press; 2017. doi: 10.1201/9781315380490
23. Wright B, Dave S, Dogra N. 100 Cases in Psychiatry. 2nd ed. Boca Raton (FL): CRC Press; 2017. doi: 10.1201/9781315380483
24. Muhialdeen AS, Mohammed SA, Ahmed NHA, Ahmed SF, Hassan WN, Asaad HR, et al. Artificial Intelligence in Medicine: A Comparative Study of ChatGPT and Google Bard in Clinical Diagnostics. Barw Medical Journal. 2024;2(1):7-13. doi: 10.58742/pry94q89
25. Fattah FH, Salih AM, Salih AM, Asaad SK, Ghafour AK, Bapir R, et al. Comparative analysis of ChatGPT and Gemini (Bard) in medical inquiry: a scoping review. Front Digit Health. 2025;7:1482712. doi: 10.3389/fdgth.2025.1482712
26. Shen J, Zhang CJP, Jiang B, Chen J, Song J, Liu Z, et al. Artificial Intelligence Versus Clinicians in Disease Diagnosis: Systematic Review. JMIR Med Inform. 2019;7(3):e10010. doi: 10.2196/10010
27. Takita H, Kabata D, Walston SL, Tatekawa H, Saito K, Tsujimoto Y, et al. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. NPJ Digit Med. 2025;8(1):175. doi: 10.1038/s41746-025-01543-z
28. Yamamura Y, Fujii K, Nakashima C, Otsuka A. Evaluation of the Accuracy of Artificial Intelligence (AI) Models in Dermatological Diagnosis and Comparison With Dermatology Specialists. Cureus. 2025;17(1):e77067. doi: 10.7759/cureus.77067
29. Pillai A, Parappally-Joseph S, Kreutz J, Traboulsi D, Gandhi M, Hardin J. Evaluating the Diagnostic and Treatment Capabilities of GPT-4 Vision in Dermatology: A Pilot Study. J Cutan Med Surg. 2025:12034754251336238. doi: 10.1177/12034754251336238
30. Hoppe JM, Auer MK, Struven A, Massberg S, Stremmel C. ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis. J Med Internet Res. 2024;26:e56110. doi: 10.2196/56110
31. Krusche M, Callhoff J, Knitza J, Ruffer N. Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4. Rheumatol Int. 2024;44(2):303-306. doi: 10.1007/s00296-023-05464-6
32. Guven S, Ayyildiz B. Acceptability and readability of ChatGPT-4 based responses for frequently asked questions about strabismus and amblyopia. J Fr Ophtalmol. 2025;48(3):104400. doi: 10.1016/j.jfo.2024.104400
33. Ying L, Li S, Chen C, Yang F, Li X, Chen Y, et al. Screening/diagnosis of pediatric endocrine disorders through the artificial intelligence model in different language settings. Eur J Pediatr. 2024;183(6):2655-2661. doi: 10.1007/s00431-024-05527-1
34. Young CC, Enichen E, Rivera C, Auger CA, Grant N, Rao A, et al. Diagnostic Accuracy of a Custom Large Language Model on Rare Pediatric Disease Case Reports. Am J Med Genet A. 2025;197(2):e63878. doi: 10.1002/ajmg.a.63878
35. Miranda J, Pereira-Silva R, Guichard J, Meneses J, Carreira AN, Seixas D. Artificial Intelligence Outperforms Physicians in General Medical Knowledge, Except in the Paediatrics Domain: A Cross-Sectional Study. Bioengineering (Basel). 2025;12(6). doi: 10.3390/bioengineering12060653
36. Abdul-Hafez HA, Alsabri M, Omran JA, Zayed A, Karimi H, Tsoi V, et al. Pediatric Emergency Department Diagnostics: Global Challenges and Innovations. Curr Treat Options Pediatr. 2025;11(1). doi: 10.1007/s40746-025-00333-9
37. Rony MKK, Das DC, Khatun MT, Ferdousi S, Akter MR, Khatun MA, et al. Artificial intelligence in psychiatry: A systematic review and meta-analysis of diagnostic and therapeutic efficacy. Digit Health. 2025;11:20552076251330528. doi: 10.1177/20552076251330528
38. Gargari OK, Fatehi F, Mohammadi I, Firouzabadi SR, Shafiee A, Habibi G. Diagnostic accuracy of large language models in psychiatry. Asian J Psychiatr. 2024;100:104168. doi: 10.1016/j.ajp.2024.104168
39. Laherrán N, Palacios R, Vázquez A. Assessment of the Capability of Artificial Intelligence for Psychiatric Diagnosis. Eur Psychiatry. 2024;67(S1):S825. doi: 10.1192/j.eurpsy.2024.1722
40. Arbanas G. ChatGPT and other Chatbots in Psychiatry. Arch psychiatry res. 2024;60(2):137-142. doi: 10.20471/june.2024.60.02.07
41. Foley GN, Gentile JP. Nonverbal communication in psychotherapy. Psychiatry (Edgmont). 2010;7(6):38-44.
42. Sedgwick P, Greenwood N. Understanding the Hawthorne effect. BMJ. 2015;351:h4672. doi: 10.1136/bmj.h4672
How to Cite
Issue
Section
License
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by-nc/4.0) which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Transfer of Copyright and Permission to Reproduce Parts of Published Papers.
Authors retain the copyright for their published work. No formal permission will be required to reproduce parts (tables or illustrations) of published papers, provided the source is quoted appropriately and reproduction has no commercial intent. Reproductions with commercial intent will require written permission and payment of royalties.

