Artificial or Human Intelligence: A Comparative Study of Diagnostic Accuracy in Clinical Settings

Mustafa S. Yousuf; Bashar Almaraziq; Moamen Fuad Shabaneh; Zaid Ashraf Al-Awaisheh; Layan Firas Obeido; Leen Salah Al-Sharafat; Toleen Ayman Kasaji; Salsabeel Zeyad Alhawatmeh; Riyam Sameh Shawaqfeh; Abedulrhman S. Abdelfattah; Nazek Abuhalaweh

doi:10.23750/abm.2026.19205

Authors

Mustafa S. Yousuf Department of Anatomy, Physiology, and Biochemistry, Faculty of Medicine, The Hashemite University, Zarqa, Jordan https://orcid.org/0009-0006-9314-8312
Bashar I. Almaraziq Faculty of Medicine, the Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan https://orcid.org/0009-0009-3470-515X
Moamen Fuad Shabaneh Faculty of Medicine, the Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan https://orcid.org/0009-0004-3466-8946
Zaid Ashraf Al-Awaisheh Faculty of Medicine, the Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan https://orcid.org/0009-0007-4356-3590
Layan Firas Obeido Faculty of Medicine, the Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan https://orcid.org/0009-0007-1297-7344
Leen Salah Al-Sharafat Faculty of Medicine, the Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan https://orcid.org/0009-0008-1932-0166
Toleen Ayman Kasaji Faculty of Medicine, the Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan https://orcid.org/0009-0004-2616-8888
Salsabeel Zeyad Alhawatmeh Faculty of Medicine, the Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan https://orcid.org/0009-0001-4714-922X
Riyam Sameh Shawaqfeh Faculty of Medicine, the Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan https://orcid.org/0009-0004-5726-8929
Abedulrhman S. Abdelfattah Department of Pediatrics, Faculty of Medicine, The Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan https://orcid.org/0000-0002-9973-3663
Nazek Abuhalaweh Department of Internal Medicine, Faculty of Medicine, The Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan https://orcid.org/0009-0006-9314-8312

Keywords:

Artificial intelligence, Diagnostic accuracy, ChatGPT, Gemini

Abstract

Background and aim: Recent advancements in artificial intelligence (AI) have expanded its application in medicine, particularly in diagnostics. To compare the diagnostic efficiency of two Artificial intelligence (AI) tools and human doctors in medical cases from four medical specialties.

Methods: A total of 120 cases in dermatology, internal medicine, pediatrics, and psychiatry (30 cases/specialty) were presented to Google Gemini 1.5 Flash, ChatGPT-4o, and human doctors. Cases were presented in a standardized way. Responses were evaluated by one specialist (per specialty) and scored. Case total scores were compared between agents and between specialties using the Kruskal-Wallis test. Diagnostic accuracy was compared using the Chi-square test.

Results: ChatGPT obtained the highest grand total score (1432/1800) and the highest total score in each specialty, except dermatology, which was obtained by human doctors (293/450). The difference between the case total scores was significant (p = 0.000), with ChatGPT scoring significantly higher than both Gemini and human doctors. Also, ChatGPT had a significantly higher diagnostic accuracy (91%). Comparing responses between specialties showed that ChatGPT had scored significantly higher in internal medicine, Gemini in Psychiatry, and human doctors in dermatology. In dermatology, no significant difference was found between the responses and between the diagnostic accuracy. The case total scores of the three agents were significantly different in the other specialties. Diagnostic accuracy was significantly different only in internal medicine.

Conclusions: Artificial intelligence, especially ChatGPT, has a great potential to be used in medical diagnosis. Caution, however, must be employed as mistakes could be made by such tools.

References

1. Gil de Zúñiga H, Goyanes M, Durotoye T. A Scholarly Definition of Artificial Intelligence (AI): Advancing AI as a Conceptual Framework in Communication Research. Political Commun. 2023;41(2):317-334. doi: 10.1080/10584609.2023.2290497

2. Imran M, Almusharraf N. Google Gemini as a next generation AI educational tool: a review of emerging educational technology. Smart Learn Environ. 2024;11(1). doi: 10.1186/s40561-024-00310-z

3. Lee RST. Artificial Intelligence in Daily Life. Singapore: Springer; 2020. doi: 10.1007/978-981-15-7695-9

4. Kim J, Merrill Jr K, Collins C. AI as a friend or assistant: The mediating role of perceived usefulness in social AI vs. functional AI. Telemat Inform. 2021;64. doi: 10.1016/j.tele.2021.101694

5. Marikyan D, Papagiannidis S, Rana OF, Ranjan R, Morgan G. “Alexa, let’s talk about my productivity”: The impact of digital assistants on work productivity. J Bus Res. 2022;142:572-584. doi: 10.1016/j.jbusres.2022.01.015

6. Rajaraman V. From ELIZA to ChatGPT. Resonance. 2023;28(6):889-905. doi: 10.1007/s12045-023-1620-6

7. Hamilton A, Molzahn A, McLemore K. The Evolution From Standardized to Virtual Patients in Medical Education. Cureus. 2024;16(10):e71224. doi: 10.7759/cureus.71224

8. Katal S, York B, Gholamrezanezhad A. AI in radiology: From promise to practice - A guide to effective integration. Eur J Radiol. 2024;181:111798. doi: 10.1016/j.ejrad.2024.111798

9. Paul D, Sanap G, Shenoy S, Kalyane D, Kalia K, Tekade RK. Artificial intelligence in drug discovery and development. Drug Discov Today. 2021;26(1):80-93. doi: 10.1016/j.drudis.2020.10.010

10. Schukow C, Smith SC, Landgrebe E, Parasuraman S, Folaranmi OO, Paner GP, et al. Application of ChatGPT in Routine Diagnostic Pathology: Promises, Pitfalls, and Potential Future Directions. Adv Anat Pathol. 2024;31(1):15-21. doi: 10.1097/PAP.0000000000000406

11. Goodman RS, Patrinely JR, Stone CA, Jr., Zimmerman E, Donald RR, Chang SS, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw Open. 2023;6(10):e2336483. doi: 10.1001/jamanetworkopen.2023.36483

12. Shieh A, Tran B, He G, Kumar M, Freed JA, Majety P. Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Sci Rep. 2024;14(1):9330. doi: 10.1038/s41598-024-58760-x

13. Meyer A, Soleman A, Riese J, Streichert T. Comparison of ChatGPT, Gemini, and Le Chat with physician interpretations of medical laboratory questions from an online health forum. Clin Chem Lab Med. 2024;62(12):2425-2434. doi: 10.1515/cclm-2024-0246

14. Gunay S, Ozturk A, Yigit Y. The accuracy of Gemini, GPT-4, and GPT-4o in ECG analysis: A comparison with cardiologists and emergency medicine specialists. Am J Emerg Med. 2024;84:68-73. doi: 10.1016/j.ajem.2024.07.043

15. Zaboli A, Brigo F, Sibilio S, Mian M, Turcato G. Human intelligence versus Chat-GPT: who performs better in correctly classifying patients in triage? Am J Emerg Med. 2024;79:44-47. doi: 10.1016/j.ajem.2024.02.008

16. Franco D'Souza R, Amanullah S, Mathew M, Surapaneni KM. Appraising the performance of ChatGPT in psychiatry using 100 clinical case vignettes. Asian J Psychiatr. 2023;89:103770. doi: 10.1016/j.ajp.2023.103770

17. Dergaa I, Fekih-Romdhane F, Hallit S, Loch AA, Glenn JM, Fessi MS, et al. ChatGPT is not ready yet for use in providing mental health assessment and interventions. Front Psychiatry. 2023;14:1277756. doi: 10.3389/fpsyt.2023.1277756

18. Reverberi C, Rigon T, Solari A, Hassan C, Cherubini P, Group GIGCS, et al. Experimental evidence of effective human-AI collaboration in medical decision-making. Sci Rep. 2022;12(1):14952. doi: 10.1038/s41598-022-18751-2

19. Sallam M, Barakat M, Sallam M. A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact J Med Res. 2024;13:e54704. doi: 10.2196/54704

20. Morris-Jones R, Powell A-M, Benton E. 100 Cases in Dermatology. London (GB): CRC Press; 2011. doi: 10.1201/b13487

21. Rees J, Pattison J, Kosky C. 100 Cases in Clinical Medicine. 3rd ed. London (GB): CRC Press; 2013. doi: 10.1201/b15862

22. Cheung R, Cunnington A, Drysdale S, Raine J, Walker J. 100 Cases in Paediatrics. 2nd ed. Boca Raton (FL): CRC Press; 2017. doi: 10.1201/9781315380490

23. Wright B, Dave S, Dogra N. 100 Cases in Psychiatry. 2nd ed. Boca Raton (FL): CRC Press; 2017. doi: 10.1201/9781315380483

24. Muhialdeen AS, Mohammed SA, Ahmed NHA, Ahmed SF, Hassan WN, Asaad HR, et al. Artificial Intelligence in Medicine: A Comparative Study of ChatGPT and Google Bard in Clinical Diagnostics. Barw Medical Journal. 2024;2(1):7-13. doi: 10.58742/pry94q89

25. Fattah FH, Salih AM, Salih AM, Asaad SK, Ghafour AK, Bapir R, et al. Comparative analysis of ChatGPT and Gemini (Bard) in medical inquiry: a scoping review. Front Digit Health. 2025;7:1482712. doi: 10.3389/fdgth.2025.1482712

26. Shen J, Zhang CJP, Jiang B, Chen J, Song J, Liu Z, et al. Artificial Intelligence Versus Clinicians in Disease Diagnosis: Systematic Review. JMIR Med Inform. 2019;7(3):e10010. doi: 10.2196/10010

27. Takita H, Kabata D, Walston SL, Tatekawa H, Saito K, Tsujimoto Y, et al. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. NPJ Digit Med. 2025;8(1):175. doi: 10.1038/s41746-025-01543-z

28. Yamamura Y, Fujii K, Nakashima C, Otsuka A. Evaluation of the Accuracy of Artificial Intelligence (AI) Models in Dermatological Diagnosis and Comparison With Dermatology Specialists. Cureus. 2025;17(1):e77067. doi: 10.7759/cureus.77067

29. Pillai A, Parappally-Joseph S, Kreutz J, Traboulsi D, Gandhi M, Hardin J. Evaluating the Diagnostic and Treatment Capabilities of GPT-4 Vision in Dermatology: A Pilot Study. J Cutan Med Surg. 2025:12034754251336238. doi: 10.1177/12034754251336238

30. Hoppe JM, Auer MK, Struven A, Massberg S, Stremmel C. ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis. J Med Internet Res. 2024;26:e56110. doi: 10.2196/56110

31. Krusche M, Callhoff J, Knitza J, Ruffer N. Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4. Rheumatol Int. 2024;44(2):303-306. doi: 10.1007/s00296-023-05464-6

32. Guven S, Ayyildiz B. Acceptability and readability of ChatGPT-4 based responses for frequently asked questions about strabismus and amblyopia. J Fr Ophtalmol. 2025;48(3):104400. doi: 10.1016/j.jfo.2024.104400

33. Ying L, Li S, Chen C, Yang F, Li X, Chen Y, et al. Screening/diagnosis of pediatric endocrine disorders through the artificial intelligence model in different language settings. Eur J Pediatr. 2024;183(6):2655-2661. doi: 10.1007/s00431-024-05527-1

34. Young CC, Enichen E, Rivera C, Auger CA, Grant N, Rao A, et al. Diagnostic Accuracy of a Custom Large Language Model on Rare Pediatric Disease Case Reports. Am J Med Genet A. 2025;197(2):e63878. doi: 10.1002/ajmg.a.63878

35. Miranda J, Pereira-Silva R, Guichard J, Meneses J, Carreira AN, Seixas D. Artificial Intelligence Outperforms Physicians in General Medical Knowledge, Except in the Paediatrics Domain: A Cross-Sectional Study. Bioengineering (Basel). 2025;12(6). doi: 10.3390/bioengineering12060653

36. Abdul-Hafez HA, Alsabri M, Omran JA, Zayed A, Karimi H, Tsoi V, et al. Pediatric Emergency Department Diagnostics: Global Challenges and Innovations. Curr Treat Options Pediatr. 2025;11(1). doi: 10.1007/s40746-025-00333-9

37. Rony MKK, Das DC, Khatun MT, Ferdousi S, Akter MR, Khatun MA, et al. Artificial intelligence in psychiatry: A systematic review and meta-analysis of diagnostic and therapeutic efficacy. Digit Health. 2025;11:20552076251330528. doi: 10.1177/20552076251330528

38. Gargari OK, Fatehi F, Mohammadi I, Firouzabadi SR, Shafiee A, Habibi G. Diagnostic accuracy of large language models in psychiatry. Asian J Psychiatr. 2024;100:104168. doi: 10.1016/j.ajp.2024.104168

39. Laherrán N, Palacios R, Vázquez A. Assessment of the Capability of Artificial Intelligence for Psychiatric Diagnosis. Eur Psychiatry. 2024;67(S1):S825. doi: 10.1192/j.eurpsy.2024.1722

40. Arbanas G. ChatGPT and other Chatbots in Psychiatry. Arch psychiatry res. 2024;60(2):137-142. doi: 10.20471/june.2024.60.02.07

41. Foley GN, Gentile JP. Nonverbal communication in psychotherapy. Psychiatry (Edgmont). 2010;7(6):38-44.

42. Sedgwick P, Greenwood N. Understanding the Hawthorne effect. BMJ. 2015;351:h4672. doi: 10.1136/bmj.h4672

Artificial or Human Intelligence: A Comparative Study of Diagnostic Accuracy in Clinical Settings

Artificial or Human Intelligence: A Comparative Study of Diagnostic Accuracy in Clinical Settings

Authors

Keywords:

Abstract

References

How to Cite

Issue

Section

License

How to Cite

Keywords