Performance of Generative Large Language Models on Ophthalmology Board-Style Questions
- PMID: 37339728
- DOI: 10.1016/j.ajo.2023.05.024
Performance of Generative Large Language Models on Ophthalmology Board-Style Questions
Abstract
Purpose: To investigate the ability of generative artificial intelligence models to answer ophthalmology board-style questions.
Design: Experimental study.
Methods: This study evaluated 3 large language models (LLMs) with chat interfaces, Bing Chat (Microsoft) and ChatGPT 3.5 and 4.0 (OpenAI), using 250 questions from the Basic Science and Clinical Science Self-Assessment Program. Although ChatGPT is trained on information last updated in 2021, Bing Chat incorporates a more recently indexed internet search to generate its answers. Performance was compared with human respondents. Questions were categorized by complexity and patient care phase, and instances of information fabrication or nonlogical reasoning were documented.
Main outcome measures: Primary outcome was response accuracy. Secondary outcomes were performance in question subcategories and hallucination frequency.
Results: Human respondents had an average accuracy of 72.2%. ChatGPT-3.5 scored the lowest (58.8%), whereas ChatGPT-4.0 (71.6%) and Bing Chat (71.2%) performed comparably. ChatGPT-4.0 excelled in workup-type questions (odds ratio [OR], 3.89, 95% CI, 1.19-14.73, P = .03) compared with diagnostic questions, but struggled with image interpretation (OR, 0.14, 95% CI, 0.05-0.33, P < .01) when compared with single-step reasoning questions. Against single-step questions, Bing Chat also faced difficulties with image interpretation (OR, 0.18, 95% CI, 0.08-0.44, P < .01) and multi-step reasoning (OR, 0.30, 95% CI, 0.11-0.84, P = .02). ChatGPT-3.5 had the highest rate of hallucinations and nonlogical reasoning (42.4%), followed by ChatGPT-4.0 (18.0%) and Bing Chat (25.6%).
Conclusions: LLMs (particularly ChatGPT-4.0 and Bing Chat) can perform similarly with human respondents answering questions from the Basic Science and Clinical Science Self-Assessment Program. The frequency of hallucinations and nonlogical reasoning suggests room for improvement in the performance of conversational agents in the medical domain.
Copyright © 2023 Elsevier Inc. All rights reserved.
Comment in
-
Comment on: Performance of Generative Large Language Models on Ophthalmology Board Style Questions.Am J Ophthalmol. 2023 Dec;256:200. doi: 10.1016/j.ajo.2023.07.029. Epub 2023 Aug 2. Am J Ophthalmol. 2023. PMID: 37541409 No abstract available.
Similar articles
-
ChatGPT-3.5 and Bing Chat in ophthalmology: an updated evaluation of performance, readability, and informative sources.Eye (Lond). 2024 Jul;38(10):1897-1902. doi: 10.1038/s41433-024-03037-w. Epub 2024 Mar 20. Eye (Lond). 2024. PMID: 38509182
-
Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580. J Med Internet Res. 2023. PMID: 38009003 Free PMC article.
-
Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis.Asia Pac J Ophthalmol (Phila). 2024 Sep-Oct;13(5):100106. doi: 10.1016/j.apjo.2024.100106. Epub 2024 Oct 5. Asia Pac J Ophthalmol (Phila). 2024. PMID: 39374807
-
Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank.Medicine (Baltimore). 2024 Mar 1;103(9):e37325. doi: 10.1097/MD.0000000000037325. Medicine (Baltimore). 2024. PMID: 38428889 Free PMC article.
-
Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions.Surg Obes Relat Dis. 2024 Jul;20(7):609-613. doi: 10.1016/j.soard.2024.04.014. Epub 2024 May 8. Surg Obes Relat Dis. 2024. PMID: 38782611 Review.
Cited by
-
Global trends and hotspots of ChatGPT in medical research: a bibliometric and visualized study.Front Med (Lausanne). 2024 May 16;11:1406842. doi: 10.3389/fmed.2024.1406842. eCollection 2024. Front Med (Lausanne). 2024. PMID: 38818395 Free PMC article.
-
Google Gemini and Bard artificial intelligence chatbot performance in ophthalmology knowledge assessment.Eye (Lond). 2024 Sep;38(13):2530-2535. doi: 10.1038/s41433-024-03067-4. Epub 2024 Apr 13. Eye (Lond). 2024. PMID: 38615098
-
Benchmarking four large language models' performance of addressing Chinese patients' inquiries about dry eye disease: A two-phase study.Heliyon. 2024 Jul 14;10(14):e34391. doi: 10.1016/j.heliyon.2024.e34391. eCollection 2024 Jul 30. Heliyon. 2024. PMID: 39113991 Free PMC article.
-
Clinical application potential of large language model: a study based on thyroid nodules.Endocrine. 2024 Jul 30. doi: 10.1007/s12020-024-03981-3. Online ahead of print. Endocrine. 2024. PMID: 39080210
-
Comparing ChatGPT and Bing, in response to the Home Blood Pressure Monitoring (HBPM) knowledge checklist.Hypertens Res. 2024 May;47(5):1401-1409. doi: 10.1038/s41440-024-01624-8. Epub 2024 Mar 4. Hypertens Res. 2024. PMID: 38438722
MeSH terms
LinkOut - more resources
Full Text Sources