Performance of Generative Large Language Models on Ophthalmology Board-Style Questions - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct:254:141-149.
doi: 10.1016/j.ajo.2023.05.024. Epub 2023 Jun 18.

Performance of Generative Large Language Models on Ophthalmology Board-Style Questions

Affiliations

Performance of Generative Large Language Models on Ophthalmology Board-Style Questions

Louis Z Cai et al. Am J Ophthalmol. 2023 Oct.

Abstract

Purpose: To investigate the ability of generative artificial intelligence models to answer ophthalmology board-style questions.

Design: Experimental study.

Methods: This study evaluated 3 large language models (LLMs) with chat interfaces, Bing Chat (Microsoft) and ChatGPT 3.5 and 4.0 (OpenAI), using 250 questions from the Basic Science and Clinical Science Self-Assessment Program. Although ChatGPT is trained on information last updated in 2021, Bing Chat incorporates a more recently indexed internet search to generate its answers. Performance was compared with human respondents. Questions were categorized by complexity and patient care phase, and instances of information fabrication or nonlogical reasoning were documented.

Main outcome measures: Primary outcome was response accuracy. Secondary outcomes were performance in question subcategories and hallucination frequency.

Results: Human respondents had an average accuracy of 72.2%. ChatGPT-3.5 scored the lowest (58.8%), whereas ChatGPT-4.0 (71.6%) and Bing Chat (71.2%) performed comparably. ChatGPT-4.0 excelled in workup-type questions (odds ratio [OR], 3.89, 95% CI, 1.19-14.73, P = .03) compared with diagnostic questions, but struggled with image interpretation (OR, 0.14, 95% CI, 0.05-0.33, P < .01) when compared with single-step reasoning questions. Against single-step questions, Bing Chat also faced difficulties with image interpretation (OR, 0.18, 95% CI, 0.08-0.44, P < .01) and multi-step reasoning (OR, 0.30, 95% CI, 0.11-0.84, P = .02). ChatGPT-3.5 had the highest rate of hallucinations and nonlogical reasoning (42.4%), followed by ChatGPT-4.0 (18.0%) and Bing Chat (25.6%).

Conclusions: LLMs (particularly ChatGPT-4.0 and Bing Chat) can perform similarly with human respondents answering questions from the Basic Science and Clinical Science Self-Assessment Program. The frequency of hallucinations and nonlogical reasoning suggests room for improvement in the performance of conversational agents in the medical domain.

PubMed Disclaimer

Comment in

Similar articles

Cited by

LinkOut - more resources