Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Feb 9;2(2):e0000198.
doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

Affiliations

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

Tiffany H Kung et al. PLOS Digit Health. .

Abstract

We evaluated the performance of a large language model called ChatGPT on the United States Medical Licensing Exam (USMLE), which consists of three exams: Step 1, Step 2CK, and Step 3. ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations. These results suggest that large language models may have the potential to assist with medical education, and potentially, clinical decision-making.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Schematic of workflow for sourcing, encoding, and adjudicating results.
Abbreviations: QC = quality control; MCSA-NJ = multiple choice single answer without forced justification; MCSA-J = multiple choice single answer with forced justification; OE = open-ended question format.
Fig 2
Fig 2. Accuracy of ChatGPT on USMLE.
For USMLE Steps 1, 2CK, and 3, AI outputs were adjudicated to be accurate, inaccurate, or indeterminate based on the ACI scoring system provided in S2 Data. A: Accuracy distribution for inputs encoded as open-ended questions. B: Accuracy distribution for inputs encoded as multiple choice single answer without (MC-NJ) or with forced justification (MC-J).
Fig 3
Fig 3. Concordance and insight of ChatGPT on USMLE.
For USMLE Steps 1, 2CK, and 3, AI outputs were adjudicated on concordance and density of insight (DOI) based on the ACI scoring system provided in S2 Data. A: Overall concordance across all exam types and question encoding formats. B: Concordance rates stratified between accurate vs inaccurate outputs, across all exam types and question encoding formats. p <0.001 for accurate vs inaccurate outputs by Fisher exact test. C: Overall insight prevalence, defined as proportion of outputs with ≥1 insight, across all exams for questions encoded in MC-J format. D: DOI stratified between accurate vs inaccurate outputs, across all exam types for questions encoded in MC-J format. Horizontal line indicates the mean. p-value determined by parametric 2-way ANOVA testing with Benjamini-Krieger-Yekutieli (BKY) post hoc to control for false discovery rate.

Similar articles

Cited by

References

    1. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2016. doi: 10.1109/cvpr.2016.308 - DOI
    1. Zhang W, Feng Y, Meng F, You D, Liu Q. Bridging the gap between training and inference for neural machine translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics; 2019. doi: 10.18653/v1/p19-1426 - DOI
    1. Bhatia Y, Bajpayee A, Raghuvanshi D, Mittal H. Image captioning using Google’s inception-resnet-v2 and recurrent neural network. 2019 Twelfth International Conference on Contemporary Computing (IC3). IEEE; 2019. doi: 10.1109/ic3.2019.8844921 - DOI
    1. McDermott MBA, Wang S, Marinsek N, Ranganath R, Foschini L, Ghassemi M. Reproducibility in machine learning for health research: Still a ways to go. Sci Transl Med. 2021;13. doi: 10.1126/scitranslmed.abb1655 - DOI - PubMed
    1. Chen P-HC, Liu Y, Peng L. How to develop machine learning models for healthcare. Nat Mater. 2019;18: 410–414. doi: 10.1038/s41563-019-0345-0 - DOI - PubMed

Grants and funding

The authors received no specific funding for this work.

LinkOut - more resources