How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment

doi:10.2196/45312

. 2023 Feb 8:9:e45312.

doi: 10.2196/45312.

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment

Aidan Gilson^{1

2}, Conrad W Safranek¹, Thomas Huang², Vimig Socrates^{1

3}, Ling Chi¹, Richard Andrew Taylor^#^{1

2}, David Chartash^#^{1

4}

Affiliations

¹ Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States.
² Department of Emergency Medicine, Yale University School of Medicine, New Haven, CT, United States.
³ Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, United States.
⁴ School of Medicine, University College Dublin, National University of Ireland, Dublin, Dublin, Ireland.

^# Contributed equally.

PMID: 36753318
PMCID: PMC9947764
DOI: 10.2196/45312

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment

Aidan Gilson et al. JMIR Med Educ. 2023.

. 2023 Feb 8:9:e45312.

doi: 10.2196/45312.

Authors

Aidan Gilson^{1

2}, Conrad W Safranek¹, Thomas Huang², Vimig Socrates^{1

3}, Ling Chi¹, Richard Andrew Taylor^#^{1

2}, David Chartash^#^{1

4}

Affiliations

¹ Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States.
² Department of Emergency Medicine, Yale University School of Medicine, New Haven, CT, United States.
³ Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, United States.
⁴ School of Medicine, University College Dublin, National University of Ireland, Dublin, Dublin, Ireland.

^# Contributed equally.

PMID: 36753318
PMCID: PMC9947764
DOI: 10.2196/45312

Erratum in

Correction: How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.
Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D. Gilson A, et al. JMIR Med Educ. 2024 Feb 27;10:e57594. doi: 10.2196/57594. JMIR Med Educ. 2024. PMID: 38412478 Free PMC article.

Abstract

Background: Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input.

Objective: This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination (USMLE) Step 1 and Step 2 exams, as well as to analyze responses for user interpretability.

Methods: We used 2 sets of multiple-choice questions to evaluate ChatGPT's performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT's performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question.

Results: Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT's answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively.

Conclusions: ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT's capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning.

Keywords: ChatGPT; GPT; MedQA; NLP; artificial intelligence; chatbot; conversational agent; education technology; generative pre-trained transformer; machine learning; medical education; natural language processing; USMLE.

©Aidan Gilson, Conrad W Safranek, Thomas Huang, Vimig Socrates, Ling Chi, Richard Andrew Taylor, David Chartash. Originally published in JMIR Medical Education (https://mededu.jmir.org), 08.02.2023.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

**Figure 1**
Template of question posed to each large language model (LLM), including both AMBOSS *Attending Tip* and the response from Chat Generative Pre-trained Transformer (ChatGPT). The correct answer to this question is “E. Zidovudine (AZT).” In the case of GPT-3, prompt engineering was necessary, with: "Please answer this multiple choice question:" + question as described previously + "Correct answer is." As GPT-3 is inherently a nondialogic model, this was necessary to reduce model hallucinations and force a clear answer [17].

See this image and copyright information in PMC

Comment in

Can ChatGPT be a new educational tool in medicine?
Luo Y, Hu N. Luo Y, et al. Med Clin (Barc). 2023 Oct 27;161(8):363-364. doi: 10.1016/j.medcli.2023.05.018. Epub 2023 Jul 10. Med Clin (Barc). 2023. PMID: 37438191 English, Spanish. No abstract available.

Cited by

Performance of Generative Pretrained Transformer on the National Medical Licensing Examination in Japan.
Tanaka Y, Nakata T, Aiga K, Etani T, Muramatsu R, Katagiri S, Kawai H, Higashino F, Enomoto M, Noda M, Kometani M, Takamura M, Yoneda T, Kakizaki H, Nomura A. Tanaka Y, et al. PLOS Digit Health. 2024 Jan 23;3(1):e0000433. doi: 10.1371/journal.pdig.0000433. eCollection 2024 Jan. PLOS Digit Health. 2024. PMID: 38261580 Free PMC article.
Revolutionizing Dental Care: A Comprehensive Review of Artificial Intelligence Applications Among Various Dental Specialties.
Alzaid N, Ghulam O, Albani M, Alharbi R, Othman M, Taher H, Albaradie S, Ahmed S. Alzaid N, et al. Cureus. 2023 Oct 14;15(10):e47033. doi: 10.7759/cureus.47033. eCollection 2023 Oct. Cureus. 2023. PMID: 37965397 Free PMC article. Review.
New evidence-based practice: Artificial intelligence as a barrier breaker.
Ferreira RM. Ferreira RM. World J Methodol. 2023 Dec 20;13(5):384-389. doi: 10.5662/wjm.v13.i5.384. eCollection 2023 Dec 20. World J Methodol. 2023. PMID: 38229944 Free PMC article.
ChatGPT for scientific community: Boon or bane?
Jain A. Jain A. Med J Armed Forces India. 2023 Sep-Oct;79(5):498-499. doi: 10.1016/j.mjafi.2023.06.009. Epub 2023 Aug 8. Med J Armed Forces India. 2023. PMID: 37719916 Free PMC article. No abstract available.
Assessing the Performance of ChatGPT in Medical Biochemistry Using Clinical Case Vignettes: Observational Study.
Surapaneni KM. Surapaneni KM. JMIR Med Educ. 2023 Nov 7;9:e47191. doi: 10.2196/47191. JMIR Med Educ. 2023. PMID: 37934568 Free PMC article.

See all "Cited by" articles

References

1. OpenAI ChatGPT: optimizing language models for dialogue. OpenAI. 2022. Nov 30, [2022-12-22]. https://openai.com/blog/chatgpt/
1. Scott K. Microsoft teams up with OpenAI to exclusively license GPT-3 language model. The Official Microsoft Blog. 2020. Sep 22, [2022-12-19]. https://blogs.microsoft.com/blog/2020/09/22/microsoft-teams-up-with-open...
1. Bowman E. A new AI chatbot might do your homework for you. but it's still not an A+ student. NPR. 2022. Dec 19, [2022-12-19]. https://www.npr.org/2022/12/19/1143912956/chatgpt-ai-chatbot-homework-ac... .
1. How good is ChatGPT? The Economist. 2022. Dec 8, [2022-12-20]. https://www.economist.com/business/2022/12/08/how-good-is-chatgpt .
1. Chambers A Can Artificial Intelligence (Chat GPT) get a 7 on an SL Maths paper? IB Maths Resources from Intermathematics. 2022. Dec 11, [2022-12-20]. https://ibmathsresources.com/2022/12/11/can-artificial-intelligence-chat...

Grants and funding

LinkOut - more resources

Full Text Sources

[1] OpenAI ChatGPT: optimizing language models for dialogue. OpenAI. 2022. Nov 30, [2022-12-22]. https://openai.com/blog/chatgpt/

[2] OpenAI ChatGPT: optimizing language models for dialogue. OpenAI. 2022. Nov 30, [2022-12-22]. https://openai.com/blog/chatgpt/

[3] Scott K. Microsoft teams up with OpenAI to exclusively license GPT-3 language model. The Official Microsoft Blog. 2020. Sep 22, [2022-12-19]. https://blogs.microsoft.com/blog/2020/09/22/microsoft-teams-up-with-open...

[4] Scott K. Microsoft teams up with OpenAI to exclusively license GPT-3 language model. The Official Microsoft Blog. 2020. Sep 22, [2022-12-19]. https://blogs.microsoft.com/blog/2020/09/22/microsoft-teams-up-with-open...

[5] Bowman E. A new AI chatbot might do your homework for you. but it's still not an A+ student. NPR. 2022. Dec 19, [2022-12-19]. https://www.npr.org/2022/12/19/1143912956/chatgpt-ai-chatbot-homework-ac... .

[6] Bowman E. A new AI chatbot might do your homework for you. but it's still not an A+ student. NPR. 2022. Dec 19, [2022-12-19]. https://www.npr.org/2022/12/19/1143912956/chatgpt-ai-chatbot-homework-ac... .

[7] How good is ChatGPT? The Economist. 2022. Dec 8, [2022-12-20]. https://www.economist.com/business/2022/12/08/how-good-is-chatgpt .

[8] How good is ChatGPT? The Economist. 2022. Dec 8, [2022-12-20]. https://www.economist.com/business/2022/12/08/how-good-is-chatgpt .

[9] Chambers A Can Artificial Intelligence (Chat GPT) get a 7 on an SL Maths paper? IB Maths Resources from Intermathematics. 2022. Dec 11, [2022-12-20]. https://ibmathsresources.com/2022/12/11/can-artificial-intelligence-chat...

[10] Chambers A Can Artificial Intelligence (Chat GPT) get a 7 on an SL Maths paper? IB Maths Resources from Intermathematics. 2022. Dec 11, [2022-12-20]. https://ibmathsresources.com/2022/12/11/can-artificial-intelligence-chat...

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment

Affiliations

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment

Authors

Affiliations

Erratum in

Abstract

Conflict of interest statement

Figures

Comment in

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources