Abstract
Introduction
The purpose of this study was to evaluate the capabilities of large language models such as Chat Generative Pretrained Transformer (ChatGPT) to diagnose glaucoma based on specific clinical case descriptions with comparison to the performance of senior ophthalmology resident trainees.
Methods
We selected 11 cases with primary and secondary glaucoma from a publicly accessible online database of case reports. A total of four cases had primary glaucoma including open-angle, juvenile, normal-tension, and angle-closure glaucoma, while seven cases had secondary glaucoma including pseudo-exfoliation, pigment dispersion glaucoma, glaucomatocyclitic crisis, aphakic, neovascular, aqueous misdirection, and inflammatory glaucoma. We input the text of each case detail into ChatGPT and asked for provisional and differential diagnoses. We then presented the details of 11 cases to three senior ophthalmology residents and recorded their provisional and differential diagnoses. We finally evaluated the responses based on the correct diagnoses and evaluated agreements.
Results
The provisional diagnosis based on ChatGPT was correct in eight out of 11 (72.7%) cases and three ophthalmology residents were correct in six (54.5%), eight (72.7%), and eight (72.7%) cases, respectively. The agreement between ChatGPT and the first, second, and third ophthalmology residents were 9, 7, and 7, respectively.
Conclusions
The accuracy of ChatGPT in diagnosing patients with primary and secondary glaucoma, using specific case examples, was similar or better than senior ophthalmology residents. With further development, ChatGPT may have the potential to be used in clinical care settings, such as primary care offices, for triaging and in eye care clinical practices to provide objective and quick diagnoses of patients with glaucoma.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
The goal of this work was to explore the capabilities of Chat Generative Pretrained Transformer (ChatGPT) for provisional and differential diagnoses of different glaucoma phenotypes using specific case examples. |
There was general agreement between ChatGPT and senior ophthalmology residents in final diagnoses. |
ChatGPT was more general while ophthalmology residents were more methodical and specific when listing differential diagnoses. |
Introduction
Glaucoma is a common cause of irreversible blindness worldwide [1]. Managing glaucoma is challenging and some patients will experience vision loss even after receiving treatment [2]. While intraocular pressure (IOP) is the only modifiable risk factor, glaucoma has multiple other risk factors including older age, family history, race (or ethnicity), and myopia, among others [3,4,5,6,7,8,9]. As a disease caused by the interaction of these risk factors in complex ways, glaucoma is difficult to identify, particularly in its earliest stages, due to the existence of variations in physiologic characteristics such as optic disc size and confusing pathological signs such as the presence of non-glaucomatous optic nerve diseases [10, 11].
Despite the existence of multiple tests for detecting glaucoma, diagnosis is still largely subjective with great variability between clinicians. Based on findings of the Glaucoma Optic Neuropathy Evaluation Project [12], underestimating the vertical cup-to-disc ratio and cup shape and missing both retinal nerve fiber layer (RNFL) defects including presence of disc hemorrhage, were key errors that led to glaucoma underestimation. Similar challenges can also lead to glaucoma overestimation.
Artificial intelligence (AI) has progressed rapidly since the introduction of the first deep convolutional neural networks (CNN) in ophthalmology [13, 14] and numerous studies have shown the effectiveness of AI models in glaucoma [15,16,17,18,19]. More recently, large language models (LLMs), particularly Chat Generative Pretrained Transformer (ChatGPT), have received significant interest in ophthalmology, as they have shown promise in understanding clinical knowledge and providing reasonable responses [20, 21]. The capabilities of ChatGPT in responding to multiple-choice questions from the USMLE was investigated and it was reported that ChatGPT not only responded correctly to over 50% of questions but also provided reasonable supporting explanations for the selected choices [22]. Further capabilities and limitations of ChatGPT in ophthalmology have been discussed elsewhere [23].
The objective of this study is to evaluate the capability of ChatGPT in the diagnosis of glaucoma from detailed case descriptions. We also aim to compare the outcome of ChatGPT with responses from senior ophthalmology residents to identify capabilities of this technology as a potential use in triaging glaucoma diagnosis.
Methods
Case Collection
We selected all cases with primary or secondary glaucoma from the publicly accessible database provided by the Department of Ophthalmology and Visual Sciences of the University of Iowa (https://webeye.ophth.uiowa.edu/eyeforum/cases.htm). Out of over 200 cases provided, 11 cases corresponded to primary or secondary glaucoma. These cases corresponded to various common and uncommon glaucoma phenotypes including primary juvenile glaucoma, normal-tension glaucoma, open-angle glaucoma, primary angle-closure glaucoma, pseudo-exfoliation glaucoma, pigment dispersion glaucoma, glaucomatocyclitic crisis glaucoma, aphakic glaucoma, neovascular glaucoma, aqueous misdirection glaucoma, and inflammatory glaucoma (Grant syndrome with trabeculitis). Descriptions of each case included patient demographics, history of the presenting illness, chief complaint, relevant medical or ocular history, and examination findings. Institutional review board (IRB) approval was not required per the direction of our local IRB office, as we used a publicly accessible dataset with no patient’s information in this analysis.
ChatGPT
LLMs, such as Generative Pretrained Transformer 3 (GPT-3), can generate fluent and human-like texts. GPT-3 was initially trained based on a corpus of unannotated data with over 400 billion words collected from the Internet that included books, articles, and texts in the websites. ChatGPT (https://chat.openai.com/), a generic LLM based on the GPT-3 architecture, is optimized for dialogue by the OpenAI company. ChatGPT can respond to textual questions highly fluently like humans. In addition to providing responses to textual question, ChatGPT has other natural language processing (NLP) capabilities including translation and text summarization.
ChatGPT Diagnosis
We input each case description into the ChatGPT (version 3.5) and asked if the model could provide a provisional diagnosis and a differential diagnosis list. Specifically, we first asked: “What is the most likely diagnosis?” and we then asked, “What is the differential diagnosis?” (Fig. 1). We then evaluated the accuracy of ChatGPT in making diagnosis with respect to the correct diagnosis. As ChatGPT may learn from previous interactions, we recorded all responses based on our first enquiry of provisional and differential diagnosis.
Additionally, we provided the same 11 cases to three senior ophthalmology residents in our Hamilton Eye Institute of the University of Tennessee Health Science Center in a masked manner and asked residents to make provisional and differential diagnoses. Ophthalmology residents were not allowed to use ChatGPT or other similar tools for making a diagnosis. We then computed the frequency of correct provisional diagnoses of ChatGPT and the three ophthalmology residents and compared the accuracy. We also evaluated the agreement between ChatGPT and the three ophthalmology residents.
Results
ChatGPT made the correct diagnosis in eight out of 11 cases (~ 72.7%), while ophthalmology residents correctly diagnosed 6/11 (~ 54.5%), 8/11 (~ 72.7%), and 8/11 (~ 72.7%) cases, respectively. Table 1 shows the details of the provisional diagnosis provided by ChatGPT and three senior ophthalmology residents.
Table 2 demonstrates the details of the differential diagnosis provided by ChatGPT and the three residents. ChatGPT consistently provided a greater number of differential diagnoses compared to ophthalmology residents.
The agreements between ChatGPT and the three ophthalmology residents were 9, 7, and 7 (out of 11 cases), respectively. However, the agreements between first and second, first and third, and second and third ophthalmology residents were 8, 8, and 11 (out of 11 cases), respectively.
Discussion
We prospectively investigated the capability of ChatGPT in diagnosing cases with primary and secondary glaucoma phenotypes based on 11 cases collected from the University of Iowa online database. ChatGPT diagnosed eight cases correctly that was better than the first ophthalmology resident with six correctly diagnosed cases and same as the second and third ophthalmology residents with eight correctly diagnosed cases. The agreement between ChatGPT and the ophthalmology residents was relatively high (9, 7, and 7, out of 11 cases). Current medical resident trainees are being trained in the midst of AI technology, LMM, and ChatGPT era and are witnessing the rapid advancement and integration of AI technologies into healthcare to assist in management and diagnoses. Understanding capabilities and limitations of these technologies will help to effectively utilize this technology in glaucoma research and clinical practice.
A recent study showed that Isabel Pro [24], one of the widely used and most accurate support systems for making diagnosis, correctly diagnosed one out of ten general ophthalmology cases (10%) while ChatGPT correctly diagnosed nine general ophthalmology cases (90%) [25]. We found that ChatGPT was about 72.7% accurate in provisional diagnosis based on 11 cases with primary and secondary glaucoma phenotypes. However, their subset included general ophthalmology cases while our datasets included glaucoma subspecialty cases with different uncommon and atypical glaucoma phenotypes, thus more challenging for ChatGPT to diagnose correctly. ChatGPT was incorrect in the diagnosis of three cases, including glaucomatocyclitic crisis, aqueous misdirection, and inflammatory glaucoma, which are often considered uncommon and atypical glaucoma phenotypes. It seems that ChatGPT can understand clinical knowledge even when provided as text information only, as it provided reasonable and relevant responses even on cases that the final response was deemed not fully matching with the underlying disease. This is evident from the example case that we have provided in Fig. 1. ChatGPT was able to link the history of this patient (central retinal vein occlusion; CRVO) with present information and correctly diagnose the patient as neovascular glaucoma secondary to CRVO.
We observed that ChatGPT was correct on eight out of eight cases with common glaucoma cases while it was primarily incorrect on uncommon and atypical glaucoma cases. It should be noted that more common glaucoma cases are typically easier to diagnose compared to uncommon, atypical, and more complex cases, as evidenced by that fact that ophthalmology residents were also unable to diagnose most of the uncommon cases correctly. Despite the abilities of ChatGPT in correctly diagnosing most of the glaucoma cases, it has several deficiencies as well. The current work only explored the capabilities of ChatGPT with specific case examples and detailed information that may not mimic the information available in real-world settings. Therefore, the results of this work are specific to using case examples that are clear and organized. ChatGPT is also not currently capable of assessing multimodal data. The inability to interpret diagnostic information such as fundus images or visual fields for glaucoma diagnosis may limit the ability for nuanced assessments and providing specific diagnoses to healthcare workers in various clinical care settings. In the future, and if equipped with such capabilities, ChatGPT may be highly beneficial in primary care facilities and emergency services with limited access to ophthalmologists or subspecialties for triaging patients with various eye diseases (e.g., initial diagnosis of glaucoma).
Advancement of LMM models including ChatGPT could benefit both comprehensive ophthalmic practices as well as more subspecialized clinics such as those focused on glaucoma. These models can make diagnoses objective, quick, accessible at any time, and more accurate. Glaucoma diagnosis is highly subjective with poor agreement even among highly skilled glaucoma specialists primarily due to a significant overlap between physiological and pathological characteristics [26]. In fact, there are several recent efforts in making glaucoma assessment more objective to reduce subjectivity and mitigate subsequent disagreement [19, 27]. As glaucoma is continuing to impact significant numbers of people in the US, primarily due to an aging population, and considering the fact that resources, including the number of glaucoma specialists, are limited, it is critical to provide glaucoma care in a more efficient and timely manner. As such, models like ChatGPT have potential in augmenting and streamlining glaucoma diagnosis and management, especially if future developments allow for incorporating multimodal data.
Another major advantage of ChatGPT and similar LMM models is that they can actively learn based on reinforcement learning capabilities, thus correcting previous mistakes, improving performance over time, and leading to more accurate output. Moreover, LMM models require minimal human oversight and supervision for active training, which is advantageous compared to the supervised learning models that require intensive human oversight.
We observed that ChatGPT consistently provides a greater number of differential diagnoses compared to ophthalmology residents. However, the lists provided by ChatGPT were generally related, and simply listed other reasons why glaucoma might be present. However, the residents were more methodical and pinpointed the diagnosis, which is what is required in clinical practice. It is worth noting again that residents were able to review and interpret VF printouts, fundus images, and OCT reports in the case descriptions while ChatGPT still cannot do these actions. This fact may enhance the ability to pinpoint diagnoses while text-only input might lead to more general differential diagnoses that lack the benefit of objective data provided from diagnostic testing.
Our study does have several limitations. First, the number of cases was small. Larger datasets with a greater number of cases can be explored in the future to validate current findings. Second, we were unable to investigate the performance of Isabel Pro and compare it with ChatGPT, as Isabel Pro is unable to diagnose subspecialty phenotypes and is primarily designed to diagnose common general ophthalmology cases. Third, we did not access more advanced forms of GPT, like GPTv4, to assess accuracy. More advanced versions are currently behind a paywall and are expected to be more accurate than the publicly available version 3.5 that we used. However, this could also be a strength of our study as the likelihood of using the publicly available version is higher than the use of costly commercial versions. Another general limitation in LLMs like ChatGPT, not specific to glaucoma, is the inability to process multimodal data, which is a crucial aspect of medical diagnoses.
In summary, the potential for AI and LLMs impacting clinical care in ophthalmology is significant. Recently, ChatGPT has received attention and is being utilized in both ophthalmic education and clinical eye care. ChatGPT is capable of conversing with users and responding to text inquiries with fluent and reasonable responses. The performance of ChatGPT when utilizing specific case examples is similar to that of trained ophthalmic residents, thus exhibiting promise for utility across different health care settings, as these systems evolve and allow for multimodal data interpretation. Future work to elucidate the utility of AI and LLMs to enhance assessment and diagnosis of ophthalmic conditions in various healthcare settings (call centers, primary care offices, emergency rooms, and various ophthalmology care clinics) should be explored further.
Conclusions
Glaucoma is a complex and multifactorial disease and there is no single test that is highly accurate and reliable. Recent LMMs, particularly, ChatGPT, may enhance glaucoma assessment. Based on 11 cases with common and uncommon glaucoma phenotypes, ChatGPT correctly diagnosed the disease process based on case descriptions, with performance on par or better than senior ophthalmology residents. As glaucoma diagnosis is typically challenging and requires several years of training, further refinement of ChatGPT may result in a tool that enhances diagnostic accuracy across a wide range of clinical settings and user experience levels.
Data Availability
The dataset is online and publicly available, provided by the Department of Ophthalmology and Visual Sciences of the University of Iowa (https://webeye.ophth.uiowa.edu/eyeforum/cases.htm).
References
Jonas JB, Aung T, Bourne RR, Bron AM, Ritch R, Panda-Jonas S. Glaucoma. Lancet. 2017;390(10108):2183–93.
Quigley HA, Vitale S. Models of open-angle glaucoma prevalence and incidence in the United States. Invest Ophthalmol Vis Sci. 1997;38(1):83–91.
Le A, Mukesh BN, McCarty CA, Taylor HR. Risk factors associated with the incidence of open-angle glaucoma: the visual impairment project. Invest Ophthalmol Vis Sci. 2003;44(9):3783–9.
Suzuki Y, Iwase A, Araie M, et al. Risk factors for open-angle glaucoma in a Japanese population: the Tajimi Study. Ophthalmology. 2006;113(9):1613–7.
Miglior S, Pfeiffer N, Torri V, Zeyen T, Cunha-Vaz J, Adamsons I. Predictive factors for open-angle glaucoma among patients with ocular hypertension in the European Glaucoma Prevention Study. Ophthalmology. 2007;114(1):3–9.
Wolfs RC, Klaver CC, Ramrattan RS, van Duijn CM, Hofman A, de Jong PT. Genetic risk of primary open-angle glaucoma. Population-based familial aggregation study. Arch Ophthalmol. 1998;116(12):1640–5.
McMonnies CW. Glaucoma history and risk factors. J Optom. 2017;10(2):71–8.
Landers J, Goldberg I, Graham SL. Analysis of risk factors that may be associated with progression from ocular hypertension to primary open angle glaucoma. Clin Exp Ophthalmol. 2002;30(4):242–7.
Lin CC, Hu CC, Ho JD, Chiu HW, Lin HC. Obstructive sleep apnea and increased risk of glaucoma: a population-based matched-cohort study. Ophthalmology. 2013;120(8):1559–64.
Hoffmann EM, Zangwill LM, Crowston JG, Weinreb RN. Optic disk size and glaucoma. Surv Ophthalmol. 2007;52(1):32–49.
Healey PR, Mitchell P. Optic disk size in open-angle glaucoma: the blue mountains eye study. Am J Ophthalmol. 1999;128(4):515–7.
O’Neill EC, Gurria LU, Pandav SS, et al. Glaucomatous optic neuropathy evaluation project: factors associated with underestimation of glaucoma likelihood. JAMA Ophthalmol. 2014;132(5):560–6.
Ting DS, Cheung GC, Wong TY. Diabetic retinopathy: global prevalence, major risk factors, screening practices and public health challenges: a review. Clin Exp Ophthalmol. 2016;44(4):260–77.
Gulshan V, Peng L, Coram M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316(22):2402–10.
Yousefi S. Clinical applications of artificial intelligence in glaucoma. J Ophthalmic Vis Res. 2023;18(1):97–112.
Huang X, Swaminathan S, Mohammadzadeh V, et al. Objective criteria for glaucoma progression boundaries derived using unsupervised machine learning. American Academy of Ophthalmology (AAO) Annual Meeting. 2023;In Press.
Yousefi S, Pasquale LR, Boland MV, Johnson CA. Machine-identified patterns of visual field loss and an association with rapid progression in the ocular hypertension treatment study. Ophthalmology. 2022;129(12):1402–11.
Thakur A, Goldbaum M, Yousefi S. Predicting glaucoma before onset using deep learning. Ophthalmol Glaucoma. 2020;3(4):262–8.
Medeiros FA, Jammal AA, Thompson AC. From machine to machine: an OCT-trained deep learning algorithm for objective quantification of glaucomatous damage in fundus photographs. Ophthalmology. 2019;126(4):513–21.
Nath S, Marie A, Ellershaw S, Korot E, Keane PA. New meaning for NLP: the trials and tribulations of natural language processing with GPT-3 in ophthalmology. Br J Ophthalmol. 2022;106(7):889–92.
Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023.
Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2): e0000198.
Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci. 2023;3(4): 100324.
Ren LY. Product: Isabel Pro – the DDX generator. J Can Health Libr Assoc. 2019;40(2):63–9.
Balas M, Ing EB. Conversational AI models for ophthalmic diagnosis: comparison of ChatGPT and the Isabel Pro differential diagnosis generator. JFO Open Ophthalmol. 2023;1: 100005.
Marks JR, Harding AK, Harper RA, et al. Agreement between specially trained and accredited optometrists and glaucoma specialist consultant ophthalmologists in their management of glaucoma patients. Eye (Lond). 2012;26(6):853–61.
Huang X, Saki F, Wang M, et al. An objective and easy-to-use glaucoma functional severity staging system based on artificial intelligence. J Glaucoma. 2022;31(8):626–33.
Funding
This work was supported by NIH Grants R01EY033005 (SY), R21EY031725 (SY), grants from Research to Prevent Blindness (RPB), New York (SY), and supports from the Hamilton Eye Institute (SY). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Publication for this submission, including the journal’s Rapid Service Fee, was funded by the grants received.
Author information
Authors and Affiliations
Contributions
Mohammad Delsoz: Research design, data acquisition and research execution, data analysis and interpenetrations, manuscript preparation. Hina Raja: Research design. Yeganeh Madadi: Manuscript preparation. Anthony A. Tang: Data interpretations. Barbara M. Wirostko: Manuscript preparation. Malik Y. Kahook: Research design and manuscript preparation. Siamak Yousefi: Research design, data analysis and interpenetrations, manuscript preparation.
Corresponding author
Ethics declarations
Conflict of Interest
Mohammad Delsoz, Hina Raja, Yeganeh Madadi, Anthony A. Tang, and Malik Y. Kahook have nothing to disclose. Barbara M. Wirostko: Works for MyEyes LLC, and provides consultation for Qlaris Bio and iCare. Siamak Yousefi: Received prototype instruments from Remidio, M&S Technologies, and Visrtucal Fields. He gives consultations to the InsihgtAEye and Enolink.
Ethical Approval
Institutional review board (IRB) approval was not required per the direction of our local IRB office, as we used a publicly accessible dataset with no patient information in this analysis.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, which permits any non-commercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc/4.0/.
About this article
Cite this article
Delsoz, M., Raja, H., Madadi, Y. et al. The Use of ChatGPT to Assist in Diagnosing Glaucoma Based on Clinical Case Reports. Ophthalmol Ther 12, 3121–3132 (2023). https://doi.org/10.1007/s40123-023-00805-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40123-023-00805-x