FormalPara Key Summary Points

The goal of this work was to explore the capabilities of Chat Generative Pretrained Transformer (ChatGPT) for provisional and differential diagnoses of different glaucoma phenotypes using specific case examples.

There was general agreement between ChatGPT and senior ophthalmology residents in final diagnoses.

ChatGPT was more general while ophthalmology residents were more methodical and specific when listing differential diagnoses.

Introduction

Glaucoma is a common cause of irreversible blindness worldwide [1]. Managing glaucoma is challenging and some patients will experience vision loss even after receiving treatment [2]. While intraocular pressure (IOP) is the only modifiable risk factor, glaucoma has multiple other risk factors including older age, family history, race (or ethnicity), and myopia, among others [3,4,5,6,7,8,9]. As a disease caused by the interaction of these risk factors in complex ways, glaucoma is difficult to identify, particularly in its earliest stages, due to the existence of variations in physiologic characteristics such as optic disc size and confusing pathological signs such as the presence of non-glaucomatous optic nerve diseases [10, 11].

Despite the existence of multiple tests for detecting glaucoma, diagnosis is still largely subjective with great variability between clinicians. Based on findings of the Glaucoma Optic Neuropathy Evaluation Project [12], underestimating the vertical cup-to-disc ratio and cup shape and missing both retinal nerve fiber layer (RNFL) defects including presence of disc hemorrhage, were key errors that led to glaucoma underestimation. Similar challenges can also lead to glaucoma overestimation.

Artificial intelligence (AI) has progressed rapidly since the introduction of the first deep convolutional neural networks (CNN) in ophthalmology [13, 14] and numerous studies have shown the effectiveness of AI models in glaucoma [15,16,17,18,19]. More recently, large language models (LLMs), particularly Chat Generative Pretrained Transformer (ChatGPT), have received significant interest in ophthalmology, as they have shown promise in understanding clinical knowledge and providing reasonable responses [20, 21]. The capabilities of ChatGPT in responding to multiple-choice questions from the USMLE was investigated and it was reported that ChatGPT not only responded correctly to over 50% of questions but also provided reasonable supporting explanations for the selected choices [22]. Further capabilities and limitations of ChatGPT in ophthalmology have been discussed elsewhere [23].

The objective of this study is to evaluate the capability of ChatGPT in the diagnosis of glaucoma from detailed case descriptions. We also aim to compare the outcome of ChatGPT with responses from senior ophthalmology residents to identify capabilities of this technology as a potential use in triaging glaucoma diagnosis.

Methods

Case Collection

We selected all cases with primary or secondary glaucoma from the publicly accessible database provided by the Department of Ophthalmology and Visual Sciences of the University of Iowa (https://webeye.ophth.uiowa.edu/eyeforum/cases.htm). Out of over 200 cases provided, 11 cases corresponded to primary or secondary glaucoma. These cases corresponded to various common and uncommon glaucoma phenotypes including primary juvenile glaucoma, normal-tension glaucoma, open-angle glaucoma, primary angle-closure glaucoma, pseudo-exfoliation glaucoma, pigment dispersion glaucoma, glaucomatocyclitic crisis glaucoma, aphakic glaucoma, neovascular glaucoma, aqueous misdirection glaucoma, and inflammatory glaucoma (Grant syndrome with trabeculitis). Descriptions of each case included patient demographics, history of the presenting illness, chief complaint, relevant medical or ocular history, and examination findings. Institutional review board (IRB) approval was not required per the direction of our local IRB office, as we used a publicly accessible dataset with no patient’s information in this analysis.

ChatGPT

LLMs, such as Generative Pretrained Transformer 3 (GPT-3), can generate fluent and human-like texts. GPT-3 was initially trained based on a corpus of unannotated data with over 400 billion words collected from the Internet that included books, articles, and texts in the websites. ChatGPT (https://chat.openai.com/), a generic LLM based on the GPT-3 architecture, is optimized for dialogue by the OpenAI company. ChatGPT can respond to textual questions highly fluently like humans. In addition to providing responses to textual question, ChatGPT has other natural language processing (NLP) capabilities including translation and text summarization.

ChatGPT Diagnosis

We input each case description into the ChatGPT (version 3.5) and asked if the model could provide a provisional diagnosis and a differential diagnosis list. Specifically, we first asked: “What is the most likely diagnosis?” and we then asked, “What is the differential diagnosis?” (Fig. 1). We then evaluated the accuracy of ChatGPT in making diagnosis with respect to the correct diagnosis. As ChatGPT may learn from previous interactions, we recorded all responses based on our first enquiry of provisional and differential diagnosis.

Fig. 1
figure 1

A sample case description input into the ChatGPT model and corresponding responses

Additionally, we provided the same 11 cases to three senior ophthalmology residents in our Hamilton Eye Institute of the University of Tennessee Health Science Center in a masked manner and asked residents to make provisional and differential diagnoses. Ophthalmology residents were not allowed to use ChatGPT or other similar tools for making a diagnosis. We then computed the frequency of correct provisional diagnoses of ChatGPT and the three ophthalmology residents and compared the accuracy. We also evaluated the agreement between ChatGPT and the three ophthalmology residents.

Results

ChatGPT made the correct diagnosis in eight out of 11 cases (~ 72.7%), while ophthalmology residents correctly diagnosed 6/11 (~ 54.5%), 8/11 (~ 72.7%), and 8/11 (~ 72.7%) cases, respectively. Table 1 shows the details of the provisional diagnosis provided by ChatGPT and three senior ophthalmology residents.

Table 1 Provisional diagnoses provided by ChatGPT and three senior ophthalmology residents

Table 2 demonstrates the details of the differential diagnosis provided by ChatGPT and the three residents. ChatGPT consistently provided a greater number of differential diagnoses compared to ophthalmology residents.

Table 2 Differential diagnoses provided by ChatGPT and three senior ophthalmology residents

The agreements between ChatGPT and the three ophthalmology residents were 9, 7, and 7 (out of 11 cases), respectively. However, the agreements between first and second, first and third, and second and third ophthalmology residents were 8, 8, and 11 (out of 11 cases), respectively.

Discussion

We prospectively investigated the capability of ChatGPT in diagnosing cases with primary and secondary glaucoma phenotypes based on 11 cases collected from the University of Iowa online database. ChatGPT diagnosed eight cases correctly that was better than the first ophthalmology resident with six correctly diagnosed cases and same as the second and third ophthalmology residents with eight correctly diagnosed cases. The agreement between ChatGPT and the ophthalmology residents was relatively high (9, 7, and 7, out of 11 cases). Current medical resident trainees are being trained in the midst of AI technology, LMM, and ChatGPT era and are witnessing the rapid advancement and integration of AI technologies into healthcare to assist in management and diagnoses. Understanding capabilities and limitations of these technologies will help to effectively utilize this technology in glaucoma research and clinical practice.

A recent study showed that Isabel Pro [24], one of the widely used and most accurate support systems for making diagnosis, correctly diagnosed one out of ten general ophthalmology cases (10%) while ChatGPT correctly diagnosed nine general ophthalmology cases (90%) [25]. We found that ChatGPT was about 72.7% accurate in provisional diagnosis based on 11 cases with primary and secondary glaucoma phenotypes. However, their subset included general ophthalmology cases while our datasets included glaucoma subspecialty cases with different uncommon and atypical glaucoma phenotypes, thus more challenging for ChatGPT to diagnose correctly. ChatGPT was incorrect in the diagnosis of three cases, including glaucomatocyclitic crisis, aqueous misdirection, and inflammatory glaucoma, which are often considered uncommon and atypical glaucoma phenotypes. It seems that ChatGPT can understand clinical knowledge even when provided as text information only, as it provided reasonable and relevant responses even on cases that the final response was deemed not fully matching with the underlying disease. This is evident from the example case that we have provided in Fig. 1. ChatGPT was able to link the history of this patient (central retinal vein occlusion; CRVO) with present information and correctly diagnose the patient as neovascular glaucoma secondary to CRVO.

We observed that ChatGPT was correct on eight out of eight cases with common glaucoma cases while it was primarily incorrect on uncommon and atypical glaucoma cases. It should be noted that more common glaucoma cases are typically easier to diagnose compared to uncommon, atypical, and more complex cases, as evidenced by that fact that ophthalmology residents were also unable to diagnose most of the uncommon cases correctly. Despite the abilities of ChatGPT in correctly diagnosing most of the glaucoma cases, it has several deficiencies as well. The current work only explored the capabilities of ChatGPT with specific case examples and detailed information that may not mimic the information available in real-world settings. Therefore, the results of this work are specific to using case examples that are clear and organized. ChatGPT is also not currently capable of assessing multimodal data. The inability to interpret diagnostic information such as fundus images or visual fields for glaucoma diagnosis may limit the ability for nuanced assessments and providing specific diagnoses to healthcare workers in various clinical care settings. In the future, and if equipped with such capabilities, ChatGPT may be highly beneficial in primary care facilities and emergency services with limited access to ophthalmologists or subspecialties for triaging patients with various eye diseases (e.g., initial diagnosis of glaucoma).

Advancement of LMM models including ChatGPT could benefit both comprehensive ophthalmic practices as well as more subspecialized clinics such as those focused on glaucoma. These models can make diagnoses objective, quick, accessible at any time, and more accurate. Glaucoma diagnosis is highly subjective with poor agreement even among highly skilled glaucoma specialists primarily due to a significant overlap between physiological and pathological characteristics [26]. In fact, there are several recent efforts in making glaucoma assessment more objective to reduce subjectivity and mitigate subsequent disagreement [19, 27]. As glaucoma is continuing to impact significant numbers of people in the US, primarily due to an aging population, and considering the fact that resources, including the number of glaucoma specialists, are limited, it is critical to provide glaucoma care in a more efficient and timely manner. As such, models like ChatGPT have potential in augmenting and streamlining glaucoma diagnosis and management, especially if future developments allow for incorporating multimodal data.

Another major advantage of ChatGPT and similar LMM models is that they can actively learn based on reinforcement learning capabilities, thus correcting previous mistakes, improving performance over time, and leading to more accurate output. Moreover, LMM models require minimal human oversight and supervision for active training, which is advantageous compared to the supervised learning models that require intensive human oversight.

We observed that ChatGPT consistently provides a greater number of differential diagnoses compared to ophthalmology residents. However, the lists provided by ChatGPT were generally related, and simply listed other reasons why glaucoma might be present. However, the residents were more methodical and pinpointed the diagnosis, which is what is required in clinical practice. It is worth noting again that residents were able to review and interpret VF printouts, fundus images, and OCT reports in the case descriptions while ChatGPT still cannot do these actions. This fact may enhance the ability to pinpoint diagnoses while text-only input might lead to more general differential diagnoses that lack the benefit of objective data provided from diagnostic testing.

Our study does have several limitations. First, the number of cases was small. Larger datasets with a greater number of cases can be explored in the future to validate current findings. Second, we were unable to investigate the performance of Isabel Pro and compare it with ChatGPT, as Isabel Pro is unable to diagnose subspecialty phenotypes and is primarily designed to diagnose common general ophthalmology cases. Third, we did not access more advanced forms of GPT, like GPTv4, to assess accuracy. More advanced versions are currently behind a paywall and are expected to be more accurate than the publicly available version 3.5 that we used. However, this could also be a strength of our study as the likelihood of using the publicly available version is higher than the use of costly commercial versions. Another general limitation in LLMs like ChatGPT, not specific to glaucoma, is the inability to process multimodal data, which is a crucial aspect of medical diagnoses.

In summary, the potential for AI and LLMs impacting clinical care in ophthalmology is significant. Recently, ChatGPT has received attention and is being utilized in both ophthalmic education and clinical eye care. ChatGPT is capable of conversing with users and responding to text inquiries with fluent and reasonable responses. The performance of ChatGPT when utilizing specific case examples is similar to that of trained ophthalmic residents, thus exhibiting promise for utility across different health care settings, as these systems evolve and allow for multimodal data interpretation. Future work to elucidate the utility of AI and LLMs to enhance assessment and diagnosis of ophthalmic conditions in various healthcare settings (call centers, primary care offices, emergency rooms, and various ophthalmology care clinics) should be explored further.

Conclusions

Glaucoma is a complex and multifactorial disease and there is no single test that is highly accurate and reliable. Recent LMMs, particularly, ChatGPT, may enhance glaucoma assessment. Based on 11 cases with common and uncommon glaucoma phenotypes, ChatGPT correctly diagnosed the disease process based on case descriptions, with performance on par or better than senior ophthalmology residents. As glaucoma diagnosis is typically challenging and requires several years of training, further refinement of ChatGPT may result in a tool that enhances diagnostic accuracy across a wide range of clinical settings and user experience levels.