Abstract
Objective
Electronic health records (EHR) data have become a central data source for clinical research. One concern for using EHR data is that the process through which individuals engage with the health system, and find themselves within EHR data, can be informative. We have termed this process informed presence. In this study we use simulation and real data to assess how the informed presence can impact inference.
Materials and Methods
We first simulated a visit process where a series of biomarkers were observed informatively and uninformatively over time. We further compared inference derived from a randomized control trial (ie, uninformative visits) and EHR data (ie, potentially informative visits).
Results
We find that only when there is both a strong association between the biomarker and the outcome as well as the biomarker and the visit process is there bias. Moreover, once there are some uninformative visits this bias is mitigated. In the data example we find, that when the “true” associations are null, there is no observed bias.
Discussion
These results suggest that an informative visit process can exaggerate an association but cannot induce one. Furthermore, careful study design can, mitigate the potential bias when some noninformative visits are included.
Conclusions
While there are legitimate concerns regarding biases that “messy” EHR data may induce, the conditions for such biases are extreme and can be accounted for.
Keywords: misclassification, electronic health records
INTRODUCTION
Electronic health records (EHR) have become a central data source for clinical research. Unlike large epidemiological cohorts or clinical trials, the data are often readily available to clinical researchers, facilitating the ability to easily ask and answer clinical questions. While appealing, EHR data come with many well documented concerns.1 One of the central concerns is the recognition that people do not interact with their health care provider randomly, but primarily when they are sick. This can lead to a form of selection bias that we have termed informed presence.2,3 As others have noted, sick patients have more data within EHR systems4 and most patterns of missing data can be considered informative missingness.5,6
One of the strengths of EHR data is that they contain information on patients over time. However, it is also possible that informative visits can magnify biases when considering longitudinal information. Consider a study that seeks to investigate the relationship between a biomarker (eg, systolic blood pressure) and a clinical event (eg, myocardial infarction) among a sample of patients. The biomarker is typically measured whenever the patient has a clinical visit. The concern arises that if a patient only visits their doctor when they are sick (ie, the biomarker level is elevated), the inference may be biased. When this occurs there is effect modification between the actual and observed biomarker. That is, we more frequently observe the biomarker when a patient is sick because s/he visits the clinic. This can lead to bias in the estimated association between the marker and the clinical event. This is illustrated as a causal diagram in Figure 1.
This potential informative visit process raises concerns for using EHR data for clinical research. However, there is suggestion that such concerns may be unnecessary. Both theoretical and simulation based work, in the context of mixed effects models, have shown that while individual random effects can be severely biased, the (arguably) more important slope parameters are not biased.7 These results suggest there may be contexts where we can get valid inference from data with an informative visit process.
In this article, we intend to examine this scenario further by considering 2 questions. First, how do the underlying associations—both for the visit process and the biomarker process—impact the potential for bias? Second, can the design of the study—both the selected cohort and the analytic model—be used to mitigate any potential biases? To investigate this, we first use simulation to generate a visit process where the biomarker is observed informatively. We then use real data, derived from both a randomized clinical trial, in which the visit process is known and uninformative, and from an EHR, in which the visit process is unknown but potentially informative, to assess the impact.
SIMULATION STUDY MATERIALS AND METHODS
We designed a simulation study to assess the impact of informative presence on parameter estimates. We first describe the general simulation framework and then detail the specific settings. As motivation, we used the scenario where one has a biomarker measured repeatedly over time and wants to assess the relationship of that biomarker with time to some outcome. We first simulated the typically unobserved, complete biomarker data for a person. For simplicity, we conceived that each person’s marker varied as a random walk. Specifically:
(1) |
where a person’s biomarker value at a given time t only depends on the biomarker value at the previous time t-1. We consider this the daily unobserved biomarker value for a person. For example, if the biomarker under study is blood pressure, a person has a daily average blood pressure measurement. This measurement may fluctuate but on average, the population has no particular trajectory.
After producing the complete data, we produced the observed data. We considered 2 forms of observed data: noninformative and informative. To create noninformatively observed data, we chose data points randomly under a Poisson process with a fixed and common rate (see specific details). This corresponds to having patients come in to a clinic visit under a fixed, predetermined time scale, such as a yearly visit. To create informatively observed data, we made the probability of observing the biomarker value at time t, a function of the biomarker itself:
(2) |
This corresponds to the scenario where we are more likely to observe data when the marker value is more extreme, ie, a patient’s health is poor. As described subsequently, we also consider a mixture of these 2 processes where a patient has both informative and noninformative visits.
After simulating the covariate visit process, we simulated the outcome. We generated time-series data by allowing the probability of an event on each day to depend on the observed biomarker on that day. Additionally, we included a person specific underlying probability of the event:
(3) |
This is a basic mixed effects model where the biomarker is a fixed effect and each person, i, has a random intercept. Equations 2 and 3 correspond to the causal diagram in Figure 1.
Simulation design
Using the general framework, we performed the following simulations:
Complete data: all visits from equation 1 were observed;
Noninformative observed: visits were observed based on an independent Poisson process with λ= 90 (ie, a visit every 3 months—which is recommended for diabetic patients)
Informative observed, confounding scenario: visits were observed based on equation 2.
Mixed visits: all patients had informative visits and a varying percentage of patients had noninformative visits (0%, 10%, 30%, 50%, 70%, 100%).
In simulation,3 it is possible to generate many sequential visits. To make it more realistic, we removed any visits that occurred within 30 “days” of one another, selecting just the first one. For each of the above scenarios we simulated n = 1000 patients and performed 500 replications. We ran the random walk sequence for each person for 730 steps to correspond to “2 years” of follow up. In the base scenario, we set the underlying biomarker relationship—βassoc in equations 3—to log(1.25). To assess the impact of the informed presence, we varied both βip and βassoc across a coarse grid of values. We also varied the intercept in equation 3 to obtain a realistic number of encounters as in Neuhaus et al.8
Simulation analysis
Using the previous simulated datasets, we estimated the association between the biomarker and the outcome. While we simulated our data under a random effects model, we analyzed the data under a time-varying Cox model, as this is more natural for analyzing time to an event. Therefore, we expect that the estimated parameter (ie, hazard ratio [HR]), under the no informed presence case, to be slightly different from the true value (ie, odds ratio). To account for this, we generate a large population dataset under the full process (n = 100 000) to get a value for the “true” HR. To analyze the data, we set the observed data up in counting process format, adding an additional line each time a new marker was observed.9 In this format, we assume that only the most recently observed value is associated with the outcome—which corresponds with our simulation scenario.
Across each simulation scenario, we calculated the average bias relative to the true HR averaging across the 500 runs. Next, we assessed whether controlling for the number of encounters could mitigate any observed bias. At each observed time point, we calculated the number of observed values (ie, visits) over the previous 365 time steps (ie, days).
All analyses were done in R 3.4.2 (R Foundation for Statistical Computing, Vienna, Austria). Simulation and analytic code are presented in the Supplementary Appendix.
RESULTS
Figure 2 shows histograms of the beta coefficients across the different simulation scenarios. When all visits are observed (Figure 2A) or visits are prescheduled (Figure 2B), there is no observed bias. However, when visits are observed informatively (Figure 2C), there is a positive bias. Following the scenario from Figure 2C, we varied both the degree of the informativeness of the biomarker as well as the underlying association of the biomarker with the outcome. These results are shown in Figure 3A as heatmaps, displaying the degree of bias. The degree of bias increases as both degree of informativeness (ie, the magnitude of β(IP) increases) and strength of association increases (ie, the magnitude of β(assoc) increases). Interestingly, when there was no underlying association between the biomarker and the outcome we did not observe any bias, regardless of the strength of the underlying association. As a sensitivity analysis, we explored the impact of sample size assessing sample sizes of 500, 1000, and 5000 people. We found that the average bias is unaffected by the sample size (Supplementary Figure 1) while the variability in bias decreases with the sample size (Supplementary Figure 2).
We next combined simulation scenarios 2 and 3 by mixing people with both informative and noninformative visits (Figure 3B-F). As the percentage of population with noninformative (ie, prescheduled) visits increases the degree of bias decreases. In particular, once at least 30% of the population has noninformative visits, there is minimal observed bias. We considered what happened when we adjusted for the number of encounters in the previous 365 days (Figure 4). When doing this, the observed bias was attenuated, though not fully eliminated. We also varied the look back period from 0 to 100 days (Supplementary Figure 3), finding that by a look back of 100 days most biases were attenuated.
DATA EXAMPLE STUDY MATERIALS AND METHODS
The previous simulations suggest that bias due to informed presence is primarily a concern when the underlying association between the biomarker and outcome is strong. Moreover, this bias can be mitigated by adjusting for the number of previous visits. We used real data to explore this in more detail. Specifically, we used data from a randomized control trial (RCT) and compared association results to those observed from our Duke University Health System’s (DUHS) EHR. Our underlying assumption is that an RCT, with prescheduled visits, should represent unbiased associations as in simulation.2
Data used
RCT data
As the RCT data, we used data from the NAVIGATOR (Nateglinide And Valsartan in Impaired Glucose Tolerance Outcomes Research) trial. NAVIGATOR was a 2 × 2 trial to compare the effect of 2 different therapies, nataglanide and valsartan, on the incidence of diabetes on cardiovascular events.10 Patients were eligible for the NAVIGATOR trial if they had an impaired glucose tolerance as defined by a fasting plasma glucose level of at least 95 mg/dL (5.3 mmol/L) but <126 mg/dL (7.0 mmol/L) and 1 or more cardiovascular risk factors (if 55 years of age or older) or known cardiovascular disease (if 50 years of age or older). Cardiovascular risk factors included smoking status, hypertension, reduced high-density lipoprotein cholesterol, elevated low-density lipoprotein cholesterol, family history of premature coronary heart disease, left ventricular hypertrophy, and microalbumiuria. Based on the trial design, patients had quarterly visits over a median of 6 years. Biomarker tests were performed at each visit. Patients were adjudicated for a variety of outcomes such as incident diabetes and renal events. To assess our overall hypothesis that potential bias is largest under stronger associations, we assessed the association between a variety of biomarkers and time to diabetes and renal dysfunction. In the NAVIGATOR trial, diabetes was defined as a fasting glucose ≥126 mg/dL or a 2-hour postchallenge glucose ≥200 mg/dL confirmed by repeat testing within 12 weeks or a suspected case confirmed by the Diabetes Endpoint Adjudication Committee. Renal dysfunction was defined as renal failure, renal transplant, dialysis, or an estimated glomerular filtration rate < 30 mL/min/1.73 m2. We assessed the association of systolic blood pressure, diastolic blood pressure, creatinine, glucose, potassium, weight, and cholesterol with each of these outcomes. We tested these associations only among the RCT placebo group. To make associations comparable, we standardized each variable to have unit variance.
EHR data
As our EHR data, we abstracted data from the DUHS EHR system. DUHS is a large academic medical system that consists of 3 hospitals and a network of outpatient clinics. As the primary provider in Durham County, it is estimated that 80% of Durham County residents receive their primary care through DUHS, providing a population level view of health care received.11
To compare results to the NAVIGATOR sample, we identified a cohort of patients with prediabetes. In brief, to identify prediabetics, we selected patients with a glycated hemoglobin A1C between 5.7% and 6.4%, with the condition that they never had a previous result greater than or equal to 6.5%. We considered the first encounter with an hemoglobin A1C in this range as the index date. We included patients seen between 2010 and 2016. To ensure that DUHS was a patient’s medical home, we limited our analysis to individuals that lived in Durham County and had at least 2 encounters in the 2 years before the index date.
Starting from the index date, we followed patients for time to diabetes and renal dysfunction, respectively. We defined diabetes as the presence of an International Classification of Diseases-Ninth Revision code of 250 or an International Classification of Diseases-Tenth Revision code of E11. We considered a definition that included fasting glucose or a glucose challenge test—as in the NAVIGATOR trial—and found that the results were sparse and unreliable. Previous work has validated the use of diagnosis codes for phenotyping diabetes within EHR samples.12 We defined renal dysfunction as renal failure, renal transplant, dialysis, or a glomerular filtration rate <30 ml/min/1.73 m2. To account for potential loss to follow-up we used December 31st 2015 as the administrative censoring date and 2016 as a burn-out period.3 Specifically, if a patient had an encounter in 2016, we presumed they were still living locally, and they were administratively censored. If they did not have an encounter in 2016, we censored them at the last encounter date.
Analytic strategy
We estimated the association between each biomarker and outcome. We used a time-varying covariate Cox model, updating the biomarker value at each encounter. We performed the analysis unadjusted, minimally adjusted (age, race and sex), and additionally adjusted for the number of encounters in the previous year. To ensure comparability, we applied the standardizations from the RCT population to the EHR cohort. We emphasize that these are simply association analyses for illustrative purposes and fuller analysis would need to take additional confounding and selection factors into account.
To induce a potentially more informative scenario we performed 2 sensitivity analyses. First we only used biomarker values that were collected in the emergency department (ED). We have previously shown that using data from the ED can lead to selection effects.3 Second we defined a source population comprised of nonlocal patients, under the assumption that these patients, too, should have more informative visits.
RESULTS
Table 1 shows basic clinical characteristics for those in the RCT and EHR sample respectively. We note that there were meaningful demographic differences. Moreover, the EHR based sample had meaningfully more encounters (10.9 vs 2.6 per person year) with more between person variability. Overall this reinforces the previously described differences between RCT and EHR based samples.13
Table 1.
NAVIGATOR trial | EHR | |
---|---|---|
Characteristic | (N = 4675) | (N = 18 011) |
Demographics | ||
Age, y | 63.0 (58.0-69.0) | 52.0 (42.0-62.0) |
Female | 2397 (51.3) | 11 599 (64.4) |
Race | ||
White | 3885 (83.1) | 5697 (31.6) |
Black | 123 (2.6) | 9913 (55.0) |
Other | 667 (14.3) | 2401 (13.3) |
Encounter summary | ||
Encounters per person-year | 2.6 (2.3-2.7) | 10.9 (6.1-18.3) |
SBP measurements per person-year | 2.6 (2.3-2.7) | 6.1 (3.7-10.2) |
SBP | 136 (126-147) | 128 (117-140) |
During outpatient encounters | 127 (116-139) | |
During inpatient encounters | 127 (113-143) | |
During ED encounters | 136 (122-151) | |
Potassium measurements per person-year | 1.4 (1.3-1.6) | 1.6 (1.0-2.9) |
Potassium | 4.3 (4.1-4.6) | 4.0 (3.7-4.3) |
During outpatient encounters | 4.0 (3.7-4.3) | |
During inpatient encounters | 3.9 (3.6-4.3) | |
During ED encounters | 3.8 (3.6-4.1) | |
Outcomes summary | ||
Renal dysfunction | 126 (2.7) | 1530 (8.5) |
Values are median (interquartile range) or n (%).
ED: emergency department; EHR: electronic health record; NAVIGATOR: Nateglinide And Valsartan in Impaired Glucose Tolerance Outcomes Research; SBP: systolic blood pressure.
Using both the RCT and EHR data, we compared the time-updated HRs for time to renal adverse event based on various biomarker values (Table 2). We found that within the RCT sample, creatinine had a moderately strong relationship with renal failure (HR, 1.37; 95% CI, 1.32-1.42). Performing the same analysis in the EHR sample, we found a more extreme effect estimate (HR, 1.53; 95% CI, 1.50-1.56), suggesting that there may be bias. Adjusting for the number of previous encounters had minimal impact on the effect estimates. There was a slight association between potassium levels and renal events. This was well replicated by the EHR sample (HR, 1.14 vs 1.11). No other clinical covariates were significantly associated with renal adverse event within the RCT sample. Moreover, all of the effect estimates from the EHR sample were similar to those found in the RCT sample. In the ED-only populations, we do note that we saw exaggerated effects with respect to the systolic blood pressure and potassium associations.
Table 2.
NAVIGATOR trial |
All local |
Local emergency department |
All nonlocal |
||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Description | Unadjusted | Adjusteda | Unadjusted | Adjusteda | Adjusted+b | Unadjusted | Adjusteda | Adjusted+b | Unadjusted | Adjusteda | Adjusted+b |
1: SBP | 1.03 | 0.94 | 1.10 | 0.99 | 1.02 | 0.87 | 0.81 | 0.82 | 1.00 | 0.89 | 0.96 |
(0.84-1.27) | (0.76-1.16) | (1.05-1.16) | (0.94-1.04) | (0.97-1.07) | (0.80-0.95) | (0.74-0.88) | (0.75-0.89) | (0.92-1.09) | (0.82-0.98) | (0.88-1.05) | |
2: DBP | 0.87 | 0.94 | 0.81 | 0.94 | 0.96 | 0.89 | 0.97 | 0.97 | 0.74 | 0.85 | 0.90 |
(0.70-1.07) | (0.76-1.16) | (0.77-0.85) | (0.89-0.99) | (0.91-1.01) | (0.82-0.98) | (0.88-1.06) | (0.89-1.07) | (0.68-0.81) | (0.78-0.93) | (0.83-0.98) | |
3: Creatinine | 1.36 | 1.37 | 1.58 | 1.52 | 1.53 | 1.41 | 1.36 | 1.37 | 1.58 | 1.50 | 1.55 |
(1.31-1.41) | (1.32-1.42) | (1.56-1.61) | (1.49-1.54) | (1.50-1.56) | (1.38-1.45) | (1.33-1.40) | (1.33-1.41) | (1.53-1.62) | (1.46-1.55) | (1.50-1.59) | |
4: Glucose | 1.09 | 1.12 | 1.21 | 1.19 | 1.17 | 1.04 | 1.00 | 1.00 | 1.32 | 1.30 | 1.28 |
(0.92-1.30) | (0.94-1.32) | (1.17-1.26) | (1.14-1.24) | (1.12-1.22) | (0.98-1.11) | (0.93-1.06) | (0.93-1.06) | (1.25-1.40) | (1.23-1.38) | (1.20-1.36) | |
5: Potassium | 1.17 | 1.14 | 1.26 | 1.11 | 1.11 | 1.42 | 1.25 | 1.25 | 1.34 | 1.14 | 1.14 |
(1.03-1.33) | (1.00-1.29) | (1.18-1.34) | (1.04-1.18) | (1.04-1.19) | (1.29-1.58) | (1.13-1.39) | (1.13-1.38) | (1.20-1.49) | (1.02-1.27) | (1.03-1.28) | |
6: Weight | 1.07 | 1.15 | 0.89 | 1.04 | 1.04 | 0.92 | 1.10 | 1.11 | 0.81 | 0.90 | 0.94 |
(0.87-1.30) | (0.91-1.44) | (0.85-0.93) | (0.99-1.09) | (0.99-1.09) | (0.84-0.99) | (1.01-1.20) | (1.02-1.21) | (0.75-0.87) | (0.82-0.97) | (0.86-1.02) | |
7: Cholesterol | 0.80 | 0.88 | 0.91 | 0.94 | 0.95 | 0.85 | 0.87 | 0.89 | 0.92 | 0.97 | 0.97 |
(0.66-0.98) | (0.72-1.08) | (0.87-0.94) | (0.91-0.98) | (0.92-0.99) | (0.75-0.96) | (0.77-0.98) | (0.78-0.99) | (0.86-0.97) | (0.91-1.03) | (0.91-1.03) |
Values are estimate (95% confidence interval).
DBP: diastolic blood pressure; NAVIGATOR: Nateglinide And Valsartan in Impaired Glucose Tolerance Outcomes Research; SBP: systolic blood pressure.
Adjusted for age, sex, and race.
Adjusted for age, sex, race, and number of previous encounters.
We repeated the analysis using diabetes as an outcome (Table 3). Glucose, weight, systolic blood pressure, potassium and cholesterol were all significantly associated with time to diabetes in the RCT sample. There was no difference in effect estimate within the EHR sample, suggesting no obvious bias. In the ED-only, sample the systolic blood pressure, glucose, and cholesterol associations all showed a degree of bias.
Table 3.
NAVIGATOR trial |
All local |
Local emergency department |
All nonlocal |
||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Description | Unadjusted | Adjusteda | Unadjusted | Adjusteda | Adjusted+b | Unadjusted | Adjusted a | Adjusted+ b | Unadjusted | Adjusted a | Adjusted+ b |
1: SBP | 1.07 | 1.08 | 1.10 | 1.10 | 1.10 | 0.90 | 0.91 | 0.91 | 1.07 | 1.06 | 1.09 |
(1.02-1.13) | (1.02-1.14) | (1.06-1.14) | (1.06-1.14) | (1.06-1.14) | (0.84-0.97) | (0.84-0.97) | (0.85-0.98) | (1.01-1.12) | (1.00-1.12) | (1.03-1.15) | |
2: DBP | 1.03 | 1.01 | 1.09 | 1.08 | 1.09 | 1.01 | 0.98 | 0.98 | 1.06 | 1.06 | 1.09 |
(0.97-1.08) | (0.96-1.07) | (1.05-1.13) | (1.04-1.12) | (1.06-1.13) | (0.95-1.09) | (0.92-1.06) | (0.92-1.06) | (1.01-1.12) | (1.00-1.12) | (1.03-1.15) | |
3: Creatinine | 1.01 | 1.01 | 1.03 | 1.03 | 0.99 | 1.09 | 1.09 | 1.09 | 1.03 | 1.04 | 0.96 |
(0.96-1.07) | (0.95-1.07) | (0.99-1.06) | (0.99-1.07) | (0.96-1.03) | (1.05-1.13) | (1.04-1.13) | (1.04-1.13) | (0.98-1.08) | (0.98-1.10) | (0.92-1.02) | |
4: Glucose | 1.42 | 1.42 | 1.42 | 1.44 | 1.43 | 1.21 | 1.24 | 1.24 | 1.40 | 1.41 | 1.40 |
(1.38-1.45) | (1.39-1.45) | (1.38-1.46) | (1.40-1.48) | (1.39-1.46) | (1.15-1.28) | (1.18-1.31) | (1.18-1.31) | (1.34-1.45) | (1.36-1.47) | (1.34-1.46) | |
5: Potassium | 0.94 | 0.94 | 0.94 | 0.96 | 0.97 | 1.01 | 1.03 | 1.02 | 0.98 | 1.01 | 1.02 |
(0.89-1.00) | (0.89-1.00) | (0.89-0.98) | (0.91-1.01) | (0.92-1.02) | (0.92-1.11) | (0.93-1.13) | (0.93-1.12) | (0.91-1.06) | (0.93-1.09) | (0.95-1.11) | |
6: Weight | 1.27 | 1.29 | 1.20 | 1.24 | 1.24 | 1.19 | 1.19 | 1.20 | 1.17 | 1.21 | 1.22 |
(1.21-1.33) | (1.22-1.36) | (1.17-1.23) | (1.21-1.27) | (1.21-1.27) | (1.13-1.26) | (1.12-1.26) | (1.13-1.27) | (1.13-1.22) | (1.16-1.26) | (1.17-1.26) | |
7: Cholesterol | 0.92 | 0.92 | 0.98 | 0.99 | 0.99 | 1.06 | 1.07 | 1.07 | 1.00 | 1.01 | 1.01 |
(0.87-0.98) | (0.87-0.98) | (0.96-1.00) | (0.96-1.01) | (0.97-1.02) | (0.97-1.17) | (0.97-1.18) | (0.97-1.18) | (0.96, 1.04) | (0.97, 1.05) | (0.97, 1.05) |
Values are estimate (95% confidence interval).
DBP: diastolic blood pressure; NAVIGATOR: Nateglinide And Valsartan in Impaired Glucose Tolerance Outcomes Research; SBP: systolic blood pressure.
Adjusted for age, sex, and race.
Adjusted for age, sex, race, and number of previous encounters.
DISCUSSION
As has been noted by others, informative visit processes observed in typical EHR data can bias association results. In this study, we used simulation to explore the contexts in which this bias is observed and how it may be mitigated. We followed this up with a real data example, comparing an RCT-based association in which we would expect minimal informativeness to an EHR-based association.
One of our key findings is that there is no induced bias when the underlying biomarker association is weak or null. This implies that informed presence can exacerbate an effect estimate but not induce an effect. This corresponds to general theory on differential misclassification. As has been noted, differential misclassification can bias associations both toward and away from the null.14,15 We can consider an informative visit process as a differential misclassification, in which people who are sicker have more information than do those that are not. This becomes the source of the bias. We can observe this in the directed acyclic graph in Figure 1, in which an informed clinic visit serves as an effect modifier, which ultimately impacts the association between the misclassification and the outcome. That is, sicker individuals, who are more likely to have the outcome, have more clinic visits and thereby more accurate observed biomarker values than healthy individuals, who are less likely to have the outcome. However, under the null, when there is no association between the biomarker and the outcome, there is no difference in the accuracy of the observed biomarkers among those who do and do not have the outcome, removing the possibility of bias. Our simulations suggest the biomarker association has to be moderately strong (HR >1.5) for there to be bias.
Mitigating bias
Within our simulations, we noted 2 ways that this bias can be mitigated: by design and by analysis. If some fraction of the sample has noninformative, prescheduled visits, the bias is attenuated. The mechanism for this can also be observed in Figure 1. When visits are no longer informative, the potential for effect modification of observation on the true biomarker relationship is lost. In many disease settings, it is reasonable to assume that a sizable fraction of patients are having regular clinic visits in addition to any informative visits. Moreover, depending on the sophistication of the health record, it is possible to check whether the patient visit was prescheduled or not. More importantly, one can further consider amending the design of the study by excluding data from likely informative visits (ie, emergency department visits). As has been observed previously,3 emergency department visits represent a selection process that can lead to biased results.
We also assessed how analytically controlling for the number of previous encounters can mitigate the bias. In our simulation we found that controlling for the number of previous encounters attenuated but did not completely remove the bias. This is not surprising because, in our simulations, the number of encounters is correlated with the degree of informativeness.
Our data example supported some of these findings. Using the assumption that RCT encounters consist of exclusively noninformative visits, we compared inference on biomarker association with renal adverse events and diabetes in both RCT and EHR data. As expected, we found minimal potential bias when comparing null associations. For the creatinine-renal association there was substantial bias within the EHR data. However, for the diabetes associations, there was no bias. One potential explanation for the discrepancy is that creatinine is not routinely measured at every visit, leading it to be a more likely “informative” lab test. Conversely glucose and weight—the 2 strongest diabetes associations—are more regularly measured, particularly among a prediabetic population. This highlights an additional consideration. It is not sufficient to only consider whether the visit is informative, but also whether the lab tests are informative as well. As other work has noted, the number of times a lab test is taken is often predictive of adverse outcomes.16
Another reason we may not have noticed substantial bias is that, by design, our analytic sample consisted of a local patient population that was receiving regular care at our facility. As such, these individuals are likely to have noninformative visits intermixed with potential informative ones. As our simulations indicated, once there is at least a 30% ratio of noninformative visits, the biasing effects are attenuated. When we changed our biomarker data to only using ED data, we did see some induced bias. We did not note any bias when using a nonlocal patient population.
We also considered the impact of adjusting for the number of encounters as a way to control for the informativeness. In our data example, adjusting for the number of encounters had minimal impact on the observed effect estimates. This is in contrast to both our simulation results and previous work by our group,2 suggesting that it is not a universal solution. More importantly, we did not note any induced bias by controlling, which has been noted in other work.17 We acknowledge that there are likely a number of additional confounding differences between the 2 samples and that these results are primarily illustrative, though they do support our overall hypothesis. More generally, we find that RCT data provide a useful juxtaposition to EHR data to understand informative processes and potential biases.
Results in context
Informative visit processes have been recognized in other contexts, most notably in regards to informative censoring.18 Regarding EHR data, a recent review found that 86% of studies did not report enough information to determine whether the visit process was informative.19 Solutions to this problem have been explored from different angles. In a set of articles, McCulloch et al7 and Neuhaus et al8 looked at outcome-dependent processes. In their scenario, as the outcome was longitudinally measured, the authors used random effects models to model the relationship between the covariate and outcome. In their first article, they noted that while the random-effects terms were often biased, the slope parameters were not. In their first article7, they noted that while the random effects terms were often biased, the slope parameters were not. In a later article8, they compared a range of estimation approaches and found that maximum likelihood–based approaches were best. Similar to our results, the authors also noted that inclusion of noninformative visits mitigated bias. However, in contrast to our work, they found that accounting for the number of encounters had no impact. In a similar study design, Gasparini et al17 assessed joint models for mitigating bias. They also found that controlling for the number of encounters made bias worse. It is possible that this adjustment is only useful when the predictor is informatively observed as opposed to the outcome. Finally, Hernán et al20 assessed an inverse probability weighting approach to account for informative visits. This work highlights some of the distinctions between data in the EHR context and the typical observational data context. The authors draw a distinction between a known (preplanned) and dynamic (informative) visits. In their scenario, individuals have an underlying, intended, known visit process that becomes dynamic due to missed visits. This, expected underlying visit process allows for the estimation of visit probability via an inverse probability weighting estimator. Conversely, when working with health system data, many people do not have any prescheduled visits, so the notion of visit probability does not translate as well. Instead, for these individuals, all or most of their visits are potentially informative. Moreover, the concern, in the EHR context is not missing visits, but rather potentially too many visits. That being said, as we noted in our simulations, when prescheduled visits do occur across the study sample, much of the bias is mitigated. This highlights the importance of recognizing the distinction between informative and scheduled visits.
Limitations and future work
There are some notable limitations of our work. First, most of our conclusions are based on simulations that are inherently simplified models. In particular, we take a simplified approach to the disease-generation process, relating just the most recent value to the outcome. Additional and more complex work is needed to further elucidate these issues. Second, our analysis only focuses on an informative covariate process. Other work has assessed an informative outcome process and the results from those studies do seem to suggest different issues in these settings. Third, while our results seemed to be confirmed by the data analysis, the problem was inherently a toy example meant to illustrate our scenario. Last, it should be noted that both the predictors and the outcomes were assessed differently between the RCT and EHR samples, which may lead to some inconsistencies.
CONCLUSION
As EHR data become more widely available, it is likely that they will become the primary data source for clinical research. As such, it is important for there to be more work understanding and characterizing potential biases in their use. While we illustrated that there is potential for biased associations, our results suggest that informative visits can only inflate an association, not induce one. This corresponds to general theory on differential misclassification. Additionally, the bias is mitigated when a fraction of patients have noninformative visits, something that can be accounted for within the study design phase. Last, the bias can be partially mitigated by controlling for the number of previous encounter. This, combined with other work on optimal statistical models, suggests that while informative visits need to be acknowledged and accounted for, they also may not completely undermine one’s analysis.
FUNDING
This work was supported by National Institute of Diabetes and Digestive and Kidney Diseases career development award K25 DK097279 (to BAG). The project described was supported by the National Center for Advancing Translational Sciences, National Institutes of Health, through grant award number UL1TR001117 at Duke University. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The NAVIGATOR trial was funded by Novartis.
AUTHOR CONTRIBUTIONS
BAG designed the analysis and drafted the manuscript. MP performed the analyses. NJP and SBP edited the manuscript and provided critical feedback. All authors approve of the final version of the manuscript.
SUPPLEMENTARY MATERIAL
Supplementary material is available at Journal of the American Medical Informatics Association online.
Supplementary Material
ACKNOWLEDGMENTS
We thank the NAVIGATOR trial steering committee and investigators for access to the NAVIGATOR trial data.
Conflict of interest statement
None declared.
REFERENCES
- 1. Hersh WR, Weiner MG, Embi PJ, et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med Care 2013; 51 (8 Suppl 3): S30–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Goldstein BA, Bhavsar NA, Phelan M, Pencina MJ.. Controlling for informed presence bias due to the number of health encounters in an electronic health record. Am J Epidemiol 2016; 18411: 847–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Phelan M, Bhavsar NA, Goldstein BA.. Illustrating informed presence bias in electronic health records data: how patient interactions with a health system can impact inference. EGEMS (Wash DC) 2017; 51: 22.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Weiskopf NG, Rusanov A, Weng C.. Sick patients have more data: the non-random completeness of electronic health records. AMIA Annu Symp Proc 2013; 2013: 1472–7. [PMC free article] [PubMed] [Google Scholar]
- 5. Haneuse S, Daniels M.. A general framework for considering selection bias in EHR-based studies: what data are observed and why? EGEMS (Wash DC) 2016; 41: 1203.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Wells BJ, Chagin KM, Nowacki AS, Kattan MW.. Strategies for handling missing data in electronic health record derived data. EGEMS (Wash DC) 2013; 13: 1035.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. McCulloch CE, Neuhaus JM, Olin RL.. Biased and unbiased estimation in longitudinal studies with informative visit processes. Biometrics 2016; 724: 1315–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Neuhaus JM, McCulloch CE, Boylan RD.. Analysis of longitudinal data from outcome-dependent visit processes: failure of proposed methods in realistic settings and potential improvements. Stat Med 2018; 3729: 4457–71. [DOI] [PubMed] [Google Scholar]
- 9. Therneau TM, Grambsch PM.. Modeling Survival Data: Extending the Cox Model. New York, NY: Springer; 2000. [Google Scholar]
- 10. Califf RM, Boolell M, Haffner SM, et al. Prevention of diabetes and cardiovascular disease in patients with impaired glucose tolerance: rationale and design of the Nateglinide And Valsartan in Impaired Glucose Tolerance Outcomes Research (NAVIGATOR) trial. Am Heart J 2008; 1564: 623–32. [DOI] [PubMed] [Google Scholar]
- 11. Miranda ML, Ferranti J, Strauss B, Neelon B, Califf RM.. Geographic health information systems: a platform to support the “triple aim”. Health Aff (Millwood) 2013; 329: 1608–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Spratt SE, Pereira K, Granger BB, et al. Assessing electronic health record phenotypes against gold-standard diagnostic criteria for diabetes mellitus. J Am Med Inform Assoc 2017; 24 (e1): e121–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Weng C, Li Y, Ryan P, et al. A distribution-based method for assessing the differences between clinical trial target populations and patient populations in electronic health records. Appl Clin Inform 2014; 52: 463–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Brenner H, Loomis D.. Varied forms of bias due to nondifferential error in measuring exposure. Epidemiology 1994; 55: 510–7. [PubMed] [Google Scholar]
- 15. Wacholder S, Hartge P, Lubin JH, Dosemeci M.. Non-differential misclassification and bias towards the null: a clarification. Occup Environ Med 1995; 528: 557–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Goldstein BA, Pomann GM, Winkelmayer WC, Pencina MJ.. A comparison of risk prediction methods using repeated observations: an application to electronic health records for hemodialysis. Stat Med 2017; 3617: 2750–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Gasparini A, Abrams KR, Barrett JK, et al. Mixed effects models for healthcare longitudinal data with an informative visiting process: a Monte Carlo simulation study. arXiv 2019 Jul 25 [E-pub ahead of print]. [DOI] [PMC free article] [PubMed]
- 18. Wu MC, Bailey KR.. Estimation and comparison of changes in the presence of informative right censoring: conditional linear model. Biometrics 1989; 453: 939–55. [PubMed] [Google Scholar]
- 19. Farzanfar D, Abumuamar A, Kim J, Sirotich E, Wang Y, Pullenayegum E.. Longitudinal studies that use data collected as part of usual care risk reporting biased results: a systematic review. BMC Med Res Methodol 2017; 171: 133.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Hernán MA, McAdams M, McGrath N, Lanoy E, Costagliola D.. Observation plans in longitudinal studies with time-varying treatments. Stat Methods Med Res 2009; 181: 27–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.