Abstract
Delayed cerebral ischemia (DCI) is a complication seen in patients with subarachnoid hemorrhage stroke. It is a major predictor of poor outcomes and is detected late. Machine learning models are shown to be useful for early detection, however training such models suffers from small sample sizes due to rarity of the condition. Here we propose a Federated Learning approach to train a DCI classifier across three institutions to overcome challenges of sharing data across hospitals. We developed a framework for federated feature selection and built a federated ensemble classifier. We compared the performance of FL model to that obtained by training separate models at each site. FL significantly improved performance at only two sites. We found that this was due to feature distribution differences across sites. FL improves performance in sites with similar feature distributions, however, FL can worsen performance in sites with heterogeneous distributions. The results highlight both the benefit of FL and the need to assess dataset distribution similarity before conducting FL.
Keywords: Federated Learning, Stroke
I. Introduction
Almost 500,000 patients suffer from aneurysmal subarachnoid hemorrhage (SAH) worldwide annually [1]. Delayed cerebral ischemia (DCI) occurs in one third of SAH patients and is a leading cause of disability and death after SAH [1]–[3]. Despite this high burden, it is difficult to identify DCI at onset. This is because DCI results in loss of function symptoms, which are not readily observable. Current diagnostic tests suffer from poor sensitivity, dependence on expert evaluation, non-continuity, or carry high risk for the patient, respectively. Developing a better detection tool is necessary to reduce morbidity and mortality in patients [3].
To address this gap, a data-driven DCI monitoring tool was previously proposed [4]. The tool offers hourly estimations of the patient’s DCI risk, incorporating high-frequency physiologic data over time to detect DCI prior to onset [4]. The model was trained and tested on data from New York Presbyterian Hospital at Columbia University Medical Center (CUMC) and was further validated on data from University of Texas Health Science Center at Houston (UTH) and University Hospital Rheinisch-Westfälische Technische Hochschule Aachen (Aachen). Although performance on the independent test data was somewhat high, it was noted that patient characteristics in the external datasets were different than the ones at CUMC. This opened the possibility of creating a more robust model by training on all patient cohorts together. In this paper we aim to use a distributed model training architecture, federated learning (FL), to train the model on all 3 datasets while keeping data in their own locations. Our contributions are:
We propose a new federated feature selection method that enables joint feature selection across sites.
We demonstrate that dataset distribution shifts can degrade the performance of FL. Specifically, these shifts are driven by differences in healthcare practices.
We show that personalizing FL by tailoring to specific subpopulations, based on site patient composition, enhances performance.
II. METHODS
We use the same patient cohort and preprocessing steps as described in [4]. Patients with aneurysmal SAH admitted to the neurological intensive care unit (NICU) were prospectively enrolled at CUMC from 2006–2014, at Aachen from 2018–2020, and at UTH from 2018–2019. Importantly, CUMC and UTH use one definition to diagnose DCI while Aachen uses an additional definition, “perfusion” DCI [5], [6]. Six vital signs were collected: Heart rate (HR), respiratory rate (RR), oxygen saturation (SPO2) mean, systolic and diastolic arterial blood pressure (BP-M, BP-S, BP-D).
We define the time anchor for positive cases as DCI event onset and PBD 7 for negative cases. We create models using progressively more temporal data with a cut-off every 12 hours from 1 to 6 days prior to the event. For example, model 1 (M1) uses all data up to 24 hours prior to the event and model 2 (M2) uses 36 hours of data prior to the event etc. We hypothesize that including data up to 6 days prior to the event will improve performance as more temporal dynamics of the patient vitals can be calculated (see Supplementary Figure 1).
We employ a novel federated feature selection process to narrow the number of features from 138 to 70, matching the number of features used in [4]. To select the global features, each site computed F-statistics on their local dataset and sent scores to the central server. A weighted average of the scores was computed and the top 70 features were selected. This method can be interpreted as selecting the 70 best features explaining the most variance across all three datasets without combining all datasets.
We use federated averaging (FedAvg) [7], the most established algorithm for FL to train our models. In FedAvg, model parameters are averaged based on each client’s sample size, i.e., sites with more samples are given more weight during averaging [7]. FL works best when data is independent and identically distributed (IID) (i.e., data is homogenous). It has been shown that as datasets become heterogeneous, sites experience worse performance in FL compared to training separate models in each local site [8], [9]. Prior to training, we conduct analysis of the dataset distributions across the three sites to determine data heterogeneity. We estimate the marginal conditional distribution for the most important features selected during feature selection and use Jensen-Shannon divergence to estimate dissimilarity between sites.
We follow a similar modeling approach as Megjhani et al.[4], building an ensemble learning classifier model using 3 supervised learning models: L2-regularized Logistic Regression (LR), linear Support Vector Machine (SVM) and Random Forest (RF). All three models were inputs into an Ensemble Classifier (EC) that used soft voting for classification. For both LR and SVM, we employ algorithms that are optimized using stochastic gradient descent enabling the use of FedAvg. However, for random forests, we use a different mechanism, FedForest. [10].
We set the total number of federated rounds at T=20, with each client site running 20 local epochs per round t. This amounts to 400 total training epochs which is typically enough to reach convergence in models of this size. During each training round, clients measured performance on a local validation set and sent the model parameters with highest performance in their validation set to the central server. These local models are used by the central server to create the global model for the next round. After all training rounds are complete, the global model was sent to each site for local testing with no further local training.
We compare performance of several architectures for model training: separate, joint, FL and subpopulation FL. Separate learning involves each local site training and testing their model independently and acts as our baseline. Joint learning includes combining all datasets together and training a single model, this is the gold standard comparator [11]. Subpopulation FL uses the same architecture as FL, but the model is only trained on a subset of patients with a modified Fisher Score of ≥3. Doing so allows us to measure performance using only the most severe SAH patients. All model architectures were evaluated using the AU-ROC. We independently run each model architecture 300 times and calculate median AU-ROC and 95% confidence intervals via bootstrap.
III. RESULTS
Figure 1 shows the distribution of days to outcome after SAH in the DCI+ group. For patients with DCI, the median number of days from bleeding to DCI at CUMC was 6 (IQR 5,8), at UTH 7 (IQR 6,8) and at Aachen 11 (IQR 8,13). At Aachen, “Perfusion” DCI, which triggered clinician intervention, was always identified on or before DCI with a mean difference of 1.5 days.
Figure 2a shows the distribution of the 2 most important features for each site in the DCI− group and Figure 2b shows the Jensen-Shannon divergence for these distributions between each pair of sites. Distributions for Aachen are more likely to deviate from other sites. This is most pronounced for the DCI− group and at time-points closer to the anchor i.e., at 24–96 hours. This is likely driven by differences in time to outcome because Aachen uses two diagnostic criteria for DCI (see Figure 1 and.[4]). As we use time from DCI as the time-point anchor, differences in onset can have a large impact on the physiologic signal of the patient. See Supplementary Table 2 for the list of features used in each model and Supplementary Tables 3 and 4 for the Jensen-Shannon divergence for the top 5 features at each modeled time point. DCI + and − groups are shown separately.
Figure 3 compares the test set performance of three model training architectures: FL, separate and joint. Overall, for CUMC and UTH, FL performs better than separate site training but worse than joint site training (note the discrepancy is typically within the 5–10% acceptable limit). This improvement is more modest for CUMC. For Aachen, separate site training produces the best performing model with joint site and FL models performing comparably. Across all FL models, the highest performance for CUMC is at 156 hours, AU-ROC = 0.76 (0.74–0.76); for UTH at 84 hours, AU-ROC=0.85 (0.78–0.89); for Aachen at 24 and 48 hours, AU-ROC=0.71 (0.66–0.75). We also compare performance for FL and subpopulation FL (patients with mFS ≥3) shown in figure 4. For CUMC and Aachen, subpopulation FL consistently underperforms FL. However, for UTH, subpopulation training generally outperforms FL. This is likely due to the UTH cohort being exclusively mFS ≥3. See Supplementary Tables 5–8 for full breakdown of results by model, training type and timepoint.
IV. DISCUSSION
We built a federated model to detect the onset of DCI from time-series physiological data across three sites: CUMC, UTH and Aachen. The model leveraged federated feature selection and a federated ensemble classifier utilizing LR, SVM and RF trained across all three sites. We show that for CUMC and UTH, FL outperforms separate site training. However, at Aachen, FL training underperforms separate site training. Further, we show that a federated subpopulation model using only patients with an mFS ≥3 improves performance at UTH where all patients in the dataset had mFS ≥3.
We believe that the worse performance of FL at Aachen is driven by its more heterogeneous feature distributions compared to CUMC and UTH. In turn, the feature distributions are likely impacted by the differences in the time to outcome at Aachen vs. CUMC and UTH. As we use time to outcome to align all physiologic data, differences here can greatly impact feature distributions.
Our result highlights the importance of understanding hospital practices and clinical definitions before using FL [12], [13]. One way to overcome distribution differences among sites is via personalized FL which accounts for distribution shifts between centers[14]. Most methods are extensions of the FedAvg algorithm. However, these often do not explicitly model distribution differences. Instead, they leverage loss-based regularization or gauge the similarity of model parameters to determine the alignment between centers. These models can improve performance at specific sites and their utility warrant further exploration in healthcare. It is worth noting that joint training also underperforms single site training at Aachen suggesting this discrepancy was not due to limitations in FL but rather that Aachen’s global minimum might be different from CUMC and UTH. Overall, we believe that assessment of dataset distribution similarity across sites is a necessary step prior to FL [12], [13].
One major benefit of FL is the increase in sample size [15], [16]. Coupling this with no data transfer makes it particularly desirable in healthcare [17]. We saw a gain in performance compared to separate training at sites for CUMC and UTH. UTH has the largest improvement in performance from FL as it can leverage the large sample size from CUMC during training. However, even for CUMC, we show that there is benefit. We believe FL can be especially beneficial when data collection is time consuming due to rarity of certain diseases or resource intensive (e.g., DCI) [18]. Further, we show that training a model on a subpopulation of severe patients leads to improved performance for UTH. This is because all patients at UTH are defined as severe. Prior studies have demonstrated work that FL training on subpopulations, i.e., clustered FL, improves performance in healthcare tasks [19], [20]. Our findings further reiterate that this is a promising approach for personalizing FL in healthcare.
One of the limitations of our study is the small sample size, especially at UTH and Aachen. This leads to fluctuations in model performance across different DCI anchors, indicating a high degree of variability, suggesting that the model is overfitting to CUMC. Although we’ve attempted to mitigate this through federated preprocessing step, we could further reduce CUMC’s contribution by capping its weight. Another challenge is that the amount of data before the anchor is inconsistent across patients as time to DCI onset is highly variable. We can tolerate this by aligning outcomes across patients as this will maximize the information in the physiologic signal leading up to the event. Finally, even with FL, there is still a small risk of privacy leak from model exchange [21], [22]. This can be remedied with the addition of a privacy-preserving technique such as differential privacy [22], [23].
Supplementary Material
Contributor Information
Ahmed Elhussein, Department of Biomedical Informatics, Columbia University, New York Genome Center, New York, NY, USA.
Murad Megjhani, Department of Neurology, Columbia University, New York, NY, USA.
Daniel Nametz, Department of Neurology, Columbia University, New York, NY, USA.
Miriam Weiss, Department of Neurosurgery, RWTH Aachen University, Aachen, Germany.
Jude Savarraj, Department of Neurology, UT Health, Houston, TX, USA.
Soon Bin Kwon, Department of Neurology, Columbia University, New York, NY, USA.
David J. Roh, Department of Neurology, Columbia University, New York, NY, USA
Sachin Agarwal, Department of Neurology, Columbia University, New York, NY, USA.
E. Sander Connolly, Jr, Department of Neurology, Columbia University, New York, NY, USA.
Angela Velazquez, Department of Neurology, Columbia University, New York, NY, USA.
Jan Claassen, Department of Neurology, Columbia University, New York, NY, USA.
Huimahn A. Choi, Department of Neurosurgery, RWTH Aachen University, Aachen, Germany
Gerrit A. Schubert, Department of Neurosurgery, Kantonsspital, Aarau, Switzerland
Soojin Park, Department of Neurology, Department of Biomedical Informatics, Columbia University, New York, NY, USA.
Gamze Gürsoy, Department of Biomedical Informatics, Department of Computer Science, Columbia University, New York Genome Center, New York, NY, USA.
References
- [1].Claassen J and Park S, “Spontaneous subarachnoid haemorrhage,” Lancet, vol. 400, no. 10355, pp. 846–862, Sep. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Eagles ME, Tso MK, and Loch Macdonald R, “Cognitive Impairment, Functional Outcome, and Delayed Cerebral Ischemia After Aneurysmal Subarachnoid Hemorrhage,” World Neurosurgery, vol. 124. pp. e558–e562, 2019. [DOI] [PubMed] [Google Scholar]
- [3].Schmidt JM et al. , “Frequency and clinical impact of asymptomatic cerebral infarction due to vasospasm after subarachnoid hemorrhage,” Journal of Neurosurgery, vol. 109, no. 6. pp. 1052–1059, 2008. [DOI] [PubMed] [Google Scholar]
- [4].Megjhani M et al. , “Dynamic Detection of Delayed Cerebral Ischemia: A Study in 3 Centers,” Stroke, vol. 52, no. 4, pp. 1370–1379, Apr. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Steiner T et al. , “European Stroke Organization guidelines for the management of intracranial aneurysms and subarachnoid haemorrhage,” Cerebrovasc. Dis, vol. 35, no. 2, pp. 93–112, Feb. 2013. [DOI] [PubMed] [Google Scholar]
- [6].Veldeman M et al. , “Invasive neuromonitoring with an extended definition of delayed cerebral ischemia is associated with improved outcome after poor-grade subarachnoid hemorrhage,” J. Neurosurg, vol. 134, no. 5, pp. 1527–1534, May 2020. [DOI] [PubMed] [Google Scholar]
- [7].McMahan B, Moore E, and Ramage D, “Communication-efficient learning of deep networks from decentralized data,” Artif. Intell, 2017. [Google Scholar]
- [8].Zhao Y, Li M, Lai L, Suda N, Civin D, and Chandra V, “Federated Learning with Non-IID Data,” arXiv [cs.LG], 02-Jun-2018. [Google Scholar]
- [9].Li Q, Diao Y, Chen Q, and He B, “Federated Learning on Non-IID Data Silos: An Experimental Study,” in 2022. IEEE 38th International Conference on Data Engineering (ICDE), 2022, pp. 965–978. [Google Scholar]
- [10].Liu Y, Liu Y, Liu Z, Zhang J, Meng C, and Zheng Y, “Federated Forest,” arXiv [cs.LG], 24-May-2019. [Google Scholar]
- [11].Lee GH and Shin S-Y, “Federated Learning on Clinical Benchmark Data: Performance Assessment,” J. Med. Internet Res, vol. 22, no. 10, p. e20891, Oct. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Agniel D, Kohane IS, and Weber GM, “Biases in electronic health record data due to processes within the healthcare system: retrospective observational study,” BMJ, vol. 361, Apr. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Hripcsak G and Albers DJ, “Correlating electronic health record concepts with healthcare process events,” J. Am. Med. Inform. Assoc, vol. 20, no. e2, pp. e311–8, Dec. 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Tan AZ, Yu H, Cui L, and Yang Q, “Towards Personalized Federated Learning,” IEEE Trans Neural Netw Learn Syst, vol. PP, Mar. 2022. [DOI] [PubMed] [Google Scholar]
- [15].Sheller MJ et al. , “Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data,” Sci. Rep, vol. 10, no. 1, p. 12598, Jul. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Sarma KV et al. , “Federated learning improves site performance in multicenter deep learning without data sharing,” J. Am. Med. Inform. Assoc, vol. 28, no. 6, pp. 1259–1264, Jun. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Adnan M, Kalra S, Cresswell JC, Taylor GW, and Tizhoosh HR, “Federated learning and differential privacy for medical image analysis,” Sci. Rep, vol. 12, no. 1, p. 1953, Feb. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Rieke N et al. , “The future of digital health with federated learning,” npj Digital Medicine, vol. 3, no. 1, pp. 1–7, Sep. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Huang L, Shea AL, Qian H, Masurkar A, Deng H, and Liu D, “Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records,” J. Biomed. Inform, vol. 99, p. 103291, Nov. 2019. [DOI] [PubMed] [Google Scholar]
- [20].Elhussein A and Gursoy G, “Privacy-preserving patient clustering for personalized federated learning,” arXiv [cs.LG], 17-Jul-2023. [PMC free article] [PubMed] [Google Scholar]
- [21].Vepakomma P, Swedish T, Raskar R, Gupta O, and Dubey A, “No Peek: A Survey of private distributed deep learning,” arXiv e-prints, p. arXiv:1812.03288, Dec. 2018. [Google Scholar]
- [22].Brendan McMahan H, Ramage D, Talwar K, and Zhang L, “Learning Differentially Private Recurrent Language Models,” https://openreview.net › forumhttps://openreview.net › forumhttps://openreview.net › pdfhttps://openreview.net › pdf, 24-Feb-2018. [Google Scholar]
- [23].Wei K et al. , “Federated Learning With Differential Privacy: Algorithms and Performance Analysis,” IEEE Trans. Inf. Forensics Secur, vol. 15, pp. 3454–3469, 2020.’ [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.