Introduction

The global pandemic due to SARS-CoV-2 novel coronavirus was first reported in 2019. WHO has reported 486,761,597 infections and 6,142,735 deaths till 1, April, 2022 September 2021 [1]. A wide spectrum of health complications has been noticed in patients with severe COVID-19 patients. Diseases after severe COVID-19 have been shared with black fungus, cardiac arrest, temporary paralysis, joint pain and respiratory disorders [2]. The SARS-CoV-2 virus has been seen responsible to cause an incidence of acute respiratory distress in a huge volume of COVID-19 cases [3]. Clinical and radiographic reports suggest the onset of pulmonary fibrosis, a common course after SARS infection. It is known as a sequela of persistent damage to the lung or acute respiratory distress syndrome (ARDS). Pulmonary fibrosis turns out as a serious complication of lung pneumonia, which leads to impaired lungs or dyspnea [4]. Various clinical studies indicate the link of COVID-19 patients with respiratory disorders, which sometimes lead to mortality. It has been noticed that the inflammatory mechanism starts around 60–90 days of hospital discharge of severe COVID-19 cases. The lungs become scarred over time, and symptoms like dry cough, tiredness, shortness of breath, nail clubbing and weight loss have been noticed in the patients [5, 6].

Pulmonary fibrosis gradually converts the normal parenchyma of lungs into fibrotic tissues. Those scarred tissues decrease the oxygen capacity, which leads to stiffness and restrictive lungs. There are various risk factors associated with the development of fibrosis after COVID-19 [7, 8]. The first risk factor is an extended stay in the ICU and the use of mechanical ventilation. While the severity of the condition is linked to the amount of time spent in the ICU, mechanical ventilation increases the risk of ventilator-induced lung injury (VILI). This injury is caused by abnormal pressure or volume settings, which cause the production of pro-inflammatory modulators, exacerbating acute lung injury, and higher mortality or pulmonary fibrosis in survivors [9]. Increased disease severity is the second risk factor, which includes comorbidities like hypertension, diabetes, and coronary artery disease, as well as lab abnormalities like lymphopenia, leukocytosis, and high lactate dehydrogenase (LDH). Following acute lung injury, the level of serum LDH has been utilized as a measure of disease severity. It's a marker for lung tissue loss that’s linked to a higher risk of death [10]. According to the World Health Organization, 80 percent of SARS-CoV-2 infections are mild, 14 percent causes severe symptoms, and 6% result in death. Third risk factor includes smoking and drinking alcohol. When compared to non-smokers, smokers are 1.4 times more likely to develop severe COVID-19 symptoms, 2.4 times more likely to require ICU admission and mechanical ventilation, and 2.4 times more likely to die [11, 12].

There are currently no fully proven methods for treating post-inflammatory COVID-19 pulmonary fibrosis. Various therapeutic options are being considered. It has been proposed that long-term use of antiviral, anti-inflammatory, and anti-fibrotic medications reduces the risk of lung fibrosis. However, it is yet unknown whether early and prolonged use of antiviral medications can prevent lung remodeling or which antiviral is the most beneficial. Anti-fibrotic medications like pirfenidone and nintedanib also have anti-inflammatory properties. Therefore, they can be utilized even during the acute phase of COVID-19 pneumonia. Pirfenidone works as an anti-fibrotic, anti-oxidant, and anti-inflammatory agent. Pirfenidone may minimize ARDS-induced lung injury by decreasing NLRP3 inflammasome activation, which reduces LPS-induced acute lung injury and eventual fibrosis. 26 Anti-fibrotic therapy has a few drawbacks in the acute phase. Hepatic dysfunction is common in COVID-19 patients, as evidenced by elevated transaminases, and the anti-fibrotics pirfenidone and nintedanib cause hepatotoxicity. Because most COVID-19 patients are taking anticoagulants, nintedanib is linked to an increased risk of bleeding. Anti-fibrotic therapy should be started within the first week of ARDS onset to avoid complications of lung fibrosis [13,14,15]. As a result, identifying people who are at risk of developing pulmonary fibrosis is critical. The rationale for employing medication should be customized, and precision medicine's role assumes the prediction of high-risk populations, a better understanding of pathophysiology, and the avoidance of disease deterioration or the formation of lung fibrosis. Analysis of COVID-19 patients after discharge from hospitals could only decrease the risk of developing fibrotic abnormalities. The majority of diagnostic procedures are based on various symptoms, medical imaging (mainly High-Resolution computed tomography), Lung function tests (LFT) and biopsy. As these procedures take longer interpretation time, cause discomfort in patients, expose patients to radiations, clinicians and radiologists are more inclined toward computer-aided diagnosis [16]. Therefore, this work presents an efficient model to detect pulmonary fibrosis in severe COVID-19 patients after 90 days of discharge from the hospital, by analyzing EHRs and HRCT scans.

The paper is organized as follows: “Literature Review” and “Motivation” presents the existing literature review and motivation for research, respectively. “Materials and Methods” includes dataset description, model development and evaluation metrics, while “Experimental Results” provides the experimental setup and comparison results with other machine learning models. “Discussion” discusses the significant findings and “Conclusion” concludes the paper by emphasizing the salient points and future perspectives.

Literature Review

To find relevant research work, a literature review was conducted utilizing several databases (PubMed, Scopus, Science Direct, and Google Scholar). Coronavirus, severe acute respiratory syndrome coronavirus 2, COVID-19, post-COVID fibrosis, and anti-fibrotic were among the search phrases. The search yielded around many articles, comprising review articles, case studies and reports. Few of them have been discussed in this section. Carfi et al. evaluated 143 patients who were discharged from the hospital after recovering from COVID-19 with ongoing symptoms. Only 18 (12.6 percent) of patients were completely free of any COVID-19-related symptom at the time of evaluation, whereas 32 percent had one or two symptoms and 55 percent had three or more. There was no fever or other indications or symptoms of acute illness in any of the patients. In 44.1 percent of patients, their quality of life had deteriorated. They also discovered that fatigue (53.1%), dyspnea (43.4%), joint pain (27.3%), and chest pain were the most common symptoms that persisted after discharge (21.7%) [17]. Zhao et al. evaluated COVID-19 survivors’ pulmonary function and related physiological features three months after recovery, enrolling 55 patients and finding varying degrees of radiological abnormalities in 39 of them. The presence of CT abnormalities was linked to a high blood urea nitrogen content upon admission [18]. A chest CT scan was obtained on the last day before discharge, two weeks after discharge, and four weeks after discharge in the study of Liu et al. The anomalies in the lungs (including focal/multiple GGO, consolidation, interlobular septal thickening, sub-pleural lines, and irregular lines) were gradually absorbed in the first and second follow-ups after discharge, compared to the previous CT scan before discharge. After a 4-week follow-up, 64.7 percent of released patients had their lung lesions completely absorbed. It suggested that COVID-19-induced lung tissue damage might be reversible in most COVID-19 patients. It was also proposed that non-severe patients had a good prognosis, and that clinical intervention should be done early to prevent common COVID-19 individuals from becoming severe [19].

A recent study by Yasin et al. [20], showed the age of the patients, CT severity score, consolidation score, and admissions in ICU were identified as the independent risk variables related with the occurrence of post-COVID-19 fibrosis after a multivariate analysis. At a cut-off point of 10.5, the chest CT severity score has a sensitivity of 86.1%, a specificity of 78%, and an accuracy of 81.9%. Another study from ‘The Lancet’ had confirmed a few risk factors, such as age, hypertension, the severity of COVID-19 and diabetes, as important indicators of developing fibrosis [21]. Li referred post-COVID-19 pulmonary fibrosis as a worrisome sequela in surviving patients [22]. The study also presented the significance of early detection of fibrosis in high-risk patients through appropriate CT scans. According to a review of Spagnolo et al. biomarkers of susceptibility could help identify patients with a higher risk and could be used to personalize COVID-19’s long-term effects treatment. It emphasizes the importance of patient and illness-related contributing risk factors for pulmonary fibrosis in COVID-19 survivors, as well as the potential utility of acute phase and follow-up biomarkers for identifying patients most at risk of developing the disease [23]. Another research article described the correlation of risk factors, such as leukocyte count, lactate dehydrogenase, the severity of COVID-19 and duration of mechanical ventilation, with the development of fibrotic abnormalities [24]. Chen et al. [25] analyzed 169 autopsies of patients with ARDS caused by a variety of causes and found that fibrosis was present in three (4%) out of 82 patients with a disease duration of less than one week, 13 (24%) out of 54 patients with a disease duration of one to three weeks, and 14 (61%) out of 23 patients with a disease duration of more than three weeks. Das et al. investigated 27 patients who had put on ventilation for ARDS and found that 23 (85%) of them had symptoms of fibrosis 110–267 days after extubating, with a strong link to the length of the pressure-controlled inverse-ratio ventilation [26]. Yu et al. divided patients in to two groups—early fibrosis and severe fibrosis, based on post-COVID-19 follow-ups. On preliminary CT imaging, the fibrosis group had a higher prevalence of the irregular interface (57.1%) and parenchymal band (50.0%). On the worst-state CT imaging, the fibrosis group had a higher prevalence of parenchymal band (92.9%), interstitial thickening (786%) air bronchogram (571%), uneven interface (85.7%) and coarse reticular pattern (28.6%) [27].

The literature indicates a strong link between fibrotic abnormalities and COVID-19 for around 15–20% of recovered patients. Considering millions of cases of COVID-19 over the world, even a small percentage of post-COVID lung fibrosis is concerning. The research articles also specify the importance of blood investigations and HRCTs of recovered COVID-19 patients to analyze the risk of developing fibrosis in the lungs. Recently, EHRs have been considered as a critical tool of patient data collection. At the time of care, EHR delivers accurate, up-to-date, and full information about patients. It also aids in the accurate diagnosis of patients, the reduction of medical errors, the provision of safer care, and quick access to patient records necessary for more coordinated and efficient care. HRCT, on the other hand, is a more precise radiological examination than a chest X-ray for the diagnosis and monitoring of lung tissue and airway illnesses. A volume HRCT scan of the entire lung tissue is possible with modern CT equipment. Contrast-enhanced CT scans of the chest or the entire body can also be used to create HRCT slices. Idiopathic interstitial pneumonias and pulmonary fibrosis, are among the most well-known indications for HRCT. These diagnostic tools have been considered beneficial in finding the abnormalities present in lungs after discharging COVID-19 patients.

Motivation

After the COVID-19 pandemic, an increasing number of individuals worldwide who have survived the sickness are still suffering from its symptoms, even though they have been clinically tested negative for the virus. As we fight this pandemic, the most difficult part will be figuring out how to deal with COVID-19 sequelae, which can range from mild fatigue and body aches to severe forms requiring long-term oxygen therapy and lung transplantation due to lung fibrosis, significant cardiac abnormalities, and stroke, all of which lead to a significant reduction in quality of life. Various studies have found that 70–80 percent of COVID-19 patients still have at least one or more symptoms after being declared COVID-free. Existing literature indicates the lack of detection or prediction model for Post COVID-19 pulmonary fibrosis. Thus, there is an urgent need of developing computer-aided diagnostic models to help the healthcare sector in detecting the fibrotic abnormalities before its onset time. Recently, many artificial intelligence-assisted systems based on EHRs and CT scans have been reported for diagnosing diseases. Powerful models based on machine learning assists clinicians and medical practitioners to diagnose the abnormalities effectively. The health reports indicate important risk factors that could be a vital diagnostic indicator for the early detection of pulmonary fibrosis.

As, computer-aided diagnostic models are now a great alternative to human experts due to their speed, accuracy and decreased false positive rates, an effective model for detecting early onset of pulmonary fibrosis could help in decreasing mortality due to the severe scaring of lungs. In this proposed work, clinical characteristics and chest HRCT data of patients were collected, with follow-up studies on the evolution of pulmonary fibrosis, who returned to the hospital for chest HRCT re-examinations 90 days after hospital discharge. In the case of pulmonary fibrosis, the major risk factors that were reported, are age, symptoms like cough, cold, fever, chest tightness, IL-6 levels, WBC counts, Lymphocytes, Albumin, Creatinine, CRP, D-dimer and humoral immunity-related indexes (IgG). The chest CT image analysis included the spread of the lesions, the position of the lesions, lobes affected, features of the lesions and external immersion. For each patient, the CT presentation was described according to the parameters of Lesion degree, Quantitative scoring of pulmonary fibrosis and Inflammation score. These risk factors are obtained after stratifying COVID-19 patients (with and without pulmonary fibrosis). A statistically significant difference has been acquired in most of the risk factors. For the technical aspect, the model has gone through optimum algorithm selection procedures and hyper-parameter tuning. The overall architecture of the proposed pulmonary fibrosis detection system is present in Fig. 1. The major inputs are highlighted as follows:

  1. 1.

    A dataset of 1175 severe COVID-19 patients has been created using EHRs and corresponding HRCT scans that includes general clinical data, such as sex, age, main clinical symptoms and radiological images.

  2. 2.

    Statistical analysis has been performed to evaluate the clinical characteristics and to make a comparison between patients with or without pulmonary fibrosis. Feature importance has also been obtained to acquire the most prominent indicator of fibrosis in COVID-19 patients.

  3. 3.

    After pre-processing, statistical analysis, null value assessment and feature scaling, training of various machine learning algorithms have been executed to achieve the classification of patients into fibrosis cases and normal cases.

  4. 4.

    Several machine learning algorithms are then compared on the considered dataset, in which Extreme Gradient Boosting (XGBoost) provides the best performance in terms of performance metrics, such as accuracy, precision, recall and specificity. The XGBoost is thus used as the base model and is optimized for the application by tuning the major hyper-parameters, such as learning rate, gamma rate and regularization lambda.

  5. 5.

    The improved XGBoost model is then trained with different training testing splits of the same dataset. The model is then tested for the prediction of pulmonary fibrosis and normal lungs. This novel approach exhibits potency and thus can be embedded in clinical diagnosis systems to provide fast, reliable and low-cost results.

Fig. 1
figure 1

The overall system architecture of the proposed system for pulmonary fibrosis detection

Materials and Methods

The paper aims to propose a machine learning-based diagnostic system to automatically detect pulmonary fibrosis by evaluating a patient’s risk factors and HRCT scans.

Dataset

As the emergence of post-COVID-19 complications is recent, none of the large data repositories contain any labeled data for pulmonary fibrosis, thereby leading us to rely on chest examination reports, EHRs and CT scan interpretations of Centre Theatre General Hospital, China for the training proposed model [28, 29]. The clinical characteristics and HRCT scans were collected at the time of follow-ups of COVID-19 patients after 90 days of hospital discharge. The dataset includes single comma separated values (csv) file with 32 risk factors and HRCT scans for all 1175 patients with their labels as Normal lungs or Fibrosed Lungs. In the acquired dataset, 725 patients have developed pulmonary fibrosis while 450 patients did not develop pulmonary fibrosis after COVID-19 recovery. A statistical analysis has been done to evaluate the relationship between fibrosis progression and related risk factors in all 1175 COVID-19 patients. The analysis of EHR was carried out in SPSS (version 26.0) software. The result showed a significant relationship between pulmonary fibrosis with levels of Interleukin-6 (IL-6), albumin and cellular immunity-related indexes in patients through analyzing the ϰ2 values and P values by Fisher’s exact test [30, 31]. Details of the dataset have been presented in Table 1. The HRCT scans were pre-processed using resizing, normalizing de-noising filters to be classified through the proposed model. Figure 2 depict the samples of Normal and Fibrosed HRCT scans after recovering from COVID-19.

Table 1 The clinical risk factors of severe COVID-19 patients
Fig. 2
figure 2

Samples of HRCT scans a fibrosed lungs, b normal lungs

Development of XGBoost model

In this study, 32 features in the form of risk factors have been fed into the XGBoost model to automatically perform classification between patients with normal lungs and patients with fibrosed lungs [32]. XGBoost is a mathematical technique based on sequence ensemble. It evaluates the second-order partial derivative of the loss function to get the gradient patterns. These patterns then obtain the minimum loss function which eventually optimizes the model. In comparison with conventional gradient boosting, XGBoost uses regularization to improve the speed, parallelization and generalization of the model [33]. While implementing XGBoost, competent models are built from a collection of weak learners iteratively. The algorithm of XGBoost works on Newton–Raphson optimization in function space [34]. The generic version of XGBoost is stated below:

  1. 1.

    Input: training set \({\{{(x}_{i},{y}_{i})\}}_{i=1}^{N}\), loss function L (y, F(x)), weak learners M and Learning rate of α.

  2. 2.

    To train the model, loss function has to be optimized by obtaining gradient descent and second-order Taylor approximation, represented in Eqs. (1) and (2).

    $${\widehat{g}}_{m}\left({x}_{i}\right)= {\left[\frac{\partial L\left({y}_{i}\right),f{x}_{i}}{\partial f{(x}_{i)}}\right]}_{f\left(x\right)={\widehat{f}}_{\left(m-1`\right)}\left(x\right)},$$
    (1)
    $${\widehat{h}}_{m}\left({x}_{i}\right)= {\left[\frac{{\partial }^{2}L\left({y}_{i}\right),f{x}_{i}}{\partial f{{(x}_{i})}^{2}}\right]}_{f\left(x\right)={\widehat{f}}_{\left(m-1`\right)}\left(x\right)},$$
    (2)

    where \({\widehat{g}}_{m}\left({x}_{i}\right)\) is the gradient and \({\widehat{h}}_{m}({x}_{i})\) is the hessian for m = 1 to M.

  3. 3.

    Fit the base learners (or weak learners) using the updated training set \({\left\{{x}_{i}, \frac{{-\widehat{g}}_{m}({x}_{i})}{{\widehat{h}}_{m}({x}_{i})}\right\}}_{i=1}^{N}\) and solving the optimization problem stated below in Eq. (3):

    $${\varnothing }_{m}=\text{argmin} \sum_{i=1}^{N}\frac{1}{2}{\widehat{h}}_{m}\left({x}_{i}\right){\left[\frac{{-\widehat{g}}_{m}\left({x}_{i}\right)-{\varnothing }_{{x}_{i}} }{{\widehat{h}}_{m}\left({x}_{i}\right)}\right]}^{2}.$$
    (3)
  4. 4.

    Updating model has been done as indicated in Eqs. (4) and (5):

    $${\widehat{f}}_{m}\left(x\right)= \alpha {\varnothing }_{m}\left(x\right),$$
    (4)
    $${\widehat{f}}_{m}\left(x\right)= {\widehat{f}}_{m-1}\left(x\right)+ {\widehat{f}}_{m}\left(x\right).$$
    (5)
  5. 5.

    Loss function

    $${y}_{i}=\widehat{f}\left(x\right)= {\widehat{f}}_{(M)}\left(x\right)=\sum_{m=0}^{M}{\widehat{f}}_{m}\left(x\right).$$
    (6)

The final loss function shown in Eq. (6) is then adjusted by taking the best values of parameters and input function to gain the optimum result.

In comparison to conventional gradient boosting, there are few in-built algorithm enhancement methods present in XGBoost. It reprimands complex models by applying regularization that avoids overfitting. It also handles sparsity patterns in the dataset more efficiently by learning automatically from the missed values while training. The XGBoost algorithm does cross-validation at each iteration of its own and employs a weighted Quantile algorithm to find the optimal points of the split [35]. For regularization of the model, another term known as regularization term is added to the cost function, as stated below:

$${\text{Objective function }} = {\text{ Loss function }} + {\text{ Regularization term}}.$$

Regularization term = \(\frac{\lambda }{2m}\) * \(\sum {|w|}^{2}\), where \(\lambda\) is the regularization parameter, that is optimized to obtain the best results, m is the number of weak learners and w is the leaf weight matrix [36].

The hyperparameters present in the XGBoost model can be grouped as general, command line, booster and learning task. To achieve optimal performance, the model must be tuned carefully. Tuning the model is an unsettling task due to the number of parameters it has. The proposed model has used random search on a few important parameters for the tuning of the model. These tuned parameters provided exceptional results with less computational complexity for the proposed application. Table 2 shows the details of XGBoost model hyperparameters that are adjusted to make the model more efficient.

Table 2 Optimal values for important hyperparameter

Performance Evaluation Metrics

After the training process, predictions have been made on the test data. Total training samples taken under consideration are 881 and testing samples are 294. To obtain the performance, metrics, such as accuracy, precision, recall, F1 score, specificity, Matthew’s correlation coefficient (MCC), Youden Index (YI) and Cohen Kappa score, have been obtained through confusion matrix. The Matthews correlation coefficient (MCC) is a more reliable statistical rate that only yields a high score if the prediction performed well in all four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally to the size of positive and negative elements in the dataset. The Youden index assesses a diagnostic test’s ability to strike a balance between sensitivity (detection of disease) and specificity (detecting health or no disease). Diagnostic model's sensitivity is added to the specificity percentage, and the sum is deducted from 100. If the Youden index is less than 50%, the model does not meet the empirical criteria for being used for diagnostic purposes. The mathematical representation of all these metrics have been stated below:

  • True Positive (TP): number of fibrosed lung cases that are correctly predicted as fibrosed.

  • False Positive (FP): number of normal lung cases that are wrongly predicted as fibrosed.

  • True Negative (TN): normal cases that are correctly predicted as normal.

  • False Negative (FN): number of fibrosed cases that are wrongly predicted as normal.

The above terms are utilized to form several performance measures present in Eqs. (7) to (10):

$$\text{Accuracy }=\frac{\sum \text{TP}}{\text{Total Samples}},$$
(7)
$$\text{Precision }=\frac{\text{TP}}{\text{TP}+\text{FP}},$$
(8)
$$\text{Sensitivity }=\frac{\text{TP}}{\text{TP}+\text{FN}},$$
(9)
$$\text{Specificity }=\frac{\text{TN}}{\text{TN}+\text{FP}},$$
(10)
$$\text{F}1-\text{score }= 2* \frac{\text{Recall}*\text{Precision}}{\text{Recall}+\text{Precision}},$$
(11)
$$\text{MCC }=\frac{\text{TP}*\text{TN}-\text{FP}*\text{FN}}{\sqrt{\left(\text{TP}+\text{FP}\right)*\left(\text{TP}+\text{FN}\right)*\left(\text{TN}+\text{FP}\right)*(\text{TN}+\text{FN})}},$$
(12)
$${\text{YI}} = {\text{Sensitivity }}\left( \% \right) + {\text{Specificity }}\left( \% \right){-}{1}00,$$
(13)
$$\text{Cohen Kappa Score }=\frac{{p}_{o-}{p}_{e}}{1-{p}_{e}},$$
(14)

where \({p}_{o}\) is the empirical probability of agreement on the label assigned to the sample and \({p}_{e}\) is the predictable agreement when both annotators assign labels randomly.

The implementations also obtain Receiver Operating Characteristic (ROC) curve, which is the graph between True Positive Rate (TPR) and False Positive Rate (FPR). It also displays the indicative ability of the model. Another performance metric Area Under Curve (AUC) is present under the ROC curve. AUC provides the sum of evaluated performance across thresholds of all possible classification.

Experimental Results

Experimental Setup

As mentioned earlier, the dataset is split into different sets for training and testing, respectively. After implementing the XGBoost model with tuned hyperparameters, performance metrics have been obtained. For the simulation, Python packages and Keras libraries with Tensorflow 1.7 have been used on an Intel Core (TM) i5-2.2 GHz processor.

Result Analysis of EHR File

As various literature related to machine learning discusses the dependence of performance results of a model on the size of dataset considered, an analysis by taking different sets of dataset has been performed to rule out the possibility of overfitting or over constraint conditions [37]. The dataset has been divided into three different sets of train-test data to obtain the performance metrics by implementing the XGBoost model. The details have been discussed in Table 3. This analysis indicates the efficiency of the proposed model by acquiring satisfying results in all the training–testing sets. Figure 3 depicts the confusion matrix of the test phase of XGBoost architecture for pulmonary fibrosis classification with 80% training data. Fibrosed cases were labeled as 1, while normal cases were labeled as 0. Among the 1170 patient data, only 1 was misclassified as false positives and 1 was misclassified as false negatives. Furthermore, in Fig. 4, the ROC curve is plotted between true positive rate and false positive rate for the 80:20 split of dataset to compare the overall performance of the model. The AUC was calculated to be 1.00.

Table 3 Performance of the XGBoost model for different sets of training and testing data from EHR dataset
Fig. 3
figure 3

Confusion matrix of the proposed system with 80% training data

Fig. 4
figure 4

ROC analysis of the proposed system with 80% training data

The hyperparameter of XGBoost scale_pos_weight is used to tune the behavior of the algorithm for an imbalanced dataset with great efficiency. In default conditions, the parameter scale_pos_weight is set to 1.0 and has the significance of keeping the balance of positive examples, relative to the negative examples when boosting model’s decision trees. A feature importance graph, presented in Fig. 5, was also plotted to recognize the significant features of clinical data. Importance delivers a specific score that specifies how beneficial each feature was in the building of decision trees based on boosting, within the model. It is calculated explicitly for each attribute of dataset, allowing attributes to be compared and ranked accordingly. The performance measure used to select the split points is the Gini index. Importance of all the features is then averaged across all decision trees within the XGBoost model.

Fig. 5
figure 5

Feature importance graph of clinical dataset

Result analysis of HRCT-Scans

The XGBoost model was used to classify Fibrosed and normal lungs from the HRCT dataset. Again, three independent sets of train–test data have been considered to acquire the performance metrics. The details have been discussed in Table 4. By obtaining satisfactory results in all the training–testing sets, this analysis demonstrates the efficacy of the proposed model with HRCT scans as well. The confusion matrix for the test phase of the XGBoost architecture for pulmonary fibrosis classification with 80% training data is shown in Fig. 6. The area under ROC curve, obtained in Fig. 7, for the case of HRCT images has also attained the value of 1.00.

Table 4 Performance of the XGBoost model for different sets of training and testing data
Fig. 6
figure 6

Confusion matrix of the proposed system with 80% training data

Fig. 7
figure 7

ROC analysis of the proposed system with 80% training data

Comparison with Other Machine Learning Models

There has been boundless development in machine learning over the decades. The recent inclination of researchers is toward deep learning that desires a dataset comprising a huge number of attributes. For clinical cases like the pulmonary fibrosis dataset associated with COVID-19, the dataset is much smaller, particularly after deleting the invalid data points. In the available literature, it is known that tree-based algorithms, SVM and regression models perform well with small datasets. Thus, few standard machine learning algorithms are considered to do a performance comparison with the proposed methodology. Models, such as Support vector machine (SVM), Naïve Bayes, Decision Tree, Random Forest, Logistic Regression, XGBoost and the proposed optimized XGBoost, have been implemented on the pulmonary fibrosis patient dataset and their performances have been evaluated [38,39,40,41,42,43,44,45]. Among all, the optimized XGBoost presented pleasing results because its tuned parameters and chosen for the final classification task. The detailed metrics comparison based on EHR data has been presented in Table 5, and analysis based on HRCT scans has been presented in Table 6.

Table 5 Comparison of machine learning models with EHR dataset
Table 6 Comparison of machine learning models with HRCT scan dataset

Discussion

At the time of pandemic, a severe insufficiency of diagnostic resources has been reported, even in the developed part of world. Timely detection of diseases could result in better treatment and could save lives. EHR is the digital version of patient’s reports. It contains patient’s medical history, diagnoses, medications, treatment plans, immunization dates, allergies, radiology images, and laboratory and test results. EHRs could easily be used to automate the diagnostic process as they are already in the digitized version. It can make the whole process efficient, fast and precise than other diagnostic methods. A classification study with 1175 High-resolution Computed Tomography (HRCT) scans has also been included, to add an additional analysis to the results acquired from EHRs. The progression of pulmonary fibrosis in COVID-19 patients is a difficult classification problem that necessitates the application of a powerful optimization algorithm and an efficient feature extraction process. Treatment decisions, prognostication, and research into the pathogenesis of pulmonary fibrosis can all be aided by automated diagnosis. The work is primarily an application of Extreme Gradient Boosting algorithm for detecting pulmonary fibrosis. The technical contribution includes statistical analysis of the dataset and hyperparameter tuning of the algorithm using grid search. The results attained were sufficient to develop a diagnostic tool for detecting pulmonary fibrosis. Effective pre-processing and statistical analysis have been implemented on the dataset to obtain consistent and uniform values.

The modified XGBoost approach was chosen for this study because it has outstanding scalability and a fast-running speed, making it an effective ML method. Furthermore, machine learning methods allow for the simultaneous assessment of several variables and their complex interactions, as well as nonlinearity in the development of predictive models. This strategy has been used to solve a variety of machine learning issues. XGBoost has been used to classify cancer patients, epilepsy patients, and to diagnose chronic renal disease in biomedical domains. The results suggest that utilizing our tuned XGBoost classification system, it is possible to discriminate between normal patients and fibrosed lung with high accuracy. Our proposed framework was found to have a maximum classification accuracy of 99 percent, suggesting the potential clinical utility of EHR and HRCT data to categorize pulmonary fibrosis lung patients. Tables 5 and 6 show how the proposed system was compared using various ML methods reported in the literature. The comparison of the systems revealed that the optimized XGBoost produced a significant improvement over the other approaches.

The SVM and Logistic Regression techniques performed poorly when compared to the other systems. Random Forest shows most near approximates to the proposed method’s recall and accuracy values. The proposed XGB system is capable of handling large data dimensions while avoiding overtraining. Though, there are few limitations present in the study. Due to the unavailability of dataset for post-COVID-19 prior to the onset date, the analysis of high-risk trajectory prediction could not be included in this study. The available dataset only includes test reports obtained after 90 days of hospital discharge. This is surely a work to be included in future scope. The suggested model could also be utilized with a large-scale dataset including people from various geographical areas and age groups. The model has the potential to be a trustworthy tool for automatic analysis to aid in the diagnosis of pulmonary fibrosis.

Conclusion

As the comorbidities and complications due to COVID-19 have increased exponentially, many developing countries faced acute medical resource shortages. Hence, there is a need to identify every single complication at an early stage, which will reduce the burden on the medical society and healthcare system. The proposed XGBoost system to detect pulmonary fibrosis in COVID-19 patients could significantly help clinicians to examine patients with fibrotic complications by analyzing the electronic health reports or HRCT scans. This machine learning model achieved an accuracy of 99% and gave the best performance in terms of other evaluation metrics when compared with Decision Tree (97%), SVM (94%), Random Forest (90%), Logistic Regression (83%) and Naïve Bayes (63%). The precision, recall, and accuracy of the suggested system in this paper are higher than those of other approaches. This ensures its accuracy when it comes to the automatic classification of the pathology in this investigation. Finally, it is critical to note that the XGB approach has attained great performance, implying that this system will aid physicians in their decision-making.