The effect of imputing missing clinical attribute values on training lung cancer survival prediction model performance | Health Information Science and Systems Skip to main content

Advertisement

Log in

The effect of imputing missing clinical attribute values on training lung cancer survival prediction model performance

  • Research
  • Published:
Health Information Science and Systems Aims and scope Submit manuscript

Abstract

According to the estimations of the World Health Organization and the International Agency for Research in Cancer, lung cancer is the most common cause of death from cancer worldwide. The last few years have witnessed a rise in the attention given to the use of clinical decision support systems in medicine generally and in cancer in particular. These can predict patients’ likelihood of survival based on analysis of and learning from previously treated patients. The datasets that are mined for developing clinical decision support functionality are often incomplete, which adversely impacts the quality of the models developed and the decision support offered. Imputing missing data using a statistical analysis approach is a common method to addressing the missing data problem. This work investigates the effect of imputation methods for missing data in preparing a training dataset for a Non-Small Cell Lung Cancer survival prediction model using several machine learning algorithms. The investigation includes an assessment of the effect of imputation algorithm error on performance prediction and also a comparison between using a smaller complete real dataset or a larger dataset with imputed data. Our results show that even when the proportion of records with some missing data is very high (> 80%) imputation can lead to prediction models with an AUC (0.68–0.72) comparable to those trained with complete data records.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. WHO. Estimated cancer incidence, mortality and prevalence worldwide in 2012. International Agency for Cancer Research. 2016. http://globocan.iarc.fr/Pages/fact_sheets_cancer.aspx?cancer=lung. Accessed 05 Dec 2016.

  2. Key statistics for lung cancer. American Cancer Society. 2016. http://www.cancer.org/cancer/lungcancer-non-smallcell/detailedguide/non-small-cell-lung-cancer-key-statistics. Accessed 07 Dec 2016.

  3. Dekker A, et al. Rapid learning in practice: a lung cancer survival decision support system in routine patient care data. Radiother Oncol. 2014;113(1):47–53.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Abernethy AP, et al. Rapid-learning system for cancer care. J Clin Oncol. 2010;28(27):4268–74.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Sammut C, Webb GI. Encyclopedia of machine learning. Berlin: Springer; 2011.

    Google Scholar 

  6. Beleites C, Neugebauer U, Bocklitz T, Krafft C, Popp J. Sample size planning for classification models. Anal Chim Acta. 2013;760:25–33.

    Article  CAS  PubMed  Google Scholar 

  7. García-Laencina PJ, Abreu PH, Abreu MH, Afonoso N. Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Comput Biol Med. 2015;59:125–33.

    Article  PubMed  Google Scholar 

  8. Jayasurya K, et al. Comparison of Bayesian network and support vector machine models for two-year survival prediction in lung cancer patients treated with radiotherapy. Med Phys. 2010;37(4):1401.

    Article  CAS  PubMed  Google Scholar 

  9. García-Laencina PJ, Sancho-Gómez J-L, Figueiras-Vidal AR. Pattern classification with missing data: a review. Neural Comput Appl. 2009;19(2):263–82.

    Article  Google Scholar 

  10. Sterne JAC, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Jochems A, et al. Distributed learning: developing a predictive model based on data from multiple hospitals without data leaving the hospital—a real life proof of concept. Radiother Oncol. 2016;121(3):459–67.

    Article  PubMed  Google Scholar 

  12. Kang J, Schwartz R, Flickinger J, Beriwal S. Machine learning approaches for predicting radiation therapy outcomes: a clinician’s perspective. Int J Radiat Oncol. 2015;93(5):1127–35.

    Article  Google Scholar 

  13. Olinsky A, Chen S, Harlow L. The comparative efficacy of imputation methods for missing data in structural equation modeling. Eur J Oper Res. 2003;151(1):53–79.

    Article  Google Scholar 

  14. Steyerberg EW, van Veen M. Imputation is beneficial for handling missing data in predictive models. J Clin Epidemiol. 2007;60(9):979.

    Article  PubMed  Google Scholar 

  15. Opitz D, Maclin R (1999) Popular ensemble methods: an empirical study. http://jair.org/papers/paper614.html. Accessed 24 Oct 2016.

  16. Rokach L. Ensemble-based classifiers. Artif Intell Rev. 2009;33(1–2):1–39.

    Google Scholar 

  17. Polikar R. Ensemble based systems in decision making. IEEE Circuits Syst Mag. 2006;6(3):21–45.

    Article  Google Scholar 

  18. Seni G, Elder JF. Ensemble methods in data mining: improving accuracy through combining predictions. Synth Lect Data Min Knowl Discov. 2010;2(1):1–126.

    Article  Google Scholar 

  19. Little RJ. A test of missing completely at random for multivariate data with missing values. J Am Stat Assoc. 1988;83(404):1198–202.

    Article  Google Scholar 

  20. Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–92.

    Article  Google Scholar 

  21. Rubin DB. Multiple imputation for nonresponse in surveys, vol. 81. New York: Wiley; 2004.

    Google Scholar 

  22. Aste M, Boninsegna M, Freno A, Trentin E. Techniques for dealing with incomplete data: a tutorial and survey. Pattern Anal Appl. 2014;18(1):1–29.

    Article  Google Scholar 

  23. Rahman G, Islam Z. A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of the ninth Australasian data mining conference—volume 121, Darlinghurst, Australia, 2011, p. 41–50.

  24. Berghmans T, Paesmans M, Sculier J-P. Prognostic factors in stage III non-small cell lung cancer: a review of conventional, metabolic and new biological variables. Ther Adv Med Oncol. 2011;3(3):127–38.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Oberije C, et al. A Validated prediction model for overall survival from stage iii non-small cell lung cancer: toward survival prediction for individual patients. Int J Radiat Oncol. 2015;92(4):935–44.

    Article  Google Scholar 

  26. Hall Mark, Frank Eibe, Holmes Geoffrey, Pfahringer Bernhard, Reutemann Peter, Witten Ian H. The WEKA data mining software: an update. SIGKDD Explor. 2009;11(1):10–8.

    Article  Google Scholar 

  27. Swets JA. Signal detection theory and ROC analysis in psychology and diagnostics: collected papers. New York: Psychology Press; 2014.

    Google Scholar 

  28. Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd international conference on machine learning, New York, NY, USA, 2006, p. 161–168.

  29. MATLAB and Statistics Toolbox Release 2015a. Natick: The MathWorks, Inc.

  30. IBM SPSS Statistics for Macintosh. Armonk, NY: IBM Corp; 2013.

  31. Schunk D. A Markov chain Monte Carlo algorithm for multiple imputation in large surveys. AStA Adv Stat Anal. 2008;92(1):101–14.

    Article  Google Scholar 

Download references

Acknowledgements

This work was in part funded by a New South Wales Office of Health and Medical Research (OHMR) bioinformatics grant, RG14/11.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed S. Barakat.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Barakat, M.S., Field, M., Ghose, A. et al. The effect of imputing missing clinical attribute values on training lung cancer survival prediction model performance. Health Inf Sci Syst 5, 16 (2017). https://doi.org/10.1007/s13755-017-0039-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13755-017-0039-4

Keywords

Navigation