Abstract
Clinical decision support using data mining techniques offers more intelligent way to reduce the decision error in the last few years. However, clinical datasets often suffer from high missingness, which adversely impacts the quality of modelling if handled improperly. Imputing missing values provides an opportunity to resolve the issue. Conventional imputation methods adopt simple statistical analysis, such as mean imputation or discarding missing cases, which have many limitations and thus degrade the performance of learning. This study examines a series of machine learning based imputation methods and suggests an efficient approach to in preparing a good quality breast cancer (BC) dataset, to find the relationship between BC treatment and chemotherapy-related amenorrhoea, where the performance is evaluated with the accuracy of the prediction. To this end, the reliability and robustness of six well-known imputation methods are evaluated. Our results show that imputation leads to a significant boost in the classification performance compared to the model prediction based on listwise deletion. Furthermore, the results reveal that most methods gain strong robustness and discriminant power even the dataset experiences high missing rate (> 50%).
Similar content being viewed by others
References
Acuna E, Rodriguez C. The treatment of missing values and its effect on classifier accuracy., Classification, clustering, and data mining applicationsNew York: Springer; 2004. p. 639–47.
Barakat MS, Field M, Ghose A, Stirling D, Holloway L, Vinod S, Dekker A, Thwaites D. The effect of imputing missing clinical attribute values on training lung cancer survival prediction model performance. Health Inf Sci Syst. 2017;5(1):16.
Batista GE, Monard MC, et al. A study of k-nearest neighbour as an imputation method. HIS. 2002;87(251–260):48.
Buuren SV, Groothuis-Oudshoorn K. Mice: multivariate imputation by chained equations in R. J Stat Softw. 2010. https://doi.org/10.18637/jss.v045.i03.
de Goeij MC, van Diepen M, Jager KJ, Tripepi G, Zoccali C, Dekker FW. Multiple imputation: dealing with missing data. Nephrol Dial Transplant. 2013;28(10):2415–20.
Ives A, Saunders C, Bulsara M, Semmens J. Pregnancy after breast cancer: population based study. BMJ. 2007;334(7586):194.
Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, Franco L. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med. 2010;50(2):105–15.
Johnson N, Bagrie E, Coomarasamy A, Bhattacharya S, Shelling A, Jessop S, Farquhar C, Khan K. Ovarian reserve tests for predicting fertility outcomes for assisted reproductive technology: the international systematic collaboration of ovarian reserve evaluation protocol for a systematic review of ovarian reserve test accuracy. BJOG. 2006;113(12):1472–80.
Kalton G, Kish L. Some efficient random imputation methods. Commun Stat Theory Methods. 1984;13(16):1919–39.
Lee S, Kil WJ, Chun M, Jung YS, Kang SY, Kang SH, Oh YT. Chemotherapy-related amenorrhea in premenopausalwomen with breast cancer. Menopause. 2009;16(1):98–103.
Lee G, Rubinfeld I, Syed Z. Adapting surgical models to individual hospitals using transfer learning. In: 2012 IEEE 12th international conference on data mining workshops; 2012. pp. 57–63.
Liem GS, Mo FK, Pang E, Suen JJ, Tang NL, Lee KM, Yip CH, Tam WH, Ng R, Koh J, et al. Chemotherapy-related amenorrhea and menopause in young chinese breast cancer patients: analysis on incidence, risk factors and serum hormone profiles. PloS ONE. 2015;10(10):e0140842.
Lin WC, Tsai CF. Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev. 2019. https://doi.org/10.1007/s10462-019-09709-4.
Little RJ, Rubin DB. Statistical analysis with missing data, vol. 793. Hoboken: Wiley; 2019.
Moon TK. The expectation-maximization algorithm. IEEE Signal Process Mag. 1996;13(6):47–60.
Nelwamondo FV, Mohamed S, Marwala T. Missing data: a comparison of neural network and expectation maximization techniques. Curr Sci. 2007;93:1514–21.
Peate M, Edib Z. Fertility after cancer predictor (forecast) study. 2019. https://medicine.unimelb.edu.au/research-groups/obstetrics-and-gynaecology-research/psychosocial-health-wellbeing-research/fertility-after-cancer-predictor-forecast-study. Accessed 15 Apr 2019.
Peate M, Meiser B, Friedlander M, Zorbas H, Rovelli S, Sansom-Daly U, Sangster J, Hadzi-Pavlovic D, Hickey M. It’s now or never: fertility-related knowledge, decision-making preferences, and treatment intentions in young women with breast cancer–an australian fertility decision aid collaborative group study. J Clin Oncol. 2011;29(13):1670–7.
Peate M, Stafford L, Hickey M. Fertility after breast cancer and strategies to help women achieve pregnancy. Cancer Forum. 2017;41:32.
Purwar A, Singh SK. Hybrid prediction model with missing value imputation for medical data. Expert Syst Appl. 2015;42(13):5621–31.
Rubin DB. Multiple imputation for nonresponse in surveys, vol. 81. Hoboken: Wiley; 2004.
Ruddy KJ, Gelber S, Tamimi RM, Schapira L, Come SE, Meyer ME, Winer EP, Partridge AH. Breast cancer presentation and diagnostic delays in young women. Cancer. 2014;120(1):20–5.
Schafer JL. Analysis of incomplete multivariate data. New York: Chapman and Hall/CRC; 1997.
Stekhoven DJ, Bühlmann P. Missforest: non-parametric missing value imputation for mixed-type data. Bioinformatics. 2011;28(1):112–8.
Van Rossum G, Drake FL Jr. Python tutorial. Amsterdam: Centrum voor Wiskunde en Informatica; 1995.
Wilson DR, Martinez TR. Improved heterogeneous distance functions. J Artif Intell Res. 1997;6:1–34.
Funding
This work is fully funded by Melbourne Research Scholarships (MRS), Grant No. 385545 and partially supported by Fertility After Cancer Predictor (FoRECAsT) Study. Michelle Peate is currently supported by an MDHS Fellowship, University of Melbourne. The FoRECAsT study is supported by the FoRECAsT consortium and Victorian Government through a Victorian Cancer Agency (Early Career Seed Grant) awarded to Michelle Peate.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wu, X., Akbarzadeh Khorshidi, H., Aickelin, U. et al. Imputation techniques on missing values in breast cancer treatment and fertility data. Health Inf Sci Syst 7, 19 (2019). https://doi.org/10.1007/s13755-019-0082-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13755-019-0082-4