Abstract
Historical and real-time healthcare data sets are valuable sources of information for predictive data analytics. However, most of the historical healthcare data sets are overloaded with challenges. One of the most frequently faced challenge is the problem of missing values, occurring because of the inaccuracies in data transmission or data entry processes. An appropriate technique for handling missing values is required to generate good quality data sets for achieving better prediction results. Removing the records with missing values, known as marginalization, poses an easy way out to this challenge. But, this will lessen the data volume of the historical data set and disturb the class balance of the data set. An alternative to marginalization is replacing missing values with plausible values, known as imputation. This paper proposes a missing value imputation technique, CLUSTIMP, using an unsupervised neural network Adaptive Resonance Theory 2 (ART2). The efficiency of the proposed imputation method is evaluated on the incomplete Mammographic mass data set and Hepatocellular Carcinoma data set (HCC) from the UCI repository considering Root Mean Squared Error (RMSE) rate and classification accuracy as the evaluation metrics. The proposed CLUSTIMP imputation algorithm outperforms existing state-of-the-art imputation methods by reducing classifiers error rates between 2 and 11%.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Almeida RJ, Kaymak U, Sousa JM (2010) A new approach to dealing with missing values in data-driven fuzzy modeling. In: International conference on fuzzy systems, pp. 1–7. IEEE
Armentano R, Bhadoria RS, Chatterjee P, Deka GC (2017) The internet of things: foundation for smart cities, EHealth, and ubiquitous computing. CRC Press, Boca Raton
Arslanturk S, Siadat M-R, Ogunyemi T, Killinger K, Diokno A (2016) Analysis of incomplete and inconsistent clinical survey data. Knowl Inform Syst 46(3):731–750
Beaulieu-Jones BK, Moore JH (2017) Missing data imputation in the electronic health record using deeply learned autoencoders. In: Pacific Symposium on Biocomputing 2017, pp. 207–218. World Scientific
Bhadoria RS, Bajpai D (2019) Stabilizing sensor data collection for control of environment-friendly clean technologies using internet of things. Wirel Personal Commun 108(1):493–510
Carpenter GA, Grossberg S (2017) Adaptive resonance theory. Springer, Berlin
Chan LS, Dunn OJ (1972) The treatment of missing values in discriminant analysisi. the sampling experiment. J Am Stat Assoc 67(338):473–477
Chen M, Hao Y, Hwang K, Wang L, Wang L (2017) Disease prediction by machine learning over big data from healthcare communities. Ieee Access 5:8869–8879
Davis D, Rahman M (2016) Missing value imputation using stratified supervised learning for cardiovascular data. J. Inf. Data Min 1(2):1–13
Elter M, Schulz-Wendtland R, Wittenberg T (2007) The prediction of breast cancer biopsy outcomes using two cad approaches that both emphasize an intelligible decision process. Med Phys 34(11):4164–4172
Ford BL (1983) An overview of hot-deck procedures. Incomplete Data Sample Surv 2(Part IV):185–207
Haji-Maghsoudi S, Rastegari A, Garrusi B, Baneshi MR (2018) Addressing the problem of missing data in decision tree modeling. J Appl Stat 45(3):547–557
Imani F, Cheng C, Chen R, Yang H (2019) Nested gaussian process modeling and imputation of high-dimensional incomplete data under uncertainty. IISE Trans Healthc Syst Eng 9(4):315–326
Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, Franco L (2010) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intell Med 50(2):105–115
Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmospheric Environ 38(18):2895–2907
Kayal CK, Bagchi S, Dhar D, Maitra T, Chatterjee S (2019) Hepatocellular carcinoma survival prediction using deep neural network. In: Proceedings of international ethical hacking conference 2018, pp. 349–358. Springer
Kurt I, Ture M, Kurum AT (2008) Comparing performances of logistic regression, classification and regression tree, and neural networks for predicting coronary artery disease. Expert Syst Appl 34(1):366–374
LaFreniere D, Zulkernine F, Barber D, Martin K (2016) Using machine learning to predict hypertension from a clinical dataset. In: 2016 IEEE symposium series on computational intelligence (SSCI), pp. 1–7. IEEE
Mazumder RS, Bhadoria RS, Deka GC (eds) (2017) Distributed computing in big data analytics. Concepts, technologies and applications. Springer, Cham
Momeni A, Pincus M, Libien J (2018) Imputation and missing data. In: Introduction to statistical methods in pathology. Springer, Cham, pp 185–200
Nguyen DV, Wang N, Carroll RJ (2004) Evaluation of missing value estimation for microarray data. J Data Sci 2(4):347–370
Penny KI, Chesney T (2006) Imputation methods to deal with missing values when data mining trauma injury data. In: 28th international conference on information technology interfaces, 2006, pp. 213–218. IEEE
Rahman MM (2014) Machine learning based data pre-processing for the purpose of medical data mining and decision support. PhD thesis, University of Hull
Rubin DB (2004) Multiple imputation for nonresponse in surveys, vol 81. Wiley, Hoboken
Santos MS, Abreu PH, García-Laencina PJ, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58:49–59
Sen S, Das M, Chatterjee R (2018) Estimation of incomplete data in mixed dataset. In: Progress in intelligent computing techniques: theory, practice, and applications. Springer, Singapore, pp 483–492
Shobha K, Nickolas S (2019) Imputation of multivariate attribute values in big data. In: Smart intelligent computing and applications. Springer, Singapore, pp 53–60
Sokat KY, Dolinskaya IS, Smilowitz K, Bank R (2018) Incomplete information imputation in limited data environments with application to disaster response. Europ J Oper Res 269(2):466–485
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for dna microarrays. Bioinformatics 17(6):520–525
Turabieh H, Salem AA, Abu-El-Rub N (2018) Dynamic l-rnn recovery of missing data in iomt applications. Future Generation Comput Syst 89:575–583
Tutz G, Ramzan S (2015) Improved methods for the imputation of missing data by nearest neighbor methods. Comput Stat Data Anal 90:84–99
Van der Heijden GJ, Donders ART, Stijnen T, Moons KG (2006) Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol 59(10):1102–1109
Verma H, Kumar S (2019) An accurate missing data prediction method using lstm based deep learning for health care. In: Proceedings of the 20th international conference on distributed computing and networking, pp. 371–376. ACM
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shobha, K., Savarimuthu, N. Clustering based imputation algorithm using unsupervised neural network for enhancing the quality of healthcare data. J Ambient Intell Human Comput 12, 1771–1781 (2021). https://doi.org/10.1007/s12652-020-02250-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-020-02250-1