Abstract
Due to development of the Internet, the size of data continue to be large and rough. During the process of data collection, different kinds of data problems occurred, among where incompleteness is one of the most serious problems to deal with. The existing methods for missing values imputation have mostly relied on using statistics and machine learning. These methods are known to be limited in efficiency and accuracy, which are caused by high dimensional calculation and low quality of initial data. In this paper, we propose a new method combining Bayesian network and crowdsourcing to deal with missing values together. We use Bayesian network to inference missing values to improve efficiency while use crowdsourcing to obtain additional information in need to improve accuracy. Experiments on real datasets show that our methods achieve better performance compared to other imputation methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Janssen, K.J.M., Donders, A.R.T., Harrell, F.E., et al.: Missing covariate data in medical research: to impute is better than to ignore. J. Clin. Epidemiol. 63(7), 721–727 (2010)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodological) 39, 1–38 (2011)
Shan, Y., Kernel, D.G.: PCA regression for missing data estimation in DNA microarray analysis. In: IEEE International Symposium on Circuits and Systems, ISCAS 2009, pp. 1477–1480. IEEE (2009)
Lakshminarayan, K., Harp, S.A., Goldman, R.P., et al.: Imputation of missing data using machine learning techniques. In: KDD, pp. 140–145 (1996)
Yang, K., Li, J., Wang, C.: Missing values estimation in microarray data with partial least squares regression. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 662–669. Springer, Heidelberg (2006)
Li, X.B.: A Bayesian approach for estimating and replacing missing categorical data. J. Data Inf. Qual. (JDIQ) 1(1), 3 (2009)
Di Zio, M., Scanu, M., Coppola, L., et al.: Bayesian networks for imputation. J. R. Stat. Soc. Ser. A (Statistics in Society) 167(2), 309–322 (2004)
Zhang, S.: Shell-neighbor method and its application in missing data imputation. Appl. Intell. 35(1), 123–133 (2011)
Setiawan, N.A., Venkatachalam, P.A., Hani, A.F.M.: Missing attribute value prediction based on artificial neural network and rough set theory. In: International Conference on BioMedical Engineering and Informatics, BMEI 2008, vol. 1, pp. 306–310. IEEE (2008)
Nowak, S., Rger, S.: How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the International Conference on Multimedia Information Retrieval, pp. 557–566. ACM (2010)
Noronha, J., Hysen, E., Zhang, H., et al.: Platemate: crowdsourcing nutritional analysis from food photographs. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, pp. 1–12. ACM (2011)
Whang, S.E., Lofgren, P., Garcia-Molina, H.: Question selection for crowd entity resolution. Proc. VLDB Endowment 6(6), 349–360 (2013)
Wang, J., Kraska, T., Franklin, M.J., et al.: Crowder: crowdsourcing entity resolution. Proc. VLDB Endowment 5(11), 1483–1494 (2012)
Zhang, C.J., Chen, L., Jagadish, H.V., et al.: Reducing uncertainty of schema matching via crowdsourcing. Proc. VLDB Endowment 6(9), 757–768 (2013)
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, SanMateo (1988)
Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, 255–268 (1989)
Stekhoven, D.J., Bhlmann, P.: MissForestnon-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)
Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9(4), 309–347 (1992)
Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max-min hill-climbing Bayesian network structure learning algorithm. Mach. Learn. 65(1), 31–78 (2006)
Huang, C., Darwiche, A.: Inference in belief networks: a procedural guide. Int. J. Approximate Reasoning 15(3), 225–263 (1996)
Lauritzen, S.L.: The EM algorithm for graphical association models with missing data. Comput. Stat. Data Anal. 19(2), 191–201 (1995)
Hochbaum, D.S.: Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In: Approximation Algorithms for NP-Hard Problems, pp. 94–143. PWS Publishing Co. (1996)
Li, J., Cai, Z., Yan, M., Li, Y.: Using crowdsourced data in location-based social networks to explore influence maximization. In: The 35th Annual IEEE International Conference on Computer Communications (INFOCOM 2016) (2016)
Wang, Y., Cai, Z., Stothard, P., et al.: Fast accurate missing SNP genotype local imputation. BMC Res. Notes 5(1), 404 (2012)
Cai, Z., Heydari, M., Lin, G.: Iterated local least squares imputation for microarray missing values. J. Bioinform. Comput. Biol. 4(5), 935–957 (2006)
Acknowledgement
This paper was supported by NGFR 973 grant 2012CB316200, NSFC grant U1509216, 61472099, 61133002 and National Sci-Tech Support Plan 2015BAH10F01.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Ye, C., Wang, H., Li, J., Gao, H., Cheng, S. (2016). Crowdsourcing-Enhanced Missing Values Imputation Based on Bayesian Network. In: Navathe, S., Wu, W., Shekhar, S., Du, X., Wang, X., Xiong, H. (eds) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science(), vol 9642. Springer, Cham. https://doi.org/10.1007/978-3-319-32025-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-32025-0_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-32024-3
Online ISBN: 978-3-319-32025-0
eBook Packages: Computer ScienceComputer Science (R0)