Abstract
Data quality is considered crucial challenge in emerging big data scenarios. Data mining techniques can be reutilized efficiently in data cleaning process. Recent studies have shown that databases are often suffered from inconsistent data issues, which ought to be resolved in the cleaning process. In this paper, we introduce an automated approach for dependably generating rules from databases themselves, in order to detect data inconsistency problems from large databases. The proposed approach employs confidence and lift measures with integrity constraints, in order to guarantee that generated rules are minimal, non-redundant and precise. The proposed approach is validated against several datasets from healthcare domain. We experimentally demonstrate that our approach outperform significant enhancement over existing approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Mezzanzanica, M., Boselli, R., Cesarini, M., Mercorio, F.: Automatic synthesis of data cleansing activities (2011)
Li, J., Liu, J., Toivonen, H., Yong, J.: Effective pruning for the discovery of conditional functional dependencies. Comput. J. 56, 378–392 (2013)
Yakout, M., Elmagarmid, A.K., Neville, J.: Ranking for data repairs. In: Proceeding—International Conference Data Engineering, pp. 23–28 (2010)
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing. In: Proceeding Journal of Data and Information Quality (JDIQ) vol. 4(4), p. 16 (2014)
Wang, J., Tang, N.: Towards dependable data repairing with fixing rules. In: SIGMOD Conference, pp. 457–468 (2014)
Fan, W., Geerts, F.: Foundations of data quality management. Synth. Lect. Data Manage. 4, 1–217 (2012)
Liu, J., Li, J., Liu, C., Chen, Y.: Discover dependencies from data—a review. IEEE Trans. Knowl. Data Eng. 24, 251–264 (2012)
Vo, L.T.H., Cao, J., Rahayu, W.: Discovering conditional functional dependencies. Conf. Res. Pract. Inf. Technol. Ser. 115, 143–152 (2011)
Rodríguez, C.C.G., Riveill, M., Antipolis, S.: e-Health monitoring applications : what about data quality ? (2010)
Mans, R.S., van der A., Wil M.P., Vanwersch, R.J.: Data Quality Issues. Process Mining in Healthcare, pp. 79–88. Springer, Berlin (2015)
Kazley, A.S., Diana, M.L., Ford, E.W., Menachemi, N.: Is electronic health record use associated with patient satisfaction in hospitals? Health Care Manage. Rev. 37, 23–30 (2012)
Kalyani, D.D.: Mining constant conditional functional dependencies for improving data quality. 74, 12–20 (2013)
Bharambe, D., Jain, S., Jain, A.: A survey : detection of duplicate record. 2, (2012)
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: Proceeding 33rd International Conference Very Large Data Bases, pp. 315–326. Vienna, Au (2007)
Hartmann, S., Kirchberg, M., Link, S.: Design by example for SQL table definitions with functional dependencies. VLDB J. 21, 121–144 (2012)
Yao, H., Hamilton, H.J.: Mining functional dependencies from data. Data Min. Knowl. Discov. 16, 197–219 (2008)
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: Proceeding—International Conference Data Engineering, pp. 746–755 (2007)
Bauckmann, J., Abedjan, Z., Leser, U., Müller, H., Naumann, F.: Discovering conditional inclusion dependencies. In: 21st ACM International Conference on Information and Knowledge Management, pp. 2094–2098. (2012)
Fan, W., Geerts, F.: Capturing missing tuples and missing values. In: Proceeding 29th ACM SIGACT-SIGMOD-SIGART Symposium Principle of Database System, pp. 169–178 (2010)
Mayfield, C., Neville, J., Prabhakar, S.: ERACER: a database approach for statistical inference and data cleaning. In: Proceeding ACM SIGMOD International Conference Management Data, pp. 75–86 (2010)
Larsson, P.: Evaluation of open source data cleaning tools : open refine and data wrangler. (2013)
Vassiliadis, P., Simitsis, A.: Extraction, transformation, and loading. Encycl. Database Syst. 1095–1101 (2009)
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. VLDB J. 21, 213–238 (2012)
Fan, W., Gao, H., Jia, X., Li, J., Ma, S.: Dynamic constraints for record matching. VLDB J. 20, 495–520 (2011)
Reiter, J.: Data quality and record linkage techniques. J. Am. Stat. Assoc. 103(482), 881 (2008)
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. VLDB J. 18, 255–276 (2009)
Zaki, M.J.: Mining non-redundant association rules. Data Min. Knowl. Discov. 9, 223–248 (2004)
Chang, I.-C., Li, Y.-C., Wu, T.-Y., Yen, D.C.: Electronic medical record quality and its impact on user satisfaction—Healthcare providers’ point of view. Gov. Inf. Q. 29, 235–242 (2012)
Weiskopf, N.G., Weng, C.: Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J. Am. Med. Inform. Assoc. 144–151 (2012)
Groves, P., Kayyali, B., Knott, D., Van Kuiken, S.: The “ Big Data ” Revolution in Healthcare. McKinsey, New York (2013)
Kush, R.D., Ph.D., Helton, E., Rockhold, F.W., Hardison, C.D.: Electronic health records, medical research, and the tower of Babel. 16–18 (2008)
Koh, H.C., Tan, G.: Data mining applications in healthcare. J. Healthc. Inf. Manage. 19, 64–72 (2005)
Chiang, F., Miller, R.J.: Discovering data quality rules. In: Proceeding VLDB Endowment, pp. 1166–1177 (2008)
Medina, R., Nourine, L.: A unified hierarchy for functional dependencies, conditional functional dependencies and association rules. In: LNAI, Lecture Notes Computer Science (including Subseries Lecture Notes Artifical Intelligent Lecture Notes Bioinformatics). vol. 5548, pp. 98–113 (2009)
Hussein, N., Alashqur, A., Sowan, B.: Using the interestingness measure lift to generate association rules. J. Adv. Comput. Sci. Technol. 4, 156 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Abdo, A.S., Salem, R.K., Abdul-Kader, H.M. (2016). Automatic Rules Generation Approach for Data Cleaning in Medical Applications. In: Gaber, T., Hassanien, A., El-Bendary, N., Dey, N. (eds) The 1st International Conference on Advanced Intelligent System and Informatics (AISI2015), November 28-30, 2015, Beni Suef, Egypt. Advances in Intelligent Systems and Computing, vol 407. Springer, Cham. https://doi.org/10.1007/978-3-319-26690-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-26690-9_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26688-6
Online ISBN: 978-3-319-26690-9
eBook Packages: Computer ScienceComputer Science (R0)