{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,4,9]],"date-time":"2025-04-09T06:48:35Z","timestamp":1744181315765,"version":"3.37.3"},"reference-count":78,"publisher":"Association for Computing Machinery (ACM)","issue":"3","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Data and Information Quality"],"published-print":{"date-parts":[[2020,9,30]]},"abstract":"Real-world datasets often suffer from various data quality problems. Several data cleaning solutions have been proposed so far. However, data cleaning remains a manual and iterative task that requires domain and technical expertise. Exploiting metadata promises to improve the tedious process of data preparation, because data errors are detectable through metadata. This article investigates the intrinsic connection between metadata and data errors. In this work, we establish a mapping that reflects the connection between data quality issues and extractable metadata using qualitative and quantitative techniques. Additionally, we present a taxonomy based on a closed grammar that covers all existing metadata and allows the composition of novel types of metadata. We provide a case-study to show the practical application of the grammar for generating new metadata for data quality assessment.<\/jats:p>","DOI":"10.1145\/3371925","type":"journal-article","created":{"date-parts":[[2020,6,13]],"date-time":"2020-06-13T22:18:11Z","timestamp":1592086691000},"page":"1-30","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":16,"title":["Anatomy of Metadata for Data Curation"],"prefix":"10.1145","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2846-1373","authenticated-orcid":false,"given":"Larysa","family":"Visengeriyeva","sequence":"first","affiliation":[{"name":"TU Berlin"}]},{"given":"Ziawasch","family":"Abedjan","sequence":"additional","affiliation":[{"name":"TU Berlin"}]}],"member":"320","published-online":{"date-parts":[[2020,6,13]]},"reference":[{"doi-asserted-by":"publisher","key":"e_1_2_1_1_1","DOI":"10.14778\/2856318.2856328"},{"doi-asserted-by":"publisher","key":"e_1_2_1_2_1","DOI":"10.14778\/2994509.2994518"},{"doi-asserted-by":"publisher","key":"e_1_2_1_3_1","DOI":"10.1007\/s00778-015-0389-y"},{"doi-asserted-by":"crossref","unstructured":"Ziawasch Abedjan Lukasz Golab Felix Naumann and Thorsten Papenbrock. 2018. Data Profiling. Morgan 8 Claypool Publishers.","key":"e_1_2_1_4_1","DOI":"10.1007\/978-3-031-01865-7"},{"doi-asserted-by":"publisher","key":"e_1_2_1_5_1","DOI":"10.1145\/2063576.2063801"},{"doi-asserted-by":"publisher","key":"e_1_2_1_6_1","DOI":"10.1007\/978-3-319-11955-7_11"},{"volume-title":"Outlier Analysis","author":"Aggarwal Charu C.","unstructured":"Charu C. Aggarwal. 2015. Outlier Analysis. Springer.","key":"e_1_2_1_7_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_8_1","DOI":"10.1145\/2723372.2742797"},{"unstructured":"V. Barnett and T. Lewis. 1995. Outliers in Statistical Data. Wiley Online Library.","key":"e_1_2_1_9_1"},{"key":"e_1_2_1_10_1","volume-title":"et\u00a0al","author":"Batini Carlo","year":"2016","unstructured":"Carlo Batini, Monica Scannapieco, et\u00a0al. 2016. Data and Information Quality. Springer."},{"key":"e_1_2_1_11_1","article-title":"Multi-scale anomaly detection algorithm based on infrequent pattern of time series","volume":"214","author":"Zhan Chen","year":"2008","unstructured":"Xiao-yun Chen and Yan-yan Zhan. 2008. Multi-scale anomaly detection algorithm based on infrequent pattern of time series. J. Comput. Appl. Math. 214, 1 (2008).","journal-title":"J. Comput. Appl. Math."},{"doi-asserted-by":"publisher","key":"e_1_2_1_12_1","DOI":"10.5555\/974305.974324"},{"doi-asserted-by":"publisher","key":"e_1_2_1_13_1","DOI":"10.1109\/ICDE.2013.6544847"},{"doi-asserted-by":"publisher","key":"e_1_2_1_14_1","DOI":"10.1145\/2723372.2749431"},{"doi-asserted-by":"publisher","key":"e_1_2_1_15_1","DOI":"10.1145\/2463676.2465327"},{"doi-asserted-by":"publisher","key":"e_1_2_1_16_1","DOI":"10.1145\/2488388.2488414"},{"doi-asserted-by":"publisher","key":"e_1_2_1_17_1","DOI":"10.1145\/564691.564719"},{"unstructured":"Wenfei Fan and Floris Geerts. 2012. Foundations of Data Quality Management. Morgan 8 Claypool Publishers.","key":"e_1_2_1_18_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_19_1","DOI":"10.1080\/01621459.1976.10481472"},{"doi-asserted-by":"publisher","key":"e_1_2_1_20_1","DOI":"10.14778\/2536360.2536363"},{"doi-asserted-by":"publisher","key":"e_1_2_1_21_1","DOI":"10.14778\/1920841.1921060"},{"doi-asserted-by":"publisher","key":"e_1_2_1_22_1","DOI":"10.1109\/TKDE.2013.184"},{"doi-asserted-by":"publisher","key":"e_1_2_1_23_1","DOI":"10.1145\/2882903.2903730"},{"doi-asserted-by":"publisher","key":"e_1_2_1_24_1","DOI":"10.1109\/HPCA.2018.00059"},{"unstructured":"Joseph M. Hellerstein. 2008. Quantitative Data Cleaning for Large Databases. United Nations Economic Commission for Europe (UNECE).","key":"e_1_2_1_25_1"},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the Conference on Innovative Data Systems Research (CIDR\u201917)","author":"Hellerstein Joseph M.","year":"2017","unstructured":"Joseph M. Hellerstein, Vikram Sreekanti, Joseph E. Gonzalez, James Dalton, Akon Dey, Sreyashi Nag, Krishna Ramachandran, Sudhanshu Arora, Arka Bhattacharyya, Shirshanka Das, et\u00a0al. 2017. Ground: A data context service. In Proceedings of the Conference on Innovative Data Systems Research (CIDR\u201917)."},{"doi-asserted-by":"publisher","key":"e_1_2_1_27_1","DOI":"10.1023\/B:AIRE.0000045502.10941.a9"},{"unstructured":"Jean-Nicholas Hould. 2017. Craft beers dataset. Retrieved from https:\/\/www.kaggle.com\/nickhould\/craft-cans. Version 1.","key":"e_1_2_1_28_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_29_1","DOI":"10.1145\/1281192.1281294"},{"doi-asserted-by":"publisher","key":"e_1_2_1_30_1","DOI":"10.1016\/B978-012722442-8\/50011-2"},{"key":"e_1_2_1_31_1","first-page":"10","article-title":"Histogram-based solutions to diverse database estimation problems","volume":"18","author":"Ioannidis Yannis","year":"1995","unstructured":"Yannis Ioannidis and Viswanath Poosala. 1995. Histogram-based solutions to diverse database estimation problems. IEEE Data Eng. Bull. 18, 3 (1995), 10--18.","journal-title":"IEEE Data Eng. Bull."},{"doi-asserted-by":"publisher","key":"e_1_2_1_32_1","DOI":"10.1109\/69.536250"},{"volume-title":"The State of Machine Learning and Data Science","year":"2017","unstructured":"Kaggle. [n.d.]. The State of Machine Learning and Data Science 2017. Retrieved from https:\/\/bit.ly\/2KopcwB.","key":"e_1_2_1_33_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_34_1","DOI":"10.1145\/1978942.1979444"},{"doi-asserted-by":"publisher","key":"e_1_2_1_35_1","DOI":"10.1145\/2254556.2254659"},{"volume-title":"Proceedings of the IEEE International Conference on Big Data. 431--440","author":"Kandogan Eser","unstructured":"Eser Kandogan, Mary Roth, Peter Schwarz, Joshua Hui, Ignacio Terrizzano, Christina Christodoulakis, and Ren\u00e9e J. Miller. 2015. LabBook: Metadata-driven social collaborative data analysis. In Proceedings of the IEEE International Conference on Big Data. 431--440.","key":"e_1_2_1_36_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_37_1","DOI":"10.1023\/A:1021564703268"},{"key":"e_1_2_1_38_1","volume-title":"Backus normal form vs. Backus Naur form. Commun. ACM (Dec","author":"Knuth Donald E.","year":"1964","unstructured":"Donald E. Knuth. 1964. Backus normal form vs. Backus Naur form. Commun. ACM (Dec. 1964)."},{"key":"e_1_2_1_39_1","volume-title":"Boostclean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299","author":"Krishnan Sanjay","year":"2017","unstructured":"Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. Boostclean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299 (2017)."},{"doi-asserted-by":"publisher","key":"e_1_2_1_40_1","DOI":"10.1145\/2939502.2939511"},{"doi-asserted-by":"publisher","key":"e_1_2_1_41_1","DOI":"10.1145\/3132847.3133180"},{"key":"e_1_2_1_42_1","first-page":"8","article-title":"Data anamnesis: Admitting raw data into an organization","volume":"39","author":"Kruse Sebastian","year":"2016","unstructured":"Sebastian Kruse, Thorsten Papenbrock, Hazar Harmouch, and Felix Naumann. 2016. Data anamnesis: Admitting raw data into an organization. IEEE Data Eng. Bull. 39, 2 (2016), 8--20.","journal-title":"IEEE Data Eng. Bull."},{"doi-asserted-by":"publisher","key":"e_1_2_1_43_1","DOI":"10.1109\/PRDC.2015.41"},{"doi-asserted-by":"publisher","key":"e_1_2_1_44_1","DOI":"10.5176\/2010-2283_1.2.52"},{"doi-asserted-by":"publisher","key":"e_1_2_1_45_1","DOI":"10.14778\/2535568.2448943"},{"doi-asserted-by":"publisher","key":"e_1_2_1_46_1","DOI":"10.1145\/3299869.3324956"},{"volume-title":"Proceedings of the 5th Conference on Information Quality. 200--209","author":"Jonathan","unstructured":"Jonathan I. Maletic and Andrian Marcus. 2000. Data cleansing: Beyond integrity analysis. In Proceedings of the 5th Conference on Information Quality. 200--209.","key":"e_1_2_1_47_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_48_1","DOI":"10.1145\/62061.62063"},{"doi-asserted-by":"publisher","key":"e_1_2_1_49_1","DOI":"10.1145\/1807167.1807178"},{"key":"e_1_2_1_50_1","volume-title":"Levin","author":"McCarthy John","year":"1965","unstructured":"John McCarthy and Michael I. Levin. 1965. LISP 1.5 Programmer\u2019s Manual. The MIT Press."},{"unstructured":"Metmuseum. 2018. The Metropolitan Museum of Art Open Access. Retrieved from https:\/\/github.com\/metmuseum\/openaccess.","key":"e_1_2_1_51_1"},{"unstructured":"Michael Stonebraker Nik Bates-Haus Liam Cleary Larry Simmons and Andy Palmer. 2018. Getting Data Operations Right. O\u2019Reilly Media.","key":"e_1_2_1_52_1"},{"key":"e_1_2_1_53_1","volume-title":"Efficient estimation of word representations in vector space. arXiv","author":"Mikolov Tomas","year":"2013","unstructured":"Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv (2013)."},{"key":"e_1_2_1_54_1","volume-title":"Technical Report HUB-IB-164, Humboldt-Universit\u00e4t zu Berlin","author":"M\u00fcller H.","year":"2003","unstructured":"H. M\u00fcller and J. C. Freytag. 2003. Problems, Methods and Challenges in Comprehensive Data Cleansing. Technical Report HUB-IB-164, Humboldt-Universit\u00e4t zu Berlin, Institut f\u00fcr Informatik."},{"doi-asserted-by":"publisher","key":"e_1_2_1_55_1","DOI":"10.1145\/2590989.2590995"},{"key":"e_1_2_1_56_1","volume-title":"Proceedings of the International Workshop on Data and Information Quality. 219--233","author":"Oliveira Paulo","year":"2005","unstructured":"Paulo Oliveira, F\u00e1tima Rodrigues, Pedro Henriques, and Helena Galhardas. 2005. A taxonomy of data quality problems. In Proceedings of the International Workshop on Data and Information Quality. 219--233."},{"doi-asserted-by":"publisher","key":"e_1_2_1_57_1","DOI":"10.14778\/2824032.2824086"},{"doi-asserted-by":"publisher","key":"e_1_2_1_58_1","DOI":"10.1073\/pnas.1219097111"},{"volume-title":"Mining Imperfect Data: Dealing with Contamination and Incomplete Records","author":"Pearson Ronald K.","unstructured":"Ronald K. Pearson. 2005. Mining Imperfect Data: Dealing with Contamination and Incomplete Records. Vol. 93. Siam.","key":"e_1_2_1_59_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_60_1","DOI":"10.1145\/1147234.1147247"},{"volume-title":"Outlier Detection in Heterogeneous Datasets Using Automatic Tuple Expansion","author":"Claudel Clement Pit","unstructured":"Clement Pit Claudel, Zelda Mariet, Rachael Harding, and Sam Madden. 2016. Outlier Detection in Heterogeneous Datasets Using Automatic Tuple Expansion. Technical Report, MIT.","key":"e_1_2_1_61_1"},{"volume-title":"Proceedings of the International Conference on Advances in Neural Information Processing Systems. 4361--4370","author":"Platanios Emmanouil","unstructured":"Emmanouil Platanios, Hoifung Poon, Tom M. Mitchell, and Eric J. Horvitz. 2017. Estimating accuracy from unlabeled data: A probabilistic logic approach. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 4361--4370.","key":"e_1_2_1_62_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_63_1","DOI":"10.14778\/2856318.2856325"},{"doi-asserted-by":"publisher","key":"e_1_2_1_64_1","DOI":"10.1145\/3219819.3220109"},{"key":"e_1_2_1_65_1","first-page":"3","article-title":"Data cleaning: Problems and current approaches","volume":"23","author":"Rahm Erhard","year":"2000","unstructured":"Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. In IEEE Data Eng. Bull., Vol. 23. 3--13.","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_66_1","volume-title":"Proceedings of the International Conference on Very Large Databases (VLDB\u201901)","volume":"1","author":"Raman Vijayshankar","unstructured":"Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter\u2019s wheel: An interactive data cleaning system. In Proceedings of the International Conference on Very Large Databases (VLDB\u201901), Vol. 1. 381--390."},{"doi-asserted-by":"publisher","key":"e_1_2_1_67_1","DOI":"10.14778\/3157794.3157797"},{"unstructured":"Thomas C. Redman. 2018. If Your Data Is Bad Your Machine Learning Tools Are Useless. Retrieved from https:\/\/bit.ly\/2InCpnA.","key":"e_1_2_1_68_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_69_1","DOI":"10.14778\/3137628.3137631"},{"unstructured":"Jenn Riley. 2017. Understanding Metadata. What Is Metadata and What Is It for? Retrieved from https:\/\/groups.niso.org\/apps\/group_public\/download.php\/17443\/understanding-metadata.","key":"e_1_2_1_70_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_71_1","DOI":"10.1109\/MSP.2007.914237"},{"key":"e_1_2_1_72_1","volume-title":"Proceedings of the Conference on Innovative Data Systems Research (CIDR\u201913)","author":"Stonebraker Michael","year":"2013","unstructured":"Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, George Beskales, Mitch Cherniack, Stanley B. Zdonik, Alexander Pagan, and Shan Xu. 2013. Data curation at scale: The Data Tamer System. In Proceedings of the Conference on Innovative Data Systems Research (CIDR\u201913)."},{"key":"e_1_2_1_73_1","first-page":"3","article-title":"Data integration: The current status and the way forward","volume":"41","author":"Stonebraker Michael","year":"2018","unstructured":"Michael Stonebraker and Ihab F. Ilyas. 2018. Data integration: The current status and the way forward. IEEE Data Eng. Bull. 41, 2 (2018), 3--9.","journal-title":"IEEE Data Eng. Bull."},{"unstructured":"Trifacta. 2019. Supported Data Types - Trifacta Wrangler. https:\/\/docs.trifacta.com\/display\/PE\/SupportedDataTypes.","key":"e_1_2_1_74_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_75_1","DOI":"10.1145\/3221269.3223028"},{"doi-asserted-by":"publisher","key":"e_1_2_1_76_1","DOI":"10.1145\/3299869.3319855"},{"volume-title":"Proceedings of the International Conference on Management of Data (SIGMOD\u201913)","author":"Yakout Mohamed","unstructured":"Mohamed Yakout, Laure Berti-\u00c9quille, and Ahmed K. Elmagarmid. 2013. Don\u2019t be SCAREd: Use SCalable Automatic REpairing with maximal likelihood and bounded changes. In Proceedings of the International Conference on Management of Data (SIGMOD\u201913). 553--564.","key":"e_1_2_1_77_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_78_1","DOI":"10.1201\/b12207"}],"container-title":["Journal of Data and Information Quality"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3371925","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,5,7]],"date-time":"2024-05-07T22:16:48Z","timestamp":1715120208000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3371925"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,6,13]]},"references-count":78,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2020,9,30]]}},"alternative-id":["10.1145\/3371925"],"URL":"https:\/\/doi.org\/10.1145\/3371925","relation":{},"ISSN":["1936-1955","1936-1963"],"issn-type":[{"type":"print","value":"1936-1955"},{"type":"electronic","value":"1936-1963"}],"subject":[],"published":{"date-parts":[[2020,6,13]]},"assertion":[{"value":"2019-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-06-13","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}