{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,8,13]],"date-time":"2024-08-13T06:36:14Z","timestamp":1723530974605},"reference-count":39,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2022,4,11]],"date-time":"2022-04-11T00:00:00Z","timestamp":1649635200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MTI"],"abstract":"The data management process is characterised by a set of tasks where data quality management (DQM) is one of the core components. Data quality, however, is a multidimensional concept, where the nature of the data quality issues is very diverse. One of the most widely anticipated data quality challenges, which becomes particularly vital when data come from multiple data sources which is a typical situation in the current data-driven world, is duplicates or non-uniqueness. Even more, duplicates were recognised to be one of the key domain-specific data quality dimensions in the context of the Internet of Things (IoT) application domains, where smart grids and health dominate most. Duplicate data lead to inaccurate analyses, leading to wrong decisions, negatively affect data-driven and\/or data processing activities such as the development of models, forecasts, simulations, have a negative impact on customer service, risk and crisis management, service personalisation in terms of both their accuracy and trustworthiness, decrease user adoption and satisfaction, etc. The process of determination and elimination of duplicates is known as deduplication, while the process of finding duplicates in one or more databases that refer to the same entities is known as Record Linkage. To find the duplicates, the data sets are compared with each other using similarity functions that are usually used to compare two input strings to find similarities between them, which requires quadratic time complexity. To defuse the quadratic complexity of the problem, especially in large data sources, record linkage methods, such as blocking and sorted neighbourhood, are used. In this paper, we propose a six-step record linkage deduplication framework. The operation of the framework is demonstrated on a simplified example of research data artifacts, such as publications, research projects and others of the real-world research institution representing Research Information Systems (RIS) domain. To make the proposed framework usable we integrated it into a tool that is already used in practice, by developing a prototype of an extension for the well-known DataCleaner. The framework detects and visualises duplicates thereby identifying and providing the user with identified redundancies in a user-friendly manner allowing their further elimination. By removing the redundancies, the quality of the data is improved therefore improving analyses and decision-making. This study makes a call for other researchers to take a step towards the \u201cgolden record\u201d that can be achieved when all data quality issues are recognised and resolved, thus moving towards absolute data quality.<\/jats:p>","DOI":"10.3390\/mti6040027","type":"journal-article","created":{"date-parts":[[2022,4,12]],"date-time":"2022-04-12T06:48:59Z","timestamp":1649746139000},"page":"27","source":"Crossref","is-referenced-by-count":12,"title":["A Record Linkage-Based Data Deduplication Framework with DataCleaner Extension"],"prefix":"10.3390","volume":"6","author":[{"ORCID":"http:\/\/orcid.org\/0000-0002-5225-389X","authenticated-orcid":false,"given":"Otmane","family":"Azeroual","sequence":"first","affiliation":[{"name":"German Centre for Higher Education Research and Science Studies (DZHW), Sch\u00fctzenstra\u00dfe 6A, 10117 Berlin, Germany"}]},{"ORCID":"http:\/\/orcid.org\/0000-0001-9854-8426","authenticated-orcid":false,"given":"Meena","family":"Jha","sequence":"additional","affiliation":[{"name":"School of Engineering and Technology, Central Queensland University, Sydney, NSW 2000, Australia"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-0532-3488","authenticated-orcid":false,"given":"Anastasija","family":"Nikiforova","sequence":"additional","affiliation":[{"name":"Institute of Computer Science, University of Tartu, Narva mnt 18, 51009 Tartu, Estonia"},{"name":"European Open Science Cloud (EOSC) Task Force \u201cFAIR Metrics and Data Quality\u201d, 1050 Brussels, Belgium"}]},{"given":"Kewei","family":"Sha","sequence":"additional","affiliation":[{"name":"College of Science and Engineering, University of Houston Clear Lake, 2700 Bay Area Blvd, Houston, TX 77058, USA"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-1071-7713","authenticated-orcid":false,"given":"Mohammad","family":"Alsmirat","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Sharjah, University City Rd., Sharjah 27272, United Arab Emirates"},{"name":"Department of Computer Science, Jordan University of Science and Technology, Irbid 22110, Jordan"}]},{"given":"Sanjay","family":"Jha","sequence":"additional","affiliation":[{"name":"School of Engineering and Technology, Central Queensland University, Sydney, NSW 2000, Australia"}]}],"member":"1968","published-online":{"date-parts":[[2022,4,11]]},"reference":[{"key":"ref_1","unstructured":"Benson, P.R. (2022, January 19). Identifying and Resolving Duplicates in Master Dats. White Paper ISO 8000. Available online: https:\/\/eccma.org\/what-is-iso-8000\/."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/978-3-031-01835-0","article-title":"An Introduction to Duplicate Detection","volume":"2","author":"Naumann","year":"2010","journal-title":"Synth. Lect. Data Manag."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"e5213","DOI":"10.1002\/cpe.5213","article-title":"Efficient hash function\u2013based duplication detection algorithm for data Deduplication deduction and reduction","volume":"33","author":"Periasamy","year":"2019","journal-title":"Concurr. Comput. Pract. Exp."},{"key":"ref_4","first-page":"391","article-title":"Definition and Evaluation of Data Quality: User-Oriented Data Object-Driven Approach to Data Quality Assessment","volume":"8","author":"Nikiforova","year":"2020","journal-title":"Balt. J. Mod. Comput."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"353","DOI":"10.1080\/02763869.2015.1052699","article-title":"The \u201cInternet of Things\u201d: What It Is and What It Means for Libraries","volume":"34","author":"Hoy","year":"2015","journal-title":"Med. Ref. Serv. Q."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"1497","DOI":"10.1016\/j.adhoc.2012.02.016","article-title":"Internet of things: Vision, applications and research challenges","volume":"10","author":"Miorandi","year":"2012","journal-title":"Ad Hoc Networks"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"7","DOI":"10.1080\/0267257X.2016.1217914","article-title":"Value co-creation with Internet of things technology in the retail industry","volume":"33","author":"Balaji","year":"2016","journal-title":"J. Mark. Manag."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"493","DOI":"10.1080\/17477891.2020.1867493","article-title":"Internet of things in disaster management: Technologies and uses","volume":"20","author":"Bail","year":"2021","journal-title":"Environ. Hazards"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"237","DOI":"10.1108\/IJPCC-08-2020-0117","article-title":"Techno-managerial implications towards communication in internet of things for smart cities","volume":"17","author":"Pawar","year":"2021","journal-title":"Int. J. Pervasive Comput. Commun."},{"key":"ref_10","first-page":"3","article-title":"Smart cities and internet of things","volume":"21","author":"Samih","year":"2019","journal-title":"J. Inf. Tehcnol. Case Appl. Res."},{"key":"ref_11","first-page":"282","article-title":"Development of Internet of Things-Related Monitoring Policies","volume":"13","author":"Kaupins","year":"2018","journal-title":"J. Inf. Priv. Secur."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Krogstie, J., Opdahl, A.L., and Brinkkemper, S. (2007). Data Integration-Problems, Approaches, and Perspectives. Conceptual Modelling in Information Systems Engineering, Springer.","DOI":"10.1007\/978-3-540-72677-7"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"1537","DOI":"10.1109\/TKDE.2011.127","article-title":"A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication","volume":"24","author":"Christen","year":"2011","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_14","first-page":"1028","article-title":"Efficient and Effective Duplicate Detection in Hierarchical Data","volume":"25","author":"Calado","year":"2012","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"104804","DOI":"10.1016\/j.cmpb.2018.10.016","article-title":"Initializing a hospital-wide data quality program. The AP-HP experience","volume":"181","author":"Daniel","year":"2018","journal-title":"Comput. Methods Programs Biomed."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TKDE.2007.250581","article-title":"Duplicate Record Detection: A Survey","volume":"19","author":"Elmagarmid","year":"2006","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1016\/j.gpb.2018.11.006","article-title":"Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases","volume":"18","author":"Chen","year":"2020","journal-title":"Genom. Proteom. Bioinform."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"184","DOI":"10.3163\/1536-5050.103.4.004","article-title":"Identifying and removing duplicate records from systematic review searches","volume":"103","author":"Kwon","year":"2015","journal-title":"J. Med. Libr. Assoc."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"313","DOI":"10.1002\/wics.1317","article-title":"Matching and record linkage","volume":"6","author":"Winkler","year":"2014","journal-title":"WIREs Comput. Stat."},{"key":"ref_20","unstructured":"Baxter, R., Christen, P., and Churches, T. (2003, January 24\u201327). A Comparison of Fast Blocking Methods for Record Linkage. Proceedings of the Workshop on Data Cleaning, Record Linkage and Object Consolidation at the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"1253","DOI":"10.14778\/1454159.1454165","article-title":"Industry-scale duplicate detection","volume":"1","author":"Weis","year":"2008","journal-title":"Proc. VLDB Endow."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"954","DOI":"10.1126\/science.130.3381.954","article-title":"Automatic Linkage of Vital Records","volume":"130","author":"Newcombe","year":"1959","journal-title":"Science"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1080\/07421222.1996.11518099","article-title":"Beyond Accuracy: What Data Quality Means to Data Consumers","volume":"12","author":"Wang","year":"1996","journal-title":"J. Manag. Inf. Syst."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Conrad, J.G., Guo, X.S., and Schriber, C.P. (2003, January 3\u20138). Online Duplicate Document Detection: Signature Reliability in a Dynamic Retrieval Environment. Proceedings of the twelfth international conference on Information and knowledge management (CIKM \u201803), Association for Computing Machinery, New York, NY, USA.","DOI":"10.1145\/956943.956946"},{"key":"ref_25","first-page":"60","article-title":"Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study","volume":"34","author":"Burdick","year":"2011","journal-title":"IEEE Data Eng. Bull."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"30","DOI":"10.17706\/jsw.15.1.30-44","article-title":"Detecting Feature Duplication in a CRM Product Line","volume":"15","author":"Khtira","year":"2020","journal-title":"J. Softw."},{"key":"ref_27","unstructured":"(2021, September 20). TechTarget, WhatIs.Com. Available online: https:\/\/whatis.techtarget.com\/definition\/golden-record."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"253","DOI":"10.1007\/978-3-319-11257-2_20","article-title":"A Comparison of Blocking Methods for Record Linkage","volume":"Volume 8744","year":"2014","journal-title":"International Conference on Privacy in Statistical Databases"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Yan, S., Lee, D., Kan, M.Y., and Giles, L.C. (2007, January 18\u201323). Adaptive Sorted Neighborhood Methods for Efficient Record Linkage. Proceedings of the 7th ACM\/IEEE-CS joint conference on Digital libraries (JCDL \u201807). Association for Computing Machinery, New York, NY, USA.","DOI":"10.1145\/1255175.1255213"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3418896","article-title":"An Overview of End-to-End Entity Resolution for Big Data","volume":"53","author":"Christophides","year":"2021","journal-title":"ACM Comput. Surv."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Panse, F., Van Keulen, M., De Keijzer, A., and Ritter, N. (2010, January 1\u20136). Duplicate detection in probabilistic data. Proceedings of the IEEE 26th International Conference on Data Engineering Workshops (ICDEW2010), Long Beach, CA, USA.","DOI":"10.1109\/ICDEW.2010.5452759"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"1183","DOI":"10.1080\/01621459.1969.10501049","article-title":"A theory for record linkage","volume":"64","author":"Fellegi","year":"1969","journal-title":"J. Am. Stat. Assoc."},{"key":"ref_33","unstructured":"Batini, C., and Scannapieco, M. (2006). Data Quality: Concepts, Methodologies and Techniques, Springer."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Manning, C.D., Raghavan, P., and Sch\u00fctze, H. (2008). Introduction to Information Retrieval, Cambridge University Press.","DOI":"10.1017\/CBO9780511809071"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Krasikov, P., Obrecht, T., Legner, C., and Eurich, M. (2020). Open Data in the Enterprise Context: Assessing Open Corporate Data\u2019s Readiness for Use. International Conference on Data Management Technologies and Applications, Springer.","DOI":"10.1007\/978-3-030-83014-4_4"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Nikiforova, A., and Kozmina, N. (2021, January 15\u201317). Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can Save Your Business. Proceedings of the 2021 Second International Conference on Intelligent Data Science Technologies and Applications (IDSTA), Tartu, Estonia.","DOI":"10.1109\/IDSTA53674.2021.9660802"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Premtoon, V., Koppel, J., and Solar-Lezama, A. (2020, January 15\u201320). Semantic code search via equational reasoning. Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2020), Association for Computing Machinery, New York, NY, USA.","DOI":"10.1145\/3385412.3386001"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1016\/j.jnca.2016.08.002","article-title":"Data quality in internet of things: A state-of-the-art survey","volume":"73","author":"Karkouch","year":"2016","journal-title":"J. Netw. Comput. Appl."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"101619","DOI":"10.1016\/j.giq.2021.101619","article-title":"A data quality approach to the identification of discrimination risk in automated decision making systems","volume":"38","author":"Torchiano","year":"2021","journal-title":"Gov. Inf. Q."}],"container-title":["Multimodal Technologies and Interaction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2414-4088\/6\/4\/27\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,28]],"date-time":"2024-07-28T11:22:00Z","timestamp":1722165720000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2414-4088\/6\/4\/27"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,4,11]]},"references-count":39,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2022,4]]}},"alternative-id":["mti6040027"],"URL":"https:\/\/doi.org\/10.3390\/mti6040027","relation":{},"ISSN":["2414-4088"],"issn-type":[{"value":"2414-4088","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,4,11]]}}}