Abstract
In today’s modern society, data play an integral role in the development global industry, since they have become a valuable asset for companies, institutions, governments, and others. At the same time, data generated daily, at a global scale, require significant resources to pre-process, filter and store. When it comes to acquiring such stored data, it is essential to understand which dataset fits to the needs of the user beforehand. One particularly important factor is the quality of a dataset, which could be determined based on a series of quality related attributes generated by it. Such attributes constitute “Profiling”, the process of obtaining information from a data sample, related to the complete dataset’s quality. However, in the era of Big Data, the ability to apply profiling techniques in complete large datasets should also be considered, in order to obtain complete quality insights. This paper attempts to provide a solution for this consideration by presenting “DaQuE”, a scalable framework for efficient profiling and quality analytics extraction in complete datasets of all volumes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abedjan, Z., Golab, L., Naumann, F.: Data profiling: a tutorial. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 1747–1751 (2017)
Agrawal, Y.: The accelerating pace of technological trends – adapting to market dynamics as an it professionals – web article (2023). https://www.linkedin.com/pulse/accelerating-pace-technological-trends-adapting-market-yash-agrawal
Altendeitering, M., Fraunhofer, I., Guggenberger, T.M.: Data quality tools: towards a software reference architecture (2024)
Apache: Apache flink – framework. https://flink.apache.org
Apache: Apache spark – framework. https://spark.apache.org
Apache: Apache storm – framework. https://storm.apache.org
Apache: Pyspark overview – introduction. https://spark.apache.org/docs/latest/api/python/index.html
Couto, J.C., Damasio, J., Bordini, R., Ruiz, D.: New trends in big data profiling. In: Science and Information Conference, pp. 808–825. Springer (2022). https://doi.org/10.1007/978-3-031-10461-9_55
Dai, W., Wardlaw, I., Cui, Y., Mehdi, K., Li, Y., Long, J.: Data profiling technology of data governance regarding big data: review and rethinking. In: Information Technology: New Generations: 13th International Conference on Information Technology, pp. 439–450. Springer (2016)
Duarte, F.: Amount of data created daily - web article (2024). https:// explodingtopics.com/blog/data-generated-per-day
Economist: The world’s most valuable resource is no longer oil, but data - web article (2017). https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data
Elbaghazaoui, B.E., Amnai, M., Semmouri, A.: Data profiling over big data area: a survey of big data profiling: state-of-the-art, use cases and challenges. In: Intelligent Systems in Big Data, Semantic Web and Machine Learning, pp. 111–123. Springer (2021)
García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F.: A comparison on scalability for batch big data processing on apache spark and apache flink. Big Data Analytics 2(1), 1–11 (2017)
Gupta, H.K., Parveen, R.: Comparative study of big data frameworks. In: 2019 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT). vol. 1, pp. 1–4. IEEE (2019)
IBM: What is data profiling? - web article. https://www.ibm.com/topics/data-profiling
Liu, Z., Zhang, A.: Sampling for big data profiling: a survey. IEEE Access 8, 72713–72726 (2020)
Liu, Z., Zhang, A.: A survey on sampling and profiling over big data (technical report). arXiv preprint arXiv:2005.05079 (2020)
Marcu, O.C., Costan, A., Antoniu, G., Pérez-Hernández, M.S.: Spark versus flink: understanding performance in big data analytics frameworks. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 433–442. IEEE (2016)
Marinakis, A., et al.: Efficient data management and interoperability middleware in business-oriented smart port use cases. In: IFIP International Conference on Artificial Intelligence Applications and Innovations, pp. 108–119. Springer (2022). https://doi.org/10.1007/978-3-031-08341-9_10
Nagpal, A., Gabrani, G.: Python for data analytics, scientific and technical applications. In: 2019 Amity International Conference on Artificial Intelligence (AICAI), pp. 140–145. IEEE (2019)
Nikiforova, A.: Definition and evaluation of data quality: User-oriented data object- driven approach to data quality assessment. Baltic J. Mod. Comput. 8(3) (2020)
Nikolakopoulos, A., et al.: Bigdam: Efficient big data management and interoperability middleware for seaports as critical infrastructures. Computers 12(11), 218 (2023)
OTE: Ote group of companies. https://www.cosmote.gr/cs/otegroup/\en/omilos{_}ote.html
van Rossum, G.: Python - programming language, https://www.python.org
Taleb, I., Serhani, M.A., Dssouli, R.: Big data quality: a data quality profiling model. In: World Congress on Services, pp. 61–77. Springer (2019)
Veiga, J., Expósito, R.R., Pardo, X.C., Taboada, G.L., Tourifio, J.: Performance evaluation of big data frameworks for large-scale data analytics. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 424–431. IEEE (2016)
Acknowledgment
The research leading to these results has received funding from the European Commission under the Horizon Europe Programme’s project "DATAMITE" (Grant Agreement No. 101092989).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 IFIP International Federation for Information Processing
About this paper
Cite this paper
Nikolakopoulos, A. et al. (2024). Scalable Data Profiling for Quality Analytics Extraction. In: Maglogiannis, I., Iliadis, L., Karydis, I., Papaleonidas, A., Chochliouros, I. (eds) Artificial Intelligence Applications and Innovations. AIAI 2024 IFIP WG 12.5 International Workshops. AIAI 2024. IFIP Advances in Information and Communication Technology, vol 715. Springer, Cham. https://doi.org/10.1007/978-3-031-63227-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-63227-3_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-63226-6
Online ISBN: 978-3-031-63227-3
eBook Packages: Computer ScienceComputer Science (R0)