Scalable Data Profiling for Quality Analytics Extraction | SpringerLink
Skip to main content

Abstract

In today’s modern society, data play an integral role in the development global industry, since they have become a valuable asset for companies, institutions, governments, and others. At the same time, data generated daily, at a global scale, require significant resources to pre-process, filter and store. When it comes to acquiring such stored data, it is essential to understand which dataset fits to the needs of the user beforehand. One particularly important factor is the quality of a dataset, which could be determined based on a series of quality related attributes generated by it. Such attributes constitute “Profiling”, the process of obtaining information from a data sample, related to the complete dataset’s quality. However, in the era of Big Data, the ability to apply profiling techniques in complete large datasets should also be considered, in order to obtain complete quality insights. This paper attempts to provide a solution for this consideration by presenting “DaQuE”, a scalable framework for efficient profiling and quality analytics extraction in complete datasets of all volumes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 12583
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
JPY 15729
Price includes VAT (Japan)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abedjan, Z., Golab, L., Naumann, F.: Data profiling: a tutorial. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 1747–1751 (2017)

    Google Scholar 

  2. Agrawal, Y.: The accelerating pace of technological trends – adapting to market dynamics as an it professionals – web article (2023). https://www.linkedin.com/pulse/accelerating-pace-technological-trends-adapting-market-yash-agrawal

  3. Altendeitering, M., Fraunhofer, I., Guggenberger, T.M.: Data quality tools: towards a software reference architecture (2024)

    Google Scholar 

  4. Apache: Apache flink – framework. https://flink.apache.org

  5. Apache: Apache spark – framework. https://spark.apache.org

  6. Apache: Apache storm – framework. https://storm.apache.org

  7. Apache: Pyspark overview – introduction. https://spark.apache.org/docs/latest/api/python/index.html

  8. Couto, J.C., Damasio, J., Bordini, R., Ruiz, D.: New trends in big data profiling. In: Science and Information Conference, pp. 808–825. Springer (2022). https://doi.org/10.1007/978-3-031-10461-9_55

  9. Dai, W., Wardlaw, I., Cui, Y., Mehdi, K., Li, Y., Long, J.: Data profiling technology of data governance regarding big data: review and rethinking. In: Information Technology: New Generations: 13th International Conference on Information Technology, pp. 439–450. Springer (2016)

    Google Scholar 

  10. Duarte, F.: Amount of data created daily - web article (2024). https:// explodingtopics.com/blog/data-generated-per-day

  11. Economist: The world’s most valuable resource is no longer oil, but data - web article (2017). https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data

  12. Elbaghazaoui, B.E., Amnai, M., Semmouri, A.: Data profiling over big data area: a survey of big data profiling: state-of-the-art, use cases and challenges. In: Intelligent Systems in Big Data, Semantic Web and Machine Learning, pp. 111–123. Springer (2021)

    Google Scholar 

  13. García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F.: A comparison on scalability for batch big data processing on apache spark and apache flink. Big Data Analytics 2(1), 1–11 (2017)

    Article  Google Scholar 

  14. Gupta, H.K., Parveen, R.: Comparative study of big data frameworks. In: 2019 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT). vol. 1, pp. 1–4. IEEE (2019)

    Google Scholar 

  15. IBM: What is data profiling? - web article. https://www.ibm.com/topics/data-profiling

  16. Liu, Z., Zhang, A.: Sampling for big data profiling: a survey. IEEE Access 8, 72713–72726 (2020)

    Article  Google Scholar 

  17. Liu, Z., Zhang, A.: A survey on sampling and profiling over big data (technical report). arXiv preprint arXiv:2005.05079 (2020)

  18. Marcu, O.C., Costan, A., Antoniu, G., Pérez-Hernández, M.S.: Spark versus flink: understanding performance in big data analytics frameworks. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), pp. 433–442. IEEE (2016)

    Google Scholar 

  19. Marinakis, A., et al.: Efficient data management and interoperability middleware in business-oriented smart port use cases. In: IFIP International Conference on Artificial Intelligence Applications and Innovations, pp. 108–119. Springer (2022). https://doi.org/10.1007/978-3-031-08341-9_10

  20. Nagpal, A., Gabrani, G.: Python for data analytics, scientific and technical applications. In: 2019 Amity International Conference on Artificial Intelligence (AICAI), pp. 140–145. IEEE (2019)

    Google Scholar 

  21. Nikiforova, A.: Definition and evaluation of data quality: User-oriented data object- driven approach to data quality assessment. Baltic J. Mod. Comput. 8(3) (2020)

    Google Scholar 

  22. Nikolakopoulos, A., et al.: Bigdam: Efficient big data management and interoperability middleware for seaports as critical infrastructures. Computers 12(11), 218 (2023)

    Article  Google Scholar 

  23. OTE: Ote group of companies. https://www.cosmote.gr/cs/otegroup/\en/omilos{_}ote.html

  24. van Rossum, G.: Python - programming language, https://www.python.org

  25. Taleb, I., Serhani, M.A., Dssouli, R.: Big data quality: a data quality profiling model. In: World Congress on Services, pp. 61–77. Springer (2019)

    Google Scholar 

  26. Veiga, J., Expósito, R.R., Pardo, X.C., Taboada, G.L., Tourifio, J.: Performance evaluation of big data frameworks for large-scale data analytics. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 424–431. IEEE (2016)

    Google Scholar 

Download references

Acknowledgment

The research leading to these results has received funding from the European Commission under the Horizon Europe Programme’s project "DATAMITE" (Grant Agreement No. 101092989).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anastasios Nikolakopoulos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nikolakopoulos, A. et al. (2024). Scalable Data Profiling for Quality Analytics Extraction. In: Maglogiannis, I., Iliadis, L., Karydis, I., Papaleonidas, A., Chochliouros, I. (eds) Artificial Intelligence Applications and Innovations. AIAI 2024 IFIP WG 12.5 International Workshops. AIAI 2024. IFIP Advances in Information and Communication Technology, vol 715. Springer, Cham. https://doi.org/10.1007/978-3-031-63227-3_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-63227-3_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-63226-6

  • Online ISBN: 978-3-031-63227-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics