Enhancing Machine Learning Capabilities in Data Lakes with AutoML and LLMs

Hoseini, Sayed; Ibbels, Maximilian; Quix, Christoph

doi:10.1007/978-3-031-70626-4_13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14918))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

360 Accesses

Abstract

The exponential growth of data from digitization requires efficient utilization and storage of large amounts of data. Data lakes can store heterogeneous datasets and prepare them for machine learning (ML). However, current data lakes lack mature capabilities to support ML requirements. AutoML is the process of automating the end-to-end application of ML to real-world problems. Large Language Models (LLMs) can potentially increase ML pipeline automation by assisting at various stages of the process and democratizing access to advanced analytics. This paper explores the integration of AutoML tools and LLMs and their application in the data lake SEDAR. We present an extended data lake metadata model for capturing data analytics, a Python package for wrapping AutoML libraries, and a module that leverages LLMs for AutoML. Finally, we undertake a comparative analysis between the performance of AutoML and LLMs in four challenging real-world use cases from the domain of chemistry, each presenting a distinct type of ML problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 17159; Price includes VAT (Japan)

Softcover Book: JPY 21449; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The Emerging Challenges of Big Data Lakes, and a Real-Life Framework for Representing, Managing and Supporting Machine Learning on Big Arctic Data

iFlow: Powering Lightweight Cross-Platform Data Pipelines

Metadata Management on Data Processing in Data Lakes

Notes

1.
https://github.com/automl/auto-sklearn---/keras-team/autokeras---/autogluon/autogluon.

References

Alla, S., Adari, S.K.: What Is MLOps? Apress, Berkeley, CA (2021)
Book Google Scholar
Chai, C., Wang, J., Luo, Y., Niu, Z., Li, G.: Data management for machine learning: a survey. IEEE TKDE 35(5), 4646–4667 (2023)
Google Scholar
Chen, A., Dohan, D.M., So, D.R.: Evoprompting: Language models for code-level neural architecture search (2023)
Google Scholar
Chen, T., et al.: Dual-awareness attention for few-shot object detection. CoRR abs/2102.12152 (2021)
Google Scholar
Diamantini, C., et al.: A knowledge-based approach to support analytic query answering in semantic data lakes. In: Chiusano, S., Cerquitelli, T., Wrembel, R. (eds.) ADBIS. LNCS, vol. 13389, pp. 179–192. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-15740-0_14
Chapter Google Scholar
Hai, R., et al.: Amalur: data integration meets machine learning. In: 39th IEEE ICDE 2023, Anaheim, CA, USA, April 3-7, 2023, pp. 3729–3739. IEEE (2023)
Google Scholar
Hai, R., Koutras, C., Quix, C., Jarke, M.: Data lakes: a survey of functions and systems. IEEE TKDE 35(12), 12571–12590 (2023)
Google Scholar
Hassan, M.M., et. al.: ChatGPT as your personal data scientist (2023)
Google Scholar
Hollmann, N., et al.: Large language models for automated data science: Introducing caafe for context-aware automated feature engineering. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Hoseini, S., Ali, A., Shaker, H., Quix, C.: SEDAR: a semantic data reservoir for heterogeneous datasets. In: CIKM 2023, ACM (2023)
Google Scholar
Hoseini, S., et al.: Automated defect detection for coatings via height profiles obtained by laser-scanning microscopy. MLWA 10, 100413 (2022)
Google Scholar
Hoseini, S., et. al.: Coatings intelligence: data-driven automation for chemistry 4.0. In: 2024 IEEE 7th (ICPS), pp. 1–8 (2024)
Google Scholar
Huang, Q., Vora, J., Liang, P., Leskovec, J.: Benchmarking large language models as AI research agents (2023)
Google Scholar
Jain, A., et al.: Overview and importance of data quality for machine learning tasks. In: Proceedings of ACM SIGKDD, pp. 3561–3562 (2020)
Google Scholar
Karmaker, S.K., et al.: Automl to date and beyond: challenges and opportunities. ACM Comput. Surv. 54(8) (2021)
Google Scholar
Khan, Y., Zimmermann, A., Jha, A., Gadepally, V., d’Aquin, M., Sahay, R.: One size does not fit all: querying web polystores. IEEE Access 7, 9598–9617 (2019)
Article Google Scholar
Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., Zhang, Y.: Evaluating the logical reasoning ability of ChatGPT and gpt-4. arXiv preprint arXiv:2304.03439 (2023)
Liu, S., Gao, C., Li, Y.: Large language model agent for hyper-parameter optimization. arXiv preprint arXiv:2402.01881 (2024)
Schelter, S., et. al.: Unit testing data with deequ. In: SIGMOD. ACM (2019)
Google Scholar
Schlegel, M., Sattler, K.: Extracting provenance of machine learning experiment pipeline artifacts. In: Abelló, A., Vassiliadis, P., Romero, O., Wrembel, R. (eds.) ADBIS. LNCS, vol. 13985, pp. 238–251. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-42914-9_17
Chapter Google Scholar
Tornede, A., et al.: Automl in the age of large language models: Current challenges, future opportunities and risks. arXiv preprint arXiv:2306.08107 (2023)
Wei, L., He, Z., Zhao, H., Yao, Q.: Unleashing the power of graph learning through llm-based autonomous agents (2023)
Google Scholar
Yang, Z., , et. al.: Autommlab: automatically generating deployable models from language instructions for computer vision tasks (2024)
Google Scholar
Zaharia, M., et al.: Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng. Bull. 41(4), 39–45 (2018)
Google Scholar
Zhang, G., et al.: Deep learning-based automated characterization of crosscut tests for coatings via image segmentation. JCTR 19, 671–683 (2021)
Google Scholar
Zhang, L., et al.: Mlcopilot: unleashing the power of large language models in solving machine learning tasks. arXiv preprint arXiv:2304.14979 (2023)
Zhang, S., Gong, C., Wu, L., Liu, X., Zhou, M.: AutoML-GPT: automatic machine learning with GPT. arXiv preprint arXiv:2305.02499 (2023)
Zhao, Y., et al.: Analysis-oriented metadata for data lakes. In: Proceedings of the 25th IDEAS, pp. 194–203. ACM (2021)
Google Scholar

Download references

Acknowledgements

This work has been sponsored by the German Federal Ministry of Education and Research in the funding program “Forschung an Fachhochschulen”, project \(i^2DACH\) (grant no. 13FH557KX0).

Author information

Authors and Affiliations

Niederrhein University of Applied Sciences, Krefeld, Germany
Sayed Hoseini, Maximilian Ibbels & Christoph Quix
Fraunhofer FIT, St. Augustin, Germany
Christoph Quix

Authors

Sayed Hoseini
View author publications
You can also search for this author in PubMed Google Scholar
Maximilian Ibbels
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Quix
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sayed Hoseini .

Editor information

Editors and Affiliations

Lebanese American University Engineering School, Lebanese American University, Chouran Beirut, Lebanon
Joe Tekli
Free University of Bozen-Bolzano, Bozen-Bolzano, Italy
Johann Gamper
Université de Pau et des Pays de l’Adour, Anglet, France
Richard Chbeir
Open University of Cyprus, Nicosia, Cyprus
Yannis Manolopoulos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hoseini, S., Ibbels, M., Quix, C. (2024). Enhancing Machine Learning Capabilities in Data Lakes with AutoML and LLMs. In: Tekli, J., Gamper, J., Chbeir, R., Manolopoulos, Y. (eds) Advances in Databases and Information Systems. ADBIS 2024. Lecture Notes in Computer Science, vol 14918. Springer, Cham. https://doi.org/10.1007/978-3-031-70626-4_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-70626-4_13
Published: 01 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70628-8
Online ISBN: 978-3-031-70626-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Enhancing Machine Learning Capabilities in Data Lakes with AutoML and LLMs