Abstract
The exponential growth of data from digitization requires efficient utilization and storage of large amounts of data. Data lakes can store heterogeneous datasets and prepare them for machine learning (ML). However, current data lakes lack mature capabilities to support ML requirements. AutoML is the process of automating the end-to-end application of ML to real-world problems. Large Language Models (LLMs) can potentially increase ML pipeline automation by assisting at various stages of the process and democratizing access to advanced analytics. This paper explores the integration of AutoML tools and LLMs and their application in the data lake SEDAR. We present an extended data lake metadata model for capturing data analytics, a Python package for wrapping AutoML libraries, and a module that leverages LLMs for AutoML. Finally, we undertake a comparative analysis between the performance of AutoML and LLMs in four challenging real-world use cases from the domain of chemistry, each presenting a distinct type of ML problem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alla, S., Adari, S.K.: What Is MLOps? Apress, Berkeley, CA (2021)
Chai, C., Wang, J., Luo, Y., Niu, Z., Li, G.: Data management for machine learning: a survey. IEEE TKDE 35(5), 4646–4667 (2023)
Chen, A., Dohan, D.M., So, D.R.: Evoprompting: Language models for code-level neural architecture search (2023)
Chen, T., et al.: Dual-awareness attention for few-shot object detection. CoRR abs/2102.12152 (2021)
Diamantini, C., et al.: A knowledge-based approach to support analytic query answering in semantic data lakes. In: Chiusano, S., Cerquitelli, T., Wrembel, R. (eds.) ADBIS. LNCS, vol. 13389, pp. 179–192. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-15740-0_14
Hai, R., et al.: Amalur: data integration meets machine learning. In: 39th IEEE ICDE 2023, Anaheim, CA, USA, April 3-7, 2023, pp. 3729–3739. IEEE (2023)
Hai, R., Koutras, C., Quix, C., Jarke, M.: Data lakes: a survey of functions and systems. IEEE TKDE 35(12), 12571–12590 (2023)
Hassan, M.M., et. al.: ChatGPT as your personal data scientist (2023)
Hollmann, N., et al.: Large language models for automated data science: Introducing caafe for context-aware automated feature engineering. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Hoseini, S., Ali, A., Shaker, H., Quix, C.: SEDAR: a semantic data reservoir for heterogeneous datasets. In: CIKM 2023, ACM (2023)
Hoseini, S., et al.: Automated defect detection for coatings via height profiles obtained by laser-scanning microscopy. MLWA 10, 100413 (2022)
Hoseini, S., et. al.: Coatings intelligence: data-driven automation for chemistry 4.0. In: 2024 IEEE 7th (ICPS), pp. 1–8 (2024)
Huang, Q., Vora, J., Liang, P., Leskovec, J.: Benchmarking large language models as AI research agents (2023)
Jain, A., et al.: Overview and importance of data quality for machine learning tasks. In: Proceedings of ACM SIGKDD, pp. 3561–3562 (2020)
Karmaker, S.K., et al.: Automl to date and beyond: challenges and opportunities. ACM Comput. Surv. 54(8) (2021)
Khan, Y., Zimmermann, A., Jha, A., Gadepally, V., d’Aquin, M., Sahay, R.: One size does not fit all: querying web polystores. IEEE Access 7, 9598–9617 (2019)
Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., Zhang, Y.: Evaluating the logical reasoning ability of ChatGPT and gpt-4. arXiv preprint arXiv:2304.03439 (2023)
Liu, S., Gao, C., Li, Y.: Large language model agent for hyper-parameter optimization. arXiv preprint arXiv:2402.01881 (2024)
Schelter, S., et. al.: Unit testing data with deequ. In: SIGMOD. ACM (2019)
Schlegel, M., Sattler, K.: Extracting provenance of machine learning experiment pipeline artifacts. In: Abelló, A., Vassiliadis, P., Romero, O., Wrembel, R. (eds.) ADBIS. LNCS, vol. 13985, pp. 238–251. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-42914-9_17
Tornede, A., et al.: Automl in the age of large language models: Current challenges, future opportunities and risks. arXiv preprint arXiv:2306.08107 (2023)
Wei, L., He, Z., Zhao, H., Yao, Q.: Unleashing the power of graph learning through llm-based autonomous agents (2023)
Yang, Z., , et. al.: Autommlab: automatically generating deployable models from language instructions for computer vision tasks (2024)
Zaharia, M., et al.: Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng. Bull. 41(4), 39–45 (2018)
Zhang, G., et al.: Deep learning-based automated characterization of crosscut tests for coatings via image segmentation. JCTR 19, 671–683 (2021)
Zhang, L., et al.: Mlcopilot: unleashing the power of large language models in solving machine learning tasks. arXiv preprint arXiv:2304.14979 (2023)
Zhang, S., Gong, C., Wu, L., Liu, X., Zhou, M.: AutoML-GPT: automatic machine learning with GPT. arXiv preprint arXiv:2305.02499 (2023)
Zhao, Y., et al.: Analysis-oriented metadata for data lakes. In: Proceedings of the 25th IDEAS, pp. 194–203. ACM (2021)
Acknowledgements
This work has been sponsored by the German Federal Ministry of Education and Research in the funding program “Forschung an Fachhochschulen”, project \(i^2DACH\) (grant no. 13FH557KX0).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hoseini, S., Ibbels, M., Quix, C. (2024). Enhancing Machine Learning Capabilities in Data Lakes with AutoML and LLMs. In: Tekli, J., Gamper, J., Chbeir, R., Manolopoulos, Y. (eds) Advances in Databases and Information Systems. ADBIS 2024. Lecture Notes in Computer Science, vol 14918. Springer, Cham. https://doi.org/10.1007/978-3-031-70626-4_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-70626-4_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70628-8
Online ISBN: 978-3-031-70626-4
eBook Packages: Computer ScienceComputer Science (R0)