Enhancing Machine Learning Capabilities in Data Lakes with AutoML and LLMs | SpringerLink
Skip to main content

Enhancing Machine Learning Capabilities in Data Lakes with AutoML and LLMs

  • Conference paper
  • First Online:
Advances in Databases and Information Systems (ADBIS 2024)

Abstract

The exponential growth of data from digitization requires efficient utilization and storage of large amounts of data. Data lakes can store heterogeneous datasets and prepare them for machine learning (ML). However, current data lakes lack mature capabilities to support ML requirements. AutoML is the process of automating the end-to-end application of ML to real-world problems. Large Language Models (LLMs) can potentially increase ML pipeline automation by assisting at various stages of the process and democratizing access to advanced analytics. This paper explores the integration of AutoML tools and LLMs and their application in the data lake SEDAR. We present an extended data lake metadata model for capturing data analytics, a Python package for wrapping AutoML libraries, and a module that leverages LLMs for AutoML. Finally, we undertake a comparative analysis between the performance of AutoML and LLMs in four challenging real-world use cases from the domain of chemistry, each presenting a distinct type of ML problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 17159
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 21449
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/automl/auto-sklearn---/keras-team/autokeras---/autogluon/autogluon.

References

  1. Alla, S., Adari, S.K.: What Is MLOps? Apress, Berkeley, CA (2021)

    Book  Google Scholar 

  2. Chai, C., Wang, J., Luo, Y., Niu, Z., Li, G.: Data management for machine learning: a survey. IEEE TKDE 35(5), 4646–4667 (2023)

    Google Scholar 

  3. Chen, A., Dohan, D.M., So, D.R.: Evoprompting: Language models for code-level neural architecture search (2023)

    Google Scholar 

  4. Chen, T., et al.: Dual-awareness attention for few-shot object detection. CoRR abs/2102.12152 (2021)

    Google Scholar 

  5. Diamantini, C., et al.: A knowledge-based approach to support analytic query answering in semantic data lakes. In: Chiusano, S., Cerquitelli, T., Wrembel, R. (eds.) ADBIS. LNCS, vol. 13389, pp. 179–192. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-15740-0_14

    Chapter  Google Scholar 

  6. Hai, R., et al.: Amalur: data integration meets machine learning. In: 39th IEEE ICDE 2023, Anaheim, CA, USA, April 3-7, 2023, pp. 3729–3739. IEEE (2023)

    Google Scholar 

  7. Hai, R., Koutras, C., Quix, C., Jarke, M.: Data lakes: a survey of functions and systems. IEEE TKDE 35(12), 12571–12590 (2023)

    Google Scholar 

  8. Hassan, M.M., et. al.: ChatGPT as your personal data scientist (2023)

    Google Scholar 

  9. Hollmann, N., et al.: Large language models for automated data science: Introducing caafe for context-aware automated feature engineering. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  10. Hoseini, S., Ali, A., Shaker, H., Quix, C.: SEDAR: a semantic data reservoir for heterogeneous datasets. In: CIKM 2023, ACM (2023)

    Google Scholar 

  11. Hoseini, S., et al.: Automated defect detection for coatings via height profiles obtained by laser-scanning microscopy. MLWA 10, 100413 (2022)

    Google Scholar 

  12. Hoseini, S., et. al.: Coatings intelligence: data-driven automation for chemistry 4.0. In: 2024 IEEE 7th (ICPS), pp. 1–8 (2024)

    Google Scholar 

  13. Huang, Q., Vora, J., Liang, P., Leskovec, J.: Benchmarking large language models as AI research agents (2023)

    Google Scholar 

  14. Jain, A., et al.: Overview and importance of data quality for machine learning tasks. In: Proceedings of ACM SIGKDD, pp. 3561–3562 (2020)

    Google Scholar 

  15. Karmaker, S.K., et al.: Automl to date and beyond: challenges and opportunities. ACM Comput. Surv. 54(8) (2021)

    Google Scholar 

  16. Khan, Y., Zimmermann, A., Jha, A., Gadepally, V., d’Aquin, M., Sahay, R.: One size does not fit all: querying web polystores. IEEE Access 7, 9598–9617 (2019)

    Article  Google Scholar 

  17. Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., Zhang, Y.: Evaluating the logical reasoning ability of ChatGPT and gpt-4. arXiv preprint arXiv:2304.03439 (2023)

  18. Liu, S., Gao, C., Li, Y.: Large language model agent for hyper-parameter optimization. arXiv preprint arXiv:2402.01881 (2024)

  19. Schelter, S., et. al.: Unit testing data with deequ. In: SIGMOD. ACM (2019)

    Google Scholar 

  20. Schlegel, M., Sattler, K.: Extracting provenance of machine learning experiment pipeline artifacts. In: Abelló, A., Vassiliadis, P., Romero, O., Wrembel, R. (eds.) ADBIS. LNCS, vol. 13985, pp. 238–251. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-42914-9_17

    Chapter  Google Scholar 

  21. Tornede, A., et al.: Automl in the age of large language models: Current challenges, future opportunities and risks. arXiv preprint arXiv:2306.08107 (2023)

  22. Wei, L., He, Z., Zhao, H., Yao, Q.: Unleashing the power of graph learning through llm-based autonomous agents (2023)

    Google Scholar 

  23. Yang, Z., , et. al.: Autommlab: automatically generating deployable models from language instructions for computer vision tasks (2024)

    Google Scholar 

  24. Zaharia, M., et al.: Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng. Bull. 41(4), 39–45 (2018)

    Google Scholar 

  25. Zhang, G., et al.: Deep learning-based automated characterization of crosscut tests for coatings via image segmentation. JCTR 19, 671–683 (2021)

    Google Scholar 

  26. Zhang, L., et al.: Mlcopilot: unleashing the power of large language models in solving machine learning tasks. arXiv preprint arXiv:2304.14979 (2023)

  27. Zhang, S., Gong, C., Wu, L., Liu, X., Zhou, M.: AutoML-GPT: automatic machine learning with GPT. arXiv preprint arXiv:2305.02499 (2023)

  28. Zhao, Y., et al.: Analysis-oriented metadata for data lakes. In: Proceedings of the 25th IDEAS, pp. 194–203. ACM (2021)

    Google Scholar 

Download references

Acknowledgements

This work has been sponsored by the German Federal Ministry of Education and Research in the funding program “Forschung an Fachhochschulen”, project \(i^2DACH\) (grant no. 13FH557KX0).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sayed Hoseini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hoseini, S., Ibbels, M., Quix, C. (2024). Enhancing Machine Learning Capabilities in Data Lakes with AutoML and LLMs. In: Tekli, J., Gamper, J., Chbeir, R., Manolopoulos, Y. (eds) Advances in Databases and Information Systems. ADBIS 2024. Lecture Notes in Computer Science, vol 14918. Springer, Cham. https://doi.org/10.1007/978-3-031-70626-4_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70626-4_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70628-8

  • Online ISBN: 978-3-031-70626-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics