On combining system and machine learning performance tuning for distributed data stream applications

Odysseos, Lambros; Herodotou, Herodotos

doi:10.1007/s10619-023-07434-0

On combining system and machine learning performance tuning for distributed data stream applications

Published: 17 May 2023

Volume 41, pages 411–438, (2023)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Lambros Odysseos¹ &
Herodotos Herodotou¹

184 Accesses
Explore all metrics

Abstract

The growing need to identify patterns in data and automate decisions based on them in near-real time, has stimulated the development of new machine learning (ML) applications processing continuous data streams. However, the deployment of ML applications over distributed stream processing engines (DSPEs) such as Apache Spark Streaming is a complex procedure that requires extensive tuning along two dimensions. First, DSPEs have a plethora of system configuration parameters, like degree of parallelism, memory buffer sizes, etc., that have a direct impact on application throughput and/or latency, and need to be optimized. Second, ML models have their own set of hyperparameters that require tuning as they can affect the overall prediction accuracy of the trained model significantly. These two forms of tuning have been studied extensively in the literature but only in isolation from each other. This manuscript presents a comprehensive experimental study that combines system configuration and hyperparameter tuning of ML applications over DSPEs. The experimental results reveal unexpected and complex interactions between the choices of system configurations and hyperparameters, and their impact on both application and model performance. These insights motivate the need for new combined system and ML model tuning approaches, and open up new research directions in the field of self-managing distributed stream processing systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

A Machine Learning-Based Elastic Strategy for Operator Parallelism in a Big Data Stream Computing System

Accelerating ELM Training over Data Streams

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

Article Open access 01 March 2017

Notes

“Cluster mode” in this study refers to running Spark on a cluster of machines as opposed to only one machine. It does not refer to the driver’s “cluster” deploy mode, which indicates that the driver is launched inside the cluster (recall Table 1). In our experiments, we run the driver using both “cluster” and “client” deploy modes.

References

Herodotou, H., Chen, Y., Lu, J.: A survey on automatic parameter tuning for big data processing systems. ACM Comput. Surv. 53(2), 1–37 (2020)
Article Google Scholar
Bifet, A., Holmes, G., Pfahringer, B., Kranen, P., Kremer, H., Jansen, T., Seidl, T.: MOA: Massive Online Analysis, a Framework for Stream Classification and Clustering. In: Proceedings of the First Workshop on Applications of Pattern Analysis, pp. 44–50 (2010). PMLR
Hoi, S.C., Wang, J., Zhao, P.: Libol: a library for online learning algorithms. J. Mach. Learn. Res. 15(1), 495 (2014)
MATH Google Scholar
Lu, J., Chen, Y., Herodotou, H., Babu, S.: Speedup your analytics: automatic parameter tuning for databases and big data systems. PVLDB 12(12), 1970–1973 (2019)
Google Scholar
Kalim, F., Cooper, T., Wu, H., Li, Y., Wang, N., Lu, N., Fu, M., Qian, X., Luo, H., Cheng, D.: Caladrius: a performance modelling service for distributed stream processing systems. In: 35th International Conference on Data Engineering (ICDE), pp. 1886–1897 (2019). IEEE
Bilal, M., Canini, M.: Towards automatic parameter tuning of stream processing systems. In: Proceedings of the 2017 Symposium on Cloud Computing (SoCC), pp. 189–200 (2017). ACM
Wang, C., Meng, X., Guo, Q., Weng, Z., Yang, C.: Automating characterization deployment in distributed data stream management systems. IEEE Trans. Knowl. Data Eng. 29(12), 2669–2681 (2017)
Article Google Scholar
Venkataraman, S., Panda, A., Ousterhout, K., Armbrust, M., Ghodsi, A., Franklin, M.J., Recht, B., Stoica, I.: Drizzle: fast and adaptable stream processing at scale. In: Proceedings of the 26th Symposium on Operating Systems Principles (SOSP), pp. 374–389 (2017). ACM
Feurer, M., Hutter, F.: Hyperparameter Optimization. In: Automated Machine Learning, pp. 3–33 (2019). Springer, Cham
Padierna, L.C., Carpio, M., Rojas, A., Puga, H., Baltazar, R., Fraire, H.: Hyper-parameter tuning for support vector machines by estimation of distribution algorithms. In: Nature-inspired Design of Hybrid Intelligent Systems, pp. 787–800 (2017). Springer
Bardenet, R., Brendel, M., Kégl, B., Sebag, M.: Collaborative hyperparameter tuning. In: International Conference on Machine Learning, pp. 199–207 (2013). PMLR
Feurer, M., Klein, A., Eggensperger, Katharina Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. Adv. Neural Inf. Process. Syst. 28, 2962–2970 (2015)
Google Scholar
Odysseos, L., Herodotou, H.: Exploring system and machine learning performance interactions when tuning distributed data stream applications. In: IEEE 38th International Conference on Data Engineering Workshops (ICDEW), pp. 24–29 (2022). IEEE
Herodotou, H., Odysseos, L., Chen, Y., Lu, J.: Automatic performance tuning for distributed data stream processing systems. In: IEEE 38th International Conference on Data Engineering (ICDE), pp. 3194–3197 (2022). IEEE
Bansal, M., Cidon, E., Balasingam, A., Gudipati, A., Kozyrakis, C., Katti, S.: Trevor: Automatic configuration and scaling of stream processing pipelines. CoRR abs/1812.09442 (2018)
Kroß, J., Krcmar, H.: Model-based performance evaluation of batch and stream applications for big data. In: IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 80–86 (2017). IEEE
Lin, J., Lee, M., Yu, I.C., Johnsen, E.B.: Modeling and simulation of spark streaming. In: IEEE 32nd International Conference on Advanced Information Networking and Applications (AINA), pp. 407–413 (2018). IEEE
Liu, X., Dastjerdi, A.V., Calheiros, R.N., Qu, C., Buyya, R.: A stepwise auto-profiling method for performance optimization of streaming applications. ACM Trans. Auton. Adapt. Syst. 12(4), 1–33 (2017)
Article Google Scholar
Li, T., Tang, J., Xu, J.: Performance modeling and predictive scheduling for distributed stream data processing. IEEE Trans. Big Data 2(4), 353–364 (2016)
Article Google Scholar
Petrov, M., Butakov, N., Nasonov, D., Melnik, M.: Adaptive performance model for dynamic scaling apache spark streaming. Procedia Comput. Sci. 136, 109–117 (2018)
Article Google Scholar
Yogatama, D., Mann, G.: Efficient transfer learning method for automatic hyperparameter tuning. In: Artificial Intelligence and Statistics, pp. 1077–1085 (2014). PMLR
Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., Hutter, F.: Auto-Sklearn 2.0: hands-free AutoML via meta-learning. J. Mach. Learn. Res. 23(261), 1–61 (2020)
MathSciNet Google Scholar
Vogel, A., Griebler, D., Danelutto, M., Fernandes, L.G.: Self-adaptation on parallel stream processing: a systematic review. Concurrency and Computation: Practice and Experience, 6759 (2021)
Kotthoff, L., Thornton, C., Hoos, H.H., Hutter, F., Leyton-Brown, K.: Auto-WEKA: Automatic model selection and hyperparameter optimization in WEKA. In: Automated Machine Learning, pp. 81–95 (2019). Springer, Cham
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 847–855 (2013)
Eibe, F., Hall, M.A., Witten, I.H.: The WEKA Workbench. Online Appendix for Data Mining: Practical Machine Learning Tools and Techniques. In: Morgan Kaufmann, (2016)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl 11(1), 10–18 (2009)
Article Google Scholar
Binder, M., Moosbauer, J., Thomas, J., Bischl, B.: Multi-objective Hyperparameter Tuning and Feature Selection using Filter Ensembles. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference, pp. 471–479 (2020). ACM
Horn, D., Demircioğlu, A., Bischl, B., Glasmachers, T., Weihs, C.: A comparative study on large scale kernelized support vector machines. Adv. Data Anal. Classif. 12(4), 867–883 (2018)
Article MathSciNet MATH Google Scholar
Veloso, B., Gama, J., Malheiro, B.: Self hyper-parameter tuning for data streams. In: International Conference on Discovery Science, pp. 241–255 (2018). Springer
Carnein, M., Trautmann, H., Bifet, A., Pfahringer, B.: confStream: automated algorithm selection and configuration of stream clustering algorithms. In: International Conference on Learning and Intelligent Optimization, pp. 80–95 (2020). Springer
Lal, D.K., Suman, U.: Towards comparison of real time stream processing engines. In: 2019 IEEE Conference on Information and Communication Technology, pp. 1–5 (2019). IEEE
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), pp. 2–14 (2012). USENIX Association
Karimov, J., Rabl, T., Katsifodimos, A., Samarev, R., Heiskanen, H., Markl, V.: Benchmarking distributed stream data processing systems. In: 2018 IEEE 34th International Conference on Data Engineering (ICDE), pp. 1507–1518 (2018). IEEE
Hesse, G., Matthies, C., Perscheid, M., Uflacker, M., Plattner, H.: ESPBench: The enterprise stream processing benchmark. In: Proceedings of the ACM/SPEC International Conference on Performance Engineering, pp. 201–212 (2021)
Herodotou, H., Chatzakou, D., Kourtellis, N.: Catching them red-handed: real-time aggression detection on social media. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 2123–2128 (2021). IEEE
Herodotou, H., Chatzakou, D., Kourtellis, N.: A streaming machine learning framework for online aggression detection on Twitter. In: International Conference on Big Data, pp. 5056–5067 (2020). IEEE
Founta, A.M., Djouvas, C., Chatzakou, D., Leontiadis, I., Blackburn, J., Stringhini, G., Vakali, A., Sirivianos, M., Kourtellis, N.: Large scale crowdsourcing and characterization of Twitter abusive behavior. In: Twelfth International AAAI Conference on Web and Social Media (2018)
Agrawal, R., Imielinski, T., Swami, A.: Database mining: a performance perspective. IEEE Trans. Knowl. Data Eng. 5(6), 914–925 (1993)
Article Google Scholar
Bifet, A., Maniu, S., Qian, J., Tian, G., He, C., Fan, W.: StreamDM: Advanced data mining in spark streaming. In: 2015 IEEE International Conference on Data Mining Workshop (ICDMW), pp. 1608–1611 (2015). IEEE
Domingos, P., Hulten, G.: Mining High-speed data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 71–80 (2000). ACM
Kandel, I., Castelli, M.: The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset. ICT Express 6(4), 312–315 (2020)
Article Google Scholar
Smith, S.L., Le, Q.V.: A Bayesian Perspective on Generalization and Stochastic Gradient Descent. arXiv preprint arXiv:1710.06451 (2017)
LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient backprop. In: Neural Networks: Tricks of the Trade, pp. 9–48 (2012). Springer
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On Large-batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv preprint arXiv:1609.04836 (2016)
Qian, X., Klabjan, D.: The impact of the mini-batch size on the variance of gradients in stochastic gradient descent. arXiv preprint arXiv:2004.13146 (2020)
Chen, S., Wu, J., Liu, X.: EMORL: effective multi-objective reinforcement learning method for hyperparameter optimization. Eng. Appl. Artif. Intell. 104, 104315 (2021)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Engineering and Informatics, Cyprus University of Technology, 30 Archiepiskopou Kyprianou, 3036, Limassol, Cyprus
Lambros Odysseos & Herodotos Herodotou

Authors

Lambros Odysseos
View author publications
You can also search for this author in PubMed Google Scholar
Herodotos Herodotou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: HH; Methodology: LO, HH; Formal analysis and investigation: LO, HH; Writing - original draft preparation: LO, HH; Writing - review and editing: LO, HH; Supervision: HH.

Corresponding author

Correspondence to Herodotos Herodotou.

Ethics declarations

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Odysseos, L., Herodotou, H. On combining system and machine learning performance tuning for distributed data stream applications. Distrib Parallel Databases 41, 411–438 (2023). https://doi.org/10.1007/s10619-023-07434-0

Download citation

Accepted: 02 May 2023
Published: 17 May 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s10619-023-07434-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

On combining system and machine learning performance tuning for distributed data stream applications

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Machine Learning-Based Elastic Strategy for Operator Parallelism in a Big Data Stream Computing System

Accelerating ELM Training over Data Streams

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

On combining system and machine learning performance tuning for distributed data stream applications

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Machine Learning-Based Elastic Strategy for Operator Parallelism in a Big Data Stream Computing System

Accelerating ELM Training over Data Streams

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation