Abstract
The growing need to identify patterns in data and automate decisions based on them in near-real time, has stimulated the development of new machine learning (ML) applications processing continuous data streams. However, the deployment of ML applications over distributed stream processing engines (DSPEs) such as Apache Spark Streaming is a complex procedure that requires extensive tuning along two dimensions. First, DSPEs have a plethora of system configuration parameters, like degree of parallelism, memory buffer sizes, etc., that have a direct impact on application throughput and/or latency, and need to be optimized. Second, ML models have their own set of hyperparameters that require tuning as they can affect the overall prediction accuracy of the trained model significantly. These two forms of tuning have been studied extensively in the literature but only in isolation from each other. This manuscript presents a comprehensive experimental study that combines system configuration and hyperparameter tuning of ML applications over DSPEs. The experimental results reveal unexpected and complex interactions between the choices of system configurations and hyperparameters, and their impact on both application and model performance. These insights motivate the need for new combined system and ML model tuning approaches, and open up new research directions in the field of self-managing distributed stream processing systems.
Similar content being viewed by others
Notes
“Cluster mode” in this study refers to running Spark on a cluster of machines as opposed to only one machine. It does not refer to the driver’s “cluster” deploy mode, which indicates that the driver is launched inside the cluster (recall Table 1). In our experiments, we run the driver using both “cluster” and “client” deploy modes.
References
Herodotou, H., Chen, Y., Lu, J.: A survey on automatic parameter tuning for big data processing systems. ACM Comput. Surv. 53(2), 1–37 (2020)
Bifet, A., Holmes, G., Pfahringer, B., Kranen, P., Kremer, H., Jansen, T., Seidl, T.: MOA: Massive Online Analysis, a Framework for Stream Classification and Clustering. In: Proceedings of the First Workshop on Applications of Pattern Analysis, pp. 44–50 (2010). PMLR
Hoi, S.C., Wang, J., Zhao, P.: Libol: a library for online learning algorithms. J. Mach. Learn. Res. 15(1), 495 (2014)
Lu, J., Chen, Y., Herodotou, H., Babu, S.: Speedup your analytics: automatic parameter tuning for databases and big data systems. PVLDB 12(12), 1970–1973 (2019)
Kalim, F., Cooper, T., Wu, H., Li, Y., Wang, N., Lu, N., Fu, M., Qian, X., Luo, H., Cheng, D.: Caladrius: a performance modelling service for distributed stream processing systems. In: 35th International Conference on Data Engineering (ICDE), pp. 1886–1897 (2019). IEEE
Bilal, M., Canini, M.: Towards automatic parameter tuning of stream processing systems. In: Proceedings of the 2017 Symposium on Cloud Computing (SoCC), pp. 189–200 (2017). ACM
Wang, C., Meng, X., Guo, Q., Weng, Z., Yang, C.: Automating characterization deployment in distributed data stream management systems. IEEE Trans. Knowl. Data Eng. 29(12), 2669–2681 (2017)
Venkataraman, S., Panda, A., Ousterhout, K., Armbrust, M., Ghodsi, A., Franklin, M.J., Recht, B., Stoica, I.: Drizzle: fast and adaptable stream processing at scale. In: Proceedings of the 26th Symposium on Operating Systems Principles (SOSP), pp. 374–389 (2017). ACM
Feurer, M., Hutter, F.: Hyperparameter Optimization. In: Automated Machine Learning, pp. 3–33 (2019). Springer, Cham
Padierna, L.C., Carpio, M., Rojas, A., Puga, H., Baltazar, R., Fraire, H.: Hyper-parameter tuning for support vector machines by estimation of distribution algorithms. In: Nature-inspired Design of Hybrid Intelligent Systems, pp. 787–800 (2017). Springer
Bardenet, R., Brendel, M., Kégl, B., Sebag, M.: Collaborative hyperparameter tuning. In: International Conference on Machine Learning, pp. 199–207 (2013). PMLR
Feurer, M., Klein, A., Eggensperger, Katharina Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. Adv. Neural Inf. Process. Syst. 28, 2962–2970 (2015)
Odysseos, L., Herodotou, H.: Exploring system and machine learning performance interactions when tuning distributed data stream applications. In: IEEE 38th International Conference on Data Engineering Workshops (ICDEW), pp. 24–29 (2022). IEEE
Herodotou, H., Odysseos, L., Chen, Y., Lu, J.: Automatic performance tuning for distributed data stream processing systems. In: IEEE 38th International Conference on Data Engineering (ICDE), pp. 3194–3197 (2022). IEEE
Bansal, M., Cidon, E., Balasingam, A., Gudipati, A., Kozyrakis, C., Katti, S.: Trevor: Automatic configuration and scaling of stream processing pipelines. CoRR abs/1812.09442 (2018)
Kroß, J., Krcmar, H.: Model-based performance evaluation of batch and stream applications for big data. In: IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 80–86 (2017). IEEE
Lin, J., Lee, M., Yu, I.C., Johnsen, E.B.: Modeling and simulation of spark streaming. In: IEEE 32nd International Conference on Advanced Information Networking and Applications (AINA), pp. 407–413 (2018). IEEE
Liu, X., Dastjerdi, A.V., Calheiros, R.N., Qu, C., Buyya, R.: A stepwise auto-profiling method for performance optimization of streaming applications. ACM Trans. Auton. Adapt. Syst. 12(4), 1–33 (2017)
Li, T., Tang, J., Xu, J.: Performance modeling and predictive scheduling for distributed stream data processing. IEEE Trans. Big Data 2(4), 353–364 (2016)
Petrov, M., Butakov, N., Nasonov, D., Melnik, M.: Adaptive performance model for dynamic scaling apache spark streaming. Procedia Comput. Sci. 136, 109–117 (2018)
Yogatama, D., Mann, G.: Efficient transfer learning method for automatic hyperparameter tuning. In: Artificial Intelligence and Statistics, pp. 1077–1085 (2014). PMLR
Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., Hutter, F.: Auto-Sklearn 2.0: hands-free AutoML via meta-learning. J. Mach. Learn. Res. 23(261), 1–61 (2020)
Vogel, A., Griebler, D., Danelutto, M., Fernandes, L.G.: Self-adaptation on parallel stream processing: a systematic review. Concurrency and Computation: Practice and Experience, 6759 (2021)
Kotthoff, L., Thornton, C., Hoos, H.H., Hutter, F., Leyton-Brown, K.: Auto-WEKA: Automatic model selection and hyperparameter optimization in WEKA. In: Automated Machine Learning, pp. 81–95 (2019). Springer, Cham
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 847–855 (2013)
Eibe, F., Hall, M.A., Witten, I.H.: The WEKA Workbench. Online Appendix for Data Mining: Practical Machine Learning Tools and Techniques. In: Morgan Kaufmann, (2016)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl 11(1), 10–18 (2009)
Binder, M., Moosbauer, J., Thomas, J., Bischl, B.: Multi-objective Hyperparameter Tuning and Feature Selection using Filter Ensembles. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference, pp. 471–479 (2020). ACM
Horn, D., Demircioğlu, A., Bischl, B., Glasmachers, T., Weihs, C.: A comparative study on large scale kernelized support vector machines. Adv. Data Anal. Classif. 12(4), 867–883 (2018)
Veloso, B., Gama, J., Malheiro, B.: Self hyper-parameter tuning for data streams. In: International Conference on Discovery Science, pp. 241–255 (2018). Springer
Carnein, M., Trautmann, H., Bifet, A., Pfahringer, B.: confStream: automated algorithm selection and configuration of stream clustering algorithms. In: International Conference on Learning and Intelligent Optimization, pp. 80–95 (2020). Springer
Lal, D.K., Suman, U.: Towards comparison of real time stream processing engines. In: 2019 IEEE Conference on Information and Communication Technology, pp. 1–5 (2019). IEEE
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), pp. 2–14 (2012). USENIX Association
Karimov, J., Rabl, T., Katsifodimos, A., Samarev, R., Heiskanen, H., Markl, V.: Benchmarking distributed stream data processing systems. In: 2018 IEEE 34th International Conference on Data Engineering (ICDE), pp. 1507–1518 (2018). IEEE
Hesse, G., Matthies, C., Perscheid, M., Uflacker, M., Plattner, H.: ESPBench: The enterprise stream processing benchmark. In: Proceedings of the ACM/SPEC International Conference on Performance Engineering, pp. 201–212 (2021)
Herodotou, H., Chatzakou, D., Kourtellis, N.: Catching them red-handed: real-time aggression detection on social media. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 2123–2128 (2021). IEEE
Herodotou, H., Chatzakou, D., Kourtellis, N.: A streaming machine learning framework for online aggression detection on Twitter. In: International Conference on Big Data, pp. 5056–5067 (2020). IEEE
Founta, A.M., Djouvas, C., Chatzakou, D., Leontiadis, I., Blackburn, J., Stringhini, G., Vakali, A., Sirivianos, M., Kourtellis, N.: Large scale crowdsourcing and characterization of Twitter abusive behavior. In: Twelfth International AAAI Conference on Web and Social Media (2018)
Agrawal, R., Imielinski, T., Swami, A.: Database mining: a performance perspective. IEEE Trans. Knowl. Data Eng. 5(6), 914–925 (1993)
Bifet, A., Maniu, S., Qian, J., Tian, G., He, C., Fan, W.: StreamDM: Advanced data mining in spark streaming. In: 2015 IEEE International Conference on Data Mining Workshop (ICDMW), pp. 1608–1611 (2015). IEEE
Domingos, P., Hulten, G.: Mining High-speed data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 71–80 (2000). ACM
Kandel, I., Castelli, M.: The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset. ICT Express 6(4), 312–315 (2020)
Smith, S.L., Le, Q.V.: A Bayesian Perspective on Generalization and Stochastic Gradient Descent. arXiv preprint arXiv:1710.06451 (2017)
LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient backprop. In: Neural Networks: Tricks of the Trade, pp. 9–48 (2012). Springer
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On Large-batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv preprint arXiv:1609.04836 (2016)
Qian, X., Klabjan, D.: The impact of the mini-batch size on the variance of gradients in stochastic gradient descent. arXiv preprint arXiv:2004.13146 (2020)
Chen, S., Wu, J., Liu, X.: EMORL: effective multi-objective reinforcement learning method for hyperparameter optimization. Eng. Appl. Artif. Intell. 104, 104315 (2021)
Author information
Authors and Affiliations
Contributions
Conceptualization: HH; Methodology: LO, HH; Formal analysis and investigation: LO, HH; Writing - original draft preparation: LO, HH; Writing - review and editing: LO, HH; Supervision: HH.
Corresponding author
Ethics declarations
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Odysseos, L., Herodotou, H. On combining system and machine learning performance tuning for distributed data stream applications. Distrib Parallel Databases 41, 411–438 (2023). https://doi.org/10.1007/s10619-023-07434-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-023-07434-0