On combining system and machine learning performance tuning for distributed data stream applications | Distributed and Parallel Databases Skip to main content
Log in

On combining system and machine learning performance tuning for distributed data stream applications

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

The growing need to identify patterns in data and automate decisions based on them in near-real time, has stimulated the development of new machine learning (ML) applications processing continuous data streams. However, the deployment of ML applications over distributed stream processing engines (DSPEs) such as Apache Spark Streaming is a complex procedure that requires extensive tuning along two dimensions. First, DSPEs have a plethora of system configuration parameters, like degree of parallelism, memory buffer sizes, etc., that have a direct impact on application throughput and/or latency, and need to be optimized. Second, ML models have their own set of hyperparameters that require tuning as they can affect the overall prediction accuracy of the trained model significantly. These two forms of tuning have been studied extensively in the literature but only in isolation from each other. This manuscript presents a comprehensive experimental study that combines system configuration and hyperparameter tuning of ML applications over DSPEs. The experimental results reveal unexpected and complex interactions between the choices of system configurations and hyperparameters, and their impact on both application and model performance. These insights motivate the need for new combined system and ML model tuning approaches, and open up new research directions in the field of self-managing distributed stream processing systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. “Cluster mode” in this study refers to running Spark on a cluster of machines as opposed to only one machine. It does not refer to the driver’s “cluster” deploy mode, which indicates that the driver is launched inside the cluster (recall Table 1). In our experiments, we run the driver using both “cluster” and “client” deploy modes.

References

  1. Herodotou, H., Chen, Y., Lu, J.: A survey on automatic parameter tuning for big data processing systems. ACM Comput. Surv. 53(2), 1–37 (2020)

    Article  Google Scholar 

  2. Bifet, A., Holmes, G., Pfahringer, B., Kranen, P., Kremer, H., Jansen, T., Seidl, T.: MOA: Massive Online Analysis, a Framework for Stream Classification and Clustering. In: Proceedings of the First Workshop on Applications of Pattern Analysis, pp. 44–50 (2010). PMLR

  3. Hoi, S.C., Wang, J., Zhao, P.: Libol: a library for online learning algorithms. J. Mach. Learn. Res. 15(1), 495 (2014)

    MATH  Google Scholar 

  4. Lu, J., Chen, Y., Herodotou, H., Babu, S.: Speedup your analytics: automatic parameter tuning for databases and big data systems. PVLDB 12(12), 1970–1973 (2019)

    Google Scholar 

  5. Kalim, F., Cooper, T., Wu, H., Li, Y., Wang, N., Lu, N., Fu, M., Qian, X., Luo, H., Cheng, D.: Caladrius: a performance modelling service for distributed stream processing systems. In: 35th International Conference on Data Engineering (ICDE), pp. 1886–1897 (2019). IEEE

  6. Bilal, M., Canini, M.: Towards automatic parameter tuning of stream processing systems. In: Proceedings of the 2017 Symposium on Cloud Computing (SoCC), pp. 189–200 (2017). ACM

  7. Wang, C., Meng, X., Guo, Q., Weng, Z., Yang, C.: Automating characterization deployment in distributed data stream management systems. IEEE Trans. Knowl. Data Eng. 29(12), 2669–2681 (2017)

    Article  Google Scholar 

  8. Venkataraman, S., Panda, A., Ousterhout, K., Armbrust, M., Ghodsi, A., Franklin, M.J., Recht, B., Stoica, I.: Drizzle: fast and adaptable stream processing at scale. In: Proceedings of the 26th Symposium on Operating Systems Principles (SOSP), pp. 374–389 (2017). ACM

  9. Feurer, M., Hutter, F.: Hyperparameter Optimization. In: Automated Machine Learning, pp. 3–33 (2019). Springer, Cham

  10. Padierna, L.C., Carpio, M., Rojas, A., Puga, H., Baltazar, R., Fraire, H.: Hyper-parameter tuning for support vector machines by estimation of distribution algorithms. In: Nature-inspired Design of Hybrid Intelligent Systems, pp. 787–800 (2017). Springer

  11. Bardenet, R., Brendel, M., Kégl, B., Sebag, M.: Collaborative hyperparameter tuning. In: International Conference on Machine Learning, pp. 199–207 (2013). PMLR

  12. Feurer, M., Klein, A., Eggensperger, Katharina Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. Adv. Neural Inf. Process. Syst. 28, 2962–2970 (2015)

    Google Scholar 

  13. Odysseos, L., Herodotou, H.: Exploring system and machine learning performance interactions when tuning distributed data stream applications. In: IEEE 38th International Conference on Data Engineering Workshops (ICDEW), pp. 24–29 (2022). IEEE

  14. Herodotou, H., Odysseos, L., Chen, Y., Lu, J.: Automatic performance tuning for distributed data stream processing systems. In: IEEE 38th International Conference on Data Engineering (ICDE), pp. 3194–3197 (2022). IEEE

  15. Bansal, M., Cidon, E., Balasingam, A., Gudipati, A., Kozyrakis, C., Katti, S.: Trevor: Automatic configuration and scaling of stream processing pipelines. CoRR abs/1812.09442 (2018)

  16. Kroß, J., Krcmar, H.: Model-based performance evaluation of batch and stream applications for big data. In: IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 80–86 (2017). IEEE

  17. Lin, J., Lee, M., Yu, I.C., Johnsen, E.B.: Modeling and simulation of spark streaming. In: IEEE 32nd International Conference on Advanced Information Networking and Applications (AINA), pp. 407–413 (2018). IEEE

  18. Liu, X., Dastjerdi, A.V., Calheiros, R.N., Qu, C., Buyya, R.: A stepwise auto-profiling method for performance optimization of streaming applications. ACM Trans. Auton. Adapt. Syst. 12(4), 1–33 (2017)

    Article  Google Scholar 

  19. Li, T., Tang, J., Xu, J.: Performance modeling and predictive scheduling for distributed stream data processing. IEEE Trans. Big Data 2(4), 353–364 (2016)

    Article  Google Scholar 

  20. Petrov, M., Butakov, N., Nasonov, D., Melnik, M.: Adaptive performance model for dynamic scaling apache spark streaming. Procedia Comput. Sci. 136, 109–117 (2018)

    Article  Google Scholar 

  21. Yogatama, D., Mann, G.: Efficient transfer learning method for automatic hyperparameter tuning. In: Artificial Intelligence and Statistics, pp. 1077–1085 (2014). PMLR

  22. Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., Hutter, F.: Auto-Sklearn 2.0: hands-free AutoML via meta-learning. J. Mach. Learn. Res. 23(261), 1–61 (2020)

    MathSciNet  Google Scholar 

  23. Vogel, A., Griebler, D., Danelutto, M., Fernandes, L.G.: Self-adaptation on parallel stream processing: a systematic review. Concurrency and Computation: Practice and Experience, 6759 (2021)

  24. Kotthoff, L., Thornton, C., Hoos, H.H., Hutter, F., Leyton-Brown, K.: Auto-WEKA: Automatic model selection and hyperparameter optimization in WEKA. In: Automated Machine Learning, pp. 81–95 (2019). Springer, Cham

  25. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  26. Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 847–855 (2013)

  27. Eibe, F., Hall, M.A., Witten, I.H.: The WEKA Workbench. Online Appendix for Data Mining: Practical Machine Learning Tools and Techniques. In: Morgan Kaufmann, (2016)

  28. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl 11(1), 10–18 (2009)

    Article  Google Scholar 

  29. Binder, M., Moosbauer, J., Thomas, J., Bischl, B.: Multi-objective Hyperparameter Tuning and Feature Selection using Filter Ensembles. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference, pp. 471–479 (2020). ACM

  30. Horn, D., Demircioğlu, A., Bischl, B., Glasmachers, T., Weihs, C.: A comparative study on large scale kernelized support vector machines. Adv. Data Anal. Classif. 12(4), 867–883 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  31. Veloso, B., Gama, J., Malheiro, B.: Self hyper-parameter tuning for data streams. In: International Conference on Discovery Science, pp. 241–255 (2018). Springer

  32. Carnein, M., Trautmann, H., Bifet, A., Pfahringer, B.: confStream: automated algorithm selection and configuration of stream clustering algorithms. In: International Conference on Learning and Intelligent Optimization, pp. 80–95 (2020). Springer

  33. Lal, D.K., Suman, U.: Towards comparison of real time stream processing engines. In: 2019 IEEE Conference on Information and Communication Technology, pp. 1–5 (2019). IEEE

  34. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), pp. 2–14 (2012). USENIX Association

  35. Karimov, J., Rabl, T., Katsifodimos, A., Samarev, R., Heiskanen, H., Markl, V.: Benchmarking distributed stream data processing systems. In: 2018 IEEE 34th International Conference on Data Engineering (ICDE), pp. 1507–1518 (2018). IEEE

  36. Hesse, G., Matthies, C., Perscheid, M., Uflacker, M., Plattner, H.: ESPBench: The enterprise stream processing benchmark. In: Proceedings of the ACM/SPEC International Conference on Performance Engineering, pp. 201–212 (2021)

  37. Herodotou, H., Chatzakou, D., Kourtellis, N.: Catching them red-handed: real-time aggression detection on social media. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 2123–2128 (2021). IEEE

  38. Herodotou, H., Chatzakou, D., Kourtellis, N.: A streaming machine learning framework for online aggression detection on Twitter. In: International Conference on Big Data, pp. 5056–5067 (2020). IEEE

  39. Founta, A.M., Djouvas, C., Chatzakou, D., Leontiadis, I., Blackburn, J., Stringhini, G., Vakali, A., Sirivianos, M., Kourtellis, N.: Large scale crowdsourcing and characterization of Twitter abusive behavior. In: Twelfth International AAAI Conference on Web and Social Media (2018)

  40. Agrawal, R., Imielinski, T., Swami, A.: Database mining: a performance perspective. IEEE Trans. Knowl. Data Eng. 5(6), 914–925 (1993)

    Article  Google Scholar 

  41. Bifet, A., Maniu, S., Qian, J., Tian, G., He, C., Fan, W.: StreamDM: Advanced data mining in spark streaming. In: 2015 IEEE International Conference on Data Mining Workshop (ICDMW), pp. 1608–1611 (2015). IEEE

  42. Domingos, P., Hulten, G.: Mining High-speed data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 71–80 (2000). ACM

  43. Kandel, I., Castelli, M.: The effect of batch size on the generalizability of the convolutional neural networks on a histopathology dataset. ICT Express 6(4), 312–315 (2020)

    Article  Google Scholar 

  44. Smith, S.L., Le, Q.V.: A Bayesian Perspective on Generalization and Stochastic Gradient Descent. arXiv preprint arXiv:1710.06451 (2017)

  45. LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient backprop. In: Neural Networks: Tricks of the Trade, pp. 9–48 (2012). Springer

  46. Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On Large-batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv preprint arXiv:1609.04836 (2016)

  47. Qian, X., Klabjan, D.: The impact of the mini-batch size on the variance of gradients in stochastic gradient descent. arXiv preprint arXiv:2004.13146 (2020)

  48. Chen, S., Wu, J., Liu, X.: EMORL: effective multi-objective reinforcement learning method for hyperparameter optimization. Eng. Appl. Artif. Intell. 104, 104315 (2021)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: HH; Methodology: LO, HH; Formal analysis and investigation: LO, HH; Writing - original draft preparation: LO, HH; Writing - review and editing: LO, HH; Supervision: HH.

Corresponding author

Correspondence to Herodotos Herodotou.

Ethics declarations

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Odysseos, L., Herodotou, H. On combining system and machine learning performance tuning for distributed data stream applications. Distrib Parallel Databases 41, 411–438 (2023). https://doi.org/10.1007/s10619-023-07434-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-023-07434-0

Keywords

Navigation