Abstract
This work describes an approach to enhance container orchestration platforms with an autonomous and dynamic rescheduling system that aims at improving application service time by co-locating highly interdependent containers for network delay reduction. Unreasonable container consolidation may however lead to host CPU saturation, in turn impairing the service time. The multiobjective approach proposed in this work aims to improve application service-time by minimizing both inter-server network traffic and CPU throttling on overloaded servers. To this extent, the Simulated Annealing combinatorial optimization heuristic is used and compared on its relative performance towards the optimal solution obtained by Mathematical Programming. Additionally, the impact of the proposed system is validated on a Kubernetes cluster hosting three concurrent applications, and this under varying load scenarios. The proposed rescheduling system systematically i) improves the application service-time (up to 27.2% from our experiments) and ii) surpasses the improvement reached by the Kubernetes descheduler.
Similar content being viewed by others
Data Availability
The datasets used and analysed during the current study are available from the corresponding author on reasonable request.
Notes
The VRP generalizes the Traveling Salesman Problem (TSP) [11]
More advanced alternatives are presented in [15]
inspired by [16]
as ‘load1’ reports the average load for the last minute.
The intuition is that highly interacting containers spend more time in communication when placed across different nodes, thereby leading to a degradation in the observed response times.
Sysadmins usually define a more conservative value (typically around 0.7) to provide some headroom to the system.
each with a weight of 0.5.
A more detailed justification for the selection of this specific metha-heuristic is described in [10]. Note however that the model described in this work is (meta-) heuristic/algorithm agnostic; while SA provides satisfactory results, we do not pretend it to be the most efficient or effective technique. In fact, comparing all possible techniques is considered out-of-scope for this work.
50 containers on 5 servers.
See Table 1 for a complete description of the test environment specifications.
For the ‘M’ scenario (see Table 5), CPLEX crashes after 4.3h (out-of-memory on a 12GB RAM server). At this stage, it still reports a ‘Gap’ value of 25.81% in the minimalization of the ‘cntc’ function (first step of the algorithm required to compute the normalizing factor).
This suggests that the size of the search-space negatively influences the performance of the exploration phase with only little progress observed in the progressive transition to the exploitation phase.
This value has been empirically defined and can be adapted according to cluster specificity.
"requiredDuringSchedulingIgnoredDuringExecution"
"preferredDuringSchedulingIgnoredDuringExecution"
For scoping reasons, this work only considers hard constraints. The impact of soft constraints is considered as a candidate topic for future extensions of this work.
Under stable load, the control loop should eventually stop to adapt the system it controls by converging to an optimum.
Pixie is an open source observability tool for K8s applications that is contributed to by New Relic, Inc. as a CNCF sandbox project since June 2021 [51]. Pixie has been selected for its streamlined simplicity of integration, though any other network monitoring tool able to report on inter-pod network traffic can be used instead.
By design any other possible optimization algorithm/metaheuristic can be used as the New Context Generator component launches its execution through an algorithm agnostic interface.
See Table 1 for the test environment specifications.
https://github.com/idlab-discover/obelisk.
Data isolation is internally ensured through the concept of scopes which represent logical data sets with configurable perimeter. A scope can be understood as a labelling mechanism aiming at isolating data between different contexts of use. Data access APIs for data ingestion, querying and streaming all require the scope to be mentioned.
While this might at first sound counter-intuitive, in most cases, setting CPU limits do more harm than help. In fact, they are the number one cause of CPU throttling [54]. Tim Hockin, one of the K8s maintainers at Google, even suggests to never set CPU limits (https://x.com/thockin/status/1134193838841401345?s=20)
https://github.com/kubernetes-sigs/descheduler - 17/11/2023.
Currently, pods request resource requirements are considered for computing node resource utilization.(...) Implementing metrics-based descheduling is currently TODO for the project. 17/11/2023 - https://github.com/kubernetes-sigs/descheduler
3 (reschedulable) Pods that may be scheduled onto 6 distinct nodes (the scheduler may indeed reassign a pod to the node it was running on prior to the descheduling)
For instance, we could observe the reassignment of 1,2 or 3 Pods, on 1,2 or 3 Nodes, sometimes even back on the original node (Node4).
References
Silva, V.G., Kirikova, M., Alksnis, G.: Containers for virtualization: an overview. Appl. Comput. Syst. 23(1), 21–27 (2018). https://doi.org/10.2478/acss-2018-0003
Docker, Inc.: Docker website (2023). https://www.docker.com
The Linux Foundation: LXC/LXD website (2023). https://linuxcontainers.org
Red Hat, Inc.: Podman website (2023). https://podman.io/
The Cloud Native Computing Foundation: Containerd website (2023). https://containerd.io
The Apache Software Foundation: Mesos website (2023). http://mesos.apache.org
Docker, Inc.: Docker Swarm website (2023). https://docs.docker.com/engine/swarm/
The Cloud Native Computing Foundation: Kubernetes website (2023). https://kubernetes.io (2023)
Kalmbach, P., Zerwas, J., Babarczi, P., Blenk, A., Kellerer, W., Schmid, S.: Empowering self-driving networks. In: Proceedings of the Afternoon Workshop on Self-Driving Networks. SelfDN 2018, pp. 8–14. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3229584.3229587
Bracke, V., Werrebrouck, G., Santos, J.P., Wauters, T., De Turck, F., Volckaert, B.: Online dynamic container rescheduling for improved application service time. J. Netw. Syst. Manag. 31(4) (2023) https://doi.org/10.1007/s10922-023-09766-9
Robinson, J.B.: On the Hamiltonian Game (A Traveling Salesman Problem). RAND Corporation, Santa Monica (1949)
Dantzig, G.B., Ramser, J.H.: The truck dispatching problem. Manage. Sci. 6, 80–91 (1959). https://doi.org/10.1287/mnsc.6.1.80
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982). https://doi.org/10.1109/TIT.1982.1056489
Croes, G.A.: A method for solving traveling-salesman problems. Oper. Res. 6(6), 791–812 (1958). https://doi.org/10.1287/opre.6.6.791
Reinelt, G.: The Traveling Salesman: Computational Solutions for TSP Applications, 1st edn. Lecture Notes in Computer Science, vol. 840. Springer, Berlin, Heidelberg (1994) https://doi.org/10.1007/3-540-48661-5
Almeida, R.S.: TSP Essay (2020). https://github.com/rsalmei/tsp-essay. Accessed: 20 Jan 2023
Beloglazov, A., Buyya, R.: Optimal online deterministic algorithms and adaptive heuristics for energy and performance efficient dynamic consolidation of virtual machines in cloud data centers. Concurr. Comput.: Pract. Exper. 24(13), 1397–1420 (2012). https://doi.org/10.1002/cpe.1867
Mahdhi, T., Mezni, H.: A prediction-based VM consolidation approach in IaaS cloud data centers. J. Syst. Softw. 146, 263–285 (2018). https://doi.org/10.1016/j.jss.2018.09.083
Wang, J.V., Cheng, C.-T., Tse, C.K.: A thermal-aware VM consolidation mechanism with outage avoidance. Experience 49(5), 906–920 (2019). https://doi.org/10.1002/spe.2680
Zhao, D., Mohamed, M., Ludwig, H.: Locality-aware scheduling for containers in cloud computing. IEEE Trans. Cloud Comput. 8(2), 635–646 (2020). https://doi.org/10.1109/TCC.2018.2794344
Filip, I.-D., Pop, F., Serbanescu, C., Choi, C.: Microservices scheduling model over heterogeneous cloud-edge environments as support for IoT applications. IEEE Internet Things J. 5(4), 2672–2681 (2018). https://doi.org/10.1109/JIOT.2018.2792940
Nanda, S., Hacker, T.J.: RACC: Resource-aware container consolidation using a deep learning approach. In: Proceedings of the First Workshop on Machine Learning for Computing Systems. MLCS’18. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3217871.3217876
Wen, Z., Lin, T., Yang, R., Ji, S., Ranjan, R., Romanovsky, A., Lin, C., Xu, J.: GA-Par: dependable microservice orchestration framework for geo-distributed clouds. IEEE Trans. Parallel Distrib. Syst. 31(1), 129–143 (2020). https://doi.org/10.1109/TPDS.2019.2929389
Guerrero, C., Lera, I., Juiz, C.: Resource optimization of container orchestration: a case study in multi-cloud microservices-based applications. J. Supercomput. 74(7), 2956–2983 (2018). https://doi.org/10.1007/s11227-018-2345-2
Bittencourt, L.F., Goldman, A., Madeira, E.R.M., da Fonseca, N.L.S., Sakellariou, R.: Scheduling in distributed systems: a cloud computing perspective. Comput. Sci. Rev. 30, 31–54 (2018). https://doi.org/10.1016/j.cosrev.2018.08.002
Söylemez, M., Tekinerdogan, B., Tarhan, A.K.: Challenges and solution directions of microservice architectures: a systematic literature review. Appl. Sci. (2022). https://doi.org/10.3390/app12115507
Santos, J., Wang, C., Wauters, T., Turck, F.D.: Diktyo: network-aware scheduling in container-based clouds. IEEE Transactions on Network and Service Management, 1–1 (2023) https://doi.org/10.1109/TNSM.2023.3271415
Zhou, R., Li, Z., Wu, C.: An efficient online placement scheme for cloud container clusters. IEEE J. Sel. Areas Commun. 37(5), 1046–1058 (2019). https://doi.org/10.1109/JSAC.2019.2906745
Piraghaj, S.F., Dastjerdi, A.V., Calheiros, R.N., Buyya, R.: A framework and algorithm for energy efficient container consolidation in cloud data centers. In: 2015 IEEE International Conference on Data Science and Data Intensive Systems, pp. 368–375 (2015). https://doi.org/10.1109/DSDIS.2015.67
Rattihalli, G.: Exploring potential for resource request right-sizing via estimation and container migration in Apache Mesos. In: 2018 IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC Companion), pp. 59–64 (2018). https://doi.org/10.1109/UCC-Companion.2018.00035
Bulej, L., Bureš, T., Hnětynka, P., Khalyeyev, D.: Self-adaptive K8S cloud controller for time-sensitive applications. In: 2021 47th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pp. 166–169 (2021). https://doi.org/10.1109/SEAA53835.2021.00029
Rodriguez, M., Buyya, R.: Container orchestration with cost-efficient autoscaling in cloud computing environments. In: Handbook of Research on Multimedia Cyber Security, pp. 190–213 (2020). https://doi.org/10.4018/978-1-7998-2701-6.ch010
Wojciechowski, L., Opasiak, K., Latusek, J., Wereski, M., Morales, V., Kim, T., Hong, M.: NetMARKS: network metrics-aware Kubernetes Scheduler powered by service mesh. In: IEEE INFOCOM 2021—IEEE Conference on Computer Communications, pp. 1–9 (2021). https://doi.org/10.1109/INFOCOM42981.2021.9488670
Marchese, A., Tomarchio, O.: Network-aware container placement in cloud-edge kubernetes clusters. In: 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp. 859–865 (2022). https://doi.org/10.1109/CCGrid54584.2022.00102
Joseph, C.T., Chandrasekaran, K.: Nature-inspired resource management and dynamic rescheduling of microservices in Cloud datacenters. Concurrency Comput.: Practice Exp. 33(17), 6290 (2021). https://doi.org/10.1002/cpe.6290
Podzimek, A., Chen, L.Y.: Transforming system load to throughput for consolidated applications. In: 2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, pp. 288–292 (2013). https://doi.org/10.1109/MASCOTS.2013.37
Arora, J.S.: Chapter 18—Multi-objective optimum design concepts and methods. In: Arora, J.S. (ed.) Introduction to Optimum Design, 4th edn., pp. 771–794. Academic Press, Boston (2017). https://doi.org/10.1016/B978-0-12-800806-5.00018-4
Caramia, M., Dell’Olmo, P.: Multi-objective optimization. In: Multi-objective management in freight logistics: increasing capacity, service level, sustainability, and safety with optimization algorithms, pp. 21–51. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50812-8_2
Grodzevich, O., Romanko, O.: Normalization and other topics in multi-objective optimization. In: Proceedings of the fields-MITACS Industrial Problems Workshop (2006)
Emmerich, M., Deutz, A.: A tutorial on multiobjective optimization: fundamentals and evolutionary methods. Natural Comput. 17 (2018) https://doi.org/10.1007/s11047-018-9685-y
Kirkpatrick, S., Gelatt, C., Vecchi, M.: Optimization by simulated annealing. Science 220, 671–680 (1983). https://doi.org/10.1126/science.220.4598.671
Cerny, V.: Thermodynamical approach to the traveling salesman problem: an efficient simulation algorithm. J. Optim. Theory Appl. 45, 41–51 (1985). https://doi.org/10.1007/BF00940812
Koulamas, C., Antony, S., Jaen, R.: A survey of simulated annealing applications to operations research problems. Omega 22(1), 41–56 (1994). https://doi.org/10.1016/0305-0483(94)90006-X
Connolly, D.T.: An improved annealing scheme for the QAP. Eur. J. Oper. Res. 46(1), 93–100 (1990). https://doi.org/10.1016/0377-2217(90)90301-Q
Fidanova, S.: Simulated annealing for grid scheduling problem. In: IEEE John Vincent Atanasoff 2006 International Symposium on Modern Computing (JVA’06), pp. 41–45 (2006). https://doi.org/10.1109/JVA.2006.44
Ellison Geltman, K.: The simulated annealing algorithm. http://katrinaeg.com/simulated-annealing.html (2014)
Flexera: 2023 state of the cloud report (2023). https://info.flexera.com/CM-REPORT-State-of-the-Cloud
Kubernetes SIGs.: Kubernetes concepts (2023). https://kubernetes.io/docs/concepts/. Accessed: 22 June 2023
Santos, J., Wauters, T., Volckaert, B., De Turck, F.: Towards network-aware resource provisioning in Kubernetes for fog computing applications. In: 2019 IEEE Conference on Network Softwarization (NetSoft), pp. 351–359 (2019). https://doi.org/10.1109/NETSOFT.2019.8806671
Kubernetes SIGs: Descheduler for Kubernetes (2023). https://github.com/kubernetes-sigs/descheduler. Accessed: 15 June 2023
pixielabs.ai: pixie overview (2023). https://docs.pixielabs.ai/about-pixie/what-is-pixie. Accessed: 6 Oct 2023
google.com: GoogleCloudPlatform microservices-demo: Online Boutique (2023). https://github.com/GoogleCloudPlatform/microservices-demo. Accessed: 10 Oct 2023
Bracke, V., Sebrechts, M., Moons, B., Hoebeke, J., De Turck, F., Volckaert, B.: Design and evaluation of a scalable Internet of Things backend for smart ports. Software: Practice Exp. 51(7), 1557–1579 (2021). https://doi.org/10.1002/spe.2973
Yellin, N.: For the love of god, stop using CPU limits on Kubernetes (updated) (2022). https://home.robusta.dev/blog/stop-using-cpu-limits. Accessed 10 August 2023
Acknowledgements
José Santos is funded by the Research Foundation Flanders (FWO), grant number 1299323N.
Funding
José Santos is funded by the Research Foundation Flanders (FWO), grant number 1299323N.
Author information
Authors and Affiliations
Contributions
Vincent Bracke substantially contributed to the conception and design of the work, to the acquisition, analysis and interpretation of data and to the creation of new software used in the work. He drafted the work and substantively revised it. José Santos and Tim Wauters substantially contributed to the analysis and interpretation of data. They substantively revised the work. Filip De Turck and Bruno Volckaert substantially contributed to the conception of the work, as well as to the analysis and interpretation of data. They drafted the work and substantively revised it. All authors have approved the submitted version.
Corresponding author
Ethics declarations
Conflict of interest
Filip De Turck, a listed author, is member of the editorial advisory board of this Journal. The authors declare that they have no other Conflict of interest.
Ethical Approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bracke, V., Santos, J., Wauters, T. et al. A Multiobjective Metaheuristic-Based Container Consolidation Model for Cloud Application Performance Improvement. J Netw Syst Manage 32, 61 (2024). https://doi.org/10.1007/s10922-024-09835-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10922-024-09835-7