Abstract
Malleability support in supercomputing requires several updates to system software stacks. In addition to this, updates to applications, libraries and the runtime systems of distributed memory programming models are also necessary. Because of this, there are relatively few applications that have been extended or developed with malleability support. As a consequence, there are no job histories from production systems that include sufficient malleable job submissions for scheduling research. In this paper, we propose a solution: a probabilistic job history conversion. This conversion allows us to evaluate malleable scheduling heuristics via simulations based on existing job histories. Based on a configurable probability, job arrivals are converted into malleable versions, and assigned a malleable performance model. This model is used by the simulator to evaluate its changes at runtime, as an effect of malleable operations being applied to it.
This work has received funding under the European Commission’s EuroHPC and Horizon 2020 programmes under grant agreements no. 955606 (DEEP-SEA) and 956560 (REGALE).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We do not take the impact of manufacturing process variations into account here.
References
AMD ryzenTM threadripperTM pro 5975wx. https://www.amd.com/en/product/11791. Accessed 13 Mar 2023
The HPC powerstack. https://hpcpowerstack.github.io/index.html. Accessed 16 Mar 2023
Logs of real parallel workloads from production systems. https://www.cs.huji.ac.il/labs/parallel/workload/logs.html. Accessed 18 Mar 2023
Ahn, D.H., et al.: Flux: overcoming scheduling challenges for exascale workflows. In: 2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS), pp. 10–19 (2018)
Aliaga, J.I., Castillo, M., Iserte, S., Martín-Álvarez, I., Mayo, R.: A survey on malleability solutions for high-performance distributed computing. Appl. Sci. 12(10), 5231 (2022)
Amdahl, G.M.: Computer architecture and Amdahl’s law. Computer 46(12), 38–46 (2013)
Barba, L.A., Yokota, R.: How will the fast multipole method fare in the exascale era. SIAM News 46(6), 1–3 (2013)
Burd, T., et al.: Zen3: the AMD 2 nd-generation 7nm x86-64 microprocessor core. In: 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65, pp. 1–3. IEEE (2022)
Cascajo, A., Singh, D.E., Carretero, J.: Detecting interference between applications and improving the scheduling using malleable application proxies. In: Anzt, H., Bienz, A., Luszczek, P., Baboulin, M. (eds.) ISC High Performance 2022. LNCS, vol. 13387, pp. 129–146. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-23220-6_9
Chacko, J.A., Ureña, I.A.C., Gerndt, M.: Integration of apache spark with invasive resource manager. pp. 1553–1560 (2019)
Chadha, M., John, J., Gerndt, M.: Extending slurm for dynamic resource-aware adaptive batch scheduling (2020)
Comprés, I., Mo-Hellenbrand, A., Gerndt, M., Bungartz, H.J.: Infrastructure and API extensions for elastic execution of MPI applications. In: Proceedings of the 23rd European MPI Users’ Group Meeting, EuroMPI 2016, pp. 82–97. Association for Computing Machinery, New York (2016)
Corbalan, J., D’Amico, M.: Modular workload format: extending SWF for modular systems. In: Klusáček, D., Cirne, W., Rodrigo, G.P. (eds.) JSSPP 2021. LNCS, vol. 12985, pp. 43–55. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88224-2_3
Fan, Y., Lan, Z., Rich, P., Allcock, W., Papka, M.E.: Hybrid workload scheduling on HPC systems. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 470–480 (2022). https://doi.org/10.1109/IPDPS53621.2022.00052
Feitelson, D.G., Tsafrir, D., Krakov, D.: Experience with using the parallel workloads archive. J. Parallel Distrib. Comput. 74(10), 2967–2982 (2014)
Georgiou, Y., Hautreux, M.: Evaluating scalability and efficiency of the resource and job management system on large HPC clusters. In: Cirne, W., Desai, N., Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2012. LNCS, vol. 7698, pp. 134–156. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35867-8_8
Huber, D., Streubel, M., Comprés, I., Schulz, M., Schreiber, M., Pritchard, H.: Towards dynamic resource management with MPI sessions and PMIX. In: Proceedings of the 29th European MPI Users’ Group Meeting, EuroMPI/USA 2022, pp. 57–67. Association for Computing Machinery, New York (2022)
Iserte, S., Mayo, R., Quintana-Ortí, E.S., Peña, A.J.: DMRlib: easy-coding and efficient resource management for job malleability. IEEE Trans. Comput. 70(9), 1443–1457 (2021). https://doi.org/10.1109/TC.2020.3022933
Jokanovic, A., D’Amico, M., Corbalan, J.: Evaluating SLURM simulator with real-machine SLURM and vice versa. In: 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 72–82 (2018)
Legrand, A., Marchal, L., Casanova, H.: Scheduling distributed applications: the SimGrid simulation framework, pp. 138–145 (2003)
Özden, T., Beringer, T., Mazaheri, A., Fard, H.M., Wolf, F.: ElastiSim: a batch-system simulator for malleable workloads. In: Proceedings of the 51st International Conference on Parallel Processing (ICPP), Bordeaux, France. ACM (2022)
Patki, T., et al.: Exploring hardware overprovisioning in power-constrained, high performance computing. In: ICS, pp. 173–182 (2013)
Patki, T., et al.: Practical resource management in power-constrained, high performance computing. In: HPDC, pp. 121–132 (2015)
Prabhakaran, S., Iqbal, M., Rinke, S., Windisch, C., Wolf, F.: A batch system with fair scheduling for evolving applications. In: 2014 43rd International Conference on Parallel Processing, pp. 351–360 (2014)
Prabhakaran, S., Neumann, M., Rinke, S., Wolf, F., Gupta, A., Kale, L.V.: A batch system with efficient adaptive scheduling for malleable and evolving applications. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 429–438 (2015)
Sakamoto, R., et al.: Analyzing resource trade-offs in hardware overprovisioned supercomputers. In: IPDPS, pp. 526–535 (2018)
Sarood, O., et al.: Maximizing throughput of overprovisioned HPC data centers under a strict power budget. In: SC, pp. 807–818 (2014)
Schreiber, M., Riesinger, C., Neckel, T., Bungartz, H.J.: Invasive compute balancing for applications with hybrid parallelization. In: 2013 25th International Symposium on Computer Architecture and High Performance Computing, pp. 136–143 (2013)
Scogland, T.R., et al.: A power-measurement methodology for large-scale, high-performance computing. In: ICPE, pp. 149–159 (2014)
Singh, T., et al.: Zen: an energy-efficient high-performance \(\times \)86 core. IEEE J. Solid-State Circ. 53(1), 102–114 (2017)
Suleiman, D., Ibrahim, M., Hamarash, I.: Dynamic voltage frequency scaling (DVFS) for microprocessors power and energy reduction. In: 4th International Conference on Electrical and Electronics Engineering, vol. 12 (2005)
Wallossek, I.: Chagall lives! AMD Ryzen threadripper PRO 5995WX and its 4 brothers 5975WX, 5965WX, 5955WX and 5945WX with technical data (2021). https://www.igorslab.de/en/chagall-lives-at-ryzen-threadripper-pro-5995wx-and-his-4-brothers-with-interesting-technical-data/. Accessed 13 Mar 2023
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Comprés, I., Arima, E., Schulz, M., Rotaru, T., Machado, R. (2023). Probabilistic Job History Conversion and Performance Model Generation for Malleable Scheduling Simulations. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-40843-4_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40842-7
Online ISBN: 978-3-031-40843-4
eBook Packages: Computer ScienceComputer Science (R0)