Abstract
Executing workflows on volunteer computing resources where individual tasks may be forced to relinquish their resource for the resource’s primary use leads to unpredictability and often significantly increases execution time. Task replication is one approach that can ameliorate this challenge. This comes at the expense of a potentially significant increase in system load and energy consumption. We propose the use of Reinforcement Learning (RL) such that a system may ‘learn’ the ‘best’ number of replicas to run to increase the number of workflows which complete promptly whilst minimising the additional workload on the system when replicas are not beneficial. We show, through simulation, that we can save 34% of the energy consumption using RL compared to a fixed number of replicas with only a 4% decrease in workflows achieving a pre-defined overhead bound.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Where the user desires \(p(W) \le \phi \).
- 2.
To simplify presentation we omit the parameters hereafter.
- 3.
without loss of generality we use \(\phi '\) to represent both \(\phi '\), and \(\phi ''\).
References
Anderson, D.P.: BOINC: a system for public-resource computing and storage. In: 2004 Grid Computing, pp. 4–10. IEEE (2004)
Bell, W.H., Cameron, D.G., Capozza, L., Millar, A.P., Stockinger, K., Zini, F.: Optorsim - a grid simulator for studying dynamic data replication strategies. Int. J. High Perform. Comput. Appl. 17, 403–416 (2003)
Bodík, P., Griffith, R., Sutton, C., Fox, A., Jordan, M., Patterson, D.: Statistical machine learning makes automatic control practical for internet datacenters. In: USENIX HotCloud (2009)
Buyya, R., Murshed, M.: GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurr. Comput. Pract. Exp. 14(13), 1175–1220 (2002)
Buyya, R., Ranjan, R., Calheiros, R.N.: Modeling and simulation of scalable cloud computing environments and the CloudSim toolkit: challenges and opportunities. In: HPCS 2009, pp. 1–11. IEEE (2009)
Calheiros, R.N., Buyya, R.: Meeting deadlines of scientific workflows in public clouds with tasks replication. IEEE TPDS 25(7), 1787–1796 (2014)
Chen, W., Deelman, E.: WorkflowSim: a toolkit for simulating scientific workflows in distributed environments. In: IEEE e-Science 2012, pp. 1–8. IEEE (2012)
Durillo, J.J., Nae, V., Prodan, R.: Multi-objective workflow scheduling: an analysis of the energy efficiency and makespan tradeoff. In: IEEE/ACM CCGrid (2013)
Forshaw, M., McGough, A., Thomas, N.: HTC-Sim: a trace-driven simulation framework for energy consumption in high-throughput computing systems. Concurr. Comput. Pract. Exp. 28(12), 3260–3290 (2016)
Forshaw, M.: Operating policies for energy efficient large scale computing. Ph.D. thesis, Newcastle University, UK (2015)
Forshaw, M., McGough, A.S., Thomas, N.: Energy-efficient checkpointing in high-throughput cycle-stealing distributed systems. Electron. Notes Theor. Comput. Sci. 310, 65–90 (2015)
Forshaw, M., Thomas, N., McGough, A.S.: The case for energy-aware simulation and modelling of internet of things (IoT). In: Proceedings of the 2nd International Workshop on Energy-Aware Simulation, pp. 5:1–5:4. ENERGY-SIM, ACM (2016)
Forshaw, M., Thomas, N., McGough, S.: Trace-driven simulation for energy consumption in high throughput computing systems. In: IEEE/ACM DS-RT (2014)
Georgakopoulos, D., Hornick, M., Sheth, A.: An overview of workflow management: from process modeling to workflow automation infrastructure. Distrib. Parallel Databases 3(2), 119–153 (1995)
Hiden, H., Woodman, S., Watson, P.: A framework for dynamically generating predictive models of workflow execution. In: Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science. WORKS 2013 (2013)
J. E. Kelley, J., Walker, M.R.: Critical-path planning and scheduling. In: International Workshop on Managing Requirements Knowledge, p. 160 (1959)
Jacob, J.C., et al.: Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking. Int. J. Comput. Sci. Eng. 4(2), 73–87 (2009)
Kell, A.J.M., Forshaw, M., Stephen McGough, A.: Exploring market power using deep reinforcement learning for intelligent bidding strategies. In: 2020 IEEE International Conference on Big Data (Big Data), pp. 4402–4411 (2020)
Kliazovich, D., Bouvry, P., Audzevich, Y., Khan, S.U.: GreenCloud: a packet-level simulator of energy-aware cloud computing data centers. In: GLOBECOM (2010). https://doi.org/10.1109/GLOCOM.2010.5683561
Legrand, A., Marchal, L.: Scheduling distributed applications: the simgrid simulation framework. In: Proceedings of the Third IEEE International Symposium on Cluster Computing and the Grid, pp. 138–145 (2003)
Li, Z., et al.: A security and cost aware scheduling algorithm for heterogeneous tasks of scientific workflow in clouds. Future Gener. Comput. Syst. 65, 140–152 (2016)
Lim, S.H., Sharma, B., Nam, G., Kim, E.K., Das, C.: MDCSim: a multi-tier data center simulation, platform. In: 2009 IEEE International Conference on Cluster Computing and Workshops, CLUSTER 2009, pp. 1–9 (2009)
Litzkow, M., Livney, M., Mutka, M.W.: Condor-a hunter of idle workstations. In: ICDCS (1988)
McGough, A.S., Forshaw, M., Gerrard, C., Wheater, S.: Reducing the number of miscreant tasks executions in a multi-use cluster. In: 2012 Second International Conference on Cloud and Green Computing (CGC), pp. 296–303 (2012)
McGough, A.S., Forshaw, M.: Reduction of wasted energy in a volunteer computing system through reinforcement learning. Sustain. Comput. Inform. Syst. 4(4), 262–275 (2014)
McGough, A.S., Forshaw, M.: Energy-aware simulation of workflow execution in high throughput computing systems. In: IEEE/ACM DS-RT (2015)
McGough, A.S., Forshaw, M.: Evaluation of energy consumption of replicated tasks in a volunteer computing environment. In: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, ICPE 2018, pp. 85–90 (2018)
McGough, A.S., Afzal, A., Darlington, J., Furmento, N., Mayer, A., Young, L.: Making the grid predictable through reservations and performance modelling. Comput. J. 48(3), 358–368 (2005)
McGough, A.S., Forshaw, M., Gerrard, C., Robinson, P., Wheater, S.: Analysis of power-saving techniques over a large multi-use cluster with variable workload. CCPE 25(18), 2501–2522 (2013). https://doi.org/10.1002/cpe.3082
Méndez, V., García, F.: SiCoGrid: a complete grid simulator for scheduling and algorithmical research, with emergent artificial intelligence data algorithms (2005)
Niu, S., et al.: Employing checkpoint to improve job scheduling in large-scale systems. In: Cirne, W., Desai, N., Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2012. LNCS, vol. 7698, pp. 36–55. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35867-8_3
Roy, A., Livny, M.: Condor and preemptive resume scheduling. In: Nabrzyski, J., Schopf, J.M., Wȩglarz, J. (eds.) Grid Resource Management. International Series in Operations Research & Management Science, vol. 64, pp. 135–144. Springer, Boston (2004). https://doi.org/10.1007/978-1-4615-0509-9_9
Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. A Bradford book, Bradford Book (1998)
Tannenbaum, T., Wright, D., Miller, K., Livny, M.: Condor: a distributed job scheduler. In: Beowulf cluster computing with Linux, pp. 307–350. MIT press (2001)
Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)
Yu, J., Buyya, R., Ramamohanarao, K.: Workflow scheduling algorithms for grid computing. In: Xhafa, F., Abraham, A. (eds.) Metaheuristics for Scheduling in Distributed Computing Environments. Studies in Computational Intelligence, vol. 146, pp. 173–214. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-69277-5_7
Zhu, Q., Zhu, J., Agrawal, G.: Power-aware consolidation of scientific workflows in virtualized environments. In: Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
McGough, A.S., Forshaw, M. (2023). Analysis of Reinforcement Learning for Determining Task Replication in Workflows. In: Gilly, K., Thomas, N. (eds) Computer Performance Engineering. EPEW 2022. Lecture Notes in Computer Science, vol 13659. Springer, Cham. https://doi.org/10.1007/978-3-031-25049-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-25049-1_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25048-4
Online ISBN: 978-3-031-25049-1
eBook Packages: Computer ScienceComputer Science (R0)