Analysis of Reinforcement Learning for Determining Task Replication in Workflows | SpringerLink
Skip to main content

Analysis of Reinforcement Learning for Determining Task Replication in Workflows

  • Conference paper
  • First Online:
Computer Performance Engineering (EPEW 2022)

Abstract

Executing workflows on volunteer computing resources where individual tasks may be forced to relinquish their resource for the resource’s primary use leads to unpredictability and often significantly increases execution time. Task replication is one approach that can ameliorate this challenge. This comes at the expense of a potentially significant increase in system load and energy consumption. We propose the use of Reinforcement Learning (RL) such that a system may ‘learn’ the ‘best’ number of replicas to run to increase the number of workflows which complete promptly whilst minimising the additional workload on the system when replicas are not beneficial. We show, through simulation, that we can save 34% of the energy consumption using RL compared to a fixed number of replicas with only a 4% decrease in workflows achieving a pre-defined overhead bound.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 7435
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 9294
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Where the user desires \(p(W) \le \phi \).

  2. 2.

    To simplify presentation we omit the parameters hereafter.

  3. 3.

    without loss of generality we use \(\phi '\) to represent both \(\phi '\), and \(\phi ''\).

References

  1. Anderson, D.P.: BOINC: a system for public-resource computing and storage. In: 2004 Grid Computing, pp. 4–10. IEEE (2004)

    Google Scholar 

  2. Bell, W.H., Cameron, D.G., Capozza, L., Millar, A.P., Stockinger, K., Zini, F.: Optorsim - a grid simulator for studying dynamic data replication strategies. Int. J. High Perform. Comput. Appl. 17, 403–416 (2003)

    Article  MATH  Google Scholar 

  3. Bodík, P., Griffith, R., Sutton, C., Fox, A., Jordan, M., Patterson, D.: Statistical machine learning makes automatic control practical for internet datacenters. In: USENIX HotCloud (2009)

    Google Scholar 

  4. Buyya, R., Murshed, M.: GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for grid computing. Concurr. Comput. Pract. Exp. 14(13), 1175–1220 (2002)

    Article  MATH  Google Scholar 

  5. Buyya, R., Ranjan, R., Calheiros, R.N.: Modeling and simulation of scalable cloud computing environments and the CloudSim toolkit: challenges and opportunities. In: HPCS 2009, pp. 1–11. IEEE (2009)

    Google Scholar 

  6. Calheiros, R.N., Buyya, R.: Meeting deadlines of scientific workflows in public clouds with tasks replication. IEEE TPDS 25(7), 1787–1796 (2014)

    Google Scholar 

  7. Chen, W., Deelman, E.: WorkflowSim: a toolkit for simulating scientific workflows in distributed environments. In: IEEE e-Science 2012, pp. 1–8. IEEE (2012)

    Google Scholar 

  8. Durillo, J.J., Nae, V., Prodan, R.: Multi-objective workflow scheduling: an analysis of the energy efficiency and makespan tradeoff. In: IEEE/ACM CCGrid (2013)

    Google Scholar 

  9. Forshaw, M., McGough, A., Thomas, N.: HTC-Sim: a trace-driven simulation framework for energy consumption in high-throughput computing systems. Concurr. Comput. Pract. Exp. 28(12), 3260–3290 (2016)

    Article  Google Scholar 

  10. Forshaw, M.: Operating policies for energy efficient large scale computing. Ph.D. thesis, Newcastle University, UK (2015)

    Google Scholar 

  11. Forshaw, M., McGough, A.S., Thomas, N.: Energy-efficient checkpointing in high-throughput cycle-stealing distributed systems. Electron. Notes Theor. Comput. Sci. 310, 65–90 (2015)

    Article  Google Scholar 

  12. Forshaw, M., Thomas, N., McGough, A.S.: The case for energy-aware simulation and modelling of internet of things (IoT). In: Proceedings of the 2nd International Workshop on Energy-Aware Simulation, pp. 5:1–5:4. ENERGY-SIM, ACM (2016)

    Google Scholar 

  13. Forshaw, M., Thomas, N., McGough, S.: Trace-driven simulation for energy consumption in high throughput computing systems. In: IEEE/ACM DS-RT (2014)

    Google Scholar 

  14. Georgakopoulos, D., Hornick, M., Sheth, A.: An overview of workflow management: from process modeling to workflow automation infrastructure. Distrib. Parallel Databases 3(2), 119–153 (1995)

    Article  Google Scholar 

  15. Hiden, H., Woodman, S., Watson, P.: A framework for dynamically generating predictive models of workflow execution. In: Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science. WORKS 2013 (2013)

    Google Scholar 

  16. J. E. Kelley, J., Walker, M.R.: Critical-path planning and scheduling. In: International Workshop on Managing Requirements Knowledge, p. 160 (1959)

    Google Scholar 

  17. Jacob, J.C., et al.: Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking. Int. J. Comput. Sci. Eng. 4(2), 73–87 (2009)

    Google Scholar 

  18. Kell, A.J.M., Forshaw, M., Stephen McGough, A.: Exploring market power using deep reinforcement learning for intelligent bidding strategies. In: 2020 IEEE International Conference on Big Data (Big Data), pp. 4402–4411 (2020)

    Google Scholar 

  19. Kliazovich, D., Bouvry, P., Audzevich, Y., Khan, S.U.: GreenCloud: a packet-level simulator of energy-aware cloud computing data centers. In: GLOBECOM (2010). https://doi.org/10.1109/GLOCOM.2010.5683561

  20. Legrand, A., Marchal, L.: Scheduling distributed applications: the simgrid simulation framework. In: Proceedings of the Third IEEE International Symposium on Cluster Computing and the Grid, pp. 138–145 (2003)

    Google Scholar 

  21. Li, Z., et al.: A security and cost aware scheduling algorithm for heterogeneous tasks of scientific workflow in clouds. Future Gener. Comput. Syst. 65, 140–152 (2016)

    Article  Google Scholar 

  22. Lim, S.H., Sharma, B., Nam, G., Kim, E.K., Das, C.: MDCSim: a multi-tier data center simulation, platform. In: 2009 IEEE International Conference on Cluster Computing and Workshops, CLUSTER 2009, pp. 1–9 (2009)

    Google Scholar 

  23. Litzkow, M., Livney, M., Mutka, M.W.: Condor-a hunter of idle workstations. In: ICDCS (1988)

    Google Scholar 

  24. McGough, A.S., Forshaw, M., Gerrard, C., Wheater, S.: Reducing the number of miscreant tasks executions in a multi-use cluster. In: 2012 Second International Conference on Cloud and Green Computing (CGC), pp. 296–303 (2012)

    Google Scholar 

  25. McGough, A.S., Forshaw, M.: Reduction of wasted energy in a volunteer computing system through reinforcement learning. Sustain. Comput. Inform. Syst. 4(4), 262–275 (2014)

    Google Scholar 

  26. McGough, A.S., Forshaw, M.: Energy-aware simulation of workflow execution in high throughput computing systems. In: IEEE/ACM DS-RT (2015)

    Google Scholar 

  27. McGough, A.S., Forshaw, M.: Evaluation of energy consumption of replicated tasks in a volunteer computing environment. In: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering, ICPE 2018, pp. 85–90 (2018)

    Google Scholar 

  28. McGough, A.S., Afzal, A., Darlington, J., Furmento, N., Mayer, A., Young, L.: Making the grid predictable through reservations and performance modelling. Comput. J. 48(3), 358–368 (2005)

    Article  Google Scholar 

  29. McGough, A.S., Forshaw, M., Gerrard, C., Robinson, P., Wheater, S.: Analysis of power-saving techniques over a large multi-use cluster with variable workload. CCPE 25(18), 2501–2522 (2013). https://doi.org/10.1002/cpe.3082

    Article  Google Scholar 

  30. Méndez, V., García, F.: SiCoGrid: a complete grid simulator for scheduling and algorithmical research, with emergent artificial intelligence data algorithms (2005)

    Google Scholar 

  31. Niu, S., et al.: Employing checkpoint to improve job scheduling in large-scale systems. In: Cirne, W., Desai, N., Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2012. LNCS, vol. 7698, pp. 36–55. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35867-8_3

    Chapter  Google Scholar 

  32. Roy, A., Livny, M.: Condor and preemptive resume scheduling. In: Nabrzyski, J., Schopf, J.M., Wȩglarz, J. (eds.) Grid Resource Management. International Series in Operations Research & Management Science, vol. 64, pp. 135–144. Springer, Boston (2004). https://doi.org/10.1007/978-1-4615-0509-9_9

    Chapter  Google Scholar 

  33. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. A Bradford book, Bradford Book (1998)

    Google Scholar 

  34. Tannenbaum, T., Wright, D., Miller, K., Livny, M.: Condor: a distributed job scheduler. In: Beowulf cluster computing with Linux, pp. 307–350. MIT press (2001)

    Google Scholar 

  35. Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst. 13(3), 260–274 (2002)

    Article  Google Scholar 

  36. Yu, J., Buyya, R., Ramamohanarao, K.: Workflow scheduling algorithms for grid computing. In: Xhafa, F., Abraham, A. (eds.) Metaheuristics for Scheduling in Distributed Computing Environments. Studies in Computational Intelligence, vol. 146, pp. 173–214. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-69277-5_7

    Chapter  MATH  Google Scholar 

  37. Zhu, Q., Zhu, J., Agrawal, G.: Power-aware consolidation of scientific workflows in virtualized environments. In: Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew Stephen McGough .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

McGough, A.S., Forshaw, M. (2023). Analysis of Reinforcement Learning for Determining Task Replication in Workflows. In: Gilly, K., Thomas, N. (eds) Computer Performance Engineering. EPEW 2022. Lecture Notes in Computer Science, vol 13659. Springer, Cham. https://doi.org/10.1007/978-3-031-25049-1_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-25049-1_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-25048-4

  • Online ISBN: 978-3-031-25049-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics