Abstract
In this paper, we propose and evaluate a framework for fault tolerant workflow execution in Grid environments. Different from previous work in the literature, our system dynamically chooses an appropriate fault tolerance technique while using a user-defined rule-based system. We also provide a generic interface that can be used to add fault tolerance techniques to the framework. The results obtained with real workflows in an experimental Grid environment show that the overhead introduced by our framework in a failure-free execution is, in the worst evaluated case, approximately 10 %. Moreover, we show that, using our framework, workflows are able to execute successfully in the presence of failures and that the framework can dynamically choose an appropriate fault tolerance technique. The main contributions of our work are twofold: the developed framework and the model-based dependability analysis we performed on it. The purpose in carrying out a model-based dependability analysis consists on evaluating the interaction between our framework and the distributed Grid environment beyond the physical limitations of an empirical evaluation. By doing this, we provide means to plan the assurance of QoS in the Grid resource allocation, while applying the fault-tolerance mechanisms we implement in our framework regardless of the underlying middleware.
Similar content being viewed by others
References
Antonioletti, M., Atkinson, M., Baxter, R., Borley, A., Chue Hong, N., Collins, B., Hardman, N., Hume, A., Knox, A., Jackson, M., et al.: The design and implementation of Grid database services in ogsa-dai. Concurrency and Computation: Practice and Experience 17(2–4), 357–376 (2005)
Avižienis, A., Laprie, J.C., Randell, B., Landwehr, C.E.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. on Dependable and Secure Computing 1(1), 11–33 (2004)
Baier, C., Katoen, J.P.: Principles of Model Checking (Representation and Mind Series). The MIT Press (2008)
Basili, V.R., Caldiera, G., Rombach, H.D.: The goal question metric approach. In: Encyclopedia of Software Engineering. Wiley (1994)
Ben-Kiki, O., Evans, C., dot Net, I.: Yaml 1.2 specification, 3rd edn. http://yaml.org/spec/1.2/spec.html (2009). Accessed 24 July 2013
Bian, J., Weng, C., Du, J., Li, M.: A QoS-aware and fault-tolerant workflow composition for Grid. In: GCC 2008, pp. 510–516 (2008)
Bianco, A., Alfaro, L.D.: Model checking of probabilistic and nondeterministic systems. In: Foundations of Software Technology and Theoretical Computer Science, pp. 499–513. Springer-Verlag (1995)
Cheung, R.C.: A user-oriented software reliability model. In: IEEE Transactions on Software Engineering, vol. 6, issue 2, pp. 118–125. IEEE (1980)
Churches, D., Gombas, G., Harrison, A., Maassen, J., Robinson, C., Shields, M., Taylor, I., Wang, I.: Programming scientific and distributed workflow with Triana services. Concurrency and Computation: Practice and Experience 18(10), 1021–1037 (2006)
Czajkowski, K., Ferguson, D., Foster, I., Frey, J., Graham, S., Maguire, T., Snelling, D., Tuecke, S.: From open Grid services infrastructure to ws-resource framework: refactoring & evolution. In: Global Grid Forum Draft Recommendation (2004)
Emmerich, W., Butchart, B., Chen, L., Wassermann, B., Price, S.: Grid service orchestration using the business process execution language (bpel). J. Grid Comput. 3(3–4), 283–304 (2005). doi:10.1007/s10723-005-9015-3
Erwin, D., Snelling, D.: UNICORE: a Grid computing environment. In: Euro-Par 2001 Parallel Processing pp. 825–834 (2001)
Foster, I.: Globus toolkit version 4: Software for service-oriented systems. J. Comput. Sci. Technol. 21(4), 513–520 (2006)
Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the Grid: enabling scalable virtual organizations. Int. Journal of HPCA 15(3), 200 (2001)
Fox, G., Gannon, D.: Special issue: workflow in Grid systems. Concurrency and Computation: Practice and Experience 18(10), 1009–1019 (2006)
Object Management Group: UML 2.0 OCL Specification. Object Management Group, Inc. (2003)
Object Management Group: UML 2.0 Superstructure. Object Management Group, Inc. (2010)
Guimaraes, F.P., de Melo, A.C.M.A.: User-defined adaptive fault-tolerant execution of workflows in the Grid. In: CIT, pp. 356–362. IEEE Computer Society (2011)
Hansson, H., Jonsson, B.: A logic for reasoning about time and reliability. Formal Asp. Comput. 6(5), 512–535 (1994)
Heymans, P., Dubois, E.: Scenario-based techniques for supporting the elaboration and the validation of formal requirements. Requir. Eng. 3(3/4), 202–218 (1998)
Hoare, C.A.R.: Communicating sequential processes. Commun. ACM 21(8), 666–677 (1978)
Hwang, S., Kesselman, C.: A flexible framework for fault tolerance in the Grid. J. Grid Comput. 1(3), 251–272 (2003). doi:10.1023/B%3AGRID
Kandaswamy, G., Mandal, A., Reed, D.: Fault tolerance and recovery of scientific workflows on computational Grids. In: CCGRID 2008, pp. 777–782 (2008)
Kwiatkowska, M., Norman, G., Parker, D.: PRISM 2.0: a tool for probabilistic model checking. In: Proceedings of The 1st Internacional Conference on Quantitative Evaluation of Systems (QEST’04), pp. 322–323. IEEE Computer Society, Washington, DC, USA (2004)
Li, Y., Lan, Z.: Exploit failure prediction for adaptive fault-tolerance in cluster computing. In: CCGRID 2006, vol. 1 (2006)
Ludascher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system. Concurrency and Computation: Practice and Experience 18(10), 1039–1065 (2006)
National science foundation: the swift parallel scripting language. http://www.ci.uchicago.edu/swift/main/ (2013). Accessed 24 July 2013
Oinn, T., et al.: Taverna: lessons in creating a workflow environment for the life sciences. Concurrency and Computation: Practice and Experience 18(10), 1067–1100 (2006)
Orvis, J., Crabtree, J., Galens, K., Gussman, A., Inman, J.M., Lee, E., Nampally, S., Riley, D., Sundaram, J.P., Felix, V., Whitty, B., Mahurkar, A., Wortman, J., White, O., Angiuoli, S.V.: Ergatis: a web interface and scalable software system for bioinformatics workflows. Bioinformatics (Oxford, England) 26(12), 1488–1492 (2010)
Plankensteiner, K., Prodan, R., Fahringer, T.: A new fault tolerance heuristic for scientific workflows in highly distributed environments based on resubmission impact. In: e-Science 2009, pp. 313–320. IEEE (2009)
PRISM case studies. http://www.prismmodelchecker.org/casestudies (2010). Accessed 24 July 2013
Quan, D.: Error recovery mechanism for Grid-based workflow within SLA context. IJHPCN 5(1), 110–121 (2007)
Rodrigues, G., Rosenblum, D., Uchitel, S.: Using scenarios to predict the reliability of concurrent component-based software systems. In: Proc. ETAPS 2005 Conference on Formal Approaches to Software Engineering, pp. 111–126. Springer, LNCS 3442 (2005)
Rodrigues, G.N., Alves, V., Silveira, R., Laranjeira, L.A.: Dependability analysis in the ambient assisted living domain: an exploratory case study. J. Syst. Softw. 85, 112–131 (2012)
da Silva e Silva, F.J., Kon, F., Goldman, A., Finger, M., de Camargo, R.Y., Costa, F.M., et al.: Application execution management on the integrade opportunistic Grid middleware. JPDC 70(5), 573–583 (2010)
Slomiski, A.: On using bpel extensibility to implement ogsi and wsrf Grid workflows. Concurrency and Computation: Practice and Experience 18(10), 1229–1241 (2006)
de Sousa, A., et al.: A flexible fault-tolerance mechanism for the integrade Grid middleware. In: NC 2007, p. 26. IEEE Computer Society (2007)
Tanimura, Y., Ikegami, T., Nakada, H., Tanaka, Y., Sekiguchi, S.: Implementation of fault-tolerant Gridrpc applications. J. Grid Comput. 4(2), 145–157 (2006). doi:10.1007/s10723-006-9044-6
The cooperative computing lab—University of Notre Dame: makeflow = make + workflow (2012). http://www3.nd.edu/~ccl/software/makeflow/. Accessed 24 July 2013
Tolosana-Calasanz, R., Bañares, J., Rana, O., Álvarez, P., Ezpeleta, J., Hoheisel, A.: Adaptive exception handling for scientific workflows. Concurrency and Computation: Practice and Experience 22(5), 617–642 (2010)
Tuecke, S., Czajkowski, K., Foster, I., Frey, J., Graham, S., Kesselman, C., Maguire, T., Sandholm, T., Vanderbilt, P., Snelling, D.: Open Grid Services Infrastructure (OGSI) Version 1.0. Global Grid Forum Draft Recommendation. Online available at: http://www.globus.org/toolkit/draft-ggf-ogsi-gridservice-33_2003-06-27.pdf (2013). Accessed 11 Oct 2013
Uchitel, S., Kramer, J., Magee, J.: Synthesis on behavioral models from scenarios. In: IEEE Transactions on Software Engineering, vol. 29, issue 2, pp. 99–115. IEEE (2003)
Uchitel, S., Kramer, J., Magee, J.: Incremental elaboration of scenarios-based specifications and behavior models using implied scenarios. In: ACM Transactions on Software Engineering and Methodologies, vol. 13, issue 1, pp. 37–85. ACM Press (2004)
Wang, M., Ramamohanarao, K., Chen, J.: Trust-based robust scheduling and runtime adaptation of scientific workflow. Concurrency and Computation: Practice and Experience 21(16), 1982–1998 (2009)
Yu, J., Buyya, R.: A taxonomy of scientific workflow systems for Grid computing. Sigmod Record 34(3), 44–49 (2005)
Zhang, Y., Mandal, A., Koelbel, C., Cooper, K.: Combined fault tolerance and scheduling techniques for workflow applications on computational Grids. In: CCGRID 2009, pp. 244–251. IEEE Computer Society (2009)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Guimaraes, F.P., Célestin, P., Batista, D.M. et al. A Framework for Adaptive Fault-Tolerant Execution of Workflows in the Grid: Empirical and Theoretical Analysis. J Grid Computing 12, 127–151 (2014). https://doi.org/10.1007/s10723-013-9281-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-013-9281-4