Abstract
In large-scale heterogeneous cluster computing systems, processor and network failures are inevitable and can have an adverse effect on applications executing on such systems. One way of taking failures into account is to employ a reliable scheduling algorithm. However, most existing scheduling algorithms for precedence constrained tasks in heterogeneous systems only consider scheduling length, and not efficiently satisfy the reliability requirements of task. In recognition of this problem, we build an application reliability analysis model based on Weibull distribution, which can dynamically measure the reliability of task executing on heterogeneous cluster with arbitrary networks architectures. Then, we propose a reliability-driven earliest finish time with duplication scheduling algorithm (REFTD) which incorporates task reliability overhead into scheduling. Furthermore, to improve system reliability, it duplicates task as if task hazard rate is more than threshold \(\theta \). The comparison study, based on both randomly generated graphs and the graphs of some real applications, shows that our scheduling algorithm can shorten schedule length and improve system reliability significantly.
Similar content being viewed by others
References
Bahman, J., Parimala, T., Rajkumar, B.: Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources. J. Supercomput. 63(2), 467–489 (2013)
Balasangameshwara, J., Rajub, N.: Hybrid policy for fault tolerant load balancing in grid computing environments. J. Netw. Comput. Appl. 35(1), 412–422 (2012)
Ball, O.: Computational complexity of network reliability analysis: an Overview. IEEE Trans. Reliab. 35(3), 230–239 (1986)
Casanova, H.: Network modeling issues for grid application scheduling. Int. J. Found. Comput. 16(2), 145–162 (2005)
Das, K.: A comparative study of exponential distribution vs Weibull distribution in machine reliability analysis in a CMS design. Comput. Ind. Eng 54(1), 12–33 (2008)
Dogan, A., Özguner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Dist. Sys 13(3), 308–323 (2002)
Dzmitry, K., Pascal, B., Samee, K.: DENS: data center energy-efficient network-aware scheduling. Cluster Comput. 16, 65–75 (2013)
Gary, M.R., Johnson, D.S.: Computers and Intractability: a Guide to the Theory of NP-Completeness. W.H. Freeman and Co, San Francisco (1979)
http://simgrid.gforge.inria.fr/. Accessed 12 Nov 2012
Jeannot, E., Saule, E., Trystram, D.: Optimizing performance and reliability on heterogeneous parallel systems: approximation algorithms and heuristics. J. Parallel Dist. Comput. 72(2), 268–280 (2012)
Jin, H., Sun, X., Zheng, Z., Lan, Z., Xie, B.: Performance under failures of DAG\_based parallel computing. In Proceedings of the CCGrid’09, pp. 236–243 (2009).
Khan, A.: Scheduling for heterogeneous systems using constrained critical paths. Parallel Comput. 38(4–5), 175–193 (2012)
Kwok, Y.-K., Ahmad, I.: Dynamic critical-path scheduling: an effective technique for allocating task graphs onto multiprocessors. IEEE Trans. Parallel Dist. Sys. 7(5), 506–521 (1996)
Li, R., Zhang, Y., Xu, Z., Wu, H.: A load-balancing method for network GISs in a heterogeneous cluster-based system using access density. Future Gener. Comput. Sys. 29(2), 528–535 (2013)
Litke, A., Skoutas, D., Tserpes, K., Varvarigou, T.: Efficient task replication and management for adaptive fault tolerance in mobile grid environments. Future Gener. Comput. Syst. 23(2), 163–178 (2007)
Macey, B.S., Zomaya, A.Y.: A performance evaluation of CP list scheduling heuristics for communication intensive task graphs. In: Parallel Processing Symposium, pp. 538–541 (1998).
Prabhakar, M.D.N., Bulmerc, M., Eccleston, A.: Weibull model selection for reliability modelling. Reliab. Eng. Sys. Safety 86(3), 257–267 (2004)
Qin, X., Jiang, H.: A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. J. Parallel Dist. Comput. 65(8), 885–900 (2005)
Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. In: Proceedings of the International Symposium on Dependable Systems and Networks (DSN 2006), pp. 249–258 (2006).
Sih, G.C., Lee, E.A.: A compile-time scheduling heuristic for interconnection-constrained heterogeneous machine architectures. IEEE Trans. Parallel Distrib. Sys. 49(2), 175–187 (1993)
Sinnen, O., Sousa, L.A., Sandnes, E.: Toward a realistic task scheduling model. IEEE Trans. Parallel Dist. Sys. 17(3), 263–275 (2006)
Tang, X., Li, K.: PADUA D.: communication contention in APN list scheduling algorithm. Info. Sci. 53(1), 59–69 (2009)
Tang, X., Li, K., Li, R., Veeravalli, B.: Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J. Parallel Dist. Comput. 70(9), 941–952 (2010)
Topcuoglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Sys. 13(3), 260–274 (2002)
Ye, Z., Xie, M., Tang, L.: Reliability evaluation of hard disk drive failures based on counting processes. Reliability Engineering & System Safety 109, 110–118 (2013)
Zhang, X., Pham, H.: Software field failure rate prediction before software deployment. J. Sys. Softw. 79(3), 291–300 (2006)
Zhang, Y., Mueller, F.: Autogeneration and autotuning of 3d stencil codes on homogeneous and heterogeneous gpu clusters. IEEE Trans. Parallel Distrib. Sys. 24(3), 417–427 (2013)
Zhao, H., Sakellariou, R.: An experimental investigation into the rank function of the heterogeneous earliest finish time scheduling algorithm. In: Proceedings of 9th International Euro-Par Conference, LNCS 2790, pp. 189–194 (2003).
Zheng, Q., Veeravalli, B., Tham, C.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)
Zhu, X., Ge, R., Sun, J., He, C.: 3E: energy-efficient elastic scheduling for independent tasks in heterogeneous computing systems. J. Sys. Softw. 8(2), 302–314 (2013)
Acknowledgments
This research was partially funded by National Science Foundation of China (Grant Nos. 61133005, 61070057, 61370098), the National Science Foundation for Distinguished Young Scholars of Hunan (12JJ1011), and a project supported by Scientific Research Fund of Hunan Provincial Education Department (Grant No. 12A062).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tang, X., Li, K. & Liao, G. An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems. Cluster Comput 17, 1413–1425 (2014). https://doi.org/10.1007/s10586-014-0372-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-014-0372-1