In large-scale heterogeneous cluster computing systems, processor and network failures are inevitable and can have an adverse effect on applications executing on such systems. One way of taking failures into account is to employ a reliable scheduling algorithm. However, most existing scheduling algorithms for precedence constrained tasks in heterogeneous systems only consider scheduling length, and not efficiently satisfy the reliability requirements of task. In recognition of this problem, we build an application reliability analysis model based on Weibull distribution, which can dynamically measure the reliability of task executing on heterogeneous cluster with arbitrary networks architectures. Then, we propose a reliability-driven earliest finish time with duplication scheduling algorithm (REFTD) which incorporates task reliability overhead into scheduling. Furthermore, to improve system reliability, it duplicates task as if task hazard rate is more than threshold \(\theta \). The comparison study, based on both randomly generated graphs and the graphs of some real applications, shows that our scheduling algorithm can shorten schedule length and improve system reliability significantly.

This research was partially funded by National Science Foundation of China (Grant Nos. 61133005, 61070057, 61370098), the National Science Foundation for Distinguished Young Scholars of Hunan (12JJ1011), and a project supported by Scientific Research Fund of Hunan Provincial Education Department (Grant No. 12A062).
Tang, X., Li, K. & Liao, G. An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems. Cluster Comput 17, 1413–1425 (2014). https://doi.org/10.1007/s10586-014-0372-1
