Keywords

1 Introduction

In recent years, many organizations face challenges when managing large amount of data generated from various business activities inside and outside the organization. Massive data storage and access might cause issues such as network overloading, low working efficiency and effectiveness, high data management cost, and low data management efficiency. The cloud computing technology has been widely used to alleviate these disadvantages based on its on-demand services and distributed architecture [1]. It allows heterogeneous cloud environments to satisfy the user requirements and helps users minimize the data loss risks and downtime. To obtain better data management performance in the cloud environment, data replication has been proposed to create and store multiple data copies in multiple sites [2]. There are many benefits gained from data replication such as cost reduction, response time saving, data availability improvement and reliability enhancement [3, 4]. Particularly, fault tolerance is one of the benefits by the implementation of data replication [5]. In the cloud environment, a variety of faults may occur to a data center, such as disasters, artificial accidents, information loss, data corruptions, software engineering faults and miscellaneous faults, etc. [1, 6]. These faults might significantly disrupt the job executions that require accessing the massive data stored in the faulty data center [7]. As one of the advantages of data replication, jobs can be redirected to other data centers where data replicas are available, known as backup data centers. Replication itself is also one of the proactive fault tolerance approaches. However, existing approaches such as Hadoop have not taken sufficient consideration to the characteristics of the jobs to be rescued as well as the overall performance of the cloud environment when handling faults. This might result in job rescue failure or performance deterioration.

In this paper, we propose a utility-based fault handling (UBFH) approach for more efficient job rescue at the faulty data center. This approach focuses on developing fault handling strategies based on common network performance measurements such as network latency, bandwidth consumption and error rate and job attributes including job deadline constraint, job urgency and job operation profit. A utility function is developed to prioritize the jobs to be rescued, in other words, to be relocated. For each job redirection operation, network performance is evaluated to find the optimal route so that the job can be migrated out the faulty data center to access a selected data replica. By doing so, our approach aims at achieving better repairability, job rescue utility and job operation profit. The simulation results show that our approach has better repairability, job rescue utility and job operation profit than HDFS, RR and JSQ approaches.

The remainder of the paper is organized as follows. Section 2 reviews the related work. Then Sect. 3 discusses the system modelling. Section 4 describes the replica selection mechanism of our UBFH approach. Section 5 illustrates our fault handling approach and algorithms followed by the simulation results in Sect. 6. Finally, Sect. 7 concludes our paper and pinpoints our future works.

2 Related Work

The cloud environment is subject to many types of faults, which might lead to a data center or the network links to a data center being unavailable [1, 6]. When such a fault occurs, the jobs that require data access from the faulty data center might be seriously impacted resulting in deteriorated performance or access disruption [7]. Hence, it is critical to a data center to own the ability to handle unexpected faults to a large extent [8]. Fault tolerance techniques are typically divided into two categories, proactive fault tolerance techniques and reactive fault tolerance techniques [9]. Proactive fault tolerance techniques try to proactively predict and manage the faults to avoid the faults from occurring while reactive fault tolerance techniques reduce the influence of faults when they already occurred [10]. For example, MapReduce uses self-healing and pre-emptive migration as its proactive fault tolerance approaches [9]. Besides, the examples of reactive fault tolerance approaches include checkpoint, retry, rescue workflow, user defined exception handling, task resubmission and job migration, etc. [5].

Many contemporary fault tolerance approaches focus on resolving the faulty problem. In [1], a proactive fault tolerance approach is proposed by considering the coordination among multiple virtual machines to jointly complete the parallel application execution. The authors use CPU temperature to detect the deteriorating physical machine in the data center and migrate the VMs on the deteriorating physical machine by deploying an improved particle swarm optimization algorithm. In [11], the authors propose an offloading system by considering the dependency relationships among multiple services to optimize execution time and energy consumption, which aims to make robust offloading decisions for mobile cloud services when faults occur. In [12], authors propose a redundant VM placement optimization approach with three employed algorithms to improve the service reliability.

There are situations in which a fault cannot be handled within a data center. For example, a data center might be temporarily closed due to a natural disaster, or it might have temporary limited connectivity to the outside due to a network problem, or the data stored in a data center might be corrupted accidently. In such situations, all job requests requiring massive data access to a faulty data center might be completely disrupted. Data replication has become a promising approach to handle such situations.

Several static and dynamic replication approaches have been proposed in the past years. In [6], authors propose a software-based selective replication to address silent data corruptions and fail-stop errors for HPC applications by aligning redundant computation with checkpoint/restart method. To evaluate the extent of reliability enhancement, authors develop a reliability model based on Markov chains. In [13], an HDFS framework with erasure coded replication scheme is proposed to provide space-optimal data redundancy with the least storage space consumption and good storage space utilization in order to protect against data loss, and the data distribution is based on consistent hashing. In [14], authors propose a threshold-based file replication mechanism to make the creation of file, file popularity based dynamic file replication and the file request processing in case of node failure without user intervention. The threshold-based file replication approaches carry out the file replication when the total number of access requests for a particular file reaches the threshold value. Specifically, Hadoop uses the typical three-replica strategy to replicate each file to three blocks to improve read bandwidth for upper-layer applications. When a fault occurs on a specific data node, Hadoop will generally redirect the data access request to another replica that meets certain criteria.

Unfortunately, most of these approaches took insufficient consideration to both common network performance measurements and the attributes of affected jobs such as their size, service delivery deadline, and job operation profit, which can be shown in Table 1. When the data access requests are redirected to other replica sites or when new data replicas are created, the impact to the overall cloud environment performance has been largely overlooked. If a system executes many redirection or re-replication operations, it will significantly increase the storage and network load on certain data centers [15]. In some cases, the redirection of data access requests from a faulty data center may even deplete resources of another data center. In addition, some jobs may miss the deadline even if they have been redirected to access the data replicas without considering the attributes of the affected jobs. This may result in user dissatisfaction, reputation damage, and compensation. Therefore, the insufficient consideration of both common network performance measurements and job attributes may largely degrade the overall performance [16]. Thus, it is desirable to have a novel replication based fault handling approach that fully considers both common network performance measurements and the attributes of the affected jobs.

Table 1. The comparison of fault tolerance approaches.

3 System Modelling

3.1 Definitions

We define the following terms in our system. A data center is used to house computer systems and associated components such as air conditioning systems, fire protection systems and electrical power systems. In our paper, there are multiple data centers \( DC \): {\( dc_{1} \), \( dc_{2} \), …, \( dc_{n} \)} and multiple users with the corresponding multiple jobs \( J \): {\( j_{1} \), \( j_{2} \), …, \( j_{m} \)}. We consider the job as independent job without the consideration of its inner workflow structure in this paper. We consider an independent replica as a dataset which is required to support the execution of the job.

The repairability refers to the ability to rescue jobs when a fault occurs at a data center. It is measured as the ratio of the number of the rescued jobs to the total number of the jobs to be rescued.

The job utility refers to the modelled value of the jobs and the job rescue utility is the sum of job utilities of those jobs that have been rescued from the faulty data center. The job operation profit is directly proportional to revenue, and it is also inversely proportional to cost. The job operation profit refers to the subtracting result between the revenue and the cost.

3.2 Job Urgency and Operation Profit Model

Each job \( j \) in \( J \) is associated with a service delivery deadline requirement \( T_{Dead} \left( j \right) \). If such a requirement is not specified, the job has infinite deadline. But in our paper, we only consider the jobs with the service delivery deadline because the jobs with infinite service delivery deadline always do not have negative influences on cloud service providers.

Each job \( j \) also has a total completion time \( TCT\left( j \right) \) which is determined by the nature of the job \( j \). Besides, the past processing time in its original execution location \( T_{Past} \left( j \right) \) should be considered if \( j \) has been selected to migrate or redirect out of its initial location. \( T_{Past} \left( j \right) \) equals to 0 if the job has not been executed. Internodal communication delay \( T_{IC} \left( j \right) \) is another factor to be considered because the extra time will be generated when the job is migrated or the data is transmitted across multiple network nodes between users and the host nodes. In some cases, the input scheduling delay \( T_{IS} \left( j \right) \) is the extra time generated by scheduling the data input or the task execution. We assume that all jobs will be re-executed if the job is migrated out of its initial location. To ensure the quality of services, all migrated jobs should satisfy their own service delivery deadline constraint, otherwise the migration operation will be deterred.

We use job urgency (\( UR \)) to evaluate the time buffer of the job. The higher the job urgency value, the more time buffer the job has. The job urgency is formulated as in (1), where \( UR\left( j \right) \) is the job urgency value of the job \( j \).

$$ UR\left( j \right) = T_{Dead} \left( j \right) - (TCT\left( j \right) + T_{Past} \left( j \right) + T_{IC} \left( j \right) + T_{IS} \left( j \right)) $$
(1)

Each job \( j \) in \( J \) is also associated with the value of job operation profit, \( PRO\left( j \right) \), which is the subtracting result between the revenue and the cost of the job \( j \).

3.3 Evaluation Metrics

The replica selection in this paper is primarily based on the cloud performance and service delivery performance in the overall cloud environment. To evaluate the cloud performance, we consider the bandwidth, the network latency, and the error rate as three major evaluation metrics. The bandwidth consumption of a specific data center \( dc_{x} \) in \( DC \), \( BC\left( {dc_{x} } \right) \), can be calculated using the equation in (2), where \( J^{x} \) is the set of jobs accessing this data center, \( Size\left( {j^{x} } \right) \) is the size of the dataset that is requested by a job \( j^{x} \), and \( TCT\left( {j^{x} } \right) \) denotes the total completion time of the job \( j^{x} \).

$$ BC\left( {dc_{x} } \right) = \mathop \sum \nolimits_{{j^{x} \in J^{x} }} \frac{{Size\left( {j^{x} } \right)}}{{TCT\left( {j^{x} } \right)}} $$
(2)

Then the available bandwidth of this data center \( AB\left( {dc_{x} } \right) \) is the difference between the maximum bandwidth of this data center \( maxB\left( {dc_{x} } \right) \), which can be presented as in (3).

$$ AB\left( {dc_{x} } \right) = maxB\left( {dc_{x} } \right) - BC\left( {dc_{x} } \right) $$
(3)

Besides, the network latency is usually measured as either one-way delay or round-trip delay. Round-trip delay is quoted by network managers more for the reason that it can be measured from a single point. Ping value has been widely used to measure the round-trip delay. It depends on a variety of factors including the data transmission speed, the nature of the transmission medium, the physical distance between two locations, the size of the transferred data, and the number of other data transmission requests being handled concurrently, etc. To simplify the problem, the network latency of a data center \( dc_{x} \), \( NL\left( {dc_{x} } \right) \), is modelled as a constant value.

The error rate of a data center \( dc_{x} \), \( ER\left( {dc_{x} } \right) \), refers to the ratio of the total number of transmitted data units in error to the total number of transmitted data units, which can be represented as in (4).

$$ ER\left( {dc_{x} } \right) = \frac{Total \,number\, of\, transmitted \,data \,units \,in\, error}{Total\, number \,of\, transmitted \,data\, units} $$
(4)

3.4 Job Rescue Utility

The utility function is often used to compare objects with multiple requirements and attributes. In this research, the utility value is used to prioritize jobs when they need to be redirected or migrated when handling fault in order to avoid the negative influence. Generally speaking, a data center prefers to rescue the jobs as many jobs as possible to fit their deadline requirements. In this case, the fault handling of the jobs should have priority assignment based on their job urgency. At the same time, a data center always tries to maximize its profit. In this case, jobs that bring more profits to the data center should have higher priority. Therefore, we propose a job utility based on both job urgency and job operation profit to prioritize the jobs.

For the job \( j_{y}^{x} \) at the faulty data center \( dc_{x} \), the general expression of job utility function \( U\left( {j_{y}^{x} } \right) \) is shown in (5) and should satisfy the condition in (6), where \( U_{UR} \left( {j_{y}^{x} } \right) \) and \( U_{PRO} \left( {j_{y}^{x} } \right) \) denote the utility value of the job urgency and the job operation profit for job \( j_{y}^{x} \), respectively. \( W_{UR} \) and \( W_{PRO} \) denote the corresponding weight of the job urgency and the job operation profit.

$$ U\left( {j_{y}^{x} } \right) = W_{UR} *U_{UR} \left( {j_{y}^{x} } \right) + W_{PRO} *U_{PRO} \left( {j_{y}^{x} } \right),\,j_{y}^{x} \in J^{x} $$
(5)
$$ W_{UR} + W_{PRO} = 1 $$
(6)

The utility value of the job urgency for a specific job \( j_{y}^{x} \) at a faulty location \( dc_{x} \) is calculated as follows in (7).

$$ U_{UR} \left( {j_{y}^{x} } \right) = \frac{{max\left( {UR\left( {j^{x} } \right)} \right) - UR\left( {j_{y}^{x} } \right)}}{{max\left( {UR\left( {j^{x} } \right)} \right) - min\left( {UR\left( {j^{x} } \right)} \right)}};\,j_{y}^{x} ,\,j^{x} \in J^{x} $$
(7)

The utility value of the job operation profit for a specific job \( j_{y}^{x} \) at a faulty location \( dc_{x} \) is calculated as follows in (8).

$$ U_{PRO} \left( {j_{y}^{x} } \right) = \frac{{PRO\left( {j_{y}^{x} } \right) - min\left( {UR\left( {j^{x} } \right)} \right)}}{{max\left( {PRO\left( {j^{x} } \right)} \right) - min\left( {PRO\left( {j^{x} } \right)} \right)}};\,j_{y}^{x} ,\,j^{x} \in J^{x} $$
(8)

Then the job rescue utility of a faulty data center \( dc_{x} \), \( U_{R} \left( {dc_{x} } \right) \), can be calculated using the equation in (9), where \( \vartheta \) is a variable parameter to judge the job rescue situation. If the job is rescued from the faulty data center, \( \vartheta \) will be 1, otherwise 0.

$$ U_{R} \left( {dc_{x} } \right) = \mathop \sum \limits_{{j_{y}^{x} \in J^{x} }} \vartheta *U\left( {j_{y}^{x} } \right),\,j_{y}^{x} \in J^{x} $$
(9)

4 The Replica Selection Schema

Our replica selection schema is an evaluation method based on overall cloud performance by applying three network evaluation metrics to select the best replica site to access. Three weighted parameters are developed to configure evaluation metrics and generate different replica selection decisions. \( W_{AB}^{x} \) denotes the weight of the available bandwidth metric of \( dc_{x} \), \( W_{NL}^{x} \) denotes the weight of the network latency metric of \( dc_{x} \), and \( W_{ER}^{x} \) denotes the weight of the error rate metric of \( dc_{x} \). The expression of the final weight of a specific data center \( FW\left( {dc_{x} } \right) \) can be summarized in (10), where \( NC_{AB}^{x} \) denotes the normalization component of the available bandwidth metric of \( dc_{x} \), \( NC_{NL}^{x} \) denotes the normalization component of the network latency metric of \( dc_{x} \), and \( NC_{ER}^{x} \) denotes the normalization component of the error rate metric of \( dc_{x} \). For a request to access a dataset that has replicas at multiple sites, the data center with the maximum \( FW\left( {dc_{x} } \right) \) value will be selected as the optimal access route for the request.

$$ \left\{ {\begin{array}{*{20}c} {FW\left( {dc_{x} } \right) = W_{AB}^{x} * NC_{AB}^{x} + W_{NL}^{x} * NC_{NL}^{x} + W_{ER}^{x} * NC_{ER}^{x} , dc_{x} \in DC} \\ {W_{AB}^{x} + W_{NL}^{x} + W_{ER}^{x} = 1} \\ \end{array} } \right. $$
(10)

Different evaluation metrics should be treated in different ways depending on their own nature. The available bandwidth metric with the highest value should be the best case while the network latency metric and error rate metric with that should be the worst case. Hence, the normalization processes of three evaluation metrics can be formulated as follows in (11), (12), and (13) respectively. If \( FW\left( {dc_{x} } \right) \) is the same among two or more locations, the location with the least network latency will be selected as the optimal route. Furthermore, if \( NL\left( {dc_{x} } \right) \) is also the same among two or more locations, the location with the lower error rate will be recognized as the optimal route.

$$ NC_{AB}^{x} = \frac{{AB\left( {dc_{x} } \right) - min\left\{ {AB\left( {dc} \right)} \right\}}}{{max\left\{ {AB\left( {dc} \right)} \right\} - min\left\{ {AB\left( {dc} \right)} \right\}}};\,dc_{x} , dc \in DC $$
(11)
$$ NC_{NL}^{x} = \frac{{\hbox{max} \left\{ {NL\left( {dc} \right)} \right\} - NL\left( {dc_{x} } \right)}}{{\hbox{max} \left\{ {NL\left( {dc} \right)} \right\} - min\left\{ {NL\left( {dc} \right)} \right\}}};\,dc_{x} , dc \in DC $$
(12)
$$ NC_{ER}^{x} = \frac{{\hbox{max} \left\{ {ER\left( {dc} \right)} \right\} - ER\left( {dc_{x} } \right)}}{{\hbox{max} \left\{ {ER\left( {dc} \right)} \right\} - min\left\{ {ER\left( {dc} \right)} \right\}}};\, dc_{x} , dc \in DC $$
(13)

5 UBFH Fault Handling Approach and Algorithms

Put simply, our UBFH fault handling approach tries to migrate jobs out of the faulty data center and redirect them to backup replica sites. The migration not only considers the performance of accessing backup replicas, but also strives to satisfy the service delivery deadline constraints. To achieve these goals, the algorithm uses two functions Redirection() and Migration() to find fault handling solutions under different scenarios for a job at the faulty data center. The algorithm uses utility-based ranking to evaluate the job priority for redirection or migration. The job utility should be treated in different ways depending on the fault circumstances in different data centers. The job with lower utility has higher migration priority at a backup data center while the job with higher utility has higher migration priority at faulty data centers.

figure a

The UBFH algorithm includes two major parts, faulting handling solution generation and implementation. Firstly, the jobs at the faulty data center will be ranked in a descending order based on their \( U\left( j \right) \) in Line 1 and then add into rank list \( ranklist[] \). Then a fault handling solution (FHS) will be worked out based on the Redirection() function for each job in the \( ranklist[] \) from Line 2 to 7. The input parameter of the Redirection() function are the job at the faulty data center which is desired to be rescued. The generation of the FHS is based on the RedirectionResult which includes a set of data center information, the redirection destination \( dc_{red} \) and the migration destination \( dc_{mig} \). Finally, after the FHS is generated, job moving activities will be done from Line 8 to 12.

figure b

If the job redirection function Redirection() is called, the backup replica-ready data centers will be firstly mapped to the input job in Line 1. A comparison between the bandwidth consumption of the input job and the available bandwidth of the backup replica-ready data centers is created to find out the optimal job redirection route from Line 2 to Line 6. If the available bandwidth in the backup replica-ready data centers are all insufficient to receive the redirected job from the faulty data center, a migration function Migration() will be initiated in Line 8.

figure c

In case that all the backup data centers do not have capacity to support a job to be rescued, the migration function Migration() is to migrate an existing job out of a replica-ready backup data center to release some resources for that job. Firstly, Line 1 collects the running jobs on the backup data centers and create a new group of jobs \( j_{mig} \). Then based on the new group of jobs \( j_{mig} \), backup data centers will be mapped in Line 3 for each job in the group. A bandwidth comparison between the bandwidth consumption of the redirected job at the faulty data center and the sum of the bandwidth consumption of \( j_{mig} \) and the available bandwidth in its backup replica-ready data centers will be conducted in Line 4, and a new group of migratable jobs \( movable\_job[] \) will be further created in Line 6 based on the movable job selection rule in Line 5. A reverse Quicksort algorithm will be applied on the new group of migratable jobs to rank the job in the \( movable\_job[] \) in an ascending order based on the job rescue utility in Line 7. A comparison between the bandwidth consumption of the movable job and the available bandwidth of its backup replica-ready data centers is conducted in Line 9 to find the eligible migratable data centers for the movable job in \( movable\_job[] \). Then based on our replica selection schema, the optimal job redirection route for rescuing the job at the faulty data center and the optimal job migration route for the movable job at the backup data centers will be finalized from Line 11 to Line 15.

6 Simulation Results

To evaluate the fault handling effectiveness and efficiency, we performed a series of simulations on OMNeT++ 5.4.1. The OMNeT++ application is an extensible, modular, component-based C++ simulation library and framework, primarily for building network and cloud communication simulators [17, 18]. To reduce the simulation uncertainty and present a clear result, we assume the following conditions in our simulations:

  • The data centers in the cloud environment has same speed and resources.

  • The storage resource is large enough.

  • The routes between the data centers have no overlap.

  • The transfer latency keeps stable between each pair of data centers.

  • The job consumes the bandwidth resource at a constant rate when executing.

A cloud environment including 5 data centers with 250 circuits of 100 Gbps optical-fiber network integrated at each data center site was implemented. The maximum bandwidth from \( dc_{1} \) to \( dc_{5} \) is all set to 25000 Gbps. The network latency from \( dc_{1} \) to \( dc_{5} \) is set to 20, 60, 40, 60, and 100 respectively. The error rate from \( dc_{1} \) to \( dc_{5} \) is set to 0.1%, 0.2%, 0.5%, 0.1% and 0.4% respectively. To avoid the fluctuation of uncertain internodal latency, input scheduling time and network latency between users and data centers, we set the \( T_{IC} \left( j \right) \) and \( T_{IS} \left( j \right) \) as 5 ms and adopt a single user with multiple requested jobs to access the datasets at different data centers in the simulated environment. A fault is set to occur at 10 ms system running time (\( T_{Past} \left( j \right) \) = 10 ms) in \( dc_{2} \), which leads to the closing down of \( dc_{2} \). The job deadline \( T_{Dead} \left( j \right) \) and the total completion time of the jobs \( TCT\left( j \right) \) are randomly set in the range of 0 ms and 1000 ms. The size of a job is randomly selected in the range of 0 GB to 5 GB, the same with that of many current data-intensive workflow jobs, such as Epigenomics. Each dataset has 3 replicas that are randomly placed in 5 data centers. To simply the problem, the weights \( W_{AB}^{x} \), \( W_{NL}^{x} \) and \( W_{ER}^{x} \) are all set to 1/3.

We compared our fault handling approach with the typical HDFS robustness approach applied in the HDFS system, the RR approach [19] applied in SQL server 2016 and the JSQ approach applied in the Cisco Local Director, IBM Network Dispatcher, and Microsoft Sharepoint [20,21,22]. All four approaches were implemented under a single-fault scenario. They were evaluated the job rescue performance in terms of the repairability, job rescue utility and job operation profit as we mentioned above. The utility weights were changed in all 3 simulations under equivalent utility weight scenario, urgency-heavy utility weight scenario and profit-heavy utility weight scenario to test the effectiveness of our approach.

6.1 Simulation 1 – Equivalent Utility Weights

In the Simulation 1, the weights of the job urgency and the job operation profit are both set to 0.5. The simulation results are shown in Fig. 1. The weights of the job rescue utility are set to equivalent utility scenario which both \( W_{UR} \) and \( W_{PRO} \) are set to 0.5.

Fig. 1.
figure 1

The repairability, job rescue utility and job operation profit of simulation 1.

In Fig. 1, it is obvious that our UBFH approach is better from both job rescue utility and job operation profit perspectives in both situations when the environment has sufficient and insufficient resources than all other three approaches. For example, when the environment has sufficient resources at 340 jobs, our UBFH approach achieves higher job rescue utility by maximum 11.89% increase and more job operation profit by maximum 9.46% increase than the HDFS and RR approaches. As the number of jobs increases, when the environment has insufficient resources at 380 jobs, our UBFH approach still achieves higher job rescue utility by maximum 11.04% increase and more job operation profit by maximum 5.09% increase than the HDFS and RR approaches.

From the repairability perspective, our UBFH approach also has better repairability when the resource is sufficient to support the job execution than all other three approaches. When the resource becomes limited, our UBFH approach aims to migrate the lower-utility jobs in the backup data center in order to release resources for higher-utility jobs from the faulty data center. By adopting this operation, the higher-utility job saving from the faulty data center may increase the resource pressure to other data centers. Therefore, some lower-utility jobs might be sacrificed. This leads a repairability decrease when the resource becomes insufficient.

6.2 Simulation 2 – Urgency-Heavy Utility

In Simulation 2, we increase the weight of the job urgency to 0.67 and decrease the weight of the job operation profit to 0.33. The simulation 2 results are shown in Fig. 3. The weights of the job rescue utility are set to urgency-heavy scenario which \( W_{UR} \) is 0.67 and \( W_{PRO} \) is 0.33.

Fig. 2.
figure 2

The repairability, job rescue utility and job operation profit of simulation 2.

In Fig. 2, our UBFH approach remains the similar trend in the simulation 1. For example, when the environment has sufficient resources at 340 jobs, our UBFH approach achieves higher job rescue utility by maximum 11.49% job rescue utility increase and maximum 9.46% job operation profit increase than the HDFS and JSQ approaches when the utility weight has been set to urgency-heavy. As the number of jobs increases, when the environment has insufficient resources at 360 jobs, our UBFH approach still achieves higher job rescue utility by maximum 8.29% increase and more job operation profit by maximum 4.29% increase than the HDFS and JSQ approaches.

The repairability is also divided into sufficient and insufficient resource scenarios. Our UBFH approach remains higher repairability to compare with all other three approaches when the resource is sufficient. But when the resource becomes limited, our UBFH operations still have a certain degree of repairability decrease due to the same reason in the simulation 1.

6.3 Simulation 3 – Profit-Heavy Utility Weights

In Simulation 3, we decrease the weight of the job urgency to 0.33 and increase the weight of the job operation profit to 0.67. The simulation 3 results are shown in Fig. 3. The weights of the job rescue utility are set to profit-heavy scenario which \( W_{UR} \) is 0.33 and \( W_{PRO} \) is 0.67.

Fig. 3.
figure 3

The repairability, job rescue utility and job operation profit of simulation 3.

In Fig. 3, it proves that our UBFH approach still maintains higher job rescue utility and job operation profit to compare with all other three approaches when the utility weight has been set to profit-heavy in both sufficient and insufficient resource scenarios. We still achieve maximum 9.36% job rescue utility increase and maximum 8.84% job operation profit increase than the HDFS and JSQ approaches when the number of jobs arrives 340 in a resource-sufficient scenario. The job rescue utility and job operation profit also remain higher than all other three approaches under resource-insufficient scenario, for example, 3.14% higher job rescue utility and 3.26% more job operation profit than the HDFS and JSQ approaches.

The repairability still experiences a decrease when the resource becomes limited. In contrast, the repairability remains higher than all other three approaches with a maximum 7.88% repairability increase at 340 jobs under sufficient resource scenarios.

7 Conclusions and Future Works

Data replication is a common method to achieve a proactive fault tolerance approach by creating multiple data copies to geographically-distributed locations. But sometimes, there are a variety of faults occurred in the cloud environments. Most of common replication approaches have insufficient considerations to the common network performance measurements and the job attributes. In this paper, we propose a utility-based fault handling (UBFH) approach to rescue the jobs at the faulty data center with the major considerations of the network performance measurements and job attributes in the cloud environment. A fault handling algorithm is developed to determine the direction of job redirection. The simulation results show that our UBFH approach has better repairability, job rescue utility and job operation profit than the HDFS, RR and JSQ approaches. In the future, the uncertain influences of the error rate in each data center site will be considered because we currently assume that there are no errors occurred in each network link. Besides, multi-packaged job migration will be investigated because we currently only consider a single-job migration.