A Utility-Based Fault Handling Approach for Efficient Job Rescue in Clouds

Xie, Fei; Yan, Jun; Shen, Jun

doi:10.1007/978-3-030-59635-4_4

Fei Xie¹¹,
Jun Yan¹¹ &
Jun Shen¹¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12403))

Included in the following conference series:

International Conference on Cloud Computing

1111 Accesses

Abstract

In recent years, many organizations face challenges when managing large amount of data and data-intensive computing tasks. Cloud computing technology has been widely-used to alleviate these challenges with its on-demand services and distributed architecture. Data replication is one of the most significant strategies to decrease the access latency and improve data availability, data reliability and resource utilization by creating multiple data copies to geographically-distributed data centers. When a fault occurs at a data center, existing jobs that require data access in this data center can be redirected to other data centers, where data replicas are available. This paper proposes a utility-based fault handling (UBFH) approach to rescue the jobs at the faulty data center. Then a fault handling algorithm is developed to determine the direction of job redirection by considering the network performance and job attributes. Our main objective is to achieve better repairability, job rescue utility and job operation profit. The simulation results show that our UBFH approach outperforms HDFS, RR and JSQ approaches in all these aspects.

You have full access to this open access chapter, Download conference paper PDF

A novel method for adaptive fault tolerance during load balancing in cloud computing

Article 13 July 2017

An Enhanced Fault-Tolerant Load Balancing Process for a Distributed System

Reliability-Aware Distributed Computing Scheduling Policy

Keywords

1 Introduction

In recent years, many organizations face challenges when managing large amount of data generated from various business activities inside and outside the organization. Massive data storage and access might cause issues such as network overloading, low working efficiency and effectiveness, high data management cost, and low data management efficiency. The cloud computing technology has been widely used to alleviate these disadvantages based on its on-demand services and distributed architecture [1]. It allows heterogeneous cloud environments to satisfy the user requirements and helps users minimize the data loss risks and downtime. To obtain better data management performance in the cloud environment, data replication has been proposed to create and store multiple data copies in multiple sites [2]. There are many benefits gained from data replication such as cost reduction, response time saving, data availability improvement and reliability enhancement [3, 4]. Particularly, fault tolerance is one of the benefits by the implementation of data replication [5]. In the cloud environment, a variety of faults may occur to a data center, such as disasters, artificial accidents, information loss, data corruptions, software engineering faults and miscellaneous faults, etc. [1, 6]. These faults might significantly disrupt the job executions that require accessing the massive data stored in the faulty data center [7]. As one of the advantages of data replication, jobs can be redirected to other data centers where data replicas are available, known as backup data centers. Replication itself is also one of the proactive fault tolerance approaches. However, existing approaches such as Hadoop have not taken sufficient consideration to the characteristics of the jobs to be rescued as well as the overall performance of the cloud environment when handling faults. This might result in job rescue failure or performance deterioration.

In this paper, we propose a utility-based fault handling (UBFH) approach for more efficient job rescue at the faulty data center. This approach focuses on developing fault handling strategies based on common network performance measurements such as network latency, bandwidth consumption and error rate and job attributes including job deadline constraint, job urgency and job operation profit. A utility function is developed to prioritize the jobs to be rescued, in other words, to be relocated. For each job redirection operation, network performance is evaluated to find the optimal route so that the job can be migrated out the faulty data center to access a selected data replica. By doing so, our approach aims at achieving better repairability, job rescue utility and job operation profit. The simulation results show that our approach has better repairability, job rescue utility and job operation profit than HDFS, RR and JSQ approaches.

The remainder of the paper is organized as follows. Section 2 reviews the related work. Then Sect. 3 discusses the system modelling. Section 4 describes the replica selection mechanism of our UBFH approach. Section 5 illustrates our fault handling approach and algorithms followed by the simulation results in Sect. 6. Finally, Sect. 7 concludes our paper and pinpoints our future works.

2 Related Work

The cloud environment is subject to many types of faults, which might lead to a data center or the network links to a data center being unavailable [1, 6]. When such a fault occurs, the jobs that require data access from the faulty data center might be seriously impacted resulting in deteriorated performance or access disruption [7]. Hence, it is critical to a data center to own the ability to handle unexpected faults to a large extent [8]. Fault tolerance techniques are typically divided into two categories, proactive fault tolerance techniques and reactive fault tolerance techniques [9]. Proactive fault tolerance techniques try to proactively predict and manage the faults to avoid the faults from occurring while reactive fault tolerance techniques reduce the influence of faults when they already occurred [10]. For example, MapReduce uses self-healing and pre-emptive migration as its proactive fault tolerance approaches [9]. Besides, the examples of reactive fault tolerance approaches include checkpoint, retry, rescue workflow, user defined exception handling, task resubmission and job migration, etc. [5].

Many contemporary fault tolerance approaches focus on resolving the faulty problem. In [1], a proactive fault tolerance approach is proposed by considering the coordination among multiple virtual machines to jointly complete the parallel application execution. The authors use CPU temperature to detect the deteriorating physical machine in the data center and migrate the VMs on the deteriorating physical machine by deploying an improved particle swarm optimization algorithm. In [11], the authors propose an offloading system by considering the dependency relationships among multiple services to optimize execution time and energy consumption, which aims to make robust offloading decisions for mobile cloud services when faults occur. In [12], authors propose a redundant VM placement optimization approach with three employed algorithms to improve the service reliability.

There are situations in which a fault cannot be handled within a data center. For example, a data center might be temporarily closed due to a natural disaster, or it might have temporary limited connectivity to the outside due to a network problem, or the data stored in a data center might be corrupted accidently. In such situations, all job requests requiring massive data access to a faulty data center might be completely disrupted. Data replication has become a promising approach to handle such situations.

Several static and dynamic replication approaches have been proposed in the past years. In [6], authors propose a software-based selective replication to address silent data corruptions and fail-stop errors for HPC applications by aligning redundant computation with checkpoint/restart method. To evaluate the extent of reliability enhancement, authors develop a reliability model based on Markov chains. In [13], an HDFS framework with erasure coded replication scheme is proposed to provide space-optimal data redundancy with the least storage space consumption and good storage space utilization in order to protect against data loss, and the data distribution is based on consistent hashing. In [14], authors propose a threshold-based file replication mechanism to make the creation of file, file popularity based dynamic file replication and the file request processing in case of node failure without user intervention. The threshold-based file replication approaches carry out the file replication when the total number of access requests for a particular file reaches the threshold value. Specifically, Hadoop uses the typical three-replica strategy to replicate each file to three blocks to improve read bandwidth for upper-layer applications. When a fault occurs on a specific data node, Hadoop will generally redirect the data access request to another replica that meets certain criteria.

Unfortunately, most of these approaches took insufficient consideration to both common network performance measurements and the attributes of affected jobs such as their size, service delivery deadline, and job operation profit, which can be shown in Table 1. When the data access requests are redirected to other replica sites or when new data replicas are created, the impact to the overall cloud environment performance has been largely overlooked. If a system executes many redirection or re-replication operations, it will significantly increase the storage and network load on certain data centers [15]. In some cases, the redirection of data access requests from a faulty data center may even deplete resources of another data center. In addition, some jobs may miss the deadline even if they have been redirected to access the data replicas without considering the attributes of the affected jobs. This may result in user dissatisfaction, reputation damage, and compensation. Therefore, the insufficient consideration of both common network performance measurements and job attributes may largely degrade the overall performance [16]. Thus, it is desirable to have a novel replication based fault handling approach that fully considers both common network performance measurements and the attributes of the affected jobs.

Table 1. The comparison of fault tolerance approaches.

Full size table

3 System Modelling

3.1 Definitions

We define the following terms in our system. A data center is used to house computer systems and associated components such as air conditioning systems, fire protection systems and electrical power systems. In our paper, there are multiple data centers $ DC $: {$ dc_{1} $, $ dc_{2} $, …, $ dc_{n} $} and multiple users with the corresponding multiple jobs $ J $: {$ j_{1} $, $ j_{2} $, …, $ j_{m} $}. We consider the job as independent job without the consideration of its inner workflow structure in this paper. We consider an independent replica as a dataset which is required to support the execution of the job.

The repairability refers to the ability to rescue jobs when a fault occurs at a data center. It is measured as the ratio of the number of the rescued jobs to the total number of the jobs to be rescued.

The job utility refers to the modelled value of the jobs and the job rescue utility is the sum of job utilities of those jobs that have been rescued from the faulty data center. The job operation profit is directly proportional to revenue, and it is also inversely proportional to cost. The job operation profit refers to the subtracting result between the revenue and the cost.

3.2 Job Urgency and Operation Profit Model

Each job $ j $ in $ J $ is associated with a service delivery deadline requirement $ T_{Dead} \left( j \right) $. If such a requirement is not specified, the job has infinite deadline. But in our paper, we only consider the jobs with the service delivery deadline because the jobs with infinite service delivery deadline always do not have negative influences on cloud service providers.

Each job $ j $ also has a total completion time $ TCT\left( j \right) $ which is determined by the nature of the job $ j $. Besides, the past processing time in its original execution location $ T_{Past} \left( j \right) $ should be considered if $ j $ has been selected to migrate or redirect out of its initial location. $ T_{Past} \left( j \right) $ equals to 0 if the job has not been executed. Internodal communication delay $ T_{IC} \left( j \right) $ is another factor to be considered because the extra time will be generated when the job is migrated or the data is transmitted across multiple network nodes between users and the host nodes. In some cases, the input scheduling delay $ T_{IS} \left( j \right) $ is the extra time generated by scheduling the data input or the task execution. We assume that all jobs will be re-executed if the job is migrated out of its initial location. To ensure the quality of services, all migrated jobs should satisfy their own service delivery deadline constraint, otherwise the migration operation will be deterred.

We use job urgency ($ UR $) to evaluate the time buffer of the job. The higher the job urgency value, the more time buffer the job has. The job urgency is formulated as in (1), where $ UR\left( j \right) $ is the job urgency value of the job $ j $.

$$ UR\left( j \right) = T_{Dead} \left( j \right) - (TCT\left( j \right) + T_{Past} \left( j \right) + T_{IC} \left( j \right) + T_{IS} \left( j \right)) $$

(1)

Each job $ j $ in $ J $ is also associated with the value of job operation profit, $ PRO\left( j \right) $, which is the subtracting result between the revenue and the cost of the job $ j $.

3.3 Evaluation Metrics

The replica selection in this paper is primarily based on the cloud performance and service delivery performance in the overall cloud environment. To evaluate the cloud performance, we consider the bandwidth, the network latency, and the error rate as three major evaluation metrics. The bandwidth consumption of a specific data center $ dc_{x} $ in $ DC $, $ BC\left( {dc_{x} } \right) $, can be calculated using the equation in (2), where $ J^{x} $ is the set of jobs accessing this data center, $ Size\left( {j^{x} } \right) $ is the size of the dataset that is requested by a job $ j^{x} $, and $ TCT\left( {j^{x} } \right) $ denotes the total completion time of the job $ j^{x} $.

$$ BC\left( {dc_{x} } \right) = \mathop \sum \nolimits_{{j^{x} \in J^{x} }} \frac{{Size\left( {j^{x} } \right)}}{{TCT\left( {j^{x} } \right)}} $$

(2)

Then the available bandwidth of this data center $ AB\left( {dc_{x} } \right) $ is the difference between the maximum bandwidth of this data center $ maxB\left( {dc_{x} } \right) $, which can be presented as in (3).

$$ AB\left( {dc_{x} } \right) = maxB\left( {dc_{x} } \right) - BC\left( {dc_{x} } \right) $$

(3)

Besides, the network latency is usually measured as either one-way delay or round-trip delay. Round-trip delay is quoted by network managers more for the reason that it can be measured from a single point. Ping value has been widely used to measure the round-trip delay. It depends on a variety of factors including the data transmission speed, the nature of the transmission medium, the physical distance between two locations, the size of the transferred data, and the number of other data transmission requests being handled concurrently, etc. To simplify the problem, the network latency of a data center $ dc_{x} $, $ NL\left( {dc_{x} } \right) $, is modelled as a constant value.

The error rate of a data center $ dc_{x} $, $ ER\left( {dc_{x} } \right) $, refers to the ratio of the total number of transmitted data units in error to the total number of transmitted data units, which can be represented as in (4).

$$ ER\left( {dc_{x} } \right) = \frac{Total \,number\, of\, transmitted \,data \,units \,in\, error}{Total\, number \,of\, transmitted \,data\, units} $$

(4)

3.4 Job Rescue Utility

The utility function is often used to compare objects with multiple requirements and attributes. In this research, the utility value is used to prioritize jobs when they need to be redirected or migrated when handling fault in order to avoid the negative influence. Generally speaking, a data center prefers to rescue the jobs as many jobs as possible to fit their deadline requirements. In this case, the fault handling of the jobs should have priority assignment based on their job urgency. At the same time, a data center always tries to maximize its profit. In this case, jobs that bring more profits to the data center should have higher priority. Therefore, we propose a job utility based on both job urgency and job operation profit to prioritize the jobs.

For the job $ j_{y}^{x} $ at the faulty data center $ dc_{x} $, the general expression of job utility function $ U\left( {j_{y}^{x} } \right) $ is shown in (5) and should satisfy the condition in (6), where $ U_{UR} \left( {j_{y}^{x} } \right) $ and $ U_{PRO} \left( {j_{y}^{x} } \right) $ denote the utility value of the job urgency and the job operation profit for job $ j_{y}^{x} $, respectively. $ W_{UR} $ and $ W_{PRO} $ denote the corresponding weight of the job urgency and the job operation profit.

$$ U\left( {j_{y}^{x} } \right) = W_{UR} *U_{UR} \left( {j_{y}^{x} } \right) + W_{PRO} *U_{PRO} \left( {j_{y}^{x} } \right),\,j_{y}^{x} \in J^{x} $$

(5)

$$ W_{UR} + W_{PRO} = 1 $$

(6)

The utility value of the job urgency for a specific job $ j_{y}^{x} $ at a faulty location $ dc_{x} $ is calculated as follows in (7).

$$ U_{UR} \left( {j_{y}^{x} } \right) = \frac{{max\left( {UR\left( {j^{x} } \right)} \right) - UR\left( {j_{y}^{x} } \right)}}{{max\left( {UR\left( {j^{x} } \right)} \right) - min\left( {UR\left( {j^{x} } \right)} \right)}};\,j_{y}^{x} ,\,j^{x} \in J^{x} $$

(7)

The utility value of the job operation profit for a specific job $ j_{y}^{x} $ at a faulty location $ dc_{x} $ is calculated as follows in (8).

$$ U_{PRO} \left( {j_{y}^{x} } \right) = \frac{{PRO\left( {j_{y}^{x} } \right) - min\left( {UR\left( {j^{x} } \right)} \right)}}{{max\left( {PRO\left( {j^{x} } \right)} \right) - min\left( {PRO\left( {j^{x} } \right)} \right)}};\,j_{y}^{x} ,\,j^{x} \in J^{x} $$

(8)

Then the job rescue utility of a faulty data center $ dc_{x} $, $ U_{R} \left( {dc_{x} } \right) $, can be calculated using the equation in (9), where $ \vartheta $ is a variable parameter to judge the job rescue situation. If the job is rescued from the faulty data center, $ \vartheta $ will be 1, otherwise 0.

$$ U_{R} \left( {dc_{x} } \right) = \mathop \sum \limits_{{j_{y}^{x} \in J^{x} }} \vartheta *U\left( {j_{y}^{x} } \right),\,j_{y}^{x} \in J^{x} $$

(9)

4 The Replica Selection Schema

Our replica selection schema is an evaluation method based on overall cloud performance by applying three network evaluation metrics to select the best replica site to access. Three weighted parameters are developed to configure evaluation metrics and generate different replica selection decisions. $ W_{AB}^{x} $ denotes the weight of the available bandwidth metric of $ dc_{x} $, $ W_{NL}^{x} $ denotes the weight of the network latency metric of $ dc_{x} $, and $ W_{ER}^{x} $ denotes the weight of the error rate metric of $ dc_{x} $. The expression of the final weight of a specific data center $ FW\left( {dc_{x} } \right) $ can be summarized in (10), where $ NC_{AB}^{x} $ denotes the normalization component of the available bandwidth metric of $ dc_{x} $, $ NC_{NL}^{x} $ denotes the normalization component of the network latency metric of $ dc_{x} $, and $ NC_{ER}^{x} $ denotes the normalization component of the error rate metric of $ dc_{x} $. For a request to access a dataset that has replicas at multiple sites, the data center with the maximum $ FW\left( {dc_{x} } \right) $ value will be selected as the optimal access route for the request.

$$ \left\{ {\begin{array}{*{20}c} {FW\left( {dc_{x} } \right) = W_{AB}^{x} * NC_{AB}^{x} + W_{NL}^{x} * NC_{NL}^{x} + W_{ER}^{x} * NC_{ER}^{x} , dc_{x} \in DC} \\ {W_{AB}^{x} + W_{NL}^{x} + W_{ER}^{x} = 1} \\ \end{array} } \right. $$

(10)

Different evaluation metrics should be treated in different ways depending on their own nature. The available bandwidth metric with the highest value should be the best case while the network latency metric and error rate metric with that should be the worst case. Hence, the normalization processes of three evaluation metrics can be formulated as follows in (11), (12), and (13) respectively. If $ FW\left( {dc_{x} } \right) $ is the same among two or more locations, the location with the least network latency will be selected as the optimal route. Furthermore, if $ NL\left( {dc_{x} } \right) $ is also the same among two or more locations, the location with the lower error rate will be recognized as the optimal route.

$$ NC_{AB}^{x} = \frac{{AB\left( {dc_{x} } \right) - min\left\{ {AB\left( {dc} \right)} \right\}}}{{max\left\{ {AB\left( {dc} \right)} \right\} - min\left\{ {AB\left( {dc} \right)} \right\}}};\,dc_{x} , dc \in DC $$

(11)

$$ NC_{NL}^{x} = \frac{{\hbox{max} \left\{ {NL\left( {dc} \right)} \right\} - NL\left( {dc_{x} } \right)}}{{\hbox{max} \left\{ {NL\left( {dc} \right)} \right\} - min\left\{ {NL\left( {dc} \right)} \right\}}};\,dc_{x} , dc \in DC $$

(12)

$$ NC_{ER}^{x} = \frac{{\hbox{max} \left\{ {ER\left( {dc} \right)} \right\} - ER\left( {dc_{x} } \right)}}{{\hbox{max} \left\{ {ER\left( {dc} \right)} \right\} - min\left\{ {ER\left( {dc} \right)} \right\}}};\, dc_{x} , dc \in DC $$

(13)

5 UBFH Fault Handling Approach and Algorithms

Put simply, our UBFH fault handling approach tries to migrate jobs out of the faulty data center and redirect them to backup replica sites. The migration not only considers the performance of accessing backup replicas, but also strives to satisfy the service delivery deadline constraints. To achieve these goals, the algorithm uses two functions Redirection() and Migration() to find fault handling solutions under different scenarios for a job at the faulty data center. The algorithm uses utility-based ranking to evaluate the job priority for redirection or migration. The job utility should be treated in different ways depending on the fault circumstances in different data centers. The job with lower utility has higher migration priority at a backup data center while the job with higher utility has higher migration priority at faulty data centers.

The UBFH algorithm includes two major parts, faulting handling solution generation and implementation. Firstly, the jobs at the faulty data center will be ranked in a descending order based on their $ U\left( j \right) $ in Line 1 and then add into rank list $ ranklist[] $. Then a fault handling solution (FHS) will be worked out based on the Redirection() function for each job in the $ ranklist[] $ from Line 2 to 7. The input parameter of the Redirection() function are the job at the faulty data center which is desired to be rescued. The generation of the FHS is based on the RedirectionResult which includes a set of data center information, the redirection destination $ dc_{red} $ and the migration destination $ dc_{mig} $. Finally, after the FHS is generated, job moving activities will be done from Line 8 to 12.

If the job redirection function Redirection() is called, the backup replica-ready data centers will be firstly mapped to the input job in Line 1. A comparison between the bandwidth consumption of the input job and the available bandwidth of the backup replica-ready data centers is created to find out the optimal job redirection route from Line 2 to Line 6. If the available bandwidth in the backup replica-ready data centers are all insufficient to receive the redirected job from the faulty data center, a migration function Migration() will be initiated in Line 8.

In case that all the backup data centers do not have capacity to support a job to be rescued, the migration function Migration() is to migrate an existing job out of a replica-ready backup data center to release some resources for that job. Firstly, Line 1 collects the running jobs on the backup data centers and create a new group of jobs $ j_{mig} $. Then based on the new group of jobs $ j_{mig} $, backup data centers will be mapped in Line 3 for each job in the group. A bandwidth comparison between the bandwidth consumption of the redirected job at the faulty data center and the sum of the bandwidth consumption of $ j_{mig} $ and the available bandwidth in its backup replica-ready data centers will be conducted in Line 4, and a new group of migratable jobs $ movable\_job[] $ will be further created in Line 6 based on the movable job selection rule in Line 5. A reverse Quicksort algorithm will be applied on the new group of migratable jobs to rank the job in the $ movable\_job[] $ in an ascending order based on the job rescue utility in Line 7. A comparison between the bandwidth consumption of the movable job and the available bandwidth of its backup replica-ready data centers is conducted in Line 9 to find the eligible migratable data centers for the movable job in $ movable\_job[] $. Then based on our replica selection schema, the optimal job redirection route for rescuing the job at the faulty data center and the optimal job migration route for the movable job at the backup data centers will be finalized from Line 11 to Line 15.

6 Simulation Results

To evaluate the fault handling effectiveness and efficiency, we performed a series of simulations on OMNeT++ 5.4.1. The OMNeT++ application is an extensible, modular, component-based C++ simulation library and framework, primarily for building network and cloud communication simulators [17, 18]. To reduce the simulation uncertainty and present a clear result, we assume the following conditions in our simulations:

The data centers in the cloud environment has same speed and resources.
The storage resource is large enough.
The routes between the data centers have no overlap.
The transfer latency keeps stable between each pair of data centers.
The job consumes the bandwidth resource at a constant rate when executing.

A cloud environment including 5 data centers with 250 circuits of 100 Gbps optical-fiber network integrated at each data center site was implemented. The maximum bandwidth from $ dc_{1} $ to $ dc_{5} $ is all set to 25000 Gbps. The network latency from $ dc_{1} $ to $ dc_{5} $ is set to 20, 60, 40, 60, and 100 respectively. The error rate from $ dc_{1} $ to $ dc_{5} $ is set to 0.1%, 0.2%, 0.5%, 0.1% and 0.4% respectively. To avoid the fluctuation of uncertain internodal latency, input scheduling time and network latency between users and data centers, we set the $ T_{IC} \left( j \right) $ and $ T_{IS} \left( j \right) $ as 5 ms and adopt a single user with multiple requested jobs to access the datasets at different data centers in the simulated environment. A fault is set to occur at 10 ms system running time ($ T_{Past} \left( j \right) $ = 10 ms) in $ dc_{2} $, which leads to the closing down of $ dc_{2} $. The job deadline $ T_{Dead} \left( j \right) $ and the total completion time of the jobs $ TCT\left( j \right) $ are randomly set in the range of 0 ms and 1000 ms. The size of a job is randomly selected in the range of 0 GB to 5 GB, the same with that of many current data-intensive workflow jobs, such as Epigenomics. Each dataset has 3 replicas that are randomly placed in 5 data centers. To simply the problem, the weights $ W_{AB}^{x} $, $ W_{NL}^{x} $ and $ W_{ER}^{x} $ are all set to 1/3.

We compared our fault handling approach with the typical HDFS robustness approach applied in the HDFS system, the RR approach [19] applied in SQL server 2016 and the JSQ approach applied in the Cisco Local Director, IBM Network Dispatcher, and Microsoft Sharepoint [20,21,22]. All four approaches were implemented under a single-fault scenario. They were evaluated the job rescue performance in terms of the repairability, job rescue utility and job operation profit as we mentioned above. The utility weights were changed in all 3 simulations under equivalent utility weight scenario, urgency-heavy utility weight scenario and profit-heavy utility weight scenario to test the effectiveness of our approach.

6.1 Simulation 1 – Equivalent Utility Weights

In the Simulation 1, the weights of the job urgency and the job operation profit are both set to 0.5. The simulation results are shown in Fig. 1. The weights of the job rescue utility are set to equivalent utility scenario which both $ W_{UR} $ and $ W_{PRO} $ are set to 0.5.

In Fig. 1, it is obvious that our UBFH approach is better from both job rescue utility and job operation profit perspectives in both situations when the environment has sufficient and insufficient resources than all other three approaches. For example, when the environment has sufficient resources at 340 jobs, our UBFH approach achieves higher job rescue utility by maximum 11.89% increase and more job operation profit by maximum 9.46% increase than the HDFS and RR approaches. As the number of jobs increases, when the environment has insufficient resources at 380 jobs, our UBFH approach still achieves higher job rescue utility by maximum 11.04% increase and more job operation profit by maximum 5.09% increase than the HDFS and RR approaches.

From the repairability perspective, our UBFH approach also has better repairability when the resource is sufficient to support the job execution than all other three approaches. When the resource becomes limited, our UBFH approach aims to migrate the lower-utility jobs in the backup data center in order to release resources for higher-utility jobs from the faulty data center. By adopting this operation, the higher-utility job saving from the faulty data center may increase the resource pressure to other data centers. Therefore, some lower-utility jobs might be sacrificed. This leads a repairability decrease when the resource becomes insufficient.

6.2 Simulation 2 – Urgency-Heavy Utility

In Simulation 2, we increase the weight of the job urgency to 0.67 and decrease the weight of the job operation profit to 0.33. The simulation 2 results are shown in Fig. 3. The weights of the job rescue utility are set to urgency-heavy scenario which $ W_{UR} $ is 0.67 and $ W_{PRO} $ is 0.33.

In Fig. 2, our UBFH approach remains the similar trend in the simulation 1. For example, when the environment has sufficient resources at 340 jobs, our UBFH approach achieves higher job rescue utility by maximum 11.49% job rescue utility increase and maximum 9.46% job operation profit increase than the HDFS and JSQ approaches when the utility weight has been set to urgency-heavy. As the number of jobs increases, when the environment has insufficient resources at 360 jobs, our UBFH approach still achieves higher job rescue utility by maximum 8.29% increase and more job operation profit by maximum 4.29% increase than the HDFS and JSQ approaches.

The repairability is also divided into sufficient and insufficient resource scenarios. Our UBFH approach remains higher repairability to compare with all other three approaches when the resource is sufficient. But when the resource becomes limited, our UBFH operations still have a certain degree of repairability decrease due to the same reason in the simulation 1.

6.3 Simulation 3 – Profit-Heavy Utility Weights

In Simulation 3, we decrease the weight of the job urgency to 0.33 and increase the weight of the job operation profit to 0.67. The simulation 3 results are shown in Fig. 3. The weights of the job rescue utility are set to profit-heavy scenario which $ W_{UR} $ is 0.33 and $ W_{PRO} $ is 0.67.

In Fig. 3, it proves that our UBFH approach still maintains higher job rescue utility and job operation profit to compare with all other three approaches when the utility weight has been set to profit-heavy in both sufficient and insufficient resource scenarios. We still achieve maximum 9.36% job rescue utility increase and maximum 8.84% job operation profit increase than the HDFS and JSQ approaches when the number of jobs arrives 340 in a resource-sufficient scenario. The job rescue utility and job operation profit also remain higher than all other three approaches under resource-insufficient scenario, for example, 3.14% higher job rescue utility and 3.26% more job operation profit than the HDFS and JSQ approaches.

The repairability still experiences a decrease when the resource becomes limited. In contrast, the repairability remains higher than all other three approaches with a maximum 7.88% repairability increase at 340 jobs under sufficient resource scenarios.

7 Conclusions and Future Works

Data replication is a common method to achieve a proactive fault tolerance approach by creating multiple data copies to geographically-distributed locations. But sometimes, there are a variety of faults occurred in the cloud environments. Most of common replication approaches have insufficient considerations to the common network performance measurements and the job attributes. In this paper, we propose a utility-based fault handling (UBFH) approach to rescue the jobs at the faulty data center with the major considerations of the network performance measurements and job attributes in the cloud environment. A fault handling algorithm is developed to determine the direction of job redirection. The simulation results show that our UBFH approach has better repairability, job rescue utility and job operation profit than the HDFS, RR and JSQ approaches. In the future, the uncertain influences of the error rate in each data center site will be considered because we currently assume that there are no errors occurred in each network link. Besides, multi-packaged job migration will be investigated because we currently only consider a single-job migration.

References

Liu, J., Wang, S., Zhou, A., Kumar, S., Yang, F., Buyya, R.: Using proactive fault-tolerance approach to enhance cloud service reliability. IEEE Trans. Cloud Comput. 6(4), 915–928 (2016)
Google Scholar
Yuan, D., Cui, L., Liu, X.: Cloud data management for scientific workflows: Research issues, methodologies, and state-of-the-art. In: 10th International Conference on Semantics, Knowledge and Grids (SKG) 2014, pp. 21–28 (2014)
Google Scholar
Mansouri, Y., Buyya, R.: Dynamic replication and migration of data objects with hot-spot and cold-spot statuses across storage data centers. J. Parallel Distrib. Comput. 126, 121–133 (2019)
Google Scholar
Lin, J.-W., Chen, C.-H., Chang, J.M.: QoS-aware data replication for data-intensive applications in cloud computing systems. IEEE Trans. Cloud Comput. 1(1), 101–115 (2013)
Article Google Scholar
Prathiba, S., Sowvarnica, S.: Survey of failures and fault tolerance in cloud. In: 2nd International Conference on Computing and Communications Technologies (ICCCT), pp. 167–172 (2017)
Google Scholar
Subasi, O., Yalcin, G., Zyulkyarov, F., Unsal, O., Labarta, J.: Designing and modelling selective replication for fault-tolerant hpc applications. In: 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 452–457 (2017)
Google Scholar
Schwarzkopf, M., Murray, D.G., Hand, S.: The seven deadly sins of cloud computing research. In: Hotcloud 2012, p. 1 (2012)
Google Scholar
Jhawar, R., Piuri, V.: Fault Tolerance and Resilience in Cloud Computing Environments, 2nd edn. Morgan Kaufmann, Burlington (2013)
Google Scholar
Cheraghlou, M.N., Khadem-Zadeh, A., Haghparast, M.: A survey of fault tolerance architecture in cloud computing. J. Network Comput. Appl. 61, 81–92 (2016)
Google Scholar
Kalanirnika, G.R., Sivagami, V.M.: Fault tolerance in cloud using reactive and proactive techniques. Int. J. Computer Sci. Eng. Commun. 3(3), 1159–1164 (2015)
Google Scholar
Deng, S., Huang, L., Taheri, J., Zomaya, A.Y.: Computation offloading for service workflow in mobile cloud computing. IEEE Trans. Parallel Distrib. Syst. 26(12), 3317–3329 (2014)
Article Google Scholar
Zhou, A., et al.: Cloud service reliability enhancement via virtual machine placement optimization. IEEE Trans. Serv. Comput. 10(6), 902–913 (2017)
Google Scholar
Ko, A.C., Zaw, W.T.: Fault tolerant erasure coded replication for HDFS based cloud storage. In: IEEE Fourth International Conference on Big Data and Cloud Computing, pp. 104–109 (2014)
Google Scholar
Vardhan, M., Goel, A., Verma, A., Kushwaha, D.S.: A dynamic fault tolerant threshold based replication mechanism in distributed environment. Procedia Technol. 6, 188–195 (2012)
Google Scholar
Shwe, T., Aritsugi, M.: PRTuner: proactive-reactive re-replication tuning in HDFS-based cloud data center. IEEE Cloud Comput. 5(6), 48–57 (2018)
Google Scholar
Xu, H., Liu, W., Shu, G., Li, J.: Location-aware data block allocation strategy for HDFS-based applications in the cloud. In: IEEE 9th International Conference on Cloud Computing (CLOUD) 2016, pp. 252–259
Google Scholar
Jiang, J., Li, Y., Hong, S. H., Xu, A., Wang, K.: A Time-sensitive Networking (TSN) simulation model based on OMNET++. In: IEEE International Conference on Mechatronics and Automation (ICMA), pp. 643–648 (2018)
Google Scholar
Oujezsky, V., Horvath, T.: Case study and comparison of SimPy 3 and OMNeT++ Simulation. In: 39th International Conference on Telecommunications and Signal Processing (TSP), pp. 15–19 (2016)
Google Scholar
Jiang, W., Xie, H., Zhou, X., Fang, L., Wang, J.: Performance analysis and improvement of replica selection algorithms for key-value stores. In: 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), pp. 786–789 (2017)
Google Scholar
ElYamany, H.F., Mohamed, M.F., Grolinger, K., Capretz, M.A.: A generalized service replication process in distributed environments. In: Proceedings of the 5th International Conference on Cloud Computing and Services Science, pp. 186–193 (2015)
Google Scholar
Thorsen, S.: Replica selection in Apache Cassandra: reducing the tail latency for reads using the C3 algorithm. Unpublished (2015)
Google Scholar
Gupta, V., Balter, M.H., Sigman, K., Whitt, W.: Analysis of join-the-shortest-queue routing for web server farms. Perform. Eval. 64(9–12), 1062–1081 (2007)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing and IT, University of Wollongong, Wollongong, NSW, 2522, Australia
Fei Xie, Jun Yan & Jun Shen

Authors

Fei Xie
View author publications
You can also search for this author in PubMed Google Scholar
Jun Yan
View author publications
You can also search for this author in PubMed Google Scholar
Jun Shen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Xie .

Editor information

Editors and Affiliations

IBM Research – Thomas J. Watson Research, Yorktown Heights, NY, USA
Qi Zhang
University of Prince Edward Island, Charlottetown, PE, Canada
Yingwei Wang
Kingdee International Software Group Co, Ltd., Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xie, F., Yan, J., Shen, J. (2020). A Utility-Based Fault Handling Approach for Efficient Job Rescue in Clouds. In: Zhang, Q., Wang, Y., Zhang, LJ. (eds) Cloud Computing – CLOUD 2020. CLOUD 2020. Lecture Notes in Computer Science(), vol 12403. Springer, Cham. https://doi.org/10.1007/978-3-030-59635-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-59635-4_4
Published: 18 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59634-7
Online ISBN: 978-3-030-59635-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics