DC-Prophet: Predicting Catastrophic Machine Failures in DataCenters

Lee, You-Luen; Juan, Da-Cheng; Tseng, Xuan-An; Chen, Yu-Ting; Chang, Shih-Chieh

doi:10.1007/978-3-319-71273-4_6

You-Luen Lee²²,
Da-Cheng Juan²³,
Xuan-An Tseng²²,
Yu-Ting Chen²³ &
…
Shih-Chieh Chang²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10536))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3444 Accesses
8 Citations

Abstract

When will a server fail catastrophically in an industrial datacenter? Is it possible to forecast these failures so preventive actions can be taken to increase the reliability of a datacenter? To answer these questions, we have studied what are probably the largest, publicly available datacenter traces, containing more than 104 million events from 12,500 machines. Among these samples, we observe and categorize three types of machine failures, all of which are catastrophic and may lead to information loss, or even worse, reliability degradation of a datacenter. We further propose a two-stage framework—DC-Prophet (DC-Prophet stands for DataCenter-Prophet.)—based on One-Class Support Vector Machine and Random Forest. DC-Prophet extracts surprising patterns and accurately predicts the next failure of a machine. Experimental results show that DC-Prophet achieves an AUC of 0.93 in predicting the next machine failure, and a F₃-score (The ideal value of F₃-score is 1, indicating perfect predictions. Also, the intuition behind F₃-score is to value “Recall” about three times more than “Precision” [12].) of 0.88 (out of 1). On average, DC-Prophet outperforms other classical machine learning methods by 39.45% in F₃-score.

You have full access to this open access chapter, Download conference paper PDF

Building a Model with AutoML in Machine Faults Detection

Predictive Maintenance Model for Industrial Equipment

To Fail or Not to Fail: Predicting Hard Disk Drive Failure Time Windows

1 Introduction

“When will a server fail catastrophically in an industrial datacenter?” “Is it possible to forecast these failures so preventive actions can be taken to increase the reliability of a datacenter?” These two questions serve as the motivation for this work.

To meet the increasing demands for cloud computing, Internet companies such as Google, Facebook, and Amazon generally deploy a large fleet of servers in their datacenters. These servers bear heavy workloads and process various, diversified requests [13]. For such a high-availability computing environment, when an unexpected machine failure happens upon a clustered partition, its workload is typically transferred to another machine in the same cluster, which increases the possibility of other failures as a chain effect [11]. Also, this unexpected failure may cause (a) processed data loss, and (b) resource congestion due to machines being suddenly unavailable. In the worst case, these failures may paralyze a datacenter, causing an unplanned outage that requires a very high cost to recover [1]: on average $9,000/minute, and up to $17,000/minute. To study machine failures in a modern datacenter, we analyze the traces from Google’s datacenter [9, 14]; the traces contain more than 104 million events generated by 12,500 machines during 29 days. We observe that approximately 40% of the machines have been removed (due to potential failures or maintenance) at least once during this period. This phenomenon suggests that potential machine failures happen quite frequently, and cannot be simply ignored. Therefore, we want to know: given the trace of a machine, can we accurately predict its next failure, ideally with low computing latency? If the answer is yes, the cloud scheduler (e.g., Borg [17] by Google) can take preventive actions to deal with incoming machine failures, such as by migrating tasks from the machine-to-fail to other machines. In this way, the cost of a machine failure is reduced to the very minimum: only the cost of task migration.

Table 1. Misprediction issues and the associated costs

Full size table

While predicting the next failure of a machine seems to be a feasible and promising solution for improving the reliability of a datacenter, it comes with two major challenges. The first challenge lies in high accuracy being required when making predictions, specifically for reducing false negatives. The false negatives (the machine actually failed but being predicted as normal) may incur a significant recovery cost [1] and should be avoided in Table 1. However, if the objective is set to minimize false negatives, the model will always predict a machine going to fail (so zero false negative), which introduces costs from false positives (the machine actually works but being predicted as failed). Therefore, one major challenge of designing a model is to better trade off between these two costs. The second challenge is the counts between normal events and failure events are highly imbalanced. Among 104 million events, only 8,957 events (less than 1%) are associated with machine failures. In this case, most predictive models will trivially predict every event as normal to achieve a high accuracy (higher than 99%). Consequently, this event-imbalance issue is the second roadblock that needs to be removed.

The contributions of this paper are as follows:

We analyze probably the largest, publicly-available traces from an industrial datacenter, and categorize three types of machine failures: Immediate-Reboot (IR), Slow-Reboot (SR), and Forcible-Decommission (FD). The frequency and duration of each type of failures categorized by our method further match experts’ domain knowledge.
We propose a two-stage framework: DC-Prophet that accurately predicts the occurrence of next failure for a machine. DC-Prophet first applies One-Class SVM to filter out most normal cases to resolve the event-imbalance issue, and then deploys Random Forest to predict the type of failures that might occur for a machine. The experimental results show that DC-Prophet accurately predicts machine failures and achieves an AUC of 0.93 and F₃-score of 0.88, both on the test set.
To understand the effectiveness of DC-Prophet, we also perform a comprehensive study on other widely-used machine learning methods, such as multi-class SVM, Logistic Regression, and Recurrent Neural Network. Experimental results show that, on average, DC-Prophet outperforms other methods by 39.45% in F₃-score.
Finally, we provide a practitioners’ guide for using DC-Prophet to predict the next failure of a machine. The latency of invoking DC-Prophet to make one prediction is less 9 ms. Therefore, DC-Prophet can be seamlessly integrated into a scheduling strategy of industrial datacenters to improve the reliability.

The remainder of this paper is organized as follows. Section 2 provides the problem definition, and Sect. 3 details the proposed DC-Prophet framework. Section 4 presents the implementation flow and experimental results, and Sect. 5 provides practitioners’ guide. Finally, Sect. 6 concludes this paper.

2 Problem Definition

2.1 Google Traces Overview

The Google traces [14] consist of the activity logs from 668,000 jobs during 29 days, and each job will spawn one or more tasks to be executed in a 12,500-machine cluster. For each machine, the traces record (a) computing resources consumed by all the tasks running on that machine, and (b) its machine state. Both resource consumption and machine states are recorded with associated time interval of one-microsecond (1 $\upmu $s) resolution.

We focus on the usage measurements of six types of resources: (a) CPU usage, (b) disk I/O time, (c) disk space usage, (d) memory usage, (e) page cache, and (f) memory access per instruction. All these measurements are normalized by their respective maximum values and thus range from 0 to 1. In this work, the average and peak values during the time interval of 5 min are also calculated for each usage–the interval of 5 min is typically used to report the measured resource footprint of a task in Google’s datacenter [14]. Furthermore, resource usages at minute-level provide a more macro view of a machine status [8]. We use $x_{r, t}$ to denote the average usage of resource type r at time interval t; similarly, $m_{r, t}$ represents the peak usage. Both $x_{r, t}$ and $m_{r, t}$ are used to construct the training dataset, with further details provided in Sect. 2.4.

In addition, Google traces also contain three types of events to determine machine states: ADD, REMOVE, and UPDATE [14]. In this work, we treat each REMOVE event as an anomaly that could potentially be a machine failure. Detailed analyses are further provided in Sect. 2.3.

2.2 Problem Formulation

The problem of predicting the next machine failure is formulated as follows:

problem 1 (Categorize catastrophic failures). Given the traces of machine events, categorize the type of each machine failure at time interval t (denoted as $y_t$).

problem 2 (Forecast catastrophic failures). Given the traces of resource usages—denoted as $x_{r,t}$ and $m_{r,t}$—up to time interval $\tau -1$, forecast the next failure and its type at time interval $\tau $ (denoted as $y_{\tau }$) for each machine. Mathematically, this problem can be expressed as:

$$\begin{aligned} y_{\tau } = f(x_{r,t}, m_{r,t}), t = 1 \text{ to } \tau -1, r \in \text{ resources } \end{aligned}$$

(1)

where $x_{r,t}$ and $m_{r,t}$ represent the respective average and peak usage of resource r at time interval t.

We use Fig. 1 to better illustrate the concept in Eq. (1), specifically the temporal relationship among $y_{\tau }$, $x_{r,t}$ and $m_{r,t}$ for $t = 1$ to $\tau -1$. One goal here is to find a function f that takes $x_{r,t}$ and $m_{r,t}$ as inputs to predict $y_\tau $.

2.3 Machine-Failure Analyses

Throughout the 29-day traces, we find a total of 8,957 potential machine failures from the REMOVE events, and Fig. 2(a) illustrates the rank-frequency of these failures. The distribution is power-law-like and heavily skewed: the top-ranked machines failed more than 100 times, whereas the majority of machines (3,397 machines) failed only once. Overall, about 40% (out of 12,500) machines have been removed at least once. We further notice that the resource usages of these most frequently-failing machines are all zeros, indicating a clear abnormal behavior. These machines seem being marked as unavailable internally [2], and hence are apparent anomalies. They are excluded from the analysis later on.

Observation 1

Most frequently-failing machines have failed more than 100 times over 29 days, with usages of all resource types being zero.

To categorize the type of a failure, we further analyze its duration which is calculated by the time difference between the REMOVE and the following ADD event. Figure 2(b) illustrates the distribution of durations for all machine failures. The failure duration can vary a lot, ranging from few minutes, to few hours, to never back—a machine is never added back to the cluster after its REMOVE event. Furthermore, three “peaks” can be observed in failure durations: $\approx $16 min, $\approx $2 h, and never back.

Observation 2

Three “peaks” in the histogram of failure durations correspond to $\approx $16 min, $\approx $2 h, and never back.

This observation raises an intriguing question: why there are three peaks in failure durations? We correspond these three peaks ($\approx $16 min, $\approx $2 h, and never back) to three types of machine failures:

Immediate-Reboot (IR). This type of failures may occur with occasional machine errors and these machines can recover themselves in a short duration by rebooting. Here, failures of less than 30-min downtime are categorized as IR failures [3].
Slow-Reboot (SR). This type of failures requires more than 30 min to recover. According to [3], the causes of slow reboots include file system integrity checks, machine hangs that require semiautomatic restart processes, and machine software reinstallation and testing. Also, a machine could be removed from a cluster due to system upgrades (e.g., automated kernel patching) or network down [7, 10]. We categorize SR failures as the ones with longer than 30-min downtime and will eventually be added back to the cluster.
Forcible-Decommission (FD). This type of failures may occur when either a machine (e.g., part of hardware) is broken and not repaired before the end of the traces, or a machine is taken out from the cluster for some reasons, such as a regular machine retirement (or called “decommission”) [2, 3]. We categorize this type of failures that a machine is removed permanently from the cluster, as FD failures.

Among 8,771 failure events (186 obvious anomalies are removed beforehand as Observation 1 described), we summarize 5,894 to be IR failures, 2,783 SR failures, and 94 FD failures. On the other hand, there are 104,644,577 normal operations.

One important goal of this work is to predict the next failure for a machine. If a failure is mispredicted as a normal operation (a false negative), a high cost can incur. For example, the user jobs can be killed unexpectedly, leading to processed data loss. If these failures can be predicted accurately in advance, the cloud/cluster scheduler can perform preventive actions such as rescheduling jobs to another available machine to mitigate the negative impacts. Compared to the cost incurred from false negatives, i.e., mispredicting a failure as a normal operation, the cost of “misclassifying” one failure type as another is relatively low. Still, if the right types of failures can be correctly predicted, the cloud/cluster scheduler can plan and arrange the computing resources accordingly.

2.4 Construct Training Dataset

We model the prediction of the next machine failure from Eq. (1) as a multi-class classification and construct the training dataset accordingly. Each instance in the dataset consists of a label $y_\tau $ that represents the failure type at time interval $\tau $, and a set of predictive features $\varvec{x}$ (or called a feature vector) extracted from the resource usages up to time interval $\tau -1$.

The type of a label $y_\tau $ is determined based on the failure duration described in Sect. 2.3. If there is no machine failure at time interval $\tau $, label $y_\tau $ is marked as “normal operation.” Therefore, we defined $y_\tau \in \{0, 1, 2, 3\}$, which represents normal operation, IR, SR, and FD, respectively.

For the predictive features $\varvec{x}$, we leverage both the average $x_{r,t}$ and peak values $m_{r,t}$ of six resource types as mentioned in Sect. 2.1. Now the question is: how to select the number of time intervals needed to be included in the dataset for an accurate prediction? We propose to calculate the partial autocorrelation to determine the number of intervals, or called “lags” in time series, to be included in the predictive features $\varvec{x}$. Assume target interval is $\tau $, the interval with “one lag” will be $\tau -1$ (and the interval with two lags will be $\tau -2$, etc.). Partial autocorrelation is a type of conditional correlation between $x_{r,\tau }$ and $x_{r,t}$, with the linear dependency of $x_{r,t+1}$ to $x_{r,\tau -1}$ removed [5]. Since the partial autocorrelation can be treated as “the correlation between $x_{r,\tau }$ and $x_{r,t}$, with other linear dependency removed,” it suggests how many time intervals (or lags) should be included in the predictive features.

Figure 3(a) illustrates the partial autocorrelation of the CPU usage on one machine, and Fig. 3(b) represents the histogram of partial autocorrelations with certain lags. Both the figures show statistical significance. Notice in general, after 6 lags (30 min), the resource usages are less relevant.

Observation 3

Resource usages from 30 min ago are less relevant to the current usage in terms of partial autocorrelation.

Based on this observation, we include resource usages within 30 min as features to predict failure type $y_\tau $. In other words, 6 time intervals (lags) are selected for both $x_{r,t}$ and $m_{r,t}$ to construct the predictive features $\varvec{x}_t$. Specifically, $\varvec{x}_t = \{x_{r,t}, m_{r,t}\}$, $r \in \text{ resources }$ and $t = \tau -j$ where $j = 1$ to 6. Therefore, $\varvec{x}$ has 2 (average and peak usages) $\times $ 6 (number of resources) $\times $ 6 (intervals) $=$ 72 predictive features.

Now we have constructed the training dataset, and are ready to proceed to the proposed framework. For conciseness, in the rest of this paper each instance will be presented as $(y, \varvec{x})$ instead of $(y_{\tau }, \varvec{x}_t)$ with $t = \tau -1, ..., \tau -6$.

3 Methodology

3.1 Overview: Two-Stage Framework

Begin immediately, we illustrate the proposed two-stage framework with Fig. 4. In the first stage, One-Class Support Vector Machine (OCSVM) is deployed for anomaly detection. All the detected anomalies are then sent to Random Forest for multi-class classification. Mathematically, DC-Prophet can be expressed as a two-stage framework:

$$\begin{aligned} f(\varvec{x}) = g(\varvec{x})\cdot h(\varvec{x}) = \left\{ \begin{array}{ll} 0, &{} \text{ if }\,\, g(\varvec{x}) = 0 \\ h(\varvec{x}), &{} \text{ if }\,\, g(\varvec{x}) = 1 \\ \end{array} \right. \end{aligned}$$

(2)

where $g(\cdot ) \in \{0,1\}$ is OCSVM and $h(\cdot ) \in \{0, 1, 2, 3\}$ is Random Forest. For an incoming instance $\varvec{x}$, it will first be sent to $g(\cdot )$ for anomaly detection. If $\varvec{x}$ is detected as an anomaly, i.e., a potential machine failure, it will be further sent to $h(\cdot )$ for multi-class classification.

In Google traces, the distribution of four label types is extremely unbalanced: 104 millions of normal cases versus 8,771 failures that are treated as anomalies (including all three types of failures). Therefore, OCSVM is applied to filter out most of normal operations and detect anomalies, i.e., potential machine failures. Without doing so, classifiers will be swamped by normal operations, learn only the “normal behaviors,” and choose to ignore all the failures. This will cause significant false negatives as mentioned in Table 1 since most machine failures are mispredicted as normal operations.

3.2 One-Class SVM

One-class SVM (OCSVM) is often applied for novelty (or outlier) detection [4] and deployed as $g(\cdot )$ in DC-Prophet. OCSVM is trained on instances that have only one class, which is the “normal” class; given a set of normal instances, OCSVM detects the soft boundary of the set, for classifying whether a new incoming instance belongs to that set (i.e., “normal”) or not. Specifically, OCSVM computes a non-linear decision boundary, using appropriate kernel functions; in this work, radial basis function (RBF) kernel is used [15]. Equation (3) below show how OCSVM makes an inference:

$$\begin{aligned} g(\varvec{x}) = \left\{ \begin{array}{ll} 1, &{} \widehat{ g}(\varvec{x}) \ge 0 \\ 0, &{} \widehat{ g}(\varvec{x}) < 1 \\ \end{array}\right. \text{ where } \,\, \widehat{g}(\varvec{x}) = \langle \varvec{w},\phi (\varvec{x}) \rangle + \rho \end{aligned}$$

(3)

where $\varvec{w}$ and $\rho $ are learnable weights that determine the decision boundary, and the function $\phi (\cdot )$ maps the original feature(s) into a higher dimensional space, to determine the optimal decision boundary. By further modifying the hard-margin SVM to tolerate some misclassifications, we have:

$$\begin{aligned} \min _{\varvec{w}, \varvec{\rho }} &\frac{1}{2}||\varvec{w}||_2^2 + C\sum _{i}^n \xi _i -\rho \nonumber \\ \text{ s.t. } &\widehat{g}(\varvec{x_{i}}) = \langle \varvec{w},\phi (\varvec{x_{i}}) \rangle -\rho \le \xi _i \nonumber \\&\xi _i \ge 0 \end{aligned}$$

(4)

where $\xi _i$ represents the classification error of i ^th sample, and C represents the weight that trades off between the maximum margin and the error-tolerance.

3.3 Random Forest

In the second stage of DC-Prophet, Random Forest [6] is used for multi-class classification. Random Forest is a type of ensemble model that leverages the classification outcomes from several (say B) decision trees for making the final classification. In other words, Random Forest is an ensemble of B trees {$T_1(\varvec{x})$, ..., $T_B(\varvec{x})$}, where $\varvec{x}$ is the vector of predictive features described in Sect. 2.4. This ensemble of B trees predicts B outcomes {$\hat{y}_1$ = $T_1(\varvec{x})$, ..., $\hat{y}_B$ = $T_B(\varvec{x})$}. Then the outcomes of all trees are aggregated for majority voting, and the final prediction $\hat{y}$ is made based on the highest (i.e., most popular) vote. Empirically, Random Forest is robust to overfitting and achieves a very high accuracy.

Given a dataset of n instances {($\varvec{x}_1$, $y_1$), ..., ($\varvec{x}_n$, $y_n$)}, the training procedure of Random Forest is as follows:

1.
Randomly sample the training data {($\varvec{x}_1$, $y_1$), ..., ($\varvec{x}_n$, $y_n$)}, and then draw n samples to form a bootstrap batch.
2.
Grow a decision tree from the bootstrap batch using the Decision Tree Construction Algorithm [4].
3.
Repeat the above two steps until the whole ensemble of B trees {$T_1(\varvec{x})$, ..., $T_B(\varvec{x})$} are grown.

After Random Forest is grown, along with the OCSVM in the first stage, DC-Prophet is ready for predicting the type of a machine failure.

4 Experimental Results

4.1 Experimental Setup

To best compare the proposed DC-Prophet with other machine learning models, we manage to search for the best hyperparameters by using 5-fold cross-validation for all the methods. Then the accuracy of each method is evaluated on the test set. All the experiments are conducted via MATLAB, running on Intel I5 processor (3.20 GHz) with 16 GB of RAM.

For the evaluation metrics, we report $Precision$, $Recall$, F-score, and AUC (area under ROC curve) to provide a comprehensive study on the performance evaluation for different models. F-score is defined as:

$$\begin{aligned} F_{\beta } = (1 + \beta ^{2}) \frac{Precision*Recall}{(\beta ^{2} * Precision) + Recall} \end{aligned}$$

(5)

where $\beta $ is the parameter representing the relative importance between $Recall$ and $Precision$ [16]. In this work, $\beta $ is selected to be 3, which means $Recall$ is approximately three times more important than $Precision$. Since the false negative (machine failure mispredicted as normal event) is much more costly as mentioned in Table 1, F₃-score is used as the main criterion to select the best framework for predicting failure types.

4.2 Results Summary

Table 2 shows the experimental results from different methods. We calculate and report $Precision$, $Recall$, F₃-score and AUC for comprehensive comparisons. The results demonstrate that the two-stage algorithms have better performance on both F₃-score and AUC. It also shows that using One-Class SVM for anomaly detection as the first stage is necessary. Among 8,771 failures, One-Class SVM only mispredicts 11 failures as normal events, which serves as an excellent filter. Furthermore, our proposed framework, DC-Prophet, which combines One-Class SVM and Random Forest, has the best F₃-score and AUC among all the two-stage methods.

However, it seems that all the algorithms have very limited capability to recognize FD failures. One reason could be that several FD failures are found to share similar patterns with the other two failure types—IR and SR; also out of 18 FD failures in the test set, 4 failures are predicted and categorized as SR failures. We suspect that for these FD cases, the machines are eventually added back; therefore they should be categorized as SR instead of FD failures. However, the ADD events occur after the end of traces.

We also notice that by simply applying Random Forest algorithm, we can already achieve great results in $Precision$. However, our proposed DC-Prophet still outperforms Random Forest in failure prediction, especially for the IR failures.

To evaluate the capability of DC-Prophet in industrial datacenters during serving, we measure the amortized runtime of one single prediction. Table 2 shows that DC-Prophet only requires 8.7 ms to make one prediction, which is almost negligible for most of the services in datacenters. This short latency allows the cloud scheduler to make preventive actions to deal with possible incoming machine failures. Furthermore, DC-Prophet is memory efficient—only 72 features are stored for making a prediction.

Table 2. Experimental result

Full size table

4.3 Feature Analysis

Among all the predictive features, we observe several features to be more discriminative than others. Figure 5 shows how many times a feature in $\varvec{x}$ is selected to be split on in Random Forest. Figure 5(a) shows the number of average-value features $x_{r,t}$ being selected in Random Forest while Fig. 5(b) illustrates the number of peak-value features $m_{r,t}$ being selected. For average-value features, we observe a trend that recent features are more discriminative. In addition, the features related to memory usages are more discriminative than the others.

We also discover that the number of peak-value features is more discriminative than the average-value ones in general. Furthermore, the peak-value features have similar predictive capabilities over six time intervals, as shown in Fig. 5(b). In addition, we observe that the peak usage of local disk is an important feature for predicting machine failures (see red circles in Fig. 5(b)).

5 Practitioners’ Guide

Here we provide the practitioners’ guide to applying DC-Prophet for forecasting machine failures in a datacenter:

Construct Training Dataset: Given the traces of machines in a datacenter, extract abnormal events representing potential machine failures, and determine their types based on the observations in Sect. 2.3 for obtaining label y. Then calculate the partial autocorrelation for each resource measurement (e.g., CPU usage, disk I/O time, etc.) to determine the number of time intervals (or lags) to be included as the predictive features $\varvec{x}$.
One-Class SVM: After constructing the dataset of (y, $\varvec{x}$), train OCSVM with the instances labeled as “normal” only, and find the best hyperparameters via grid-search and cross-validation.
Random Forest: After OCSVM is trained, remove the instances detected as normal from the training dataset. Use the rest of dataset (treated as anomalies) to train Random Forest. Choose the number of trees in the ensemble and optimize it by cross-validation.

After both components of DC-Prophet are trained, each new incoming instance will follow the flow in Fig. 4 for failure prediction. Thanks to DC-Prophet’s low latency (8.71 ms per invocation), it can be used for both (a) offline analysis in other similar datacenters, and (b) serving as a failure predictor integrated into a cloud/cluster scheduler, with training via historical data offline.

6 Conclusion

In this paper, we propose DC-Prophet: a two-stage framework for forecasting machine failures. Thanks to DC-Prophet, we now can answer the two motivational questions: “When will a server fail catastrophically in an industrial datacenter?” “Is it possible to forecast these failures so preventive actions can be taken to increase the reliability of a datacenter?” Experimental results show that DC-Prophet accurately predicts machine failures and achieves an AUC of 0.93 and F₃-score of 0.88. Finally, a practitioners’ guide is provided for deploying DC-Prophet to predict the next failure of a machine. The latency of invoking DC-Prophet to make one prediction is less 9 ms, and there can be seamlessly integrated into the scheduling strategy of industrial datacenters to improve the reliability.

References

2016 cost of data center outages report. https://goo.gl/OeNM4U
Google cluster data - discussions (2011). https://groups.google.com/forum/#!forum/googleclusterdata-discuss
Barroso, L.A., Clidaras, J., Hölzle, U.: The datacenter as a computer: an introduction to the design of warehouse-scale machines. Synth. Lect. Comput. Archit. 8(3), 1–154 (2013)
Article Google Scholar
Bishop, C.: Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York (2006)
MATH Google Scholar
Box, G.E., Jenkins, G.M., Reinsel, G.C., Ljung, G.M.: Time Series Analysis: Forecasting and Control. Wiley, New York (2015)
MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Chen, X., Lu, C.-D., Pattabiraman, K.: Failure analysis of jobs in compute clouds: a Google cluster case study. In: 2014 IEEE 25th International Symposium on Software Reliability Engineering, pp. 167–177. IEEE (2014)
Google Scholar
Guan, Q., Fu, S.: Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In: 2013 IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS), pp. 205–214. IEEE (2013)
Google Scholar
Juan, D.-C., Li, L., Peng, H.-K., Marculescu, D., Faloutsos, C.: Beyond poisson: modeling inter-arrival time of requests in a datacenter. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014. LNCS (LNAI), vol. 8444, pp. 198–209. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06605-9_17
Chapter Google Scholar
Liu, Z., Cho, S.: Characterizing machines and workloads on a Google cluster. In: 2012 41st International Conference on Parallel Processing Workshops, pp. 397–403. IEEE (2012)
Google Scholar
Miller, T.D., Crawford Jr., I.L.: Terminating a non-clustered workload in response to a failure of a system with a clustered workload. US Patent 7,653,833, 26 January 2010
Google Scholar
Powers, D.M.: Evaluation: from precision, recall and f-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011)
MathSciNet Google Scholar
Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: SOCC, p. 7. ACM (2012)
Google Scholar
Reiss, C., Wilkes, J., Hellerstein, J.L.: Google cluster-usage traces: format + schema. Technical report, Google Inc., Mountain View, CA, USA, version 2.1, November 2011. https://github.com/google/cluster-data. Accessed 17 Nov 2014
Scholkopf, B., Sung, K.-K., Burges, C.J., Girosi, F., Niyogi, P., Poggio, T., Vapnik, V.: Comparing support vector machines with Gaussian kernels to radial basis function classifiers. IEEE Trans. Signal Process. 45(11), 2758–2765 (1997)
Article Google Scholar
van Rijsbergen, C.: Information Retrieval, 2nd edn. Butterworths, London (1979)
MATH Google Scholar
Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Proceedings of the Tenth European Conference on Computer Systems, p. 18. ACM (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
You-Luen Lee, Xuan-An Tseng & Shih-Chieh Chang
Google Inc., Mountain View, CA, USA
Da-Cheng Juan & Yu-Ting Chen

Authors

You-Luen Lee
View author publications
You can also search for this author in PubMed Google Scholar
Da-Cheng Juan
View author publications
You can also search for this author in PubMed Google Scholar
Xuan-An Tseng
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Ting Chen
View author publications
You can also search for this author in PubMed Google Scholar
Shih-Chieh Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shih-Chieh Chang .

Editor information

Editors and Affiliations

Google Research, Google Inc., Zurich, Switzerland
Yasemin Altun
NASA Ames Research Center, Mountain View, USA
Kamalika Das
Oath, Sunnyvale, USA
Taneli Mielikäinen
Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
Donato Malerba
Institute of Computing Science, Poznan University of Technology, Poznan, Poland
Jerzy Stefanowski
Laboratoire d’ Informatique (LIX), École Polytechnique, Palaiseau, France
Jesse Read
Department of Computer Science, Stanford University, Stanford, USA
Marinka Žitnik
Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
Jožef Stefan Institute, Ljubljana, Slovenia
Sašo Džeroski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lee, YL., Juan, DC., Tseng, XA., Chen, YT., Chang, SC. (2017). DC-Prophet: Predicting Catastrophic Machine Failures in DataCenters. In: Altun, Y., et al. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2017. Lecture Notes in Computer Science(), vol 10536. Springer, Cham. https://doi.org/10.1007/978-3-319-71273-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-71273-4_6
Published: 30 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71272-7
Online ISBN: 978-3-319-71273-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics