
计算机科学 ›› 2021, Vol. 48 ›› Issue (11A): 225-231.doi: 10.11896/jsjkx.201200066

蒋从锋1, 殷继亮1, 胡海周1, 闫龙川2, 张纪林3, 万健4, 仇烨亮5   

  1. 1 杭州电子科技大学计算机学院 杭州310018
    2 国家电网有限公司信息通信分公司 北京100053
    3 杭州电子科技大学网络空间安全学院 杭州310018
    4 浙江科技学院信息与电子工程学院 杭州310023
    5 阿里云计算有限公司 杭州311121
  • 出版日期:2021-11-10 发布日期:2021-11-12
  • 通讯作者: 蒋从锋(cjiang@hdu.edu.cn)
  • 基金资助:

Analysis of Workload Failure in Co-located Data Centers

JIANG Cong-feng1, YIN Ji-liang1, HU Hai-zhou1, YAN Long-chuan2, ZHANG Ji-lin3, WAN Jian4, QIU Ye-liang5   

  1. 1 School of Computer Science and Technology,Hangzhou Dianzi University,Hangzhou 310018,China
    2 State Grid Electrical Information Communication Co.,Ltd.,Beijing 100053,China
    3 School of Cyberspace Security,Hangzhou Dianzi University,Hangzhou 310018,China
    4 School of Information and Electronic Engineering,Zhejiang University of Science and Technology,Hangzhou 310023,China
    5 Alibaba Cloud Computing Co.,Ltd.,Hangzhou 311121,China
  • Online:2021-11-10 Published:2021-11-12
  • About author:JIANG Cong-feng,born in 1980,Ph.D,professor,Ph.D.supervisor,is a member of China Computer Federation.His main research interests include cloud computing,system optimization and performance evaluation.
  • Supported by:
    National Key Research and DevelopmentPragram of China(2017YFB101000),National Natural Science Foundation of China(61972118) and Zhejiang Key Research and Development Program of China(2019C01059).

摘要: 数据中心工作负载混合部署在显著提升云数据中心的资源利用率的同时,也增加了调度的复杂性和作业的失效率。以阿里云发布的数据中心日志数据集cluster-trace-v2018为例,从离线批处理工作负载角度出发,详细地分析了不同类型工作负载在成功率和资源利用上的特征。主要发现如下:1)少量类型作业的失效会影响集群整体作业成功率并造成集群资源的浪费;2)伏羲分布式调度系统在任务故障切换执行时间上满足高斯分布,在任务调度延迟方面满足齐夫分布;3)通过分析失败实例在集群节点上的分布,发现集群作业发生失败在空间上具有随机性,且失败的实例很容易再次发生失败,而在时间上集群整体失败率则存在不平衡性;4)以任务实例的失效为基准,计算了集群节点的平均无故障时间,大部分节点的平均无故障时间在1 000 s左右,小部分节点的任务实例失效率低,其平均无故障时间可达10 000 s以上。

关键词: 分布式调度, 工作负载特征, 混合部署, 失效分析

Abstract: Datacenter workload co-location can greatly increase the resource utilization of cloud data centers,while it also increases the scheduling complexity and job failures.In this paper,the cluster trace dataset from Alibaba Cloud is investigated,and the characteristics of batch workload failure rates and cluster resource utilization are studied.The main contributions and findings of this paper are as follows.First,Only a small portion of specific types of jobs account for the overall cluster failure rate and resource waste due to job failures.Second,the execution time of task failover in the Fuxi distributed scheduler can be quantified as Gaussian distribution,and the task scheduling delay can be quantified as Zipf distribution.Third,Based on the failed instances distribution on cluster nodes,it's found that the job failures randomly occur in the cluster spatially,and the failed jobs are prone to fail again after their failovers.Moreover,job failures occur in the cluster temporally but not evenly distributed in the cluster.Fourth,the mean time between failures of the cluster is calculated according to instance failure data,and the results show that most of the cluster nodes have the mean time between failures values as 1000 seconds,while a few of them have the mean time between failures values as 10000 seconds.

Key words: Co-located cluster, Distributed scheduling, Failure analysis, Workload characteristics


