Exploiting Field Data Analysis to Improve the Reliability and Energy-efficiency of HPC Systems
 

Exploiting Field Data Analysis to Improve the Reliability and Energy-efficiency of HPC Systems

Date

2016-06

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as two serious design concerns that are expected to become more challenging in future Exascale systems. The efficient design and operation of such large-scale installations critically relies on developing an in-depth understanding of their failure behaviour as well as their energy consumption profiles. Among the main obstacles facing the study of HPC reliability and energy efficiency issues, however, is the difficulty of replicating HPC problems inside a lab environment or obtaining access to operational field data from HPC organizations. Examples of such field data include node failure logs, hardware replacement logs, system event logs, workload traces, data from environmental sensors, and more. Fortunately, the recent decade has witnessed an increasing number of HPC organizations willing to share their operational data with researchers or even make them publicly available.
In this work, we exploit field data analysis in improving our understanding of HPC failures in real world systems, and in optimizing HPC fault-tolerance protocols while analyzing their respective performance and energy overheads. Throughout our analyses, we investigate various HPC design tradeoffs between system performance, system reliability, and energy efficiency. Our results in the first part of this thesis provide critical insights into how and why failures happen in HPC installations as well as which types of failures are correlated in the field. We study the impact of various factors on system reliability, including environmental factors such as data center temperature and power quality. We find that the effect of temperature, for example, on hardware reliability in large-scale systems is smaller than often assumed. This finding implies that the operators of these facilities can achieve high energy savings by raising their operating temperatures, without making significant sacrifices in system reliability. Our analysis of power problems in large HPC facilities, on the other hand, revealed strong correlations between different power issues (e.g. power outages, voltage spikes, etc.), and increased failure rates in various hardware and software components. Based on our observations, we derive learned lessons and practical recommendations for the efficient design and operation of large-scale systems. The second part of this thesis utilizes the knowledge obtained from our HPC failure analysis in improving HPC fault-tolerance techniques. We focus on the most widely used fault-tolerance mechanism in modern HPC systems: "checkpoint/restart". We study how to optimize checkpoint-scheduling in parallel applications for both performance and energy efficiency purposes. Our results show that exploiting certain failure characteristics of HPC systems in designing checkpoint-scheduling policies can reduce the energy/performance overheads that are associated with faults and fault-tolerance in HPC systems significantly.

Description

Keywords

Distributed Systems, Energy-Efficiency, Fault-Tolerance, High-Performance Computing, Performance

Citation

DOI

ISSN

Creative Commons

Creative Commons URI

Items in TSpace are protected by copyright, with all rights reserved, unless otherwise indicated.