Abstract
As HPC systems approach Exascale, their circuit features will shrink while their overall size will grow, both at a fixed power limit. These trends imply that soft faults in electronic circuits will become an increasingly significant problem for programs that run on these systems, causing them to occasionally crash or worse, silently return incorrect results. This is motivating extensive work on program resilience to such faults, ranging from generic mechanisms such as replication or checkpoint/restart to algorithm-specific error detection and resilience mechanisms. Effective use of such mechanisms requires a detailed understanding of (1) which vulnerable parts of the program are most worth protecting and (2) the performance and resilience impact of fault resilience mechanisms on the program. This paper presents FaultTelescope, a tool that combines these two and generates actionable insights by presenting program vulnerabilities and impact of fault resilience mechanisms in an intuitive way.
Similar content being viewed by others
References
Gi D, Martinez R, Busquets JV, Baraza JC, Gil PJ (1999) Fault Injection into VHDL models: experimental validation of a fault-tolerant microcomputer system. In: European Dependable Computing Conference, pp 191–208
(2013) Kulfi fault injector. https://github.com/quadpixels/KULFI
Baumann RC (2005) Radiation-induced soft errors in advanced semiconductor technologies. IEEE Trans. Device Mater. Reliab. 5(3):305–316
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1):1–122
Bronevetsky G, de Supinski B (2008) Soft error vulnerability of iterative linear algebra methods. In: International Conference on Supercomputing
Casas M, de Supinski BR, Bronevetsky G, Schulz M (2012) Fault resilience of the algebraic multi-grid solver. In: International Conference on Supercomputing, pp 91–100
Chung J, Lee I, Sullivan M, Ryoo JH, Kim DW, Yoon DH, Kaplan L, Erez M (2012) Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: 2012 international conference for high performance computing, networking, storage and analysis, pp 1–11. doi:10.1109/SC.2012.36, http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6468537
DeBardeleben N, Blanchard S, Guan Q, Zhang Z, Fu S (2012) Experimental framework for injecting logic errors in a virtual machine to profile applications for soft error resilience. In: Euro-Par 2011: Parallel Processing Workshops, Lecture Notes in Computer Science, vol 7156. Springer, Berlin, pp 282–291. doi:10.1007/978-3-642-29740-3_32
DeBardeleben N, Blanchard S, Sridharan V, Gurumurthi S, Stearley J, Ferreira K (2014) Extra bits on sram and dram errors—more data from the field. In: IEEE Workshop on Silicon Errors in Logic-System Effects (SELSE)
Du P, Luszczek P, Dongarra J (2011) High performance dense linear system solver with soft error resilience. In: 2011 IEEE International Conference on Cluster Computing, pp 272–280. doi:10.1109/CLUSTER.2011.38. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6061145
Du P, Luszczek P, Dongarra J (2012) High performance dense linear system solver with resilience to multiple soft errors. In: Procedia Computer Science, vol. 9, pp. 216–225. doi:10.1016/j.procs.2012.04.023. http://www.sciencedirect.com/science/article/pii/S1877050912001445
Ferreira K, Stearley J, James H, Laros I, Oldfield R, Pedretti K, Brightwell R, Riesen R, Bridges PG, Arnold D (2011) Evaluating the viability of process replication reliability for exascale systems. In: Supercomputing
Foundation FS (2011) Gnu scientific library—reference manual
Hsueh MC, Tsai TK, Iyer RK (1997) Fault injection techniques and tools. IEEE Comput 30(4):75–82
Huang KH, Abraham JA (2010) Algorithm-based fault tolerance for matrix operations. In: International Conference on Dependable Systems and Networks (DSN), pp 161–170
ITRS (2013) International technology roadmap for semiconductors. Tech. rep
LaFrieda C, Ipek E, Martinez JF, Manohar R (2007) Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: International conference on dependable systems and networks (DSN), pp 317–326
Lattner C, Adve V (2004) Llvm: a compilation framework for lifelong program analysis and transformation. San Jose, CA, pp 75–88
Li H, Mundy J, Patterson W, Kazazis D, Zaslavsky A, Bahar RI (2007) Thermally-induced soft errors in nanoscale CMOS circuits. In: IEEE international symposium on nanoscale architectures (NANOARCH), pp 62–69
Li ML, Ramachandran P, Karpuzcu UR, Kumar S, Hari S, Adve SV (2009) Accurate microarchitecture-level fault modeling for studying hardware faults. In: International symposiumn on high-performance computer architecture
Li X, Huang MC, Shen K, Chu L (2010) A realistic evaluation of memory hardware errors and software system susceptibility. In: USENIX annual technical conference
da Lu C, Reed DA (2004) Assessing fault sensitivity in MPI applications. In: Supercomputing
Massengill LW, Bhuva BL, Holman WT, Alles ML, Loveless TD (2012) Technology scaling and soft error reliability. In: IEEE Reliability Physics Symposium (IRPS), pp 3C.1.1–3C.1.7
Michalak S, Harris KW, Hengartner NW, Takala BE, Wender SA (2005) Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory’s ASC Q Supercomputer. IEEE Trans Device Mater Reliab 5(3):329–335
Moody A, Bronevetsky G, Mohror K, De Supinski B (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: 2010 international conference for high performance computing, networking, storage and analysis (SC), pp 1–11. doi:10.1109/SC.2010.18
Olson WT (2014) Hattrick n-body simulator. http://code.google.com/p/hattrick-nbody
Reinhardt SK, Mukherjee SS (2000) Transient fault detection via simultaneous multithreading. In: International symposium on computer architecture (ISCA), pp 25–36
Sbragion D (2014) Drc: digital room correction. http://drc-fir.sourceforge.net/
Sloan J, Kesler D, Kumar R, Rahimi A (2010) A numerical optimization-based methodology for application robustification: transforming applications for error tolerance. In: International conference on dependable systems and networks (DSN), pp 161–170
Sloan J, Kumar R, Bronevetsky G (2012) Algorithmic approaches to low overhead fault detection for sparse linear algebra. In: Dependable systems and networks (Section III). http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6263938
Acknowledgments
We are grateful to Vishal Sharma and Arvind Haran, the authors of the original KULFI and for granting us permission to modify it for our experiment purposes. We are also appreciative of the opportunity to be involved in and contribute to KULFI.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, S., Bronevetsky, G., Li, B. et al. A framework for evaluating comprehensive fault resilience mechanisms in numerical programs. J Supercomput 71, 2963–2984 (2015). https://doi.org/10.1007/s11227-015-1422-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1422-z