A framework for evaluating comprehensive fault resilience mechanisms in numerical programs | The Journal of Supercomputing Skip to main content
Log in

A framework for evaluating comprehensive fault resilience mechanisms in numerical programs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

As HPC systems approach Exascale, their circuit features will shrink while their overall size will grow, both at a fixed power limit. These trends imply that soft faults in electronic circuits will become an increasingly significant problem for programs that run on these systems, causing them to occasionally crash or worse, silently return incorrect results. This is motivating extensive work on program resilience to such faults, ranging from generic mechanisms such as replication or checkpoint/restart to algorithm-specific error detection and resilience mechanisms. Effective use of such mechanisms requires a detailed understanding of (1) which vulnerable parts of the program are most worth protecting and (2) the performance and resilience impact of fault resilience mechanisms on the program. This paper presents FaultTelescope, a tool that combines these two and generates actionable insights by presenting program vulnerabilities and impact of fault resilience mechanisms in an intuitive way.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Gi D, Martinez R, Busquets JV, Baraza JC, Gil PJ (1999) Fault Injection into VHDL models: experimental validation of a fault-tolerant microcomputer system. In: European Dependable Computing Conference, pp 191–208

  2. (2013) Kulfi fault injector. https://github.com/quadpixels/KULFI

  3. Baumann RC (2005) Radiation-induced soft errors in advanced semiconductor technologies. IEEE Trans. Device Mater. Reliab. 5(3):305–316

    Article  MathSciNet  Google Scholar 

  4. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1):1–122

    Article  Google Scholar 

  5. Bronevetsky G, de Supinski B (2008) Soft error vulnerability of iterative linear algebra methods. In: International Conference on Supercomputing

  6. Casas M, de Supinski BR, Bronevetsky G, Schulz M (2012) Fault resilience of the algebraic multi-grid solver. In: International Conference on Supercomputing, pp 91–100

  7. Chung J, Lee I, Sullivan M, Ryoo JH, Kim DW, Yoon DH, Kaplan L, Erez M (2012) Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: 2012 international conference for high performance computing, networking, storage and analysis, pp 1–11. doi:10.1109/SC.2012.36, http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6468537

  8. DeBardeleben N, Blanchard S, Guan Q, Zhang Z, Fu S (2012) Experimental framework for injecting logic errors in a virtual machine to profile applications for soft error resilience. In: Euro-Par 2011: Parallel Processing Workshops, Lecture Notes in Computer Science, vol 7156. Springer, Berlin, pp 282–291. doi:10.1007/978-3-642-29740-3_32

  9. DeBardeleben N, Blanchard S, Sridharan V, Gurumurthi S, Stearley J, Ferreira K (2014) Extra bits on sram and dram errors—more data from the field. In: IEEE Workshop on Silicon Errors in Logic-System Effects (SELSE)

  10. Du P, Luszczek P, Dongarra J (2011) High performance dense linear system solver with soft error resilience. In: 2011 IEEE International Conference on Cluster Computing, pp 272–280. doi:10.1109/CLUSTER.2011.38. http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6061145

  11. Du P, Luszczek P, Dongarra J (2012) High performance dense linear system solver with resilience to multiple soft errors. In: Procedia Computer Science, vol. 9, pp. 216–225. doi:10.1016/j.procs.2012.04.023. http://www.sciencedirect.com/science/article/pii/S1877050912001445

  12. Ferreira K, Stearley J, James H, Laros I, Oldfield R, Pedretti K, Brightwell R, Riesen R, Bridges PG, Arnold D (2011) Evaluating the viability of process replication reliability for exascale systems. In: Supercomputing

  13. Foundation FS (2011) Gnu scientific library—reference manual

  14. Hsueh MC, Tsai TK, Iyer RK (1997) Fault injection techniques and tools. IEEE Comput 30(4):75–82

    Article  Google Scholar 

  15. Huang KH, Abraham JA (2010) Algorithm-based fault tolerance for matrix operations. In: International Conference on Dependable Systems and Networks (DSN), pp 161–170

  16. ITRS (2013) International technology roadmap for semiconductors. Tech. rep

  17. LaFrieda C, Ipek E, Martinez JF, Manohar R (2007) Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: International conference on dependable systems and networks (DSN), pp 317–326

  18. Lattner C, Adve V (2004) Llvm: a compilation framework for lifelong program analysis and transformation. San Jose, CA, pp 75–88

  19. Li H, Mundy J, Patterson W, Kazazis D, Zaslavsky A, Bahar RI (2007) Thermally-induced soft errors in nanoscale CMOS circuits. In: IEEE international symposium on nanoscale architectures (NANOARCH), pp 62–69

  20. Li ML, Ramachandran P, Karpuzcu UR, Kumar S, Hari S, Adve SV (2009) Accurate microarchitecture-level fault modeling for studying hardware faults. In: International symposiumn on high-performance computer architecture

  21. Li X, Huang MC, Shen K, Chu L (2010) A realistic evaluation of memory hardware errors and software system susceptibility. In: USENIX annual technical conference

  22. da Lu C, Reed DA (2004) Assessing fault sensitivity in MPI applications. In: Supercomputing

  23. Massengill LW, Bhuva BL, Holman WT, Alles ML, Loveless TD (2012) Technology scaling and soft error reliability. In: IEEE Reliability Physics Symposium (IRPS), pp 3C.1.1–3C.1.7

  24. Michalak S, Harris KW, Hengartner NW, Takala BE, Wender SA (2005) Predicting the Number of Fatal Soft Errors in Los Alamos National Laboratory’s ASC Q Supercomputer. IEEE Trans Device Mater Reliab 5(3):329–335

    Article  Google Scholar 

  25. Moody A, Bronevetsky G, Mohror K, De Supinski B (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: 2010 international conference for high performance computing, networking, storage and analysis (SC), pp 1–11. doi:10.1109/SC.2010.18

  26. Olson WT (2014) Hattrick n-body simulator. http://code.google.com/p/hattrick-nbody

  27. Reinhardt SK, Mukherjee SS (2000) Transient fault detection via simultaneous multithreading. In: International symposium on computer architecture (ISCA), pp 25–36

  28. Sbragion D (2014) Drc: digital room correction. http://drc-fir.sourceforge.net/

  29. Sloan J, Kesler D, Kumar R, Rahimi A (2010) A numerical optimization-based methodology for application robustification: transforming applications for error tolerance. In: International conference on dependable systems and networks (DSN), pp 161–170

  30. Sloan J, Kumar R, Bronevetsky G (2012) Algorithmic approaches to low overhead fault detection for sparse linear algebra. In: Dependable systems and networks (Section III). http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6263938

Download references

Acknowledgments

We are grateful to Vishal Sharma and Arvind Haran, the authors of the original KULFI and for granting us permission to modify it for our experiment purposes. We are also appreciative of the opportunity to be involved in and contribute to KULFI.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lu Peng.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, S., Bronevetsky, G., Li, B. et al. A framework for evaluating comprehensive fault resilience mechanisms in numerical programs. J Supercomput 71, 2963–2984 (2015). https://doi.org/10.1007/s11227-015-1422-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1422-z

Keywords

Navigation