An Architectural Framework for Detecting Process Hangs/Crashes

Nakka, Nithin; Saggese, Giacinto Paolo; Kalbarczyk, Zbigniew; Iyer, Ravishankar K.

doi:10.1007/11408901_8

Nithin Nakka¹⁹,
Giacinto Paolo Saggese¹⁹,
Zbigniew Kalbarczyk¹⁹ &
…
Ravishankar K. Iyer¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 3463))

Included in the following conference series:

European Dependable Computing Conference

849 Accesses
15 Citations

Abstract

This paper addresses the challenges faced in practical implementation of heartbeat-based process/crash and hang detection. We propose an in-processor hardware module to reduce error detection latency and instrumentation overhead. Three hardware techniques integrated into the main pipeline of a superscalar processor are presented. The techniques discussed in this work are: (i) Instruction Count Heartbeat (ICH), which detects process crashes and a class of hangs where the process exists but is not executing any instructions, (ii) Infinite Loop Hang Detector (ILHD), which captures process hangs in infinite execution of legitimate loops, and (iii) Sequential Code Hang Detector (SCHD), which detects process hangs in illegal loops. The proposed design has the following unique features: 1) operating system neutral detection techniques, 2) elimination of any instrumentation for detection of all application crashes and OS hangs, and 3) an automated and light-weight compile-time instrumentation methodology to detect all process hangs (including infinite loops), the detection being performed in the hardware module at runtime. The proposed techniques can support heartbeat protocols to detect operating system/process crashes and hangs in distributed systems. Evaluation of the techniques for hang detection show a low 1.6% performance overhead and 6% memory overhead for the instrumentation. The crash detection technique does not incur any performance overhead and has a latency of a few instructions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threading

Hardware-Based Sequential Consistency Violation Detection Made Simpler

Hardware-Assisted Fine-Grained Code-Reuse Attack Detection

References

Eveking, H.: SuperScalar DLX Documentation, http://www.rs.e-technik.tu-darmstadt.de/TUD/res/dlxdocu/DlxPdf.zip
Burger, D., Austin, T.M.: The SimpleScalar Tool Set, Version 2.0. Tech. Rep. CS-1342, Univ of Wisconsin-Madison (June 1997)
Google Scholar
Chandra, T.D., Toueg, S.: Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM 43(2), 225–267 (1996)
Article MATH MathSciNet Google Scholar
Gouda, M., McGuire, T.: Accelerated heartbeat protocols. In: Proc. of the Int’l Conf. on Distributed Computing Systems, pp. 202–209 (May 1998)
Google Scholar
Kalbarczyk, Z., Bagchi, S., Whisnant, K., Iyer, R.K.: Chameleon: A Software Infrastructure for Adaptive Fault Tolerance. IEEE Trans. on PDS 10(6) (June 1999)
Google Scholar
Murphy, N.: Watchdog Timers. Embedded Systems Programming (November 2000)
Google Scholar
Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach. Morgan-Kaufmann, San Francisco (1996)
MATH Google Scholar
Yang, Z.: Implementation of Preemptive Control Flow Checking Via Editing of Program Executables. Master’s Thesis, University of Illinois at Urbana-Champaign (December 2002)
Google Scholar
Li, Y.-T.S., et al.: Performance Estimation of Embedded Software with Instruction Cache Modeling. ACM Trans. on Design Automation of Electronic Systems 4(3), 257–279
Google Scholar
Felber, P., Defago, X., Guerraoui, R., Oser, P.: Failure Detectors as First Class Objects. In: Proc. of the Int’l Symposium on Distributed Objects and Applications (1999)
Google Scholar
AIX V 5.1: System Management Concepts, http://publib16.boulder.ibm.com/pseries/en_US/aixbman/admnconc/syshang_intro.htm
Eddon, G., Eddon, H.: Understanding the DCOM Wire Protocol by Analyzing Network Data Packets. Microsoft Systems Journal (March 1998)
Google Scholar
Sun Cluster 3.1 Concepts Guide, http://docs.sun.com/db/doc/817-0519
Chen, W., Toueg, S., Aguilera, M.K.: On the Quality of Service of Failure Detectors. In: Proc. DSN 2000 (2000)
Google Scholar
Bertier, M., Marin, O., Sens, P.: Implementation and Performance Evaluation of an Adaptable Failure Detector. In: Proc. DSN 2002 (2002)
Google Scholar
Geist, A., et al.: PVM: Parallel Virtual Machine—A Users’ Guide and Tutorial for Networked Parallel Computing. Scientific and Engineering Series. MIT Press, Cambridge (1994)
MATH Google Scholar
Hayashibaral, N., Defago, X., Yared, R., Katayama, T.: The Accrual Failure Detector. IS-RR-2004-010, May 10 (2004)
Google Scholar
Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM 32(2), 374–382 (1985)
Article MATH MathSciNet Google Scholar
Nakka, N., Saggese, G.P., Kalbarczyk, Z., Iyer, R.K.: An Architectural Framework for Detecting Process Hangs/Crashes, http://www.crhc.uiuc.edu/~nakka/HCDetect.pdf
Gu, W., Kalbarczyk, Z., Iyer, R.K.: Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors. In: Proc. of DSN 2004, pp. 827–836 (2004)
Google Scholar
Whisnant, K., Iyer, R.K., Kalbarczyk, Z.T., Jones III, P.H., Rennels, D.A., Some, R.: The Effects of an ARMOR-Based SIFT Environment on the Performance and Dependability of User Applications. IEEE Trans. on Software Engg. 30(4), 257–277 (2004)
Article Google Scholar
Lee, I., Iyer, R.K.: Faults, Symptoms, and Software Fault Tolerance in the Tandem GUARDIAN90 Operating System. In: FTCS 1993 (1993)
Google Scholar
Beauragard, D.J.: Error-Injection-Based Failure Profile of the IEEE 1394 Bus. Master’s Thesis, University of Illinois at Urbana-Champaign (2003)
Google Scholar
PWDOG1 - PCI Watchdog for Windows XP, 2000, NT, 98, Linux Kernel (2000), http://www.quancom.de/qprod01/homee.htm
AT&T 5ESS^TM from top to bottom, http://www.morehouse.org/hin/ess/ess05.htm
Siewiorek, D.P., Swarz, R.S.: Reliable Computer Systems: Design and Evaluation, Ch. 8, 2nd edn.
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Reliable and High Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, 1308 West Main St., Urbana, IL, 61801, USA
Nithin Nakka, Giacinto Paolo Saggese, Zbigniew Kalbarczyk & Ravishankar K. Iyer

Authors

Nithin Nakka
View author publications
You can also search for this author in PubMed Google Scholar
Giacinto Paolo Saggese
View author publications
You can also search for this author in PubMed Google Scholar
Zbigniew Kalbarczyk
View author publications
You can also search for this author in PubMed Google Scholar
Ravishankar K. Iyer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Computer Sciences III, University of Erlangen-Nürnberg, Martensstr. 3, 91058, Erlangen, Germany
Mario Dal Cin
UPS, INSA, INP, ISAE; LAAS-CNRS, Université de Toulouse, Toulouse, France
Mohamed Kaâniche
Department of Measurement and Information Systems, Budapest University of Technology and Economics,
András Pataricza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nakka, N., Saggese, G.P., Kalbarczyk, Z., Iyer, R.K. (2005). An Architectural Framework for Detecting Process Hangs/Crashes. In: Dal Cin, M., Kaâniche, M., Pataricza, A. (eds) Dependable Computing - EDCC 5. EDCC 2005. Lecture Notes in Computer Science, vol 3463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408901_8

Download citation

DOI: https://doi.org/10.1007/11408901_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25723-3
Online ISBN: 978-3-540-32019-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics