Abstract
This paper addresses the challenges faced in practical implementation of heartbeat-based process/crash and hang detection. We propose an in-processor hardware module to reduce error detection latency and instrumentation overhead. Three hardware techniques integrated into the main pipeline of a superscalar processor are presented. The techniques discussed in this work are: (i) Instruction Count Heartbeat (ICH), which detects process crashes and a class of hangs where the process exists but is not executing any instructions, (ii) Infinite Loop Hang Detector (ILHD), which captures process hangs in infinite execution of legitimate loops, and (iii) Sequential Code Hang Detector (SCHD), which detects process hangs in illegal loops. The proposed design has the following unique features: 1) operating system neutral detection techniques, 2) elimination of any instrumentation for detection of all application crashes and OS hangs, and 3) an automated and light-weight compile-time instrumentation methodology to detect all process hangs (including infinite loops), the detection being performed in the hardware module at runtime. The proposed techniques can support heartbeat protocols to detect operating system/process crashes and hangs in distributed systems. Evaluation of the techniques for hang detection show a low 1.6% performance overhead and 6% memory overhead for the instrumentation. The crash detection technique does not incur any performance overhead and has a latency of a few instructions.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Eveking, H.: SuperScalar DLX Documentation, http://www.rs.e-technik.tu-darmstadt.de/TUD/res/dlxdocu/DlxPdf.zip
Burger, D., Austin, T.M.: The SimpleScalar Tool Set, Version 2.0. Tech. Rep. CS-1342, Univ of Wisconsin-Madison (June 1997)
Chandra, T.D., Toueg, S.: Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM 43(2), 225–267 (1996)
Gouda, M., McGuire, T.: Accelerated heartbeat protocols. In: Proc. of the Int’l Conf. on Distributed Computing Systems, pp. 202–209 (May 1998)
Kalbarczyk, Z., Bagchi, S., Whisnant, K., Iyer, R.K.: Chameleon: A Software Infrastructure for Adaptive Fault Tolerance. IEEE Trans. on PDS 10(6) (June 1999)
Murphy, N.: Watchdog Timers. Embedded Systems Programming (November 2000)
Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach. Morgan-Kaufmann, San Francisco (1996)
Yang, Z.: Implementation of Preemptive Control Flow Checking Via Editing of Program Executables. Master’s Thesis, University of Illinois at Urbana-Champaign (December 2002)
Li, Y.-T.S., et al.: Performance Estimation of Embedded Software with Instruction Cache Modeling. ACM Trans. on Design Automation of Electronic Systems 4(3), 257–279
Felber, P., Defago, X., Guerraoui, R., Oser, P.: Failure Detectors as First Class Objects. In: Proc. of the Int’l Symposium on Distributed Objects and Applications (1999)
AIX V 5.1: System Management Concepts, http://publib16.boulder.ibm.com/pseries/en_US/aixbman/admnconc/syshang_intro.htm
Eddon, G., Eddon, H.: Understanding the DCOM Wire Protocol by Analyzing Network Data Packets. Microsoft Systems Journal (March 1998)
Sun Cluster 3.1 Concepts Guide, http://docs.sun.com/db/doc/817-0519
Chen, W., Toueg, S., Aguilera, M.K.: On the Quality of Service of Failure Detectors. In: Proc. DSN 2000 (2000)
Bertier, M., Marin, O., Sens, P.: Implementation and Performance Evaluation of an Adaptable Failure Detector. In: Proc. DSN 2002 (2002)
Geist, A., et al.: PVM: Parallel Virtual Machine—A Users’ Guide and Tutorial for Networked Parallel Computing. Scientific and Engineering Series. MIT Press, Cambridge (1994)
Hayashibaral, N., Defago, X., Yared, R., Katayama, T.: The Accrual Failure Detector. IS-RR-2004-010, May 10 (2004)
Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM 32(2), 374–382 (1985)
Nakka, N., Saggese, G.P., Kalbarczyk, Z., Iyer, R.K.: An Architectural Framework for Detecting Process Hangs/Crashes, http://www.crhc.uiuc.edu/~nakka/HCDetect.pdf
Gu, W., Kalbarczyk, Z., Iyer, R.K.: Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors. In: Proc. of DSN 2004, pp. 827–836 (2004)
Whisnant, K., Iyer, R.K., Kalbarczyk, Z.T., Jones III, P.H., Rennels, D.A., Some, R.: The Effects of an ARMOR-Based SIFT Environment on the Performance and Dependability of User Applications. IEEE Trans. on Software Engg. 30(4), 257–277 (2004)
Lee, I., Iyer, R.K.: Faults, Symptoms, and Software Fault Tolerance in the Tandem GUARDIAN90 Operating System. In: FTCS 1993 (1993)
Beauragard, D.J.: Error-Injection-Based Failure Profile of the IEEE 1394 Bus. Master’s Thesis, University of Illinois at Urbana-Champaign (2003)
PWDOG1 - PCI Watchdog for Windows XP, 2000, NT, 98, Linux Kernel (2000), http://www.quancom.de/qprod01/homee.htm
AT&T 5ESSTM from top to bottom, http://www.morehouse.org/hin/ess/ess05.htm
Siewiorek, D.P., Swarz, R.S.: Reliable Computer Systems: Design and Evaluation, Ch. 8, 2nd edn.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nakka, N., Saggese, G.P., Kalbarczyk, Z., Iyer, R.K. (2005). An Architectural Framework for Detecting Process Hangs/Crashes. In: Dal Cin, M., Kaâniche, M., Pataricza, A. (eds) Dependable Computing - EDCC 5. EDCC 2005. Lecture Notes in Computer Science, vol 3463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408901_8
Download citation
DOI: https://doi.org/10.1007/11408901_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25723-3
Online ISBN: 978-3-540-32019-7
eBook Packages: Computer ScienceComputer Science (R0)