Failure Detection with Booting in Partially Synchronous Systems | SpringerLink
Skip to main content

Failure Detection with Booting in Partially Synchronous Systems

  • Conference paper
Dependable Computing - EDCC 5 (EDCC 2005)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 3463))

Included in the following conference series:

Abstract

Unreliable failure detectors are a well known means to enrich asynchronous distributed systems with time-free semantics that allow to solve consensus in the presence of crash failures. Implementing unreliable failure detectors requires a system that provides some synchrony, typically an upper bound on end-to-end message delays. Recently, we introduced an implementation of the perfect failure detector in a novel partially synchronous model, referred to as the Θ-Model, where only the ratio Θ of maximum vs. minimum end-to-end delay of messages that are simultaneously in transit must be known a priori (while the actual delays need not be known and not even be bounded). In this paper, we present an alternative failure detector algorithm, which is based on a clock synchronization algorithm for the Θ-Model. It not only surpasses our first implementation with respect to failure detection time, but also works during the system booting phase.

Supported by the Austrian START program Y41-MAT, the BM:vit FIT-IT project DCBA (proj. no. 808198), and by the FWF project Theta (proj. no. P17757-N04).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Hermant, J.F., Le Lann, G.: Fast asynchronous uniform consensus in real-time distributed systems. IEEE Transactions on Computers 51, 931–944 (2002)

    Article  Google Scholar 

  2. Le Lann, G., Schmid, U.: How to maximize computing systems coverage. Technical Report 183/1-128, Department of Automation, Technische Universität Wien (2003)

    Google Scholar 

  3. Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty processor. Journal of the ACM 32, 374–382 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  4. Dolev, D., Dwork, C., Stockmeyer, L.: On the minimal synchronism needed for distributed consensus. Journal of the ACM 34, 77–97 (1987)

    Article  MATH  MathSciNet  Google Scholar 

  5. Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of the ACM 43, 225–267 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  6. Chandra, T.D., Hadzilacos, V., Toueg, S.: The weakest failure detector for solving consensus. Journal of the ACM 43, 685–722 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  7. Le Lann, G., Schmid, U.: How to implement a timer-free perfect failure detector in partially synchronous systems. Technical Report 183/1-127, Department of Automation, Technische Universität Wien (2003)

    Google Scholar 

  8. Larrea, M., Fernandez, A., Arevalo, S.: On the implementation of unreliable failure detectors in partially synchronous systems. IEEE Transactions on Computers 53, 815–828 (2004)

    Article  Google Scholar 

  9. Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the presence of partial synchrony. Journal of the ACM 35, 288–323 (1988)

    Article  MathSciNet  Google Scholar 

  10. Widder, J.: Distributed Computing in the Presence of Bounded Asynchrony. PhD thesis, Vienna University of Technology, Fakultät für Informatik (2004)

    Google Scholar 

  11. Larrea, M., Fernández, A., Arévalo, S.: On the impossibility of implementing perpetual failure detectors in partially synchronous systems. In: Proceedings of the 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing (PDP 2002), Gran Canaria Island, Spain (2002)

    Google Scholar 

  12. Widder, J.: Booting clock synchronization in partially synchronous systems. In: Proceedings of the 17th International Symposium on Distributed Computing (DISC 2003), Sorrento, Italy, vol. 2848, pp. 121–135. Springer, Heidelberg (2003)

    Google Scholar 

  13. Widder, J., Schmid, U.: Booting clock synchronization in partially synchronous systems with hybrid node and link failures. Technical Report 183/1-126, Department of Automation, Technische Universität Wien (2003) (submitted for publication)

    Google Scholar 

  14. Srikanth, T.K., Toueg, S.: Optimal clock synchronization. Journal of the ACM 34, 626–645 (1987)

    Article  MathSciNet  Google Scholar 

  15. Dolev, D., Friedman, R., Keidar, I., Malkhi, D.: Failure detectors in omission failure environments. In: Proc. 16th ACM Symposium on Principles of Distributed Computing, Santa Barbara, California, p. 286 (1997)

    Google Scholar 

  16. Malkhi, D., Reiter, M.: Unreliable intrusion detection in distributed computations. In: Proceedings of the 10th Computer Security Foundations Workshop (CSFW 1997), Rockport, MA, USA, pp. 116–124 (1997)

    Google Scholar 

  17. Kihlstrom, K.P., Moser, L.E., Melliar-Smith, P.M.: Solving consensus in a byzantine environment using an unreliable fault detector. In: Proceedings of the International Conference on Principles of Distributed Systems (OPODIS), Chantilly, France, pp. 61–75 (1997)

    Google Scholar 

  18. Doudou, A., Garbinato, B., Guerraoui, R., Schiper, A.: Muteness failure detectors: Speci¯cation and implementation. In: Hlavicka, J., Maehle, E., Pataricza, A. (eds.) EDDC 1999. LNCS, vol. 1667, pp. 71–87. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  19. Doudou, A., Garbinato, B., Guerraoui, R.: Encapsulating failure detection: From crash to byzantine failures. In: Blieberger, J., Strohmeier, A. (eds.) Ada-Europe 2002. LNCS, vol. 2361, pp. 24–50. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  20. Basu, A., Charron-Bost, B., Toueg, S.: Simulating reliable links with unreliable links in the presence of process crashes. In: Babaoğlu, Ö., Marzullo, K. (eds.) WDAG 1996. LNCS, vol. 1151, pp. 105–122. Springer, Heidelberg (1996)

    Google Scholar 

  21. Liu, J.W.S.: Real-Time Systems. Prentice Hall, Englewood Cliffs (2000)

    Google Scholar 

  22. Stankovic, J.A., Spuri, M., Ramamritham, K., Buttazzo, G.C.: Deadline Scheduling for Real-Time Systems. Kluwer Academic Publishers, Dordrecht (1998)

    MATH  Google Scholar 

  23. Albeseder, D.: Experimentelle Verifikation von Synchronitäatsannahmen für Computernetzwerke. Diplomarbeit, Embedded Computing Systems Group, Technische Universitäat Wien (2004) (in German)

    Google Scholar 

  24. Hadzilacos, V., Toueg, S.: Fault-tolerant broadcasts and related problems. In: Mullender, S. (ed.) Distributed Systems, 2nd edn., pp. 97–145. Addison-Wesley, Reading (1993)

    Google Scholar 

  25. Schmid, U., Fetzer, C.: Randomized asynchronous consensus with imperfect communications. In: 22nd Symposium on Reliable Distributed Systems (SRDS 2003), Florence, Italy, pp. 361–370 (2003)

    Google Scholar 

  26. Le Lann, G.: On real-time and non real-time distributed computing. In: Helary, J.-M., Raynal, M. (eds.) WDAG 1995. LNCS, vol. 972, pp. 51–70. Springer, Heidelberg (1995)

    Chapter  Google Scholar 

  27. Aguilera, M.K., Chen, W., Toueg, S.: Failure detection and consensus in the crashrecovery model. Distributed Computing 13, 99–125 (2000)

    Article  Google Scholar 

  28. Cristian, F., Fetzer, C.: The timed asynchronous distributed system model. IEEE Transactions on Parallel and Distributed Systems 10, 642–657 (1999)

    Article  Google Scholar 

  29. Veríssimo, P., Casimiro, A., Fetzer, C.: The timely computing base: Timely actions in the presence of uncertain timeliness. In: Proceedings IEEE International Conference on Dependable Systems and Networks (DSN’01 / FTCS’30), New York City, USA, pp. 533–542 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Widder, J., Le Lann, G., Schmid, U. (2005). Failure Detection with Booting in Partially Synchronous Systems. In: Dal Cin, M., Kaâniche, M., Pataricza, A. (eds) Dependable Computing - EDCC 5. EDCC 2005. Lecture Notes in Computer Science, vol 3463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408901_3

Download citation

  • DOI: https://doi.org/10.1007/11408901_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25723-3

  • Online ISBN: 978-3-540-32019-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics