The overhead of consensus failure recovery | Distributed Computing Skip to main content
Log in

The overhead of consensus failure recovery

  • Original Article
  • Published:
Distributed Computing Aims and scope Submit manuscript

Abstract

Many reliable distributed systems are consensus-based and typically operate under two modes: a fast normal mode in failure-free synchronous periods, and a slower recovery mode following asynchrony and failures. A lot of work has been devoted to optimize the normal mode, but little has focused on optimizing the recovery mode. This paper seeks to understand whether the recovery mode is inherently slower than the normal mode. In particular, we consider consensus algorithms in the round-based eventually synchronous model of [11], where t out of n processes may fail by crashing, messages may be lost, and the system may be asynchronous for arbitrarily long, but eventually the system becomes synchronous and no new failure occurs (we say that the system becomes stable). For t   ≥   n/3, we prove a lower bound of three rounds for achieving a global decision whenever the system becomes stable, and we contrast this with a bound of two rounds when t  <  n/3. We then give matching algorithms for both t   ≥   n/3 and t   <   n/3.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. ACM: Special issue on group communications systems. Commun. ACM 39(4) (1996)

  2. Amir, Y., Tutu, C.: From total order to database replication. In: Proceedings of the 22nd IEEE International Conference on Distributed Computing Systems (ICDCS-22) (2002)

  3. Birman, K., van Renessee, R.: Reliable Distributed Computing with the Isis Toolkit. IEEE Computer Society Press (1993)

  4. Chandra T.D., Toueg S. (1996) Unreliable failure detectors for reliable distributed systems. J. ACM 43(2): 225–267

    Article  MATH  MathSciNet  Google Scholar 

  5. Charron-Bost B., Schiper A. (2004) Uniform consensus is harder than consensus. J. Algorithms 51(1):15–37

    Article  MATH  MathSciNet  Google Scholar 

  6. Chockler G.V., Keidar I., Vitenberg R. (2001) Group communication specifications: a comprehensive study. ACM Comput. Surv. 33(4): 1–43

    Article  Google Scholar 

  7. Cristian, F., Fetzer, C.: The timed asynchronous distributed system model. IEEE Trans. Parallel Distrib. Syst. 10(6) (1999)

  8. Dolev D., Reischuk R., Strong R. (1990) Early stopping in byzantine agreement. J. ACM 37(4): 720–741

    Article  MATH  MathSciNet  Google Scholar 

  9. Dutta, P., Guerraoui, R.: Fast indulgent consensus with zero degradation. In: Proceedings of the Fourth European Dependable Computing Conference (EDCC-4). Toulouse, France (2002)

  10. Dutta, P., Guerraoui, R.: The inherent price of indulgence. Distrib. Comput. 18(1), 85–98 (2005). A preliminary version appeared in the Proceedings of the 21st ACM Symposium on Principles of Distributed Computing (PODC-21), 2002

    Google Scholar 

  11. Dwork C., Lynch N.A., Stockmeyer L. (1988) Consensus in the presence of partial synchrony. J. ACM 35(2): 288–323

    Article  MathSciNet  Google Scholar 

  12. El Abbadi, A., Skeen, D., Cristian, F.: An efficient fault-tolerant protocol for replicated data management. In: Proceedings of the 4th ACM Conference on Principles of Database Systems (1985)

  13. Fischer M.J., Lynch N.A. (1982) A lower bound for the time to assure interactive consistency. Inform. Process. Lett. 14(4): 183–186

    Article  MATH  MathSciNet  Google Scholar 

  14. Fischer M.J., Lynch N.A., Paterson M.S. (1985) Impossibility of distributed consensus with one faulty process. J. ACM 32(2): 374–382

    Article  MATH  MathSciNet  Google Scholar 

  15. Friedman, R., Vaysburd, A.: Fast replicated state machines over partitionable networks. In: Proceedings of the 16th IEEE Symposium on Reliable Distributed Systems (SRDS-16), pp. 130–137. IEEE Computer Society (1997)

  16. Gafni, E.: Round-by-round fault detectors: Unifying synchrony and asynchrony. In: Proceedings of the 17th ACM Symposium on Principles of Distributed Computing (PODC-17), pp. 143–152. Puerto Vallarta, Mexico (1998)

  17. Guerraoui, R.: Revisiting the relationship between non blocking atomic commitment and consensus problems. In: Proceedings of the 9th International Workshop on Distributed Algorithms (WDAG-9) (1995)

  18. Keidar, I., Dolev, D.: Efficient message ordering in dynamic networks. In: Proceedings of the 15th ACM Symposium on Principles of Distributed Computing (PODC-15), pp. 68–76. New York, NY (1996)

  19. Keidar, I., Rajsbaum, S.: On the cost of fault-tolerant consensus when there are no faults – a tutorial. Tech. Rep. MIT-LCS-TR-821, MIT (2001). PODC 2002 Tutorial

  20. Keidar I., Rajsbaum S. (2003) A simple proof of the uniform consensus synchronous lower bound. Inform. Process. Lett. 85(1): 47–52

    Article  MATH  MathSciNet  Google Scholar 

  21. Lamport L. (1978) Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7): 558–565

    Article  MATH  Google Scholar 

  22. Lamport, L.: The part-time parliament. Tech. Rep. 49, Systems Research Center, Digital Equipment Corp, Palo Alto (1989). A revised version of the paper also appeared in ACM Trans. Comput. Syst. 16(2), 133–169 (1998)

    Google Scholar 

  23. Lamport, L., Fischer, M.: Byzantine generals and transaction commit protocols. Technical Report 62, SRI International (1982)

  24. Lamport L., Shostak R., Pease M. (1982) The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4(3): 382–401

    Article  MATH  Google Scholar 

  25. Lampson, B.: How to build a highly available system using consensus. In: Proceedings of the 10th International Workshop on Distributed Algorithms (WDAG-10), pp. 1–15. Bologna, Italy (1996)

  26. Mostefaoui A., Raynal M. (2001) Leader-based consensus. Parallel Process. Lett. 11(1): 95–107

    Article  MathSciNet  Google Scholar 

  27. Oki, B., Liskov, B.: Viewstamped replication: a general primary copy method to support highly available distributed systems. In: Proceedings of the 7th ACM Symposium on Principles of Distributed Computing (PODC-7), pp. 8–17. Toronto, Ontario, Canada (1988)

  28. Santoro, N., Widmayer, P.: Time is not a healer. In: 6th Annual Symp. Theor. Aspects of Computer Science, LNCS, vol. 349, pp. 304–313. Springer, Berlin Heidelberg New York (1989)

  29. Schneider F.B. (1990) Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv. 22(4): 299–319

    Article  Google Scholar 

  30. Thekkath, C.A., Mann, T., Lee, E.K.: Frangipani: A scalable distributed file system. In: ACM SIGOPS Symposium on Operating Systems Principles (SOSP), pp. 224–237 (1997)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Partha Dutta.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dutta, P., Guerraoui, R. & Keidar, I. The overhead of consensus failure recovery. Distrib. Comput. 19, 373–386 (2007). https://doi.org/10.1007/s00446-006-0017-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00446-006-0017-6

Keywords

Navigation