Abstract
Reaching agreement on the identity of correctly functioning processors of a distributed system in the presence of random communication delays, failures and processor joins is a fundamental problem in fault-tolerant distributed systems. Assuming a synchronous communication network that is not subject to partition occurrences, we specify the processor-group membership problem and we propose three simple protocols for solving it. The protocols provide all correct processors with consistent views of the processor-group membership and guarantee bounded processor failure detection and join delays.
Similar content being viewed by others
References
Birman K, Joseph T: Reliable communication in the presence of failures. ACM Trans Comput Syst 5(1): 47–76 (1987)
Carr R: The tandem global update protocol. Tandem Systems Review, June 1985
Chang JM, Maxemchuk N: Reliable broadcast protocols. ACM Trans Comput Syst 2(3): 251–273 (1984)
Cristian F, Aghili H, Strong R, Dolev D: Atomic broadcast: from simple diffusion to Byzantine agreement. 15th Int Conf on Fault-tolerant computing, Ann Arbor, Michigan, 1985
Cristian F, Aghili H, Strong R: Approximate clock synchronization despite omission and performance failures and processor joins. 16th Int Conf on Fault-tolerant computing, Wien, Austria, 1986
Cristian F: Agreeing on who is present and who is absent in a synchronous distributed system. 18th Int Conf on Fault-tolerant computing, Tokyo, Japan, 1988
Cristian F: Synchronous atomic broadcast for redundant broadcast channels. J Real-Time Syst 2: 195–212 (1990)
Cristian F: Understanding fault-tolerant distributed systems. IBM Res Rep RJ6980, 1990 (to appear in Communications of the ACM, 1991)
El Abbadi A, Skeen D, Cristian F: An efficient fault-tolerant protocol for replicated data management. Proc. 4th Annual ACM Conference on Principles of Database Systems, Portland, Oregon, 1985
Garcia-Molina H: Elections in a distributed computing system. IEEE Trans Comput C-31(1): 48–59 (1982)
Kopetz H, Grünsteidl G, Reisinger J: Fault-tolerant membership service in a synchronous distributed real-time system. Proc. IFIP Working Conference on “Dependable Computing for Critical Applications”, Santa Barbara, August 1989
Kronenberg N, Levy H, Strecker W. VAX clusters, a closely coupled distributed system. ACM Trans Comput Syst 4(2): 130–146 (1986)
Lamport L: Using time instead of timeout for fault tolerant distributed systems. ACM Trans Program Lang Syst 6(2):254–280 (1984)
Le Lann G: Algorithms for distributed data sharing systems which use tickets. Proc 3rd Berkeley workshop on distributed data management and computer networks, 1982
Strong R, Skeen D, Cristian F, Aghili H: Handshake protocols. 7th Int Conf on Distributed Computing Systems, Berlin, September 1987
Walter B: A robust and efficient protocol for checking the availability of remote sites. 6th Berkeley workshop on distributed data management and computer networks, 1982
Author information
Authors and Affiliations
Additional information
Flaviu Cristian is a computer scientist at the IBM Almaden Research Center in San Jose, California. He received his PhD from the University of Grenoble, France, in 1979. After carrying out research in operating systems and programming methodology in France and working on the specification, design, and verification of fault-tolerant software in England, he joined IBM in 1982. Since then he has worked in the area of fault-tolerant distributed systems and protocols. He has participated in the design and implementation of a highly available distributed system prototype at the Almaden Research Center, has reviewed and consulted for several fault-tolerant distributed system designs, both in Europe and the American divisions of IBM, and is now a technical leader in the design of a new Air Traffic Control System for the US which must satisfy very stringent availability requirements.
Rights and permissions
About this article
Cite this article
Cristian, F. Reaching agreement on processor-group membrship in synchronous distributed systems. Distrib Comput 4, 175–187 (1991). https://doi.org/10.1007/BF01784719
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF01784719