Abstract
A message is in-transit with respect to a global state if its sending is recorded in this global state, while its receipt is not. Checkpointing algorithms have to log such in-transit messages in order to restore the state of channels when a computation has to be resumed from a consistent global state after a failure has occurred. Coordinated checkpointing algorithms log those in-transit messages exactly on stable storage. Because of their lack of synchronization, uncoordinated checkpointing algorithms conservatively log more messages.
This paper presents an uncoordinated checkpointing protocol that logs all in-transit messages and the smallest possible number of non in-transit messages. As a consequence, the protocol saves stable storage space and enables quicker recoveries. An appropriate tracking of message causal dependencies constitutes the core of the protocol.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
A. Acharya, B.R. Badrinath, Checkpointing Distributed Applications on Mobile Computers, Proc. 3rd Int. Conf. on Par. and Dist. Information Systems, 1994.
L. Alvisi, K. Marzullo, Message Logging: Pessimistic, Optimistic, and Causal, Proc. 15th IEEE Int. Conf. on Distributed Computing Systems, 1995, pp. 229–236.
R. Baldoni, J. M. Hélary, A. Mostefaoui, M. Raynal, Consistent Checkpointing in Distributed systems, INRIA Research Report 2564, June 1995, 25 p.
R. Baldoni, J. Brzezinski, J.M. Hélary, A. Mostefaoui, M. Raynal, Characterization of Consistent Checkpoints in Large Scale Distributed Systems. Proc. 6th IEEE Int. Workshop on Future Trends of Dist. Comp. Sys., Korea, pp. 314–323, August 1995.
K.M. Chandy, L. Lamport, Distributed Snapshots: Determining Global States of Distributed Systems, ACM Trans. on Comp. Sys., Vol. 3(1), 1985, pp. 63–75.
E.N. Elnozahy, W. Zwaenepoel, Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback and Fast Output Commit, IEEE Trans. on Computers, Vol. 41(5), 1992, pp. 526–531.
D.B. Johnson, W. Zwaenepoel, Sender-Based Message Logging, Proc. 17th IEEE Conf. on Fault-Tolerant Computing Systems, 1987, pp. 14–19.
D.B. Johnson, W. Zwaenepoel, Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing, Journal of Algorithms, Vol. 11(3), 1990, pp. 462–491.
R. Koo, S. Toueg, Checkpointing and Rollback-Recovery for Distributed Systems, IEEE Trans. on Software Engineering, Vol. 13(1), 1987, pp. 23–31.
L. Lamport, Time, Clocks and the Ordering of Events in a Distributed System, Communications of the ACM, Vol. 21(7), 1978, pp. 558–565.
F. Mattern, Virtual Time and Global States of Distributed Systems. In Cosnard, Quinton, Raynal, and Robert, Editors, Proc. Int. Workshop on Dist. Alg., France, October 1988, pp. 215–226, 1989.
R.H.B. Netzer, J. Xu, Necessary and Sufficient Conditions for Consistent Global Snapshots, IEEE Trans. on Parallel and Distributed Systems, Vol. 6(2), 1995, pp. 165–169.
B. Randell, System Structure for Software Fault-Tolerance, IEEE Trans. on Software Engineering, Vol. 1(2), 1975, pp. 220–232.
M. Raynal, A. Schiper, S. Toueg, The Causal Ordering Abstraction and a Simple Way to Implement it, Inf. Processing Letters, Vol. 39, 1991, pp. 343–350.
F. Ruget, Cheaper Matrix Clocks, Proc. 8th Int. Workshop on Distributed Algorithms, Springer Verlag, LNCS 857, pp. 340–354, 1994.
D.L. Russell, State Restoration in Systems of Communicating Processes, IEEE Trans. on Software Engineering, Vol. 6, 1980, pp. 183–194.
L.M. Silva, J.G. Silva, Global Checkpointing for Distributed Programs, Proc. 11th IEEE Symp. on Reliable Distributed Systems, Houston, TX, 1992, pp. 155–162.
M. Singhal, F. Mattern, An Optimality Proof for Asynchronous Recovery Algorithms in Distributed Systems, Inf. Processing Letters, Vol. 55, 1995, pp. 117–121.
R.E. Strom, S. Yemini, Optimistic Recovery in Distributed Systems, ACM Transactions on Computer Systems, Vol. 3(3), 1985, pp. 204–226.
G.T. Wuu, A. J. Bernstein, Efficient Solutions to the Replicated Log and Dictionary Problems, Proc. 3rd ACM Symp. on Principles of Dist. Comp., 1984, pp. 233–242.
Y.M. Wang, W.K. Fuchs, Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems, Proc. 11th IEEE Symp. Reliable Distributed Systems, 1992, pp. 147–154.
Y.M. Wang, P.Y. Chung, I.J. Lin, W.K. Fuchs, Checkpointing Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems, IEEE Trans. on Parallel and Distributed Systems, Vol. 6(5), 1995, pp. 546–554.
J. Xu, R.H.B. Netzer, M. Mackey, Sender-Based Message Logging for Reducing Rollback Propagation, Proc. 7th IEEE Symp. on Parallel and Distributed Processing, 1995, pp. 602–609, San Antonio.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1996 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mostefaoui, A., Raynal, M. (1996). Efficient message logging for uncoordinated checkpointing protocols. In: Hlawiczka, A., Silva, J.G., Simoncini, L. (eds) Dependable Computing — EDCC-2. EDCC 1996. Lecture Notes in Computer Science, vol 1150. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61772-8_48
Download citation
DOI: https://doi.org/10.1007/3-540-61772-8_48
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61772-3
Online ISBN: 978-3-540-70677-9
eBook Packages: Springer Book Archive