Abstract
This paper presents an efficient, writer-based logging scheme for recoverable distributed shared memory systems, in which logging of a data item is performed by its writer process, instead of every process that accesses the item logging it. Since the writer process maintains the log of data items, volatile storage can be used for logging. Only the readers' access information needs to be logged into the stable storage of the writer process to tolerate multiple failures. Moreover, to reduce the frequency of stable logging, only the data items accessed by multiple processes are logged with their access information when the items are invalidated, and also semantic-based optimization in logging is considered. Compared with the earlier schemes in which stable logging was performed whenever a new data item was accessed or written by a process, the size of the log and the logging frequency can be significantly reduced in the proposed scheme.
Similar content being viewed by others
References
M. Ahamad, P. W. Hutto, and R. John. Implementing and programming causal distributed shared memory. In Proc. of the 10th Int'l Conf on Distributed Computing Systems, pp. 274–281, Jun. 1990.
M. Ahamad, J. E. Burns, P. W. Hutto, and G. Neiger. Causal memory. In Proc. of the 11th Int'l Conf on Distributed Computing Systems, pp. 274–281, May 1991.
R. E. Ahmed, R. C. Frazier, and P. N. Marinos. Cache-aided rollback error recovery carer algorithms for shared-memory multiprocessor systems. In Proc. of the 20th Symp. on Fault-Tolerant Computing, pp. 82–88, Jun. 1990.
G. Cabillic, G. Muller, and I. Puaut. The performance of consistent checkpointing in distributed shared memory systems. In Proc. of the l4th Symp. on Reliable Distributed Systems, Sep. 1995.
J. B. Carter, A. L. Cox, S. Dwarkadas, E. N. Elnozahy, D. B. Johnson, P. Keleher, S. Rodrigues, W. Yu, and W. Zwaenepoel. Network multicomputing using recoverable distributed shared memory. In Proc. of the IEEE Int'l Conf. CompCon'93, Feb. 1993.
M. Chandy and L. Lamport. Distributed snapshot: Determining global states of distributed systems. ACM Trans. on Computer Systems, 3(1): 63–75, Feb. 1985.
M. Costa, P. Guedes, M. Sequeira, N. Neves, and M. Castro. Lightweight logging for lazy release consistent distributed shared memory. In Proc. of the USENIX 2nd Symp. on Operating Systems Design and Implementation, Oct. 1996.
G. Janakiraman and Y. Tamir. Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers. In Proc. of the 13th Symp. on Reliable Distributed Systems, pp. 42–51, Oct. 1994.
B. Janssens and W. K. Fuchs. Relaxing consistency in recoverable distributed shared memory. In Proc. of the 23rd Annual Int'l Symp. on Fault-Tolerant Computing, pp. 155–163, Jun. 1993.
B. Janassens and W. K. Fuchs. Reducing interprocessor dependence in recoverable shared memory. In Proc. of the 13rd Symp. on Reliable Distributed Systems, pp. 34–41, Oct. 1994.
S. Kanthadai and J. L. Welch. Implementation of recoverable distributed shared memory by logging writes. In Proc. of the 16th Int'l Conf. on Distributed Computing Systems, pp. 116–123, May 1996.
P. Keleher. CVM: The coherent virtual machine. http: www.cs.umd.eduprojectscvm.
A. Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, and I. Puaut. A recoverable distributed shared memory integrating coherence and recoverability. In Proc. of the 25th Int'l Symp. on Fault-Tolerant Computing Systems, pp. 289–298, Jun. 1995.
L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. on Computers, C-28(9): 690–691, Sep. 1979.
K. Li. Shared virtual memory on loosely coupled multiprocessors. Ph.D. thesis, Department of Computer Science, Yale University, Sep. 1986.
B. Nitzberg and V. Lo. Distributed shared memory: A survey of issues and algorithms. IEEE Computer, Aug. 1991.
B. Randell, P. A. Lee, and P. C. Treleaven. Reliability issues in computing system design. ACM Computing Surveys, 10(2): 123–165, Jun. 1978.
M. Raynal, A. Schiper, and S. Toueg. The causal ordering abstraction and a simple way to implement it. Information Processing Letters, 39(6): 343–350, 1991.
G. G. Richard III and M. Singhal. Using logging and asynchronous checkpointing to implement recoverable distributed shared memory. In Proc. of the 12th Symp. on Reliable Distributed Systems, pp. 58–67, Oct. 1993.
R. D. Schlichting and F. B. Schneider. Fail-stop processors: An approach to designing fault-tolerant computing systems. ACM Trans. on Computer Systems, 1(3): 222–238, Aug. 1983.
M. Stumm and S. Zhou. Algorithms implementing distributed shared memory. IEEE Computer, 54–64, May 1990.
M. Stumm and S. Zhou. Fault tolerant distributed shared memory. In Proc. of the 2nd IEEE Symp. on Parallel and Distributed Processing, pp. 719–724, Dec. 1990.
G. Suri, B. Janssens, and W. K. Fuchs. Reduced overhead logging for rollback recovery in distributed shared memory. In Proc. of the 25th Annual Int'l Symp. on Fault-Tolerant Computing, Jun. 1995.
V. O. Tam and M. Hsu. Fast recovery in distributed shared virtual memory systems. In Proc. of the 10th Int'l Conf on Distributed Computing Systems, pp. 38–45, May 1990.
K. L. Wu and W. K. Fuchs. Recoverable distributed shared memory. IEEE Trans. on Computers, 39(4): 460–469, Apr. 1990.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Park, T., Yeom, H.Y. A Low Overhead Logging Scheme for Fast Recovery in Distributed Shared Memory Systems. The Journal of Supercomputing 15, 295–320 (2000). https://doi.org/10.1023/A:1008116511402
Issue Date:
DOI: https://doi.org/10.1023/A:1008116511402