An efficient checkpointing method for multicomputers with wormhole routing | International Journal of Parallel Programming Skip to main content
Log in

An efficient checkpointing method for multicomputers with wormhole routing

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Efficient checkpointing and resumption of multicomputer applications is essential if multicomputers are to support time-sharing and the automatic resumption of jobs after a system failure. We present a checkpointing scheme that is transparent, imposes overhead only during checkpoints, requires minimal message logging, and allows for quick resumption of execution from a checkpointed image. Furthermore, the checkpointing algorithm allows each processorp to continue running the application being checkpointed except during the time thatp is actively taking a local snapshot, and requires no global stop or freeze of the multicomputer. Since checkpointing multicomputer applications poses requirements different from those posed by checkpointing general distributed systems, existing distributed checkpointing schemes are inadequate for multicomputer checkpointing. Our checkpointing scheme makes use of special properties of wormhole routing networks to satisfy this new set of requirements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. William C. Athas and Charles L. Seitz, Multicomputers: Message-passing Concurrent Computers,IEEE Computer 21(8):9–24 (August 1988).

    Google Scholar 

  2. J. Gray, P. M. Jones, M. Blasgen, B. Lindsay, R. Lorie, T. Price, F. Putzolu, and I. Traiger, The Recovery Manager of the SystemR Database Manager,ACM Computing Surveys 13(2):223–242 (June 1981).

    Google Scholar 

  3. A. Borg, J. Baumbach, and S. Glazer, A Message System Supporting fault Tolerance,Proc. of the ACM Symp. on Oper. Sys. Principles, Atlanta, Georgia, pp. 90–99 (October 1983).

  4. Barbara H. Liskov and Robert W. Scheifler, Guardians and Actions: Linguistic Support for Robust, Distributed Programs,ACM Trans. on Progr. Lang. and Syst. 5(3):381–404 (July 1983).

    Google Scholar 

  5. C. Mohan and B. Lindsay, Efficient Commit Protocols for the Tree of Processes Model of Distributed Transactions,Proc. of the ACM Symp. of Distrib. Computing, Montreal, pp. 76–80 (August 1983).

  6. Michael L. Powell and David L. Presotto, Publishing: A Reliable Broadcast Communication Mechanism,Proc. of the ACM SIGOPS Symp. on Operating Syst. Principles, pp. 100–109 (October 1983).

  7. David B. Johnson and Willy Zwaenepoel, Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing,Comput. Syst. Res. at Rice University: Annual Report 1988–1989, pp. 83–102 (1989).

  8. Robert E. Strom and Shaula Yemini, Optimistic Recovery in Distributed Systems,ACM Trans. on Comput. Syst., pp. 204–226 (August 1985).

  9. Madalene Spezialetti and Phil Kearns, Efficient Distributed Snapshots,Proc. of the Sixth Int'l. Conf. on Distrib. Computing Syst., Cambridge, Massachusetts, IEEE Computer Society, pp. 382–388 (May 1986).

  10. Ten H. Lai and Tao H. Yang, On Distributed Snapshots,Info. Proc. Letts. 25:153–158 (May 1987).

    Google Scholar 

  11. Richard Koo and Sam Toueg, Checkpointing and Rollback-recovery for Distributed Systems,IEEE Trans. on Software Engineering SE-13(1):23–31 (January 1987).

    Google Scholar 

  12. Mohan Ahuja, Repeated Global Snapshots in Asynchronous Distributed Systems, Technical Report OSU-CISRC-8/89 TR40, Ohio State University (August 1989).

  13. Carol Critchlow and Kim Taylor, The Inhibition Spectrum and the Achievement of Causal Consistency, Technical Report TR 90-1101, Cornell University (February 1990).

  14. K. Mani Chandy and Leslie Lamport, Distributed Snapshots: Determining Global States of Distributed Systems,ACM Trans. on Comput. Syst. 3(1):3–75 (February 1985).

    Google Scholar 

  15. S. Venkatesan, Message-optimal Incremental Snapshots,Proc. of the Ninth Int'l. Conf. on Distrib. Computing Syst., Newport Beach, California, IEEE Computer Society, pp. 53–60 (June 1989).

  16. W. J. Dally and C. L. Seitz, Deadlock-free Message Routing in Multiprocessor Interconnection Networks,IEEE Trans. on Comput. C-36(5):547–553 (May 1987).

    Google Scholar 

  17. Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Anoop Gupta, and John Hennessy, The Directory-based Cache Coherence Protocol for the Dash Multiprocessor,The 17th Annual Int'l. Symp. on Comput. Archit., Seattle, Washington, pp. 148–159 (May 1990).

  18. William J. Dally, Fine-grain Concurrent Computing,Res. Directions in Comput. Sci.: An MIT Perspective, pp. 127–154 (1991).

  19. Steven F. Nugent, The iPSC/2 Direct-connect Communications Technology,Proc. of the Third Hypercube ACM Conf. (November 1988).

  20. William J. Dally, Performance Analysis ofk-aryn-cube Interconnection Networks,IEEE Trans. on Comput. 39(6):775–785 (June 1990).

    Google Scholar 

  21. C. L. Seitz, W. C. Athas, C. M. Flaig, A. J. Martin, J. Seizovic, C. S. Steele, and W.-K. Su, The Architecture and Programming of the Ametek Series 2010 Multicomputer,Proc. of the Third Hypercube ACM Conf., pp. 33–37 (January 1988).

  22. C. L. Seitzet al., Submicron Systems Architecture Project Semiannual Technical Report, Caltech Computer Science Technical Report TR-88-18 (November 1988).

  23. William J. Dally, Express cubes: Improving the Performance ofk-aryn-cube Interconnection Networks,IEEE Trans. on Comput. 40(9):1016–1023 (September 1991).

    Google Scholar 

  24. Kai Li, Jeffrey Naughton, and James Plank, Real-time, Concurrent Checkpoint for parallel Programs,Second ACM SIGPLAN Symp. on Principles and Practice of Parallel Prog., Seattle, Washington, pp. 79–88 (March 1990).

  25. S. Feldman and C. Brown, Igor: A System for Program Debugging via Reversible Execution,ACM SIGPLAN Notices, Workshop on Parallel and Distributed Debugging 24(1):112–123 (January 1989).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, K., Naughton, J.F. & Plank, J.S. An efficient checkpointing method for multicomputers with wormhole routing. Int J Parallel Prog 20, 159–180 (1991). https://doi.org/10.1007/BF01379316

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01379316

Key Words

Navigation