A Communication Framework for Fault-Tolerant Parallel Execution | SpringerLink
Skip to main content

A Communication Framework for Fault-Tolerant Parallel Execution

  • Conference paper
Languages and Compilers for Parallel Computing (LCPC 2009)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5898))

Abstract

PC grids represent massive computation capacity at a low cost, but are challenging to employ for parallel computing because of variable and unpredictable performance and availability. A communicating parallel program must employ checkpoint-restart and/or process redundancy to make continuous forward progress in such an unreliable environment. A communication model based on one-sided Put/Get calls, pioneered by the Linda system, is a good match as processes can execute their communication operations independently and asynchronously. However, Linda and its many variants are not designed for communicating processes that are replicated or independently restarted from checkpoints. The key problem is that a single logical operation that impacts the global program state may be executed by different instances of the same process at different times leading to semantic inconsistency. This paper presents the design, execution model, implementation, and validation of a communication layer for robust execution on volatile nodes. The research leads to a practical way to employ idle PCs for latency tolerant parallel computing applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Thain, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: the Condor experience. Concurrency - Practice and Experience 17(2-4), 323–356 (2005)

    Article  Google Scholar 

  2. Anderson, D.P.: BOINC: A system for public-resource computing and storage. In: GRID 2004: Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing, Washington, DC, USA, pp. 4–10. IEEE Computer Society, Los Alamitos (2004)

    Chapter  Google Scholar 

  3. Zheng, R., Subhlok, J.: A quantitative comparison of checkpoint with restart and replication in volatile environments. Technical Report UH-CS-08-06, University of Houston (June 2008)

    Google Scholar 

  4. Carriero, N., Gelernter, D.: The S/Net’s Linda kernel. ACM Trans. Comput. Syst. 4(2), 110–129 (1986)

    Article  Google Scholar 

  5. Kondo, D., Taufer, M., Brooks, C., Casanova, H., Chien, A.: Characterizing and evaluating desktop grids: an empirical study. In: Proceedings. 18th International Parallel and Distributed Processing Symposium, April 2004, p. 26– (2004)

    Google Scholar 

  6. Ren, X., Eigenmann, R.: iShare - Open internet sharing built on peer-to-peer and web. In: European Grid Conference, Amsterdam, Netherlands (February 2005)

    Google Scholar 

  7. Cirne, W., Brasileiro, F., Andrade, N., Costa, L., Andrade, A., Novaes, R., Mowbray, M.: Labs of the world, unite!!! Journal of Grid Computing 4(3), 225–246 (2006)

    Article  MATH  Google Scholar 

  8. Litzkow, M., Tannenbaum, T., Basney, J., Livny, M.: Checkpoint and migration of UNIX processes in the Condor distributed processing system. Technical Report UW-CS-TR-1346, University of Wisconsin - Madison Computer Sciences Department (April 1997)

    Google Scholar 

  9. http://www.almaden.ibm.com/cs/tspaces/

  10. Noble, M.S., Zlateva, S.: Scientific computation with javaspaces. In: Hertzberger, B., Hoekstra, A.G., Williams, R. (eds.) HPCN-Europe 2001. LNCS, vol. 2110, pp. 657–666. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  11. Zhang, L., Parashar, M., Gallicchio, E., Levy, R.M.: Salsa: Scalable asynchronous replica exchange for parallel molecular dynamics applications. In: ICPP 2006: Proceedings of the 2006 International Conference on Parallel Processing, Washington, DC, USA, pp. 127–134. IEEE Computer Society, Los Alamitos (2006)

    Google Scholar 

  12. Xu, A., Liskov, B.: A design for a fault-tolerant, distributed implementation of Linda. In: Proc. Nineteenth International Symposium on Fault-Tolerant Computing (FTCS-19), Chicago, IL (June 1989)

    Google Scholar 

  13. Bakken, D.E., Schlichting, R.D.: Supporting fault-tolerant parallel programming in Linda. IEEE Transactions on Parallel and Distributed Systems 6(3), 287–302 (1995)

    Article  Google Scholar 

  14. Jeong, K., Shasha, D.: PLinda 2.0: A transactional/checkpointing approach to fault tolerant Linda. In: Proceedings of the 13th Symposium on Reliable Distributed Systems, Dana Point, CA, USA, pp. 96–105 (1994)

    Google Scholar 

  15. Fagg, G.E., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J.J.: Process fault-tolerance: Semantics, design and applications f or high performance computing. International Journal of High Performance Computing Applications 19, 465–477 (2005)

    Article  Google Scholar 

  16. Batchu, R., Neelamegam, J.P., Cui, Z., Beddhu, M., Skjellum, A., Dandass, Y.: MPI/FT: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing. In: Proceedings of the 1st IEEE International Symposium of Cluster Computing and the Grid, pp. 26–33 (2001)

    Google Scholar 

  17. Bouteiller, A., Cappello, F., Herault, T., Krawezik, G., Lemarinie, R.P., Magniette, F.: MPICH-V2: A fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: SC 2003: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, Washington, DC, USA, p. 25. IEEE Computer Society, Los Alamitos (2003)

    Google Scholar 

  18. LeBlanc, T., Anand, R., Gabriel, E., Subhlok, J.: VolpexMPI: an MPI Library for Execution of Parallel Applications on Volatile Nodes. In: Ropo, M., Westerholm, J., Dongarra, J. (eds.) Euro PVM/MPI 2009. LNCS, vol. 5759, pp. 124–133. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  19. Sugita, Y., Okamoto, Y.: Replica-exchange molecular dynamics method for protein folding. Chemical Physics Letters 314, 141–151 (1999)

    Article  Google Scholar 

  20. Case, D., Pearlman, D., Caldwell, J.W., Cheatham, T., Ross, W., Simmerling, C., Darden, T., Merz, K., Stanton, R., Cheng, A.: Amber 6 Manual (1999)

    Google Scholar 

  21. Kanna, N.: Inter-task communication on volatile nodes. Master’s thesis, University of Houston (December 2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kanna, N., Subhlok, J., Gabriel, E., Rohit, E., Anderson, D. (2010). A Communication Framework for Fault-Tolerant Parallel Execution. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds) Languages and Compilers for Parallel Computing. LCPC 2009. Lecture Notes in Computer Science, vol 5898. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13374-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13374-9_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13373-2

  • Online ISBN: 978-3-642-13374-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics