Abstract
This paper describes an approach to providing software fault tolerance for future deep-space robotic NASA missions, which will require a high degree of autonomy supported by an enhanced on-board computational capability. Such systems have become possible as a result of the emerging many-core technology, which is expected to offer 1024-core chips by 2015. We discuss the challenges and opportunities of this new technology, focusing on introspection-based adaptive fault tolerance that takes into account the specific requirements of applications, guided by a fault model. Introspection supports runtime monitoring of the program execution with the goal of identifying, locating, and analyzing errors. Fault tolerance assertions for the introspection system can be provided by the user, domain-specific knowledge, or via the results of static or dynamic program analysis. This work is part of an on-going project at the Jet Propulsion Laboratory in Pasadena, California.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Lardner, D.: Babbages’s Calculating Engine. Edinburgh Review (July 1834); Reprinted in Morrison, P., Morrison, E. (eds.). Charles Babbage and His Calculating Engines. Dover, New York (1961)
Avizienis, A., Laprie, J.C., Randell, B.: Fundamental Concepts of Dependability. Technical report, UCLA (2000) (CSD Report No. 010028)
Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing 1(1) (January-March 2004)
Castano, R., Estlin, T., Anderson, R.C., Gaines, D.M., Castano, A., Bornstein, B., Chouinard, C., Judd, M.: OASIS: Onboard Autonomous Science Investigation System for Opportunistic Rover Science. Journal of Field Robotics 24(5), 379–397 (2007)
Tile64 Processor Family (2007), http://www.tilera.com
Shirvani, P.P.: Fault-Tolerant Computing for Radiation Environments. Technical Report 01-6, Center for Reliable Computing, Stanford University, Stanford, California 94305 (June 2001) (Ph.D. Thesis)
Lamport, L., Shostak, R., Pease, M.: The Byzantine Generals Problem. ACM Trans. Programming Languages and Systems 4(3), 382–401 (1982)
Aggarwal, N., Ranganathan, P., Jouppi, N.P., Smith, J.E.: Isolation in Commodity Multicore Processors. IEEE Computer 40(6), 49–59 (2007)
Li, M., Tao, W., Goldberg, D., Hsu, I., Tamir, Y.: Design and Validation of Portable Communication Infrastructure for Fault-Tolerant Cluster Middleware. In: Cluster 2002: Proceedings of the IEEE International Conference on Cluster Computing, p. 266. IEEE Computer Society, Washington (September 2002)
Samson, J., Gardner, G., Lupia, D., Patel, M., Davis, P., Aggarwal, V., George, A., Kalbarcyzk, Z., Some, R.: High Performance Dependable Multiprocessor II. In: Proceedings 2007 IEEE Aerospace Conference, pp. 1–22 (March 2007)
James, M., Shapiro, A., Springer, P., Zima, H.: Adaptive Fault Tolerance for Scalable Cluster Computing in Space. International Journal of High Performance Computing Applications (IJHPCA) 23(3) (2009)
Zima, H.P., Chapman, B.M.: Supercompilers for Parallel and Vector Computers. ACM Press Frontier Series (1991)
Nielson, F., Nielson, H.R., Hankin, C.: Principles of Program Analysis. Springer, New York (1999)
Havelund, K., Goldberg, A.: Verify Your Runs. In: Meyer, B., Woodcock, J. (eds.) VSTTE 2005. LNCS, vol. 4171, pp. 374–383. Springer, Heidelberg (2008)
Weiser, M.: Program Slicing. IEEE Transactions on Software Engineering 10, 352–357 (1984)
Strout, M.M., Kreaseck, B., Hovland, P.: Data Flow Analysis for MPI Programs. In: Proceedings of the 2006 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2006) (June 2006)
Some, R., Ngo, D.: REE: A COTS-Based Fault Tolerant Parallel Processing Supercomputer for Spacecraft Onboard Scientific Data Analysis. In: Proceedings of the Digital Avionics Systems System Conference, pp. 7.B.3-1–7.B.3-12 (1999)
Kalbarczyk, Z.T., Iyer, R.K., Bagchi, S., Whisnant, K.: Chameleon: A software infrastructure for adaptive fault tolerance. IEEE Trans. Parallel Distrib. Syst. 10(6), 560–579 (1999)
Goldberg, A., Havelund, K., McGann, C.: Runtime Verification for Autonomous Spacecraft Software. In: Proceedings 2005 IEEE Aerospace Conference, pp. 507–516 (March 2005)
Mehlitz, P.C., Penix, J.: Design for Verification with Dynamic Assertions. In: Proceedings of the 2005 29th Annual IEEE/NASA Software Engineering Workshop, SEW 2005 (2005)
Kang, D.I., Suh, J., McMahon, J.O., Crago, S.P.: Preliminary Study toward Intelligent Run-time Resource Management Techniques for Large Multi-Core Architectures. In: Proceedings of the 2007 Workshop on High Performance Embedded Computing, HPEC 2007 (September 2007)
Zima, H.P.: Introspection in a Massively Parallel PIM-Based Architecture. In: Joubert, G.R. (ed.) Advances in Parallel Computing, vol. 13, pp. 441–448. Elsevier B.V., Amsterdam (2004)
Iyer, R.K., Kalbarczyk, Z., Pattabiraman, K., Healey, W., Hwu, W.M.W., Klemperer, P., Farivar, R.: Toward Application-Aware Security and Reliability. IEEE Security and Privacy 5(1), 57–62 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
James, M., Springer, P., Zima, H. (2010). Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing. In: D’Ambra, P., Guarracino, M., Talia, D. (eds) Euro-Par 2010 - Parallel Processing. Euro-Par 2010. Lecture Notes in Computer Science, vol 6272. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15291-7_25
Download citation
DOI: https://doi.org/10.1007/978-3-642-15291-7_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15290-0
Online ISBN: 978-3-642-15291-7
eBook Packages: Computer ScienceComputer Science (R0)