Abstract
We present a resilient domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm reformulates the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to both soft and hard faults. We discuss an implementation based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Servers are assumed to be “sandboxed”, while no assumption is made on the reliability of the clients. We explore the scalability of the algorithm up to \(\sim \)12k cores, build an SST/macro skeleton to extrapolate to \(\sim \)50k cores, and show the resilience under simulated hard and soft faults for a 2D linear Poisson equation.
I’m an employee of the US Government and transfer the rights to the extent transferable (Title 17 §105 U.S.C. applies)
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ang, J.A., Barrett, R.F., Benner, R.E., Burke, D., Chan, C., Cook, J., Donofrio, D., Hammond, S.D., Hemmert, K.S., Kelly, S.M., Le, H., Leung, V.J., Resnick, D.R., Rodrigues, A.F., Shalf, J., Stark, D., Unat, D., Wright, N.J.: Abstract machine models and proxy architectures for exascale computing. In: Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing. Co-HPC 2014, pp. 25–32. IEEE Press, Piscataway, NJ, USA (2014). http://dx.doi.org/10.1109/Co-HPC.2014.4
Benzi, M., Frommer, A., Nabben, R., Szyld, D.B.: Algebraic theory of multiplicative schwarz methods. Numerische Mathematik 89(4), 605–639 (2001)
Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.: Post-failure recovery of mpi communication capability: design and rationale. Int. J. High Perform. Comput. Appl. 27(3), 244–254 (2013). http://dx.doi.org/10.1177/1094342013488238
Bosilca, G., Delmas, R., Dongarra, J., Langou, J.: Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69(4), 410–416 (2009)
Bridges, P.G., Ferreira, K.B., Heroux, M.A., Hoemmen, M.: Fault-tolerant linear solvers via selective reliability. ArXiv e-prints, June 2012
Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale resilience. Int. J. High Perform. Comput. Appl. 23(4), 374–388 (2009)
Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innovations 1(1) (2014). http://superfri.org/superfri/article/view/14
Chen, Z.: Algorithm-based recovery for iterative methods without checkpointing. In: Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC 2011, pp. 73–84. ACM, New York, NY, USA (2011). http://doi.acm.org/10.1145/1996130.1996142
Daubechies, I., DeVore, R., Fornasier, M., Güntürk, C.S.: Iteratively reweighted least squares minimization for sparse recovery. Commun. Pure Appl. Math. 63(1), 1–38 (2010). http://dx.doi.org/10.1002/cpa.20303
DOE-ASCR: Exascale programming challenges. Technical report, July 2011. http://science.energy.gov/~/media/ascr/pdf/program-documents/docs/ProgrammingChallengesWorkshopReport.pdf
DOE-ASCR: Top ten exascale research challenges. Technical report, February 2014
Du, P., Bouteiller, A., Bosilca, G., Herault, T., Dongarra, J.: Algorithm-based fault tolerance for dense matrix factorizations. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2012, pp. 225–234. ACM, New York, NY, USA (2012). http://doi.acm.org/10.1145/2145816.2145845
Du, P., Luszczek, P., Dongarra, J.: High performance dense linear system solver with soft error resilience. In: IEEE International Conference on Cluster Computing (CLUSTER), pp. 272–280, September 2011
Engelmann, C., Naughton, T.: Toward a performance/resilience tool for hardware/software co-design of high-performance computing systems. In: 2013 42nd International Conference on Parallel Processing (ICPP), pp. 960–969, October 2013
Griebel, M., Oswald, P.: Greedy and randomized versions of the multiplicative schwarz method. Linear Algebra Appl. 437(7), 1596–1610 (2012)
Gupta, R., Iskra, K., Yoshii, K., Balaji, P., Beckman, P.: Introspective fault tolerance for exascale systems. Technical report, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439 (2012)
Heroux, M., Bartlett, R., Hoekstra, V.H.R., Hu, J., Kolda, T., Lehoucq, R., Long, K., Pawlowski, R., Phipps, E., Salinger, A., Thornquist, H., Tuminaro, R., Willenbring, J., Williams, A.: An overview of trilinos. Technical report, SAND2003-2927, Sandia National Laboratories (2003)
Holst, M.: Algebraic schwarz theory. Technical report CRPC-994-10, California Institute of Technology (1994)
Keyes, D.: How scalable is domain decomposition in practice? In: Proceedings of the 11th International Conference on Domain Decomposition Methods, pp. 286–297. Domain Decomposition Press (1999)
Larson, J.W., Hegland, M., Harding, B., Roberts, S., Stals, L., Rendell, A.P., Strazdins, P., Ali, M.M., Kowitz, C., Nobes, R., Southern, J., Wilson, N., Li, M., Oishi, Y.: Fault-tolerant grid-based solvers: combining concepts from sparse grids and mapreduce. Proc. Comput. Sci. 18, 130–139 (2013)
Li, D., Vetter, J.S., Yu, W.: Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 57:1–57:11. IEEE Computer Society Press, Los Alamitos, CA, USA (2012). http://dl.acm.org/citation.cfm?id=2388996.2389074
Li, M.L., Ramachandran, P., Sahoo, S.K., Adve, S.V., Adve, V.S., Zhou, Y.: Understanding the propagation of hard errors to software and implications for resilient system design. SIGOPS Oper. Syst. Rev. 42(2), 265–276 (2008). http://doi.acm.org/10.1145/1353535.1346315
Malkowski, K., Raghavan, P., Kandemir, M.: Analyzing the soft error resilience of linear solvers on multicore multiprocessors. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (2010)
Quarteroni, A., Valli, A.: Domain Decomposition Methods for Partial Differential Equations. Numerical Mathematics and Scientific Computation. Clarendon Press, Oxford (1999)
Rizzi, F., Morris, K., Sargsyan, K., Mycek, P., Safta, C., LeMaitre, O., Knio, O., Debusschere, B.: Partial differential equations preconditioner resilient to soft and hard faults. In: IEEE International Conference on Cluster Computing (CLUSTER), pp. 552–562, September 2015
Sargsyan, K., Rizzi, F., Mycek, P., Safta, C., Morris, K., Najm, H., Maître, O.L., Knio, O., Debusschere, B.: Fault resilient domain decomposition preconditioner for PDES. SIAM J. Sci. Comput. 37(5), A2317–A2345 (2015)
Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secure Comput. 7(4), 337–350 (2010)
Shye, A., Moseley, T., Reddi, V., Blomstedt, J., Connors, D.: Using process-level redundancy to exploit multiple cores for transient fault tolerance. In: 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2007, DSN 2007, pp. 297–306 (2007)
Sloan, J., Kumar, R., Bronevetsky, G.: Algorithmic approaches to low overhead fault detection for sparse linear algebra. In: 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1–12, June 2012
Snir, M., Wisniewski, R.W., Abraham, J.A., Adve, S.V., Bagchi, S., Balaji, P., Belak, J., Bose, P., Cappello, F., Carlson, B., Chien, A.A., Coteus, P., DeBardeleben, N., Diniz, P.C., Engelmann, C., Erez, M., Fazzari, S., Geist, A., Gupta, R., Johnson, F., Krishnamoorthy, S., Leyffer, S., Liberty, D., Mitra, S., Munson, T., Schreiber, R., Stearley, J., Hensbergen, E.V.: Addressing failures in exascale computing. IJHPCA, 129–173 (2014)
Toselli, A., Widlund, O.: Domain Decomposition Methods - Algorithms and Theory. Springer Series in Computational Mathematics. Springer, Heidelberg (2005). http://link.springer.com/book/10.1007/b137868
Wilke, J.J., Kenny, J.P.: Using discrete event simulation for programming model exploration at extreme-scale: macroscale components for the structural simulation toolkit (SST). Technical report, Sandia technical report SAND2015-1027 (2015)
Acknowledgments
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Award Numbers 13-016717. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland (outside the US)
About this paper
Cite this paper
Morris, K. et al. (2016). Scalability of Partial Differential Equations Preconditioner Resilient to Soft and Hard Faults. In: Kunkel, J., Balaji, P., Dongarra, J. (eds) High Performance Computing. ISC High Performance 2016. Lecture Notes in Computer Science(), vol 9697. Springer, Cham. https://doi.org/10.1007/978-3-319-41321-1_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-41321-1_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41320-4
Online ISBN: 978-3-319-41321-1
eBook Packages: Computer ScienceComputer Science (R0)