{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,8,30]],"date-time":"2024-08-30T17:48:44Z","timestamp":1725040124209},"reference-count":189,"publisher":"Association for Computing Machinery (ACM)","issue":"5","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2021,9,30]]},"abstract":"Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.<\/jats:p>","DOI":"10.1145\/3403956","type":"journal-article","created":{"date-parts":[[2020,9,28]],"date-time":"2020-09-28T10:45:25Z","timestamp":1601289925000},"page":"1-32","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":18,"title":["Predictive Reliability and Fault Management in Exascale Systems"],"prefix":"10.1145","volume":"53","author":[{"given":"Ramon","family":"Canal","sequence":"first","affiliation":[{"name":"Universitat Polit\u00e8cnica de Catalunya (UPC) and Barcelona Supercomputing Center (BSC)"}]},{"given":"Carles","family":"Hernandez","sequence":"additional","affiliation":[{"name":"Universitat Polit\u00e8cnica de Val\u00e8ncia (UPV)"}]},{"given":"Rafa","family":"Tornero","sequence":"additional","affiliation":[{"name":"Universitat Polit\u00e8cnica de Val\u00e8ncia (UPV)"}]},{"given":"Alessandro","family":"Cilardo","sequence":"additional","affiliation":[{"name":"Universit\u00e1 degli studi di Napoli Federico II (UNINA)"}]},{"given":"Giuseppe","family":"Massari","sequence":"additional","affiliation":[{"name":"Politecnico di Milano (POLIMI)"}]},{"given":"Federico","family":"Reghenzani","sequence":"additional","affiliation":[{"name":"Politecnico di Milano (POLIMI)"}]},{"given":"William","family":"Fornaciari","sequence":"additional","affiliation":[{"name":"Politecnico di Milano (POLIMI)"}]},{"given":"Marina","family":"Zapater","sequence":"additional","affiliation":[{"name":"Ecole Polytechnique Federale de Lausanne (EPFL)"}]},{"given":"David","family":"Atienza","sequence":"additional","affiliation":[{"name":"Ecole Polytechnique Federale de Lausanne (EPFL)"}]},{"given":"Ariel","family":"Oleksiak","sequence":"additional","affiliation":[{"name":"Poznan Supercomputing and Networking Center (PSNC)"}]},{"given":"Wojciech","family":"Pi\u0104tek","sequence":"additional","affiliation":[{"name":"Poznan Supercomputing and Networking Center (PSNC)"}]},{"given":"Jaume","family":"Abella","sequence":"additional","affiliation":[{"name":"Barcelona Supercomputing Center (BSC)"}]}],"member":"320","published-online":{"date-parts":[[2020,9,28]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the 14th IEEE International On-line Testing Symposium. 3--9. DOI:https:\/\/doi.org\/10","author":"Abella J.","year":"2008","unstructured":"J. Abella , P. Chaparro , X. Vera , J. Carretero , and A. Gonz\u00e1lez . 2008. On-line failure detection and confinement in caches . In Proceedings of the 14th IEEE International On-line Testing Symposium. 3--9. DOI:https:\/\/doi.org\/10 .1109\/IOLTS. 2008 .15 10.1109\/IOLTS.2008.15 J. Abella, P. Chaparro, X. Vera, J. Carretero, and A. Gonz\u00e1lez. 2008. On-line failure detection and confinement in caches. In Proceedings of the 14th IEEE International On-line Testing Symposium. 3--9. DOI:https:\/\/doi.org\/10.1109\/IOLTS.2008.15"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/SIES.2015.7185039"},{"key":"e_1_2_1_3_1","unstructured":"E. Agullo L. Giraud A. Guermouche J. Roman and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: Recover-restart strategies. INRIA Research Report RR-8324. E. Agullo L. Giraud A. Guermouche J. Roman and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: Recover-restart strategies. INRIA Research Report RR-8324."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1137\/15M1042115"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the ACM SIGCOMM Conference on Data Communication (SIGCOMM\u201908)","author":"Al-Fares M.","unstructured":"M. Al-Fares , A. Loukissas , and A. Vahdat . 2008. A scalable, commodity data center network architecture . In Proceedings of the ACM SIGCOMM Conference on Data Communication (SIGCOMM\u201908) . ACM, New York, NY, 63--74. DOI:https:\/\/doi.org\/10.1145\/1402958.1402967 10.1145\/1402958.1402967 M. Al-Fares, A. Loukissas, and A. Vahdat. 2008. A scalable, commodity data center network architecture. In Proceedings of the ACM SIGCOMM Conference on Data Communication (SIGCOMM\u201908). ACM, New York, NY, 63--74. DOI:https:\/\/doi.org\/10.1145\/1402958.1402967"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2013.116"},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the 18th IEEE Symposium on High Performance Interconnects. 83--87","author":"Alverson R.","year":"2010","unstructured":"R. Alverson , D. Roweth , and L. Kaplan . 2010. The Gemini system interconnect . In Proceedings of the 18th IEEE Symposium on High Performance Interconnects. 83--87 . DOI:https:\/\/doi.org\/10.1109\/HOTI. 2010 .23 10.1109\/HOTI.2010.23 R. Alverson, D. Roweth, and L. Kaplan. 2010. The Gemini system interconnect. In Proceedings of the 18th IEEE Symposium on High Performance Interconnects. 83--87. DOI:https:\/\/doi.org\/10.1109\/HOTI.2010.23"},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of the 10th IFIP\/IEEE International Symposium on Integrated Network Management. 159--168","author":"Andrzejak A.","year":"2007","unstructured":"A. Andrzejak and L. Silva . 2007. Deterministic models of software aging and optimal rejuvenation schedules . In Proceedings of the 10th IFIP\/IEEE International Symposium on Integrated Network Management. 159--168 . DOI:https:\/\/doi.org\/10.1109\/INM. 2007 .374780 10.1109\/INM.2007.374780 A. Andrzejak and L. Silva. 2007. Deterministic models of software aging and optimal rejuvenation schedules. In Proceedings of the 10th IFIP\/IEEE International Symposium on Integrated Network Management. 159--168. DOI:https:\/\/doi.org\/10.1109\/INM.2007.374780"},{"key":"e_1_2_1_9_1","unstructured":"ARM. 2017. ARM Reliability Availability and Serviceability (RAS) Specification\u2014ARMv8 for the ARMv8-A Architecture Profile. White paper. Retrieved from https:\/\/developer.arm.com\/docs\/ddi0587\/latest. ARM. 2017. ARM Reliability Availability and Serviceability (RAS) Specification\u2014ARMv8 for the ARMv8-A Architecture Profile. White paper. Retrieved from https:\/\/developer.arm.com\/docs\/ddi0587\/latest."},{"key":"e_1_2_1_10_1","volume-title":"Software Fault Tolerance","author":"Avizienis A.","unstructured":"A. Avizienis . 1995. Software Fault Tolerance . Chapter 2: The Methodology of N-Version Programming. Wiley . A. Avizienis. 1995. Software Fault Tolerance. Chapter 2: The Methodology of N-Version Programming. Wiley."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2004.2"},{"key":"e_1_2_1_12_1","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201919)","author":"Bachan J.","unstructured":"J. Bachan , S. B. Baden , S. Hofmeyr , M. Jacquelin , A. Kamil , D. Bonachea , P. H. Hargrove , and H. Ahmed . 2019. UPC++: A high-performance communication framework for asynchronous computation . In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201919) . 963--973. J. Bachan, S. B. Baden, S. Hofmeyr, M. Jacquelin, A. Kamil, D. Bonachea, P. H. Hargrove, and H. Ahmed. 2019. UPC++: A high-performance communication framework for asynchronous computation. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201919). 963--973."},{"key":"e_1_2_1_13_1","volume-title":"Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC\u201912)","author":"Bauer M.","unstructured":"M. Bauer , S. Treichler , E. Slaughter , and A. Aiken . 2012. Legion: Expressing locality and independence with logical regions . In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC\u201912) . 1--11. M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. 2012. Legion: Expressing locality and independence with logical regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC\u201912). 1--11."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2016.54"},{"key":"e_1_2_1_15_1","volume-title":"Proceedings of the 12th IEEE International Conference on Fuzzy Systems (FUZZ\u201903)","volume":"1","author":"Berenji H. R.","year":"2003","unstructured":"H. R. Berenji , J. Ametha , and D. Vengerov . 2003. Inductive learning for fault diagnosis . In Proceedings of the 12th IEEE International Conference on Fuzzy Systems (FUZZ\u201903) , Vol. 1 . 726--731 vol.1. DOI:https:\/\/doi.org\/10.1109\/FUZZ. 2003 .1209453 10.1109\/FUZZ.2003.1209453 H. R. Berenji, J. Ametha, and D. Vengerov. 2003. Inductive learning for fault diagnosis. In Proceedings of the 12th IEEE International Conference on Fuzzy Systems (FUZZ\u201903), Vol. 1. 726--731 vol.1. DOI:https:\/\/doi.org\/10.1109\/FUZZ.2003.1209453"},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the International Conference on Dependable Systems and Networks (DSN\u201905)","author":"Bernick D.","year":"2005","unstructured":"D. Bernick , B. Bruckert , P. D. Vigna , D. Garcia , R. Jardine , J. Klecka , and J. Smullen . 2005. NonStop\/spl reg\/ advanced architecture . In Proceedings of the International Conference on Dependable Systems and Networks (DSN\u201905) . 12--21. DOI:https:\/\/doi.org\/10.1109\/DSN. 2005 .70 10.1109\/DSN.2005.70 D. Bernick, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen. 2005. NonStop\/spl reg\/ advanced architecture. In Proceedings of the International Conference on Dependable Systems and Networks (DSN\u201905). 12--21. DOI:https:\/\/doi.org\/10.1109\/DSN.2005.70"},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC\u201915)","author":"Berrocal E.","unstructured":"E. Berrocal , L. Bautista-Gomez , S. Di , Z. Lan , and F. Cappello . 2015. Lightweight silent data corruption detection based on runtime data analysis for HPC applications . In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC\u201915) . Association for Computing Machinery, New York, NY, 275--278. DOI:https:\/\/doi.org\/10.1145\/2749246.2749253 10.1145\/2749246.2749253 E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan, and F. Cappello. 2015. Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC\u201915). Association for Computing Machinery, New York, NY, 275--278. DOI:https:\/\/doi.org\/10.1145\/2749246.2749253"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2017.2735971"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201911)","author":"Bougeret M.","year":"2063","unstructured":"M. Bougeret , H. Casanova , M. Rabie , Y. Robert , and F. Vivien . 2011. Checkpointing strategies for parallel jobs . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201911) . ACM, New York, NY, Article 33, 11 pages. DOI:https:\/\/doi.org\/10.1145\/ 2063 384.2063428 10.1145\/2063384.2063428 M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien. 2011. Checkpointing strategies for parallel jobs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201911). ACM, New York, NY, Article 33, 11 pages. DOI:https:\/\/doi.org\/10.1145\/2063384.2063428"},{"key":"e_1_2_1_20_1","volume-title":"Proceedings of the International Conference on Dependable Systems and Networks Workshops (DSN-W\u201910)","author":"Brandt J.","year":"2010","unstructured":"J. Brandt , F. Chen , V. De Sapio , A. Gentile , J. Mayo , P. P\u00e9bay , D. Roe , D. Thompson , and M. Wong . 2010. Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example . In Proceedings of the International Conference on Dependable Systems and Networks Workshops (DSN-W\u201910) . 2--7. DOI:https:\/\/doi.org\/10.1109\/DSNW. 2010 .5542629 10.1109\/DSNW.2010.5542629 J. Brandt, F. Chen, V. De Sapio, A. Gentile, J. Mayo, P. P\u00e9bay, D. Roe, D. Thompson, and M. Wong. 2010. Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example. In Proceedings of the International Conference on Dependable Systems and Networks Workshops (DSN-W\u201910). 2--7. DOI:https:\/\/doi.org\/10.1109\/DSNW.2010.5542629"},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","unstructured":"M.-A. Breuer and A. D. Friedman. 1976. Diagnosis 8 Reliable Design of Digital Systems. Springer. M.-A. Breuer and A. D. Friedman. 1976. Diagnosis 8 Reliable Design of Digital Systems. Springer.","DOI":"10.1007\/978-3-642-95424-5"},{"key":"e_1_2_1_22_1","unstructured":"P. Bridges K. Ferreira M. Heroux and M. Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints June 2012. arXiv:1206.1390 [math.NA]. P. Bridges K. Ferreira M. Heroux and M. Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints June 2012. arXiv:1206.1390 [math.NA]."},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the 22nd Annual International Conference on Supercomputing (ICS\u201908)","author":"Bronevetsky G.","unstructured":"G. Bronevetsky and B. De Supinski . 2008. Soft error vulnerability of iterative linear algebra methods . In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS\u201908) . ACM, New York, NY, 155--164. DOI:https:\/\/doi.org\/10.1145\/1375527.1375552 10.1145\/1375527.1375552 G. Bronevetsky and B. De Supinski. 2008. Soft error vulnerability of iterative linear algebra methods. In Proceedings of the 22nd Annual International Conference on Supercomputing (ICS\u201908). ACM, New York, NY, 155--164. DOI:https:\/\/doi.org\/10.1145\/1375527.1375552"},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the 11th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE\u201914)","author":"Cabello U.","unstructured":"U. Cabello , J. Rodr\u00edguez , A. Meneses , S. Mendoza , and D. Decouchant . 2014. Fault tolerance in heterogeneous multi-cluster systems through a task migration mechanism . In Proceedings of the 11th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE\u201914) . IEEE, 1--7. U. Cabello, J. Rodr\u00edguez, A. Meneses, S. Mendoza, and D. Decouchant. 2014. Fault tolerance in heterogeneous multi-cluster systems through a task migration mechanism. In Proceedings of the 11th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE\u201914). IEEE, 1--7."},{"key":"e_1_2_1_25_1","unstructured":"F. Cappello A. Geist W. Gropp S. Kale B. Kramer and M. Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1 1 (2014). http:\/\/superfri.org\/superfri\/article\/view\/14. F. Cappello A. Geist W. Gropp S. Kale B. Kramer and M. Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1 1 (2014). http:\/\/superfri.org\/superfri\/article\/view\/14."},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the 26th ACM International Conference on Supercomputing (ICS\u201912)","author":"Casas M.","unstructured":"M. Casas , B. de Supinski , G. Bronevetsky , and M. Schulz . 2012. Fault resilience of the algebraic multi-grid solver . In Proceedings of the 26th ACM International Conference on Supercomputing (ICS\u201912) . ACM, New York, NY, 91--100. DOI:https:\/\/doi.org\/10.1145\/2304576.2304590 10.1145\/2304576.2304590 M. Casas, B. de Supinski, G. Bronevetsky, and M. Schulz. 2012. Fault resilience of the algebraic multi-grid solver. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS\u201912). ACM, New York, NY, 91--100. DOI:https:\/\/doi.org\/10.1145\/2304576.2304590"},{"key":"#cr-split#-e_1_2_1_27_1.1","doi-asserted-by":"crossref","unstructured":"F. J. Cazorla L. Kosmidis E. Mezzetti C. Hernandez J. Abella and T. Vardanega. 2019. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv. 52 1 Article 14 (Feb. 2019) 35 pages. DOI:https:\/\/doi.org\/10.1145\/3301283 10.1145\/3301283","DOI":"10.1145\/3301283"},{"key":"#cr-split#-e_1_2_1_27_1.2","doi-asserted-by":"crossref","unstructured":"F. J. Cazorla L. Kosmidis E. Mezzetti C. Hernandez J. Abella and T. Vardanega. 2019. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv. 52 1 Article 14 (Feb. 2019) 35 pages. DOI:https:\/\/doi.org\/10.1145\/3301283","DOI":"10.1145\/3301283"},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the IEEE International Reliability Physics Symposium. PR.1.1--PR.1.7. DOI:https:\/\/doi.org\/10","author":"Cha S.","year":"2014","unstructured":"S. Cha , C. Chen , and L. S. Milor . 2014. System-level estimation of threshold voltage degradation due to NBTI with I\/O measurements . In Proceedings of the IEEE International Reliability Physics Symposium. PR.1.1--PR.1.7. DOI:https:\/\/doi.org\/10 .1109\/IRPS. 2014 .6861168 10.1109\/IRPS.2014.6861168 S. Cha, C. Chen, and L. S. Milor. 2014. System-level estimation of threshold voltage degradation due to NBTI with I\/O measurements. In Proceedings of the IEEE International Reliability Physics Symposium. PR.1.1--PR.1.7. DOI:https:\/\/doi.org\/10.1109\/IRPS.2014.6861168"},{"key":"e_1_2_1_29_1","volume-title":"Proceedings of the ACM\/IEEE International Symposium on Low Power Electronics and Design (ISLPED\u201912)","author":"Chan C. S.","unstructured":"C. S. Chan , Y. Jin , Y. K. Wu , K. Gross , K. Vaidyanathan , and T. S. Rosing . 2012. Fan-speed-aware scheduling of data intensive jobs . In Proceedings of the ACM\/IEEE International Symposium on Low Power Electronics and Design (ISLPED\u201912) . Association for Computing Machinery, New York, NY, 409--414. DOI:https:\/\/doi.org\/10.1145\/2333660.2333753 10.1145\/2333660.2333753 C. S. Chan, Y. Jin, Y. K. Wu, K. Gross, K. Vaidyanathan, and T. S. Rosing. 2012. Fan-speed-aware scheduling of data intensive jobs. In Proceedings of the ACM\/IEEE International Symposium on Low Power Electronics and Design (ISLPED\u201912). Association for Computing Machinery, New York, NY, 409--414. DOI:https:\/\/doi.org\/10.1145\/2333660.2333753"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2567529.2567555"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2010.2058873"},{"key":"e_1_2_1_32_1","volume-title":"Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA\u201905)","author":"Charles P.","unstructured":"P. Charles , C. Grothoff , V. Saraswat , C. Donawa , A. Kielstra , K. Ebcioglu , C. von Praun , and V. Sarkar . 2005. X10: An object-oriented approach to non-uniform cluster computing . In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA\u201905) . Association for Computing Machinery, New York, NY, 519--538. DOI:https:\/\/doi.org\/10.1145\/1094811.1094852 10.1145\/1094811.1094852 P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. 2005. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA\u201905). Association for Computing Machinery, New York, NY, 519--538. DOI:https:\/\/doi.org\/10.1145\/1094811.1094852"},{"key":"e_1_2_1_33_1","volume-title":"Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation (NSDI\u201904)","author":"Chen M. Y.","unstructured":"M. Y. Chen , A. Accardi , E. Kiciman , J. Lloyd , D. Patterson , A. Fox , and E. Brewer . 2004. Path-based faliure and evolution management . In Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation (NSDI\u201904) . USENIX Association, Berkeley, CA, 23--23. http:\/\/dl.acm.org\/citation.cfm?id=1251175.1251198 M. Y. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer. 2004. Path-based faliure and evolution management. In Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation (NSDI\u201904). USENIX Association, Berkeley, CA, 23--23. http:\/\/dl.acm.org\/citation.cfm?id=1251175.1251198"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2002.1029005"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/1996130.1996142"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2442516.2442533"},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED\u201907)","author":"Choi J.","unstructured":"J. Choi , C. Y. Cher , H. Franke , H. Hamann , A. Weger , and P. Bose . 2007. Thermal-aware task scheduling at the system software level . In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED\u201907) . ACM, New York, NY, 213--218. DOI:https:\/\/doi.org\/10.1145\/1283780.1283826 10.1145\/1283780.1283826 J. Choi, C. Y. Cher, H. Franke, H. Hamann, A. Weger, and P. Bose. 2007. Thermal-aware task scheduling at the system software level. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED\u201907). ACM, New York, NY, 213--218. DOI:https:\/\/doi.org\/10.1145\/1283780.1283826"},{"key":"e_1_2_1_38_1","volume-title":"Proceedings of the 45th Annual Design Automation Conference (DAC\u201908)","author":"Coskun A. K.","unstructured":"A. K. Coskun , T. S. Rosing , and K. C. Gross . 2008. Temperature management in multiprocessor socs using online learning . In Proceedings of the 45th Annual Design Automation Conference (DAC\u201908) . ACM, New York, NY, 890--893. DOI:https:\/\/doi.org\/10.1145\/1391469.1391693 10.1145\/1391469.1391693 A. K. Coskun, T. S. Rosing, and K. C. Gross. 2008. Temperature management in multiprocessor socs using online learning. In Proceedings of the 45th Annual Design Automation Conference (DAC\u201908). ACM, New York, NY, 890--893. DOI:https:\/\/doi.org\/10.1145\/1391469.1391693"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1166\/jolpe.2006.007"},{"key":"e_1_2_1_40_1","doi-asserted-by":"crossref","unstructured":"G. Da Costa A. Oleksiak W. Piatek J. Salom and L. Sis\u00f3. 2015. Minimization of costs and energy consumption in a data center by a workload-based capacity management. In Energy Efficient Data Centers S. Klingert M. Chinnici and M. Rey Porto (Eds.). Springer International Publishing Cham 102--119. G. Da Costa A. Oleksiak W. Piatek J. Salom and L. Sis\u00f3. 2015. Minimization of costs and energy consumption in a data center by a workload-based capacity management. In Energy Efficient Data Centers S. Klingert M. Chinnici and M. Rey Porto (Eds.). Springer International Publishing Cham 102--119.","DOI":"10.1007\/978-3-319-15786-3_7"},{"key":"e_1_2_1_41_1","volume-title":"International Roadmap for Devices and Systems","author":"IEEE IRDS Technical Council","unstructured":"IEEE IRDS Technical Council . 2018. International Roadmap for Devices and Systems . IEEE. IEEE IRDS Technical Council. 2018. International Roadmap for Devices and Systems. IEEE."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.adhoc.2014.11.002"},{"key":"e_1_2_1_43_1","first-page":"8348791","article-title":"Energy-aware high-performance computing: Survey of state-of-the-art tools, techniques, and environments. Sci","volume":"2019","author":"Czarnul P.","year":"2019","unstructured":"P. Czarnul , J. Proficz , and A. Krzywaniak . 2019 . Energy-aware high-performance computing: Survey of state-of-the-art tools, techniques, and environments. Sci . Program. 2019 , 4 (2019), 8348791 . DOI:https:\/\/doi.org\/10.1155\/2019\/8348791 10.1155\/2019 P. Czarnul, J. Proficz, and A. Krzywaniak. 2019. Energy-aware high-performance computing: Survey of state-of-the-art tools, techniques, and environments. Sci. Program. 2019, 4 (2019), 8348791. DOI:https:\/\/doi.org\/10.1155\/2019\/8348791","journal-title":"Program."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/12.83652"},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the 8th IEEE International Symposium on Industrial Embedded Systems (SIES\u201913)","author":"Dasari D.","year":"2013","unstructured":"D. Dasari , B. Akesson , V. N\u00e9lis , M. A. Awan , and S. M. Petters . 2013. Identifying the sources of unpredictability in COTS-based multicore systems . In Proceedings of the 8th IEEE International Symposium on Industrial Embedded Systems (SIES\u201913) . 39--48. DOI:https:\/\/doi.org\/10.1109\/SIES. 2013 .6601469 10.1109\/SIES.2013.6601469 D. Dasari, B. Akesson, V. N\u00e9lis, M. A. Awan, and S. M. Petters. 2013. Identifying the sources of unpredictability in COTS-based multicore systems. In Proceedings of the 8th IEEE International Symposium on Industrial Embedded Systems (SIES\u201913). 39--48. DOI:https:\/\/doi.org\/10.1109\/SIES.2013.6601469"},{"key":"e_1_2_1_46_1","volume-title":"Proceedings of the 8th International Green and Sustainable Computing Conference (IGSC\u201917)","author":"Dauwe D.","year":"2017","unstructured":"D. Dauwe , R. Jhaveri , S. Pasricha , A. A. Maciejewski , and H. J. Siegel . 2018. Optimizing checkpoint intervals for reduced energy use in exascale systems . In Proceedings of the 8th International Green and Sustainable Computing Conference (IGSC\u201917) . 1--8. DOI:https:\/\/doi.org\/10.1109\/IGCC. 2017 .8323598 10.1109\/IGCC.2017.8323598 D. Dauwe, R. Jhaveri, S. Pasricha, A. A. Maciejewski, and H. J. Siegel. 2018. Optimizing checkpoint intervals for reduced energy use in exascale systems. In Proceedings of the 8th International Green and Sustainable Computing Conference (IGSC\u201917). 1--8. DOI:https:\/\/doi.org\/10.1109\/IGCC.2017.8323598"},{"key":"e_1_2_1_47_1","volume-title":"Proceedings of the IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017. 914--923","author":"Dauwe D.","year":"2017","unstructured":"D. Dauwe , S. Pasricha , A. A. Maciejewski , and H. J. Siegel . 2017. An analysis of resilience techniques for exascale computing platforms . Proceedings of the IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017. 914--923 . DOI:https:\/\/doi.org\/10.1109\/IPDPSW. 2017 .41 10.1109\/IPDPSW.2017.41 D. Dauwe, S. Pasricha, A. A. Maciejewski, and H. J. Siegel. 2017. An analysis of resilience techniques for exascale computing platforms. Proceedings of the IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017. 914--923. DOI:https:\/\/doi.org\/10.1109\/IPDPSW.2017.41"},{"key":"e_1_2_1_48_1","volume-title":"Proceedings of the IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018. 783--792","author":"Dauwe D.","year":"2018","unstructured":"D. Dauwe , S. Pasricha , A. A. Maciejewski , and H. J. Siegel . 2018. An analysis of multilevel checkpoint performance models . Proceedings of the IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018. 783--792 . DOI:https:\/\/doi.org\/10.1109\/IPDPSW. 2018 .00125 10.1109\/IPDPSW.2018.00125 D. Dauwe, S. Pasricha, A. A. Maciejewski, and H. J. Siegel. 2018. An analysis of multilevel checkpoint performance models. Proceedings of the IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018. 783--792. DOI:https:\/\/doi.org\/10.1109\/IPDPSW.2018.00125"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSUSC.2018.2797890"},{"key":"e_1_2_1_50_1","volume-title":"Proceedings of the 22nd International Symposium on High-performance Parallel and Distributed Computing (HPDC\u201913)","author":"Davies T.","unstructured":"T. Davies and Z. Chen . 2013. Correcting soft errors online in LU factorization . In Proceedings of the 22nd International Symposium on High-performance Parallel and Distributed Computing (HPDC\u201913) . ACM, New York, NY, 167--178. DOI:https:\/\/doi.org\/10.1145\/2493123.2462920 10.1145\/2493123.2462920 T. Davies and Z. Chen. 2013. Correcting soft errors online in LU factorization. In Proceedings of the 22nd International Symposium on High-performance Parallel and Distributed Computing (HPDC\u201913). ACM, New York, NY, 167--178. DOI:https:\/\/doi.org\/10.1145\/2493123.2462920"},{"key":"#cr-split#-e_1_2_1_51_1.1","doi-asserted-by":"crossref","unstructured":"R. I. Davis and A. Burns. 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43 4 Article 35 (Oct. 2011) 44 pages. DOI:https:\/\/doi.org\/10.1145\/1978802.1978814 10.1145\/1978802.1978814","DOI":"10.1145\/1978802.1978814"},{"key":"#cr-split#-e_1_2_1_51_1.2","doi-asserted-by":"crossref","unstructured":"R. I. Davis and A. Burns. 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43 4 Article 35 (Oct. 2011) 44 pages. DOI:https:\/\/doi.org\/10.1145\/1978802.1978814","DOI":"10.1145\/1978802.1978814"},{"key":"e_1_2_1_52_1","volume-title":"Cobalt: An open source platform for HPC system software research.","author":"Desai Narayan","year":"2005","unstructured":"Narayan Desai . 2005 . Cobalt: An open source platform for HPC system software research. Edinburgh BG\/L System Software Workshop . Narayan Desai. 2005. Cobalt: An open source platform for HPC system software research. Edinburgh BG\/L System Software Workshop."},{"key":"e_1_2_1_53_1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201914)","author":"Di S.","year":"2014","unstructured":"S. Di , L. Bautista-Gome , and F. Cappello . 2014. Optimization of a multilevel checkpoint model with uncertain execution scales . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201914) . 907--918. DOI:https:\/\/doi.org\/10.1109\/SC. 2014 .79 10.1109\/SC.2014.79 S. Di, L. Bautista-Gome, and F. Cappello. 2014. Optimization of a multilevel checkpoint model with uncertain execution scales. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201914). 907--918. DOI:https:\/\/doi.org\/10.1109\/SC.2014.79"},{"key":"e_1_2_1_54_1","volume-title":"Proceedings of the 15th IEEE\/ACM International Symposium on Cluster, Cloud and Grid Computing. 271--280","author":"Di S.","year":"2015","unstructured":"S. Di , E. Berrocal , and F. Cappello . 2015. An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications . In Proceedings of the 15th IEEE\/ACM International Symposium on Cluster, Cloud and Grid Computing. 271--280 . DOI:https:\/\/doi.org\/10.1109\/CCGrid. 2015 .17 10.1109\/CCGrid.2015.17 S. Di, E. Berrocal, and F. Cappello. 2015. An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications. In Proceedings of the 15th IEEE\/ACM International Symposium on Cluster, Cloud and Grid Computing. 271--280. DOI:https:\/\/doi.org\/10.1109\/CCGrid.2015.17"},{"key":"e_1_2_1_55_1","volume-title":"Proceedings of the IEEE 28th International Parallel and Distributed Processing Symposium. 1181--1190","author":"Di S.","year":"2014","unstructured":"S. Di , M. S. Bouguerra , L. Bautista-Gomez , and F. Cappello . 2014. Optimization of multi-level checkpoint model for large scale HPC applications . In Proceedings of the IEEE 28th International Parallel and Distributed Processing Symposium. 1181--1190 . DOI:https:\/\/doi.org\/10.1109\/IPDPS. 2014 .122 10.1109\/IPDPS.2014.122 S. Di, M. S. Bouguerra, L. Bautista-Gomez, and F. Cappello. 2014. Optimization of multi-level checkpoint model for large scale HPC applications. In Proceedings of the IEEE 28th International Parallel and Distributed Processing Symposium. 1181--1190. DOI:https:\/\/doi.org\/10.1109\/IPDPS.2014.122"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2016.2517639"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2018.2864184"},{"key":"e_1_2_1_58_1","volume-title":"Proceedings of the 49th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN\u201919)","author":"Di S.","year":"2019","unstructured":"S. Di , H. Guo , E. Pershey , M. Snir , and F. Cappello . 2019. Characterizing and understanding HPC job failures over the 2K-day life of IBM BlueGene\/Q system . In Proceedings of the 49th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN\u201919) . 473--484. DOI:https:\/\/doi.org\/10.1109\/DSN. 2019 .00055 10.1109\/DSN.2019.00055 S. Di, H. Guo, E. Pershey, M. Snir, and F. Cappello. 2019. Characterizing and understanding HPC job failures over the 2K-day life of IBM BlueGene\/Q system. In Proceedings of the 49th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN\u201919). 473--484. DOI:https:\/\/doi.org\/10.1109\/DSN.2019.00055"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2016.2546248"},{"key":"e_1_2_1_60_1","volume-title":"Proceedings of the 44th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. 610--621","author":"Di-Martino C.","year":"2014","unstructured":"C. Di-Martino , Z. Kalbarczyk , R. K. Iyer , F. Baccanico , J. Fullop , and W. Kramer . 2014. Lessons learned from the analysis of system failures at petascale: The case of blue waters . In Proceedings of the 44th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. 610--621 . DOI:https:\/\/doi.org\/10.1109\/DSN. 2014 .62 10.1109\/DSN.2014.62 C. Di-Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer. 2014. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In Proceedings of the 44th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. 610--621. DOI:https:\/\/doi.org\/10.1109\/DSN.2014.62"},{"key":"e_1_2_1_61_1","volume-title":"Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC\u201918)","author":"Dinda P.","year":"2080","unstructured":"P. Dinda , X. Wang , J. Wang , C. Beauchene , and C. Hetland . 2018. Hard real-time scheduling for parallel run-time systems . In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC\u201918) . ACM, New York, NY, 14--26. DOI:https:\/\/doi.org\/10.1145\/3 2080 40.3208052 10.1145\/3208040.3208052 P. Dinda, X. Wang, J. Wang, C. Beauchene, and C. Hetland. 2018. Hard real-time scheduling for parallel run-time systems. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC\u201918). ACM, New York, NY, 14--26. DOI:https:\/\/doi.org\/10.1145\/3208040.3208052"},{"key":"e_1_2_1_62_1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201914)","author":"Domke J.","year":"2014","unstructured":"J. Domke , T. Hoefler , and S. Matsuoka . 2014. Fail-in-place network design: Interaction between topology, routing algorithm and failures . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201914) . 597--608. DOI:https:\/\/doi.org\/10.1109\/SC. 2014 .54 10.1109\/SC.2014.54 J. Domke, T. Hoefler, and S. Matsuoka. 2014. Fail-in-place network design: Interaction between topology, routing algorithm and failures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201914). 597--608. DOI:https:\/\/doi.org\/10.1109\/SC.2014.54"},{"key":"e_1_2_1_63_1","doi-asserted-by":"crossref","unstructured":"J. Dongarra T. Herault and Y. Robert. 2015. Fault Tolerance Techniques for High-Performance Computing. Springer. J. Dongarra T. Herault and Y. Robert. 2015. Fault Tolerance Techniques for High-Performance Computing. Springer.","DOI":"10.1007\/978-3-319-20943-2"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1016\/0142-1123(82)90018-4"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.microrel.2015.06.004"},{"key":"e_1_2_1_66_1","volume-title":"Proceedings of the IEEE 28th International Parallel and Distributed Processing Symposium. 1193--1202","author":"Elliott J.","year":"2014","unstructured":"J. Elliott , M. Hoemmen , and F. Mueller . 2014. Evaluating the impact of SDC on the GMRES iterative solver . In Proceedings of the IEEE 28th International Parallel and Distributed Processing Symposium. 1193--1202 . DOI:https:\/\/doi.org\/10.1109\/IPDPS. 2014 .123 10.1109\/IPDPS.2014.123 J. Elliott, M. Hoemmen, and F. Mueller. 2014. Evaluating the impact of SDC on the GMRES iterative solver. In Proceedings of the IEEE 28th International Parallel and Distributed Processing Symposium. 1193--1202. DOI:https:\/\/doi.org\/10.1109\/IPDPS.2014.123"},{"key":"e_1_2_1_68_1","volume-title":"Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC\u201912)","author":"Fiala D.","year":"2012","unstructured":"D. Fiala , F. Mueller , C. Engelmann , R. Riesen , K. Ferreira , and R. Brightwell . 2012. Detection and correction of silent data corruption for large-scale high-performance computing . In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC\u201912) . 1--12. DOI:https:\/\/doi.org\/10.1109\/SC. 2012 .49 10.1109\/SC.2012.49 D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. 2012. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC\u201912). 1--12. DOI:https:\/\/doi.org\/10.1109\/SC.2012.49"},{"key":"e_1_2_1_69_1","volume-title":"Proceedings of the Euromicro Conference on Digital System Design (DSD\u201917)","author":"Flich J.","year":"2017","unstructured":"J. Flich , G. Agosta , P. Ampletzer , D. A. Alonso , C. Brandolese , E. Cappe , A. Cilardo , L. Dragic , A. Dray , A. Duspara , W. Fornaciari , G. Guillaume , Y. Hoornenborg , A. Iranfar , M. Kovac , S. Libutti , B. Maitre , J. M. Martinez , G. Massari , H. Mlinaric , E. Papastefanakis , T. Picornell , I. Piljic , A. Pupykina , F. Reghenzani , I. Staub , R. Tornero , M. Zapater , and D. Zoni . 2017. MANGO: Exploring manycore architectures for next-GeneratiOn HPC systems . In Proceedings of the Euromicro Conference on Digital System Design (DSD\u201917) . 478--485. DOI:https:\/\/doi.org\/10.1109\/DSD. 2017 .51 10.1109\/DSD.2017.51 J. Flich, G. Agosta, P. Ampletzer, D. A. Alonso, C. Brandolese, E. Cappe, A. Cilardo, L. Dragic, A. Dray, A. Duspara, W. Fornaciari, G. Guillaume, Y. Hoornenborg, A. Iranfar, M. Kovac, S. Libutti, B. Maitre, J. M. Martinez, G. Massari, H. Mlinaric, E. Papastefanakis, T. Picornell, I. Piljic, A. Pupykina, F. Reghenzani, I. Staub, R. Tornero, M. Zapater, and D. Zoni. 2017. MANGO: Exploring manycore architectures for next-GeneratiOn HPC systems. In Proceedings of the Euromicro Conference on Digital System Design (DSD\u201917). 478--485. DOI:https:\/\/doi.org\/10.1109\/DSD.2017.51"},{"key":"e_1_2_1_70_1","volume-title":"Proceedings of the 18th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS\u201918)","author":"Fornaciari W.","unstructured":"W. Fornaciari , G. Agosta , D. Atienza , C. Brandolese , L. Cammoun , L. Cremona , A. Cilardo , A. Farres , J. Flich , C. Hernandez , M. Kulchewski , S. Libutti , J.M. Martinez , G. Massari , A. Oleksiak , A. Pupykina , F. Reghenzani , R. Tornero , M. Zanella , M. Zapater , and D. Zoni . 2018. Reliable power and time-constraints-aware predictive management of heterogeneous exascale systems . In Proceedings of the 18th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS\u201918) . ACM, New York, NY, 187--194. DOI:https:\/\/doi.org\/10.1145\/3229631.3239368 10.1145\/3229631.3239368 W. Fornaciari, G. Agosta, D. Atienza, C. Brandolese, L. Cammoun, L. Cremona, A. Cilardo, A. Farres, J. Flich, C. Hernandez, M. Kulchewski, S. Libutti, J.M. Martinez, G. Massari, A. Oleksiak, A. Pupykina, F. Reghenzani, R. Tornero, M. Zanella, M. Zapater, and D. Zoni. 2018. Reliable power and time-constraints-aware predictive management of heterogeneous exascale systems. In Proceedings of the 18th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS\u201918). ACM, New York, NY, 187--194. DOI:https:\/\/doi.org\/10.1145\/3229631.3239368"},{"key":"e_1_2_1_71_1","volume-title":"Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems (SRDS\u201907)","author":"Fu S.","year":"2007","unstructured":"S. Fu and C. Xu . 2007. Quantifying temporal and spatial correlation of failure events for proactive management . In Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems (SRDS\u201907) . 175--184. DOI:https:\/\/doi.org\/10.1109\/SRDS. 2007 .18 10.1109\/SRDS.2007.18 S. Fu and C. Xu. 2007. Quantifying temporal and spatial correlation of failure events for proactive management. In Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems (SRDS\u201907). 175--184. DOI:https:\/\/doi.org\/10.1109\/SRDS.2007.18"},{"key":"e_1_2_1_72_1","volume-title":"Proceedings of the First USENIX Conference on Analysis of System Logs (WASL\u201908)","author":"Fulp E. W.","year":"1855","unstructured":"E. W. Fulp , G. A. Fink , and J. N. Haack . 2008. Predicting computer system failures using support vector machines . In Proceedings of the First USENIX Conference on Analysis of System Logs (WASL\u201908) . USENIX Association, Berkeley, CA, 5--5. Retrieved from http:\/\/dl.acm.org\/citation.cfm?id= 1855 886.1855891. E. W. Fulp, G. A. Fink, and J. N. Haack. 2008. Predicting computer system failures using support vector machines. In Proceedings of the First USENIX Conference on Analysis of System Logs (WASL\u201908). USENIX Association, Berkeley, CA, 5--5. Retrieved from http:\/\/dl.acm.org\/citation.cfm?id=1855886.1855891."},{"key":"e_1_2_1_73_1","volume-title":"Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC\u201912)","author":"Gainaru A.","year":"2012","unstructured":"A. Gainaru , F. Cappello , M. Snir , and W. Kramer . 2012. Fault prediction under the microscope: A closer look into HPC systems . In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC\u201912) . 1--11. DOI:https:\/\/doi.org\/10.1109\/SC. 2012 .57 10.1109\/SC.2012.57 A. Gainaru, F. Cappello, M. Snir, and W. Kramer. 2012. Fault prediction under the microscope: A closer look into HPC systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC\u201912). 1--11. DOI:https:\/\/doi.org\/10.1109\/SC.2012.57"},{"key":"e_1_2_1_74_1","volume-title":"Proceedings of the IEEE International Conference on Cluster Computing (ICCC\u201918)","author":"Garg R.","year":"2018","unstructured":"R. Garg , A. Mohan , M. Sullivan , and G. Cooperman . 2018. CRUM: Checkpoint-restart support for CUDA\u2019s unified memory . Proceedings of the IEEE International Conference on Cluster Computing (ICCC\u201918) . 302--313. DOI:https:\/\/doi.org\/10.1109\/CLUSTER. 2018 .00047arxiv:1808.00117 10.1109\/CLUSTER.2018.00047arxiv:1808.00117 R. Garg, A. Mohan, M. Sullivan, and G. Cooperman. 2018. CRUM: Checkpoint-restart support for CUDA\u2019s unified memory. Proceedings of the IEEE International Conference on Cluster Computing (ICCC\u201918). 302--313. DOI:https:\/\/doi.org\/10.1109\/CLUSTER.2018.00047arxiv:1808.00117"},{"key":"e_1_2_1_75_1","volume-title":"Proceedings of the 9th International Symposium on Software Reliability Engineering. 283--292","author":"Garg S.","year":"1998","unstructured":"S. Garg , A. van Moorsel , K. Vaidyanathan , and K. S. Trivedi . 1998. A methodology for detection and estimation of software aging . In Proceedings of the 9th International Symposium on Software Reliability Engineering. 283--292 . DOI:https:\/\/doi.org\/10.1109\/ISSRE. 1998 .730892 10.1109\/ISSRE.1998.730892 S. Garg, A. van Moorsel, K. Vaidyanathan, and K. S. Trivedi. 1998. A methodology for detection and estimation of software aging. In Proceedings of the 9th International Symposium on Software Reliability Engineering. 283--292. DOI:https:\/\/doi.org\/10.1109\/ISSRE.1998.730892"},{"key":"e_1_2_1_76_1","volume-title":"Proceedings of the ACM SIGCOMM Conference (SIGCOMM\u201911)","author":"Gill P.","year":"2018","unstructured":"P. Gill , N. Jain , and N. Nagappan . 2011. Understanding network failures in data centers: Measurement, analysis, and implications . In Proceedings of the ACM SIGCOMM Conference (SIGCOMM\u201911) . ACM, New York, NY, 350--361. DOI:https:\/\/doi.org\/10.1145\/ 2018 436.2018477 10.1145\/2018436.2018477 P. Gill, N. Jain, and N. Nagappan. 2011. Understanding network failures in data centers: Measurement, analysis, and implications. In Proceedings of the ACM SIGCOMM Conference (SIGCOMM\u201911). ACM, New York, NY, 350--361. DOI:https:\/\/doi.org\/10.1145\/2018436.2018477"},{"key":"e_1_2_1_77_1","volume-title":"Proceedings of the ACM\/IEEE Conference on Supercomputing (SC\u201905)","author":"Gioiosa R.","year":"2005","unstructured":"R. Gioiosa , J. C. Sancho , S. Jiang , and F. Petrini . 2005. Transparent, incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers . In Proceedings of the ACM\/IEEE Conference on Supercomputing (SC\u201905) . 9--9. DOI:https:\/\/doi.org\/10.1109\/SC. 2005 .76 10.1109\/SC.2005.76 R. Gioiosa, J. C. Sancho, S. Jiang, and F. Petrini. 2005. Transparent, incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers. In Proceedings of the ACM\/IEEE Conference on Supercomputing (SC\u201905). 9--9. DOI:https:\/\/doi.org\/10.1109\/SC.2005.76"},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2016.2599513"},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","DOI":"10.1145\/1897852.1897877"},{"key":"e_1_2_1_80_1","volume-title":"Proceedings of the ACM SIGCOMM Conference on Data Communication (SIGCOMM\u201909)","author":"Guo C.","unstructured":"C. Guo , G. Lu , D. Li , H. Wu , X. Zhang , Y. Shi , C. Tian , Y. Zhang , and S. Lu . 2009. BCube: A high performance, server-centric network architecture for modular data centers . In Proceedings of the ACM SIGCOMM Conference on Data Communication (SIGCOMM\u201909) . ACM, New York, NY, 63--74. DOI:https:\/\/doi.org\/10.1145\/1592568.1592577 10.1145\/1592568.1592577 C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu. 2009. BCube: A high performance, server-centric network architecture for modular data centers. In Proceedings of the ACM SIGCOMM Conference on Data Communication (SIGCOMM\u201909). ACM, New York, NY, 63--74. DOI:https:\/\/doi.org\/10.1145\/1592568.1592577"},{"key":"e_1_2_1_81_1","volume-title":"Proceedings of the 18th International Conference on Machine Learning (ICML\u201901)","author":"Hamerly G.","unstructured":"G. Hamerly and C. Elkan . 2001. Bayesian approaches to failure prediction for disk drives . In Proceedings of the 18th International Conference on Machine Learning (ICML\u201901) . Morgan Kaufmann Publishers Inc., San Francisco, CA, 202--209. http:\/\/dl.acm.org\/citation.cfm?id=645530.655825 G. Hamerly and C. Elkan. 2001. Bayesian approaches to failure prediction for disk drives. In Proceedings of the 18th International Conference on Machine Learning (ICML\u201901). Morgan Kaufmann Publishers Inc., San Francisco, CA, 202--209. http:\/\/dl.acm.org\/citation.cfm?id=645530.655825"},{"key":"e_1_2_1_83_1","volume-title":"Proceedings of the 6th IFIP\/IEEE International Symposium on Integrated Network Management. 309--322","author":"Hellerstein J. L.","year":"1999","unstructured":"J. L. Hellerstein , F. Zhang , and P. Shahabuddin . 1999. An approach to predictive detection for service management . In Proceedings of the 6th IFIP\/IEEE International Symposium on Integrated Network Management. 309--322 . DOI:https:\/\/doi.org\/10.1109\/INM. 1999 .770691 10.1109\/INM.1999.770691 J. L. Hellerstein, F. Zhang, and P. Shahabuddin. 1999. An approach to predictive detection for service management. In Proceedings of the 6th IFIP\/IEEE International Symposium on Integrated Network Management. 309--322. DOI:https:\/\/doi.org\/10.1109\/INM.1999.770691"},{"key":"e_1_2_1_84_1","doi-asserted-by":"publisher","DOI":"10.5555\/646376.689372"},{"key":"e_1_2_1_85_1","doi-asserted-by":"publisher","DOI":"10.1145\/1089014.1089021"},{"key":"e_1_2_1_86_1","volume-title":"Failure Prediction in Complex Computer Systems: A Probabilistic Approach","author":"Hoffmann G. A.","unstructured":"G. A. Hoffmann . 2006. Failure Prediction in Complex Computer Systems: A Probabilistic Approach . Shaker Verlag . G. A. Hoffmann. 2006. Failure Prediction in Complex Computer Systems: A Probabilistic Approach. Shaker Verlag."},{"key":"e_1_2_1_87_1","doi-asserted-by":"publisher","DOI":"10.1109\/TR.2007.909764"},{"key":"e_1_2_1_88_1","doi-asserted-by":"publisher","DOI":"10.1147\/rd.255.0453"},{"key":"e_1_2_1_89_1","article-title":"Algorithm-based fault tolerance for matrix operations","author":"Huang K.-H.","year":"1984","unstructured":"K.-H. Huang and J. A. Abraham . 1984 . Algorithm-based fault tolerance for matrix operations . IEEE Trans. Comput. C-33, 6 ( June 1984), 518--528. DOI:https:\/\/doi.org\/10.1109\/TC.1984.1676475 10.1109\/TC.1984.1676475 K.-H. Huang and J. A. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. C-33, 6 (June 1984), 518--528. DOI:https:\/\/doi.org\/10.1109\/TC.1984.1676475","journal-title":"IEEE Trans. Comput. C-33, 6"},{"key":"e_1_2_1_90_1","doi-asserted-by":"publisher","DOI":"10.1145\/996566.996800"},{"key":"e_1_2_1_91_1","doi-asserted-by":"publisher","DOI":"10.1109\/TR.2002.802886"},{"key":"e_1_2_1_92_1","volume-title":"Proceedings of the International Conference on High Performance Computing Simulation (HPCS\u201914)","author":"Hukerikar S.","year":"2014","unstructured":"S. Hukerikar , P. C. Diniz , R. F. Lucas , and K. Teranishi . 2014. Opportunistic application-level fault detection through adaptive redundant multithreading . In Proceedings of the International Conference on High Performance Computing Simulation (HPCS\u201914) . 243--250. DOI:https:\/\/doi.org\/10.1109\/HPCSim. 2014 .6903692 10.1109\/HPCSim.2014.6903692 S. Hukerikar, P. C. Diniz, R. F. Lucas, and K. Teranishi. 2014. Opportunistic application-level fault detection through adaptive redundant multithreading. In Proceedings of the International Conference on High Performance Computing Simulation (HPCS\u201914). 243--250. DOI:https:\/\/doi.org\/10.1109\/HPCSim.2014.6903692"},{"key":"#cr-split#-e_1_2_1_93_1.1","doi-asserted-by":"crossref","unstructured":"S. Hukerikar and C. Engelmann. 2017. Resilience design patterns: A structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4 3 (2017). DOI:https:\/\/doi.org\/10.14529\/jsfi170301 10.14529\/jsfi170301","DOI":"10.14529\/jsfi170301"},{"key":"#cr-split#-e_1_2_1_93_1.2","doi-asserted-by":"crossref","unstructured":"S. Hukerikar and C. Engelmann. 2017. Resilience design patterns: A structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4 3 (2017). DOI:https:\/\/doi.org\/10.14529\/jsfi170301","DOI":"10.2172\/1436045"},{"key":"e_1_2_1_94_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2013.09.009"},{"key":"e_1_2_1_95_1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201918)","author":"Hussain Z.","unstructured":"Z. Hussain , T. Znati , and R. Melhem . 2018. Partial redundancy in HPC systems with non-uniform node reliabilities . In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201918) . IEEE Press, Piscataway, NJ, Article 44, 11 pages. http:\/\/dl.acm.org\/citation.cfm?id=3291656.3291715 Z. Hussain, T. Znati, and R. Melhem. 2018. Partial redundancy in HPC systems with non-uniform node reliabilities. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201918). IEEE Press, Piscataway, NJ, Article 44, 11 pages. http:\/\/dl.acm.org\/citation.cfm?id=3291656.3291715"},{"key":"e_1_2_1_96_1","unstructured":"Intel Corporation. [n.d.]. Intel Xeon Processor E7 Family: Reliability Availability and Serviceability. White paper. https:\/\/www.intel.com\/content\/www\/us\/en\/processors\/xeon\/xeon-e7-family-ras-server-paper.html. Intel Corporation. [n.d.]. Intel Xeon Processor E7 Family: Reliability Availability and Serviceability. White paper. https:\/\/www.intel.com\/content\/www\/us\/en\/processors\/xeon\/xeon-e7-family-ras-server-paper.html."},{"key":"e_1_2_1_97_1","volume-title":"Road Vehicles\u2014Functional Safety","author":"DIS","year":"2011","unstructured":"ISO\/ DIS 26262 : Road Vehicles\u2014Functional Safety . 2011 . International Organization for Standardization . ISO\/DIS 26262: Road Vehicles\u2014Functional Safety. 2011. International Organization for Standardization."},{"key":"e_1_2_1_98_1","first-page":"8","article-title":"TheSPoT: Thermal stress-aware power and temperature management for multiprocessor systems-on-chip","volume":"37","author":"Iranfar A.","year":"2018","unstructured":"A. Iranfar , M. Kamal , A. Afzali-Kusha , M. Pedram , and D. Atienza . 2018 . TheSPoT: Thermal stress-aware power and temperature management for multiprocessor systems-on-chip . IEEE Trans. Comput.-Aid. Design Integr. Circ. Syst. 37 , 8 (Aug. 2018), 1532--1545. DOI:https:\/\/doi.org\/10.1109\/TCAD.2017.2768417 10.1109\/TCAD.2017.2768417 A. Iranfar, M. Kamal, A. Afzali-Kusha, M. Pedram, and D. Atienza. 2018. TheSPoT: Thermal stress-aware power and temperature management for multiprocessor systems-on-chip. IEEE Trans. Comput.-Aid. Design Integr. Circ. Syst. 37, 8 (Aug. 2018), 1532--1545. DOI:https:\/\/doi.org\/10.1109\/TCAD.2017.2768417","journal-title":"IEEE Trans. Comput.-Aid. Design Integr. Circ. Syst."},{"key":"e_1_2_1_99_1","volume-title":"Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS\u201917)","author":"Iranfar A.","unstructured":"A. Iranfar , F. Terraneo , W. A. Simon , L. Dragi\u0107 , I. Pilji\u0107 , M. Zapater , W. Fornaciari , M. Kova\u010d , and D. Atienza . 2017. Thermal characterization of next-generation workloads on heterogeneous MPSoCs . In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS\u201917) . 286--291. A. Iranfar, F. Terraneo, W. A. Simon, L. Dragi\u0107, I. Pilji\u0107, M. Zapater, W. Fornaciari, M. Kova\u010d, and D. Atienza. 2017. Thermal characterization of next-generation workloads on heterogeneous MPSoCs. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS\u201917). 286--291."},{"key":"e_1_2_1_100_1","doi-asserted-by":"publisher","DOI":"10.1109\/TDSC.2017.2737537"},{"key":"e_1_2_1_101_1","volume-title":"Proceedings of the IEEE International Electron Devices Meeting (IEDM\u201916)","author":"Jin M.","year":"2016","unstructured":"M. Jin , C. Liu , J. Kim , J. Kim , H. Shim , K. Kim , G. Kim , S. Lee , T. Uemura , M. Chang , T. An , J. Park , and S. Pae . 2016. Reliability characterization of 10 nm FinFET technology with multi-VT gate stack for low power and high performance . In Proceedings of the IEEE International Electron Devices Meeting (IEDM\u201916) . 15.1.1--15.1.4. DOI:https:\/\/doi.org\/10.1109\/IEDM. 2016 .7838420 10.1109\/IEDM.2016.7838420 M. Jin, C. Liu, J. Kim, J. Kim, H. Shim, K. Kim, G. Kim, S. Lee, T. Uemura, M. Chang, T. An, J. Park, and S. Pae. 2016. Reliability characterization of 10 nm FinFET technology with multi-VT gate stack for low power and high performance. In Proceedings of the IEEE International Electron Devices Meeting (IEDM\u201916). 15.1.1--15.1.4. DOI:https:\/\/doi.org\/10.1109\/IEDM.2016.7838420"},{"key":"e_1_2_1_102_1","volume-title":"Proceedings of the 44th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. 738--743","author":"Kannan S.","year":"2014","unstructured":"S. Kannan , N. Farooqui , A. Gavrilovska , and K. Schwan . 2014. HeteroCheckpoint: Efficient checkpointing for accelerator-based systems . In Proceedings of the 44th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. 738--743 . DOI:https:\/\/doi.org\/10.1109\/DSN. 2014 .76 10.1109\/DSN.2014.76 S. Kannan, N. Farooqui, A. Gavrilovska, and K. Schwan. 2014. HeteroCheckpoint: Efficient checkpointing for accelerator-based systems. In Proceedings of the 44th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. 738--743. DOI:https:\/\/doi.org\/10.1109\/DSN.2014.76"},{"key":"e_1_2_1_103_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNN.2005.853411"},{"key":"e_1_2_1_104_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897937.2905010"},{"key":"e_1_2_1_105_1","volume-title":"Proceedings of the 48th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN\u201918)","author":"Kumar M.","year":"2018","unstructured":"M. Kumar , S. Gupta , T. Patel , M. Wilder , W. Shi , S. Fu , C. Engelmann , and D. Tiwari . 2018. Understanding and analyzing interconnect errors and network congestion on a large scale HPC system . In Proceedings of the 48th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN\u201918) . 107--114. DOI:https:\/\/doi.org\/10.1109\/DSN. 2018 .00023 10.1109\/DSN.2018.00023 M. Kumar, S. Gupta, T. Patel, M. Wilder, W. Shi, S. Fu, C. Engelmann, and D. Tiwari. 2018. Understanding and analyzing interconnect errors and network congestion on a large scale HPC system. In Proceedings of the 48th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN\u201918). 107--114. DOI:https:\/\/doi.org\/10.1109\/DSN.2018.00023"},{"key":"e_1_2_1_106_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.simpat.2013.08.007"},{"key":"e_1_2_1_107_1","volume-title":"Proceedings of the 3rd IEEE International High-Assurance Systems Engineering Symposium. 232--239","author":"Lal R.","year":"1998","unstructured":"R. Lal and G. Choi . 1998. Error and failure analysis of a UNIX server . In Proceedings of the 3rd IEEE International High-Assurance Systems Engineering Symposium. 232--239 . DOI:https:\/\/doi.org\/10.1109\/HASE. 1998 .731618 10.1109\/HASE.1998.731618 R. Lal and G. Choi. 1998. Error and failure analysis of a UNIX server. In Proceedings of the 3rd IEEE International High-Assurance Systems Engineering Symposium. 232--239. DOI:https:\/\/doi.org\/10.1109\/HASE.1998.731618"},{"key":"e_1_2_1_108_1","doi-asserted-by":"publisher","DOI":"10.1137\/040620394"},{"key":"e_1_2_1_109_1","doi-asserted-by":"crossref","unstructured":"J. C. Laprie (Ed.). 1995. Dependability\u2014Its Attributes Impairments and Means. Springer-Verlag Berlin. J. C. Laprie (Ed.). 1995. Dependability\u2014Its Attributes Impairments and Means. Springer-Verlag Berlin.","DOI":"10.1007\/978-3-642-79789-7_1"},{"key":"e_1_2_1_110_1","doi-asserted-by":"publisher","DOI":"10.1109\/FTCSH.1995.532603"},{"key":"e_1_2_1_111_1","doi-asserted-by":"publisher","DOI":"10.5555\/573776"},{"key":"e_1_2_1_112_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0026-2714(03)00183-5"},{"key":"e_1_2_1_113_1","first-page":"10","article-title":"Fat-trees: Universal networks for hardware-efficient supercomputing","volume":"34","author":"Leiserson C. E.","year":"1985","unstructured":"C. E. Leiserson . 1985 . Fat-trees: Universal networks for hardware-efficient supercomputing . IEEE Trans. Comput. 34 , 10 (Oct. 1985), 892--901. http:\/\/dl.acm.org\/citation.cfm?id=4492.4495 C. E. Leiserson. 1985. Fat-trees: Universal networks for hardware-efficient supercomputing. IEEE Trans. Comput. 34, 10 (Oct. 1985), 892--901. http:\/\/dl.acm.org\/citation.cfm?id=4492.4495","journal-title":"IEEE Trans. Comput."},{"key":"e_1_2_1_114_1","volume-title":"Proceedings of the 14th International Symposium on Software Reliability Engineering (ISSRE\u201903)","author":"Levy D.","year":"2003","unstructured":"D. Levy and R. Chillarege . 2003. Early warning of failures through alarm analysis a case study in telecom voice mail systems . In Proceedings of the 14th International Symposium on Software Reliability Engineering (ISSRE\u201903) 271--280. DOI:https:\/\/doi.org\/10.1109\/ISSRE. 2003 .1251049 10.1109\/ISSRE.2003.1251049 D. Levy and R. Chillarege. 2003. Early warning of failures through alarm analysis a case study in telecom voice mail systems. In Proceedings of the 14th International Symposium on Software Reliability Engineering (ISSRE\u201903) 271--280. DOI:https:\/\/doi.org\/10.1109\/ISSRE.2003.1251049"},{"key":"e_1_2_1_115_1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201917)","author":"Liang X.","unstructured":"X. Liang , J. Chen , D. Tao , S. Li , P. Wu , H. Li , K. Ouyang , Y. Liu , F. Song , and Z. Chen . 2017. Correcting soft errors online in fast Fourier transform . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201917) . ACM, New York, NY, Article 30, 12 pages. DOI:https:\/\/doi.org\/10.1145\/3126908.3126915 10.1145\/3126908.3126915 X. Liang, J. Chen, D. Tao, S. Li, P. Wu, H. Li, K. Ouyang, Y. Liu, F. Song, and Z. Chen. 2017. Correcting soft errors online in fast Fourier transform. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201917). ACM, New York, NY, Article 30, 12 pages. DOI:https:\/\/doi.org\/10.1145\/3126908.3126915"},{"key":"e_1_2_1_116_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2006.18"},{"key":"e_1_2_1_117_1","doi-asserted-by":"publisher","DOI":"10.1109\/24.58720"},{"key":"e_1_2_1_118_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2020.01.026"},{"key":"e_1_2_1_119_1","doi-asserted-by":"publisher","DOI":"10.1147\/rd.62.0200"},{"key":"e_1_2_1_120_1","volume-title":"Proceedings of the 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. 25--36","author":"Martino C. D.","year":"2015","unstructured":"C. D. Martino , W. Kramer , Z. Kalbarczyk , and R. Iyer . 2015. Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 HPC application runs . In Proceedings of the 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. 25--36 . DOI:https:\/\/doi.org\/10.1109\/DSN. 2015 .50 10.1109\/DSN.2015.50 C. D. Martino, W. Kramer, Z. Kalbarczyk, and R. Iyer. 2015. Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 HPC application runs. In Proceedings of the 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. 25--36. DOI:https:\/\/doi.org\/10.1109\/DSN.2015.50"},{"key":"e_1_2_1_121_1","doi-asserted-by":"crossref","unstructured":"M. M\u00e9dard and S. S. Lumetta. 2003. Network Reliability and Fault Tolerance. American Cancer Society. Retrieved from arXiv:https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/0471219282.eot281. M. M\u00e9dard and S. S. Lumetta. 2003. Network Reliability and Fault Tolerance. American Cancer Society. Retrieved from arXiv:https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/0471219282.eot281.","DOI":"10.1002\/0471219282.eot281"},{"key":"e_1_2_1_122_1","volume-title":"Proceedings of the 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. 415--426","author":"Meza J.","year":"2015","unstructured":"J. Meza , Q. Wu , S. Kumar , and O. Mutlu . 2015. Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field . In Proceedings of the 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. 415--426 . DOI:https:\/\/doi.org\/10.1109\/DSN. 2015 .57 10.1109\/DSN.2015.57 J. Meza, Q. Wu, S. Kumar, and O. Mutlu. 2015. Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field. In Proceedings of the 45th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks. 415--426. DOI:https:\/\/doi.org\/10.1109\/DSN.2015.57"},{"key":"e_1_2_1_123_1","doi-asserted-by":"publisher","DOI":"10.2172\/984082"},{"key":"e_1_2_1_124_1","unstructured":"Moor Insights 8 Strategy. 2017. AMD EPYC Brings New RAS Capability. White paper. Retrieved from https:\/\/www.amd.com\/system\/files\/2017-06\/AMD-EPYC-Brings-New-RAS-Capability.pdf. Moor Insights 8 Strategy. 2017. AMD EPYC Brings New RAS Capability. White paper. Retrieved from https:\/\/www.amd.com\/system\/files\/2017-06\/AMD-EPYC-Brings-New-RAS-Capability.pdf."},{"key":"e_1_2_1_125_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2009.2032372"},{"key":"e_1_2_1_126_1","volume-title":"Proceedings of the Conference on Artificial Neural Networks and Neural Information Processing (ICANN\/ICONIP\u201903)","author":"Murray J.","unstructured":"J. Murray , G. Hugues , and K. Kreutz-Delgado . 2003. Hard drive failure prediction using non-parametric statistical methods . In Proceedings of the Conference on Artificial Neural Networks and Neural Information Processing (ICANN\/ICONIP\u201903) . J. Murray, G. Hugues, and K. Kreutz-Delgado. 2003. Hard drive failure prediction using non-parametric statistical methods. In Proceedings of the Conference on Artificial Neural Networks and Neural Information Processing (ICANN\/ICONIP\u201903)."},{"key":"e_1_2_1_127_1","volume-title":"Proceedings of the IEEE International Reliability Physics Symposium (IRPS\u201918)","author":"Narasimham B.","year":"2018","unstructured":"B. Narasimham , S. Gupta , D. Reed , J. K. Wang , N. Hendrickson , and H. Taufique . 2018. Scaling trends and bias dependence of the soft error rate of 16 nm and 7 nm FinFET SRAMs . In Proceedings of the IEEE International Reliability Physics Symposium (IRPS\u201918) . 4C.1--1--4C.1--4. DOI:https:\/\/doi.org\/10.1109\/IRPS. 2018 .8353583 10.1109\/IRPS.2018.8353583 B. Narasimham, S. Gupta, D. Reed, J. K. Wang, N. Hendrickson, and H. Taufique. 2018. Scaling trends and bias dependence of the soft error rate of 16 nm and 7 nm FinFET SRAMs. In Proceedings of the IEEE International Reliability Physics Symposium (IRPS\u201918). 4C.1--1--4C.1--4. DOI:https:\/\/doi.org\/10.1109\/IRPS.2018.8353583"},{"key":"e_1_2_1_128_1","volume-title":"Proceedings of the Design, Automation Test in Europe Conference Exhibition. 1--6. DOI:https:\/\/doi.org\/10","author":"Narayanasamy S.","year":"2007","unstructured":"S. Narayanasamy , A. K. Coskun , and B. Calder . 2007. Transient fault prediction based on anomalies in processor events . In Proceedings of the Design, Automation Test in Europe Conference Exhibition. 1--6. DOI:https:\/\/doi.org\/10 .1109\/DATE. 2007 .364448 10.1109\/DATE.2007.364448 S. Narayanasamy, A. K. Coskun, and B. Calder. 2007. Transient fault prediction based on anomalies in processor events. In Proceedings of the Design, Automation Test in Europe Conference Exhibition. 1--6. DOI:https:\/\/doi.org\/10.1109\/DATE.2007.364448"},{"key":"e_1_2_1_129_1","volume-title":"Proceedings of the 6th IEEE Real-Time Systems Symposium (RTSS\u201985)","author":"Nassar F. A.","unstructured":"F. A. Nassar and D. M. Andrews . 1985. A methodology for analysis of failure prediction data . In Proceedings of the 6th IEEE Real-Time Systems Symposium (RTSS\u201985) . 160--166. F. A. Nassar and D. M. Andrews. 1985. A methodology for analysis of failure prediction data. In Proceedings of the 6th IEEE Real-Time Systems Symposium (RTSS\u201985). 160--166."},{"key":"e_1_2_1_130_1","volume-title":"Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. 104--113","author":"Nukada A.","year":"2011","unstructured":"A. Nukada , H. Takizawa , and S. Matsuoka . 2011. NVCR: A transparent checkpoint-restart library for NVIDIA CUDA . In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. 104--113 . DOI:https:\/\/doi.org\/10.1109\/IPDPS. 2011 .131 10.1109\/IPDPS.2011.131 A. Nukada, H. Takizawa, and S. Matsuoka. 2011. NVCR: A transparent checkpoint-restart library for NVIDIA CUDA. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. 104--113. DOI:https:\/\/doi.org\/10.1109\/IPDPS.2011.131"},{"key":"e_1_2_1_131_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.micpro.2017.05.019"},{"key":"e_1_2_1_132_1","volume-title":"Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT\u201914)","author":"Oliveira D. A. G.","year":"2014","unstructured":"D. A. G. Oliveira , P. Rech , L. L. Pilla , P. O. A. Navaux , and L. Carro . 2014. GPGPUs ECC efficiency and efficacy . In Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT\u201914) . 209--215. DOI:https:\/\/doi.org\/10.1109\/DFT. 2014 .6962085 10.1109\/DFT.2014.6962085 D. A. G. Oliveira, P. Rech, L. L. Pilla, P. O. A. Navaux, and L. Carro. 2014. GPGPUs ECC efficiency and efficacy. In Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT\u201914). 209--215. DOI:https:\/\/doi.org\/10.1109\/DFT.2014.6962085"},{"key":"e_1_2_1_133_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2017.04.015"},{"key":"#cr-split#-e_1_2_1_134_1.1","doi-asserted-by":"crossref","unstructured":"K. O'brien I. Pietri R. Reddy A. Lastovetsky and R. Sakellariou. 2017. A survey of power and energy predictive models in HPC systems and applications. ACM Comput. Surv. 50 3 Article 37 (June 2017) 38 pages. DOI:https:\/\/doi.org\/10.1145\/3078811 10.1145\/3078811","DOI":"10.1145\/3078811"},{"key":"#cr-split#-e_1_2_1_134_1.2","doi-asserted-by":"crossref","unstructured":"K. O'brien I. Pietri R. Reddy A. Lastovetsky and R. Sakellariou. 2017. A survey of power and energy predictive models in HPC systems and applications. ACM Comput. Surv. 50 3 Article 37 (June 2017) 38 pages. DOI:https:\/\/doi.org\/10.1145\/3078811","DOI":"10.1145\/3078811"},{"key":"e_1_2_1_135_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2010.100"},{"key":"e_1_2_1_136_1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201915)","author":"Pe\u00f1a A. J.","unstructured":"A. J. Pe\u00f1a , W. Bland , and P. Balaji . 2015. VOCL-FT: Introducing techniques for efficient soft error coprocessor recovery . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201915) . 1--12. DOI:https:\/\/doi.org\/10.1145\/2807591.2807640 10.1145\/2807591.2807640 A. J. Pe\u00f1a, W. Bland, and P. Balaji. 2015. VOCL-FT: Introducing techniques for efficient soft error coprocessor recovery. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201915). 1--12. DOI:https:\/\/doi.org\/10.1145\/2807591.2807640"},{"key":"e_1_2_1_137_1","doi-asserted-by":"publisher","DOI":"10.1109\/TR.2002.804733"},{"key":"e_1_2_1_138_1","volume-title":"Proceedings of the 3rd IEEE International High-Assurance Systems Engineering Symposium. 214--223","author":"Pizza M.","year":"1998","unstructured":"M. Pizza , L. Strigini , A. Bondavalli , and F. Di Giandomenico . 1998. Optimal discrimination between transient and permanent faults . In Proceedings of the 3rd IEEE International High-Assurance Systems Engineering Symposium. 214--223 . DOI:https:\/\/doi.org\/10.1109\/HASE. 1998 .731615 10.1109\/HASE.1998.731615 M. Pizza, L. Strigini, A. Bondavalli, and F. Di Giandomenico. 1998. Optimal discrimination between transient and permanent faults. In Proceedings of the 3rd IEEE International High-Assurance Systems Engineering Symposium. 214--223. DOI:https:\/\/doi.org\/10.1109\/HASE.1998.731615"},{"key":"e_1_2_1_139_1","volume-title":"Proceedings of the IEEE Cluster Conference (CLUSTER\u201917)","author":"Pourghassemi B.","unstructured":"B. Pourghassemi and A. Chandramowlishwaran . 2017. cudaCR: An in-kernel application-level checkpoint\/restart scheme for CUDA-enabled GPUs . In Proceedings of the IEEE Cluster Conference (CLUSTER\u201917) . B. Pourghassemi and A. Chandramowlishwaran. 2017. cudaCR: An in-kernel application-level checkpoint\/restart scheme for CUDA-enabled GPUs. In Proceedings of the IEEE Cluster Conference (CLUSTER\u201917)."},{"key":"e_1_2_1_140_1","volume-title":"Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium. 1136--1143","author":"Rajachandrasekar R.","year":"2012","unstructured":"R. Rajachandrasekar , X. Besseron , and D. K. Panda . 2012. Monitoring and predicting hardware failures in HPC clusters with FTB-IPMI . In Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium. 1136--1143 . DOI:https:\/\/doi.org\/10.1109\/IPDPSW. 2012 .139 10.1109\/IPDPSW.2012.139 R. Rajachandrasekar, X. Besseron, and D. K. Panda. 2012. Monitoring and predicting hardware failures in HPC clusters with FTB-IPMI. In Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium. 1136--1143. DOI:https:\/\/doi.org\/10.1109\/IPDPSW.2012.139"},{"key":"e_1_2_1_141_1","doi-asserted-by":"publisher","DOI":"10.1145\/1555815.1555793"},{"key":"e_1_2_1_142_1","unstructured":"Paolo Rech. [n.d.]. Reliability Issues in Current and Future Supercomputers. Retrieved from http:\/\/energysfe.ufsc.br\/slides\/Paolo-Rech-260917.pdf. Paolo Rech. [n.d.]. Reliability Issues in Current and Future Supercomputers. Retrieved from http:\/\/energysfe.ufsc.br\/slides\/Paolo-Rech-260917.pdf."},{"key":"#cr-split#-e_1_2_1_143_1.1","doi-asserted-by":"crossref","unstructured":"F. Reghenzani G. Massari and W. Fornaciari. 2019. The real-time Linux kernel: A survey on PREEMPT_RT. Comput. Surveys 52 1 Article 18 (Feb. 2019) 36 pages. DOI:https:\/\/doi.org\/10.1145\/3297714 10.1145\/3297714","DOI":"10.1145\/3297714"},{"key":"#cr-split#-e_1_2_1_143_1.2","doi-asserted-by":"crossref","unstructured":"F. Reghenzani G. Massari and W. Fornaciari. 2019. The real-time Linux kernel: A survey on PREEMPT_RT. Comput. Surveys 52 1 Article 18 (Feb. 2019) 36 pages. DOI:https:\/\/doi.org\/10.1145\/3297714","DOI":"10.1145\/3297714"},{"key":"e_1_2_1_144_1","volume-title":"Proceedings of the 23rd European MPI Users\u2019 Group Meeting (EuroMPI\u201916)","author":"Reghenzani F.","unstructured":"F. Reghenzani , G. Pozzi , G. Massari , S. Libutti , and W. Fornaciari . 2016. The MIG framework: Enabling transparent process migration in open MPI . In Proceedings of the 23rd European MPI Users\u2019 Group Meeting (EuroMPI\u201916) . ACM, New York, NY, 64--73. DOI:https:\/\/doi.org\/10.1145\/2966884.2966903 10.1145\/2966884.2966903 F. Reghenzani, G. Pozzi, G. Massari, S. Libutti, and W. Fornaciari. 2016. The MIG framework: Enabling transparent process migration in open MPI. In Proceedings of the 23rd European MPI Users\u2019 Group Meeting (EuroMPI\u201916). ACM, New York, NY, 64--73. DOI:https:\/\/doi.org\/10.1145\/2966884.2966903"},{"key":"#cr-split#-e_1_2_1_145_1.1","doi-asserted-by":"crossref","unstructured":"F. Salfner M. Lenk and M. Malek. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42 3 Article 10 (March 2010) 42 pages. DOI:https:\/\/doi.org\/10.1145\/1670679.1670680 10.1145\/1670679.1670680","DOI":"10.1145\/1670679.1670680"},{"key":"#cr-split#-e_1_2_1_145_1.2","doi-asserted-by":"crossref","unstructured":"F. Salfner M. Lenk and M. Malek. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42 3 Article 10 (March 2010) 42 pages. DOI:https:\/\/doi.org\/10.1145\/1670679.1670680","DOI":"10.1145\/1670679.1670680"},{"key":"e_1_2_1_146_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2006.1639672"},{"key":"e_1_2_1_147_1","volume-title":"Proceedings of the IEEE 31st VLSI Test Symposium (VTS\u201913)","author":"Sari A.","year":"2013","unstructured":"A. Sari , M. Psarakis , and D. Gizopoulos . 2013. Combining checkpointing and scrubbing in FPGA-based real-time systems . In Proceedings of the IEEE 31st VLSI Test Symposium (VTS\u201913) . 1--6. DOI:https:\/\/doi.org\/10.1109\/VTS. 2013 .6548910 10.1109\/VTS.2013.6548910 A. Sari, M. Psarakis, and D. Gizopoulos. 2013. Combining checkpointing and scrubbing in FPGA-based real-time systems. In Proceedings of the IEEE 31st VLSI Test Symposium (VTS\u201913). 1--6. DOI:https:\/\/doi.org\/10.1109\/VTS.2013.6548910"},{"key":"e_1_2_1_148_1","volume-title":"Proceedings of the International Conference on Dependable Systems and Networks.721--730","author":"Shereshevsky M.","year":"2003","unstructured":"M. Shereshevsky , J. Crowell , B. Cukic , V. Gandikota , and Y. Liu . 2003. Software aging and multifractality of memory resources . In Proceedings of the International Conference on Dependable Systems and Networks.721--730 . DOI:https:\/\/doi.org\/10.1109\/DSN. 2003 .1209987 10.1109\/DSN.2003.1209987 M. Shereshevsky, J. Crowell, B. Cukic, V. Gandikota, and Y. Liu. 2003. Software aging and multifractality of memory resources. In Proceedings of the International Conference on Dependable Systems and Networks.721--730. DOI:https:\/\/doi.org\/10.1109\/DSN.2003.1209987"},{"key":"e_1_2_1_149_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2011.112"},{"key":"e_1_2_1_150_1","doi-asserted-by":"crossref","unstructured":"D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems 3rd ed. A. K. Peters Ltd. D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems 3rd ed. A. K. Peters Ltd.","DOI":"10.1201\/9781439863961"},{"key":"e_1_2_1_151_1","volume-title":"Proceedings of the Conference on Intelligent System Application to Power Systems (ISAP\u201997)","author":"Singer R. M.","unstructured":"R. M. Singer , K. Gross , J. Herzog , R. W. King , and S. Wegerich . 1997. Model-based nuclear power plant monitoring and fault detection: Theoretical foundations . In Proceedings of the Conference on Intelligent System Application to Power Systems (ISAP\u201997) . R. M. Singer, K. Gross, J. Herzog, R. W. King, and S. Wegerich. 1997. Model-based nuclear power plant monitoring and fault detection: Theoretical foundations. In Proceedings of the Conference on Intelligent System Application to Power Systems (ISAP\u201997)."},{"key":"e_1_2_1_152_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10723-015-9359-2"},{"key":"e_1_2_1_153_1","doi-asserted-by":"publisher","DOI":"10.1109\/40.755464"},{"key":"e_1_2_1_154_1","doi-asserted-by":"publisher","DOI":"10.5555\/279869.279878"},{"key":"e_1_2_1_155_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2014.2323194"},{"key":"e_1_2_1_156_1","doi-asserted-by":"publisher","DOI":"10.1145\/2786763.2694348"},{"key":"e_1_2_1_157_1","doi-asserted-by":"publisher","DOI":"10.1109\/IRPS.2018.8353539"},{"key":"e_1_2_1_158_1","doi-asserted-by":"publisher","DOI":"10.5555\/645606.660853"},{"key":"e_1_2_1_159_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPPS.1996.508106"},{"key":"e_1_2_1_160_1","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2010.69"},{"key":"e_1_2_1_161_1","volume-title":"Proceedings of the 16th IEEE\/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid\u201916)","author":"Subasi O.","year":"2016","unstructured":"O. Subasi , S. Di , L. Bautista-Gomez , P. Balaprakash , O. Unsal , J. Labarta , A. Cristal , and F. Cappello . 2016. Spatial support vector regression to detect silent errors in the exascale era . In Proceedings of the 16th IEEE\/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid\u201916) . 413--424. DOI:https:\/\/doi.org\/10.1109\/CCGrid. 2016 .33 10.1109\/CCGrid.2016.33 O. Subasi, S. Di, L. Bautista-Gomez, P. Balaprakash, O. Unsal, J. Labarta, A. Cristal, and F. Cappello. 2016. Spatial support vector regression to detect silent errors in the exascale era. In Proceedings of the 16th IEEE\/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid\u201916). 413--424. DOI:https:\/\/doi.org\/10.1109\/CCGrid.2016.33"},{"key":"e_1_2_1_162_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.suscom.2018.01.004"},{"key":"e_1_2_1_163_1","volume-title":"Proceedings of the 51st Annual Design Automation Conference (DAC\u201914)","author":"Sutaria K.","unstructured":"K. Sutaria , A. Ramkumar , R. Zhu , R. Rajveev , Y. Ma , and Y. Cao . 2014. BTI-induced aging under random stress waveforms: Modeling, simulation and silicon validation . In Proceedings of the 51st Annual Design Automation Conference (DAC\u201914) . ACM, New York, NY, Article 203, 6 pages. DOI:https:\/\/doi.org\/10.1145\/2593069.2593101 10.1145\/2593069.2593101 K. Sutaria, A. Ramkumar, R. Zhu, R. Rajveev, Y. Ma, and Y. Cao. 2014. BTI-induced aging under random stress waveforms: Modeling, simulation and silicon validation. In Proceedings of the 51st Annual Design Automation Conference (DAC\u201914). ACM, New York, NY, Article 203, 6 pages. DOI:https:\/\/doi.org\/10.1145\/2593069.2593101"},{"key":"e_1_2_1_164_1","volume-title":"Proceedings of the GPU Technology Conference (GTC\u201916)","author":"Suzuki T.","unstructured":"T. Suzuki , A. Nukada , and S. Matsuoka . 2016. Transparent checkpoint and restart technology for CUDA applications . In Proceedings of the GPU Technology Conference (GTC\u201916) . https:\/\/tinyurl.com\/ycb7y8xw. T. Suzuki, A. Nukada, and S. Matsuoka. 2016. Transparent checkpoint and restart technology for CUDA applications. In Proceedings of the GPU Technology Conference (GTC\u201916). https:\/\/tinyurl.com\/ycb7y8xw."},{"key":"e_1_2_1_165_1","volume-title":"Proceedings of the IEEE International Parallel Distributed Processing Symposium. 864--876","author":"Takizawa H.","year":"2011","unstructured":"H. Takizawa , K. Koyama , K. Sato , K. Komatsu , and H. Kobayashi . 2011. CheCL: Transparent checkpointing and process migration of OpenCL applications . In Proceedings of the IEEE International Parallel Distributed Processing Symposium. 864--876 . DOI:https:\/\/doi.org\/10.1109\/IPDPS. 2011 .85 10.1109\/IPDPS.2011.85 H. Takizawa, K. Koyama, K. Sato, K. Komatsu, and H. Kobayashi. 2011. CheCL: Transparent checkpointing and process migration of OpenCL applications. In Proceedings of the IEEE International Parallel Distributed Processing Symposium. 864--876. DOI:https:\/\/doi.org\/10.1109\/IPDPS.2011.85"},{"key":"e_1_2_1_166_1","volume-title":"Proceedings of the International Conference on Parallel and Distributed Computing, Applications and Technologies. 408--413","author":"Takizawa H.","year":"2009","unstructured":"H. Takizawa , K. Sato , K. Komatsu , and H. Kobayashi . 2009. CheCUDA: A checkpoint\/restart tool for CUDA applications . In Proceedings of the International Conference on Parallel and Distributed Computing, Applications and Technologies. 408--413 . DOI:https:\/\/doi.org\/10.1109\/PDCAT. 2009 .78 10.1109\/PDCAT.2009.78 H. Takizawa, K. Sato, K. Komatsu, and H. Kobayashi. 2009. CheCUDA: A checkpoint\/restart tool for CUDA applications. In Proceedings of the International Conference on Parallel and Distributed Computing, Applications and Technologies. 408--413. DOI:https:\/\/doi.org\/10.1109\/PDCAT.2009.78"},{"key":"e_1_2_1_167_1","doi-asserted-by":"publisher","DOI":"10.1109\/12.192214"},{"key":"e_1_2_1_168_1","volume-title":"Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA\u201915)","author":"Tiwari D.","year":"2015","unstructured":"D. Tiwari , S. Gupta , J. Rogers , D. Maxwell , P. Rech , S. Vazhkudai , D. Oliveira , D. Londo , N. DeBardeleben , P. Navaux , L. Carro , and A. Bland . 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation . In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA\u201915) . 331--342. DOI:https:\/\/doi.org\/10.1109\/HPCA. 2015 .7056044 10.1109\/HPCA.2015.7056044 D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux, L. Carro, and A. Bland. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA\u201915). 331--342. DOI:https:\/\/doi.org\/10.1109\/HPCA.2015.7056044"},{"key":"e_1_2_1_169_1","volume-title":"Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA\u201915)","author":"Tiwari D.","year":"2015","unstructured":"D. Tiwari , S. Gupta , J. Rogers , D. Maxwell , P. Rech , S. Vazhkudai , D. Oliveira , D. Londo , N. Debardeleben , P. Navaux , L. Carro , and A. Bland . 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation . In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA\u201915) . 331--342. DOI:https:\/\/doi.org\/10.1109\/HPCA. 2015 .7056044 10.1109\/HPCA.2015.7056044 D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. Debardeleben, P. Navaux, L. Carro, and A. Bland. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA\u201915). 331--342. DOI:https:\/\/doi.org\/10.1109\/HPCA.2015.7056044"},{"key":"e_1_2_1_170_1","volume-title":"Proceedings of the International Joint Conference on Neural Networks (IJCNN\u201990)","volume":"2","author":"Troudet T.","year":"1990","unstructured":"T. Troudet and W. Merrill . 1990. A real time neural net estimator of fatigue life . In Proceedings of the International Joint Conference on Neural Networks (IJCNN\u201990) . 59--64 vol. 2 . DOI:https:\/\/doi.org\/10.1109\/IJCNN. 1990 .137695 10.1109\/IJCNN.1990.137695 T. Troudet and W. Merrill. 1990. A real time neural net estimator of fatigue life. In Proceedings of the International Joint Conference on Neural Networks (IJCNN\u201990). 59--64 vol. 2. DOI:https:\/\/doi.org\/10.1109\/IJCNN.1990.137695"},{"key":"e_1_2_1_171_1","unstructured":"D. Turnbull and N. Alldrin. 2003. Failure Prediction in Hardware Systems. Tech. rep. University of California San Diego CA. Retrieved from http:\/\/www.cs.ucsd.edu\/ dturnbul\/Papers\/ServerPrediction.pdf. D. Turnbull and N. Alldrin. 2003. Failure Prediction in Hardware Systems. Tech. rep. University of California San Diego CA. Retrieved from http:\/\/www.cs.ucsd.edu\/ dturnbul\/Papers\/ServerPrediction.pdf."},{"key":"e_1_2_1_172_1","doi-asserted-by":"publisher","DOI":"10.1147\/sj.413.0461"},{"key":"e_1_2_1_173_1","volume-title":"Proceedings of the IEEE International Conference on Data Mining. 474--481","author":"Vilalta R.","year":"2002","unstructured":"R. Vilalta and S. Ma . 2002. Predicting rare events in temporal domains . In Proceedings of the IEEE International Conference on Data Mining. 474--481 . DOI:https:\/\/doi.org\/10.1109\/ICDM. 2002 .1183991 10.1109\/ICDM.2002.1183991 R. Vilalta and S. Ma. 2002. Predicting rare events in temporal domains. In Proceedings of the IEEE International Conference on Data Mining. 474--481. DOI:https:\/\/doi.org\/10.1109\/ICDM.2002.1183991"},{"key":"e_1_2_1_174_1","doi-asserted-by":"publisher","DOI":"10.1109\/MIC.2007.132"},{"key":"e_1_2_1_175_1","volume-title":"Proceedings of the ACM\/IEEE Conference on Supercomputing. IEEE Press, 43","author":"Wang C.","unstructured":"C. Wang , F. Mueller , C. Engelmann , and S. L. Scott . 2008. Proactive process-level live migration in HPC environments . In Proceedings of the ACM\/IEEE Conference on Supercomputing. IEEE Press, 43 . C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. 2008. Proactive process-level live migration in HPC environments. In Proceedings of the ACM\/IEEE Conference on Supercomputing. IEEE Press, 43."},{"key":"e_1_2_1_176_1","volume-title":"Proceedings of the 12th Annual International Computer Software Applications Conference (COMPSAC\u201988)","author":"Wang J. P.","year":"1988","unstructured":"J. P. Wang and S. M. Shatz . 1988. Reliability-oriented task allocation in redundant distributed systems . In Proceedings of the 12th Annual International Computer Software Applications Conference (COMPSAC\u201988) . 276--283. DOI:https:\/\/doi.org\/10.1109\/CMPSAC. 1988 .17186 10.1109\/CMPSAC.1988.17186 J. P. Wang and S. M. Shatz. 1988. Reliability-oriented task allocation in redundant distributed systems. In Proceedings of the 12th Annual International Computer Software Applications Conference (COMPSAC\u201988). 276--283. DOI:https:\/\/doi.org\/10.1109\/CMPSAC.1988.17186"},{"key":"e_1_2_1_177_1","doi-asserted-by":"publisher","DOI":"10.1145\/306225.306237"},{"key":"e_1_2_1_178_1","volume-title":"Proceedings of the Genetic and Evolutionary Computation Conference. Morgan Kaufmann, 718--725","author":"Weiss G. M.","year":"1999","unstructured":"G. M. Weiss . 1999 . Timeweaver: A genetic algorithm for identifying predictive patterns in sequences of events . In Proceedings of the Genetic and Evolutionary Computation Conference. Morgan Kaufmann, 718--725 . G. M. Weiss. 1999. Timeweaver: A genetic algorithm for identifying predictive patterns in sequences of events. In Proceedings of the Genetic and Evolutionary Computation Conference. Morgan Kaufmann, 718--725."},{"key":"e_1_2_1_179_1","doi-asserted-by":"publisher","DOI":"10.1145\/1347375.1347389"},{"key":"e_1_2_1_180_1","volume-title":"Proceedings of the 8th IEEE\/ACM\/IFIP International Conference on Hardware\/Software Codesign and System Synthesis (CODES\/ISSS\u201910)","author":"Xiang Y.","year":"1878","unstructured":"Y. Xiang , T. Chantem , R. P. Dick , X. S. Hu , and L. Shang . 2010. System-level reliability modeling for MPSoCs . In Proceedings of the 8th IEEE\/ACM\/IFIP International Conference on Hardware\/Software Codesign and System Synthesis (CODES\/ISSS\u201910) . ACM, New York, NY, 297--306. DOI:https:\/\/doi.org\/10.1145\/ 1878 961.1879013 10.1145\/1878961.1879013 Y. Xiang, T. Chantem, R. P. Dick, X. S. Hu, and L. Shang. 2010. System-level reliability modeling for MPSoCs. In Proceedings of the 8th IEEE\/ACM\/IFIP International Conference on Hardware\/Software Codesign and System Synthesis (CODES\/ISSS\u201910). ACM, New York, NY, 297--306. DOI:https:\/\/doi.org\/10.1145\/1878961.1879013"},{"key":"e_1_2_1_181_1","volume-title":"Proceedings of the 5th International Conference on Computer Science Education. 1895--1899","author":"Xu X.","year":"2010","unstructured":"X. Xu , Y. Lin , T. Tang , and Y. Lin . 2010. HiAL-Ckpt: A hierarchical application-level checkpointing for CPU-GPU hybrid systems . In Proceedings of the 5th International Conference on Computer Science Education. 1895--1899 . DOI:https:\/\/doi.org\/10.1109\/ICCSE. 2010 .5593819 10.1109\/ICCSE.2010.5593819 X. Xu, Y. Lin, T. Tang, and Y. Lin. 2010. HiAL-Ckpt: A hierarchical application-level checkpointing for CPU-GPU hybrid systems. In Proceedings of the 5th International Conference on Computer Science Education. 1895--1899. DOI:https:\/\/doi.org\/10.1109\/ICCSE.2010.5593819"},{"key":"e_1_2_1_182_1","volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201908)","author":"Yang J.","year":"2008","unstructured":"J. Yang , X. Zhou , M. Chrobak , Y. Zhang , and L. Jin . 2008. Dynamic thermal management through task scheduling . In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201908) . 191--201. DOI:https:\/\/doi.org\/10.1109\/ISPASS. 2008 .4510751 10.1109\/ISPASS.2008.4510751 J. Yang, X. Zhou, M. Chrobak, Y. Zhang, and L. Jin. 2008. Dynamic thermal management through task scheduling. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201908). 191--201. DOI:https:\/\/doi.org\/10.1109\/ISPASS.2008.4510751"},{"key":"e_1_2_1_183_1","unstructured":"E. Yilmaz and L. Gilly. 2012. Redundancy and Reliability for an HPC Data Centre. Retrieved from http:\/\/www.prace-ri.eu\/IMG\/pdf\/HPC-Centre-Redundancy-Reliability-WhitePaper.pdf. E. Yilmaz and L. Gilly. 2012. Redundancy and Reliability for an HPC Data Centre. Retrieved from http:\/\/www.prace-ri.eu\/IMG\/pdf\/HPC-Centre-Redundancy-Reliability-WhitePaper.pdf."},{"key":"e_1_2_1_184_1","volume-title":"SLURM: Simple Linux utility for resource management. In Job Scheduling Strategies for Parallel Processing","author":"Yoo A. B.","year":"2003","unstructured":"A. B. Yoo , M. A. Jette , and M. Grondona . 2003 . SLURM: Simple Linux utility for resource management. In Job Scheduling Strategies for Parallel Processing , Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer Berlin , 44--60. A. B. Yoo, M. A. Jette, and M. Grondona. 2003. SLURM: Simple Linux utility for resource management. In Job Scheduling Strategies for Parallel Processing, Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer Berlin, 44--60."},{"key":"e_1_2_1_185_1","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2017.2680464"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3403956","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,1]],"date-time":"2023-01-01T13:10:21Z","timestamp":1672578621000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3403956"}},"subtitle":["State of the Art and Perspectives"],"short-title":[],"issued":{"date-parts":[[2020,9,28]]},"references-count":189,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2021,9,30]]}},"alternative-id":["10.1145\/3403956"],"URL":"https:\/\/doi.org\/10.1145\/3403956","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,9,28]]},"assertion":[{"value":"2019-10-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-05-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-09-28","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}