Abstract
Due to non-associativity of floating-point operations and dynamic scheduling on parallel architectures, getting a bit-wise reproducible floating-point result for multiple executions of the same code on different or even similar parallel architectures is challenging. In this paper, we address the problem of reproducibility in the context of matrix multiplication and propose an algorithm that yields both reproducible and accurate results. This algorithm is composed of two main stages: a filtering stage that uses fast vectorized floating-point expansions in conjunction with error-free transformations; an accumulation stage based on Kulisch long accumulators in a high-radix carry-save representation. Finally, we provide implementations and performance results in parallel environments like GPUs.
Similar content being viewed by others
Notes
- 1.
In general, x stands for four different formats, but in the scope of this article we consider x to correspond to single (S) or double (D) precision.
- 2.
References
Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (CDROM). Supercomputing 1998, 1–27. IEEE Computer Society (1998)
Goto, K., van de Geijn, R.A.: High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1), 1–14 (2008)
Fabregat-Traver, D., Bientinesi, P.: Computing petaflops over terabytes of data: the case of genome-wide association studies. ACM Trans. Math. Softw. 40(4), 27:1–27:22 (2014)
Bergman, K., al.: Exascale computing study: technology challenges in achieving exascale systems. DARPA report, September 2008
Whitehead, N., Fit-Florea, A.: Precision & performance: Floating point and IEEE 754 compliance for NVIDIA GPUs. Technical report, NVIDIA (2011)
Corden, M.: Differences in floating-point arithmetic between Intel Xeon processors and the Intel Xeon Phi™ coprocessor. Technical report, Intel (2013)
Doertel, K.: Best known method: Avoid heterogeneous precision in control flow calculations. Technical report, Intel (2013)
Kulisch, U., Snyder, V.: The exact dot product as basic tool for long interval arithmetic. Computing 91(3), 307–313 (2011)
Demmel, J., Nguyen, H.D.: Fast reproducible floating-point summation. In: Proceedings of the 21st IEEE Symposium on Computer Arithmetic, Austin, Texas, USA, pp. 163–172 (2013)
Collange, C., Defour, D., Graillat, S., Iakymchuk, R.: Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures. Technical report HAL: hal-00949355, INRIA, DALI-LIRMM, LIP6, ICS, February 2014
IEEE Computer Society: IEEE Standard for Floating-Point Arithmetic. IEEE Standard 754–2008, August 2008
Higham, N.J.: Accuracy and stability of numerical algorithms, 2nd edn. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA (2002)
Muller, J.M., Brisebarre, N., de Dinechin, F., Jeannerod, C.P., Lefèvre, V., Melquiond, G., Revol, N., Stehlé, D., Torres, S.: Handbook of Floating-Point Arithmetic. Birkhäuser, Boston (2010)
Li, X.S., Demmel, J.W., Bailey, D.H., Henry, G., Hida, Y., Iskandar, J., Kahan, W., Kang, S.Y., Kapur, A., Martin, M.C., Thompson, B.J., Tung, T., Yoo, D.J.: Design, implementation and testing of extended and mixed precision BLAS. ACM Trans. Math. Softw. 28(2), 152–205 (2002)
Hida, Y., Li, X.S., Bailey, D.H.: Algorithms for quad-double precision floating point arithmetic. In: Proceedings of the 15th IEEE Symposium on Computer Arithmetic, CA, USA, 155–162. IEEE Computer Society Press, Los Alamitos (2001)
Knuth, D.E.: The Art of Computer Programming. Seminumerical Algorithms, vol. 2, 3rd edn. Addison-Wesley, Boston (1997)
Matsumoto, K., Nakasato, N., Sakai, T., Yahagi, H., Sedukhin, S.G.: Multi-level optimization of matrix multiplication for gpu-equipped systems. In: ICCS. Procedia Computer Science, vol. 4, pp. 342–351. Elsevier (2011)
Acknowledgement
This work undertaken (partially) in the framework of CALSIMLAB is supported by the public grant ANR-11-LABX-0037-01 overseen by the French National Research Agency (ANR) as part of the “Investissements d’Avenir” program (reference: ANR-11-IDEX-0004-02). This work was also (partially) supported by the FastRelax project through the ANR public grant (reference: ANR-14-CE25-0018-01).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Iakymchuk, R., Defour, D., Collange, C., Graillat, S. (2016). Reproducible and Accurate Matrix Multiplication. In: Nehmeier, M., Wolff von Gudenberg, J., Tucker, W. (eds) Scientific Computing, Computer Arithmetic, and Validated Numerics. SCAN 2015. Lecture Notes in Computer Science(), vol 9553. Springer, Cham. https://doi.org/10.1007/978-3-319-31769-4_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-31769-4_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31768-7
Online ISBN: 978-3-319-31769-4
eBook Packages: Computer ScienceComputer Science (R0)