Abstract
The Vertical Block–cyclic Distributed Parallel LU Factorization Method (VBPLU) is effectively processed on a distributed memory parallel computer. VBPLU is based on the two techniques, the block algorithm and the aggregation of communications. Since startup time dominates the data communication and the aggregation reduces communication isssues, the total performance has been much improved. Furthermore this method uses long vectors so that it is also advantageous on vector processors. In this paper, we have constructed a modeling of VBPLU using a simplified LogGP model with analytical formulae, and estimated accurately the computational cost taking into account load distributions caused by data layout and process mapping. Some knowledge for optimization of block algorithm has been obtained. Our estimations have been verified through numerical experiments on three different distributed memory parallel computers.
Similar content being viewed by others
References
A. Alexandrov, et al. LogGP: incorporating long messages into the LogP model. In Proceeding of the 7th Annual Symposium on Parallel Algorithms and Architectures, pp. 95–105, ACM Press, 1995.
A. Anderson, et al. LAPACK: portable linear algebra library for high-performance computers. LAPACK Working Note 20, Technical Report CS–90–105, University of Tennessee, 1990.
M. Barnett, et al. Building a high-performance collective communication library. In Proceeding of Supercomputing, 1994.
L.S. Blackford, et al. ScaLAPACK: a linear algebra library for message-passing. SIAM Conference on Parallel Processing, 1997.
D. Culler, et. al. LogP: towards a realistic model of parallel computation. In Proceeding of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1993.
E.D. D'Azevedo and J.J. Dongarra. The design and implementation of the parallel out-of-core ScaLAPACK LU, QR and Cholesky factorization routines. LAPACK working Note 118, Technical Report CS–97–347. University of Tennessee, 1997.
F. Desprez, B. Tourancheau, J.J. Dongarra. Performance complexity of LU factorization with efficient pipelining and overlap on a multiprocessor. LAPACK Working Note 67. CRPC-TR94417, Rice University, 1994.
J.J. Dongarra. 1998. Performance of various computers using standard linear equations software. Technical Report CS–89–85, Univ. Tennessee.
J.J. Dongarra and R. A. Van de Geijn. Reduction to condensed form for the Eigenvalue problem on distributed memory architectures. Parallel Computing, 18:973–982, 1992.
J.J. Dongarra and D.W. Walker. Constructing numerical software libraries for high-performance computer environments. Parallel and Distributed Computing Handbook, pp. 917–954, McGraw-Hill, 1996.
K. Dowd. High Performance Computing, O'Reilly & Associates, Inc., 1993.
G.H. Golub and J.M. Ortega. Scientific Computing: An Introduction with Parallel Computing. Academic Press, 1993..
G.H. Golub and C.F. Van Loan. Matrix Computations, 3rd ed. Johns Hopkins University Press, 1996.
W. Gropp and E. Lusk. Installation guide for mpich, a portable implementation of MPI. Technical Report ANL-96/5. Argonne National Laboratory, 1994.
B. Hendlickson and D. Womble. The torus-wrap mapping for dense matrix calculations on massively parallel computers. SIAM J. Sci. Stat. Comput, 15(5):1201–1226, 1994.
T. Imamura. Matrix computation with parallel language ADETRAN4, - - Implementation on VPP500 and its Applications- - . In Proceeding of JSPP'96 in Japan, pp. 17–24, 1996 (in Japanese).
W.F. McColl. Bulk synchronous parallel computing. In Proceeding of 2nd Workshop on Abstract Models for Parallel Computation, 1994.
MPIF eds. MPI: A message passing interface standard, version 1.1, 1995.
T. Nogi. Promising data parallel environment- - ADEPS, ADETRAN, ADENA- - . In Proceeding of pAs'95 in Aizu, pp. 45–53, IEEE Computer Society, 1995.
A. Petitet. Algorithmic redistribution methods for block cyclic decompositions. Ph.D thesis, University of Tennessee, 1996.
J. Ronde et al. Progress Report SAD/PARASOL II, Technical Report CAMAS-TR-2.2.3.2, University of Amsterdam, 1993.
C.W. Tseng. An optimizing FORTRAN D compiler for MIMD distributed memory machine. Ph.D. thesis, Rice University, 1993.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Imamura, T. An Estimation of Complexity and Computational Costs for Vertical Block-Cyclic Distributed Parallel LU Factorization. The Journal of Supercomputing 15, 95–110 (2000). https://doi.org/10.1023/A:1008121726802
Issue Date:
DOI: https://doi.org/10.1023/A:1008121726802