An Estimation of Complexity and Computational Costs for Vertical Block-Cyclic Distributed Parallel LU Factorization

Imamura, Toshiyuki

doi:10.1023/A:1008121726802

An Estimation of Complexity and Computational Costs for Vertical Block-Cyclic Distributed Parallel LU Factorization

Published: January 2000

Volume 15, pages 95–110, (2000)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Toshiyuki Imamura¹

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

The Vertical Block–cyclic Distributed Parallel LU Factorization Method (VBPLU) is effectively processed on a distributed memory parallel computer. VBPLU is based on the two techniques, the block algorithm and the aggregation of communications. Since startup time dominates the data communication and the aggregation reduces communication isssues, the total performance has been much improved. Furthermore this method uses long vectors so that it is also advantageous on vector processors. In this paper, we have constructed a modeling of VBPLU using a simplified LogGP model with analytical formulae, and estimated accurately the computational cost taking into account load distributions caused by data layout and process mapping. Some knowledge for optimization of block algorithm has been obtained. Our estimations have been verified through numerical experiments on three different distributed memory parallel computers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Performance comparison of various parallel incomplete LU factorization preconditioners for domain decomposition method

Article 01 November 2018

Structured Orthogonal Inversion of Block p-Cyclic Matrices on Multicores with GPU Accelerators

Analysis of Partitioning Models and Metrics in Parallel Sparse Matrix-Vector Multiplication

References

A. Alexandrov, et al. LogGP: incorporating long messages into the LogP model. In Proceeding of the 7th Annual Symposium on Parallel Algorithms and Architectures, pp. 95–105, ACM Press, 1995.
A. Anderson, et al. LAPACK: portable linear algebra library for high-performance computers. LAPACK Working Note 20, Technical Report CS–90–105, University of Tennessee, 1990.
M. Barnett, et al. Building a high-performance collective communication library. In Proceeding of Supercomputing, 1994.
L.S. Blackford, et al. ScaLAPACK: a linear algebra library for message-passing. SIAM Conference on Parallel Processing, 1997.
D. Culler, et. al. LogP: towards a realistic model of parallel computation. In Proceeding of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1993.
E.D. D'Azevedo and J.J. Dongarra. The design and implementation of the parallel out-of-core ScaLAPACK LU, QR and Cholesky factorization routines. LAPACK working Note 118, Technical Report CS–97–347. University of Tennessee, 1997.
F. Desprez, B. Tourancheau, J.J. Dongarra. Performance complexity of LU factorization with efficient pipelining and overlap on a multiprocessor. LAPACK Working Note 67. CRPC-TR94417, Rice University, 1994.
J.J. Dongarra. 1998. Performance of various computers using standard linear equations software. Technical Report CS–89–85, Univ. Tennessee.
J.J. Dongarra and R. A. Van de Geijn. Reduction to condensed form for the Eigenvalue problem on distributed memory architectures. Parallel Computing, 18:973–982, 1992.
Google Scholar
J.J. Dongarra and D.W. Walker. Constructing numerical software libraries for high-performance computer environments. Parallel and Distributed Computing Handbook, pp. 917–954, McGraw-Hill, 1996.
K. Dowd. High Performance Computing, O'Reilly & Associates, Inc., 1993.
G.H. Golub and J.M. Ortega. Scientific Computing: An Introduction with Parallel Computing. Academic Press, 1993..
G.H. Golub and C.F. Van Loan. Matrix Computations, 3rd ed. Johns Hopkins University Press, 1996.
W. Gropp and E. Lusk. Installation guide for mpich, a portable implementation of MPI. Technical Report ANL-96/5. Argonne National Laboratory, 1994.
B. Hendlickson and D. Womble. The torus-wrap mapping for dense matrix calculations on massively parallel computers. SIAM J. Sci. Stat. Comput, 15(5):1201–1226, 1994.
Google Scholar
T. Imamura. Matrix computation with parallel language ADETRAN4, - - Implementation on VPP500 and its Applications- - . In Proceeding of JSPP'96 in Japan, pp. 17–24, 1996 (in Japanese).
W.F. McColl. Bulk synchronous parallel computing. In Proceeding of 2nd Workshop on Abstract Models for Parallel Computation, 1994.
MPIF eds. MPI: A message passing interface standard, version 1.1, 1995.
T. Nogi. Promising data parallel environment- - ADEPS, ADETRAN, ADENA- - . In Proceeding of pAs'95 in Aizu, pp. 45–53, IEEE Computer Society, 1995.
A. Petitet. Algorithmic redistribution methods for block cyclic decompositions. Ph.D thesis, University of Tennessee, 1996.
J. Ronde et al. Progress Report SAD/PARASOL II, Technical Report CAMAS-TR-2.2.3.2, University of Amsterdam, 1993.
C.W. Tseng. An optimizing FORTRAN D compiler for MIMD distributed memory machine. Ph.D. thesis, Rice University, 1993.

Download references

Author information

Authors and Affiliations

Center for Promotion of Computational Science and Engineering, Japan Atomic Energy Research Institute, 2-2-54 Nakameguro, Meguro-ku, Tokyo, 153, Japan
Toshiyuki Imamura

Authors

Toshiyuki Imamura
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Imamura, T. An Estimation of Complexity and Computational Costs for Vertical Block-Cyclic Distributed Parallel LU Factorization. The Journal of Supercomputing 15, 95–110 (2000). https://doi.org/10.1023/A:1008121726802

Download citation

Issue Date: January 2000
DOI: https://doi.org/10.1023/A:1008121726802

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

An Estimation of Complexity and Computational Costs for Vertical Block-Cyclic Distributed Parallel LU Factorization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Performance comparison of various parallel incomplete LU factorization preconditioners for domain decomposition method

Structured Orthogonal Inversion of Block p-Cyclic Matrices on Multicores with GPU Accelerators

Analysis of Partitioning Models and Metrics in Parallel Sparse Matrix-Vector Multiplication

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Subscribe and save

Buy Now

Navigation

An Estimation of Complexity and Computational Costs for Vertical Block-Cyclic Distributed Parallel LU Factorization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Performance comparison of various parallel incomplete LU factorization preconditioners for domain decomposition method

Structured Orthogonal Inversion of Block p-Cyclic Matrices on Multicores with GPU Accelerators

Analysis of Partitioning Models and Metrics in Parallel Sparse Matrix-Vector Multiplication

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now

Search

Navigation