An Estimation of Complexity and Computational Costs for Vertical Block-Cyclic Distributed Parallel LU Factorization | The Journal of Supercomputing Skip to main content
Log in

An Estimation of Complexity and Computational Costs for Vertical Block-Cyclic Distributed Parallel LU Factorization

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

The Vertical Block–cyclic Distributed Parallel LU Factorization Method (VBPLU) is effectively processed on a distributed memory parallel computer. VBPLU is based on the two techniques, the block algorithm and the aggregation of communications. Since startup time dominates the data communication and the aggregation reduces communication isssues, the total performance has been much improved. Furthermore this method uses long vectors so that it is also advantageous on vector processors. In this paper, we have constructed a modeling of VBPLU using a simplified LogGP model with analytical formulae, and estimated accurately the computational cost taking into account load distributions caused by data layout and process mapping. Some knowledge for optimization of block algorithm has been obtained. Our estimations have been verified through numerical experiments on three different distributed memory parallel computers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. A. Alexandrov, et al. LogGP: incorporating long messages into the LogP model. In Proceeding of the 7th Annual Symposium on Parallel Algorithms and Architectures, pp. 95–105, ACM Press, 1995.

  2. A. Anderson, et al. LAPACK: portable linear algebra library for high-performance computers. LAPACK Working Note 20, Technical Report CS–90–105, University of Tennessee, 1990.

  3. M. Barnett, et al. Building a high-performance collective communication library. In Proceeding of Supercomputing, 1994.

  4. L.S. Blackford, et al. ScaLAPACK: a linear algebra library for message-passing. SIAM Conference on Parallel Processing, 1997.

  5. D. Culler, et. al. LogP: towards a realistic model of parallel computation. In Proceeding of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1993.

  6. E.D. D'Azevedo and J.J. Dongarra. The design and implementation of the parallel out-of-core ScaLAPACK LU, QR and Cholesky factorization routines. LAPACK working Note 118, Technical Report CS–97–347. University of Tennessee, 1997.

  7. F. Desprez, B. Tourancheau, J.J. Dongarra. Performance complexity of LU factorization with efficient pipelining and overlap on a multiprocessor. LAPACK Working Note 67. CRPC-TR94417, Rice University, 1994.

  8. J.J. Dongarra. 1998. Performance of various computers using standard linear equations software. Technical Report CS–89–85, Univ. Tennessee.

  9. J.J. Dongarra and R. A. Van de Geijn. Reduction to condensed form for the Eigenvalue problem on distributed memory architectures. Parallel Computing, 18:973–982, 1992.

    Google Scholar 

  10. J.J. Dongarra and D.W. Walker. Constructing numerical software libraries for high-performance computer environments. Parallel and Distributed Computing Handbook, pp. 917–954, McGraw-Hill, 1996.

  11. K. Dowd. High Performance Computing, O'Reilly & Associates, Inc., 1993.

  12. G.H. Golub and J.M. Ortega. Scientific Computing: An Introduction with Parallel Computing. Academic Press, 1993..

  13. G.H. Golub and C.F. Van Loan. Matrix Computations, 3rd ed. Johns Hopkins University Press, 1996.

  14. W. Gropp and E. Lusk. Installation guide for mpich, a portable implementation of MPI. Technical Report ANL-96/5. Argonne National Laboratory, 1994.

  15. B. Hendlickson and D. Womble. The torus-wrap mapping for dense matrix calculations on massively parallel computers. SIAM J. Sci. Stat. Comput, 15(5):1201–1226, 1994.

    Google Scholar 

  16. T. Imamura. Matrix computation with parallel language ADETRAN4, - - Implementation on VPP500 and its Applications- - . In Proceeding of JSPP'96 in Japan, pp. 17–24, 1996 (in Japanese).

  17. W.F. McColl. Bulk synchronous parallel computing. In Proceeding of 2nd Workshop on Abstract Models for Parallel Computation, 1994.

  18. MPIF eds. MPI: A message passing interface standard, version 1.1, 1995.

  19. T. Nogi. Promising data parallel environment- - ADEPS, ADETRAN, ADENA- - . In Proceeding of pAs'95 in Aizu, pp. 45–53, IEEE Computer Society, 1995.

  20. A. Petitet. Algorithmic redistribution methods for block cyclic decompositions. Ph.D thesis, University of Tennessee, 1996.

  21. J. Ronde et al. Progress Report SAD/PARASOL II, Technical Report CAMAS-TR-2.2.3.2, University of Amsterdam, 1993.

  22. C.W. Tseng. An optimizing FORTRAN D compiler for MIMD distributed memory machine. Ph.D. thesis, Rice University, 1993.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Imamura, T. An Estimation of Complexity and Computational Costs for Vertical Block-Cyclic Distributed Parallel LU Factorization. The Journal of Supercomputing 15, 95–110 (2000). https://doi.org/10.1023/A:1008121726802

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1008121726802

Navigation