Application-oriented ping-pong benchmarking: how to assess the real communication overheads

Schneider, Timo; Gerstenberger, Robert; Hoefler, Torsten

doi:10.1007/s00607-013-0330-4

Application-oriented ping-pong benchmarking: how to assess the real communication overheads

Published: 09 May 2013

Volume 96, pages 279–292, (2014)
Cite this article

Computing Aims and scope Submit manuscript

Timo Schneider¹,
Robert Gerstenberger² &
Torsten Hoefler¹

288 Accesses
7 Citations
Explore all metrics

Abstract

Moving data between processes has often been discussed as one of the major bottlenecks in parallel computing—there is a large body of research, striving to improve communication latency and bandwidth on different networks, measured with ping-pong benchmarks of different message sizes. In practice, the data to be communicated generally originates from application data structures and needs to be serialized before communicating it over serial network channels. This serialization is often done by explicitly copying the data to communication buffers. The message passing interface (MPI) standard defines derived datatypes to allow zero-copy formulations of non-contiguous data access patterns. However, many applications still choose to implement manual pack/unpack loops, partly because they are more efficient than some MPI implementations. MPI implementers on the other hand do not have good benchmarks that represent important application access patterns. We demonstrate that the data serialization can consume up to 80 % of the total communication overhead for important applications. This indicates that most of the current research on optimizing serial network transfer times may be targeted at the smaller fraction of the communication overhead. To support the scientific community, we extracted the send/recv-buffer access patterns of a representative set of scientific applications to build a benchmark that includes serialization and communication of application data and thus reflects all communication overheads. This can be used like traditional ping-pong benchmarks to determine the holistic communication latency and bandwidth as observed by an application. It supports serialization loops in C and Fortran as well as MPI datatypes for representative application access patterns. Our benchmark, consisting of seven micro-applications, unveils significant performance discrepancies between the MPI datatype implementations of state of the art MPI implementations. Our micro-applications aim to provide a standard benchmark for MPI datatype implementations to guide optimizations similarly to the established benchmarks SPEC CPU and Livermore Loops.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

IMB-ASYNC: a revised method and benchmark to estimate MPI-3 asynchronous progress efficiency

Article 15 January 2022

Finepoints: Partitioned Multithreaded MPI Communication

SKaMPI-OpenSHMEM: Measuring OpenSHMEM Communication Routines

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Which can be downloaded from http://unixer.de/research/datatypes/ddtbench.

References

Aiken A, Nicolau A (1988) Optimal loop parallelization. In: Proceedings of the ACM SIGPLAN conference on programming language design and implementation (PLDI’88), vol 23. ACM, pp 308–317
Alverson R, Roweth D, Kaplan L (2010) The Gemini System interconnect. In: Proceedings of the IEEE symposium on high performance interconnects (HOTI’10), IEEE Computer Society, pp 83–87
Armstrong B, Bae H, Eigenmann R, Saied F, Sayeed M, Zheng Y (2006) HPC benchmarking and performance evaluation with realistic applications. In: SPEC benchmarking workshop
Bajrović E, Träff JL (2011) Using MPI derived datatypes in numerical libraries. In: Recent advances in the message passing interface (EuroMPI’11). Springer, Berlin, pp 29–38
Barrett RF, Heroux MA, Lin PT, Vaughan CT, Williams AB (2011) Poster: mini-applications: Vehicles for co-design. In: Proceedings of the companion on high performance computing, networking, storage and analysis (SC’11 companion), ACM, pp 1–2
Bernard C, Ogilvie MC, DeGrand TA, Detar CE, Gottlieb SA, Krasnitz A, Sugar RL, Toussaint D (1991) Studying quarks and gluons on MIMD parallel computers. Int J Supercomput Appl SAGE 5:61–70
Google Scholar
Brunner TA (2012) Mulard: a multigroup thermal radiation diffusion mini-application. Technical report, DOE exascale research conference
Byna S, Gropp W, Sun XH, Thakur R (2003) Improving the performance of MPI derived datatypes by optimizing memory-access cost. In: Proceedings of the IEEE international conference on cluster computing (CLUSTER’03). IEEE Computer Society, pp 412–419
Carrington L, Komatitsch D, Laurenzano M, Tikir M, Michéa D, Le Goff N, Snavely A, Tromp J (2008) High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62k processors. In: Proceedings of the ACM/IEEE conference on supercomputing (SC’08), IEEE Computer Society, pp 60:1–60:11
Dixit KM (1991) The SPEC benchmarks. In: Parallel computing, vol 17. Elsevier Science Publishers B.V., Amsterdam, pp 1195–1209
Gropp W, Hoefler T, Thakur R, Träff JL (2011) Performance expectations and guidelines for MPI derived datatypes. In: Recent advances in the message passing interface (EuroMPI’11), LNCS, vol 6960. Springer, New York, pp 150–159
Heroux MA, Doerfler DW, Crozier PS, Willenbring JM, Edwards HC, Williams A, Rajan M, Keiter ER, Thornquist HK, Numrich RW (2009) Improving performance via mini-applications. Technical report, Sandia National Laboratories, SAND2009-5574
Hoefler T, Gottlieb S (2010) Parallel zero-copy algorithms for fast Fourier transform and conjugate gradient using MPI datatypes. In: Recent advances in the message passing interface (EuroMPI’10), LNCS, vol 6305. Springer, New York, pp 132–141
McMahon FH (1986) The livermore Fortran kernels: a computer test of the numerical performance range. Technical report, Lawrence Livermore National Laboratory, UCRL-53745
MPI Forum (2009) MPI: a message-passing interface standard. Version 2.2
Plimpton S (1995) Fast parallel algorithms for short-range molecular dynamics. Academic Press Professional. J Comput Phys 117:1–19
Reussner R, Träff J, Hunzelmann G (2000) A benchmark for MPI derived datatypes. In: Recent advances in parallel virtual machine and message passing interface (EuroPVM/MPI’00), LNCS, vol 1908. Springer, New York, pp 10–17
Schneider T, Gerstenberger R, Hoefler T (2012) Micro-applications for communication data access patterns and MPI datatypes. In: Recent advances in the message passing interface (EuroMPI’12), LNCS, vol 7490. Springer, New York, pp 121–131
Skamarock WC, Klemp JB (2008) A time-split nonhydrostatic atmospheric model for weather research and forecasting applications. Academic Press Professional. J Comput Phys 227:3465–3485
Google Scholar
Träff J, Hempel R, Ritzdorf H, Zimmermann F (1999) Flattening on the fly: Efficient handling of MPI derived datatypes. In: Recent advances in parallel virtual machine and message passing interface (EuroPVM/MPI’99), LNCS, vol 1697. Springer, New York, pp 109–116
van der Wijngaart RF, Wong P (2002) NAS parallel benchmarks version 2.4. Technical report, NAS Technical, Report NAS-02-007
Wu J, Wyckoff P, Panda D (2004) High performance implementation of MPI derived datatype communication over InfiniBand. In: Proceedings of the international parallel and distributed processing symposium (IPDPS’04). IEEE Computer Society

Download references

Acknowledgments

This work was supported by the DOE Office of Science, Advanced Scientific Computing Research, under award number DE-FC02-10ER26011, program manager Sonia Sachs.

Author information

Authors and Affiliations

ETH Zurich, Department of Computer Science, Universitätstr. 6, Zurich, 8092, Switzerland
Timo Schneider & Torsten Hoefler
University of Illinois at Urbana-Champaign, Urbana, IL, USA
Robert Gerstenberger

Authors

Timo Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Robert Gerstenberger
View author publications
You can also search for this author in PubMed Google Scholar
Torsten Hoefler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Timo Schneider.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schneider, T., Gerstenberger, R. & Hoefler, T. Application-oriented ping-pong benchmarking: how to assess the real communication overheads. Computing 96, 279–292 (2014). https://doi.org/10.1007/s00607-013-0330-4

Download citation

Received: 16 December 2012
Accepted: 27 April 2013
Published: 09 May 2013
Issue Date: April 2014
DOI: https://doi.org/10.1007/s00607-013-0330-4

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Application-oriented ping-pong benchmarking: how to assess the real communication overheads

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

IMB-ASYNC: a revised method and benchmark to estimate MPI-3 asynchronous progress efficiency

Finepoints: Partitioned Multithreaded MPI Communication

SKaMPI-OpenSHMEM: Measuring OpenSHMEM Communication Routines

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

Application-oriented ping-pong benchmarking: how to assess the real communication overheads

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

IMB-ASYNC: a revised method and benchmark to estimate MPI-3 asynchronous progress efficiency

Finepoints: Partitioned Multithreaded MPI Communication

SKaMPI-OpenSHMEM: Measuring OpenSHMEM Communication Routines

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation