Abstract
Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, are popular methods of parallel programming; however, performance monitoring and analysis tools for these models have remained elusive. In this work, we propose a performance counter extension to the OpenSHMEM interfaces to expose internal communication state as lightweight performance data to tools. We implement our interface in the open source Sandia OpenSHMEM library and demonstrate its mapping to libfabric primitives. Next, we design a simple collector tool to record the behavior of OpenSHMEM processes at execution time. We analyze the Integer Sort (ISx) benchmark and use the resulting data to investigate several common performance issues—including communication schedule, poor overlap, and load imbalance—and visualize the impact of optimizations to correct these issues. Through this study, our tool uncovered a performance bug in this popular benchmark. Finally, by using our tool to guide the application of several pipelining optimizations, we were able to improve the ISx key exchange performance by more than 30%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Other names and brands may be claimed as the property of others.
- 2.
Intel and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as “Spectre” and “Meltdown”. Implementation of these updates may make these results inapplicable to your device or system.
Software and workloads used in performance tests may have been optimized for performance only on Intel\(^{\textregistered }\) microprocessors. Performance tests, such as SYSmark\(^{\star }\) and MobileMark\(^{\star }\), are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
For more information go to http://www.intel.com/benchmarks.
References
Adhianto, L., et al.: HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput.: Pract. Exper. 22(6), 685–701 (2010). Http://hpctoolkit.Org
Barrett, B.W., Brigthwell, R., Hemmert, K.S., Pedretti, K., Wheeler, K., Underwood, K.D.: Enhanced support for openSHMEM communication in portals. In: IEEE 19th Annual Symposium on High Performance Interconnects. HotI, August 2011
Brandt, J., Froese, E., Gentile, A., Kaplan, L., Allan, B., Walsh, E.: Network performance counter monitoring and analysis on the Cray XC platform. In: Proceedings of Cray Users Group (2016)
Browne, S., Dongarra, J., Garner, N., London, K., Mucci, P.: A scalable cross-platform infrastructure for application performance tuning using hardware counters. In: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing. SC 2000, IEEE Computer Society, Washington, DC, USA (2000)
Cong, G., Wen, H., Murata, H., Negishi, Y.: Tool-assisted optimization of shared-memory accesses in UPC applications. In: IEEE International Conference on High Performance Computing and Communication & IEEE International Conference on Embedded Software and Systems, (HPCC-ICESS), pp. 104–111, June 2012
DeRose, L., Homer, B., Johnson, D., Kaufmann, S., Poxon, H.: Cray performance analysis tools. In: Resch, M., Keller, R., Himmler, V., Krammer, B., Schulz, A. (eds.) Tools for High Performance Computing, pp. 191–199. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68564-7_12
Eschweiler, D., Wagner, M., Geimer, M., Knpfer, A., Nagel, W., Wolf, F.: Open trace format 2: The next generation of scalable trace formats and support libraries. In: Applications, Tools and Techniques on the Road to Exascale Computing. vol. 22, pp. 481–490, January 2012
Grun, P., et al.: A brief introduction to the openfabrics interfaces - a new network API for maximizing high performance application efficiency. In: 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, pp. 34–39, August 2015
Hanebutte, U., Hemstad, J.: ISx: A scalable integer sort for co-design in the exascale era. In: 2015 9th International Conference on Partitioned Global Address Space Programming Models (PGAS), pp. 102–104, September 2015
Hanebutte, U., Hemstad, J.: ISx: a scalable integer sort for co-design in the exascale era. In: 9th International Conference on Partitioned Global Address Space Programming Models. pp. 102–104, September 2015
Hermanns, M.-A., Geimer, M., Mohr, B., Wolf, F.: Scalable detection of MPI-2 remote memory access inefficiency patterns. In: Ropo, M., Westerholm, J., Dongarra, J. (eds.) EuroPVM/MPI 2009. LNCS, vol. 5759, pp. 31–41. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03770-2_10
Knüpfer, A., et al.: Score-P: a joint performance measurement run-time infrastructure for periscope, scalasca, TAU, and vampir. Tools for High Performance Computing, pp. 79–91. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31476-6_7
Linford, J., Simon, T.A., Shende, S., Malony, A.D.: Profiling non-numeric OpenSHMEM applications with the TAU performance system. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 105–119. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05215-1_8
Linford, J.C., Khuvis, S., Shende, S., Malony, A., Imam, N., Venkata, M.G.: Performance analysis of openSHMEM applications with TAU commander. In: Gorentla Venkata, M., Imam, N., Pophale, S. (eds.) OpenSHMEM 2017. LNCS, vol. 10679, pp. 161–179. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73814-7_11
Mohr, B., Kühnal, A., Hermanns, M., Wolf, F.: Performance analysis of one-sided communication mechanisms. In: Joubert, G.R., Nagel, W.E., Peters, F.J., Plata, O.G., Tirado, P., Zapata, E.L. (eds.) Parallel Computing: Current & Future Issues of High-End Computing, Proceedings of the International Conference ParCo 2005. John von Neumann Institute for Computing Series, 13–16 September 2005, Department of Computer Architecture, University of Malaga, Spain, vol. 33, pp. 885–892. Central Institute for Applied Mathematics, Jülich (2005)
MPI Forum: MPI: A message-passing interface standard version 3.1. Technical report, University of Tennessee, Knoxville, June 2015
Oeste, S., Knüpfer, A., Ilsche, T.: Towards parallel performance analysis tools for the openSHMEM standard. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 90–104. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05215-1_7
OpenSHMEM application programming interface, version 1.3., February 2016. http://www.openshmem.org
OpenSHMEM application programming interface, version 1.4., December 2017. http://www.openshmem.org
Pedretti, K., Vaughan, C.T., Barrett, R.F., Devine, K.D., Hemmert, K.S.: Using the Cray Gemini performance counters. In: Proceedings of the Cray Users Group (2013)
Portals 4.0. http://www.cs.sandia.gov/Portals/portals4.html
Performance Scaled Messaging 2 (PSM2) Programmer’s Guide, October 2017. https://intel.ly/2y2uvjb
Sandia OpenSHMEM (2018). https://github.com/Sandia-OpenSHMEM/SOS
Seager, K., Choi, S.-E., Dinan, J., Pritchard, H., Sur, S.: Design and implementation of openSHMEM using OFI on the aries interconnect. In: Gorentla Venkata, M., Imam, N., Pophale, S., Mintz, T.M. (eds.) OpenSHMEM 2016. LNCS, vol. 10007, pp. 97–113. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50995-2_7
Su, H.H., Billingsley, M., George, A.D.: Parallel performance wizard: a performance system for the analysis of partitioned global-address-space applications. Int. J. High Perform. Comput. Appl. 24(4), 485–510 (2010)
Su, H.-H., Bonachea, D., Leko, A., Sherburne, H., Billingsley, M., George, A.D.: GASP! a standardized performance analysis tool interface for global address space programming models. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 450–459. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75755-9_54
Tallent, N.R., Vishnu, A., Van Dam, H., Daily, J., Kerbyson, D.J., Hoisie, A.: Diagnosing the causes and severity of one-sided message contention. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pp. 130–139. ACM, New York, NY, USA (2015)
UPC Consortium: UPC language and library specifications, v1.3. Technical Report LBNL-6623E, Lawrence Berkeley National Lab, November 2013
Van der Wijngaart, R.F., et al.: Comparing runtime systems with exascale ambitions using the parallel research Kernels. In: Kunkel, J.M., Balaji, P., Dongarra, J. (eds.) ISC High Performance 2016. LNCS, vol. 9697, pp. 321–339. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41321-1_17
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Rahman, M.Wu., Ozog, D., Dinan, J. (2019). Lightweight Instrumentation and Analysis Using OpenSHMEM Performance Counters. In: Pophale, S., Imam, N., Aderholdt, F., Gorentla Venkata, M. (eds) OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Extreme Heterogeneity. OpenSHMEM 2018. Lecture Notes in Computer Science(), vol 11283. Springer, Cham. https://doi.org/10.1007/978-3-030-04918-8_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-04918-8_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04917-1
Online ISBN: 978-3-030-04918-8
eBook Packages: Computer ScienceComputer Science (R0)