Abstract
This work focuses on the study of the 1–D TVD–Hopmoc method executed in shared memory manycore environments. In particular, this paper studies barrier costs on Intel\(^{\textregistered }\) Xeon Phi\(^\mathrm{TM}\) (KNC and KNL) accelerators when using the OpenMP standard. This paper employs an explicit synchronization mechanism to reduce spin and thread scheduling times in an OpenMP implementation of the 1–D TVD–Hopmoc method. Basically, we define an array that represents threads and the new scheme consists of synchronizing only adjacent threads. Moreover, the new approach reduces the OpenMP scheduling time by employing an explicit work-sharing strategy. In the beginning of the process, the array that represents the computational mesh of the numerical method is partitioned among threads, instead of permitting the OpenMP API to perform this task. Thereby, the new scheme diminishes the OpenMP spin time by avoiding OpenMP barriers using an explicit synchronization mechanism where a thread only waits for its two adjacent threads. The results of the new approach is compared with a basic parallel implementation of the 1–D TVD–Hopmoc method. Specifically, numerical simulations shows that the new approach achieves promising performance gains in shared memory manycore environments for an OpenMP implementation of the 1–D TVD–Hopmoc method.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
Over the last decades, since both the demand for faster computation and shared memory multicore and manycore architectures became available in large scale, an important issue has emerged: how can one obtain a speedup proportional to the number of physical cores available in a parallel architecture? Specific programming languages, libraries, and programming models has been proposed to assist programmers and researches to surmount this challenge. Among them, the OpenMP library [2] is one of the best-known standards nowadays.
Since new machines with faster clock speed are facing a certain limit because overheat and electromagnetic interference, technologies have been proposed with the objective of improving the processing capacity in computations. Multithreading, distributed processing, manycore architectures (e.g. General Purpose Graphical Processing Units (GPGPUs)), and Many Integrated Core (MIC) accelerators (such as Intel\(^{\textregistered }\) Xeon Phi\(^\mathrm{TM}\)) are examples of technologies employed to boost a computer performance. Even with these recent technologies, it is still a challenge to obtain a speedup proportional to the number of available cores in a parallel implementation due to the hardware complexity in such systems. To overcome this particular issue, the OpenMP standard offers a simple manner to convert a serial code to a parallel implementation for shared memory multicore and manycore architectures. Although a parallel implementation with the support of this API is easily reached, a basic (or naive) OpenMP implementation in general does not attain straightforwardly the expected speedup in an application. Furthermore, despite being easily implemented, the most common resources available in the OpenMP standard may generate high scheduling and synchronization running times.
High Performance Computing (HPC) is a practice widely used in computational simulations of several real-world phenomena. Numerical methods have been designed to maximize the computational capacity provided by a HPC environment. In particular, the Hopmoc method (see [3] and references therein) is a spatially decoupled alternating direction procedure for solving convection–diffusion equations. It was designed to be executed in parallel architectures (see [1] and references therein). To provide more specific detail, the unknowns are spatially decoupled, permitting message-passing minimization among threads. Specifically, the set of unknowns is decoupled into two subsets. These two subsets are calculated alternately by explicit and implicit approaches. In particular, the use of two explicit and implicit semi-steps avoids the use of a linear system solver. Moreover, this method employs a strategy based on tracking values along characteristic lines during time stepping. The two semi-steps are performed along characteristic lines by a Semi-Lagrangian scheme following concepts of the Modified Method of Characteristics. The time derivative and the advection term are combined as a direction derivative. Thus, time steps are performed in the flow direction along characteristics of the velocity field of the fluid. The Hopmoc method is a direct method in the sense that the cost per time step is previously known. To determine the value in the foot of the characteristic line, the original method uses an interpolation technique, that introduces inherent numerical errors. To overcome this limitation, the Hopmoc method was combined with a Total Variation Diminishing (TVD) scheme. This new approach, called TVD–Hopmoc, employs a flux limiter to determine the value in the foot of the characteristic line based on the Lax-Wendroff scheme [1].
We studied a basic OpenMP implementation of the TVD–Hopmoc method under the Intel\(^{\textregistered }\) Parallel Studio XE software for Intel’s Haswell/Broadwell architectures. This product showed us that the main problem in the performance of a simple OpenMP–based TVD–Hopmoc method was the use of the basic OpenMP scheduling and synchronization mechanisms. Thus, this paper employs alternative strategies to these basic OpenMP strategies. Specifically, this work uses an explicit work-sharing (EWS) strategy. The array that denotes the computational mesh of the 1–D TVD–Hopmoc method is explicitly partitioned among threads. In addition, to avoid implicit barrier costs imposed by the OpenMP standard, this paper employs an explicit synchronization mechanism to guarantee that only threads with real data dependencies participate in the synchronization. Additionally, we compare our approach along with three thread binding policies (balanced, compact, and scatter).
This paper is organized as follows. Section 2 outlines the TVD–Hopmoc method and shows how the experiments were conducted in this work. Section 3 presents the EWS strategy employed here along with our strategy to synchronize adjacent threads. Section 4 shows the experimental results that compares the new approach with a naive OpenMP–based TVD–Hopmoc method. Finally, Sect. 5 addresses the conclusions and discusses the next steps in this investigation.
2 The TVD–Hopmoc Method and Description of the Tests
Consider the advection–diffusion equation in the form
with adequate initial and boundary conditions, where v is a constant positive velocity, d is the constant positive diffusivity, \(0\le x \le 1\) and \(0\le t \le T\), for T time steps. Applying the Hopmoc method to equation (1) yields \(\overline{u}^{t+\frac{1}{2}}_i = \overline{\overline{u}}^{t}_i+ \delta t\left[ \theta ^{t}_i L_h\left( \overline{\overline{u}}^{t}_i\right) + \theta ^{t+1}_i L_h\left( \overline{u}^{t+\frac{1}{2}}_i\right) \right] \) and \(u^{t+1}_i = \overline{u}^{t+\frac{1}{2}}_i + \delta t \left[ \theta ^t_iL_h\left( \overline{u}^{t+\frac{1}{2}}_i\right) + \theta ^{t+1}_i L_h\left( u^{t+1}_i\right) \right] \), where \(\theta ^t_i\) is 1 (0) if \(t+i\) is even (odd), \(L_h\left( u^t_i\right) =d\frac{u^{t}_{i-1}-2u^{t}_{i}+u^{t}_{i+1}}{\varDelta x^2}\) is a finite-difference operator, \(\overline{u}^{t+\frac{1}{2}}_i\) and \(u^{t+1}_i\) are consecutive time semi-steps, and the value of the concentration in \(\overline{\overline{u}}^t_i\) is obtained by a TVD scheme to determine \(\overline{\overline{u}}^t_{i+1} = u^t_i - c\left( u^t_i-u^t_{i-1}\right) \left[ 1-\frac{(1-c)\phi _{i-1}}{2}+\frac{(1-c)\phi _i}{2r}\right] \), where \(r=\frac{u^t_i-u^t_{i-1}}{u^t_{i+1}-u^t_{i}}\) [1]. The Van Leer flux limiter [4] was employed in the numerical simulations performed in this present work. This scheme delivered better results when compared with other techniques [1]. Our numerical simulations were carried out for a Gaussian pulse with amplitude 1.0, whose initial center location is 0.2, with velocity \(v=1\) and diffusion coefficient \(d=\frac{2}{Re}=10^{-3}\) (where Re stands for Reynolds number), \(\varDelta x = 10^{-5}\) and \(\varDelta x = 10^{-6}\) (i.e., \(10^5\) and \(10^6\) stencil points, respectively), and T is established as \(10^4\), \(10^5\), and \(10^6\).
3 EWS and Synchronization Strategies Combined with an Explicit Lock Mechanism
The main idea of our OpenMP–based TVD–Hopmoc method is to avoid the extra time caused by scheduling threads and the implicit barriers after a load sharing construct employed in the OpenMP standard. We deal with thread imbalance by partitioning explicitly the array into the team of threads. Our implementation of the TVD–Hopmoc method defines a static array that represents the unknowns. Thus, a permanent partition of this array is established for each thread in the beginning of the code, i.e., no changes are made to this partition during the execution of the process. Since the mesh is static, thread scheduling is performed only at the beginning of the execution. In addition, since each thread in the team has its required data, we do not need to use the OpenMP parallel for directive in our code. Thus, our OpenMP implementation of the TVD–Hopmoc method was designed to minimize synchronization in a way that a particular thread needs information only from its adjacent threads so that an implicit barrier is unnecessary. Thereby, the mechanism applied here involves a synchronization of adjacent threads, i.e., each thread waits only for two adjacent threads to reach the same synchronization point.
The strategy employed here is based on a simple lock mechanism. Using an array of booleans, a thread sets (releases) an entry in this array and, hence, informs its adjacent threads that the data cannot (can) be used by them. We will refer this approach as EWS-adSync implementation.
4 Experimental Results
This section shows the results of the new and basic approaches aforementioned in executions performed on a machine containing an Intel\(^{\textregistered }\) Xeon Phi\(^\mathrm{TM}\) Knights-Corner (KNC) accelerator 5110P, 8GB DDR5, 1.053 GHz, 60 cores, 4 threads per core (in Sect. 4.1), and on a machine containing an Intel\(^{\textregistered }\) Xeon Phi\(^\mathrm{TM}\) CPU 7250 @ 1.40 GHz, 68 cores, 4 threads per core (in Sect. 4.2). We evaluate the EWS-adSync implementation of the TVD–Hopmoc method along with three thread binding policies: balanced, compact, and scatter policies.
4.1 Experiments Performed on a Xeon Phi\(^\mathrm{TM}\) (Knight’s Corner) Accelerator
This section shows experiments performed on an Intel\(^{\textregistered }\) Xeon Phi\(^\mathrm{TM}\) (KNC) accelerator composed of 240 threads. Figure 1 shows that our OpenMP–based TVD–Hopmoc method using the EWS strategy in conjunction with synchronizing adjacent threads along with compact (comp.) thread binding policy (pol.) yielded a speedup of approximately 140x (using 240 threads) in a simulation with \(\varDelta x=10^{-6}\) (i.e., a mesh composed of \(10^6\) stencil points) and \(T=10^5\). This implementation used in conjunction with the scatter (sct.) policy delivered a speedup of 136x (using 238 threads) in a simulation with the same settings.
The EWS-adSync implementation alongside scatter, balanced (bal.), and compact binding policies respectively achieved speedups of 134.2x, 133.8x, and 133.5x in simulations with \(\varDelta x=10^{-6}\) and \(T=10^4\). These implementations dominated the basic OpenMP–based TVD-Hopmoc method, which obtained a speedup of 28x.
The EWS-adSync implementation alongside the compact policy (with speedup of 121x) achieved better results than the scatter policy (with speedup of 111x) in a simulation with \(\varDelta x=10^{-5}\) and \(T=10^5\). Both implementations dominated the basic OpenMP–based TVD-Hopmoc method, which obtained a speedup of 95x. The EWS-adSync implementation alongside the compact policy (with speedup of 90x) reached slightly better results than the scatter policy (with speedup of 88x) in a simulation with \(\varDelta x=10^{-5}\) and \(T=10^4\). On the other hand, the EWS-adSync implementation in conjunction with the scatter policy (with speedup of 38x) yielded better results than the compact policy (with speedup of 27x) in a simulation with \(\varDelta x=10^{-5}\) and \(T=10^3\). Figure 1 shows that in general the OpenMP–based TVD–Hopmoc method obtains better speedups when setting a larger number of iterations T since it takes advantage of a higher cache hit rate.
4.2 Experiments Performed on a Xeon Phi\(^\mathrm{TM}\) (Knights Landing) Accelerator
This section shows experiments performed on an Intel\(^{\textregistered }\) Xeon Phi\(^\mathrm{TM}\) (KNL) accelerator composed of 272 threads. Figure 2 shows that our OpenMP–based TVD–Hopmoc method using the EWS strategy in conjunction with synchronizing adjacent threads along with scatter thread binding policy yielded a speedup of approximately 91x (using 135 threads) in a simulation with \(\varDelta x=10^{-5}\) (i.e., a mesh composed of \(10^5\) stencil points) and \(T=10^6\). This implementation used in conjunction with scatter and compact policies delivered respectively speedups of 85x and 77x (using 135 and 271 threads, respectively) with the same settings. Similarly, the EWS-adSync implementation alongside balanced policy achieved better results (with speedup of 84x) than in conjunction with both scatter (with speedup of 80x) and compact (with speedup of 69x) policies in simulations with \(\varDelta x=10^{-5}\) and \(T=10^5\).
The EWS-adSync implementation alongside the scatter policy achieved similar results to the compact policy (both with speedup of 17x) in a simulation with \(\varDelta x=10^{-6}\) and \(T=10^5\). It seems that the simulation with \(\varDelta x=10^{-6}\) obtained a higher cache miss rate than the other experiments.
5 Conclusions
Based on an explicit work-sharing strategy along with an explicit synchronization mechanism, the approach employed here to implement an OpenMP–based TVD–Hopmoc method attained reasonable speedups in manycore (Xeon Phi\(^\mathrm{TM}\) KNC and KNL accelerators) architectures. In particular, the scheduling time was profoundly reduced by replacing the effort of assigning to threads tasks at runtime with an explicit work-sharing strategy that determines a priori the range of the array that represents stencil points where each thread will perform its task. Moreover, using a lock array where an entry denotes a thread, the synchronization time in barriers was substituted by a strategy that only requires that each thread be synchronized with its two adjacent threads. Employing these two strategies, the team of threads presented a reasonable load balancing, where almost the total number of threads available are used simultaneously.
Our OpenMP–based TVD–Hopmoc method along with a compact thread binding policy yielded a speedup of approximately 140x when applied to a mesh composed of \(10^6\) stencil points in a simulation performed on an Intel\(^{\textregistered }\) Xeon Phi\(^\mathrm{TM}\) (Knight’s Corner) accelerator composed of 240 threads. Moreover, this parallel TVD–Hopmoc method alongside a balanced thread binding policy reached a speedup of approximately 91x when applied to a mesh composed of \(10^5\) stencil points in a simulation performed on an Intel\(^{\textregistered }\) Xeon Phi\(^\mathrm{TM}\) (Knights Landing) accelerator composed of 272 threads. Furthermore, our OpenMP–based TVD–Hopmoc method in conjunction both with scatter and compact policies achieved a speedup of approximately 17x when applied to a mesh composed of \(10^6\) stencil points in a simulation performed on the same machine.
It is observed in Figs. 1 and 2 that the speedup of our OpenMP–based TVD–Hopmoc method along with scatter and balanced policies shows four trends. The speedup increases with the use of up to a number of threads that is multiple of the number of cores in the machine. We intend to provide further investigations about this characteristic in future studies.
A next step in this investigation is to implement an OpenMP–based 2–D TVD–Hopmoc method. Even in the 2–D case, we plan to use an array to represent the stencil points so that the approach employed in the 1–D case of the TVD–Hopmoc method is still valid.
References
Cabral, F., Osthoff, C., Costa, G., Brandão, D.N., Kischinhevsky, M., Gonzaga de Oliveira, S.L.: Tuning up TVD HOPMOC method on Intel MIC Xeon Phi architectures with Intel Parallel Studio tools. In: Proceedings of the 8th Workshop on Applications for Multi-Core Architectures, Campinas, SP, Brazil (2017)
Dagum, L., Menon, R.: OpenMP: an industry standard api for shared-memory programming. Comput. Sci. Eng. IEEE 5(1), 46–55 (1998)
Oliveira, S.R.F., Gonzaga de Oliveira, S.L., Kischinhevsky, M.: Convergence analysis of the Hopmoc method. Int. J. Comput. Math. 86, 1375–1393 (2009)
van Leer, B.: Towards the ultimate conservative difference schemes. J. Comput. Phys. 14, 361–370 (1974)
Acknowledgments
This work was developed with the support of CNPq, CAPES, and FAPERJ - Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro. We would like to thank the Núcleo de Computação Científica at Universidade Estadual Paulista (NCC/UNESP) for letting us execute our simulations on its heterogeneous multi-core cluster. These resources were partially funded by Intel\(^{\textregistered }\) through the projects entitled Intel Parallel Computing Center, Modern Code Partner, and Intel/Unesp Center of Excellence in Machine Learning.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Cabral, F.L., Osthoff, C., Costa, G.P., Gonzaga de Oliveira, S.L., Brandão, D., Kischinhevsky, M. (2018). An OpenMP Implementation of the TVD–Hopmoc Method Based on a Synchronization Mechanism Using Locks Between Adjacent Threads on Xeon Phi (TM) Accelerators. In: Shi, Y., et al. Computational Science – ICCS 2018. ICCS 2018. Lecture Notes in Computer Science(), vol 10862. Springer, Cham. https://doi.org/10.1007/978-3-319-93713-7_67
Download citation
DOI: https://doi.org/10.1007/978-3-319-93713-7_67
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93712-0
Online ISBN: 978-3-319-93713-7
eBook Packages: Computer ScienceComputer Science (R0)