An OpenMP Implementation of the TVD–Hopmoc Method Based on a Synchronization Mechanism Using Locks Between Adjacent Threads on Xeon Phi (TM) Accelerators

Cabral, Frederico L.; Osthoff, Carla; Costa, Gabriel P.; Gonzaga de Oliveira, Sanderson L.; Brandão, Diego; Kischinhevsky, Mauricio

doi:10.1007/978-3-319-93713-7_67

Frederico L. Cabral²⁰,
Carla Osthoff²⁰,
Gabriel P. Costa²⁰,
Sanderson L. Gonzaga de Oliveira²¹,
Diego Brandão²² &
…
Mauricio Kischinhevsky²³

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10862))

Included in the following conference series:

International Conference on Computational Science

3279 Accesses

Abstract

This work focuses on the study of the 1–D TVD–Hopmoc method executed in shared memory manycore environments. In particular, this paper studies barrier costs on Intel$^{\textregistered }$ Xeon Phi$^\mathrm{TM}$ (KNC and KNL) accelerators when using the OpenMP standard. This paper employs an explicit synchronization mechanism to reduce spin and thread scheduling times in an OpenMP implementation of the 1–D TVD–Hopmoc method. Basically, we define an array that represents threads and the new scheme consists of synchronizing only adjacent threads. Moreover, the new approach reduces the OpenMP scheduling time by employing an explicit work-sharing strategy. In the beginning of the process, the array that represents the computational mesh of the numerical method is partitioned among threads, instead of permitting the OpenMP API to perform this task. Thereby, the new scheme diminishes the OpenMP spin time by avoiding OpenMP barriers using an explicit synchronization mechanism where a thread only waits for its two adjacent threads. The results of the new approach is compared with a basic parallel implementation of the 1–D TVD–Hopmoc method. Specifically, numerical simulations shows that the new approach achieves promising performance gains in shared memory manycore environments for an OpenMP implementation of the 1–D TVD–Hopmoc method.

You have full access to this open access chapter, Download conference paper PDF

Vectorized Barrier and Reduction in LLVM OpenMP Runtime

The Importance of Efficient Fine-Grain Synchronization for Many-Core Systems

Reducing Cache Miss Rate Using Thread Oversubscription to Accelerate an MPI-OpenMP-Based 2-D Hopmoc Method

1 Introduction

Over the last decades, since both the demand for faster computation and shared memory multicore and manycore architectures became available in large scale, an important issue has emerged: how can one obtain a speedup proportional to the number of physical cores available in a parallel architecture? Specific programming languages, libraries, and programming models has been proposed to assist programmers and researches to surmount this challenge. Among them, the OpenMP library [2] is one of the best-known standards nowadays.

Since new machines with faster clock speed are facing a certain limit because overheat and electromagnetic interference, technologies have been proposed with the objective of improving the processing capacity in computations. Multithreading, distributed processing, manycore architectures (e.g. General Purpose Graphical Processing Units (GPGPUs)), and Many Integrated Core (MIC) accelerators (such as Intel$^{\textregistered }$ Xeon Phi$^\mathrm{TM}$) are examples of technologies employed to boost a computer performance. Even with these recent technologies, it is still a challenge to obtain a speedup proportional to the number of available cores in a parallel implementation due to the hardware complexity in such systems. To overcome this particular issue, the OpenMP standard offers a simple manner to convert a serial code to a parallel implementation for shared memory multicore and manycore architectures. Although a parallel implementation with the support of this API is easily reached, a basic (or naive) OpenMP implementation in general does not attain straightforwardly the expected speedup in an application. Furthermore, despite being easily implemented, the most common resources available in the OpenMP standard may generate high scheduling and synchronization running times.

High Performance Computing (HPC) is a practice widely used in computational simulations of several real-world phenomena. Numerical methods have been designed to maximize the computational capacity provided by a HPC environment. In particular, the Hopmoc method (see [3] and references therein) is a spatially decoupled alternating direction procedure for solving convection–diffusion equations. It was designed to be executed in parallel architectures (see [1] and references therein). To provide more specific detail, the unknowns are spatially decoupled, permitting message-passing minimization among threads. Specifically, the set of unknowns is decoupled into two subsets. These two subsets are calculated alternately by explicit and implicit approaches. In particular, the use of two explicit and implicit semi-steps avoids the use of a linear system solver. Moreover, this method employs a strategy based on tracking values along characteristic lines during time stepping. The two semi-steps are performed along characteristic lines by a Semi-Lagrangian scheme following concepts of the Modified Method of Characteristics. The time derivative and the advection term are combined as a direction derivative. Thus, time steps are performed in the flow direction along characteristics of the velocity field of the fluid. The Hopmoc method is a direct method in the sense that the cost per time step is previously known. To determine the value in the foot of the characteristic line, the original method uses an interpolation technique, that introduces inherent numerical errors. To overcome this limitation, the Hopmoc method was combined with a Total Variation Diminishing (TVD) scheme. This new approach, called TVD–Hopmoc, employs a flux limiter to determine the value in the foot of the characteristic line based on the Lax-Wendroff scheme [1].

We studied a basic OpenMP implementation of the TVD–Hopmoc method under the Intel$^{\textregistered }$ Parallel Studio XE software for Intel’s Haswell/Broadwell architectures. This product showed us that the main problem in the performance of a simple OpenMP–based TVD–Hopmoc method was the use of the basic OpenMP scheduling and synchronization mechanisms. Thus, this paper employs alternative strategies to these basic OpenMP strategies. Specifically, this work uses an explicit work-sharing (EWS) strategy. The array that denotes the computational mesh of the 1–D TVD–Hopmoc method is explicitly partitioned among threads. In addition, to avoid implicit barrier costs imposed by the OpenMP standard, this paper employs an explicit synchronization mechanism to guarantee that only threads with real data dependencies participate in the synchronization. Additionally, we compare our approach along with three thread binding policies (balanced, compact, and scatter).

This paper is organized as follows. Section 2 outlines the TVD–Hopmoc method and shows how the experiments were conducted in this work. Section 3 presents the EWS strategy employed here along with our strategy to synchronize adjacent threads. Section 4 shows the experimental results that compares the new approach with a naive OpenMP–based TVD–Hopmoc method. Finally, Sect. 5 addresses the conclusions and discusses the next steps in this investigation.

2 The TVD–Hopmoc Method and Description of the Tests

Consider the advection–diffusion equation in the form

$$\begin{aligned} u_{t}+v u_x=d u_{xx}, \end{aligned}$$

(1)

with adequate initial and boundary conditions, where v is a constant positive velocity, d is the constant positive diffusivity, $0\le x \le 1$ and $0\le t \le T$, for T time steps. Applying the Hopmoc method to equation (1) yields $\overline{u}^{t+\frac{1}{2}}_i = \overline{\overline{u}}^{t}_i+ \delta t\left[ \theta ^{t}_i L_h\left( \overline{\overline{u}}^{t}_i\right) + \theta ^{t+1}_i L_h\left( \overline{u}^{t+\frac{1}{2}}_i\right) \right] $ and $u^{t+1}_i = \overline{u}^{t+\frac{1}{2}}_i + \delta t \left[ \theta ^t_iL_h\left( \overline{u}^{t+\frac{1}{2}}_i\right) + \theta ^{t+1}_i L_h\left( u^{t+1}_i\right) \right] $, where $\theta ^t_i$ is 1 (0) if $t+i$ is even (odd), $L_h\left( u^t_i\right) =d\frac{u^{t}_{i-1}-2u^{t}_{i}+u^{t}_{i+1}}{\varDelta x^2}$ is a finite-difference operator, $\overline{u}^{t+\frac{1}{2}}_i$ and $u^{t+1}_i$ are consecutive time semi-steps, and the value of the concentration in $\overline{\overline{u}}^t_i$ is obtained by a TVD scheme to determine $\overline{\overline{u}}^t_{i+1} = u^t_i - c\left( u^t_i-u^t_{i-1}\right) \left[ 1-\frac{(1-c)\phi _{i-1}}{2}+\frac{(1-c)\phi _i}{2r}\right] $, where $r=\frac{u^t_i-u^t_{i-1}}{u^t_{i+1}-u^t_{i}}$ [1]. The Van Leer flux limiter [4] was employed in the numerical simulations performed in this present work. This scheme delivered better results when compared with other techniques [1]. Our numerical simulations were carried out for a Gaussian pulse with amplitude 1.0, whose initial center location is 0.2, with velocity $v=1$ and diffusion coefficient $d=\frac{2}{Re}=10^{-3}$ (where Re stands for Reynolds number), $\varDelta x = 10^{-5}$ and $\varDelta x = 10^{-6}$ (i.e., $10^5$ and $10^6$ stencil points, respectively), and T is established as $10^4$, $10^5$, and $10^6$.

3 EWS and Synchronization Strategies Combined with an Explicit Lock Mechanism

The main idea of our OpenMP–based TVD–Hopmoc method is to avoid the extra time caused by scheduling threads and the implicit barriers after a load sharing construct employed in the OpenMP standard. We deal with thread imbalance by partitioning explicitly the array into the team of threads. Our implementation of the TVD–Hopmoc method defines a static array that represents the unknowns. Thus, a permanent partition of this array is established for each thread in the beginning of the code, i.e., no changes are made to this partition during the execution of the process. Since the mesh is static, thread scheduling is performed only at the beginning of the execution. In addition, since each thread in the team has its required data, we do not need to use the OpenMP parallel for directive in our code. Thus, our OpenMP implementation of the TVD–Hopmoc method was designed to minimize synchronization in a way that a particular thread needs information only from its adjacent threads so that an implicit barrier is unnecessary. Thereby, the mechanism applied here involves a synchronization of adjacent threads, i.e., each thread waits only for two adjacent threads to reach the same synchronization point.

The strategy employed here is based on a simple lock mechanism. Using an array of booleans, a thread sets (releases) an entry in this array and, hence, informs its adjacent threads that the data cannot (can) be used by them. We will refer this approach as EWS-adSync implementation.

4 Experimental Results

This section shows the results of the new and basic approaches aforementioned in executions performed on a machine containing an Intel$^{\textregistered }$ Xeon Phi$^\mathrm{TM}$ Knights-Corner (KNC) accelerator 5110P, 8GB DDR5, 1.053 GHz, 60 cores, 4 threads per core (in Sect. 4.1), and on a machine containing an Intel$^{\textregistered }$ Xeon Phi$^\mathrm{TM}$ CPU 7250 @ 1.40 GHz, 68 cores, 4 threads per core (in Sect. 4.2). We evaluate the EWS-adSync implementation of the TVD–Hopmoc method along with three thread binding policies: balanced, compact, and scatter policies.

4.1 Experiments Performed on a Xeon Phi$^\mathrm{TM}$ (Knight’s Corner) Accelerator

This section shows experiments performed on an Intel$^{\textregistered }$ Xeon Phi$^\mathrm{TM}$ (KNC) accelerator composed of 240 threads. Figure 1 shows that our OpenMP–based TVD–Hopmoc method using the EWS strategy in conjunction with synchronizing adjacent threads along with compact (comp.) thread binding policy (pol.) yielded a speedup of approximately 140x (using 240 threads) in a simulation with $\varDelta x=10^{-6}$ (i.e., a mesh composed of $10^6$ stencil points) and $T=10^5$. This implementation used in conjunction with the scatter (sct.) policy delivered a speedup of 136x (using 238 threads) in a simulation with the same settings.

The EWS-adSync implementation alongside scatter, balanced (bal.), and compact binding policies respectively achieved speedups of 134.2x, 133.8x, and 133.5x in simulations with $\varDelta x=10^{-6}$ and $T=10^4$. These implementations dominated the basic OpenMP–based TVD-Hopmoc method, which obtained a speedup of 28x.

The EWS-adSync implementation alongside the compact policy (with speedup of 121x) achieved better results than the scatter policy (with speedup of 111x) in a simulation with $\varDelta x=10^{-5}$ and $T=10^5$. Both implementations dominated the basic OpenMP–based TVD-Hopmoc method, which obtained a speedup of 95x. The EWS-adSync implementation alongside the compact policy (with speedup of 90x) reached slightly better results than the scatter policy (with speedup of 88x) in a simulation with $\varDelta x=10^{-5}$ and $T=10^4$. On the other hand, the EWS-adSync implementation in conjunction with the scatter policy (with speedup of 38x) yielded better results than the compact policy (with speedup of 27x) in a simulation with $\varDelta x=10^{-5}$ and $T=10^3$. Figure 1 shows that in general the OpenMP–based TVD–Hopmoc method obtains better speedups when setting a larger number of iterations T since it takes advantage of a higher cache hit rate.

4.2 Experiments Performed on a Xeon Phi$^\mathrm{TM}$ (Knights Landing) Accelerator

This section shows experiments performed on an Intel$^{\textregistered }$ Xeon Phi$^\mathrm{TM}$ (KNL) accelerator composed of 272 threads. Figure 2 shows that our OpenMP–based TVD–Hopmoc method using the EWS strategy in conjunction with synchronizing adjacent threads along with scatter thread binding policy yielded a speedup of approximately 91x (using 135 threads) in a simulation with $\varDelta x=10^{-5}$ (i.e., a mesh composed of $10^5$ stencil points) and $T=10^6$. This implementation used in conjunction with scatter and compact policies delivered respectively speedups of 85x and 77x (using 135 and 271 threads, respectively) with the same settings. Similarly, the EWS-adSync implementation alongside balanced policy achieved better results (with speedup of 84x) than in conjunction with both scatter (with speedup of 80x) and compact (with speedup of 69x) policies in simulations with $\varDelta x=10^{-5}$ and $T=10^5$.

The EWS-adSync implementation alongside the scatter policy achieved similar results to the compact policy (both with speedup of 17x) in a simulation with $\varDelta x=10^{-6}$ and $T=10^5$. It seems that the simulation with $\varDelta x=10^{-6}$ obtained a higher cache miss rate than the other experiments.

5 Conclusions

Based on an explicit work-sharing strategy along with an explicit synchronization mechanism, the approach employed here to implement an OpenMP–based TVD–Hopmoc method attained reasonable speedups in manycore (Xeon Phi$^\mathrm{TM}$ KNC and KNL accelerators) architectures. In particular, the scheduling time was profoundly reduced by replacing the effort of assigning to threads tasks at runtime with an explicit work-sharing strategy that determines a priori the range of the array that represents stencil points where each thread will perform its task. Moreover, using a lock array where an entry denotes a thread, the synchronization time in barriers was substituted by a strategy that only requires that each thread be synchronized with its two adjacent threads. Employing these two strategies, the team of threads presented a reasonable load balancing, where almost the total number of threads available are used simultaneously.

Our OpenMP–based TVD–Hopmoc method along with a compact thread binding policy yielded a speedup of approximately 140x when applied to a mesh composed of $10^6$ stencil points in a simulation performed on an Intel$^{\textregistered }$ Xeon Phi$^\mathrm{TM}$ (Knight’s Corner) accelerator composed of 240 threads. Moreover, this parallel TVD–Hopmoc method alongside a balanced thread binding policy reached a speedup of approximately 91x when applied to a mesh composed of $10^5$ stencil points in a simulation performed on an Intel$^{\textregistered }$ Xeon Phi$^\mathrm{TM}$ (Knights Landing) accelerator composed of 272 threads. Furthermore, our OpenMP–based TVD–Hopmoc method in conjunction both with scatter and compact policies achieved a speedup of approximately 17x when applied to a mesh composed of $10^6$ stencil points in a simulation performed on the same machine.

It is observed in Figs. 1 and 2 that the speedup of our OpenMP–based TVD–Hopmoc method along with scatter and balanced policies shows four trends. The speedup increases with the use of up to a number of threads that is multiple of the number of cores in the machine. We intend to provide further investigations about this characteristic in future studies.

A next step in this investigation is to implement an OpenMP–based 2–D TVD–Hopmoc method. Even in the 2–D case, we plan to use an array to represent the stencil points so that the approach employed in the 1–D case of the TVD–Hopmoc method is still valid.

References

Cabral, F., Osthoff, C., Costa, G., Brandão, D.N., Kischinhevsky, M., Gonzaga de Oliveira, S.L.: Tuning up TVD HOPMOC method on Intel MIC Xeon Phi architectures with Intel Parallel Studio tools. In: Proceedings of the 8th Workshop on Applications for Multi-Core Architectures, Campinas, SP, Brazil (2017)
Google Scholar
Dagum, L., Menon, R.: OpenMP: an industry standard api for shared-memory programming. Comput. Sci. Eng. IEEE 5(1), 46–55 (1998)
Article Google Scholar
Oliveira, S.R.F., Gonzaga de Oliveira, S.L., Kischinhevsky, M.: Convergence analysis of the Hopmoc method. Int. J. Comput. Math. 86, 1375–1393 (2009)
Article MathSciNet Google Scholar
van Leer, B.: Towards the ultimate conservative difference schemes. J. Comput. Phys. 14, 361–370 (1974)
Article Google Scholar

Download references

Acknowledgments

This work was developed with the support of CNPq, CAPES, and FAPERJ - Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro. We would like to thank the Núcleo de Computação Científica at Universidade Estadual Paulista (NCC/UNESP) for letting us execute our simulations on its heterogeneous multi-core cluster. These resources were partially funded by Intel$^{\textregistered }$ through the projects entitled Intel Parallel Computing Center, Modern Code Partner, and Intel/Unesp Center of Excellence in Machine Learning.

Author information

Authors and Affiliations

Laboratório Nacional de Computação Científica - LNCC, Petrópolis, Brazil
Frederico L. Cabral, Carla Osthoff & Gabriel P. Costa
Universidade Federal de Lavras - UFLA, Lavras, Brazil
Sanderson L. Gonzaga de Oliveira
Centro Federal de Educação Tecnológica Celso Suckow da Fonseca - CEFET-RJ, Rio de Janeiro, Brazil
Diego Brandão
Universidade Federal Fluminense - UFF, Rio de Janeiro, Brazil
Mauricio Kischinhevsky

Authors

Frederico L. Cabral
View author publications
You can also search for this author in PubMed Google Scholar
Carla Osthoff
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel P. Costa
View author publications
You can also search for this author in PubMed Google Scholar
Sanderson L. Gonzaga de Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Diego Brandão
View author publications
You can also search for this author in PubMed Google Scholar
Mauricio Kischinhevsky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carla Osthoff .

Editor information

Editors and Affiliations

Chinese Academy of Sciences, Beijing, China
Yong Shi
National Supercomputing Center in Wuxi, Wuxi, China
Haohuan Fu
Chinese Academy of Sciences, Beijing, China
Yingjie Tian
University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Amsterdam, Amsterdam, The Netherlands
Michael Harold Lees
University of Tennessee, Knoxville, Tennessee, USA
Jack Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M. A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cabral, F.L., Osthoff, C., Costa, G.P., Gonzaga de Oliveira, S.L., Brandão, D., Kischinhevsky, M. (2018). An OpenMP Implementation of the TVD–Hopmoc Method Based on a Synchronization Mechanism Using Locks Between Adjacent Threads on Xeon Phi (TM) Accelerators. In: Shi, Y., et al. Computational Science – ICCS 2018. ICCS 2018. Lecture Notes in Computer Science(), vol 10862. Springer, Cham. https://doi.org/10.1007/978-3-319-93713-7_67

Download citation

DOI: https://doi.org/10.1007/978-3-319-93713-7_67
Published: 12 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93712-0
Online ISBN: 978-3-319-93713-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An OpenMP Implementation of the TVD–Hopmoc Method Based on a Synchronization Mechanism Using Locks Between Adjacent Threads on Xeon Phi (TM) Accelerators

Abstract

Similar content being viewed by others

Vectorized Barrier and Reduction in LLVM OpenMP Runtime

The Importance of Efficient Fine-Grain Synchronization for Many-Core Systems

Reducing Cache Miss Rate Using Thread Oversubscription to Accelerate an MPI-OpenMP-Based 2-D Hopmoc Method

1 Introduction

2 The TVD–Hopmoc Method and Description of the Tests

3 EWS and Synchronization Strategies Combined with an Explicit Lock Mechanism

4 Experimental Results

4.1 Experiments Performed on a Xeon Phi\(^\mathrm{TM}\) (Knight’s Corner) Accelerator

4.2 Experiments Performed on a Xeon Phi\(^\mathrm{TM}\) (Knights Landing) Accelerator

5 Conclusions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

An OpenMP Implementation of the TVD–Hopmoc Method Based on a Synchronization Mechanism Using Locks Between Adjacent Threads on Xeon Phi (TM) Accelerators

Abstract

Similar content being viewed by others

Vectorized Barrier and Reduction in LLVM OpenMP Runtime

The Importance of Efficient Fine-Grain Synchronization for Many-Core Systems

Reducing Cache Miss Rate Using Thread Oversubscription to Accelerate an MPI-OpenMP-Based 2-D Hopmoc Method

1 Introduction

2 The TVD–Hopmoc Method and Description of the Tests

3 EWS and Synchronization Strategies Combined with an Explicit Lock Mechanism

4 Experimental Results

4.1 Experiments Performed on a Xeon Phi\(^\mathrm{TM}\) (Knight’s Corner) Accelerator

4.2 Experiments Performed on a Xeon Phi\(^\mathrm{TM}\) (Knights Landing) Accelerator

5 Conclusions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation