Efficient Inspected Critical Sections in Data-Parallel GPU Codes

Blaß, Thorsten; Philippsen, Michael; Veldema, Ronald

doi:10.1007/978-3-030-35225-7_15

Thorsten Blaß⁹,
Michael Philippsen⁹ &
Ronald Veldema⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11403))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

385 Accesses

Abstract

Optimistic concurrency control and STMs rely on the assumption of sparse conflicts. For data-parallel GPU codes with many or with dynamic data dependences, a pessimistic and lock-based approach may be faster, if only GPUs would offer hardware support for GPU-wide fine-grained synchronization. Instead, current GPUs inflict dead- and livelocks on attempts to implement such synchronization in software.

The paper demonstrates how to build GPU-wide non-hanging critical sections that are as easy to use as STMs but also get close to the performance of traditional fine-grained locks. Instead of sequentializing all threads that enter a critical section, the novel programmer-guided Inspected Critical Sections (ICS) keep the degree of parallelism up. As in optimistic approaches threads that are known not to interfere, may execute the body of the inspected critical section concurrently.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 5719; Price includes VAT (Japan)

Softcover Book: JPY 7149; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Pinpointing and repairing performance bottlenecks in concurrent programs

Article 27 November 2017

Exploitation of GPUs for the Parallelisation of Probably Parallel Legacy Code

Optimistic Parallelism on GPUs

Notes

1.
Recall that the upper code in Fig. 1 can deadlock on a GPU. Generally speaking, the while-condition splits the threads (of a warp/wavefront) into two sets. One set has the thread that has acquired the lock, the other set holds all the other threads. The SIMT instruction scheduler then runs the sets in turn, one after the other, and up to a convergence point where the sets are combined again. The problem is that there is no yield and that the instruction scheduler does not switch between sets. If the scheduler chooses to issue instructions to the set of the spinning threads first, the set with the winning thread never receives a single instruction, it does not make any progress, and thus it never releases the lock.
2.
For Table 1 we compiled with -O0. If we use -O3 all the runs hang. In our benchmark environment the compiler seems to undo the manual anti-SIMT-deadlock transformation shown in Fig. 1.
3.
Since we can only show -O0 numbers for STM and FGL due to the compiler issue, we show those numbers for ICS as well, even though the compiler issue did not prevent -O3 for our approach.
4.
To circumvent this problem, many codes use the incorrect spin lock with a grid size that stays below (warp size \(\cdot \) # of SPs). With such an underutilization of the GPU the warp scheduler does not have to re-schedule because there is no more than one warp per SP.
5.
Note that if needed, the developer can trade time for space: Ideally is an injective projection of an item to . With a smaller co-domain of , the bitmap can be smaller, but the conflict detection may announce false positives that then cause sequential execution and hence longer runtimes.
6.
For Hash we use the number of the bucket, for Bank it is the account numbers. Labyrinth uses the coordinates of the points in the mesh as . Genome uses a common subsequence (string) to identify a hash bucket that holds common DNA-segments. Kmeans uses the Id of a cluster. Vacation uses Ids of hotel rooms, flighs, and cars. We never use memory addresses as items.
7.
If we force ICS to always assume a conflict and to go down to the single thread level, runtimes are much slower than the STM version (Hash table: 14x and 11x, Bank: 13x, Graph: 6x and 2x, Labyrinth: 2x, Genome: 12x, Kmeans: 10x, Vacation: 15x).
8.
All measurements with -O3; the compiler did not remove the empty block.

References

CUDA 9 Features Revealed: Volta, Cooperative Groups and More (2017). https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/. Accessed 03 July 2017
Inside Volta: The World’s Most Advanced Data Center GPU (2017). https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/. Accessed 03 July 2017
Baxter, D., Mirchandaney, R., Saltz, J.H.: Run-time parallelization and scheduling of loops. In: (SPAA 1989): Symposium on Parallel Algorithms and Architecture, Santa Fe, NM, pp. 603–612, June 1989
Google Scholar
Best, M.J., Mottishaw, S., Mustard, C., Roth, M., Fedorova, A., Brownsword, A.: Synchronization via scheduling: techniques for efficiently managing shared state. In: (PLDI 2011): International Conference on Programming Language Design and Implementation, San Jose, CA, pp. 640–652, June 2011
Google Scholar
Cascaval, C., et al.: Software transactional memory: why is it only a research toy? Queue 6(5), 40:46–40:58 (2008)
Article Google Scholar
Cederman, D., Tsigas, P., Chaudhry, M.T.: Towards a software transactional memory for graphics processors. In: (EG PGV 2010): Eurographics Conference on Parallel Graphics and Visualization, Norrköping, Sweden, pp. 121–129, May 2010
Google Scholar
Dang, F.H., Rauchwerger, L.: Speculative parallelization of partially parallel loops. In: (LCR 2000): International Workshop Languages, Compilers, and Run-Time Systems for Scalable Computers, Rochester, NY, pp. 285–299, May 2000
Google Scholar
Eastep, J., Wingate, D., Santambrogio, M.D., Agarwal, A.: Smartlocks: lock acquisition scheduling for self-aware synchronization. In: (ICAC 2010): International Conference on Autonomic Computing, Washington, DC, pp. 215–224, June 2010
Google Scholar
ElTantawy, A., Aamodt, T.M.: MIMD synchronization on SIMT architectures. In: (MICRO 2016): International Symposium on Microarchitecture, Taipei, Taiwan, pp. 1–14, October 2016
Google Scholar
Erdős, P., Rényi, A.: On random graphs I. Publ. Math. (Debrecen) 6, 290–297 (1959)
MathSciNet MATH Google Scholar
Habermaier, A., Knapp, A.: On the correctness of the SIMT execution model of GPUs. In: (ESOP 2012): European Symposium on Programming, Tallinn, Estonia, pp. 316–335, March 2012
Google Scholar
Holey, A., Zhai, A.: Lightweight software transactions on GPUs. In: (ICPP 2014): International Conference on Parallel Processing, Minneapolis, MN, pp. 461–470, September 2014
Google Scholar
Li, A., van den Braak, G.J., Corporaal, H., Kumar, A.: Fine-grained synchronizations and dataflow programming on GPUs. In: (ICS 2015): International Conference on Supercomputing, Newport Beach, CA, pp. 109–118, June 2015
Google Scholar
Minh, C.C., Chung, J., Kozyrakis, C., Olukotun, K.: STAMP: Stanford transactional applications for multi-processing. In: (IISWC 2008): International Symposium on Workload Characterization, Seattle, WA, pp. 35–46, September 2008
Google Scholar
Ramamurthy, A.: Towards scalar synchronization in SIMT architectures. Master’s thesis, University of British Columbia, September 2011
Google Scholar
Shen, Q., Sharp, C., Blewitt, W., Ushaw, G., Morgan, G.: PR-STM: priority rule based software transactions for the GPU. In: (Euro-Par 2015): International Conference on Parallel and Distributed Systems, Vienna, Austria, pp. 361–372, August 2015
Google Scholar
Xiao, S., Aji, A.M., Feng, W.C.: On the robust mapping of dynamic programming onto a graphics processing unit. In: (ICPADS 2009): International Conference on Parallel and Distributed Systems, Shenzhen, China, pp. 26–33, December 2009
Google Scholar
Xiao, S., Feng, W.: Inter-Block GPU communication via fast barrier synchronization. In: (IPDPS 2010): International Symposium on Parallel and Distributed Processing, Atlanta, GA, pp. 1–12, April 2010
Google Scholar
Xu, Y., Gao, L., Wang, R., Luan, Z., Wu, W., Qian, D.: Lock-based Synchronization for GPU architectures. In: (CF 2016): International Conference on Computing Frontiers, Como, Italy, pp. 205–213, May 2016
Google Scholar
Xu, Y., Wang, R., Goswami, N., Li, T., Gao, L., Qian, D.: Software transactional memory for GPU architectures. In: (CGO 2014): International Symposium on Code Generation and Optimization, Orlando, FL, pp. 1:1–1:10, February 2014
Google Scholar
Yilmazer, A., Kaeli, D.R.: HQL: a scalable synchronization mechanism for GPUs. In: (IPDPS 2013): International Symposium on Parallel and Distributed Processing, Cambridge, MA, pp. 475–486, May 2013
Google Scholar

Download references

Author information

Authors and Affiliations

Programming Systems Group, Friedrich-Alexander University, Erlangen, Germany
Thorsten Blaß, Michael Philippsen & Ronald Veldema

Authors

Thorsten Blaß
View author publications
You can also search for this author in PubMed Google Scholar
Michael Philippsen
View author publications
You can also search for this author in PubMed Google Scholar
Ronald Veldema
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thorsten Blaß .

Editor information

Editors and Affiliations

Texas A&M University, College Station, TX, USA
Lawrence Rauchwerger

Appendix

1.1 Benchmark Infrastructure

For all measurements we use a 1,5 GHz Desktop NVIDIA TITAN Xp GPU with 12 GBytes of global memory and 3.840 cores in 60 SMs (with 64 SPs each) that runs CUDA (Version 8.0) code.

The group-size in all measurements is 128 threads. The reason is that on our GPU the kernels can use up to 32.000 registers per group, i.e., 250 registers per thread. Both the retrenchment cascade and the STM framework need 70 of those registers. This leaves 180 registers for the local variables of the applications. Since the benchmarks need that many, we could not use larger group-sizes. While smaller group-sizes are possible, we only present measurements for a group-size of 128 threads because our experiments did not show qualitatively different results for smaller group-sizes.

We repeated all measurements 100 times; all given numbers are averages. For the code versions with fine-grained locks and the STM-based implementations we only measured those runs that did not face a dead- or livelock.

1.2 Benchmark Set

We use seven benchmarks, some of which are taken from the STAMP benchmark suite [14] with given atomic regions. We always use the largest possible shared data structure and/or the maximal number of threads that fit onto our GPU.

Hash Table. We use 1.5M threads and a hash table with the same number of buckets, each of which holds the linked lists of colliding entries. The threads randomly put a single entry into the shared hash table. ICS uses the bucket number as item to check for conflicts. The bucket operation is the atomic region in the STM code. The fine-grained lock code (FGL) uses one lock per bucket. To study the effect of the number of collisions, we also use half the buckets.

Bank. There are 24M accounts. 24M parallel threads withdraw an amount of money from one randomly picked account and deposit it to another. The two accounts are the items for conflict checking. There is a conflict if two threads use an account in common. STM: the transfer happens in the atomic region. FGL: there is one lock per account.

Graph. The G(n, p)-instance of the Erdős-Rényi Graph Model (ERGM) [10] starts from an edgeless graph with \(n=10K\) nodes. A thread per node adds an undirected edge to any other node (= ICS item for conflict checking) with probability p. To illustrate the effect of the number of collisions we study the two probabilities \(p=25\%\) and \(p=75\%\). STM: the atomic region is the insertion of an edge. FGL: the code locks the adjacency lists of both the nodes that the new edge connects.

Labyrinth. The largest 3D-mesh from the STAMP input files that fits into our memory has size (512, 512, 7). Thus 512 threads plan non-intersecting routes in parallel. All nodes of the route are the items for conflict checking. STM: a full routing step is the atomic region. FGL: there is a lock per mesh point. FGL and ICS: if a route hits a spot that is already part of another route, the thread tries (three times) to find a detour around it. This avoids recalculating the full route.

Genome. 8M threads try to reconstruct a genome from DNA segments that reside in a shared pool, that is a hash table. There may not be duplicates and only one thread may check whether and where a segment from the pool matches the given genome. ICS checks conflicts on the bucket number. We consider a genome size of 65, 536, DNA segments have a size of 192, and there are 1, 677, 726 such segments. STM and FGL: see Hash table.

Kmeans. 3M threads partition the same number of data items from a 32-dimensional space into 1, 536 subsets (clusters). Until a fix point is reached, all threads check the distance to the centers of all of the clusters and migrate a data item to the closest cluster (= item for conflict checking). STM: the migration is the atomic region. FGL: there is one lock per cluster; the code locks the two clusters that are affected by a migration.

Vacation. The travel reservation system uses hash tables to store customers and their reservations for a hotel, a flight, and a rental car, i.e., on three potentially conflicting items. 4M parallel threads perform 4M (random) reservations, cancellations, and updates for full trips. There may be conflicts. There are configuration parameters for the likelihood of such conflicts and the mix of operations (for the STAMP expert: we use \(r=629148\), \(u=93\), \(q=90\)). STM: one operation on all three components of a trip is in the atomic region. FGL: there is a lock per hotel, flight, and car.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Blaß, T., Philippsen, M., Veldema, R. (2019). Efficient Inspected Critical Sections in Data-Parallel GPU Codes. In: Rauchwerger, L. (eds) Languages and Compilers for Parallel Computing. LCPC 2017. Lecture Notes in Computer Science(), vol 11403. Springer, Cham. https://doi.org/10.1007/978-3-030-35225-7_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-35225-7_15
Published: 15 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35224-0
Online ISBN: 978-3-030-35225-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us