Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications | International Journal of Parallel Programming Skip to main content
Log in

Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Memory affinity has become a key element to achieve scalable performance on multi-core platforms. Mechanisms such as thread scheduling, page allocation and cache prefetching are commonly employed to enhance memory affinity which keeps data close to the cores that access it. In particular, software transactional memory (STM) applications exhibit irregular memory access behavior that makes harder to determine which and when data will be needed by each core. Additionally, existing STM runtime systems are decoupled from issues such as thread and memory management. In this paper, we thus propose a skeleton-driven mechanism to improve memory affinity on STM applications that fit the worklist pattern employing a two-level approach. First, it addresses memory affinity in the DRAM level by automatic selecting page allocation policies. Then it employs data prefetching helper threads to improve affinity in the cache level. It relies on a skeleton framework to exploit the application pattern in order to provide automatic memory page allocation and cache prefetching. Our experimental results on the STAMP benchmark suite show that our proposed mechanism can achieve performance improvements of up to 46 %, with an average of 11 %, over a baseline version on two NUMA multi-core machines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Remote read latency divided by local read latency (obtained from BenchIT).

References

  1. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: A view of the parallel computing landscape. Commun. ACM 52(10), 56–67 (2009)

    Article  Google Scholar 

  2. Awasthi, M., Nellans, D.W., Sudan, K., Balasubramonian, R., Davis, A.: Handling the problems and opportunities posed by multiple on-chip memory controllers. In: PACT, pp. 319–330. ACM (2010). doi:10.1145/1854273.1854314

  3. Baek, W., Minh, C.C., Trautmann, M., Kozyrakis, C., Olukotun, K.: The openTM transactional application programming interface. In: PACT 2007, pp. 376–387. IEEE Computer Society (2007)

  4. Broquedis, F., Aumage, O., Goglin, B., Thibault, S., Wacrenier, P.A., Namyst, R.: Structuring the execution of openMP applications for multicore architectures. In: IPDPS, pp. 1–10. IEEE Computer Society (2010)

  5. Broquedis, F., Clet Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: A generic framework for managing hardware affinities in HPC applications. In: PDP, pp. 180–186. IEEE Computer Society (2010)

  6. Castro, M., Góes, L.F.W., Fernandes, L.G., Méhaut, J.F.: Dynamic thread mapping based on machine learning for transactional memory applications. In: Euro-Par, pp. 465–476 (2012)

  7. Castro, M., Góes, L.F.W., Ribeiro, C.P., Cole, M., Cintra, M., Méhaut, J.F.: A machine learning-based approach for thread mapping on transactional memory applications. In: HiPC, pp. 1–10 (2011)

  8. Cole, M.: Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press & Pitman, London (1989)

    MATH  Google Scholar 

  9. Collins, J.D., Wang, H., Tullsen, D.M., Hughes, C., Lee, Y.F., Lavery, D., Shen, J.P.: Speculative Precomputation: Long-Range Prefetching of Delinquent Loads. In: ISCA, pp. 14–25. ACM (2001)

  10. Dalessandro, L., Dice, D., Scott, M., Shavit, N., Spear, M.: Transactional mutex locks. In: Euro-Par, pp. 2–13. Springer (2010)

  11. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137–150. USENIX Association (2004)

  12. Diener, M., Madruga, F., Rodrigues, E., Alves, M., Schneider, J., Navaux, P., Heiss, H.U.: Evaluating thread placement based on memory access patterns for multi-core processors. In: HPCC, pp. 491–496. IEEE Computer Society (2010)

  13. Felber, P., Fetzer, C., Riegel, T.: Dynamic Performance tuning of word-based software transactional memory. In: PPoPP, pp. 237–246. ACM (2008). doi:10.1145/1345206.1345241

  14. Felber, P., Fetzer, C., Riegel, T., Sturzrehm, H.: Transactifying applications using an open compiler framework. In: TRANSACT. ACM (2007)

  15. Garner, B.D., Browne, S., Dongarra, J., Garner, N., Ho, G., Mucci, P.: A portable programming interface for performance evaluation on modern processors. Int. J. High Perform. Comput. Appl. 14, 189–204 (2000)

    Article  Google Scholar 

  16. Góes, L.F.W.: Automatic skeleton-driven performance optimizations for transactional memory. Ph.D. thesis, School of Informatics, University of Edinburgh, UK (2012)

  17. Goes, L.F.W., Ioannou, N., Xekalakis, P., Cole, M., Cintra, M.: Autotuning skeleton-driven optimizations for transactional worklist applications. IEEE Trans. Parallel Distrib. Syst. 23(12), 2205–2218 (2012)

    Article  Google Scholar 

  18. Hong, S., Narayanan, S.H.K., Kandemir, M., Özturk, O.: Process variation aware thread mapping for chip multiprocessors. In: DATE, pp. 821–826. European Design and Automation Association (2009)

  19. Kleen, A.: A NUMA API for Linux. Tech. Rep. Novell-4621437 (2005)

  20. Larus, J., Rajwar, R.: Transactional Memory. Morgan & Claypool Publishers (2006)

  21. McCool, M.: Structured parallel programming with deterministic patterns. In: HotPar, pp. 25–30. USENIX Association (2010)

  22. Minh, C.C., Chung, J., Kozyrakis, C., Olukotun, K.: STAMP: Stanford transactional applications for multi-processing. In: IISWC, pp. 35–46. IEEE Computer Society (2008)

  23. Nikas, K., Anastopoulos, N., Goumas, G., Koziris, N.: Employing transactional memory and helper threads to speedup Dijkstra’s algorithm. In: ICPP, pp. 388–395. IEEE Computer Society (2009)

  24. Pousa Ribeiro, C., Castro, M., Carissimi, A., Méhaut, J.F.: Improving memory affinity of geophysics applications on NUMA platforms using Minas. In: VECPAR. Springer (2010)

  25. Song, Y., Kalogeropulos, S., Tirumalai, P.: Design and implementation of a compiler framework for helper threading on multicore processors. In: PACT, pp. 99–109. IEEE Computer Society (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luís Fabrício Wanderley Góes.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Góes, L.F.W., Ribeiro, C.P., Castro, M. et al. Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications. Int J Parallel Prog 42, 365–382 (2014). https://doi.org/10.1007/s10766-013-0253-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-013-0253-x

Keywords

Navigation