Dynamic concurrency throttling on NUMA systems and data migration impacts

  • PDF / 1,988,099 Bytes
  • 26 Pages / 439.37 x 666.142 pts Page_size
  • 80 Downloads / 196 Views

DOWNLOAD

REPORT


Dynamic concurrency throttling on NUMA systems and data migration impacts Janaina Schwarzrock, et al. [full author details at the end of the article] Received: 28 April 2020 / Accepted: 26 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Many parallel applications do not scale as the number of threads increases, which means that using the maximum number of threads will not always deliver the best outcome in performance or energy consumption. Therefore, many works have already proposed strategies for tuning the number of threads to optimize for performance or energy. Since parallel applications may have more than one parallel region, these tuning strategies can determine a specific number of threads for each application’s parallel region, or determine a fixed number of threads for the whole application execution. In the former case, strategies apply Dynamic Concurrency Throttling (DCT), which enables adapting the number of threads at runtime. However, the use of DCT implies on overheads, such as creating/destroying threads and cache warm-up. DCT’s overhead can be further aggravated in Non-uniform Memory Access systems, where changing the number of threads may incur in remote memory accesses or, more importantly, data migration between nodes. In this way, tuning strategies should not only determine the best number of threads locally, for each parallel region, but also be aware of the impacts when applying DCT. This work investigates how parallel regions may influence each other during DCT employment, showing that data migration may represent a considerable overhead. Effectively, those overheads affect the strategy’s solution, impacting the overall application performance and energy consumption. We demonstrate why many approaches will very likely fail when applied to simulated environments or will hardly reach a near-optimum solution when executed in real hardware. Keywords Parallel computing · Online optimization · Thread Throttling · Data migration

1 Introduction Thread-level parallelism (TLP) has been largely exploited to accelerate applications by using all available resources in multicore processors. The principle is very straightforward: to reduce the execution time of an application by concurrently executing parts of its code (commonly called parallel regions) with as many threads as possible. However, due to software (e.g., data synchronization and communication) and hardware (e.g., off-chip bus saturation) reasons, executing these regions with all available cores in the system may not achieve the best outcome in other metrics, such as the energy-performance tradeoff, represented by the lowest energydelay product (EDP) [18,22,23,25,33,38,39].

123

J. Schwarzrock et al.

Hence, each parallel region may have a different optimal configuration (given in number of threads) that delivers the best EDP. In order to better understand this scenario, Fig. 1 shows the EDP for the execution of three parallel regions of the SP kernel from the NAS Parallel benchmark [2] in a 32-c