DySHARQ: Dynamic Software-Defined Hardware-Managed Queues for Tile-Based Architectures
- PDF / 2,730,485 Bytes
- 35 Pages / 439.37 x 666.142 pts Page_size
- 22 Downloads / 224 Views
DySHARQ: Dynamic Software-Defined Hardware-Managed Queues for Tile-Based Architectures Sven Rheindt1 • Sebastian Maier2 • Nora Pohle1 • Lars Nolte1 • Oliver Lenke1 • Florian Schmaus2 • Thomas Wild1 • Wolfgang Schro¨derPreikschat2 • Andreas Herkersdorf1 Received: 3 April 2020 / Accepted: 5 November 2020 Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract The recent trend towards tile-based manycore architectures has helped to tackle the memory wall by physically distributing memories and processing nodes. However, this introduced a data-to-task locality challenge and inter-tile communication thus often imposes significant software overhead. Thus, we proposed software-defined hardware-managed SHARQ queues that enable efficient inter-tile communication by leveraging user-defined queues with arbitrarily sized elements. To ensure (remote) processing of queued elements, SHARQ introduces an optional handler task, which is scheduled by hardware on demand. Queue management, intra- and intertile data transfer, and handler task invocation are entirely handled by hardware. Only rare tasks, like the dynamic queue creation at run-time, are performed in software. DySHARQ, an extension of SHARQ, enables dynamic and concurrent queue memory management and queue length adjustments to be able to adapt to application and resource requirement changes. The DySHARQ hardware is able to monitor the queue memory requirements at run-time and conditionally schedules a software-defined memory management task. It further optimizes the hardwaresoftware interaction for local queue operations. We integrated DySHARQ into the MPI library used by the NAS benchmarks. The evaluation shows a reduction in execution time by up to 43% (compared to software) for the communication intense IS kernel in a 4 4 tile design on an FPGA platform with a total of 80 LEON3 cores. The dynamic memory management reduces the memory footprint by 3.75 in a 2 2 design. Keywords Distributed manycore architecture Hardware-software co-design Intertile communication Hardware-accelerated queue Data-to-task locality
& Sven Rheindt [email protected] 1
Technical University of Munich (TUM), Munich, Germany
2
Friedrich-Alexander-Universita¨t Erlangen-Nu¨rnberg (FAU), Erlangen, Germany
123
International Journal of Parallel Programming
1 Introduction Performance scaling of computer architectures is highly dependent on memory access latency and bandwidth. As the memory wall hindered the further scaling of classical multi-core architectures [1–3], tile-based manycore architectures have become popular [4–9]. Although the physically distributed memories and processing nodes reduce access hot-spots and latencies, this approach did not yet solve the data-to-task locality issue [10, 11]. Distributed and parallel operating systems [12] and applications help to exploit the increased scalability of these architectures but are in need of efficient mechanisms for inter-tile communication (ITC), thread synchronization and data transport [13, 14]. Comm
Data Loading...