NUMA-aware image compositing on multi-GPU platform

  • PDF / 1,711,184 Bytes
  • 11 Pages / 595.276 x 790.866 pts Page_size
  • 34 Downloads / 225 Views

DOWNLOAD

REPORT


O R I G I N A L A RT I C L E

NUMA-aware image compositing on multi-GPU platform Pan Wang · Zhiquan Cheng · Ralph Martin · Huahai Liu · Xun Cai · Sikun Li

Published online: 26 April 2013 © Springer-Verlag Berlin Heidelberg 2013

Abstract Sort-last parallel rendering is widely used. Recent GPU developments mean that a PC equipped with multiple GPUs is a viable alternative to a high-cost supercomputer: the Fermi architecture of a single GPU supports uniform virtual addressing, providing a foundation for non-uniform memory access (NUMA) on multi-GPU platforms. Such hardware changes require the user to reconsider the parallel rendering algorithms. In this paper, we thoroughly investigate the NUMA-aware image compositing problem, which is the key final stage in sort-last parallel rendering. Based on a proven radix-k strategy, we find one optimal compositing algorithm, which takes advantage of NUMA architecture on the multi-GPU platform. We quantitatively analyze different image compositing modes for practical image compositing, taking into account peer-to-peer communication costs between GPUs. Our experiments on various datasets show that our image compositing method is very fast, an image of a

P. Wang · Z. Cheng () · H. Liu · X. Cai · S. Li School of Computer Science, National University of Defense Technology, Hunan, China e-mail: [email protected] P. Wang e-mail: [email protected] H. Liu e-mail: [email protected] X. Cai e-mail: [email protected] S. Li e-mail: [email protected] R. Martin School of Computer Science & Informatics, Cardiff University, Cardiff, UK e-mail: [email protected]

few megapixels can be composited in about 10 ms by eight GPUs. Keywords Multi-GPU System · Parallel rendering · Image compositing

1 Introduction Parallel rendering is an important technique for visualizing complex scenes in computer graphics, scientific visualization, CAD, and virtual reality. Parallel rendering distributes data to different processors, then sorts and composites locally rendered data to produce the output image. According to when the sort is performed, parallel rendering can take one of three main approaches: sort-first, sort-middle, or sort-last [10]. Unlike sort-first and sort-middle approaches, sort-last parallel rendering has the distinct advantage of high scalability and good load-balancing. Task division for parallel geometry processing and rasterization is also simple, which makes it a prime candidate to extend visualization software to high-performance parallel rendering. However, it requires the intermediate images from processing nodes to be composited to create the final image (i.e. image compositing) [2, 14, 15, 18]. For an image of no more than a few tens of megapixels, this is still a very time-consuming task even for a supercomputer. For example, compositing a 64 megapixel result at 32 K cores takes over 80 ms (much longer than usable for real-time applications) using the IBM Blue Gene/p Intrepid machine at Argonne National Laboratory or the Cray XT5 Jaguar at Oak Ridge [6]. Multi-GPU