Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration

  • PDF / 1,625,644 Bytes
  • 12 Pages / 595.276 x 790.866 pts Page_size
  • 82 Downloads / 149 Views

DOWNLOAD

REPORT


(0123456789().,-volV)(0123456789(). ,- volV)

Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration Youngrang Kim1 • Jaehwan Lee1



Jik-Soo Kim2 • Hyunseung Jei3 • Hongchan Roh3

Received: 28 December 2018 / Revised: 2 May 2019 / Accepted: 14 August 2019 Ó Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract This paper presents a comprehensive suite of techniques for optimized memory management in multi-GPU systems to accelerate deep learning application execution. We employ a hybrid utilization of GPU and CPU memories in a multi-GPU environment by effectively addressing contention issues in the shared interconnect (e.g., PCIe, NVLink). In addition, we designed and implemented an intelligent prefetching algorithm (from CPU memory to GPU) that achieves the highest processing throughput while sustaining a large mini-batch size. We successfully implemented our optimization techniques on TensorFlow, and performed extensive experiments in various multi-GPU environments including traditional PCIe and the latest high-bandwidth interconnect, NVLink. Evaluation results show that our proposed scheme actually improves computing performance by decreasing the I/O bottleneck, and effectively increasing the mini-batch size without sacrificing overall training throughput. Keywords Convolutional neural network  GPGPU  Multi-GPU  Mini-batch

1 Introduction Convolutional neural network (CNN) uses the convolution layer to extract input data features and perform training using those features [1]. It has been widely adopted in deep learning frameworks. With the advent of the increased computing power of general-purpose GPUs (GPGPUs), parallel operations in a CNN can be effectively accelerated. However due to physical limitations in the amount of

& Jaehwan Lee [email protected] Youngrang Kim [email protected] Jik-Soo Kim [email protected] Hyunseung Jei [email protected] Hongchan Roh [email protected] 1

Korea Aerospace University, Goyang-si, Republic of Korea

2

Myongji University, Yongin-si, Republic of Korea

3

SK Telecom ML Infra Lab, Seongnam-si, Republic of Korea

available GPU memory, it is not always possible to compute large-batch input data or large CNN models. In a typical CNN, the feature map data, which are the outputs of convolution layers, occupy the largest portion in GPGPU memory. Feature map data are generated during the process of feed-forwarding. However, they are not used for the actual operation until they are reused during the backwardpropagation process. Therefore, the feature map data can stay in GPU memory for a relatively long time without actual usage until the backward-propagation process begins. To address this problem, virtualized deep neural networks (vDNN) [2] is proposed by NVIDIA which is a runtime memory management system that can virtualize GPU and CPU memory usage. To overcome the physical limitation of available GPGPU memory, vDNN swaps out feature map data, that normally remain in GPU memory for reuse but are not immediately required for p