Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster
- PDF / 1,122,544 Bytes
- 14 Pages / 595.276 x 790.866 pts Page_size
- 104 Downloads / 206 Views
(0123456789().,-volV)(0123456789(). ,- volV)
Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster Youngrang Kim1 • Hyeonseong Choi1 • Jaehwan Lee1 Hongchan Roh3
•
Jik-Soo Kim2 • Hyunseung Jei3
•
Received: 29 November 2019 / Revised: 3 May 2020 / Accepted: 21 June 2020 Ó Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract This paper presents a novel ‘‘Distributed Deep Learning Framework’’ for a heterogeneous multi-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous dataparalleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster. Keywords Data parallel Distributed deep learning Heterogeneous cluster Large-scale deep learning
1 Introduction Recently, distributed deep learning frameworks have been proposed [1] to accelerate overall deep learning computations by exploiting multiple GPUs and multiple computing
& Jaehwan Lee [email protected] Youngrang Kim [email protected] Hyeonseong Choi [email protected] Jik-Soo Kim [email protected] Hyunseung Jei [email protected] Hongchan Roh [email protected] 1
Korea Aerospace University, Goyang-si, Republic of Korea
2
Myongji University, Yongin-si, Republic of Korea
3
SK Telecom ML Infra Lab., Seongnam-si, Republic of Korea
nodes. Typically, distributed deep learning mechanisms can be classified into asynchronous and synchronous aggregations based on the execution timing of the operations. Also, it can be further categorized into parameterserver [2] and all-reduce [3] schemes depending on the methods of exchanging data for the aggregation among training workers. However, employing combinations of these distributed deep learning mechanisms on top of a heterogeneous multi-GPU cluster may result in lower computing resource utilization. In the case of synchronous training, other training workers may have to wait a substantial amount of time due to relatively slow workers (stragglers) which results in lower computing performance. To address such problems, Ho et. al. proposed a Stale-Synchronous Parallel Parameter Server [4], which worked by specifying the staleness threshold. Each worker maintained its difference in the number of training iterations compared to the slowest worker below
Data Loading...