Distributed deep learning system for cancerous region detection on Sunway TaihuLight

  • PDF / 3,609,058 Bytes
  • 14 Pages / 595.276 x 790.866 pts Page_size
  • 90 Downloads / 185 Views

DOWNLOAD

REPORT


REGULAR PAPER

Distributed deep learning system for cancerous region detection on Sunway TaihuLight GuoFeng Lv1   · MingFan Li1 · Hong An1 · Han Lin1 · Junshi Chen1 · Wenting Han1 · Qian Xiao2 · Fei Wang2 · Rongfen Lin2 Received: 26 February 2020 / Accepted: 2 July 2020 © China Computer Federation (CCF) 2020

Abstract To explore the potential of distributed training on deep neural networks, we implement several distributed algorithms with the basis of swFlow on the world-leading supercomputer, Sunway TaihuLight. Based on two naive designs of parameter server and ring all-reduce, we present the limitation of the communication model and discuss the optimizations for adapting the five-level interconnect architecture of Sunway system. To reduce the communication bottleneck on large scale system, multi-severs and hierarchical ring all-reduce models are introduced. With a benchmark from deep learning-based cancerous region detection algorithm, the average parallel efficiency obtains over 80% for at most 1024 processors. It reveals the great opportunity for joint combination of deep learning and HPC system. Keywords  Deep neural network · Parameter server · Ring all-reduce · Cancerous region detection

1 Introduction

* GuoFeng Lv [email protected] MingFan Li [email protected] Hong An [email protected] Han Lin [email protected] Junshi Chen [email protected] Wenting Han [email protected] Qian Xiao [email protected] Fei Wang [email protected] Rongfen Lin [email protected] 1



School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, Anhui, China



Wuxi Jiangnan Institute of Computing Technology, Wuxi 214083, Jiangsu, China

2

Over the past few years, advances in deep learning have driven tremendous progress and outperformed many stateof-the-art approaches in conventional fields. Particularly, deep learning models can now recognize images, process natural language, and defeat humans in challenging strategy games. At this point, deep learning has drawn wider attention from experts across areas, which has been proved by the exascale deep learning for climate analysis from the traditional HPC domains on summit last year. Deep learning usually demands a large amount of training data and powerful computing resources for data analysis. For one thing, there is a growing demand to accelerate smart application to a wide spectrum devices, ranging from high performance compute card of Intel KNL, NVIDA GPUs to the customed AI accelerators of Google TPUs or Cambrian series. For another, recent experiments where DeepLabv3+ scales up to 27360 V100 GPUs with a peak throughput of 1.13EF/s show vast potential for distributed training Kurth et al. (2018). In general, the high performance supercomputer has become an appealing substitute for redundant model training of deep learning. With the increase of cluster scale and high performance accelerators, the heavy communication has become the bottlenecks for distributed application (Abadi et al. 2016; Akiba et al. 2017a).

13