Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

  • PDF / 4,988,029 Bytes
  • 16 Pages / 595.276 x 790.866 pts Page_size
  • 99 Downloads / 168 Views

DOWNLOAD

REPORT


REGULAR PAPER

Survey and design of paleozoic: a high‑performance compiler tool chain for deep learning inference accelerator Zihan Liu1 · Jingwen Leng1 · Guandong Lu1 · Chenhui Wang1 · Quan Chen1 · Minyi Guo1  Received: 20 March 2020 / Accepted: 17 July 2020 © China Computer Federation (CCF) 2020

Abstract Specialized hardware accelerators for deep learning are widely introduced by many hardware vendors because of their high performance and efficiency. However, different vendors adopt different accelerator architectures, making it challenging for the compiler tool-chain to generate and optimize high-performance codes. Moreover, the current tool-chains provided by the vendors are either highly abstract, which makes it hard to optimize or contain too many hardware-related details, which makes it inconvenient to program. So, in this paper, we propose a middle layer compiler tool-chain for Cambricon MLU-100 to fill the gap between high-level runtime library and low operator-level SDK. Our tool-chain is based on the operator level SDK but abstracts away its redundant initialization and allocation statement. We also expose the interface of major optimization knobs compared to the existing runtime, thus enabling a considerable optimization space. We evaluate our work by several state-of-the-art neural networks and choose the line of code and optimization knobs as evaluation metrics. We also compare the performance against state-of-the-art tool-chain TensorRT applying simple optimization strategy and find that our work has great potential in optimization. Our work can guarantee the user a vast optimization space with only around 20% amount of the codes that hides the redundant initialization and allocation statements from users. Keywords  Deep learning accelerator · Compiler tool-chain · Hardware-related optimization

1 Introduction 1.1 Deep learning accelerator With the evolution of computing power, computation intense deep learning has been increasingly applied in the key application domains, including computer vision, natural language

* Jingwen Leng leng‑[email protected] * Minyi Guo guo‑[email protected] Zihan Liu [email protected] Guandong Lu [email protected] Chenhui Wang wang‑chen‑[email protected] Quan Chen chen‑[email protected] 1



Shanghai Jiao Tong University, Shanghai 200240, China

processing, etc. Nowadays, conventional general-purpose processors like CPU/GPU can hardly meet the growing need in computation power. On the other hand, the computation patterns in deep learning are good candidates for hardware specialization. There exist a few kinds of patterns in a deep neural network, including convolution, pooling, activation, batch normalization, and fully connected layers. These calculations are mostly based on linear calculation, with conjunctions of linear transformations, matrix decomposition, etc. The general-purpose CPUs that adopt deep and complex pipelines are highly inefficient in this scenario. Since the linear calculation deals with a huge amount of data, the optimal memory hierarchy i