Fast LSTM by dynamic decomposition on cloud and distributed systems

PDF / 5,101,235 Bytes
29 Pages / 439.37 x 666.142 pts Page_size
4 Downloads / 248 Views

Fast LSTM by dynamic decomposition on cloud and distributed systems Yang You1 · Yuxiong He2 · Samyam Rajbhandari2 · Wenhan Wang2 · Cho-Jui Hsieh3 · Kurt Keutzer4 · James Demmel4 Received: 20 June 2020 / Accepted: 27 June 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract Long short-term memory (LSTM) is a powerful deep learning technique that has been widely used in many real-world data-mining applications such as language modeling and machine translation. In this paper, we aim to minimize the latency of LSTM inference on cloud systems without losing accuracy. If an LSTM model does not fit in cache, the latency due to data movement will likely be greater than that due to computation. In this case, we reduce model parameters. If, as in most applications we consider, the LSTM models are able to fit the cache of cloud server processors, we focus on reducing the number of floating point operations, which has a corresponding linear impact on the latency of the inference calculation. Thus, in our system, we dynamically reduce model parameters or flops depending on which most impacts latency. Our inference system is based on singular value decomposition and canonical polyadic decomposition. Our system is accurate and low latency. We evaluate our system based on models from a series of real-world applications like language modeling, computer vision, question answering, and sentiment analysis. Users of our system can use either pretrained models or start from scratch. Our system achieves 15× average speedup for six real-world applications without losing accuracy in inference. We also design and implement a distributed optimization system with dynamic decomposition, which can significantly reduce the energy cost and accelerate the training process. Keywords LSTM · Fast inference · Dynamic decomposition

1 Introduction Long short-term memory (LSTM) is a powerful deep learning technique that has been used in many real-world applications like language modeling [26], machine translation [56], speech recognition [57], and visual question answering [23]. Because of LSTM’s good performance in these applications, there is a growing interest in using LSTM inference for recommendations systems in domains such as movie recommendation [55] or search-based online advertising [62]. While researchers keep improving the training speed [60,61], the inference

B

Yang You [email protected]

Extended author information available on the last page of the article

123

Y. You et al.

speed for these LSTM-based systems can be a major bottleneck and delays degrade the user experience. Our specific target is to reduce the LSTM inference latency to less than 50 ms to ensure a good interactive user experience. The previous work on improving the speed of LSTMs has focused on reducing the number of parameters in the LSTM. Although reducing number of parameters is important, we find number of floating point operations (flops) is the major overhead when the model can fit the cache of cloud servers. In this situation, the LST

Data Loading...

Fast LSTM by dynamic decomposition on cloud and distributed systems

Recommend Documents

Distributed MPC of Interconnected Nonlinear Systems by Dynamic Dual Decomposition

Distributed-Order Dynamic Systems Stability, Simulation, Application

Distributed MPC Via Dual Decomposition

Distributed Lags and Dynamic Models

VNF Instance Dynamic Scaling Strategy Based on LSTM

An Approach to Forecasting and Filtering Noise in Dynamic Systems Using LSTM Architectures

Parallel and Distributed Data Mining in Cloud

Distributed MPC Under Coupled Constraints Based on Dantzig-Wolfe Decomposition

FAST: Fast Accessing Scheme for data Transmission in cloud computing

Approximation of Fractional Order Dynamic Systems Using Elman, GRU and LSTM Neural Networks

DDSL: Efficient Subgraph Listing on Distributed and Dynamic Graphs

Scheduling Parallel Applications on Heterogeneous Distributed Systems