Fast LSTM by dynamic decomposition on cloud and distributed systems
- PDF / 5,101,235 Bytes
- 29 Pages / 439.37 x 666.142 pts Page_size
- 4 Downloads / 232 Views
Fast LSTM by dynamic decomposition on cloud and distributed systems Yang You1 · Yuxiong He2 · Samyam Rajbhandari2 · Wenhan Wang2 · Cho-Jui Hsieh3 · Kurt Keutzer4 · James Demmel4 Received: 20 June 2020 / Accepted: 27 June 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020
Abstract Long short-term memory (LSTM) is a powerful deep learning technique that has been widely used in many real-world data-mining applications such as language modeling and machine translation. In this paper, we aim to minimize the latency of LSTM inference on cloud systems without losing accuracy. If an LSTM model does not fit in cache, the latency due to data movement will likely be greater than that due to computation. In this case, we reduce model parameters. If, as in most applications we consider, the LSTM models are able to fit the cache of cloud server processors, we focus on reducing the number of floating point operations, which has a corresponding linear impact on the latency of the inference calculation. Thus, in our system, we dynamically reduce model parameters or flops depending on which most impacts latency. Our inference system is based on singular value decomposition and canonical polyadic decomposition. Our system is accurate and low latency. We evaluate our system based on models from a series of real-world applications like language modeling, computer vision, question answering, and sentiment analysis. Users of our system can use either pretrained models or start from scratch. Our system achieves 15× average speedup for six real-world applications without losing accuracy in inference. We also design and implement a distributed optimization system with dynamic decomposition, which can significantly reduce the energy cost and accelerate the training process. Keywords LSTM · Fast inference · Dynamic decomposition
1 Introduction Long short-term memory (LSTM) is a powerful deep learning technique that has been used in many real-world applications like language modeling [26], machine translation [56], speech recognition [57], and visual question answering [23]. Because of LSTM’s good performance in these applications, there is a growing interest in using LSTM inference for recommendations systems in domains such as movie recommendation [55] or search-based online advertising [62]. While researchers keep improving the training speed [60,61], the inference
B
Yang You [email protected]
Extended author information available on the last page of the article
123
Y. You et al.
speed for these LSTM-based systems can be a major bottleneck and delays degrade the user experience. Our specific target is to reduce the LSTM inference latency to less than 50 ms to ensure a good interactive user experience. The previous work on improving the speed of LSTMs has focused on reducing the number of parameters in the LSTM. Although reducing number of parameters is important, we find number of floating point operations (flops) is the major overhead when the model can fit the cache of cloud servers. In this situation, the LST
Data Loading...