Multi-GPU Based Recurrent Neural Network Language Model Training

Recurrent neural network language models (RNNLMs) have been applied in a wide range of research fields, including nature language processing and speech recognition. One challenge in training RNNLMs is the heavy computational cost of the crucial back-propa

  • PDF / 434,936 Bytes
  • 10 Pages / 439.37 x 666.142 pts Page_size
  • 99 Downloads / 227 Views

DOWNLOAD

REPORT


Abstract. Recurrent neural network language models (RNNLMs) have been applied in a wide range of research fields, including nature language processing and speech recognition. One challenge in training RNNLMs is the heavy computational cost of the crucial back-propagation (BP) algorithm. This paper presents an effective approach to train recurrent neural network on multiple GPUs, where parallelized stochastic gradient descent (SGD) is applied. Results on text-based experiments show that the proposed approach achieves 3.4× speedup on 4 GPUs than the single one, without any performance loss in language model perplexity.

1

Introduction

Statistical language models (LMs) play a vital role in modern nature language processing (NLP) systems. To address the data sparsity problem in conventional n-gram LMs, neural network language models (NNLMs) [1–6] have been proposed, which use shallow neural network to extract features from input words, and back-propagation (BP) [14] algorithm to update the network parameters. Depending on the network architecture, NNLMs can be classified into two groups: feed-forward NNLM [1,2] where a finite length of context is considered, and recurrent NNLMs (RNNLMs) [3–6] where a recurrent vector is used to preserve longer and variable length context. In recent years, RNNLMs have been reported to obtain significant performance improvements over n-gram LMs and feed-forward NNLMs on a wide range of tasks [3,4]. To build language models based on recurrent neural networks (RNNs), a practical issue is the heavy computational complexity in training the neural network. In typical tasks with huge quantities of input data, the training process exceeds days or even weeks. Such long training periods are uncomfortable for practical use. One direction to accelerate RNN training speed is to use parallel algorithm to train the network on many-core machines or clusters. However, while BP algorithm plays a vital role in RNN training, it is difficult to develop an effective parallel training algorithm, for BP involves a full model update after each iteration. This paper presents a effective data parallelism approach to train RNNs on multiple GPUs, using a ring-oriented graph-structure. Experimental results on c Springer Science+Business Media Singapore 2016  W. Che et al. (Eds.): ICYCSEE 2016, Part I, CCIS 623, pp. 484–493, 2016. DOI: 10.1007/978-981-10-2053-7 43

Multi-GPU Based Recurrent Neural Network Language Model Training

485

the famous Penn Treebank Wall Street Journal corpus show a 3.4× speedup on a cluster with 4 GPUs compared with single one, without degrading performance of the trained language model. The rest of this paper is organized as follows. Section 2 discusses related works. Section 3 describes the architecture of typical recurrent neural networks and critical factors in training. Section 4 introduces the parallel RNN training approach and shows how to apply it on multiple GPUs. In Sect. 5 experimental results on text-based dataset are presented. Conclusions and future works are discussed in Sect. 6.