Feature Learning via Deep Belief Network for Chinese Speech Emotion Recognition

Speech emotion recognition is an interesting and challenging subject due to the emotion gap between speech signals and high-level speech emotion. To bridge this gap, this paper present a method of Chinese speech emotion recognition using Deep belief netwo

  • PDF / 430,438 Bytes
  • 7 Pages / 439.37 x 666.14 pts Page_size
  • 88 Downloads / 203 Views

DOWNLOAD

REPORT


)

Institute of Intelligent Information Processing, Taizhou University, Taizhou, China [email protected]

Abstract. Speech emotion recognition is an interesting and challenging subject due to the emotion gap between speech signals and high-level speech emotion. To bridge this gap, this paper present a method of Chinese speech emotion recog‐ nition using Deep belief networks (DBN). DBN is used to perform unsupervised feature learning on the extracted low-level acoustic features. Then, Multi-layer Perceptron (MLP) is initialized in terms of the learning results of hidden layer of DBN, and employed for Chinese speech emotion classification. Experimental results on the Chinese Natural Audio-Visual Emotion Database (CHEAVD), show that the presented method obtains a classification accuracy of 32.80 % and macro average precision of 41.54 % on the testing data from the CHEAVD dataset on speech emotion recognition tasks, significantly outperforming the baseline results provided by the organizers in the speech emotion recognition sub-chal‐ lenges. Keywords: Deep learning · Deep belief networks · Speech emotion recognition · Feature learning

1

Introduction

During the past two decades, massive efforts have been made to recognize human emotions from emotional speech signals, i.e., called speech emotion recognition. At present, speech emotion recognition has attracted much interest in various fields such as signal processing, pattern recognition, artificial intelligence, etc., since it can be applied to human-machine interactions [1, 2]. Feature extraction is a critical step to bridge the emotion gap between speech signals and high-level speech emotion. Up to now, a variety of features have been employed for speech emotion recognition [3, 4]. These features can be roughly divided into four categories: (1) acoustic features, such as prosody features, voice quality features as well as spectral features, (2) language features, such as lexical information, (3) context infor‐ mation such as subject, gender, culture influences, (4) hybrid features such as the inte‐ gration of two or three features abovementioned. However, these hand-designed features, there is no agreement that which is the best one sufficiently and efficiently characterizing emotion in speech signals. In addition, these hand-designed features were low-level, hence may not be reliable enough to efficiently characterize the subjective

© Springer Nature Singapore Pte Ltd. 2016 T. Tan et al. (Eds.): CCPR 2016, Part II, CCIS 663, pp. 645–651, 2016. DOI: 10.1007/978-981-10-3005-5_53

646

S. Zhang et al.

emotion in complicated scenarios. It is thus important to develop automatic feature learning algorithms for speech emotion recognition. In recent years, deep learning [5], which is multi-layered with a deep architecture, has attracted extensive attentions in machine learning, signal processing, artificial intel‐ ligence and pattern recognition. Deep belief networks (DBN) [6], as a representative method of deep learning, exhibits a strong ability of unsupervised feature learning. In rec