Fisher Kernels on Phase-Based Features for Speech Emotion Recognition

The involvement of affect information in a spoken dialogue system can increase the user-friendliness and provide a more natural way for the interaction experience. This can be reached by speech emotion recognition, where the features are usually dominated

  • PDF / 183,078 Bytes
  • 9 Pages / 439.37 x 666.142 pts Page_size
  • 55 Downloads / 255 Views

DOWNLOAD

REPORT


Abstract The involvement of affect information in a spoken dialogue system can increase the user-friendliness and provide a more natural way for the interaction experience. This can be reached by speech emotion recognition, where the features are usually dominated by the spectral amplitude information while they ignore the use of the phase spectrum. In this chapter, we propose to use phase-based features to build up such an emotion recognition system. To exploit these features, we employ Fisher kernels. The according technique encodes the phase-based features by their deviation from a generative Gaussian mixture model. The resulting representation is fed to train a classification model with a linear kernel classifier. Experimental results on the GeWEC database including ‘normal’ and whispered phonation demonstrate the effectiveness of our method. Keywords Speech emotion recognition Modified group delay features

· Phase-based features · Fisher kernels ·

1 Introduction For a spoken dialogue systems, a recent trend is to consider the integration of emotion recognition in order to increase the user-friendliness and provide a more natural interaction experience [1–6]. In fact, this may be particularly relevant for systems J. Deng (B) · Z. Zhang · B. Schuller Chair of Complex & Intelligent Systems, University of Passau, Passau, Germany e-mail: [email protected] X. Xu Machine Intelligence & Signal Processing Group, MMK, Technische Universität München, Munich, Germany S. Frühholz · D. Grandjean · B. Schuller Swiss Center for Affective Sciences, University of Geneva, Geneva, Switzerland B. Schuller Department of Computing, Imperial College London, London, UK © Springer Science+Business Media Singapore 2017 K. Jokinen and G. Wilcock (eds.), Dialogues with Social Robots, Lecture Notes in Electrical Engineering 427, DOI 10.1007/978-981-10-2585-3_15

195

196

J. Deng et al.

that accept whispered speech as input given the social and emotional implications of whispering. At present, acoustic features used for speech emotion recognition are dominated by the conventional Fourier transformation magnitude part of a signal, such as in Mel-frequency cepstral coefficients (MFCCs) [7–10]. In general, the phase-based representation of the signal has been neglected mainly because of the difficulties in phase wrapping [11, 12]. In spite of this, the phase spectrum is capable of summarising the signal. Recent work has proved the effectiveness of using phase spectrum in different speech audio processing applications, including speech recognition [13, 14], source separation [15], and speaker recognition [16]. However, there exists little research, which applies phase-based features for speech emotion recognition. Recently, the phase distortion, which is the derivative of the relative phase shift, has been investigated for emotional valence recognition [17]. In this short chapter, the key objective is to demonstrate the usefulness of the phased-based features for speech emotion recognition. In particular, this chapter investigates wheth