Recognition of emotion from speech using evolutionary cepstral coefficients

  • PDF / 1,304,974 Bytes
  • 21 Pages / 439.642 x 666.49 pts Page_size
  • 82 Downloads / 208 Views

DOWNLOAD

REPORT


Recognition of emotion from speech using evolutionary cepstral coefficients Ali Bakhshi1

· Stephan Chalup1 · Ali Harimi2 · Seyed Mostafa Mirhassani3

Received: 18 June 2019 / Revised: 5 May 2020 / Accepted: 11 August 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract An optimal representation of acoustic features is an ongoing challenge in automatic speech emotion recognition research. In this study, we proposed Cepstral coefficients based on evolutionary filterbanks as emotional features. It is difficult to guarantee that an individual optimized filterbank provides the best representation for emotion classification. Consequently, we employed six HMM-based binary classifiers that used a specific filterbank, which was optimized by a genetic algorithm to categorize the data into seven emotion classes. These optimized classifiers were applied in a hierarchical manner and outperformed conventional Mel Frequency Cepstral Coefficients in terms of overall emotion classification accuracy. The proposed method using evolutionary-based Cepstral coefficients achieved a weighted average recall of 87.29% on the Berlin database while the same approach but using conventional Cepstral features achieved only 79.63%. Keywords Genetic algorithm · Mel filterbank · Cepstral coefficients · Speech emotion recognition

1 Introduction Automatic speech emotion recognition (SER) has been an attractive research area in the last decade. Human emotions can partially be encoded in speech prosody. Most acoustical  Ali Bakhshi

[email protected] Stephan Chalup [email protected] Ali Harimi [email protected] Seyed Mostafa Mirhassani [email protected] 1

School of Electrical Engineering and Computing, The University of Newcastle, Newcastle, Australia

2

Department of Electrical Engineering, Islamic Azad University, Shahrood Branch, Shahrood, Iran

3

Department of Biomedical Engineering, University of Malaya, Kuala Lumpur, Malaysia

Multimedia Tools and Applications

features employed for SER can be categorized into two major groups: prosodic and spectral. Pitch (F 0) and intensity are among the most prominent prosodic features, and MFCC (Mel Frequency Cepstral Coefficient), PLP (Perceptual Linear Prediction), and formants are the most important spectral features. In literature, pitch and energy were reported as standard emotional features [3, 51, 52]. Also, spectral features that mainly were extracted from the sub-banded spectrum of speech have been shown complementary to prosodic features [24]. The authors of [58] derived spectral patterns for SER from speech spectrograms that were divided by the Bark scale [83]. The equivalent rectangular bandwidth (ERB) gives an unrealistic but convenient simplification of rectangular band-pass filters to extract sub-banded spectral features [29]. In order to model the perception of speech in a manner similar to the human ear, the Mel frequency scale is linearly below 1kHz and logarithmically above that [29]. MFCCs are the most widespread