Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Fea

  • PDF / 2,269,450 Bytes
  • 29 Pages / 439.37 x 666.142 pts Page_size
  • 29 Downloads / 201 Views

DOWNLOAD

REPORT


Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Features Starlet Ben Alex1 · Leena Mary2 · Ben P. Babu1 Received: 3 September 2019 / Revised: 13 April 2020 / Accepted: 15 April 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract This work attempts to recognize emotions from human speech using prosodic information represented by variations in duration, energy, and fundamental frequency (F0 ) values. For this, the speech signal is first automatically segmented into syllables. Prosodic features at the utterance (15 features) and syllable level (10 features) are extracted using the syllable boundaries and trained separately using deep neural network classifiers. The effectiveness of the proposed approach is demonstrated on German speech corpus-EMOTional Sensitivity ASistance System (EmotAsS) for people with disabilities, the dataset used for the Interspeech 2018 Atypical Affect SubChallenge. The initial set of prosodic features on evaluation yields an unweighted average recall (UAR) of 30.15%. A fusion of the decision scores of these features with spectral features gives a UAR of 36.71%. This paper also employs methods like attention mechanism and feature selection using resampling-based recursive feature elimination (RFE) to enhance system performance. Implementing attention and feature selection followed by a score-level fusion improves the UAR to 36.83% and 40.96% for prosodic features and overall fusion, respectively. The fusion of the scores of the best individual system of the Atypical Affect Sub-Challenge and the proposed system provides a UAR (43.71%) above the best test result reported. The effectiveness of the proposed system has also been demonstrated on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database with a UAR of 63.83%.

B

Starlet Ben Alex [email protected] Leena Mary [email protected] Ben P. Babu [email protected]

1

Centre for Advanced Signal Processing (CASP), Rajiv Gandhi Institute of Technology, APJ Abdul Kalam Technological University, Kottayam, Kerala, India

2

Department of Electronics and Communication Engineering, Government Engineering College, Idukki, Kerala, India

Circuits, Systems, and Signal Processing

Keywords Automatic emotion recognition (AER) · Prosodic features · Syllabification · Attention mechanism · Feature selection · Score-level fusion

1 Introduction Emotion is any feeling that may arise due to the psychological changes caused by the environment in which an individual persists [8,17,64]. The facial expressions and certain speech characteristics of a person serve as cues for recognizing emotions. A unique quality of human speech is that it contains information about the language, the speaker, and the conveyed emotion, in addition to the intended message [33]. This makes it possible to exploit speech for identifying emotions even without the knowledge of the language used to convey it. Realizing the underlying emotions in speech has its impo