Significance of Phonological Features in Speech Emotion Recognition

PDF / 787,792 Bytes
10 Pages / 595.276 x 790.866 pts Page_size
24 Downloads / 251 Views

Significance of Phonological Features in Speech Emotion Recognition Wei Wang1 · Paul A. Watters2 · Xinyi Cao1 · Lingjie Shen1 · Bo Li3 Received: 27 December 2019 / Accepted: 8 July 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract A novel Speech Emotion Recognition (SER) method based on phonological features is proposed in this paper. Intuitively, as expert knowledge derived from linguistics, phonological features are correlated with emotions. However, it has been found that they are seldomly used as features to improve SER. Motivated by this, we set our goal to utilize phonological features to further advance SER’s accuracy since they can provide complementary information for the task. Furthermore, we will also explore the relationship between phonological features and emotions. Firstly, instead of only based on acoustic features, we devise a new SER approach by fusing phonological representations and acoustic features together. A significant improvement in SER performance has been demonstrated on a publicly available SER database named Interactive Emotional Dyadic Motion Capture (IEMOCAP). Secondly, the experimental results show that the top-performing method for the task of categorical emotion recognition is a deep learning-based classifier which generates an unweighted average recall (UAR) accuracy of 60.02%. Finally, we investigate the most discriminative features and find some patterns of emotional rhyme based on the phonological representations. Keywords Speech emotion recognition · Phonological features · Feature analysis · Acoustic features

1 Introduction Automatic Speech Emotion Recognition (SER) has been an active research area during the past several decades, and is of great interest for the human computer interaction community. An accurate and efficient human emotion recognition system will help make the interaction between humans and computers more natural and friendlier. Automatic SER has

* Bo Li [email protected]; [email protected] Wei Wang [email protected] Paul A. Watters [email protected] 1

School of Education Science, Nanjing Normal University, Nanjing, JS 210097, China

2

Department of Computer Science and Information Technology, La Trobe University, Melbourne, VIC 3350, Australia

3

School of Computer Sciences and Computer Engineering, University of Southern Mississippi, 730 East Beach Blvd, Long Beach, MS 39560, USA

wide applications ranging from computer tutoring to mental health diagnosis (Jin et al. 2015). The accuracy of speech emotion recognition mainly relies on two factors—features and classifiers. In terms of features used in SER, different acoustic features have dominated the literature, primarily the large set of acoustic features characterizing prosodic, voice quality and spectral related features. These acoustic features consist of frame-level features that are often referred as low-level descriptors (LLDs), and their corresponding functions are used to map LLDs at the segment level to a space at the utterance level. Most

Data Loading...

Significance of Phonological Features in Speech Emotion Recognition

Recommend Documents

Multi-features Integration for Speech Emotion Recognition

Speech Emotion Recognition Using Spectrogram Patterns as Features

Fisher Kernels on Phase-Based Features for Speech Emotion Recognition

Pattern recognition and features selection for speech emotion recognition model using deep learning

Ranking Speech Features for Their Usage in Singing Emotion Classification

Speech Emotion Recognition in Neurological Disorders Using Convolutional Neural Network

Emotion Recognition in Speech with Deep Learning Architectures

Recognition of emotion from speech using evolutionary cepstral coefficients

Speech and Facial Based Emotion Recognition Using Deep Learning Approaches

Speech Emotion Recognition UsingConvolutional Neural Network and Long-Short TermMemory

Stress and Emotion Recognition Using Acoustic Speech Analysis

Deep Residual Local Feature Learning for Speech Emotion Recognition