Significance of Phonological Features in Speech Emotion Recognition

  • PDF / 787,792 Bytes
  • 10 Pages / 595.276 x 790.866 pts Page_size
  • 24 Downloads / 219 Views

DOWNLOAD

REPORT


Significance of Phonological Features in Speech Emotion Recognition Wei Wang1 · Paul A. Watters2 · Xinyi Cao1 · Lingjie Shen1 · Bo Li3  Received: 27 December 2019 / Accepted: 8 July 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract A novel Speech Emotion Recognition (SER) method based on phonological features is proposed in this paper. Intuitively, as expert knowledge derived from linguistics, phonological features are correlated with emotions. However, it has been found that they are seldomly used as features to improve SER. Motivated by this, we set our goal to utilize phonological features to further advance SER’s accuracy since they can provide complementary information for the task. Furthermore, we will also explore the relationship between phonological features and emotions. Firstly, instead of only based on acoustic features, we devise a new SER approach by fusing phonological representations and acoustic features together. A significant improvement in SER performance has been demonstrated on a publicly available SER database named Interactive Emotional Dyadic Motion Capture (IEMOCAP). Secondly, the experimental results show that the top-performing method for the task of categorical emotion recognition is a deep learning-based classifier which generates an unweighted average recall (UAR) accuracy of 60.02%. Finally, we investigate the most discriminative features and find some patterns of emotional rhyme based on the phonological representations. Keywords  Speech emotion recognition · Phonological features · Feature analysis · Acoustic features

1 Introduction Automatic Speech Emotion Recognition (SER) has been an active research area during the past several decades, and is of great interest for the human computer interaction community. An accurate and efficient human emotion recognition system will help make the interaction between humans and computers more natural and friendlier. Automatic SER has

* Bo Li [email protected]; [email protected] Wei Wang [email protected] Paul A. Watters [email protected] 1



School of Education Science, Nanjing Normal University, Nanjing, JS 210097, China

2



Department of Computer Science and Information Technology, La Trobe University, Melbourne, VIC 3350, Australia

3

School of Computer Sciences and Computer Engineering, University of Southern Mississippi, 730 East Beach Blvd, Long Beach, MS 39560, USA



wide applications ranging from computer tutoring to mental health diagnosis (Jin et al. 2015). The accuracy of speech emotion recognition mainly relies on two factors—features and classifiers. In terms of features used in SER, different acoustic features have dominated the literature, primarily the large set of acoustic features characterizing prosodic, voice quality and spectral related features. These acoustic features consist of frame-level features that are often referred as low-level descriptors (LLDs), and their corresponding functions are used to map LLDs at the segment level to a space at the utterance level. Most