An optimal model with a lower bound of recall for imbalanced speech emotion recognition

  • PDF / 1,486,872 Bytes
  • 21 Pages / 439.642 x 666.49 pts Page_size
  • 16 Downloads / 243 Views

DOWNLOAD

REPORT


An optimal model with a lower bound of recall for imbalanced speech emotion recognition Xusheng Ai1

· Victor S. Sheng2 · Wei Fang3 · Charles X. Ling4

Received: 28 August 2019 / Revised: 29 May 2020 / Accepted: 4 June 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In an early complain warning system, we encounter a common problem - the lack of angry emotions for training classification models. Moreover, the recognition of angry emotion is more important than that of no-anger emotion. Based on this, the main purpose of this paper is to train an optimal model which achieves a high recall above a lower bound and a maximum of F1 score. It is divided into three aspects: 1) A variant of F1 score (T F1 score) takes recall above a lower bound and F1 score into consideration; 2) A Single Emotion Deep Neural Network (SEDNN) and its training process are designed to find an optimal model with a maximum of T F1 score. 3) A performance comparison of different methods is conducted on IEMOCAP and Emo-DB database. Extensive experiments show that when a BCE loss function or a focal loss function is used, the training process can find a model with a recall above a high threshold and a maximum of F1 score. Especially, SEDNN with the focal loss function performs better than SEDNN with the BCE loss function. Keywords Imbalance · Deep neural network · Convolutional neural network · Speech emotion recognition  Xusheng Ai

[email protected]  Victor S. Sheng

[email protected] Wei Fang [email protected] Charles X. Ling [email protected] 1

Software and Service Outsourcing College, Suzhou Vocational Institute of Industrial Technology, Suzhou, 215104, People’s Republic of China

2

Department of Computer Science, Texas Tech University, Lubbock, TX 79409, USA

3

School of Computer and Software, Jiangsu Engineering Center of Network Monitoring, Nanjing University of Information Science and Technology, Nanjing, China

4

Department of Computer Science, Western University, London, ON N6A 5B7, Canada

Multimedia Tools and Applications

1 Introduction Emotion is important paralinguistic information in human communication. Emotion directs non-linguistic social signals (such as body language and facial expression) to express wants, needs and desires [24]. There are many applications of speech emotion recognition in different fields such as healthcare, services, and telecommunication. In the healthcare field, speech emotion recognition can help clinicians assess patients’ psychological disorders online. In the industry of customer call centers, speech emotion recognition (SER) can be used to detect customers’ satisfaction. Speech emotion recognition can be also used to route 911 emergency call services for high priority emergency calls. In the industry of telemarketing call centers, thousands of telemarketers wait for calls from customers. A telemarketer answers a call and makes a conversation with a corresponding customer. When customers speak angrily at the end of call, complaints may happen sooner o