Human emotion recognition based on the weighted integration method using image sequences and acoustic features

  • PDF / 1,043,261 Bytes
  • 15 Pages / 439.37 x 666.142 pts Page_size
  • 51 Downloads / 221 Views

DOWNLOAD

REPORT


Human emotion recognition based on the weighted integration method using image sequences and acoustic features Sung-Woo Byun 1 & Seok-Pil Lee 2 Received: 4 June 2020 / Revised: 27 July 2020 / Accepted: 9 September 2020 # The Author(s) 2020

Abstract

People generally perceive other people’s emotions based on speech and facial expressions, so it can be helpful to use speech signals and facial images simultaneously. However, because the characteristics of speech and image data are different, combining the two inputs is still a challenging issue in the area of emotion-recognition research. In this paper, we propose a method to recognize emotions by synchronizing speech signals and image sequences. We design three deep networks. One of the networks is trained using image sequences, which focus on facial expression changes. Facial landmarks are also input to another network to reflect facial motion. The speech signals are first converted to acoustic features, which are used for the input of the other network, synchronizing the image sequence. These three networks are combined using a novel integration method to boost the performance of emotion recognition. A test comparing accuracy is conducted to verify the proposed method. The results demonstrated that the proposed method exhibits more accurate performance than previous studies. Keywords Emotion recognition . Acoustic feature . Facial expression . Model integration

1 Introduction Recently, high-performance personal computers have been rapidly popularized with the technological development of information society. Accordingly, the interaction between * Seok-Pil Lee [email protected] Sung-Woo Byun [email protected]

1

Graduate School, Department of Computer Science, SangMyung University, Seoul, Republic of Korea

2

Department of Electronic Engineering, SangMyung University, Seoul, Republic of Korea

Multimedia Tools and Applications

humans and computers is actively changing into a bidirectional interface, and a better understanding of human emotions is needed, which could improve human–machine interaction systems [4]. In signal processing, emotion recognition has become an attractive research topic [45]. Therefore, the goal of this human interface is to extract and recognize the emotional state of individuals accurately and to provide personalized media according to a user’s emotional state. Emotion refers to a conscious mental reaction subjectively experienced as strong feeling typically accompanied by physiological and behavioral changes in the body [3]. To recognize a user’s emotional state, several studies have applied different forms of input, such as speech, facial expression, video, text, and others [11, 13, 15, 25, 39, 42, 47]. Among the methods using these inputs, facial emotion recognition (FER) has been gaining substantial attention over the past decades. Conventional FER approaches generally have three main steps: 1) detecting a facial region from an input image, 2) extracting facial features, and 3) recognizing emotions. In conventional methods, it is