Video multimodal emotion recognition based on Bi-GRU and attention fusion

PDF / 1,038,596 Bytes
28 Pages / 439.37 x 666.142 pts Page_size
17 Downloads / 209 Views

Video multimodal emotion recognition based on Bi-GRU and attention fusion Ruo-Hong Huan 1 Kai-Kai Chi 1

1

1

1

1

& Jia Shu & Sheng-Lin Bao & Rong-Hua Liang & Peng Chen &

Received: 22 March 2020 / Revised: 25 August 2020 / Accepted: 6 October 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract

A video multimodal emotion recognition method based on Bi-GRU and attention fusion is proposed in this paper. Bidirectional gated recurrent unit (Bi-GRU) is applied to improve the accuracy of emotion recognition in time contexts. A new network initialization method is proposed and applied to the network model, which can further improve the video emotion recognition accuracy of the time-contextual learning. To overcome the weight consistency of each modality in multimodal fusion, a video multimodal emotion recognition method based on attention fusion network is proposed. The attention fusion network can calculate the attention distribution of each modality at each moment in realtime so that the network model can learn multimodal contextual information in real-time. The experimental results show that the proposed method can improve the accuracy of emotion recognition in three single modalities of textual, visual, and audio, meanwhile improve the accuracy of video multimodal emotion recognition. The proposed method outperforms the existing state-of-the-art methods for multimodal emotion recognition in sentiment classification and sentiment regression. Keywords Video emotion recognition . Multimodal . Bi-GRU . Attention mechanism . Fusion

1 Introduction Usually, the ways humans naturally communicating and expressing emotions are multimodal [23]. That means we can express emotions either verbally or visually. When more emotions are expressed with tones, the audio data may contain major cues for emotion recognition; and when more facial expressions are used to express emotions, it can be considered that most of the clues needed for mining emotions exist in facial expressions. Identifying human emotions

* Ruo-Hong Huan [email protected] Extended author information available on the last page of the article

Multimedia Tools and Applications

using multimodal information such as human facial expressions, phonetic intonation, and linguistic content is an interesting and challenging issue. Videos provide multimodal data in both acoustic and visual modalities. Facial expressions, vocal tones and text data in the video data can provide important information to recognize the true emotion state of a person better. Therefore, analyzing videos can create better models for emotion recognition and sentiment analysis. Textual, visual, and audio are often regarded as the main multimodal information in the research of multimodal emotion recognition about videos. The three modalities of textual, visual, and audio are simultaneously recognized and utilized, which can effectively extract the semantic and emotional information conveyed during the communication process. It is necessary to simultaneously establis

Data Loading...

Video multimodal emotion recognition based on Bi-GRU and attention fusion

Recommend Documents

Contextually Aware Multimodal Emotion Recognition

Deep learning-based late fusion of multimodal information for emotion classification of music video

Vehicle theft recognition from surveillance video based on spatiotemporal attention

A Survey on Automatic Multimodal Emotion Recognition in the Wild

EEG based emotion recognition using fusion feature extraction method

Face Recognition, Video-Based

Human Action Recognition Method Based on Video-Level Features and Attention Mechanism

Video-Based Face Recognition

Video-based Face Recognition

Video-Based Face Recognition

Multimodal Fusion

Korean video dataset for emotion recognition in the wild