Video multimodal emotion recognition based on Bi-GRU and attention fusion
- PDF / 1,038,596 Bytes
- 28 Pages / 439.37 x 666.142 pts Page_size
- 17 Downloads / 183 Views
Video multimodal emotion recognition based on Bi-GRU and attention fusion Ruo-Hong Huan 1 Kai-Kai Chi 1
1
1
1
1
& Jia Shu & Sheng-Lin Bao & Rong-Hua Liang & Peng Chen &
Received: 22 March 2020 / Revised: 25 August 2020 / Accepted: 6 October 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract
A video multimodal emotion recognition method based on Bi-GRU and attention fusion is proposed in this paper. Bidirectional gated recurrent unit (Bi-GRU) is applied to improve the accuracy of emotion recognition in time contexts. A new network initialization method is proposed and applied to the network model, which can further improve the video emotion recognition accuracy of the time-contextual learning. To overcome the weight consistency of each modality in multimodal fusion, a video multimodal emotion recognition method based on attention fusion network is proposed. The attention fusion network can calculate the attention distribution of each modality at each moment in realtime so that the network model can learn multimodal contextual information in real-time. The experimental results show that the proposed method can improve the accuracy of emotion recognition in three single modalities of textual, visual, and audio, meanwhile improve the accuracy of video multimodal emotion recognition. The proposed method outperforms the existing state-of-the-art methods for multimodal emotion recognition in sentiment classification and sentiment regression. Keywords Video emotion recognition . Multimodal . Bi-GRU . Attention mechanism . Fusion
1 Introduction Usually, the ways humans naturally communicating and expressing emotions are multimodal [23]. That means we can express emotions either verbally or visually. When more emotions are expressed with tones, the audio data may contain major cues for emotion recognition; and when more facial expressions are used to express emotions, it can be considered that most of the clues needed for mining emotions exist in facial expressions. Identifying human emotions
* Ruo-Hong Huan [email protected] Extended author information available on the last page of the article
Multimedia Tools and Applications
using multimodal information such as human facial expressions, phonetic intonation, and linguistic content is an interesting and challenging issue. Videos provide multimodal data in both acoustic and visual modalities. Facial expressions, vocal tones and text data in the video data can provide important information to recognize the true emotion state of a person better. Therefore, analyzing videos can create better models for emotion recognition and sentiment analysis. Textual, visual, and audio are often regarded as the main multimodal information in the research of multimodal emotion recognition about videos. The three modalities of textual, visual, and audio are simultaneously recognized and utilized, which can effectively extract the semantic and emotional information conveyed during the communication process. It is necessary to simultaneously establis
Data Loading...