Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

  • PDF / 3,559,000 Bytes
  • 11 Pages / 439.37 x 666.142 pts Page_size
  • 36 Downloads / 137 Views

DOWNLOAD

REPORT


Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering Shaoning Xiao1 · Yimeng Li1 · Yunan Ye1 · Long Chen1 · Shiliang Pu1 · Zhou Zhao1 · Jian Shao1 · Jun Xiao1

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract This work aims to address the problem of video question answering (VideoQA) with a novel model and a new open-ended VideoQA dataset. VideoQA is a challenging field in visual information retrieval, which aims to generate the answer according to the video content and question. Ultimately, VideoQA is a video understanding task. Efficiently combining the multi-grained representations is the key factor in understanding a video. The existing works mostly focus on overall frame-level visual understanding to tackle the problem, which neglects finer-grained and temporal information inside the video, or just combines the multigrained representations simply by concatenation or addition. Thus, we propose the multigranularity temporal attention network that enables to search for the specific frames in a video that are holistically and locally related to the answer. We first learn the mutual attention representations of multi-grained visual content and question. Then the mutually attended features are combined hierarchically using a double layer LSTM to generate the answer. Furthermore, we illustrate several different multi-grained fusion configurations to prove the advancement of this hierarchical architecture. The effectiveness of our model is demonstrated on the large-scale video question answering dataset based on ActivityNet dataset. Keywords Video question answering · Multi-grained representation · Temporal co-attention

1 Introduction Understanding video content is an important yet challenging problem in computer vision. Recently, video understanding has progressed significantly with the introduce of large datasets and the emphasis on temporal structure of videos. For deeper understanding, there are increasing interests in multi-model tasks combining vision and language such as video question answering and video captioning. As a sub-domain of video understanding, visual question answering (VQA) task [1], which is to generate the answer of the posed query with the assist of the referenced visual content, has become popular. By jointly modeling visual contents and

B 1

Jun Xiao [email protected] Zhejiang University, Hangzhou, China

123

S. Xiao et al.

Fig. 1 Example of VideoQA. Question may focus on exact frames in video and fine-grained objects in a frame which should be paid more attention to

semantic information, VQA has achieved a promising performance in recent years. However, the most VQA works [2–6] concentrated on the image content due to the basic research on image caption and image retrieval [7,8]. Since the video is as significant as the image in visual information expression, the video-based visual question answering (VideoQA) has been an important application in computer vision. Compared to image, video is more complex in construc