Refocused Attention: Long Short-Term Rewards Guided Video Captioning

  • PDF / 2,800,009 Bytes
  • 14 Pages / 439.37 x 666.142 pts Page_size
  • 20 Downloads / 157 Views

DOWNLOAD

REPORT


Refocused Attention: Long Short-Term Rewards Guided Video Captioning Jiarong Dong1,2 · Ke Gao1

· Xiaokai Chen1,2 · Juan Cao1

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract The adaptive cooperation of visual model and language model is essential for video captioning. However, due to the lack of proper guidance for each time step in end-to-end training, the over-dependence of language model often results in the invalidation of attention-based visual model, which is called ‘Attention Defocus’ problem in this paper. Based on an important observation that the recognition precision of entity word can reflect the effectiveness of the visual model, we propose a novel strategy called refocused attention to optimize the training and cooperating of visual model and language model, using ingenious guidance at appropriate time step. The strategy consists of a short-term-reward guided local entity recognition and a long-term-reward guided global relation understanding, neither requires any external training data. Moreover, a framework with hierarchical visual representations and hierarchical attention is established to fully exploit the potential strength of the proposed learning strategy. Extensive experiments demonstrate that the ingenious guidance strategy together with the optimized structure outperform state-of-the-art video captioning methods with relative improvements 7.7% in BLEU-4 and 5.0% in CIDEr-D on MSVD dataset, even without multi-modal features. Keywords Video captioning · Hierarchical attention · Reinforcement learning · Reward

1 Introduction Video captioning aims at generating natural language descriptions which are both semantically and syntactically correct for a video, and has become a prominent interdisciplinary research problem in both academia and industry. Inspired by machine translation, long shortterm memory (LSTM) [1] network is adopted for sequence modeling. Besides, attention strategy has been explored for visual representation optimizing in recent years. The combination of LSTM network and attention mechanism has achieved the state-of-the-art results

B

Ke Gao [email protected]

1

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

2

University of Chinese Academy of Sciences, Beijing, China

123

J. Dong et al.

in this task [2,3]. The visual model and language model should make a concerted effort to guarantee the reliable prediction of fluent descriptions with rich details. Due to the lack of proper guidance, an ideal cooperation of the visual signal and language model is hard to guarantee in existing video captioning methods. The over-dependence of language model is particularly serious, which often results in the invalidation of attention-based visual model. This problem is called Attention Defocus in this paper. To further exemplify the cause and effect of this problem, a test has been conducted as shown in Fig. 1. Here, the performances of different visual representations are compared with an entity classifier which has been train

Data Loading...