Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM
- PDF / 1,234,226 Bytes
- 17 Pages / 439.37 x 666.142 pts Page_size
- 4 Downloads / 162 Views
Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM Shuqin Chen1 · Xian Zhong1,2 Luo Zhong1,2
· Lin Li1,2 · Wenxuan Liu1 · Cheng Gu1 ·
Accepted: 8 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Automatically generating captions for videos faces a huge challenge since it is a cross-modal cross task that involves vision and texts. Most of the existing models generate the captioning words merely based on the video visual content features, ignoring the important underlying semantic information. The relationship between explicit semantics and hidden visual content is not holistically exploited, thus hardly describing fine-grained caption accurately from a global view. To better extract and integrate the semantic information, we propose a novel encoder-decoder framework of bi-directional long short-term memory with attention model and conversion gate (BiLSTM-CG), which transfers auxiliary attributes and then generates detailed captioning. Specifically, we extract semantic attributes from sliced frames in a multiple-instance learning (MIL) manner. MIL algorithms attempt to learn a classification function that can predict the labels of bags and/or instances in the visual content. In the encoding stage, we adopt 2D and 3D convolutional neural networks to encode video clips, and then feed the concatenate features into a BiLSTM. In decoding stage, we design a CG to adaptively fuse semantic attributes into hidden features at word level, and a CG can convert auxiliary attributes and textual embedding for video captioning. Furthermore, the CG has an ability to automatically decide the optimal time stamp to capture the explicit semantic or rely on the hidden states of the language model to generate the next word. Extensive experiments conducted on the MSR-VTT and MSVD video captioning datasets demonstrate the effectiveness of our method compared with state-of-the-art approaches. Keywords Video captioning · Bi-directional long short-term memory · Multiple-instance learning · Semantic fine-grained attributes · Attention mechanism · Conversion gate
1 Introduction Video captioning has drawn increasing attention from the research community, owing to its numerous potential in surveillance, video retrieval evaluation, and other areas. However, the semantic gap between video and natural language still needs to be paid more attention with the rapid development of these two fields [1]. Existing video captioning methods can be divided into two categories, template-based methods [2] and sequence learning methods [3–6]. The
Extended author information available on the last page of the article
123
S. Chen et al.
Fig. 1 Examples of video description generation
former extracts subject-verb-object template from training data, followed by matching the template to the video, and finally generates sentences with the template when testing [7]. The latter mainly consists of two stages, encoding visual content and decoding hidden features. Recent d
Data Loading...