Label-Based Automatic Alignment of Video with Narrative Sentences

In this paper we consider videos (e.g. Hollywood movies) and their accompanying natural language descriptions in the form of narrative sentences (e.g. movie scripts without timestamps). We propose a method for temporally aligning the video frames with the

PDF / 3,677,255 Bytes
16 Pages / 439.37 x 666.142 pts Page_size
18 Downloads / 174 Views

DOWNLOAD

REPORT

Department of Computer Science, ETHZ, Zurich, Switzerland {pelin.dogan,grossm}@inf.ethz.ch 2 Disney Research, Zurich, Switzerland [email protected]

Abstract. In this paper we consider videos (e.g. Hollywood movies) and their accompanying natural language descriptions in the form of narrative sentences (e.g. movie scripts without timestamps). We propose a method for temporally aligning the video frames with the sentences using both visual and textual information, which provides automatic timestamps for each narrative sentence. We compute the similarity between both types of information using vectorial descriptors and propose to cast this alignment task as a matching problem that we solve via dynamic programming. Our approach is simple to implement, highly eﬃcient and does not require the presence of frequent dialogues, subtitles, and character face recognition. Experiments on various movies demonstrate that our method can successfully align the movie script sentences with the video frames of movies.

1

Introduction

Audio description consists of an audio narration track where the narrator describes what is happening in the video. It allows visually impaired people to follow movies or other types of videos. However the number of movies that provide it is considerably low, and its preparation is particularly time consuming. On the other hand, scripts of numerous movies are available online although they generally are plain text sentences. Our goal is to temporally align the script sentences to the corresponding shots in the video, i.e. obtain the timing information of each sentence. These sentences can then be converted to audio description by an automatic speech synthesizer or can be read by a human describer. This would provide a wider range of movies to visually impaired people. Several additional applications could beneﬁt from the alignment of video with text. For example, the resulting correspondences of video frames and sentences can be used to improve image/video understanding and automatic caption generation by forming a learning corpus. Video-text alignment also enables text-based video retrieval since searching for a part of the video could be achieved via a simple text search. In this paper, we address temporal alignment of video frames with their descriptive sentences to obtain precise timestamps of the sentences with minimal manual intervention. A representative result is shown in Fig. 1. The videos c Springer International Publishing Switzerland 2016 G. Hua and H. J´ egou (Eds.): ECCV 2016 Workshops, Part I, LNCS 9913, pp. 605–620, 2016. DOI: 10.1007/978-3-319-46604-0 43

606

P. Dogan et al.

are typically movies or some parts of movies with duration of 10 to 20 min. We do not assume any presegmentation or shot threading of the video. We start by obtaining the high-level labels of the video frames (e.g. “car”, “walking”, “street”) with deep learning techniques [12] and use these labels to group the video frames into semantic shots. In this way, each shot contains relatively diff

Data Loading...

Label-Based Automatic Alignment of Video with Narrative Sentences

Recommend Documents

Video Automatic Annotation

Video Automatic Annotation

Sentences

Multiple Source Alignment for Video Analysis

Automatic Detection of Human Fall in Video

Multiple Source Alignment for Video Analysis

Automatic Identification of Account Sharing for Video Streaming Services

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Intonation Phrase formation in sentences with clausal embedding

Off-line Automatic Virtual Director for Lecture Video

A Unified Framework for Micro-video BackGround Music Automatic Matching

Automatic Video Captioning via Multi-channel Sequential Encoding