Label-Based Automatic Alignment of Video with Narrative Sentences

In this paper we consider videos (e.g. Hollywood movies) and their accompanying natural language descriptions in the form of narrative sentences (e.g. movie scripts without timestamps). We propose a method for temporally aligning the video frames with the

  • PDF / 3,677,255 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 18 Downloads / 151 Views

DOWNLOAD

REPORT


Department of Computer Science, ETHZ, Zurich, Switzerland {pelin.dogan,grossm}@inf.ethz.ch 2 Disney Research, Zurich, Switzerland [email protected]

Abstract. In this paper we consider videos (e.g. Hollywood movies) and their accompanying natural language descriptions in the form of narrative sentences (e.g. movie scripts without timestamps). We propose a method for temporally aligning the video frames with the sentences using both visual and textual information, which provides automatic timestamps for each narrative sentence. We compute the similarity between both types of information using vectorial descriptors and propose to cast this alignment task as a matching problem that we solve via dynamic programming. Our approach is simple to implement, highly efficient and does not require the presence of frequent dialogues, subtitles, and character face recognition. Experiments on various movies demonstrate that our method can successfully align the movie script sentences with the video frames of movies.

1

Introduction

Audio description consists of an audio narration track where the narrator describes what is happening in the video. It allows visually impaired people to follow movies or other types of videos. However the number of movies that provide it is considerably low, and its preparation is particularly time consuming. On the other hand, scripts of numerous movies are available online although they generally are plain text sentences. Our goal is to temporally align the script sentences to the corresponding shots in the video, i.e. obtain the timing information of each sentence. These sentences can then be converted to audio description by an automatic speech synthesizer or can be read by a human describer. This would provide a wider range of movies to visually impaired people. Several additional applications could benefit from the alignment of video with text. For example, the resulting correspondences of video frames and sentences can be used to improve image/video understanding and automatic caption generation by forming a learning corpus. Video-text alignment also enables text-based video retrieval since searching for a part of the video could be achieved via a simple text search. In this paper, we address temporal alignment of video frames with their descriptive sentences to obtain precise timestamps of the sentences with minimal manual intervention. A representative result is shown in Fig. 1. The videos c Springer International Publishing Switzerland 2016  G. Hua and H. J´ egou (Eds.): ECCV 2016 Workshops, Part I, LNCS 9913, pp. 605–620, 2016. DOI: 10.1007/978-3-319-46604-0 43

606

P. Dogan et al.

are typically movies or some parts of movies with duration of 10 to 20 min. We do not assume any presegmentation or shot threading of the video. We start by obtaining the high-level labels of the video frames (e.g. “car”, “walking”, “street”) with deep learning techniques [12] and use these labels to group the video frames into semantic shots. In this way, each shot contains relatively diff