Video Summarization with Long Short-Term Memory

We propose a novel supervised learning technique for summarizing videos by automatically selecting keyframes or key subshots. Casting the task as a structured prediction problem, our main idea is to use Long Short-Term Memory (LSTM) to model the variable-

  • PDF / 1,049,233 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 103 Downloads / 243 Views

DOWNLOAD

REPORT


2 3

Department of Computer Science, University of Southern California, Los Angeles, USA {zhang.ke,weilunc}@usc.edu Department of Computer Science, University of California, Los Angeles, USA [email protected] Department of Computer Science, University of Texas at Austin, Austin, USA [email protected] Abstract. We propose a novel supervised learning technique for summarizing videos by automatically selecting keyframes or key subshots. Casting the task as a structured prediction problem, our main idea is to use Long Short-Term Memory (LSTM) to model the variable-range temporal dependency among video frames, so as to derive both representative and compact video summaries. The proposed model successfully accounts for the sequential structure crucial to generating meaningful video summaries, leading to state-of-the-art results on two benchmark datasets. In addition to advances in modeling techniques, we introduce a strategy to address the need for a large amount of annotated data for training complex learning approaches to summarization. There, our main idea is to exploit auxiliary annotated video summarization datasets, in spite of their heterogeneity in visual styles and contents. Specifically, we show that domain adaptation techniques can improve learning by reducing the discrepancies in the original datasets’ statistical properties. Keywords: Video summarization

1

· Long short-term memory

Introduction

Video has rapidly become one of the most common sources of visual information. The amount of video data is daunting — it takes over 82 years to watch all videos uploaded to YouTube per day! Automatic tools for analyzing and understanding video contents are thus essential. In particular, automatic video summarization is a key tool to help human users browse video data. A good video summary would compactly depict the original video, distilling its important events into a short watchable synopsis. Video summarization can shorten video in several ways. In this paper, we focus on the two most common ones: keyframe selection, K. Zhang and W.-L. Chao—Equally contributed. Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46478-7 47) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part VII, LNCS 9911, pp. 766–782, 2016. DOI: 10.1007/978-3-319-46478-7 47

Video Summarization with Long Short-Term Memory

767

where the system identifies a series of defining frames [1–5] and key subshot selection, where the system identifies a series of defining subshots, each of which is a temporally contiguous set of frames spanning a short time interval [6–9]. There has been a steadily growing interest in studying learning techniques for video summarization. Many approaches are based on unsupervised learning, and define intuitive criteria to pick frames [1,5,6,9–14] without explicitly optimizing the evaluation metrics. Recent work has begun to explore supervised learning techniques [2,15–18]. I