Crowd aware summarization of surveillance videos by deep reinforcement learning

  • PDF / 1,713,756 Bytes
  • 21 Pages / 439.37 x 666.142 pts Page_size
  • 36 Downloads / 247 Views

DOWNLOAD

REPORT


Crowd aware summarization of surveillance videos by deep reinforcement learning Junfeng Xu 1 & Zhengxing Sun 1

& Chen Ma

1

Received: 23 December 2019 / Revised: 17 August 2020 / Accepted: 16 September 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract

Surveillance videos which record crowd behaviors have dramatically increased due to the wide applications. A quick view of such crowd surveillance video in a constrained time is an increasing demand because it always contain a huge number of redundancy frames. In this paper, we focus on summarization of crowd surveillance videos. But it is not easy due to two reasons. First, how to make the decision to keep or discard a subshot from the input surveillance video stream so that the summary can outline the main behaviors of the crowd over a limited frames sequence. Second, how to maintain performance of summarization model for long surveillance videos. To tackle these challenges, we formulate surveillance video summarization as a sequential decision-making process and train the summarization network with reinforcement learning-based framework. A novel crowd location-density reward is proposed to teach summarization network to produce highquality summaries. In addition, a summarization network with three layers LSTM is designed to maintain performance across longer time spans. Extensive experiments on three public crowd surveillance videos datasets show that the proposed method achieves state-of-the-art performance. Keywords Surveillance video summarization . Crowd behaviors . Deep reinforcement learning . Unsupervised video summarization

Junfeng Xu and Chen Ma are Co-First Authors

* Zhengxing Sun [email protected] Junfeng Xu [email protected] Chen Ma [email protected]

1

State Key Lab for Novel Software Technology, Nanjing University, Nanjing 210023, China

Multimedia Tools and Applications

1 Introduction In recent days, surveillance videos, especially the ones that record crowds have dramatically increased due to the wide applications, such as crowd surveillance in the square, railway station, shopping malls, schools etc. These surveillance videos contain a huge number of frames (about 3000 frames a minute) that is a barrier to many practical usages. Video summarization is used to shorten an input video in the form of key shots or frames while still preserving the important information it contains. The shortened video provides an efficient way to browse large amounts of video data. In previous works of surveillance videos summarization, [9, 29] selected frames with moving targets as summarization according to frame-level dissimilarity measure. But it is sensitive to the minor changes in video stream, so they are not suitable for crowd surveillance which contain a large number of moving targets. [26, 43] proposed event-based surveillance video summarization, they selected key frames highly dependent on complicated abnormal event detection results. Obviously, as discussed in [25], the performance may decline significantly when th