Multi-stream Deep Networks for Person to Person Violence Detection in Videos

Violence detection in videos has numerous applications, ranging from parental control and children protection to multimedia filtering and retrieval. A number of approaches have been proposed to detect vital clues for violent actions, among which most meth

  • PDF / 2,545,584 Bytes
  • 15 Pages / 439.37 x 666.142 pts Page_size
  • 40 Downloads / 293 Views

DOWNLOAD

REPORT


Abstract. Violence detection in videos has numerous applications, ranging from parental control and children protection to multimedia filtering and retrieval. A number of approaches have been proposed to detect vital clues for violent actions, among which most methods prefer employing trajectory based action recognition techniques. However, these methods can only model general characteristics of human actions, thus cannot well capture specific high order information of violent actions. Therefore, they are not suitable for detecting violence, which is typically intense and correlated with specific scenes. In this paper, we propose a novel framework, i.e., multi-stream deep convolutional neural networks, for person to person violence detection in videos. In addition to conventional spatial and temporal streams, we develop an acceleration stream to capture the important intense information usually involved in violent actions. Moreover, a simple and effective score-level fusion strategy is proposed to integrate multi-stream information. We demonstrate the effectiveness of our method on the typical violence dataset and extensive experimental results show its superiority over state-of-the-art methods. Keywords: Violence detection · Acceleration feature neural networks · Long short-term memory

1

·

Convolutional

Introduction

With the rapid development of digital media, massive collections of video materials have become ubiquitous online. Detecting different types of human actions has a wide range of applications. Among various applications, for the reason of protecting children against offensive video contents and providing people the ability of content-based video filtering or retrieval, detecting violent actions in videos has recently received considerable attentions. Violence detection poses big challenges to the computer vision community. On one hand, because of the subjective nature, one may have an ambiguous concept of violence in definition. Here, we adopt the common definition from VSD [1], i.e., physical violence or accident resulting in human injury or pain. On the other hand, violence detection in surveillance videos always turns into the crowd scene analysis problem. c Springer Nature Singapore Pte Ltd. 2016  T. Tan et al. (Eds.): CCPR 2016, Part I, CCIS 662, pp. 517–531, 2016. DOI: 10.1007/978-981-10-3002-4 43

518

Z. Dong et al.

In this paper, we are specifically interested in content based person to person violence detection at a relatively short distance in videos. To address the above problem, previous researchers prefer employing trajectory-based action recognition techniques [2,3,11]. Conventional approaches often follow the standard bag-of-words pipeline for representing general human actions. Specifically, they first extract several types of features of entire videos, then quantize features into histograms using k-means clustering, VLAD [29] or Fisher Vector [19]. The key step of these methods is extracting proper features to model human actions. For instance, improved dense trajectory [26] extracts Motion