Online Human Action Detection Using Joint Classification-Regression Recurrent Neural Networks

Human action recognition from well-segmented 3D skeleton data has been intensively studied and has been attracting an increasing attention. Online action detection goes one step further and is more challenging, which identifies the action type and localiz

  • PDF / 1,156,094 Bytes
  • 18 Pages / 439.37 x 666.142 pts Page_size
  • 4 Downloads / 182 Views

DOWNLOAD

REPORT


Institute of Computer Science and Technology, Peking University, Beijing, China {lyttonhao,liujiaying}@pku.edu.cn 2 Microsoft Research Asia, Beijing, China {culan,wezeng}@microsoft.com 3 Institute of Automation, Chinese Academy of Sciences, Beijing, China {jlxing,cfyuan}@nlpr.ia.ac.cn

Abstract. Human action recognition from well-segmented 3D skeleton data has been intensively studied and has been attracting an increasing attention. Online action detection goes one step further and is more challenging, which identifies the action type and localizes the action positions on the fly from the untrimmed stream data. In this paper, we study the problem of online action detection from streaming skeleton data. We propose a multi-task end-to-end Joint Classification-Regression Recurrent Neural Network to better explore the action type and temporal localization information. By employing a joint classification and regression optimization objective, this network is capable of automatically localizing the start and end points of actions more accurately. Specifically, by leveraging the merits of the deep Long Short-Term Memory (LSTM) subnetwork, the proposed model automatically captures the complex long-range temporal dynamics, which naturally avoids the typical sliding window design and thus ensures high computational efficiency. Furthermore, the subtask of regression optimization provides the ability to forecast the action prior to its occurrence. To evaluate our proposed model, we build a large streaming video dataset with annotations. Experimental results on our dataset and the public G3D dataset both demonstrate very promising performance of our scheme.

Keywords: Action detection classification-regression

·

Recurrent neural network

·

Joint

This work was done at Microsoft Research Asia. Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46478-7 13) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part VII, LNCS 9911, pp. 203–220, 2016. DOI: 10.1007/978-3-319-46478-7 13

204

1

Y. Li et al.

Introduction

Human action detection is an important problem in computer vision, which has broad practical applications like visual surveillance, human-computer interaction and intelligent robot navigation. Unlike action recognition and offline action detection, which determine the action after it is fully observed, online action detection aims to detect the action on the fly, as early as possible. It is much desirable to accurately and timely localize the start point and end point of an action along the time and determine the action type as illustrated in Fig. 1. Besides, it is also desirable to forecast the start and end of the actions prior to their occurrence. For example, for intelligent robot system, in addition to the accurate detection of actions, it would also be appreciated if it can predict the start of the impending action or the end of the ongoing actions and then get something ready