Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition

3D action recognition – analysis of human actions based on 3D skeleton data – becomes popular recently due to its succinctness, robustness, and view-invariant representation. Recent attempts on this problem suggested to develop RNN-based learning methods

  • PDF / 1,249,420 Bytes
  • 18 Pages / 439.37 x 666.142 pts Page_size
  • 86 Downloads / 234 Views

DOWNLOAD

REPORT


School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore {jliu029,amir3,wanggang}@ntu.edu.sg 2 School of Electrical and Information Engineering, University of Sydney, Sydney, Australia [email protected]

Abstract. 3D action recognition – analysis of human actions based on 3D skeleton data – becomes popular recently due to its succinctness, robustness, and view-invariant representation. Recent attempts on this problem suggested to develop RNN-based learning methods to model the contextual dependency in the temporal domain. In this paper, we extend this idea to spatio-temporal domains to analyze the hidden sources of action-related information within the input data over both domains concurrently. Inspired by the graphical structure of the human skeleton, we further propose a more powerful tree-structure based traversal method. To handle the noise and occlusion in 3D skeleton data, we introduce new gating mechanism within LSTM to learn the reliability of the sequential input data and accordingly adjust its effect on updating the long-term context information stored in the memory cell. Our method achieves state-of-the-art performance on 4 challenging benchmark datasets for 3D human action analysis. Keywords: 3D action recognition · Recurrent neural networks short-term memory · Trust gate · Spatio-temporal analysis

1

· Long

Introduction

In recent years, action recognition based on the locations of major joints of the body in 3D space has attracted a lot of attention. Different feature extraction and classifier learning approaches are studied for 3D action recognition [1–3]. For example, Yang and Tian [4] represented the static postures and the dynamics of the motion patterns via eigenjoints and utilized a Na¨ıve-Bayes-Nearest-Neighbor classifier learning. A HMM was applied by [5] for modeling the temporal dynamics of the actions over a histogram-based representation of 3D joint locations. Evangelidis et al. [6] learned a GMM over the Fisher kernel representation of a succinct skeletal feature, called skeletal quads. Vemulapalli et al. [7] represented the skeleton configurations and actions as points and curves in a Lie group c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part III, LNCS 9907, pp. 816–833, 2016. DOI: 10.1007/978-3-319-46487-9 50

Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition

817

respectively, and utilized a SVM classifier to classify the actions. A skeletonbased dictionary learning utilizing group sparsity and geometry constraint was also proposed by [8]. An angular skeletal representation over the tree-structured set of joints was introduced in [9], which calculated the similarity of these features over temporal dimension to build the global representation of the action samples and fed them to SVM for final classification. Recurrent neural networks (RNNs) which are a variant of neural nets for handling sequential data with variable length, have been successfully applied to language modeling [10–12]