Hierarchical Dynamic Parsing and Encoding for Action Recognition

A video action generally exhibits quite complex rhythms and non-stationary dynamics. To model such non-uniform dynamics, this paper describes a novel hierarchical dynamic encoding method to capture both the locally smooth dynamics and globally drastic dyn

  • PDF / 1,032,304 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 51 Downloads / 194 Views

DOWNLOAD

REPORT


Science and Technology on Integrated Information System Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China [email protected] 2 Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL 60208, USA {jzt011,yingwu}@eecs.northwestern.edu 3 State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing 100084, China [email protected] 4 State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China [email protected] Abstract. A video action generally exhibits quite complex rhythms and non-stationary dynamics. To model such non-uniform dynamics, this paper describes a novel hierarchical dynamic encoding method to capture both the locally smooth dynamics and globally drastic dynamic changes. It provides a multi-layer joint representation for temporal modeling for action recognition. At the first layer, the action sequence is parsed in an unsupervised manner into several smooth-changing stages corresponding to different key poses or temporal structures. The dynamics within each stage are encoded by mean-pooling or learning to rank based encoding. At the second layer, the temporal information of the ordered dynamics extracted from the previous layer is encoded again to form the overall representation. Extensive experiments on a gesture action dataset (Chalearn) and several generic action datasets (Olympic Sports and Hollywood2) have demonstrated the effectiveness of the proposed method. Keywords: Action recognition encoding

1

·

Hierarchical modeling

·

Dynamic

Introduction

The performance of action recognition methods depends heavily on the representation of video data. For this reason, many recent efforts focus on developing various action representations in different levels. The state-of-the-art action representation is based on the Bag-of-Visual-Words (BoW) [1] framework, which includes three steps: local descriptors extraction, codebook learning, and descriptors encoding. The raw local descriptors themselves are noisy and the discriminative power of the distributed BoW representation comes from the efficient c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part IV, LNCS 9908, pp. 202–217, 2016. DOI: 10.1007/978-3-319-46493-0 13

Hierarchical Dynamic Parsing and Encoding for Action Recognition

203

Fig. 1. The action “jump” can be roughly parsed into three divisions: running approach, body stay flew in the air and touch down. Each division can also be parsed into different sub-divisions

coding of these local descriptors. As a result, the temporal dependencies and dynamics of the video are seriously neglected. Dynamics characterize the inherent global temporal dependencies of actions. Existing dynamic-based approaches generally view the video as a sequence of observations and model it with temporal models. The models can either