Spatio-temporal attention on manifold space for 3D human action recognition
- PDF / 1,202,417 Bytes
- 11 Pages / 595.224 x 790.955 pts Page_size
- 24 Downloads / 181 Views
Spatio-temporal attention on manifold space for 3D human action recognition Chongyang Ding1
· Kai Liu1 · Fei Cheng1 · Evgeny Belyaev2
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Recently, skeleton-based action recognition has become increasingly prevalent in computer vision due to its wide range of applications, and many approaches have been proposed to address this task. Among these methods, manifold space is widely used to deal with the relative geometric relationships between different body parts in human skeletons. Existing studies treat all geometric relationships as having the same degree of importance; thus, they cannot focus on significant information. In addition, the traditional attention mechanism aims mostly to solve the attention problems in Euclidean space, and is not applicable in manifold space. To investigate these issues, we propose a spatial and temporal attention mechanism on Lie groups for 3D human action recognition. We build our network architecture with a generalized attention mechanism that extends the scope of attention from traditional Euclidean space to manifold space. In addition, our model can learn to identify the significant spatial features and temporal stages with effective attention modules, which focus on discriminative transformation relationships between different rigid bodies within each frame and allocate different levels of attention to different frames. Extensive experiments are conducted on standard datasets and the experimental results demonstrate the effectiveness of the proposed network architecture. Keywords Skeleton-based · Action recognition · Spatial attention · Temporal attention · Manifold space
1 Introduction Human action recognition has been an important and challenging task in computer vision due to its wide range of applications, such as intelligent video surveillance, video understanding and human-computer interaction. The goal
Kai Liu
[email protected] Chongyang Ding [email protected] Fei Cheng [email protected] Evgeny Belyaev e [email protected] 1
Department of Computer Science and Technology, Xidian University, Xi’an, China
2
Department of Information Systems, ITMO University, Saint Petersburg, Russia
of human action recognition is to identify actions from input sensor streams with the aim of assisting the automatic analysis of media resources. The existing research can be broadly grouped into three main categories based on the type of input stream: 2D videos, 3D depth maps, and 3D skeletons. These feature inputs have been widely discussed for action recognition due to the convenience of data capture, additional depth information and invariance of viewpoint or appearance. In this paper, we focus on exploring 3D skeletons for action recognition. With the prevalence of cost-effective depth cameras (e.g., Kinect) and the corresponding pose estimation algorithms [36], extracting and gaining skeleton data for action recognition has become accessible. Compared with traditional RGB videos [30], skele
Data Loading...