Spatio-temporal attention on manifold space for 3D human action recognition

PDF / 1,202,417 Bytes
11 Pages / 595.224 x 790.955 pts Page_size
24 Downloads / 312 Views

Spatio-temporal attention on manifold space for 3D human action recognition Chongyang Ding1

· Kai Liu1 · Fei Cheng1 · Evgeny Belyaev2

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Recently, skeleton-based action recognition has become increasingly prevalent in computer vision due to its wide range of applications, and many approaches have been proposed to address this task. Among these methods, manifold space is widely used to deal with the relative geometric relationships between different body parts in human skeletons. Existing studies treat all geometric relationships as having the same degree of importance; thus, they cannot focus on significant information. In addition, the traditional attention mechanism aims mostly to solve the attention problems in Euclidean space, and is not applicable in manifold space. To investigate these issues, we propose a spatial and temporal attention mechanism on Lie groups for 3D human action recognition. We build our network architecture with a generalized attention mechanism that extends the scope of attention from traditional Euclidean space to manifold space. In addition, our model can learn to identify the significant spatial features and temporal stages with effective attention modules, which focus on discriminative transformation relationships between different rigid bodies within each frame and allocate different levels of attention to different frames. Extensive experiments are conducted on standard datasets and the experimental results demonstrate the effectiveness of the proposed network architecture. Keywords Skeleton-based · Action recognition · Spatial attention · Temporal attention · Manifold space

1 Introduction Human action recognition has been an important and challenging task in computer vision due to its wide range of applications, such as intelligent video surveillance, video understanding and human-computer interaction. The goal

Kai Liu

[email protected] Chongyang Ding [email protected] Fei Cheng [email protected] Evgeny Belyaev e [email protected] 1

Department of Computer Science and Technology, Xidian University, Xi’an, China

2

Department of Information Systems, ITMO University, Saint Petersburg, Russia

of human action recognition is to identify actions from input sensor streams with the aim of assisting the automatic analysis of media resources. The existing research can be broadly grouped into three main categories based on the type of input stream: 2D videos, 3D depth maps, and 3D skeletons. These feature inputs have been widely discussed for action recognition due to the convenience of data capture, additional depth information and invariance of viewpoint or appearance. In this paper, we focus on exploring 3D skeletons for action recognition. With the prevalence of cost-effective depth cameras (e.g., Kinect) and the corresponding pose estimation algorithms [36], extracting and gaining skeleton data for action recognition has become accessible. Compared with traditional RGB videos [30], skele

Data Loading...

Spatio-temporal attention on manifold space for 3D human action recognition

Recommend Documents

Spatiotemporal attention enhanced features fusion network for action recognition

Human Action Recognition Algorithm Based on 3D DenseNet-BC

Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition

Human Action Recognition Without Human

Vehicle theft recognition from surveillance video based on spatiotemporal attention

Spatial-Temporal Co-attention Network for Action Recognition

Human Action Recognition Method Based on Video-Level Features and Attention Mechanism

Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition

Human action recognition based on 3D body mask and depth spatial-temporal maps

Multi-cue based 3D residual network for action recognition

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

Adversarial Self-supervised Learning for Semi-supervised 3D Action Recognition