View transform graph attention recurrent networks for skeleton-based action recognition

  • PDF / 1,068,129 Bytes
  • 8 Pages / 595.276 x 790.866 pts Page_size
  • 17 Downloads / 194 Views

DOWNLOAD

REPORT


ORIGINAL PAPER

View transform graph attention recurrent networks for skeleton-based action recognition Qingqing Huang1 · Fengyu Zhou1 · Runze Qin1 · Yang zhao2 Received: 2 April 2020 / Accepted: 8 September 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract Human action recognition based on skeleton recently has attracted attention of researchers due to the accessibility and popularity of the 3D skeleton data. However, it is complicated to effectively represent spatial–temporal skeleton sequences given the large variations of action representations when they are captured from different viewpoints. In order to get a better representation of the spatial–temporal skeletal features, this paper introduces a view transform graph attention recurrent networks (VT+GARN) method for view-invariant human action recognition. We design a view-invariant transform strategy based on the sequence to reduce the influence of different views on the spatial–temporal position of skeleton joint. Then, the graph attention recurrent network automatically calculates the coefficient of attention and learns the representation of spatiotemporal skeletal features after the transformation and outputs the classification result. Ablation studies and extensive experiments on three challenging datasets, Northwestern-UCLA, NTU RGB+D and UWA3DII, demonstrate the effectiveness and superiority of our method Keywords View transform · Graph attention · Skeleton · Action recognition

1 Introduction Human action recognition is extensively utilized in practical fields involving visual surveillance, video retrieval and robotics [1]. In consideration of the varieties of input data, human action recognition can be categorized into RGB methods [2–5] adopting optical flow and image data, and skeleton methods using human joint coordinates. Compared with the RGB-based methods, the skeleton-based methods have sufficient features to represent motion by a small amount of data and are more robust to lighting and background chances [6,7]. In this paper, the method we introduce is based on 3D skeleton data. One prime challenge of skeleton sequences to dispose is the intricacy view diversities. Traditional methods generally manipulate spatiotemporal features manually to illustrate skeleton sequences [6–8]. As the advent of data

B

Fengyu Zhou [email protected]

1

School of Control Science and Engineering, Shandong University, Jinan, People’s Republic of China

2

School of Electrical Engineering and Automation, Qilu University of Technology, Jinan, People’s Republic of China

era various neural networks have been made the focus of attention such as recurrent neural network (RNN) [9,10] that models temporal information of skeleton, convolutional neural network (CNN) that encodes skeleton sequences into a pseudo-image to extract local spatial features of the structure [11–14], and graph convolutional network (GCN) [15–19] that directly learns the spatial structure through the skeleton graph. The majority of these architectures are constructed i