View-independent representation with frame interpolation method for skeleton-based human action recognition

  • PDF / 3,736,381 Bytes
  • 12 Pages / 595.276 x 790.866 pts Page_size
  • 37 Downloads / 180 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

View‑independent representation with frame interpolation method for skeleton‑based human action recognition Yingguo Jiang1 · Jun Xu2 · Tong Zhang1  Received: 28 February 2020 / Accepted: 10 April 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Human action recognition is an important branch of computer vision science. It is a challenging task based on skeletal data because of joints’ complex spatiotemporal information. In this work, we propose a method for action recognition, which consists of three parts: view-independent representation, frame interpolation, and combined model. First, the action sequence becomes view-independent representations independent of the view. Second, when judgment conditions are met, differentiated frame interpolations are used to expand the temporal dimensional information. Then, a combined model is adopted to extract these representation features and classify actions. Experimental results on two multi-view benchmark datasets Northwestern-UCLA and NTU RGB+D demonstrate the effectiveness of our complete method. Although using only one type of action feature and a simple architecture combined model, our complete method still outperforms most of the referential state-of-the-art methods and has strong robustness. Keywords  Action recognition · View-independent representation · Frame interpolation · Transfer CNN · Self-attention mechanism

1 Introduction Human action recognition is always applied to various fields, such as virtual reality, intelligent robots [1], and emotional analysis [2]. As an important branch of computer vision science, it has attracted much attention. Generally speaking, an action sequence can be represented as a set of human pose trajectories, in both spatial and temporal dimensions, with three main types of data: RGB images, depth data, and skeletal data. Some studies [3–5] of human action recognition focused on finding human posture edges of RGB images and matching posture contours corresponding to the action. RGB images collected by an optical device are sensitive to various factors, such as brightness of the environment, complex background interference, and optical performance of the device. Besides, human actions are performed in the threedimensional (3D) world, so RGB images lack 3D spatial * Tong Zhang [email protected] 1



School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China



Unit 95269 of the People’s Liberation Army, Guangzhou 510075, China

2

information. Due to the convenience of data collection and the popularity of video, there are still some researches [6, 7] on human action recognition based on RGB images. However, noise data and lack of 3D spatial information result in low recognition accuracy and poor robustness. Depth data are the human posture data obtained by sensing infrared instead of visible light [8]. The process ignores the texture of the human surface and makes the human edge clear. Many previous works [9–11] of human action recognition wer