Human action recognition based on 3D body mask and depth spatial-temporal maps

  • PDF / 3,787,669 Bytes
  • 18 Pages / 439.642 x 666.49 pts Page_size
  • 64 Downloads / 212 Views

DOWNLOAD

REPORT


Human action recognition based on 3D body mask and depth spatial-temporal maps Xing Li1 · Zhenjie Hou1,2

· Jiuzhen Liang1 · Chen Chen3

Received: 7 May 2019 / Revised: 31 March 2020 / Accepted: 11 August 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In this paper, a method based on depth spatial-temporal maps(DSTMs) is presented for human action recognition from depth video sequences, which provides compact global spatial and temporal information of human motion for action recognition. In our approach, the initial frame of depth sequences is dilated to generate 3D body mask. The new depth sequences of major part of the human body are then computed after using 3D body mask on each depth frame. We project each frame of the new depth sequences onto three orthogonal axes to get three binary lists. Under each projection axis, binary lists are stitching in order through an entire depth sequence forming a DSTM. We evaluate our method on two standard databases. Experimental results show that this method could effectively capture the spatial and temporal information of human motion and improve the accuracy of human action recognition. Keywords Human action recognition · 3D body mask · Depth spatial-temporal map

 Zhenjie Hou

[email protected] Xing Li [email protected] Jiuzhen Liang [email protected] Chen Chen [email protected] 1

College of Information Science and Engineering, Changzhou University, Changzhou, China

2

Jiangsu Province Networking and Mobile Internet Technology Engineering Key Laboratory, Huaian, China

3

Department of Electrical and Computer Engineering, University of North Carolina at Charlotte, Charlotte, USA

Multimedia Tools and Applications

1 Introduction Human action recognition has a wide range of application in human-computer interaction [2–4, 18, 19], including somatosensory games, intelligent monitoring system, etc. In the beginning, RGB camera was used to collect human body video sequences [8, 11]. In paper [1], the authors introduce Motion Energy Images (MEI) and Motion History Images (MHI) to capture the spatial and temporal information of human action in a video sequence. In paper [6], the authors propose a hierarchical extension algorithm for computing dense motion flow from MHI. However, these color image sequences based-methods are very sensitive to illumination changes, which greatly limit the robustness of action recognition. With the development of technology, especially the launch of Microsoft’s somatosensory device Kinect makes it possible to study human action recognition based on depth video sequences. Compared with color sequences, depth sequences have tremendous 3D information and are not sensitive to illumination changes, and it is also easier to extract the foreground of human actions. In recent years, many depth video sequences based-methods are proposed, including 3D points [9], spatial-temporal depth cuboids [14], depth motion maps (DMM) [16, 20], surface normals [10, 21], skeletons joints [13]. In paper [17], Yang pr