Marker-Less 3D Human Motion Capture with Monocular Image Sequence and Height-Maps

The recovery of 3D human pose with monocular camera is an inherently ill-posed problem due to the large number of possible projections from the same 2D image to 3D space. Aimed at improving the accuracy of 3D motion reconstruction, we introduce the additi

  • PDF / 2,256,418 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 101 Downloads / 183 Views

DOWNLOAD

REPORT


3

College of Computer Science, Zhejiang University, Hangzhou, China {answeror,yonghaoliu,hanfeilin,ylgui,wangzh cs,gengwd}@zju.edu.cn 2 Interactive and Digital Media Institute, National University of Singapore, Singapore, Singapore [email protected] School of Computing, National University of Singapore, Singapore, Singapore [email protected]

Abstract. The recovery of 3D human pose with monocular camera is an inherently ill-posed problem due to the large number of possible projections from the same 2D image to 3D space. Aimed at improving the accuracy of 3D motion reconstruction, we introduce the additional builtin knowledge, namely height-map, into the algorithmic scheme of reconstructing the 3D pose/motion under a single-view calibrated camera. Our novel proposed framework consists of two major contributions. Firstly, the RGB image and its calculated height-map are combined to detect the landmarks of 2D joints with a dual-stream deep convolution network. Secondly, we formulate a new objective function to estimate 3D motion from the detected 2D joints in the monocular image sequence, which reinforces the temporal coherence constraints on both the camera and 3D poses. Experiments with HumanEva, Human3.6M, and MCAD dataset validate that our method outperforms the state-of-the-art algorithms on both 2D joints localization and 3D motion recovery. Moreover, the evaluation results on HumanEva indicates that the performance of our proposed single-view approach is comparable to that of the multiview deep learning counterpart. Keywords: Human pose estimation

1

· Height-map

Introduction

Marker-less motion capture is an active field of research in computer vision and graphics with applications in computer animation, video surveillance, biomedical research, and sports science. According to the recent study on world population aging [1], the life expectancy at age 60 and above is expected to grow in the next Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46493-0 2) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part IV, LNCS 9908, pp. 20–36, 2016. DOI: 10.1007/978-3-319-46493-0 2

Marker-Less 3D Human Motion Capture

21

few decades. This anticipates an emerging need in video-based analysis systems to monitor the elderly in nursing home as an event alert system. Existing motion capture approaches can be broadly divided into two categories: (1) methods based on monocular camera [2–5], and (2) methods that rely on synchronous multi-view streams [6–8]. Nowadays, single view approaches are getting more attention in the industry. Although multi-view visual data presents richer information for marker-less motion capture, such data are not always available in reality, especially in the applications of video surveillance. The recovery of 3D human poses with monocular image sequences is an inherently ill-posed problem, since the observed projection on a 2D image can be