Towards Viewpoint Invariant 3D Human Pose Estimation
We propose a viewpoint invariant model for 3D human pose estimation from a single depth image. To achieve this, our discriminative model embeds local regions into a learned viewpoint invariant feature space. Formulated as a multi-task learning problem, ou
- PDF / 5,067,014 Bytes
- 18 Pages / 439.37 x 666.142 pts Page_size
- 48 Downloads / 211 Views
Abstract. We propose a viewpoint invariant model for 3D human pose estimation from a single depth image. To achieve this, our discriminative model embeds local regions into a learned viewpoint invariant feature space. Formulated as a multi-task learning problem, our model is able to selectively predict partial poses in the presence of noise and occlusion. Our approach leverages a convolutional and recurrent network architecture with a top-down error feedback mechanism to self-correct previous pose estimates in an end-to-end manner. We evaluate our model on a previously published depth dataset and a newly collected human pose dataset containing 100 K annotated depth images from extreme viewpoints. Experiments show that our model achieves competitive performance on frontal views while achieving state-of-the-art performance on alternate viewpoints.
1
Introduction
Depth sensors are becoming ubiquitous in applications ranging from security to robotics and from entertainment to smart spaces [5]. While recent advances in pose estimation have improved performance on front and side views, most realworld settings present challenging viewpoints such as top or angled views in retail stores, hospital environments, or airport settings. These viewpoints introduce high levels of self-occlusion making human pose estimation difficult for existing algorithms. Humans are remarkably robust at predicting full rigid-body and articulated poses in these challenging scenarios. However, most work in the human pose estimation literature has addressed relatively constrained settings. There has been a long line of work on generative pose models, where a pose is estimated by constructing a skeleton using templates or priors in a top-down manner [12,16,18,19]. In contrast, discriminative methods directly identify individual body parts, labels, or positions and construct the skeleton in a bottom-up approach [14,15,51,52,54]. However, recent research in both classes primarily focus Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46448-0 10) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part I, LNCS 9905, pp. 160–177, 2016. DOI: 10.1007/978-3-319-46448-0 10
Towards Viewpoint Invariant 3D Human Pose Estimation
161
on frontal views with few occlusions despite the abundance of occlusion and partial-pose research in object detection [2–4,7,9,22,23,32,53,61]. Even modern representation learning techniques address human pose estimation from frontal or side views [10,17,34,41,42,59,60]. While the above methods improve human pose estimation, they fail to address viewpoint variances. In this work we address the problem of viewpoint invariant pose estimation from single depth images. There are two challenges towards this goal. The first challenge is designing a model that is not only rich enough to reason about 3D spatial information but also robust to viewpoint changes. The model must understand both loc
Data Loading...