Human Pose Estimation Using Deep Consensus Voting

In this paper we consider the problem of human pose estimation from a single still image. We propose a novel approach where each location in the image votes for the position of each keypoint using a convolutional neural net. The voting scheme allows us to

  • PDF / 5,923,145 Bytes
  • 15 Pages / 439.37 x 666.142 pts Page_size
  • 22 Downloads / 244 Views

DOWNLOAD

REPORT


Abstract. In this paper we consider the problem of human pose estimation from a single still image. We propose a novel approach where each location in the image votes for the position of each keypoint using a convolutional neural net. The voting scheme allows us to utilize information from the whole image, rather than rely on a sparse set of keypoint locations. Using dense, multi-target votes, not only produces good keypoint predictions, but also enables us to compute image-dependent joint keypoint probabilities by looking at consensus voting. This differs from most previous methods where joint probabilities are learned from relative keypoint locations and are independent of the image. We finally combine the keypoints votes and joint probabilities in order to identify the optimal pose configuration. We show our competitive performance on the MPII Human Pose and Leeds Sports Pose datasets.

1

Introduction

In recent years, with the resurgence of deep learning techniques, the accuracy of human pose estimation from a single image has improved dramatically. Yet despite this recent progress, it is still a challenging computer vision task and state-of-the-art results are far from human performance. The general approach in previous works, such as [22,26], is to train a deep neural net as a keypoint detector for all keypoints. Given an image I, the net is fed a patch of the image Iy ⊂ I centered around pixel y and predicts if y is one of the M keypoints of the model. This process is repeated in a sliding window approach, using a fully convolutional implementation, to produce M heat maps, one for each keypoint. Structured prediction, usually by a graphical model, is then used to combine these heat maps into a single pose prediction. This approach has several drawbacks. First, most pixels belonging to the person are not themselves any of the keypoints and therefore contribute only limited information to the pose estimation process. Information from the entire person can be used to get more reliable predictions, particularly in the face of partial I. Lifshitz and E. Fetaya—Equal contribution. Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46475-6 16) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part II, LNCS 9906, pp. 246–260, 2016. DOI: 10.1007/978-3-319-46475-6 16

Human Pose Estimation Using Deep Consensus Voting

247

occlusion where the keypoint itself is not visible. Another drawback is that while the individual keypoint predictors use state-of-the-art classification methods to produce high quality results, the binary terms in the graphical model, enforcing global pose consistency, are based only on relative keypoint location statistics gathered from the training data and are independent of the input image.

Fig. 1. Our model’s predicted pose estimation on the MPII-human-pose database testset [1]. Each pose is represented as a stick figure, inferred from predicted joints