Human Pose Estimation via Convolutional Part Heatmap Regression

This paper is on human pose estimation using Convolutional Neural Networks. Our main contribution is a CNN cascaded architecture specifically designed for learning part relationships and spatial context, and robustly inferring pose even for the case of se

  • PDF / 17,235,668 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 85 Downloads / 224 Views

DOWNLOAD

REPORT


Abstract. This paper is on human pose estimation using Convolutional Neural Networks. Our main contribution is a CNN cascaded architecture specifically designed for learning part relationships and spatial context, and robustly inferring pose even for the case of severe part occlusions. To this end, we propose a detection-followed-by-regression CNN cascade. The first part of our cascade outputs part detection heatmaps and the second part performs regression on these heatmaps. The benefits of the proposed architecture are multi-fold: It guides the network where to focus in the image and effectively encodes part constraints and context. More importantly, it can effectively cope with occlusions because part detection heatmaps for occluded parts provide low confidence scores which subsequently guide the regression part of our network to rely on contextual information in order to predict the location of these parts. Additionally, we show that the proposed cascade is flexible enough to readily allow the integration of various CNN architectures for both detection and regression, including recent ones based on residual learning. Finally, we illustrate that our cascade achieves top performance on the MPII and LSP data sets. Code can be downloaded from http://www.cs.nott.ac.uk/ ∼psxab5/. Keywords: Human pose estimation volutional Neural Networks

1

· Part heatmap regression · Con-

Introduction

Articulated human pose estimation from images is a Computer Vision problem of extraordinary difficulty. Algorithms have to deal with the very large number of feasible human poses, large changes in human appearance (e.g. foreshortening, clothing), part occlusions (including self-occlusions) and the presence of multiple people within close proximity to each other. A key question for addressing these problems is how to extract strong low and mid-level appearance features capturing discriminative as well as relevant contextual information and how to model complex part relationships allowing for effective yet efficient pose inference. Being capable of performing these tasks in an end-to-end fashion, Convolutional Neural Networks (CNNs) have been recently shown to feature remarkably robust performance and high part localization accuracy. Yet, the accurate estimation of c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part VII, LNCS 9911, pp. 717–732, 2016. DOI: 10.1007/978-3-319-46478-7 44

718

A. Bulat and G. Tzimiropoulos part heatmaps 256 x 256

part detection network regression heatmaps

stacked part heatmaps

regression network

Fig. 1. Proposed architecture: Our CNN cascade consists of two connected deep subnetworks. The first one (upper part in the figure) is a part detection network trained to detect the individual body parts using a per-pixel sigmoid loss. Its output is a set of N part heatmaps. The second one is a regression subnetwork that jointly regresses the part heatmaps stacked alongside the input image to confidence maps representing the location of the body parts.

the locations of occluded body parts i