A Multi-scale CNN for Affordance Segmentation in RGB Images

Given a single RGB image our goal is to label every pixel with an affordance type. By affordance, we mean an object’s capability to readily support a certain human action, without requiring precursor actions. We focus on segmenting the following five affo

  • PDF / 3,635,326 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 73 Downloads / 200 Views

DOWNLOAD

REPORT


Abstract. Given a single RGB image our goal is to label every pixel with an affordance type. By affordance, we mean an object’s capability to readily support a certain human action, without requiring precursor actions. We focus on segmenting the following five affordance types in indoor scenes: ‘walkable’, ‘sittable’, ‘lyable’, ‘reachable’, and ‘movable’. Our approach uses a deep architecture, consisting of a number of multiscale convolutional neural networks, for extracting mid-level visual cues and combining them toward affordance segmentation. The mid-level cues include depth map, surface normals, and segmentation of four types of surfaces – namely, floor, structure, furniture and props. For evaluation, we augmented the NYUv2 dataset with new ground-truth annotations of the five affordance types. We are not aware of prior work which starts from pixels, infers mid-level cues, and combines them in a feed-forward fashion for predicting dense affordance maps of a single RGB image.

Keywords: Object affordance

1

· Mid-level cues · Deep learning

Introduction

This paper addresses the problem of affordance segmentation in an image, where the goal is to label every pixel with an affordance type. By affordance, we mean an object’s capability to support a certain human action [1,2]. For example, when a surface in the scene affords the opportunity for a person to walk, sit or lie down on it, we say that the surface is characterized by affordance types ‘walkable’, ‘sittable’, or ‘lyable’. Also, an object may be ‘reachable’ when someone standing on the floor can readily grasp the object. A surface or an object may be characterized by a number of affordance types. Importantly, affordance of an object exhibits only the possibility of some action, subject to the object’s relationships with the environment, and thus is not an inherent (permanent) object’s attribute. Thus, sometimes chairs are not ‘sittable’ and floors are not ‘walkable’ if other objects in the environment prevent performing the corresponding actions. Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46493-0 12) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part IV, LNCS 9908, pp. 186–201, 2016. DOI: 10.1007/978-3-319-46493-0 12

A Multi-scale CNN for Affordance Segmentation in RGB Images

187

Affordance segmentation is an important, long-standing problem with a range of applications, including robot navigation, path planning, and autonomous driving [3–14]. Reasoning about affordances has been shown to facilitate object and action recognition [4,10,13]. Existing work typically leverages mid-level visual cues [3] for reasoning about spatial (and temporal) relationships among objects in the scene, which is then used for detection (and in some cases segmentation) of affordances in the image (or video). For example, Hoiem et al. [15,16] show that inferring mid-level cues – including: depth map, semantic cues, and occlusion maps – facilita