Learning Common and Specific Features for RGB-D Semantic Segmentation with Deconvolutional Networks

In this paper, we tackle the problem of RGB-D semantic segmentation of indoor images. We take advantage of deconvolutional networks which can predict pixel-wise class labels, and develop a new structure for deconvolution of multiple modalities. We propose

  • PDF / 1,700,313 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 60 Downloads / 168 Views

DOWNLOAD

REPORT


Nanyang Technological University, Singapore, Singapore [email protected], [email protected], [email protected] 2 University of Technology Sydney (UTS), Ultimo, Australia [email protected] 3 NVIDIA Corporation, Santa Clara, USA [email protected]

Abstract. In this paper, we tackle the problem of RGB-D semantic segmentation of indoor images. We take advantage of deconvolutional networks which can predict pixel-wise class labels, and develop a new structure for deconvolution of multiple modalities. We propose a novel feature transformation network to bridge the convolutional networks and deconvolutional networks. In the feature transformation network, we correlate the two modalities by discovering common features between them, as well as characterize each modality by discovering modality specific features. With the common features, we not only closely correlate the two modalities, but also allow them to borrow features from each other to enhance the representation of shared information. With specific features, we capture the visual patterns that are only visible in one modality. The proposed network achieves competitive segmentation accuracy on NYU depth dataset V1 and V2. Keywords: Semantic segmentation · Deep learning · Common feature · Specific feature

1

Introduction

Semantic segmentation of scenes is a fundamental task in image understanding. It assigns a class label to each pixel of an image. Previously, most research works focus on outdoor scenarios [1–6]. Recently, the semantic segmentation of indoor images attracts increasing attention [3,7–15]. It is challenging due to many reasons, including randomness of object distribution, poor illumination, occlusion and so on. Figure 1 shows an example of indoor scene segmentation. Thanks to the Kinect and other low-cost RGB-D cameras, we can obtain not only the color images (Fig. 1(a)), but also the depth maps of indoor scenes (Fig. 1(b)). The additional depth information is independent of illumination, c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part V, LNCS 9909, pp. 664–679, 2016. DOI: 10.1007/978-3-319-46454-1 40

Learning Common and Specific Features for RGB-D Semantic Segmentation

665

Fig. 1. Example images from the NYU Depth Dataset V2 [7]. (a) shows an RGB image captured in a homeoffice. (b) and (c) are the corresponding depth map and groundtruth. (d-f) are the visualized RGB specific feature, depth specific feature, and common feature (The method to obtain these features will be discussed in Sect. 5.2.). RGB specific features encode the texture-rich visual patterns, such as the objects on the desk (the red circle in (d)). The depth specific features encode the visual patterns which are more obvious in the depth map, such as the chair (the green circle in (e)). Common features encode the visual patterns that are visible in both modalities, such as the edges (the yellow circles in (f)) (Color figure online)

which can significantly alleviate the challenges in semantic segmentation. With the availability of RGB-D indoor sc