LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling

Semantic labeling of RGB-D scenes is crucial to many intelligent applications including perceptual robotics. It generates pixelwise and fine-grained label maps from simultaneously sensed photometric (RGB) and depth channels. This paper addresses this prob

  • PDF / 3,125,471 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 10 Downloads / 164 Views

DOWNLOAD

REPORT


Department of Computer Science, The University of Hong Kong, Hong Kong, China [email protected], [email protected] 2 School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China [email protected], [email protected], [email protected], [email protected]

Abstract. Semantic labeling of RGB-D scenes is crucial to many intelligent applications including perceptual robotics. It generates pixelwise and fine-grained label maps from simultaneously sensed photometric (RGB) and depth channels. This paper addresses this problem by (i) developing a novel Long Short-Term Memorized Context Fusion (LSTMCF) Model that captures and fuses contextual information from multiple channels of photometric and depth data, and (ii) incorporating this model into deep convolutional neural networks (CNNs) for end-to-end training. Specifically, contexts in photometric and depth channels are, respectively, captured by stacking several convolutional layers and a long short-term memory layer; the memory layer encodes both shortrange and long-range spatial dependencies in an image along the vertical direction. Another long short-term memorized fusion layer is set up to integrate the contexts along the vertical direction from different channels, and perform bi-directional propagation of the fused vertical contexts along the horizontal direction to obtain true 2D global contexts. At last, the fused contextual representation is concatenated with the convolutional features extracted from the photometric channels in order to improve the accuracy of fine-scale semantic labeling. Our proposed model has set a new state of the art, i.e., 48.1% and 49.4% average class accuracy over 37 categories (2.2% and 5.4% improvement) on the large-scale SUNRGBD dataset and the NYUDv2 dataset, respectively. Keywords: RGB-D scene labeling · Image context modeling short-term memory · Depth and photometric data fusion

·

Long

This work was support by Projects on Faculty/Student Exchange and Collaboration Scheme between the Higher Education in Hong Kong and the Mainland, Guangzhou Science and Technology Program under grant 1563000439, and Fundamental Research Funds for the Central Universities. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part II, LNCS 9906, pp. 541–557, 2016. DOI: 10.1007/978-3-319-46475-6 34

542

1

Z. Li et al.

Introduction

Scene labeling, also known as semantic scene segmentation, is one of the most fundamental problems in computer vision. It refers to associating every pixel in an image with a semantic label, such as table, road and wall, as illustrated in Fig. 1. High-quality scene labeling can be beneficial to many intelligent tasks, including robot task planning [1], pose estimation [2], plane segmentation [3], context-based image retrieval [4], and automatic photo adjustment [5]. context modeling on feature maps

photometric data

feature

adaptive fusion

extraction fully convolutional feature extraction

depth

chair

wall

others

cabinet

counter

ceiling

Fig. 1. An illustration o