LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling

Semantic labeling of RGB-D scenes is crucial to many intelligent applications including perceptual robotics. It generates pixelwise and fine-grained label maps from simultaneously sensed photometric (RGB) and depth channels. This paper addresses this prob

PDF / 3,125,471 Bytes
17 Pages / 439.37 x 666.142 pts Page_size
10 Downloads / 182 Views

DOWNLOAD

REPORT

Department of Computer Science, The University of Hong Kong, Hong Kong, China lizhen36@hku.hk, yizhouy@acm.org 2 School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China ganyk@mail2.sysu.edu.cn, xdliang328@gmail.com, chengh9@mail.sysu.edu.cn, linliang@ieee.org

Abstract. Semantic labeling of RGB-D scenes is crucial to many intelligent applications including perceptual robotics. It generates pixelwise and ﬁne-grained label maps from simultaneously sensed photometric (RGB) and depth channels. This paper addresses this problem by (i) developing a novel Long Short-Term Memorized Context Fusion (LSTMCF) Model that captures and fuses contextual information from multiple channels of photometric and depth data, and (ii) incorporating this model into deep convolutional neural networks (CNNs) for end-to-end training. Speciﬁcally, contexts in photometric and depth channels are, respectively, captured by stacking several convolutional layers and a long short-term memory layer; the memory layer encodes both shortrange and long-range spatial dependencies in an image along the vertical direction. Another long short-term memorized fusion layer is set up to integrate the contexts along the vertical direction from diﬀerent channels, and perform bi-directional propagation of the fused vertical contexts along the horizontal direction to obtain true 2D global contexts. At last, the fused contextual representation is concatenated with the convolutional features extracted from the photometric channels in order to improve the accuracy of ﬁne-scale semantic labeling. Our proposed model has set a new state of the art, i.e., 48.1% and 49.4% average class accuracy over 37 categories (2.2% and 5.4% improvement) on the large-scale SUNRGBD dataset and the NYUDv2 dataset, respectively. Keywords: RGB-D scene labeling · Image context modeling short-term memory · Depth and photometric data fusion

·

Long

This work was support by Projects on Faculty/Student Exchange and Collaboration Scheme between the Higher Education in Hong Kong and the Mainland, Guangzhou Science and Technology Program under grant 1563000439, and Fundamental Research Funds for the Central Universities. c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part II, LNCS 9906, pp. 541–557, 2016. DOI: 10.1007/978-3-319-46475-6 34

542

1

Z. Li et al.

Introduction

Scene labeling, also known as semantic scene segmentation, is one of the most fundamental problems in computer vision. It refers to associating every pixel in an image with a semantic label, such as table, road and wall, as illustrated in Fig. 1. High-quality scene labeling can be beneﬁcial to many intelligent tasks, including robot task planning [1], pose estimation [2], plane segmentation [3], context-based image retrieval [4], and automatic photo adjustment [5]. context modeling on feature maps

photometric data

feature

adaptive fusion

extraction fully convolutional feature extraction

depth

chair

wall

others

cabinet

counter

ceiling

Fig. 1. An illustration o

Data Loading...

LSTM-CF: Unifying Context Modeling and Fusion with LSTMs for RGB-D Scene Labeling

Recommend Documents

Learning Dynamic Hierarchical Models for Anytime Scene Labeling

Big Visual Data Analysis Scene Classification and Geometric Labeling

Topology-Change-Aware Volumetric Fusion for Dynamic Scene Reconstruction

Improving Constrained Bundle Adjustment Through Semantic Scene Labeling

Marketing in Context Setting the Scene

Spatio-Temporally Consistent Correspondence for Dense Dynamic Scene Modeling

Global Context Enhanced Multi-modal Fusion for Referring Image Segmentation

Semantic Context Detection Using Audio Event Fusion

Context-Specific and Proximity-Dependent Labeling for the Proteomic Analysis of Spatiotemporally Defined Protein Complex

Modeling Decisions Information Fusion and Aggregation Operators

Fusion materials modeling: Challenges and opportunities

Modeling Context in Referring Expressions