Modeling Context Between Objects for Referring Expression Understanding

Referring expressions usually describe an object using properties of the object and relationships of the object with other objects. We propose a technique that integrates context between objects to understand referring expressions. Our approach uses an LS

PDF / 4,444,505 Bytes
16 Pages / 439.37 x 666.142 pts Page_size
82 Downloads / 208 Views

DOWNLOAD

REPORT

Abstract. Referring expressions usually describe an object using properties of the object and relationships of the object with other objects. We propose a technique that integrates context between objects to understand referring expressions. Our approach uses an LSTM to learn the probability of a referring expression, with input features from a region and a context region. The context regions are discovered using multipleinstance learning (MIL) since annotations for context objects are generally not available for training. We utilize max-margin based MIL objective functions for training the LSTM. Experiments on the Google RefExp and UNC RefExp datasets show that modeling context between objects provides better performance than modeling only object properties. We also qualitatively show that our technique can ground a referring expression to its referred region along with the supporting context region.

1

Introduction

In image retrieval and human-robot interaction, objects are usually queried by their category, attributes, pose, action and their context in the scene [1]. Natural language queries can encode rich information like relationships that distinguish object instances from each other. In a retrieval task that focuses on a particular object in an image, the query is called a referring expression [2,3]. When there is only one instance of an object type in an image, a referring expression provides additional information such as attributes to improve retrieval/localization performance. More importantly, when multiple instances of an object type are present in an image, a referring expression distinguishes the referred object from other instances, thereby helping to localize the correct instance. The task of localizing a region in an image given a referring expression is called the comprehension task [4] and its inverse process is the generation task. In this work we focus on the comprehension task. Referring expressions usually mention relationships of an object with other regions along with the properties of the object [5,6] (See Fig. 1). Hence, it is important to model relationships between regions for understanding referring expressions. However, the supervision during training typically consists of annotations of only the referred object. While this might be suﬃcient for modeling attributes of an object mentioned in a referring expression, it is diﬃcult to model relationships between objects with such limited supervision. Previous work on c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part IV, LNCS 9908, pp. 792–807, 2016. DOI: 10.1007/978-3-319-46493-0 48

Modeling Context Between Objects for Referring Expression Understanding A bed with two beds to the left of it

The plant on the right side of the TV

Computer monitor above laptop screen

Umbrella held by a woman wearing a blue jacket

A man sitting on a table watching TV

A man riding a white sports bike

Umbrella held by a girl in red coat

A person sitting on a couch watching TV

A person on a black motorcycle

Refer

Data Loading...

Modeling Context Between Objects for Referring Expression Understanding

Recommend Documents

Modeling Context in Referring Expressions

Global Context Enhanced Multi-modal Fusion for Referring Image Segmentation

Building referring expression corpora with and without feedback

Visual Analytics for Understanding Relationships between Entities

Turbulence and Self-Organization Modeling Astrophysical Objects

Modeling Close Packing of 3D Objects

Modeling Spatial Objects Affected by Uncertainty

A context-aware semantic modeling framework for efficient image retrieval

Multi-granularity Multimodal Feature Interaction for Referring Image Segmentation

Context-Aware Modeling of Multimedia Content

Molecular Modeling as a Tool for Adhesive Performance Understanding

Context-Aware Modeling of Multimedia Content