Modeling Context Between Objects for Referring Expression Understanding
Referring expressions usually describe an object using properties of the object and relationships of the object with other objects. We propose a technique that integrates context between objects to understand referring expressions. Our approach uses an LS
- PDF / 4,444,505 Bytes
- 16 Pages / 439.37 x 666.142 pts Page_size
- 82 Downloads / 208 Views
Abstract. Referring expressions usually describe an object using properties of the object and relationships of the object with other objects. We propose a technique that integrates context between objects to understand referring expressions. Our approach uses an LSTM to learn the probability of a referring expression, with input features from a region and a context region. The context regions are discovered using multipleinstance learning (MIL) since annotations for context objects are generally not available for training. We utilize max-margin based MIL objective functions for training the LSTM. Experiments on the Google RefExp and UNC RefExp datasets show that modeling context between objects provides better performance than modeling only object properties. We also qualitatively show that our technique can ground a referring expression to its referred region along with the supporting context region.
1
Introduction
In image retrieval and human-robot interaction, objects are usually queried by their category, attributes, pose, action and their context in the scene [1]. Natural language queries can encode rich information like relationships that distinguish object instances from each other. In a retrieval task that focuses on a particular object in an image, the query is called a referring expression [2,3]. When there is only one instance of an object type in an image, a referring expression provides additional information such as attributes to improve retrieval/localization performance. More importantly, when multiple instances of an object type are present in an image, a referring expression distinguishes the referred object from other instances, thereby helping to localize the correct instance. The task of localizing a region in an image given a referring expression is called the comprehension task [4] and its inverse process is the generation task. In this work we focus on the comprehension task. Referring expressions usually mention relationships of an object with other regions along with the properties of the object [5,6] (See Fig. 1). Hence, it is important to model relationships between regions for understanding referring expressions. However, the supervision during training typically consists of annotations of only the referred object. While this might be sufficient for modeling attributes of an object mentioned in a referring expression, it is difficult to model relationships between objects with such limited supervision. Previous work on c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part IV, LNCS 9908, pp. 792–807, 2016. DOI: 10.1007/978-3-319-46493-0 48
Modeling Context Between Objects for Referring Expression Understanding A bed with two beds to the left of it
The plant on the right side of the TV
Computer monitor above laptop screen
Umbrella held by a woman wearing a blue jacket
A man sitting on a table watching TV
A man riding a white sports bike
Umbrella held by a girl in red coat
A person sitting on a couch watching TV
A person on a black motorcycle
Refer
Data Loading...