Structured Matching for Phrase Localization

In this paper we introduce a new approach to phrase localization: grounding phrases in sentences to image regions. We propose a structured matching of phrases and regions that encourages the semantic relations between phrases to agree with the visual rela

  • PDF / 5,435,978 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 64 Downloads / 219 Views

DOWNLOAD

REPORT


Abstract. In this paper we introduce a new approach to phrase localization: grounding phrases in sentences to image regions. We propose a structured matching of phrases and regions that encourages the semantic relations between phrases to agree with the visual relations between regions. We formulate structured matching as a discrete optimization problem and relax it to a linear program. We use neural networks to embed regions and phrases into vectors, which then define the similarities (matching weights) between regions and phrases. We integrate structured matching with neural networks to enable end-to-end training. Experiments on Flickr30K Entities demonstrate the empirical effectiveness of our approach.

Keywords: Vision

1

· Language

Introduction

This paper addresses the problem of phrase localization: given an image and a textual description, locate the image regions that correspond to the noun phrases in the description. For example, an image may be described as “a man wearing a tan coat signs papers for another man wearing a blue coat”. We wish to localize, in terms of bounding boxes, the image regions for the phrases “a man”, “tan coat”, “papers”, “another man”, and “blue coat”. In other words, we wish to ground these noun phrases to image regions. Phrase localization is an important task. Visual grounding of natural language is a critical cognitive capability necessary for communication, language learning, and the understanding of multimodal information. Specifically, understanding the correspondence between regions and phrases is important for natural language based image retrieval and visual question answering. Moreover, by aligning phrases and regions, phrase localization has the potential to improve weakly supervised learning of object recognition from massive amounts of paired images and texts. Recent research has brought significant progress on the problem of phrase localization [1–3]. Plummer et al. introduced the Flickr30K Entities dataset, which includes images, captions, and ground-truth correspondences between regions and phrases [1]. To match regions and phrases, Plummer et al. embedded c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part VIII, LNCS 9912, pp. 696–711, 2016. DOI: 10.1007/978-3-319-46484-8 42

Structured Matching for Phrase Localization

697

regions and phrases into a common vector space through Canonical Correlation Analysis (CCA) and pick a region for each phrase based on the similarity of the embeddings. Subsequent works by Wang et al. [2] and Rohrbach et al. [3] have since achieved significant improvements by embedding regions and phrases using deep neural networks.

Fig. 1. Structured matching is needed for phrase localization: it is not enough to just match phrases and regions individually; the relations between phrases also need to agree with the relations between regions.

But existing works share a common limitation: they largely localize each phrase independently, ignoring the semantic relations between phrases. The only constraint used i