Structured Matching for Phrase Localization

In this paper we introduce a new approach to phrase localization: grounding phrases in sentences to image regions. We propose a structured matching of phrases and regions that encourages the semantic relations between phrases to agree with the visual rela

PDF / 5,435,978 Bytes
16 Pages / 439.37 x 666.142 pts Page_size
64 Downloads / 229 Views

DOWNLOAD

REPORT

Abstract. In this paper we introduce a new approach to phrase localization: grounding phrases in sentences to image regions. We propose a structured matching of phrases and regions that encourages the semantic relations between phrases to agree with the visual relations between regions. We formulate structured matching as a discrete optimization problem and relax it to a linear program. We use neural networks to embed regions and phrases into vectors, which then deﬁne the similarities (matching weights) between regions and phrases. We integrate structured matching with neural networks to enable end-to-end training. Experiments on Flickr30K Entities demonstrate the empirical eﬀectiveness of our approach.

Keywords: Vision

1

· Language

Introduction

This paper addresses the problem of phrase localization: given an image and a textual description, locate the image regions that correspond to the noun phrases in the description. For example, an image may be described as “a man wearing a tan coat signs papers for another man wearing a blue coat”. We wish to localize, in terms of bounding boxes, the image regions for the phrases “a man”, “tan coat”, “papers”, “another man”, and “blue coat”. In other words, we wish to ground these noun phrases to image regions. Phrase localization is an important task. Visual grounding of natural language is a critical cognitive capability necessary for communication, language learning, and the understanding of multimodal information. Speciﬁcally, understanding the correspondence between regions and phrases is important for natural language based image retrieval and visual question answering. Moreover, by aligning phrases and regions, phrase localization has the potential to improve weakly supervised learning of object recognition from massive amounts of paired images and texts. Recent research has brought signiﬁcant progress on the problem of phrase localization [1–3]. Plummer et al. introduced the Flickr30K Entities dataset, which includes images, captions, and ground-truth correspondences between regions and phrases [1]. To match regions and phrases, Plummer et al. embedded c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part VIII, LNCS 9912, pp. 696–711, 2016. DOI: 10.1007/978-3-319-46484-8 42

Structured Matching for Phrase Localization

697

regions and phrases into a common vector space through Canonical Correlation Analysis (CCA) and pick a region for each phrase based on the similarity of the embeddings. Subsequent works by Wang et al. [2] and Rohrbach et al. [3] have since achieved signiﬁcant improvements by embedding regions and phrases using deep neural networks.

Fig. 1. Structured matching is needed for phrase localization: it is not enough to just match phrases and regions individually; the relations between phrases also need to agree with the relations between regions.

But existing works share a common limitation: they largely localize each phrase independently, ignoring the semantic relations between phrases. The only constraint used i

Data Loading...

Structured Matching for Phrase Localization

Recommend Documents

A Tree-Structured Feature Matching Algorithm

Hierarchical Matching and Reasoning for Action Localization via Language Query

Contrastive Learning for Weakly Supervised Phrase Grounding

Phrase Structure and the Lexicon

Propagating Over Phrase Relations for One-Stage Visual Grounding

DESIGN FOR LOCALIZATION

Research on Tibetan Phrase Classification Method for Language Information Processing

Reusable Phrase Extraction Based on Syntactic Parsing

Matching

Relational Synthesis for Pattern Matching

Structured Polysilicon for Photonic Applications

Matching Theory for Wireless Networks