Grounding of Textual Phrases in Images by Reconstruction

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localizati

PDF / 2,395,880 Bytes
18 Pages / 439.37 x 666.142 pts Page_size
93 Downloads / 240 Views

DOWNLOAD

REPORT

Max Planck Institute for Informatics, Saarbr¨ ucken, Germany {arohrbach,schiele}@mpi-inf.mpg.des 2 UC Berkeley EECS, Berkeley, CA, USA {rohrbach,ronghang,trevor}@eecs.berkeley.edu 3 ICSI, Berkeley, CA, USA

Abstract. Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localization of phrases, thus it is desirable to learn from data with no or little grounding supervision. We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly. During training our approach encodes the phrase using a recurrent network language model and then learns to attend to the relevant image region in order to reconstruct the input phrase. At test time, the correct attention, i.e., the grounding, is evaluated. If grounding supervision is available it can be directly applied via a loss over the attention mechanism. We demonstrate the eﬀectiveness of our approach on the Flickr30k Entities and ReferItGame datasets with diﬀerent levels of supervision, ranging from no supervision over partial supervision to full supervision. Our supervised variant improves by a large margin over the state-of-the-art on both datasets.

1

Introduction

Language grounding in visual data is an interesting problem studied both in computer vision [18,24,25,28,35] and natural language processing [29,34] communities. Such grounding can be done on diﬀerent levels of granularity: from coarse, e.g. associating a paragraph of text to a scene in a movie [41,52], to ﬁne, e.g. localizing a word or phrase in a given image [18,35]. In this work we focus on the latter scenario. Many prior eﬀorts in this area have focused on rather constrained settings with a small number of nouns to ground [28,31]. On the contrary, we want to tackle the problem of grounding arbitrary natural language phrases in images. Most parallel corpora of sentence/visual data do not provide localization annotations (e.g. bounding boxes) and the annotation process is costly. We propose an approach which can learn to localize phrases relying only on phrases associated with images without bounding box annotations but c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part I, LNCS 9905, pp. 817–834, 2016. DOI: 10.1007/978-3-319-46448-0 49

818

A. Rohrbach et al.

Fig. 1. (a) Without bounding box annotations at training time our approach GroundeR can ground free-form natural language phrases in images. (b) During training our latent attention approach reconstructs phrases by learning to attend to the correct box. (c) At test time, the attention model infers the grounding for each phrase. For semi-supervised and fully supervised variants see Fig. 2.

which is also able to incorporate phrases with bounding box supervision when available (see Fig. 1). The main idea of our approach is shown in

Data Loading...

Grounding of Textual Phrases in Images by Reconstruction

Recommend Documents

Generator from Edges: Reconstruction of Facial Images

Calculation of Impulse Grounding Resistance of Extended Grounding Electrode

Three-Dimensional Reconstruction of Teeth CT Images

Language Grounding in Robots

Sentiment Analysis for Images on Microblogging by Integrating Textual Information with Multiple Kernel Learning

Multiple Grounding

ReLink: Open Information Extraction by Linking Phrases and Its Applications

Metaphysical and Conceptual Grounding

Saliency Detection in Hyperspectral Images Using Autoencoder-Based Data Reconstruction

Bit-depth quantization and reconstruction error in digital images

Relevance, Textual

Power Systems Grounding