Grounding of Textual Phrases in Images by Reconstruction

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localizati

  • PDF / 2,395,880 Bytes
  • 18 Pages / 439.37 x 666.142 pts Page_size
  • 93 Downloads / 228 Views

DOWNLOAD

REPORT


Max Planck Institute for Informatics, Saarbr¨ ucken, Germany {arohrbach,schiele}@mpi-inf.mpg.des 2 UC Berkeley EECS, Berkeley, CA, USA {rohrbach,ronghang,trevor}@eecs.berkeley.edu 3 ICSI, Berkeley, CA, USA

Abstract. Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localization of phrases, thus it is desirable to learn from data with no or little grounding supervision. We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly. During training our approach encodes the phrase using a recurrent network language model and then learns to attend to the relevant image region in order to reconstruct the input phrase. At test time, the correct attention, i.e., the grounding, is evaluated. If grounding supervision is available it can be directly applied via a loss over the attention mechanism. We demonstrate the effectiveness of our approach on the Flickr30k Entities and ReferItGame datasets with different levels of supervision, ranging from no supervision over partial supervision to full supervision. Our supervised variant improves by a large margin over the state-of-the-art on both datasets.

1

Introduction

Language grounding in visual data is an interesting problem studied both in computer vision [18,24,25,28,35] and natural language processing [29,34] communities. Such grounding can be done on different levels of granularity: from coarse, e.g. associating a paragraph of text to a scene in a movie [41,52], to fine, e.g. localizing a word or phrase in a given image [18,35]. In this work we focus on the latter scenario. Many prior efforts in this area have focused on rather constrained settings with a small number of nouns to ground [28,31]. On the contrary, we want to tackle the problem of grounding arbitrary natural language phrases in images. Most parallel corpora of sentence/visual data do not provide localization annotations (e.g. bounding boxes) and the annotation process is costly. We propose an approach which can learn to localize phrases relying only on phrases associated with images without bounding box annotations but c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part I, LNCS 9905, pp. 817–834, 2016. DOI: 10.1007/978-3-319-46448-0 49

818

A. Rohrbach et al.

Fig. 1. (a) Without bounding box annotations at training time our approach GroundeR can ground free-form natural language phrases in images. (b) During training our latent attention approach reconstructs phrases by learning to attend to the correct box. (c) At test time, the attention model infers the grounding for each phrase. For semi-supervised and fully supervised variants see Fig. 2.

which is also able to incorporate phrases with bounding box supervision when available (see Fig. 1). The main idea of our approach is shown in