Leveraging Visual Question Answering for Image-Caption Ranking

Visual Question Answering (VQA) is the task of taking as input an image and a free-form natural language question about the image, and producing an accurate answer. In this work we view VQA as a “feature extraction” module to extract image and caption rep

PDF / 3,569,102 Bytes
17 Pages / 439.37 x 666.142 pts Page_size
98 Downloads / 223 Views

DOWNLOAD

REPORT

Abstract. Visual Question Answering (VQA) is the task of taking as input an image and a free-form natural language question about the image, and producing an accurate answer. In this work we view VQA as a “feature extraction” module to extract image and caption representations. We employ these representations for the task of image-caption ranking. Each feature dimension captures (imagines) whether a fact (question-answer pair) could plausibly be true for the image and caption. This allows the model to interpret images and captions from a wide variety of perspectives. We propose score-level and representation-level fusion models to incorporate VQA knowledge in an existing state-of-theart VQA-agnostic image-caption ranking model. We ﬁnd that incorporating and reasoning about consistency between images and captions signiﬁcantly improves performance. Concretely, our model improves stateof-the-art on caption retrieval by 7.1 % and on image retrieval by 4.4 % on the MSCOCO dataset.

Keywords: Visual question answering Mid-level concepts

1

·

Image-caption ranking

·

Introduction

Visual Question Answering (VQA) is an “AI-complete” problem that requires knowledge from multiple disciplines such as computer vision, natural language processing and knowledge base reasoning. A VQA system takes as input an image and a free-form open-ended question about the image and outputs the natural language answer to the question. A VQA system needs to not only recognize objects and scenes but also reason beyond low-level recognition about aspects such as intention, future, physics, material and commonsense knowledge. For example (Q: Who is the person in charge in this picture? A: Chef) reveals the most important person and occupation in the image. Moreover, answers to multiple questions about the same image can be correlated and may reveal Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46475-6 17) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part II, LNCS 9906, pp. 261–277, 2016. DOI: 10.1007/978-3-319-46475-6 17

262

X. Lin and D. Parikh

more complex interactions. For example (Q: What is this person riding? A: Motorcycle) and (Q: What is the man wearing on his head? A: Helmet) might reveal correlations observable in the visual world due to safety regulations. Today’s VQA models, while far from perfect, may already be picking up on these semantic correlations of the world. If so, they may serve as an implicit knowledge resource to help other tasks. Just like we do not need to fully understand the theory behind an equation to use it, can we already use VQA knowledge captured by existing VQA models to improve other tasks? In this work we study the problem of using VQA knowledge to improve imagecaption ranking. Consider the image and its caption in Fig. 1. Aligning them not only requires recognizing the batter and that it is a baseball game (mentioned in the caption), but also realiz

Data Loading...

Leveraging Visual Question Answering for Image-Caption Ranking

Recommend Documents

Optimal Image Feature Ranking and Fusion for Visual Question Answering

Revisiting Visual Question Answering Baselines

Error Analysis for Visual Question Answering

Visual Question Answering on Image Sets

Interpretable Neural Computation for Real-World Compositional Visual Question Answering

Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering

Neural Networks for Detecting Irrelevant Questions During Visual Question Answering

TRRNet: Tiered Relation Reasoning for Compositional Visual Question Answering

Question Answering

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

VQA-LOL: Visual Question Answering Under the Lens of Logic

Visual question answering: a state-of-the-art review