Revisiting Visual Question Answering Baselines

Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems include attention or memory mechanisms designed to perfo

PDF / 1,870,177 Bytes
13 Pages / 439.37 x 666.142 pts Page_size
67 Downloads / 346 Views

DOWNLOAD

REPORT

Abstract. Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems include attention or memory mechanisms designed to perform “reasoning”. Furthermore, for the task of multiple-choice VQA, nearly all of these systems train a multi-class classiﬁer on image and question features to predict an answer. This paper questions the value of these common practices and develops a simple alternative model based on binary classiﬁcation. Instead of treating answers as competing choices, our model receives the answer as input and predicts whether or not an image-question-answer triplet is correct. We evaluate our model on the Visual7W Telling and the VQA Real Multiple Choice tasks, and ﬁnd that even simple versions of our model perform competitively. Our best model achieves state-of-the-art performance of 65.8 % accuracy on the Visual7W Telling task and compares surprisingly well with the most complex systems proposed for the VQA Real Multiple Choice task. Additionally, we explore variants of the model and study the transferability of the model between both datasets. We also present an error analysis of our best model, the results of which suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers.

Keywords: Visual question answering

1

· Dataset bias

Introduction

Recent advances in computer vision have brought us close to the point where traditional object-recognition benchmarks such as Imagenet are considered to be “solved” [1,2]. These advances, however, also prompt the question how we can move from object recognition to visual understanding; that is, how we can extend today’s recognition systems that provide us with “words” describing an image or an image region to systems that can produce a deeper semantic representation of the image content. Because benchmarks have traditionally been a key driver for progress in computer vision, several recent studies have proposed methodologies to assess our ability to develop such representations. These proposals include modeling relations between objects [3], visual Turing tests [4], and visual question answering [5–8]. c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part VIII, LNCS 9912, pp. 727–739, 2016. DOI: 10.1007/978-3-319-46484-8 44

728

A. Jabri et al.

What color is jacket? -Red and blue. -Yellow. -Black. -Orange.

the

How many cars are parked? -Four. -Three. -Five. -Six.

What event is this? -A wedding. -Graduation. -A funeral. -A picnic.

When is this scene taking place? -Day time. -Night time. -Evening. -Morning.

Fig. 1. Four images with associated questions and answers from the Visual7W dataset. Correct answers are typeset in green. (Color ﬁgure online)

The task of Visual Question Answering (VQA) is to answer questions—posed in natural language—about an image by providing an answer in the form of s

Data Loading...

Revisiting Visual Question Answering Baselines

Recommend Documents

Error Analysis for Visual Question Answering

Visual Question Answering on Image Sets

Question Answering

VQA-LOL: Visual Question Answering Under the Lens of Logic

Leveraging Visual Question Answering for Image-Caption Ranking

Interpretable Neural Computation for Real-World Compositional Visual Question Answering

Visual question answering: a state-of-the-art review

Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering

Optimal Image Feature Ranking and Fusion for Visual Question Answering

Neural Networks for Detecting Irrelevant Questions During Visual Question Answering

TRRNet: Tiered Relation Reasoning for Compositional Visual Question Answering

Web Question Answering