Revisiting Visual Question Answering Baselines
Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems include attention or memory mechanisms designed to perfo
- PDF / 1,870,177 Bytes
- 13 Pages / 439.37 x 666.142 pts Page_size
- 67 Downloads / 332 Views
Abstract. Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems include attention or memory mechanisms designed to perform “reasoning”. Furthermore, for the task of multiple-choice VQA, nearly all of these systems train a multi-class classifier on image and question features to predict an answer. This paper questions the value of these common practices and develops a simple alternative model based on binary classification. Instead of treating answers as competing choices, our model receives the answer as input and predicts whether or not an image-question-answer triplet is correct. We evaluate our model on the Visual7W Telling and the VQA Real Multiple Choice tasks, and find that even simple versions of our model perform competitively. Our best model achieves state-of-the-art performance of 65.8 % accuracy on the Visual7W Telling task and compares surprisingly well with the most complex systems proposed for the VQA Real Multiple Choice task. Additionally, we explore variants of the model and study the transferability of the model between both datasets. We also present an error analysis of our best model, the results of which suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers.
Keywords: Visual question answering
1
· Dataset bias
Introduction
Recent advances in computer vision have brought us close to the point where traditional object-recognition benchmarks such as Imagenet are considered to be “solved” [1,2]. These advances, however, also prompt the question how we can move from object recognition to visual understanding; that is, how we can extend today’s recognition systems that provide us with “words” describing an image or an image region to systems that can produce a deeper semantic representation of the image content. Because benchmarks have traditionally been a key driver for progress in computer vision, several recent studies have proposed methodologies to assess our ability to develop such representations. These proposals include modeling relations between objects [3], visual Turing tests [4], and visual question answering [5–8]. c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part VIII, LNCS 9912, pp. 727–739, 2016. DOI: 10.1007/978-3-319-46484-8 44
728
A. Jabri et al.
What color is jacket? -Red and blue. -Yellow. -Black. -Orange.
the
How many cars are parked? -Four. -Three. -Five. -Six.
What event is this? -A wedding. -Graduation. -A funeral. -A picnic.
When is this scene taking place? -Day time. -Night time. -Evening. -Morning.
Fig. 1. Four images with associated questions and answers from the Visual7W dataset. Correct answers are typeset in green. (Color figure online)
The task of Visual Question Answering (VQA) is to answer questions—posed in natural language—about an image by providing an answer in the form of s
Data Loading...