Visual question answering: a state-of-the-art review

PDF / 2,253,006 Bytes
41 Pages / 439.37 x 666.142 pts Page_size
6 Downloads / 282 Views

Visual question answering: a state‑of‑the‑art review Sruthy Manmadhan1 · Binsu C. Kovoor1

© Springer Nature B.V. 2020

Abstract Visual question answering (VQA) is a task that has received immense consideration from two major research communities: computer vision and natural language processing. Recently it has been widely accepted as an AI-complete task which can be used as an alternative to visual turing test. In its most common form, it is a multi-modal challenging task where a computer is required to provide the correct answer for a natural language question asked about an input image. It attracts many deep learning researchers after their remarkable achievements in text, voice and vision technologies. This review extensively and critically examines the current status of VQA research in terms of step by step solution methodologies, datasets and evaluation metrics. Finally, this paper also discusses future research directions for all the above-mentioned aspects of VQA separately. Keywords Visual question answering · Computer vision · Natural language processing · Deep learning

1 Introduction Matt King, FACEBOOK says, “But from my perspective as a blind user, going from essentially zero percent satisfaction from a photo to somewhere in the neighborhood of half … is a huge jump” as a comment to the great attempt from Facebook to automatically caption photos of blind users. This leads to an inference that, it would be great if machines are intelligent enough to understand image contents and communicate this understanding as effectively as humans. VQA is a stepping stone to this Artificial Intelligence-dream (AI-dream) of Visual Dialogue. In the most common form of Visual Question Answering (VQA), the computer is presented with an image and a textual question about this image. Then, the machine’s task is to generate correct answer, typically a few words or a short phrase. That is, VQA is a task which is guided by matured research in computer vision (CV) and natural language processing (NLP), both are under the domain of AI. In words of * Sruthy Manmadhan [email protected] Binsu C. Kovoor [email protected] 1

Division of Information Technology, Cochin University of Science and Technology, Kochi, Kerala, India

13

Vol.:(0123456789)

S. Manmadhan, B. C. Kovoor

Fig. 1 Definition of VQA

Table 1 Computer vision sub-tasks required to be solved by VQA CV task

Representative VQA question

Object recognition

What is in the image?

Object detection Attribute classification Scene classification Counting Activity recognition Spatial relationships among objects Commonsense reasoning Knowledge-base reasoning

Are there any dogs in the picture? What color is the umbrella? Is it raining? How many people are there in the image? Is the child crying? What is between cat and sofa? Does this person have 20/20 vision? Is this a vegetarian pizza?

Devi Parikh, a VQA researcher, it is a great combination of pictures, words and common sense as shown in Fig. 1. While compared to other vision-language tasks such as image

Data Loading...

Visual question answering: a state-of-the-art review

Recommend Documents

Revisiting Visual Question Answering Baselines

Error Analysis for Visual Question Answering

Visual Question Answering on Image Sets

Question Answering

VQA-LOL: Visual Question Answering Under the Lens of Logic

Leveraging Visual Question Answering for Image-Caption Ranking

Interpretable Neural Computation for Real-World Compositional Visual Question Answering

Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering

Optimal Image Feature Ranking and Fusion for Visual Question Answering

Neural Networks for Detecting Irrelevant Questions During Visual Question Answering

TRRNet: Tiered Relation Reasoning for Compositional Visual Question Answering

Web Question Answering