Visual question answering: a state-of-the-art review

  • PDF / 2,253,006 Bytes
  • 41 Pages / 439.37 x 666.142 pts Page_size
  • 6 Downloads / 223 Views

DOWNLOAD

REPORT


Visual question answering: a state‑of‑the‑art review Sruthy Manmadhan1 · Binsu C. Kovoor1

© Springer Nature B.V. 2020

Abstract Visual question answering (VQA) is a task that has received immense consideration from two major research communities: computer vision and natural language processing. Recently it has been widely accepted as an AI-complete task which can be used as an alternative to visual turing test. In its most common form, it is a multi-modal challenging task where a computer is required to provide the correct answer for a natural language question asked about an input image. It attracts many deep learning researchers after their remarkable achievements in text, voice and vision technologies. This review extensively and critically examines the current status of VQA research in terms of step by step solution methodologies, datasets and evaluation metrics. Finally, this paper also discusses future research directions for all the above-mentioned aspects of VQA separately. Keywords  Visual question answering · Computer vision · Natural language processing · Deep learning

1 Introduction Matt King, FACEBOOK says, “But from my perspective as a blind user, going from essentially zero percent satisfaction from a photo to somewhere in the neighborhood of half … is a huge jump” as a comment to the great attempt from Facebook to automatically caption photos of blind users. This leads to an inference that, it would be great if machines are intelligent enough to understand image contents and communicate this understanding as effectively as humans. VQA is a stepping stone to this Artificial Intelligence-dream (AI-dream) of Visual Dialogue. In the most common form of Visual Question Answering (VQA), the computer is presented with an image and a textual question about this image. Then, the machine’s task is to generate correct answer, typically a few words or a short phrase. That is, VQA is a task which is guided by matured research in computer vision (CV) and natural language processing (NLP), both are under the domain of AI. In words of * Sruthy Manmadhan [email protected] Binsu C. Kovoor [email protected] 1



Division of Information Technology, Cochin University of Science and Technology, Kochi, Kerala, India

13

Vol.:(0123456789)



S. Manmadhan, B. C. Kovoor

Fig. 1  Definition of VQA

Table 1  Computer vision sub-tasks required to be solved by VQA CV task

Representative VQA question

Object recognition

What is in the image?

Object detection Attribute classification Scene classification Counting Activity recognition Spatial relationships among objects Commonsense reasoning Knowledge-base reasoning

Are there any dogs in the picture? What color is the umbrella? Is it raining? How many people are there in the image? Is the child crying? What is between cat and sofa? Does this person have 20/20 vision? Is this a vegetarian pizza?

Devi Parikh, a VQA researcher, it is a great combination of pictures, words and common sense as shown in Fig. 1. While compared to other vision-language tasks such as image