Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent n
- PDF / 5,393,604 Bytes
- 16 Pages / 439.37 x 666.142 pts Page_size
- 75 Downloads / 228 Views
Abstract. We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference. To remedy this, we propose a model we call the Spatial Memory Network and apply it to the VQA task. Memory networks are recurrent neural networks with an explicit attention mechanism that selects certain parts of the information stored in memory. Our Spatial Memory Network stores neuron activations from different spatial regions of the image in its memory, and uses attention to choose regions relevant for computing the answer. We propose a novel question-guided spatial attention architecture that looks for regions relevant to either individual words or the entire question, repeating the process over multiple recurrent steps, or “hops”. To better understand the inference process learned by the network, we design synthetic questions that specifically require spatial inference and visualize the network’s attention. We evaluate our model on two available visual question answering datasets and obtain improved results.
Keywords: Visual question answering network · Deep learning
1
·
Spatial attention
·
Memory
Introduction
Visual Question Answering (VQA) is an emerging interdisciplinary research problem at the intersection of computer vision, natural language processing and artificial intelligence. It has many real-life applications, such as automatic querying of surveillance video [1] or assisting the visually impaired [2]. Compared to the recently popular image captioning task [3–6], VQA requires a deeper understanding of the image, but is considerably easier to evaluate. It also puts more focus on artificial intelligence, namely the inference process needed to produce the answer to the visual question. In one of the early works [8], VQA is seen as a Turing test proxy. The authors propose an approach based on handcrafted features, combining a semantic parse c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part VII, LNCS 9911, pp. 451–466, 2016. DOI: 10.1007/978-3-319-46478-7 28
452
H. Xu and K. Saenko
Fig. 1. We propose a Spatial Memory Network for VQA (SMem-VQA) that answers questions about images using spatial inference. The figure shows the inference process of our two-hop model on examples from the VQA dataset [7]. In the first hop (middle), the attention process captures the correspondence between individual words in the question and image regions. High attention regions (bright areas) are marked with bounding boxes and the corresponding words are highlighted using the same color. In the second hop (right), the fine-grained evidence gathered in the first hop, as well as an embedding of the entire question, are used to collect more exact evidence to predict the answer. (Best viewed in color.) (Color figure online)
of the question with visual scene analys
Data Loading...