Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent n

PDF / 5,393,604 Bytes
16 Pages / 439.37 x 666.142 pts Page_size
75 Downloads / 252 Views

DOWNLOAD

REPORT

Abstract. We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference. To remedy this, we propose a model we call the Spatial Memory Network and apply it to the VQA task. Memory networks are recurrent neural networks with an explicit attention mechanism that selects certain parts of the information stored in memory. Our Spatial Memory Network stores neuron activations from diﬀerent spatial regions of the image in its memory, and uses attention to choose regions relevant for computing the answer. We propose a novel question-guided spatial attention architecture that looks for regions relevant to either individual words or the entire question, repeating the process over multiple recurrent steps, or “hops”. To better understand the inference process learned by the network, we design synthetic questions that speciﬁcally require spatial inference and visualize the network’s attention. We evaluate our model on two available visual question answering datasets and obtain improved results.

Keywords: Visual question answering network · Deep learning

1

·

Spatial attention

·

Memory

Introduction

Visual Question Answering (VQA) is an emerging interdisciplinary research problem at the intersection of computer vision, natural language processing and artiﬁcial intelligence. It has many real-life applications, such as automatic querying of surveillance video [1] or assisting the visually impaired [2]. Compared to the recently popular image captioning task [3–6], VQA requires a deeper understanding of the image, but is considerably easier to evaluate. It also puts more focus on artiﬁcial intelligence, namely the inference process needed to produce the answer to the visual question. In one of the early works [8], VQA is seen as a Turing test proxy. The authors propose an approach based on handcrafted features, combining a semantic parse c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part VII, LNCS 9911, pp. 451–466, 2016. DOI: 10.1007/978-3-319-46478-7 28

452

H. Xu and K. Saenko

Fig. 1. We propose a Spatial Memory Network for VQA (SMem-VQA) that answers questions about images using spatial inference. The ﬁgure shows the inference process of our two-hop model on examples from the VQA dataset [7]. In the ﬁrst hop (middle), the attention process captures the correspondence between individual words in the question and image regions. High attention regions (bright areas) are marked with bounding boxes and the corresponding words are highlighted using the same color. In the second hop (right), the ﬁne-grained evidence gathered in the ﬁrst hop, as well as an embedding of the entire question, are used to collect more exact evidence to predict the answer. (Best viewed in color.) (Color ﬁgure online)

of the question with visual scene analys

Data Loading...

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

Recommend Documents

Revisiting Visual Question Answering Baselines

Error Analysis for Visual Question Answering

Visual Question Answering on Image Sets

Optimal Image Feature Ranking and Fusion for Visual Question Answering

Semantically Corroborating Neural Attention for Biomedical Question Answering

LARQ: Learning to Ask and Rewrite Questions for Community Question Answering

Leveraging Visual Question Answering for Image-Caption Ranking

Interpretable Neural Computation for Real-World Compositional Visual Question Answering

Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering

Neural Networks for Detecting Irrelevant Questions During Visual Question Answering

TRRNet: Tiered Relation Reasoning for Compositional Visual Question Answering

Question Answering