Top-Down Neural Attention by Excitation Backprop
We aim to model the top-down attention of a Convolutional Neural Network (CNN) classifier for generating task-specific attention maps. Inspired by a top-down human visual attention model, we propose a new backpropagation scheme, called Excitation Backprop
- PDF / 3,045,482 Bytes
- 17 Pages / 439.37 x 666.142 pts Page_size
- 70 Downloads / 195 Views
Boston University, Boston, USA {jmzhang,sclaroff}@bu.edu 2 Adobe Research, San Jose, USA {zlin,jbrandt,xshen}@adobe.com
Abstract. We aim to model the top-down attention of a Convolutional Neural Network (CNN) classifier for generating task-specific attention maps. Inspired by a top-down human visual attention model, we propose a new backpropagation scheme, called Excitation Backprop, to pass along top-down signals downwards in the network hierarchy via a probabilistic Winner-Take-All process. Furthermore, we introduce the concept of contrastive attention to make the top-down attention maps more discriminative. In experiments, we demonstrate the accuracy and generalizability of our method in weakly supervised localization tasks on the MS COCO, PASCAL VOC07 and ImageNet datasets. The usefulness of our method is further validated in the text-to-region association task. On the Flickr30k Entities dataset, we achieve promising performance in phrase localization by leveraging the top-down attention of a CNN model that has been trained on weakly labeled web images.
1
Introduction
Top-down task-driven attention is an important mechanism for efficient visual search. Various top-down attention models have been proposed, e.g. [1–4]. Among them, the Selective Tuning attention model [3] provides a biologically plausible formulation. Assuming a pyramidal neural network for visual processing, the Selective Tuning model is composed of a bottom-up sweep of the network to process input stimuli, and a top-down Winner-Take-ALL (WTA) process to localize the most relevant neurons in the network for a given top-down signal. Inspired by the Selective Tuning model, we propose a top-down attention formulation for modern CNN classifiers. Instead of the deterministic WTA process used in [3], which can only generate binary attention maps, we formulate the top-down attention of a CNN classifier as a probabilistic WTA process. The probabilistic WTA formulation is realized by a novel backpropagation scheme, called Excitation Backprop, which integrates both top-down and Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46493-0 33) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part IV, LNCS 9908, pp. 543–559, 2016. DOI: 10.1007/978-3-319-46493-0 33
544
J. Zhang et al. Input
chair
glass
boy
woman
man
couple
father
Fig. 1. A CNN classifier’s top-down attention maps generated by our Excitation Backprop can localize common object categories, e.g. chair and glass, as well as finegrained categories like boy, man and woman in this example image, which is resized to 224×224 for our method. The classifier used in this example is trained to predict ∼18 K tags using only weakly labeled web images. Visualizing the classifier’s top-down attention can also help interpret what has been learned by the classifier. For couple, we can tell that our classifier uses the two adults in the image as the evidence, while for
Data Loading...