ContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization

We aim to localize objects in images using image-level supervision only. Previous approaches to this problem mainly focus on discriminative object regions and often fail to locate precise object boundaries. We address this problem by introducing two types

  • PDF / 13,487,490 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 28 Downloads / 197 Views

DOWNLOAD

REPORT


Abstract. We aim to localize objects in images using image-level supervision only. Previous approaches to this problem mainly focus on discriminative object regions and often fail to locate precise object boundaries. We address this problem by introducing two types of contextaware guidance models, additive and contrastive models, that leverage their surrounding context regions to improve localization. The additive model encourages the predicted object region to be supported by its surrounding context region. The contrastive model encourages the predicted object region to be outstanding from its surrounding context region. Our approach benefits from the recent success of convolutional neural networks for object recognition and extends Fast R-CNN to weakly supervised object localization. Extensive experimental evaluation on the PASCAL VOC 2007 and 2012 benchmarks shows that our context-aware approach significantly improves weakly supervised localization and detection. Keywords: Object recognition · Object detection · Weakly supervised object localization · Context · Convolutional neural networks

1

Introduction

Weakly supervised object localization and learning (WSL) [1,2] is the problem of localizing spatial extents of target objects and learning their representations from a dataset with only image-level labels. WSL is motivated by two fundamental issues of conventional object recognition. First, the strong supervision in terms of object bounding boxes or segmentation masks is difficult to obtain and prevents scaling-up object localization to thousands of object classes. Second, imprecise and ambiguous manual annotations can introduce subjective biases to the learning. Convolutional neural networks (CNN) [3,4] have recently taken over the state of the art in many computer vision tasks. CNN-based methods for weakly supervised object localization have been explored in [5,6]. Despite this Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46454-1 22) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part V, LNCS 9909, pp. 350–365, 2016. DOI: 10.1007/978-3-319-46454-1 22

ContextLocNet

351

ROI extraction

context-aware additive model

-

person in ROI but not in context ?

person person in context? in ROI?

+

person?

person person in context? in ROI?

progress, WSL remains a very challenging problem. The state-of-the-art performance of WSL on standard benchmarks [1,2,6] is considerably lower compared to the strongly supervised counterparts [7–9]. Strongly supervised detection methods often use contextual information from regions around the object or from the whole image [7,9–13]: Indeed, visual context often provides useful information about which image regions are likely to be a target class according to object-background or object-object relations, e.g., a boat in the sea, a bird in the sky, a person on a horse, a table around a chair, etc. However, can a similar effect be