Built-in Foreground/Background Prior for Weakly-Supervised Semantic Segmentation

Pixel-level annotations are expensive and time consuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recently, CNN-based methods have proposed to fine-tune pre-trained networks using i

  • PDF / 4,541,361 Bytes
  • 20 Pages / 439.37 x 666.142 pts Page_size
  • 8 Downloads / 261 Views

DOWNLOAD

REPORT


The Australian National University (ANU), Canberra, Australia [email protected] 2 CSIRO, Canberra, Australia {fatemehsadat.saleh,mohammadsadegh.aliakbarian,lars.petersson, jose.alvarezlopez}@data61.csiro.au 3 CVLab, EPFL, Lausanne, Switzerland [email protected]

Abstract. Pixel-level annotations are expensive and time consuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recently, CNN-based methods have proposed to fine-tune pre-trained networks using image tags. Without additional information, this leads to poor localization accuracy. This problem, however, was alleviated by making use of objectness priors to generate foreground/background masks. Unfortunately these priors either require training pixel-level annotations/bounding boxes, or still yield inaccurate object boundaries. Here, we propose a novel method to extract markedly more accurate masks from the pre-trained network itself, forgoing external objectness modules. This is accomplished using the activations of the higher-level convolutional layers, smoothed by a dense CRF. We demonstrate that our method, based on these masks and a weakly-supervised loss, outperforms the state-of-the-art tag-based weakly-supervised semantic segmentation techniques. Furthermore, we introduce a new form of inexpensive weak supervision yielding an additional accuracy boost. Keywords: Semantic segmentation · Weak annotation neural networks · Weakly-supervised segmentation

1

· Convolutional

Introduction

Semantic scene segmentation, i.e., assigning a class label to every pixel in an input image, has received growing attention in the computer vision community, with accuracy greatly increasing over the years [1–6]. In particular, Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46484-8 25) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part VIII, LNCS 9912, pp. 413–432, 2016. DOI: 10.1007/978-3-319-46484-8 25

414

F. Saleh et al.

fully-supervised approaches based on Convolutional Neural Networks (CNNs) have recently achieved impressive results [1–4,7]. Unfortunately, these methods require large amounts of training images with pixel-level annotations, which are expensive and time-consuming to obtain. Weakly-supervised techniques have therefore emerged as a solution to address this limitation [8–15]. These techniques rely on a weaker form of training annotations, such as, from weaker to stronger levels of supervision, image tags [12,14,16,17], information about object sizes [17], labeled points or squiggles [12] and labeled bounding boxes [13,18]. In the current Deep Learning era, existing weakly-supervised methods typically start from a network pre-trained on an object recognition dataset (e.g., ImageNet [19]) and fine-tune it using segmentation losses defined according to the weak annotations at hand [12–14,16,17]. In this paper, we are particula