Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

We propose a novel unsupervised learning approach to build features suitable for object detection and classification. The features are pre-trained on a large dataset without human annotation and later transferred via fine-tuning on a different, smaller an

  • PDF / 6,521,410 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 75 Downloads / 205 Views

DOWNLOAD

REPORT


Abstract. We propose a novel unsupervised learning approach to build features suitable for object detection and classification. The features are pre-trained on a large dataset without human annotation and later transferred via fine-tuning on a different, smaller and labeled dataset. The pre-training consists of solving jigsaw puzzles of natural images. To facilitate the transfer of features to other tasks, we introduce the context-free network (CFN), a siamese-ennead convolutional neural network. The features correspond to the columns of the CFN and they process image tiles independently (i.e., free of context). The later layers of the CFN then use the features to identify their geometric arrangement. Our experimental evaluations show that the learned features capture semantically relevant content. We pre-train the CFN on the training set of the ILSVRC2012 dataset and transfer the features on the combined training and validation set of Pascal VOC 2007 for object detection (via fast RCNN) and classification. These features outperform all current unsupervised features with 51.8 % for detection and 68.6 % for classification, and reduce the gap with supervised learning (56.5 % and 78.2 % respectively). Keywords: Unsupervised learning · Image representation learning Self-supervised learning · Feature transfer

1

·

Introduction

Visual tasks, such as object classification and detection, have been successfully approached through the supervised learning paradigm [1,10,23,33], where one uses labeled data to train a parametric model. However, as manually labeled data can be costly, unsupervised learning methods are gaining momentum. Recently, Doersch et al. [9], Wang and Gupta [36] and Agrawal et al. [2] have explored a novel paradigm for unsupervised learning called self-supervised learning. The main idea is to exploit different labelings that are freely available besides or within visual data, and to use them as intrinsic reward signals to learn general-purpose features. [9] uses the relative spatial co-location of patches in images as a label. [36] uses object correspondence obtained through tracking in videos, and [2] uses ego-motion information obtained by a mobile agent such as the Google car [7]. The features obtained with these approaches have been successfully transferred to classification and detections tasks, and their performance is very encouraging when compared to features trained in a supervised manner. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part VI, LNCS 9910, pp. 69–84, 2016. DOI: 10.1007/978-3-319-46466-4 5

70

M. Noroozi and P. Favaro

(a)

(b)

(c)

Fig. 1. Learning image representations by solving jigsaw puzzles. (a) The image from which the tiles (marked with green lines) are extracted. (b) A puzzle obtained by shuffling the tiles. Some tiles might be directly identifiable as object parts, but others are ambiguous (e.g., have similar patterns or belong to the background) and their localization is much more reliable when all tiles are jointly evaluated. In contrast, wit