Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

We propose a novel unsupervised learning approach to build features suitable for object detection and classification. The features are pre-trained on a large dataset without human annotation and later transferred via fine-tuning on a different, smaller an

PDF / 6,521,410 Bytes
16 Pages / 439.37 x 666.142 pts Page_size
75 Downloads / 205 Views

DOWNLOAD

REPORT

Abstract. We propose a novel unsupervised learning approach to build features suitable for object detection and classiﬁcation. The features are pre-trained on a large dataset without human annotation and later transferred via ﬁne-tuning on a diﬀerent, smaller and labeled dataset. The pre-training consists of solving jigsaw puzzles of natural images. To facilitate the transfer of features to other tasks, we introduce the context-free network (CFN), a siamese-ennead convolutional neural network. The features correspond to the columns of the CFN and they process image tiles independently (i.e., free of context). The later layers of the CFN then use the features to identify their geometric arrangement. Our experimental evaluations show that the learned features capture semantically relevant content. We pre-train the CFN on the training set of the ILSVRC2012 dataset and transfer the features on the combined training and validation set of Pascal VOC 2007 for object detection (via fast RCNN) and classiﬁcation. These features outperform all current unsupervised features with 51.8 % for detection and 68.6 % for classiﬁcation, and reduce the gap with supervised learning (56.5 % and 78.2 % respectively). Keywords: Unsupervised learning · Image representation learning Self-supervised learning · Feature transfer

1

·

Introduction

Visual tasks, such as object classiﬁcation and detection, have been successfully approached through the supervised learning paradigm [1,10,23,33], where one uses labeled data to train a parametric model. However, as manually labeled data can be costly, unsupervised learning methods are gaining momentum. Recently, Doersch et al. [9], Wang and Gupta [36] and Agrawal et al. [2] have explored a novel paradigm for unsupervised learning called self-supervised learning. The main idea is to exploit diﬀerent labelings that are freely available besides or within visual data, and to use them as intrinsic reward signals to learn general-purpose features. [9] uses the relative spatial co-location of patches in images as a label. [36] uses object correspondence obtained through tracking in videos, and [2] uses ego-motion information obtained by a mobile agent such as the Google car [7]. The features obtained with these approaches have been successfully transferred to classiﬁcation and detections tasks, and their performance is very encouraging when compared to features trained in a supervised manner. c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part VI, LNCS 9910, pp. 69–84, 2016. DOI: 10.1007/978-3-319-46466-4 5

70

M. Noroozi and P. Favaro

(a)

(b)

(c)

Fig. 1. Learning image representations by solving jigsaw puzzles. (a) The image from which the tiles (marked with green lines) are extracted. (b) A puzzle obtained by shuﬄing the tiles. Some tiles might be directly identiﬁable as object parts, but others are ambiguous (e.g., have similar patterns or belong to the background) and their localization is much more reliable when all tiles are jointly evaluated. In contrast, wit

Data Loading...

Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

Recommend Documents

Solving Jigsaw Puzzles Using Variational Autoencoders

Unsupervised Visual Representation Learning by Graph-Based Consistent Constraints

A Genetic Algorithm-Based Solver for Small-Scale Jigsaw Puzzles

Multi-strategy Evolutionary Computation for Automated Jigsaw Puzzles

Solving Cryptographic Puzzles: How to Mine?

Unsupervised Visual Time-Series Representation Learning and Clustering

Visual Memorability for Robotic Interestingness via Unsupervised Online Learning

The Curious Robot: Learning Visual Representations via Physical Interactions

Unsupervised Learning

Unsupervised Learning

Learning Visual Context by Comparison

Fairness by Learning Orthogonal Disentangled Representations