The Curious Robot: Learning Visual Representations via Physical Interactions

What is the right supervisory signal to train visual representations? Current approaches in computer vision use category labels from datasets such as ImageNet to train ConvNets. However, in case of biological agents, visual representation learning does no

PDF / 3,843,611 Bytes
16 Pages / 439.37 x 666.142 pts Page_size
32 Downloads / 219 Views

DOWNLOAD

REPORT

Abstract. What is the right supervisory signal to train visual representations? Current approaches in computer vision use category labels from datasets such as ImageNet to train ConvNets. However, in case of biological agents, visual representation learning does not require millions of semantic labels. We argue that biological agents use physical interactions with the world to learn visual representations unlike current vision systems which just use passive observations (images and videos downloaded from web). For example, babies push objects, poke them, put them in their mouth and throw them to learn representations. Towards this goal, we build one of the ﬁrst systems on a Baxter platform that pushes, pokes, grasps and observes objects in a tabletop environment. It uses four diﬀerent types of physical interactions to collect more than 130K datapoints, with each datapoint providing supervision to a shared ConvNet architecture allowing us to learn visual representations. We show the quality of learned representations by observing neuron activations and performing nearest neighbor retrieval on this learned representation. Quantitatively, we evaluate our learned ConvNet on image classiﬁcation tasks and show improvements compared to learning without external data. Finally, on the task of instance retrieval, our network outperforms the ImageNet network on recall@1 by 3 %.

1

Introduction

Recently most computer vision systems have moved from using hand-designed features to feature learning paradigm. Much of the visual feature learning is done in a completely supervised manner using category labels. However, in case of biological agents, visual learning typically does not require categorical labels and happens in a “unsupervised” manner1 . Recently there has been a strong push to learn visual representations without using any category labels. Examples include using context from images [1], diﬀerent viewpoints from videos [2], ego-motion from videos [3] and generative models of images and videos [4–7]. However, all these approaches still observe 1

By “unsupervised” we mean no supervision from other agents but supervision can come from other modalities or from time.

c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part II, LNCS 9906, pp. 3–18, 2016. DOI: 10.1007/978-3-319-46475-6 1

4

L. Pinto et al.

Poking

Grasping

Pushing

Physical Interaction Data

Conv Layer1 Filters

Conv3 Neuron Activations

Conv5 Neuron Activations

Learned Visual Representation

Fig. 1. Learning ConvNets from Physical Interactions: We propose a framework for training ConvNet using physical interaction data from robots. We ﬁrst use a Baxter robot to grasp, push, poke and observe objects, with each interaction providing a training datapoint. We collect more than 130K datapoints to train a ConvNet. To the best of our knowledge, ours is one of the ﬁrst system which trains visual representation using physical interactions.

the visual world passively without any physical interaction with the world. On the other ha

Data Loading...

The Curious Robot: Learning Visual Representations via Physical Interactions

Recommend Documents

Learning Disentangled Representations via Mutual Information Estimation

Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles

Learning Discrete Sentence Representations via Construction & Decomposition

Curious Hierarchical Actor-Critic Reinforcement Learning

An Incremental Learning Approach for Physical Human-Robot Collaboration

Robot Teams: Sharing Visual Memories

Learning representations from dendrograms

Pictorial Representations and Learning

Visual Target Sequence Prediction via Hierarchical Temporal Memory Implemented on the iCub Robot

Visual Vehicle Tracking via Deep Learning and Particle Filter

Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

Visual Memorability for Robotic Interestingness via Unsupervised Online Learning