The Curious Robot: Learning Visual Representations via Physical Interactions
What is the right supervisory signal to train visual representations? Current approaches in computer vision use category labels from datasets such as ImageNet to train ConvNets. However, in case of biological agents, visual representation learning does no
- PDF / 3,843,611 Bytes
- 16 Pages / 439.37 x 666.142 pts Page_size
- 32 Downloads / 195 Views
Abstract. What is the right supervisory signal to train visual representations? Current approaches in computer vision use category labels from datasets such as ImageNet to train ConvNets. However, in case of biological agents, visual representation learning does not require millions of semantic labels. We argue that biological agents use physical interactions with the world to learn visual representations unlike current vision systems which just use passive observations (images and videos downloaded from web). For example, babies push objects, poke them, put them in their mouth and throw them to learn representations. Towards this goal, we build one of the first systems on a Baxter platform that pushes, pokes, grasps and observes objects in a tabletop environment. It uses four different types of physical interactions to collect more than 130K datapoints, with each datapoint providing supervision to a shared ConvNet architecture allowing us to learn visual representations. We show the quality of learned representations by observing neuron activations and performing nearest neighbor retrieval on this learned representation. Quantitatively, we evaluate our learned ConvNet on image classification tasks and show improvements compared to learning without external data. Finally, on the task of instance retrieval, our network outperforms the ImageNet network on recall@1 by 3 %.
1
Introduction
Recently most computer vision systems have moved from using hand-designed features to feature learning paradigm. Much of the visual feature learning is done in a completely supervised manner using category labels. However, in case of biological agents, visual learning typically does not require categorical labels and happens in a “unsupervised” manner1 . Recently there has been a strong push to learn visual representations without using any category labels. Examples include using context from images [1], different viewpoints from videos [2], ego-motion from videos [3] and generative models of images and videos [4–7]. However, all these approaches still observe 1
By “unsupervised” we mean no supervision from other agents but supervision can come from other modalities or from time.
c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part II, LNCS 9906, pp. 3–18, 2016. DOI: 10.1007/978-3-319-46475-6 1
4
L. Pinto et al.
Poking
Grasping
Pushing
Physical Interaction Data
Conv Layer1 Filters
Conv3 Neuron Activations
Conv5 Neuron Activations
Learned Visual Representation
Fig. 1. Learning ConvNets from Physical Interactions: We propose a framework for training ConvNet using physical interaction data from robots. We first use a Baxter robot to grasp, push, poke and observe objects, with each interaction providing a training datapoint. We collect more than 130K datapoints to train a ConvNet. To the best of our knowledge, ours is one of the first system which trains visual representation using physical interactions.
the visual world passively without any physical interaction with the world. On the other ha
Data Loading...