Ambient Sound Provides Supervision for Visual Learning

The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate

  • PDF / 8,179,555 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 61 Downloads / 222 Views

DOWNLOAD

REPORT


Massachusetts Institute of Technology, Cambridge, USA [email protected] 2 Google Research, Cambridge, USA

Abstract. The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. Keywords: Sound

1

· Convolutional networks · Unsupervised learning

Introduction

Sound conveys important information about the world around us – the bustle of a caf´e tells us that there are many people nearby, while the low-pitched roar of engine noise tells us to watch for fast-moving cars [10]. Although sound is in some cases complementary to visual information, such as when we listen to something out of view, vision and hearing are often informative about the same structures in the world. Here we propose that as a consequence of these correlations, concurrent visual and sound information provide a rich training signal that we can use to learn useful representations of the visual world. In particular, an algorithm trained to predict the sounds that occur within a visual scene might be expected to learn objects and scene elements that are associated with salient and distinctive noises, such as people, cars, and flowing water. Such an algorithm might also learn to associate visual scenes with the ambient sound textures [25] that occur within them. It might, for example, associate the sound of wind with outdoor scenes, and the buzz of refrigerators with indoor scenes. Although human annotations are indisputably useful for learning, they are expensive to collect. The correspondence between ambient sounds and video is, c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part I, LNCS 9905, pp. 801–816, 2016. DOI: 10.1007/978-3-319-46448-0 48

A. Owens et al.

Energy

0.001

Frequency channel

Frequency channel

0.4

Freq. channel 1.0

0.0

0.0

Time (sec.)

0.000

3.7

Mod. channel

Energy

Frequency channel

Frequency channel

Freq. channel Freq. channel

1.0

0.0

(a) Video frame

Time (sec.)

(b) Cochleagram

3.7

-1.0

Freq. channel

0.001

0.4

0.0

Freq. channel

802

-1.0

0.000

Mod. channel

Freq. channel

(c) Summary statistics

Fig. 1. Visual scenes are associated with characteristic sounds. Our goal is to take an image (a) and predict time-averaged summary statistics (c) of a cochleagram (b). The statistics we use are (clockwise): the response to a bank