Ambient Sound Provides Supervision for Visual Learning
The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate
- PDF / 8,179,555 Bytes
- 16 Pages / 439.37 x 666.142 pts Page_size
- 61 Downloads / 222 Views
Massachusetts Institute of Technology, Cambridge, USA [email protected] 2 Google Research, Cambridge, USA
Abstract. The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. Keywords: Sound
1
· Convolutional networks · Unsupervised learning
Introduction
Sound conveys important information about the world around us – the bustle of a caf´e tells us that there are many people nearby, while the low-pitched roar of engine noise tells us to watch for fast-moving cars [10]. Although sound is in some cases complementary to visual information, such as when we listen to something out of view, vision and hearing are often informative about the same structures in the world. Here we propose that as a consequence of these correlations, concurrent visual and sound information provide a rich training signal that we can use to learn useful representations of the visual world. In particular, an algorithm trained to predict the sounds that occur within a visual scene might be expected to learn objects and scene elements that are associated with salient and distinctive noises, such as people, cars, and flowing water. Such an algorithm might also learn to associate visual scenes with the ambient sound textures [25] that occur within them. It might, for example, associate the sound of wind with outdoor scenes, and the buzz of refrigerators with indoor scenes. Although human annotations are indisputably useful for learning, they are expensive to collect. The correspondence between ambient sounds and video is, c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part I, LNCS 9905, pp. 801–816, 2016. DOI: 10.1007/978-3-319-46448-0 48
A. Owens et al.
Energy
0.001
Frequency channel
Frequency channel
0.4
Freq. channel 1.0
0.0
0.0
Time (sec.)
0.000
3.7
Mod. channel
Energy
Frequency channel
Frequency channel
Freq. channel Freq. channel
1.0
0.0
(a) Video frame
Time (sec.)
(b) Cochleagram
3.7
-1.0
Freq. channel
0.001
0.4
0.0
Freq. channel
802
-1.0
0.000
Mod. channel
Freq. channel
(c) Summary statistics
Fig. 1. Visual scenes are associated with characteristic sounds. Our goal is to take an image (a) and predict time-averaged summary statistics (c) of a cochleagram (b). The statistics we use are (clockwise): the response to a bank
Data Loading...