Ambient Sound Provides Supervision for Visual Learning

The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate

PDF / 8,179,555 Bytes
16 Pages / 439.37 x 666.142 pts Page_size
61 Downloads / 249 Views

DOWNLOAD

REPORT

Massachusetts Institute of Technology, Cambridge, USA [email protected] 2 Google Research, Cambridge, USA

Abstract. The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, ﬁnding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. Keywords: Sound

1

· Convolutional networks · Unsupervised learning

Introduction

Sound conveys important information about the world around us – the bustle of a caf´e tells us that there are many people nearby, while the low-pitched roar of engine noise tells us to watch for fast-moving cars [10]. Although sound is in some cases complementary to visual information, such as when we listen to something out of view, vision and hearing are often informative about the same structures in the world. Here we propose that as a consequence of these correlations, concurrent visual and sound information provide a rich training signal that we can use to learn useful representations of the visual world. In particular, an algorithm trained to predict the sounds that occur within a visual scene might be expected to learn objects and scene elements that are associated with salient and distinctive noises, such as people, cars, and ﬂowing water. Such an algorithm might also learn to associate visual scenes with the ambient sound textures [25] that occur within them. It might, for example, associate the sound of wind with outdoor scenes, and the buzz of refrigerators with indoor scenes. Although human annotations are indisputably useful for learning, they are expensive to collect. The correspondence between ambient sounds and video is, c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part I, LNCS 9905, pp. 801–816, 2016. DOI: 10.1007/978-3-319-46448-0 48

A. Owens et al.

Energy

0.001

Frequency channel

Frequency channel

0.4

Freq. channel 1.0

0.0

0.0

Time (sec.)

0.000

3.7

Mod. channel

Energy

Frequency channel

Frequency channel

Freq. channel Freq. channel

1.0

0.0

(a) Video frame

Time (sec.)

(b) Cochleagram

3.7

-1.0

Freq. channel

0.001

0.4

0.0

Freq. channel

802

-1.0

0.000

Mod. channel

Freq. channel

(c) Summary statistics

Fig. 1. Visual scenes are associated with characteristic sounds. Our goal is to take an image (a) and predict time-averaged summary statistics (c) of a cochleagram (b). The statistics we use are (clockwise): the response to a bank

Data Loading...

Ambient Sound Provides Supervision for Visual Learning

Recommend Documents

Densifying Supervision for Fine-Grained Visual Comparisons

Learning to Generate Grounded Visual Captions Without Localization Supervision

Machine Learning for Water Supply Supervision

Interpretable Visual Reasoning via Probabilistic Formulation Under Natural Supervision

Cross-Modal Supervision for Learning Active Speaker Detection in Video

WeakAL : Combining Active Learning and Weak Supervision

Active and Incremental Learning with Weak Supervision

Influences of luminance contrast and ambient lighting on visual context learning and retrieval

Safety Supervision for Coal Mines Using Machine Learning Methods

Semantic Curiosity for Active Visual Learning

Brain-Based Visual Learning

Learning to Compose Hypercolumns for Visual Correspondence