Cross-Modal Supervision for Learning Active Speaker Detection in Video

In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal fe

PDF / 1,151,120 Bytes
17 Pages / 439.37 x 666.142 pts Page_size
38 Downloads / 316 Views

DOWNLOAD

REPORT

Abstract. In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classiﬁer in a weakly supervised manner. The classiﬁer uses spatio-temporal features to encode upper body motion - facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person speciﬁc models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data. We are the ﬁrst to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision, by transferring knowledge from one modality to another. Keywords: Active speaker detection · Cross-modal supervision Weakly supervised learning · Online learning

1

·

Introduction

The problem of detecting active speakers in video is a central one to several applications. In video conferencing, knowing the active speaker allows the application to focus on and transmit the video of one amongst several people at a table. In a Human-Computer-Interaction (HCI) setting, a robot/computer can use active speaker information to address the correct interlocuter. Active speaker detection is also a part of the pipeline in video diarization, the automatic annotation of speakers, their speech and actions in video. Video diarization is useful for movie sub-titling, multimedia retrieval and for video understanding in general. Traditionally, visual active speaker detection has been done using lip motion detection [1–4]. However, facial expressions and gestures from the upper body, This work was supported by the KU Leuven GOA project CAMETRON and iMinds. c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part V, LNCS 9909, pp. 285–301, 2016. DOI: 10.1007/978-3-319-46454-1 18

286

P. Chakravarty and T. Tuytelaars

movement of the hands, etc., are all cues that can be utilized to assist with this task, as shown in [5], where better detection results are achieved using spatiotemporal features extracted from the entire upper body, compared with just lip motion detection. Another powerful idea we borrow from [5], is to use audio to supervise the training of a video based active speaker detection system. In that work, a microphone array is used to get directional sound information (assumed to be speech sounds), and based on this input, upper body tracks are associated with speak/non-speak labels. These labels are then used to train an active speaker classiﬁer using video only.

Fig. 1. Audio-based Voice Activity Detection (VAD) is used to weakly supervise the training of a video-based active speaker classiﬁer.

Data Loading...

Cross-Modal Supervision for Learning Active Speaker Detection in Video

Recommend Documents

WeakAL : Combining Active Learning and Weak Supervision

Active and Incremental Learning with Weak Supervision

Speaker Detection

Video Corpus Annotation Using Active Learning

Speaker Change Detection

Learning Where to Focus for Efficient Video Object Detection

Machine Learning for Water Supply Supervision

Deep Active Learning for Effective Pulmonary Nodule Detection

Active Crowd Counting with Limited Supervision

Deep Hashing with Active Pairwise Supervision

Ambient Sound Provides Supervision for Visual Learning

Active neural learners for text with dual supervision