Cross-Modal Supervision for Learning Active Speaker Detection in Video
In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal fe
- PDF / 1,151,120 Bytes
- 17 Pages / 439.37 x 666.142 pts Page_size
- 38 Downloads / 277 Views
Abstract. In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion - facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data. We are the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision, by transferring knowledge from one modality to another. Keywords: Active speaker detection · Cross-modal supervision Weakly supervised learning · Online learning
1
·
Introduction
The problem of detecting active speakers in video is a central one to several applications. In video conferencing, knowing the active speaker allows the application to focus on and transmit the video of one amongst several people at a table. In a Human-Computer-Interaction (HCI) setting, a robot/computer can use active speaker information to address the correct interlocuter. Active speaker detection is also a part of the pipeline in video diarization, the automatic annotation of speakers, their speech and actions in video. Video diarization is useful for movie sub-titling, multimedia retrieval and for video understanding in general. Traditionally, visual active speaker detection has been done using lip motion detection [1–4]. However, facial expressions and gestures from the upper body, This work was supported by the KU Leuven GOA project CAMETRON and iMinds. c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part V, LNCS 9909, pp. 285–301, 2016. DOI: 10.1007/978-3-319-46454-1 18
286
P. Chakravarty and T. Tuytelaars
movement of the hands, etc., are all cues that can be utilized to assist with this task, as shown in [5], where better detection results are achieved using spatiotemporal features extracted from the entire upper body, compared with just lip motion detection. Another powerful idea we borrow from [5], is to use audio to supervise the training of a video based active speaker detection system. In that work, a microphone array is used to get directional sound information (assumed to be speech sounds), and based on this input, upper body tracks are associated with speak/non-speak labels. These labels are then used to train an active speaker classifier using video only.
Fig. 1. Audio-based Voice Activity Detection (VAD) is used to weakly supervise the training of a video-based active speaker classifier.
Data Loading...