Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli

PDF / 931,852 Bytes
9 Pages / 612 x 792 pts (letter) Page_size
70 Downloads / 228 Views

Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli David Sodoyer Institut de la Communication Parl´ee, Institut National Polytechnique de Grenoble, Universit´e Stendhal, CNRS UMR 5009, ICP, INPG, 46 avenue F´elix Viallet, 38031 Grenoble Cedex 1, France Email: [email protected]

Jean-Luc Schwartz Institut de la Communication Parl´ee, Institut National Polytechnique de Grenoble, Universit´e Stendhal, CNRS UMR 5009, ICP, INPG, 46 avenue F´elix Viallet, 38031 Grenoble Cedex 1, France Email: [email protected]

Laurent Girin Institut de la Communication Parl´ee, Institut National Polytechnique de Grenoble, Universit´e Stendhal, CNRS UMR 5009, ICP, INPG, 46 avenue F´elix Viallet, 38031 Grenoble Cedex 1, France Email: [email protected]

Jacob Klinkisch Institut de la Communication Parl´ee, Institut National Polytechnique de Grenoble, Universit´e Stendhal, CNRS UMR 5009, ICP, INPG, 46 avenue F´elix Viallet, 38031 Grenoble Cedex 1, France Email: [email protected]

Christian Jutten Laboratoire des Images et des Signaux, Institut National Polytechnique de Grenoble, Universit´e Joseph Fourier, CNRS UMR 5083, LIS, INPG, 46 avenue F´elix Viallet, 38031 Grenoble Cedex 1, France Email: [email protected] Received 19 October 2001 and in revised form 7 May 2002 We present a new approach to the source separation problem in the case of multiple speech signals. The method is based on the use of automatic lipreading, the objective is to extract an acoustic speech signal from other acoustic signals by exploiting its coherence with the speaker’s lip movements. We consider the case of an additive stationary mixture of decorrelated sources, with no further assumptions on independence or non-Gaussian character. Firstly, we present a theoretical framework showing that it is indeed possible to separate a source when some of its spectral characteristics are provided to the system. Then we address the case of audiovisual sources. We show how, if a statistical model of the joint probability of visual and spectral audio input is learnt to quantify the audio-visual coherence, separation can be achieved by maximizing this probability. Finally, we present a number of separation results on a corpus of vowel-plosive-vowel sequences uttered by a single speaker, embedded in a mixture of other voices. We show that separation can be quite good for mixtures of 2, 3, and 5 sources. These results, while very preliminary, are encouraging, and are discussed in respect to their potential complementarity with traditional pure audio separation or enhancement techniques. Keywords and phrases: blind source separation, lipreading, audio-visual speech processing.

1. INTRODUCTION There exists an intrinsic coherence and even a complementarity between audition and vision for speech perception [1]. Indeed, the phonetic contrasts least robust in auditory perception in acoustic noise are the most visible ones, both

for consonants and vowels [2]. Thus, visual cues can compensate to a certain extent

Data Loading...

Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli

Recommend Documents

Audiovisual Speech Synchrony Measure: Application to Biometrics

Silent articulation modulates auditory and audiovisual speech perception

Neural Correlates of Quality During Perception of Audiovisual Stimuli

Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform

Segmentation and Annotation of Audiovisual Recordings Based on Automated Speech Recognition

A Variational Autoencoder Approach for Speech Signal Separation

Summarizing Audiovisual Contents of a Video Program

Profiting from Open Audiovisual Content

New Era for Robust Speech Recognition Exploiting Deep Learning

Film Restoration The Culture and Science of Audiovisual Heritage

Speech-to-Speech Translation

Blind Signal Separation with Speech Enhancement