Audiovisual Speech Synchrony Measure: Application to Biometrics
- PDF / 761,931 Bytes
- 11 Pages / 600.03 x 792 pts Page_size
- 62 Downloads / 239 Views
Research Article Audiovisual Speech Synchrony Measure: Application to Biometrics ´ Herve´ Bredin and Gerard Chollet ´ D´epartement Traitement du Signal et de l’Image, Ecole Nationale Sup´erieure des T´el´ecommunications, CNRS/LTCI, 46 rue Barrault, 75013 Paris Cedex 13, France Received 18 August 2006; Accepted 18 March 2007 Recommended by Ebroul Izquierdo Speech is a means of communication which is intrinsically bimodal: the audio signal originates from the dynamics of the articulators. This paper reviews recent works in the field of audiovisual speech, and more specifically techniques developed to measure the level of correspondence between audio and visual speech. It overviews the most common audio and visual speech front-end processing, transformations performed on audio, visual, or joint audiovisual feature spaces, and the actual measure of correspondence between audio and visual speech. Finally, the use of synchrony measure for biometric identity verification based on talking faces is experimented on the BANCA database. Copyright © 2007 H. Bredin and G. Chollet. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1.
INTRODUCTION
Speech is a means of communication which is intrinsically bimodal: the audio signal originates from the dynamics of the articulators. Both audible and visible speech cues carry relevant information. Though the first automatic speechbased recognition systems were only relying on its auditory part (whether it is speech recognition or speaker verification), it is well known that its visual counterpart can be a great help, especially under adverse conditions [1]. In noisy environments for example, audiovisual speech recognizers perform better than audio-only systems. Using visual speech as a second source of information for speaker verification has also been experimented, even though resulting improvements are not always significant. This review tries to complement existing surveys about audiovisual speech processing. It does not address the problem of audiovisual speech recognition nor speaker verification: these two issues are already covered in [2, 3]. Moreover, this paper does not tackle the question of the estimation of visual speech from its acoustic counterpart (or reciprocally): the reader might want to have a look at [4, 5] showing that linear methods can lead to very good estimates. This paper focuses on the measure of correspondence between acoustic and visual speech. How correlated the two signals are? Can we detect a lack of correspondence between
them? Is it possible to decide (putting aside any biometric method), among a few people appearing in a video, who is talking? Section 2 overviews the acoustic and visual front-ends processing. They are often very similar to the one used for speech recognition and speaker verification, though a tendency to simplify them as much as possible has been noticed. Moreover, lin
Data Loading...