Statistical Lip-Appearance Models Trained Automatically Using Audio Information

  • PDF / 1,181,845 Bytes
  • 11 Pages / 612 x 792 pts (letter) Page_size
  • 24 Downloads / 155 Views

DOWNLOAD

REPORT


Statistical Lip-Appearance Models Trained Automatically Using Audio Information Philippe Daubias Laboratoire d’Informatique de l’Universit´e du Maine (LIUM), Institut d’Informatique Claude Chappe, F-72085 Le Mans Cedex 9, France Laboratoire d’Informatique Graphique Image et Mod´elisation (LIGIM), Bˆatiment 710, 8, bd Niels Bohr, F-69622 Villeurbanne Cedex, France Email: [email protected]

´ Paul Deleglise Laboratoire d’Informatique de l’Universit´e du Maine (LIUM), Institut d’Informatique Claude Chappe, F-72085 Le Mans Cedex 9, France Email: [email protected] Received 1 November 2001 and in revised form 19 June 2002 We aim at modeling the appearance of the lower face region to assist visual feature extraction for audio-visual speech processing applications. In this paper, we present a neural network based statistical appearance model of the lips which classifies pixels as belonging to the lips, skin, or inner mouth classes. This model requires labeled examples to be trained, and we propose to label images automatically by employing a lip-shape model and a red-hue energy function. To improve the performance of lip-tracking, we propose to use blue marked-up image sequences of the same subject uttering the identical sentences as natural nonmarked-up ones. The easily extracted lip shapes from blue images are then mapped to the natural ones using acoustic information. The lipshape estimates obtained simplify lip-tracking on the natural images, as they reduce the parameter space dimensionality in the red-hue energy minimization, thus yielding better contour shape and location estimates. We applied the proposed method to a small audio-visual database of three subjects, achieving errors in pixel classification around 6%, compared to 3% for hand-placed contours and 20% for filtered red-hue. Keywords and phrases: lip-appearance model, lip-shape model, automatic lip-region labeling, artificial neural networks, dynamic time warping, audio-visual corpora.

1. INTRODUCTION Today, automatic speech recognition (ASR) works well for several applications, but performance depends highly on the specificity of the task, and on the type and level of surrounding noise. To strengthen ASR systems against noise, one may, for example, use multiband systems [1], higher level (linguistic) information [2], or visual information which is complementary to the audio information. Since McGurk’s experiments [3] which have proven the importance of visual information in human speech perception, visual modality has been successfully used for improving performance and robustness of ASR [4, 5, 6, 7, 8, 9], speaker recognition [10, 11], and other speech applications [12]. Using visual information in unconstrained conditions requires having accurate visual feature extraction, regardless of the visual features used: (i) pixel-based (data-driven) features: images are fed directly into a speech recognition system [4, 5, 8, 13], after applying a few transformations or normalizations

to the images (fixed-size ROI (region of int