2D Deformable Models for Visual Speech Analysis

A scheme for describing the mouth of a speaker in color image sequences is proposed which is based on a parametric 2D model of the lips. Key information for the parameters estimation is extracted from chrominance analysis. A detailed description of the te

  • PDF / 1,073,673 Bytes
  • 8 Pages / 595.276 x 790.866 pts Page_size
  • 86 Downloads / 218 Views

DOWNLOAD

REPORT


Istituto per la llicerca Scientifica e Tecnologica I-38050 Povo, Trento, ITALY Universita degli studi di Milano, Milano

Abstract. A scheme for describing the mouth of a speaker in color image sequences is proposed which is based on a parametric 2D model of the lips. Key information for the parameters estimation is extracted from chrominance analysis. A detailed description of the techniques employed is given, and some preliminary results are shown.

1

Introduction

The mouth is a part of the human body which presents high interpersonal variability. Moreover, lips of a speaking person undergo rapid and dramatic shape changes. To be suitable visual speech analysis, the model of the mouth should be sufficiently flexible to account for these variabilities, but also rigid enough to run unaffected by irrelevant details. Aim of this paper is to illustrate a series of algorithms that can be used to extract visual speech information from sequences of color images. Low-level processing techniques based on chrominance analysis are first considered, and a scheme is then proposed in which color information determines optimal values for the parameters of a 2D deformable model of the mouth. In particular, positions of six points corresponding to the lateral extrema of the mouth and the inner and outer apices of the lips are recovered in a robust and accurate way. Albeit still under development, the method is promising, and appears to work fairly well on the available database of sequences.

2

Chrominance Analysis

Before entering a more detailed presentation, let us describe the basic techniques employed to extract low-level visual information from the images. In gray-level images taken under diffuse light conditions, the external border of the lips is rather elusive - this making mostly uneffective the of techniques

D. G. Stork et al. (eds.), Speechreading by Humans and Machines © Springer-Verlag Berlin Heidelberg 1996

392

based on luminance differences (gradient or the like). On the contrary, information carried by color is less sensitive to the lighting variations. We start analyzing the hue of the lips region, and work in the HSL 1 (Hue, Saturation, Luminance) color space, is which information related to colors and luminance are disentangled. Detection of red-prevalent regions. Although lips are prevalently red, by no means they are pure red. Also, slight hue variations occur at different pixels across the lips area, and an even wider variability is to be expected from person to person. The first step in extracting the lips region from the image consists in filtering the hue component of the image with the following weight function [Gong and Sakauchi (1995)]: f(h) = {

1- (h-~o) 2

jh- hoi::; w

w

0

(1)

otherwise

where h represents the hue value at each pixel, ho is the "center" of the filter, that is the hue that one wants to emphasize the most, and w is a parameter controlling the distance from ho beyond which the response of the filter falls to zero. When ho = 1/3, the function f implements a measure of the dist