Active Shape Models for Visual Speech Feature Extraction

Most approaches for lip modelling are based on heuristic constraints imposed by the user. We describe the use of Active Shape Models for extracting visual speech features for use by automatic speechreading systems, where the deformation of the lip model a

  • PDF / 1,466,459 Bytes
  • 8 Pages / 595.276 x 790.866 pts Page_size
  • 71 Downloads / 203 Views

DOWNLOAD

REPORT


Abstract. Most approaches for lip modelling are based on heuristic constraints

imposed by the user. We describe the use of Active Shape Models for extracting visual speech features for use by automatic speechreading systems, where the deformation of the lip model as well as image search is based on a priori knowledge learned from a training set. We demonstrate the robustness and accuracy of the technique for locating and tracking lips on a database consisting of a broad variety of talkers and lighting conditions. Keywords. lip locating, lip tracking, learned model, learned features

1. Introduction While mainstream speech recognition research has concentrated almost exclusively on the acoustic speech signal, it is well known that humans use visual information of the talker's face (mainly lip movements) in addition to the acoustic signal for speech perception purpose. Whereas several well known methods exist for representing acoustic features of speech, it is still not fully understood (i) which visual features are important for speechreading, (ii) how to extract them and (iii) how to combine them with the acoustic information. It is generally agreed that most visual information is contained in the lips, especially the inner lip contours and to a minor extent in the visibility of teeth and tongue (Montgomery and Jackson 1983, Summerfield 1992). The main difficulty in incorporating information about lip movements into an acoustic speech recognition system is to find a robust and accurate method for extracting important visual speech features. The technique should be able to locate and track lips in faces of various talkers and should be robust to lighting, rotation, scale and translation (LRST). The extracted features should be sensitive to variances which account for different visemes and insensitive to variances which account for linguistic variability and image variability (LRST). Here we describe the use of Active Shape Models (ASMs), introduced by Cootes et al. (1994), for robust detection, tracking and parameterisation of visual

D. G. Stork et al. (eds.), Speechreading by Humans and Machines © Springer-Verlag Berlin Heidelberg 1996

384

speech information. In comparison to previous contour tracking approaches such as the use of deformable templates (Yuille, Hallinan and Cohen 1992, Hennecke, Prasad and Stork 1994) or snakes (Kass, Witkin and Terzopoulos 1988, Bregler and Omohundro 1994), ASM is a statistically based technique which almost completely avoids the use of constraints, thresholds or penalties imposed by the user. During image search, the model is only allowed to deform to shapes similar to the ones seen in the training set. Whereas deformable templates and snakes align to strong gradients for locating the object, regardless of their actual appearance in the image, ASMs learn the typical grey level appearance perpendicular to the contour from the training set and use them for image search.

2. Active Shape Models Active Shape Models are flexible models which represent an object by a set of labelle