Perception of Synthetic Visual Speech

We report here on an experiment comparing visual recognition of monosyllabic words produced either by our computer-animated talker or a human talker. Recognition of the synthetic talker is reasonably close to that of the human talker, but a significant di

  • PDF / 2,274,650 Bytes
  • 16 Pages / 595.276 x 790.866 pts Page_size
  • 36 Downloads / 205 Views

DOWNLOAD

REPORT


Abstract. We report here on an experiment comparing visual recognition of monosyllabic words produced either by our computer-animated talker or a human talker. Recognition of the synthetic talker is reasonably close to that of the human talker, but a significant distance remains to be covered and we discuss improvements to the synthetic phoneme specifications. In an additional experiment using the same paradigm, we compare perception of our animated talker with a similarly generated point-light display, finding significantly worse performance for the latter for a number of viseme classes. We conclude with some ideas for future progress and briefly describe our new animated tongue. Keywords. Visible speech synthesis, coarticulation, speechreading, point-light displays, text-to-speech 1

Introduction

Much of what we know about speech perception has come from experimental studies using synthetic speech. Although some research questions can be answered in part with natural speech stimuli, our overall progress in analyzing human speech perception has been critically dependent on the use of synthetic speech. Extending this approach to the visual side of speech, we have developed a high quality visual speech synthesizer-a computer-animated talking faceincorporating coarticulation based on a model of speech production using rules describing the relative dominance of speech segments (Cohen & Massaro, 1993). Our goals for this technology include gaining an understanding of the visual information that is used in speechreading, how this information is combined with auditory information, how such information may be used in automatic speech recognition (ASR) systems, and its use as an improved channel for man/machine communication. An essential component of the development process is an evaluation of the synthesis quality. This analysis of the facial synthesis may be seen as a validation process. By validation, we mean a measure of the degree to which our synthetic faces mimic the behavior of real faces. Confusion matrices and standard tests of intelligibility are being utilized to assess the quality of the facial synthesis relative to the natural face. These same results will also highlight those characteristics of the talking face that could be made more informative.

D. G. Stork et al. (eds.), Speechreading by Humans and Machines © Springer-Verlag Berlin Heidelberg 1996

154

2 Visual Speech Synthesis Techniques Two genernl strategies for generating highly realistic full facial displays have been employed: musculoskeletal models and parametrically controlled polygon topology. Using the first basic strategy, human faces have been made by constructing a computational model for the muscle and bone structures of the face (e.g. Platt & Badler, 1981; Waters, 1987; Waters & Terzopoulous, 1991). At the foundation of this type of model is an approximation of the skull and jaw including the jaw pivot Simulated muscle tissues and their insertions are placed over the skull. This requires complex elastic models for the compressible tiss