Perception of Synthetic Visual Speech

We report here on an experiment comparing visual recognition of monosyllabic words produced either by our computer-animated talker or a human talker. Recognition of the synthetic talker is reasonably close to that of the human talker, but a significant di

PDF / 2,274,650 Bytes
16 Pages / 595.276 x 790.866 pts Page_size
36 Downloads / 280 Views

DOWNLOAD

REPORT

Abstract. We report here on an experiment comparing visual recognition of monosyllabic words produced either by our computer-animated talker or a human talker. Recognition of the synthetic talker is reasonably close to that of the human talker, but a significant distance remains to be covered and we discuss improvements to the synthetic phoneme specifications. In an additional experiment using the same paradigm, we compare perception of our animated talker with a similarly generated point-light display, finding significantly worse performance for the latter for a number of viseme classes. We conclude with some ideas for future progress and briefly describe our new animated tongue. Keywords. Visible speech synthesis, coarticulation, speechreading, point-light displays, text-to-speech 1

Introduction

Much of what we know about speech perception has come from experimental studies using synthetic speech. Although some research questions can be answered in part with natural speech stimuli, our overall progress in analyzing human speech perception has been critically dependent on the use of synthetic speech. Extending this approach to the visual side of speech, we have developed a high quality visual speech synthesizer-a computer-animated talking faceincorporating coarticulation based on a model of speech production using rules describing the relative dominance of speech segments (Cohen & Massaro, 1993). Our goals for this technology include gaining an understanding of the visual information that is used in speechreading, how this information is combined with auditory information, how such information may be used in automatic speech recognition (ASR) systems, and its use as an improved channel for man/machine communication. An essential component of the development process is an evaluation of the synthesis quality. This analysis of the facial synthesis may be seen as a validation process. By validation, we mean a measure of the degree to which our synthetic faces mimic the behavior of real faces. Confusion matrices and standard tests of intelligibility are being utilized to assess the quality of the facial synthesis relative to the natural face. These same results will also highlight those characteristics of the talking face that could be made more informative.

D. G. Stork et al. (eds.), Speechreading by Humans and Machines © Springer-Verlag Berlin Heidelberg 1996

154

2 Visual Speech Synthesis Techniques Two genernl strategies for generating highly realistic full facial displays have been employed: musculoskeletal models and parametrically controlled polygon topology. Using the first basic strategy, human faces have been made by constructing a computational model for the muscle and bone structures of the face (e.g. Platt & Badler, 1981; Waters, 1987; Waters & Terzopoulous, 1991). At the foundation of this type of model is an approximation of the skull and jaw including the jaw pivot Simulated muscle tissues and their insertions are placed over the skull. This requires complex elastic models for the compressible tiss

Data Loading...

Perception of Synthetic Visual Speech

Recommend Documents

Visual Perception

Audio-Visual Speech Processing

Visual attention in pictorial perception

The Perception and Cognition of Visual Space

Reading your own lips: Common-coding theory and visual speech perception

Speech Perception, Production and Acquisition Multidisciplinary appr

Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

The impact of white matter hyperintensities on speech perception

Neural Correlates of Quality Perception for Complex Speech Signals

2D Deformable Models for Visual Speech Analysis

Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli

Shaping perceptual learning of synthetic speech through feedback