Model-Based Synthesis of Visual Speech Movements from 3D Video

PDF / 4,286,140 Bytes
12 Pages / 600.05 x 792 pts Page_size
24 Downloads / 254 Views

Research Article Model-Based Synthesis of Visual Speech Movements from 3D Video James D. Edge, Adrian Hilton, and Philip Jackson Centre for Vision, Speech and Signal Processing, The University of Surrey, Surrey GU2 7XH, UK Correspondence should be addressed to James D. Edge, [email protected] Received 1 March 2009; Revised 30 July 2009; Accepted 23 September 2009 Recommended by G´erard Bailly We describe a method for the synthesis of visual speech movements using a hybrid unit selection/model-based approach. Speech lip movements are captured using a 3D stereo face capture system and split up into phonetic units. A dynamic parameterisation of this data is constructed which maintains the relationship between lip shapes and velocities; within this parameterisation a model of how lips move is built and is used in the animation of visual speech movements from speech audio input. The mapping from audio parameters to lip movements is disambiguated by selecting only the most similar stored phonetic units to the target utterance during synthesis. By combining properties of model-based synthesis (e.g., HMMs, neural nets) with unit selection we improve the quality of our speech synthesis. Copyright © 2009 James D. Edge et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction Synthetic talking heads are becoming increasingly popular across a wide range of applications: from entertainment (e.g., Computer Games/TV/Films) through to natural user interfaces and speech therapy. This application of computer animation and speech technology is complicated by the expert nature of any potential viewer. Face-to-face interactions are the natural means of every-day communication and thus it is very diﬃcult to fool even a na¨ıve subject that synthetic speech movements are real. This is particularly the case as the static realism of our models get closer to photorealistic. Whilst a viewer may accept a cartoon-like character readily, they are often more sceptical of realistic avatars. To explain this phenomena Mori [1] posited the “uncanny valley”, the idea that the closer a simulcra comes to human-realistic, the more slight discrepancies with observed reality disturb a viewer. Nevertheless, as the technology for capturing human likeness becomes more widely available, the application of lifelike synthetic characters to the above mentioned applications has become attractive to our narcissistic desires. Recent films, such as the “The Curious Case of Benjamin Button”, demonstrate what can be attained in terms of mappingcaptured facial performance onto a synthetic character.

However, the construction of purely synthetic performance is a far more challenging task and one which has yet to be fully accomplished. The problem of visual speech synthesis can be thought of as the translation of a sequence of abstract phonetic commands into continuous movements of the visib

Data Loading...

Model-Based Synthesis of Visual Speech Movements from 3D Video

Recommend Documents

Visual Acuity, Eye Movements and Visual Fields

Audio-Visual Speech Processing

Creating Stereoscopic (3D) Video from a 2D Monocular Video Stream

Perception of Synthetic Visual Speech

Towards Speech Synthesis from Intracranial Signals

Speech Synthesis

Self-supervised Learning of Audio-Visual Objects from Video

Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

Motor Phonetics A Study of Speech Movements in Action

Challenges in Speech Synthesis

Detection and Classification of Human Movements in Video Scenes

2D Deformable Models for Visual Speech Analysis