Model-Based Synthesis of Visual Speech Movements from 3D Video
- PDF / 4,286,140 Bytes
- 12 Pages / 600.05 x 792 pts Page_size
- 24 Downloads / 217 Views
Research Article Model-Based Synthesis of Visual Speech Movements from 3D Video James D. Edge, Adrian Hilton, and Philip Jackson Centre for Vision, Speech and Signal Processing, The University of Surrey, Surrey GU2 7XH, UK Correspondence should be addressed to James D. Edge, [email protected] Received 1 March 2009; Revised 30 July 2009; Accepted 23 September 2009 Recommended by G´erard Bailly We describe a method for the synthesis of visual speech movements using a hybrid unit selection/model-based approach. Speech lip movements are captured using a 3D stereo face capture system and split up into phonetic units. A dynamic parameterisation of this data is constructed which maintains the relationship between lip shapes and velocities; within this parameterisation a model of how lips move is built and is used in the animation of visual speech movements from speech audio input. The mapping from audio parameters to lip movements is disambiguated by selecting only the most similar stored phonetic units to the target utterance during synthesis. By combining properties of model-based synthesis (e.g., HMMs, neural nets) with unit selection we improve the quality of our speech synthesis. Copyright © 2009 James D. Edge et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction Synthetic talking heads are becoming increasingly popular across a wide range of applications: from entertainment (e.g., Computer Games/TV/Films) through to natural user interfaces and speech therapy. This application of computer animation and speech technology is complicated by the expert nature of any potential viewer. Face-to-face interactions are the natural means of every-day communication and thus it is very difficult to fool even a na¨ıve subject that synthetic speech movements are real. This is particularly the case as the static realism of our models get closer to photorealistic. Whilst a viewer may accept a cartoon-like character readily, they are often more sceptical of realistic avatars. To explain this phenomena Mori [1] posited the “uncanny valley”, the idea that the closer a simulcra comes to human-realistic, the more slight discrepancies with observed reality disturb a viewer. Nevertheless, as the technology for capturing human likeness becomes more widely available, the application of lifelike synthetic characters to the above mentioned applications has become attractive to our narcissistic desires. Recent films, such as the “The Curious Case of Benjamin Button”, demonstrate what can be attained in terms of mappingcaptured facial performance onto a synthetic character.
However, the construction of purely synthetic performance is a far more challenging task and one which has yet to be fully accomplished. The problem of visual speech synthesis can be thought of as the translation of a sequence of abstract phonetic commands into continuous movements of the visib
Data Loading...