Multimodal Translation System Using Texture-Mapped Lip-Sync Images for Video Mail and Automatic Dubbing Applications

PDF / 2,059,275 Bytes
11 Pages / 600 x 792 pts Page_size
32 Downloads / 290 Views

Multimodal Translation System Using Texture-Mapped Lip-Sync Images for Video Mail and Automatic Dubbing Applications Shigeo Morishima School of Science and Engineering, Waseda University, Tokyo 169-8555, Japan Email: [email protected] ATR Spoken Language Translation Research Laboratories, Kyoto 619-0288, Japan

Satoshi Nakamura ATR Spoken Language Translation Research Laboratories, Kyoto 619-0288, Japan Email: [email protected] Received 25 November 2002; Revised 16 January 2004 We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker’s speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker’s face in an image sequence. To retain the speaker’s facial expression, we substitute only the speech organ’s image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages. Keywords and phrases: audio-visual speech translation, lip-sync talking head, face tracking with 3D template, video mail and automatic dubbing, texture-mapped facial animation, personal face model.

1.

INTRODUCTION

The facial expression is thought to send most of the nonverbal information in ordinary conversation. From this viewpoint, many researches have been carried on face-to-face communication using a 3D personal face model, sometimes called an “Avatar” in cyberspace [1]. For spoken language translation, ATR-MATRIX (ATR’s multiligual automatic translation system for information exchange) [2] has been developed for the limited domain of hotel reservations between Japanese and English. A speech translation system has been developed for verbal information, although it does not take into account articulation and intonation. Verbal information is the central element in human communications, but the facial expression also plays an important role in transmitting information in face-to-face communication. For example, dubbed speech in movies has the problem that it does not match the lip movements of the facial image. In the case of making the entire facial image by

computer graphics, it is diﬃcult to send messages of original nonverbal information. If we could develop a technology that is able to translate facial speaking motion synchronized to translat

Data Loading...

Multimodal Translation System Using Texture-Mapped Lip-Sync Images for Video Mail and Automatic Dubbing Applications

Recommend Documents

Multimodal Interaction in Image and Video Applications

A Hierarchy System for Automatic Target Recognition in SAR Images

Multimodal machine translation through visuals and speech

Visual Analytics for Understanding Images and Video

Automatic Video Object Segmentation Using Volume Growing and Hierarchical Clustering

Multi-Device Applications Using the Multimodal Architecture

Video Automatic Annotation

Video Automatic Annotation

Co-restoring Multimodal Microscopy Images

Difficulty Translation in Histopathology Images

Automatic Peak Recognition for Mountain Images

Suggesting Sounds for Images from Video Collections