On the Relationship between Face Movements, Tongue Movements, and Speech Acoustics

  • PDF / 1,257,559 Bytes
  • 15 Pages / 612 x 792 pts (letter) Page_size
  • 69 Downloads / 177 Views

DOWNLOAD

REPORT


On the Relationship between Face Movements, Tongue Movements, and Speech Acoustics Jintao Jiang Electrical Engineering Department, University of California at Los Angeles, Los Angeles, CA 90095-1594, USA Email: [email protected]

Abeer Alwan Electrical Engineering Department, University of California at Los Angeles, Los Angeles, CA 90095-1594, USA Email: [email protected]

Patricia A. Keating Linguistics Department, University of California at Los Angeles, Los Angeles, CA 90095-1543, USA Email: [email protected]

Edward T. Auer Jr. Communication Neuroscience Department, House Ear Institute, Los Angeles, CA 90057, USA Email: [email protected]

Lynne E. Bernstein Communication Neuroscience Department, House Ear Institute, Los Angeles, CA 90057, USA Email: [email protected] Received 29 November 2001 and in revised form 13 May 2002 This study examines relationships between external face movements, tongue movements, and speech acoustics for consonantvowel (CV) syllables and sentences spoken by two male and two female talkers with different visual intelligibility ratings. The questions addressed are how relationships among measures vary by syllable, whether talkers who are more intelligible produce greater optical evidence of tongue movements, and how the results for CVs compared to those for sentences. Results show that the prediction of one data stream from another is better for C/a/ syllables than C/i/ and C/u/ syllables. Across the different places of articulation, lingual places result in better predictions of one data stream from another than do bilabial and glottal places. Results vary from talker to talker; interestingly, high rated intelligibility do not result in high predictions. In general, predictions for CV syllables are better than those for sentences. Keywords and phrases: articulatory movements, speech acoustics, qualisys, EMA, optical tracking.

1.

INTRODUCTION

The effort to create talking machines began several hundred years ago [1, 2], and over the years most speech synthesis efforts have focused mainly on speech acoustics. With the development of computer technology, the desire to create talking faces along with voices has been inspired by ideas for many potential applications. A better understanding of the relationships between speech acoustics and face and tongue movements would be helpful to develop better synthetic talking faces [2] and for other applications as well. For example, in automatic speech recognition, optical (facial) information could be used to compensate

for noisy speech waveforms [3, 4]; optical information could also be used to enhance auditory comprehension of speech in noisy situations [5]. However, how best to drive a synthetic talking face is a challenging question. A theoretical ideal driving source for face animation is speech acoustics, because the optical and acoustic signals are simultaneous products of speech production. Speech production involves control of various speech articulators to produce acoustic speech signals. Predictable relationships between articulatory