Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus

PDF / 1,726,691 Bytes
13 Pages / 612 x 792 pts (letter) Page_size
7 Downloads / 237 Views

Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus Eric K. Patterson Department of Electrical and Computer Engineering, Clemson University, Clemson, SC 29634, USA Email: [email protected]

Sabri Gurbuz Department of Electrical and Computer Engineering, Clemson University, Clemson, SC 29634, USA Email: [email protected]

Zekeriya Tufekci Department of Electrical and Computer Engineering, Clemson University, Clemson, SC 29634, USA Email: [email protected]

John N. Gowdy Department of Electrical and Computer Engineering, Clemson University, Clemson, SC 29634, USA Email: [email protected] Received 30 November 2001 and in revised form 10 May 2002 Strides in computer technology and the search for deeper, more powerful techniques in signal processing have brought multimodal research to the forefront in recent years. Audio-visual speech processing has become an important part of this research because it holds great potential for overcoming certain problems of traditional audio-only methods. Diﬃculties, due to background noise and multiple speakers in an application environment, are significantly reduced by the additional information provided by visual features. This paper presents information on a new audio-visual database, a feature study on moving speakers, and on baseline results for the whole speaker group. Although a few databases have been collected in this area, none has emerged as a standard for comparison. Also, eﬀorts to date have often been limited, focusing on cropped video or stationary speakers. This paper seeks to introduce a challenging audio-visual database that is flexible and fairly comprehensive, yet easily available to researchers on one DVD. The Clemson University Audio-Visual Experiments (CUAVE) database is a speaker-independent corpus of both connected and continuous digit strings totaling over 7000 utterances. It contains a wide variety of speakers and is designed to meet several goals discussed in this paper. One of these goals is to allow testing of adverse conditions such as moving talkers and speaker pairs. A feature study of connected digit strings is also discussed. It compares stationary and moving talkers in a speaker-independent grouping. An image-processing-based contour technique, an image transform method, and a deformable template scheme are used in this comparison to obtain visual features. This paper also presents methods and results in an attempt to make these techniques more robust to speaker movement. Finally, initial baseline speaker-independent results are included using all speakers, and conclusions as well as suggested areas of research are given. Keywords and phrases: audio-visual speech recognition, speechreading, multimodal database.

1.

INTRODUCTION

Over the past decade, multimodal signal processing has been an increasing area of interest for researchers. Over recent years, the potential of multimodal signal processing has grown as computing power has increased. Faster processing allows the

Data Loading...

Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus

Recommend Documents

Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform

Multimodal machine translation through visuals and speech

A Robust Multimodal Speech Recognition Method using Optical Flow Analysis

Feature Extraction of the Speech Signal

Speech Rate, Pause and Sociolinguistic Variation Studies in Corpus S

Construction and Evaluation of Tamil Speech Emotion Corpus

Corpus-Based Methods in Language and Speech Processing

Objective and Subjective Evaluation of an Expressive Speech Corpus

Multimodal Communication in Political Speech. Shaping Minds and Social Action

Multidimensional feature diversity based speech signal acquisition

A Multimodal Communication Aid for Persons with Cerebral Palsy Using Head Movement and Speech Recognition

Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Fea