Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

PDF / 2,546,206 Bytes
15 Pages / 612 x 792 pts (letter) Page_size
111 Downloads / 266 Views

Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features Petar S. Aleksic Department of Electrical and Computer Engineering, Northwestern University, 2145 North Sheridan Road, Evanston, IL 60208-3118, USA Email: [email protected]

Jay J. Williams Department of Electrical and Computer Engineering, Northwestern University, 2145 North Sheridan Road, Evanston, IL 60208-3118, USA Email: [email protected]

Zhilin Wu Department of Electrical and Computer Engineering, Northwestern University, 2145 North Sheridan Road, Evanston, IL 60208-3118, USA Email: [email protected]

Aggelos K. Katsaggelos Department of Electrical and Computer Engineering, Northwestern University, 2145 North Sheridan Road, Evanston, IL 60208-3118, USA Email: [email protected] Received 3 December 2001 and in revised form 19 May 2002 We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial animation parameters (FAPs) supported by the MPEG-4 standard for the visual representation of speech. We also describe a robust and automatic algorithm we have developed to extract FAPs from visual data, which does not require hand labeling or extensive training procedures. The principal component analysis (PCA) was performed on the FAPs in order to decrease the dimensionality of the visual feature vectors, and the derived projection weights were used as visual features in the audio-visual automatic speech recognition (ASR) experiments. Both single-stream and multistream hidden Markov models (HMMs) were used to model the ASR system, integrate audio and visual information, and perform a relatively large vocabulary (approximately 1000 words) speech recognition experiments. The experiments performed use clean audio data and audio data corrupted by stationary white Gaussian noise at various SNRs. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only speech recognition WERs, at various SNRs (0–30 dB) with additive white Gaussian noise, and by 19% relatively to audio-only speech recognition WER under clean audio conditions. Keywords and phrases: audio-visual speech recognition, facial animation parameters, snake.

1.

INTRODUCTION

Human listeners use visual information, such as facial expressions, and lips and tongue movement, in order to improve perception of the uttered audio signal [1]. Impaired hearing individuals, using lipreading, or speechreading, can achieve very good speech perception [1, 2, 3, 4]. The use of visual information in addition to audio, improves speech understanding especially in noisy environments. Visual information, obviously independent of audio noise, is comple-

mentary to the audio signal, and as such can improve speech perception even in noise-free environments [5]. In ASR, a very active area of research over the last decades, visual information is ignored. Because of this the performance of the state of the

Data Loading...

Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

Recommend Documents

Speech Emotion Recognition Using Spectrogram Patterns as Features

Multi-features Integration for Speech Emotion Recognition

Pattern recognition and features selection for speech emotion recognition model using deep learning

Recognition of Isolated Digit Using Random Forest for Audio-Visual Speech Recognition

Medical reporting using speech recognition

Audiovisual Speech Synchrony Measure: Application to Biometrics

Face Recognition Using Local Features

Dynamic Bayesian Networks for Audio-Visual Speech Recognition

Significance of Phonological Features in Speech Emotion Recognition

Fisher Kernels on Phase-Based Features for Speech Emotion Recognition

Speech Recognition

Mobile Robot Visual Navigation Using Multiple Features