Dynamic Bayesian Networks for Audio-Visual Speech Recognition

PDF / 1,387,928 Bytes
15 Pages / 612 x 792 pts (letter) Page_size
8 Downloads / 230 Views

Dynamic Bayesian Networks for Audio-Visual Speech Recognition Ara V. Nefian Intel Corporation, Microprocessor Research Labs, 2200 Mission College Blvd., Santa Clara, CA 95052-8119, USA Email: [email protected]

Luhong Liang Intel Corporation, Microcomputer Research Labs, Guanghua Road, 100020 Chaoyang District, Beijing, China Email: [email protected]

Xiaobo Pi Intel Corporation, Microcomputer Research Labs, Guanghua Road, 100020 Chaoyang District, Beijing, China Email: [email protected]

Xiaoxing Liu Intel Corporation, Microcomputer Research Labs, Guanghua Road, 100020 Chaoyang District, Beijing, China Email: [email protected]

Kevin Murphy Computer Science Division, University of California, Berkeley, Berkeley, CA 94720-1776, USA Email: [email protected] Received 30 November 2001 and in revised form 6 August 2002 The use of visual features in audio-visual speech recognition (AVSR) is justified by both the speech generation mechanism, which is essentially bimodal in audio and visual representation, and by the need for features that are invariant to acoustic noise perturbation. As a result, current AVSR systems demonstrate significant accuracy improvements in environments aﬀected by acoustic noise. In this paper, we describe the use of two statistical models for audio-visual integration, the coupled HMM (CHMM) and the factorial HMM (FHMM), and compare the performance of these models with the existing models used in speaker dependent audio-visual isolated word recognition. The statistical properties of both the CHMM and FHMM allow to model the state asynchrony of the audio and visual observation sequences while preserving their natural correlation over time. In our experiments, the CHMM performs best overall, outperforming all the existing models and the FHMM. Keywords and phrases: audio-visual speech recognition, hidden Markov models, coupled hidden Markov models, factorial hidden Markov models, dynamic Bayesian networks.

1.

INTRODUCTION

The variety of applications of automatic speech recognition (ASR) systems for human computer interfaces, telephony, and robotics has driven the research of a large scientific community in recent decades. However, the success of the currently available ASR systems is restricted to relatively controlled environments and well-defined applications such as dictation or small to medium vocabulary voice-based control commands (e.g., hand-free dialing). Often, robust ASR systems require special positioning of the microphone with

respect to the speaker resulting in a rather unnatural humanmachine interface. In recent years, together with the investigation of several acoustic noise reduction techniques, the study of visual features has emerged as attractive solution to speech recognition under less constrained environments. The use of visual features in audio-visual speech recognition (AVSR) is motivated by the speech formation mechanism and the natural ability of humans to reduce audio ambiguity using visual cues [1]. In addition, the visual information provides compl

Data Loading...

Dynamic Bayesian Networks for Audio-Visual Speech Recognition

Recommend Documents

Dynamic and Temporal Bayesian Networks

Audiovisual Speech Synchrony Measure: Application to Biometrics

Sparse Representations for Speech Recognition

Pattern Recognition for Speech Detection

Speech Recognition

Segmentation and Annotation of Audiovisual Recordings Based on Automated Speech Recognition

Deep Impression: Audiovisual Deep Residual Networks for Multimodal Apparent Personality Trait Recognition

Automatic Speech Recognition on Mobile Devices and over Communication Networks

Age and Gender Recognition from Speech Using Deep Neural Networks

Automatic Speech Recognition of Arabic Phonemes with Neural Networks

How to Encode Dynamic Gaussian Bayesian Networks as Gaussian Processes?

Temporal Exceptional Model Mining Using Dynamic Bayesian Networks