On the Integration of Auditory and Visual Parameters in an HMM-based ASR

In this paper, we describe two architectures for combining automatic speechreading and acoustic speech recognition. We propose a model which can improve the performances of an audio-visual speech recognizer in an isolated word and speaker dependent situat

PDF / 1,158,755 Bytes
11 Pages / 595.276 x 790.866 pts Page_size
46 Downloads / 247 Views

DOWNLOAD

REPORT

Institut de la Communication Parlee, URA - CNRS N° 368 INPG!ENSERG - Universite STENDHAL, BP 25X - F38040 Grenoble.

Abstract. In this paper, we describe two architectures for combining automatic speechreading and acoustic speech recognition. We propose a model which can improve the performances of an audio-visual speech recognizer in an isolated word and speaker dependent situation. This is achieved by using a hybrid system based on two HMMs trained respectively with acoustic and optic data. Both architectures have been tested on degraded audio over a wide range of .SIN ratios. The results of these experiments are presented and discussed.

Keywords: speechreading, recognition, audio-visual, HMM.

1. Introduction Although acoustically-based automatic speech recognition systems have witnessed enormous developments over the past years, they still operate poorly when a background noise is present. Several studies have shown that the use of an external source of information such as lip movements can significantly enhance the recognition rate (Petajan, 1984; Stork, Wolf and Levine 1992; Bregler and Hild and Manke, et al., 1993).

2. Audio-visual·speech perception Research with human subjects has shown that vision of the talker's face provides extensive benefit to speech recognition in difficult listening conditions even with normal hearers (Sumby and Pollack, 1954; Erber, 1969, 1975; Benoit and Mohammadi and Kandel, 1994). All those studies have shown that the audiovisual recognition scores are always higher than both the audio and the visual scores in all conditions, i.e., AV>A and AV>V. This is the basic challenge of bimodal integration, and thus the first goal any audio-visual ASR should reach. In the area of speech perception, several models have been proposed to account for the human process of auditory and visual integration of speech (Summerfield,

D. G. Stork et al. (eds.), Speechreading by Humans and Machines © Springer-Verlag Berlin Heidelberg 1996

462

1987; Robert-Ribes, 1995). Four or five different architectural structures have been proposed to model the AV fusion in speech perception. Only two of these strategies are easy to implement. We have thus focused our studies on the comparison from results of these two kinds of architectures. They are hereby briefly presented: 1 - Direct Identification Model (Early Integration Model): it is based upon the "Lexical Access From Spectra" by Klatt (Klatt, 1979). The fusion process takes place before any classification. Thus, the input of such a system is composed of a combination of acoustic and optic data. There is no common metric level over the two modalities. Figure 1 shows the principle of this architecture. A. vector

Classifier

Word or phoneme

V. vector Figure 1: schematic ofDirect Identification Model.

2 -Separated Identification Model (Late Integration Model): the A and the V inputs are independently identified by means of two parallel processes. Each input is matched against unimodal prototypes so that an A and a V score are computed from those A and V

Data Loading...

On the Integration of Auditory and Visual Parameters in an HMM-based ASR

Recommend Documents

Audio-visual integration in noise: Influence of auditory and visual stimulus degradation on eye movements and perception

Auditory and Visual Sensations

Auditory-Visual Dissonance

Ageing Voices: The Effect of Changes in Voice Parameters on ASR Performance

The effects of spatial auditory and visual cues on mixed reality remote collaboration

The Visual Pathway for Binocular Integration

An Investigation of Mood and Executive Functioning Effects of Brief Auditory and Visual Mindfulness Meditations in Patie

An Empirical Study of the Effect of ASR-Supported English Reading Aloud Practices on Pronunciation Accuracy

Auditory and Visual Tasks Influence the Temporal Dynamics of EEG Microstates During Post-encoding Rest

Face-voice space: Integrating visual and auditory cues in judgments of person distinctiveness

Temporal Unfolding of Micro-valences in Facial Expression Evoked by Visual, Auditory, and Olfactory Stimuli

The impact of irregular corneal shape parameters on visual acuity and contrast sensitivity