On the Integration of Auditory and Visual Parameters in an HMM-based ASR

In this paper, we describe two architectures for combining automatic speechreading and acoustic speech recognition. We propose a model which can improve the performances of an audio-visual speech recognizer in an isolated word and speaker dependent situat

  • PDF / 1,158,755 Bytes
  • 11 Pages / 595.276 x 790.866 pts Page_size
  • 46 Downloads / 182 Views

DOWNLOAD

REPORT


Institut de la Communication Parlee, URA - CNRS N° 368 INPG!ENSERG - Universite STENDHAL, BP 25X - F38040 Grenoble.

Abstract. In this paper, we describe two architectures for combining automatic speechreading and acoustic speech recognition. We propose a model which can improve the performances of an audio-visual speech recognizer in an isolated word and speaker dependent situation. This is achieved by using a hybrid system based on two HMMs trained respectively with acoustic and optic data. Both architectures have been tested on degraded audio over a wide range of .SIN ratios. The results of these experiments are presented and discussed.

Keywords: speechreading, recognition, audio-visual, HMM.

1. Introduction Although acoustically-based automatic speech recognition systems have witnessed enormous developments over the past years, they still operate poorly when a background noise is present. Several studies have shown that the use of an external source of information such as lip movements can significantly enhance the recognition rate (Petajan, 1984; Stork, Wolf and Levine 1992; Bregler and Hild and Manke, et al., 1993).

2. Audio-visual·speech perception Research with human subjects has shown that vision of the talker's face provides extensive benefit to speech recognition in difficult listening conditions even with normal hearers (Sumby and Pollack, 1954; Erber, 1969, 1975; Benoit and Mohammadi and Kandel, 1994). All those studies have shown that the audiovisual recognition scores are always higher than both the audio and the visual scores in all conditions, i.e., AV>A and AV>V. This is the basic challenge of bimodal integration, and thus the first goal any audio-visual ASR should reach. In the area of speech perception, several models have been proposed to account for the human process of auditory and visual integration of speech (Summerfield,

D. G. Stork et al. (eds.), Speechreading by Humans and Machines © Springer-Verlag Berlin Heidelberg 1996

462

1987; Robert-Ribes, 1995). Four or five different architectural structures have been proposed to model the AV fusion in speech perception. Only two of these strategies are easy to implement. We have thus focused our studies on the comparison from results of these two kinds of architectures. They are hereby briefly presented: 1 - Direct Identification Model (Early Integration Model): it is based upon the "Lexical Access From Spectra" by Klatt (Klatt, 1979). The fusion process takes place before any classification. Thus, the input of such a system is composed of a combination of acoustic and optic data. There is no common metric level over the two modalities. Figure 1 shows the principle of this architecture. A. vector

Classifier

Word or phoneme

V. vector Figure 1: schematic ofDirect Identification Model.

2 -Separated Identification Model (Late Integration Model): the A and the V inputs are independently identified by means of two parallel processes. Each input is matched against unimodal prototypes so that an A and a V score are computed from those A and V