Detection and Separation of Speech Event Using Audio and Video Information Fusion and Its Application to Robust Speech I

  • PDF / 2,764,270 Bytes
  • 12 Pages / 600 x 792 pts Page_size
  • 31 Downloads / 183 Views

DOWNLOAD

REPORT


Detection and Separation of Speech Event Using Audio and Video Information Fusion and Its Application to Robust Speech Interface Futoshi Asano,1 Kiyoshi Yamamoto,2 Isao Hara,1 Jun Ogata,1 Takashi Yoshimura,1 Yoichi Motomura,1 Naoyuki Ichimura,1 Hideki Asoh1 1 Information

Technology Research Institute, National Institute of Advanced Industrial Science and Technology, Tsukuba 305-8568, Japan Emails: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

2 Department

of Computer Science,Tsukuba University, Tsukuba 305-8573, Japan Email: [email protected]

Received 11 November 2003; Revised 3 February 2004; Recommended for Publication by Chin-Hui Lee A method of detecting speech events in a multiple-sound-source condition using audio and video information is proposed. For detecting speech events, sound localization using a microphone array and human tracking by stereo vision is combined by a Bayesian network. From the inference results of the Bayesian network, information on the time and location of speech events can be known. The information on the detected speech events is then utilized in the robust speech interface. A maximum likelihood adaptive beamformer is employed as a preprocessor of the speech recognizer to separate the speech signal from environmental noise. The coefficients of the beamformer are kept updated based on the information of the speech events. The information on the speech events is also used by the speech recognizer for extracting the speech segment. Keywords and phrases: information fusion, sound localization, human tracking, adaptive beamformer, speech recognition.

1. INTRODUCTION Detection of speech events is an important issue in automatic speech recognition (ASR) in a real environment with background noise and interferences. Also, the detection of the presence or absence of the target speech signal is often important for noise reduction such as adaptive beamformer (see, e.g., [1]) or spectral subtraction (see, e.g., [2]), which can be used as a preprocessor of ASR. In the maximum likelihood (ML) adaptive beamformer employed in this paper, the spatial correlation of the noise must be estimated during the absence of the target signal as described later in this paper. In the spectral subtraction, the spectrum of the noise must be estimated in a way similar to that of the ML beamformer. When environmental noise is nonspeech signals, a voice activity detector (VAD) can be used as a target speech detector (see, e.g., [3]). In environments such as offices and homes, however, not only the target but also interference from sources such as a TV or a radio can be speech signals. In such cases, the detection of the target speech cannot be accomplished only by using sound information, and fusion with the information from other modalities such as vision is necessary.

Chaodhury et al. [4] proposed a speech event detector using audio and video information. In their paper, an environment in which a dialog