Robust In-Car Speech Recognition Based on Nonlinear Multiple Regressions

  • PDF / 661,035 Bytes
  • 10 Pages / 600.03 x 792 pts Page_size
  • 13 Downloads / 241 Views

DOWNLOAD

REPORT


Research Article Robust In-Car Speech Recognition Based on Nonlinear Multiple Regressions Weifeng Li,1 Kazuya Takeda,1 and Fumitada Itakura2 1 Graduate

School of Information Science, Nagoya University, Nagoya 464-8603, Japan of Information Engineering, Faculty of Science and Technology, Meijo University, Nagoya 468-8502, Japan

2 Department

Received 31 January 2006; Revised 10 August 2006; Accepted 29 October 2006 Recommended by S. Parthasarathy We address issues for improving handsfree speech recognition performance in different car environments using a single distant microphone. In this paper, we propose a nonlinear multiple-regression-based enhancement method for in-car speech recognition. In order to develop a data-driven in-car recognition system, we develop an effective algorithm for adapting the regression parameters to different driving conditions. We also devise the model compensation scheme by synthesizing the training data using the optimal regression parameters and by selecting the optimal HMM for the test speech. Based on isolated word recognition experiments conducted in 15 real car environments, the proposed adaptive regression approach shows an advantage in average relative word error rate (WER) reductions of 52.5% and 14.8%, compared to original noisy speech and ETSI advanced front end, respectively. Copyright © 2007 Weifeng Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1.

INTRODUCTION

The mismatch between training and testing conditions is one of the most challenging and important problems in automatic speech recognition (ASR). This mismatch may be caused by a number of factors, such as background noise, speaker variation, a change in speaking styles, channel effects, and so on. State-of-the-art ASR techniques for removing the mismatch usually fall into the following three categories [1]: robust features, speech enhancement, and model compensation. The first approach seeks parameterizations that are fundamentally immune to noise. The most widely used speech recognition features are the Mel-frequency cepstral coefficients (MFCCs) [2]. MFCC’s lack of robustness in noisy or mismatched conditions has led many researchers to investigate robust variants or novel feature extraction algorithm. Some of these researches could be perceptually based on, for example, the PLP [3] and RASTA [4], while other approaches are related to the auditory processing, for example, gammatone filter [5] and EIH model [6]. Speech enhancement approach aims to perform noise reduction by transforming noisy speech (or feature) into an estimate that more closely resembles clean speech (or feature). Examples falling in this approach include spectral subtraction [7], Wiener filter, cepstral mean normal-

ization (CMN) [8], codeword-dependent cesptral normalization (CDCN) [9], and so on. Spectral subtraction was originally proposed in the context of the enhan