A Comprehensive Noise Robust Speech Parameterization Algorithm Using Wavelet Packet Decomposition-Based Denoising and Sp
- PDF / 1,008,110 Bytes
- 20 Pages / 600.03 x 792 pts Page_size
- 86 Downloads / 199 Views
Research Article A Comprehensive Noise Robust Speech Parameterization Algorithm Using Wavelet Packet Decomposition-Based Denoising and Speech Feature Representation Techniques Bojan Kotnik and Zdravko Kaˇciˇc Faculty of Electrical Engineering and Computer Science, University of Maribor, Smetanova ul. 17, 2000 Maribor, Slovenia Received 22 May 2006; Revised 12 January 2007; Accepted 11 April 2007 Recommended by Matti Karjalainen This paper concerns the problem of automatic speech recognition in noise-intense and adverse environments. The main goal of the proposed work is the definition, implementation, and evaluation of a novel noise robust speech signal parameterization algorithm. The proposed procedure is based on time-frequency speech signal representation using wavelet packet decomposition. A new modified soft thresholding algorithm based on time-frequency adaptive threshold determination was developed to efficiently reduce the level of additive noise in the input noisy speech signal. A two-stage Gaussian mixture model (GMM)-based classifier was developed to perform speech/nonspeech as well as voiced/unvoiced classification. The adaptive topology of the wavelet packet decomposition tree based on voiced/unvoiced detection was introduced to separately analyze voiced and unvoiced segments of the speech signal. The main feature vector consists of a combination of log-root compressed wavelet packet parameters, and autoregressive parameters. The final output feature vector is produced using a two-staged feature vector postprocessing procedure. In the experimental framework, the noisy speech databases Aurora 2 and Aurora 3 were applied together with corresponding standardized acoustical model training/testing procedures. The automatic speech recognition performance achieved using the proposed noise robust speech parameterization procedure was compared to the standardized mel-frequency cepstral coefficient (MFCC) feature extraction procedures ETSI ES 201 108 and ETSI ES 202 050. Copyright © 2007 B. Kotnik and Z. Kaˇciˇc. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1.
INTRODUCTION
Automatic speech recognition (ASR) systems have become indispensable integral parts of modern multimodal manmachine communication dialog applications such as voicedriven service portals, speech interfaces in automotive navigational and guidance systems, or speech-driven applications in modern offices [1]. As automatic speech recognition systems are evolutionally moving from controlled laboratory environments to more acoustically dynamic places, noise robustness criteria must be assured in order to maintain speech recognition accuracy above a sufficient level. If a recognition system is to be used in noisy environments it must be robust to many different types and levels of noise, categorized as either additive/convolutive noises, or changes in the speaker’s voice due to environmenta
Data Loading...