Evolutionary Splines for Cepstral Filterbank Optimization in Phoneme Classification
- PDF / 1,798,978 Bytes
- 14 Pages / 600.05 x 792 pts Page_size
- 118 Downloads / 165 Views
Research Article Evolutionary Splines for Cepstral Filterbank Optimization in Phoneme Classification Leandro D. Vignolo,1 Hugo L. Rufiner,1 Diego H. Milone,1 and John C. Goddard2 1 Research
Center for Signals, Systems and Computational Intelligence, Department of Informatics, National University of Litoral, CONICET, Santa Fe, 3000, Argentina 2 Departamento de Ingenier´ ıa El´ectrica, Universidad Aut´onoma Metropolitana, Unidad Iztapalapa, Mexico D.F., 09340, Mexico Correspondence should be addressed to Leandro D. Vignolo, [email protected] Received 14 July 2010; Revised 29 October 2010; Accepted 24 December 2010 Academic Editor: Raviraj S. Adve Copyright © 2011 Leandro D. Vignolo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Mel-frequency cepstral coefficients have long been the most widely used type of speech representation. They were introduced to incorporate biologically inspired characteristics into artificial speech recognizers. Recently, the introduction of new alternatives to the classic mel-scaled filterbank has led to improvements in the performance of phoneme recognition in adverse conditions. In this work we propose a new bioinspired approach for the optimization of the filterbanks, in order to find a robust speech representation. Our approach—which relies on evolutionary algorithms—reduces the number of parameters to optimize by using spline functions to shape the filterbanks. The success rates of a phoneme classifier based on hidden Markov models are used as the fitness measure, evaluated over the well-known TIMIT database. The results show that the proposed method is able to find optimized filterbanks for phoneme recognition, which significantly increases the robustness in adverse conditions.
1. Introduction Most current speech recognizers rely on the traditional mel-frequency cepstral coefficients (MFCC) [1] for the feature extraction phase. This representation is biologically motivated and introduces the use of a psychoacoustic scale to mimic the frequency response in the human ear. However, as the entire auditory system is complex and not yet fully understood, the shape of the true optimal filterbank for automatic recognition is not known. Moreover, the recognition performance of automatic systems degrades when speech signals are contaminated with noise. This has motivated the development of alternative speech representations, and many of them consist in modifications to the mel-scaled filterbank, for which the number of filters has been empirically set to different values [2]. For example, Skowronski and Harris [3, 4] proposed a novel scheme for determining filter bandwidth and reported significant recognition improvements compared to those using the MFCC traditional features. Other approaches follow a common strategy which consists in optimizing a
speech representation so that phoneme discrimination is maximized for a
Data Loading...