Significance of Joint Features Derived from the Modified Group Delay Function in Speech Processing

  • PDF / 1,902,370 Bytes
  • 13 Pages / 600.03 x 792 pts Page_size
  • 99 Downloads / 151 Views

DOWNLOAD

REPORT


Research Article Significance of Joint Features Derived from the Modified Group Delay Function in Speech Processing Rajesh M. Hegde,1 Hema A. Murthy,2 and V. R. R. Gadde3 1 Department

of Electrical and Computer Engineering, University of California San Diego, San Diego, CA 92122, USA of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai 600 036, India 3 STAR Lab, SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025, USA 2 Department

Received 1 April 2006; Revised 20 September 2006; Accepted 10 October 2006 Recommended by Climent Nadeu This paper investigates the significance of combining cepstral features derived from the modified group delay function and from the short-time spectral magnitude like the MFCC. The conventional group delay function fails to capture the resonant structure and the dynamic range of the speech spectrum primarily due to pitch periodicity effects. The group delay function is modified to suppress these spikes and to restore the dynamic range of the speech spectrum. Cepstral features are derived from the modified group delay function, which are called the modified group delay feature (MODGDF). The complementarity and robustness of the MODGDF when compared to the MFCC are also analyzed using spectral reconstruction techniques. Combination of several spectral magnitude-based features and the MODGDF using feature fusion and likelihood combination is described. These features are then used for three speech processing tasks, namely, syllable, speaker, and language recognition. Results indicate that combining MODGDF with MFCC at the feature level gives significant improvements for speech recognition tasks in noise. Combining the MODGDF and the spectral magnitude-based features gives a significant increase in recognition performance of 11% at best, while combining any two features derived from the spectral magnitude does not give any significant improvement. Copyright © 2007 Rajesh M. Hegde et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1.

INTRODUCTION

Various types of features have been used in speech processing [1]. Variations on the basic spectral computation, such as the inclusion of time and frequency masking, have been used in [2–4]. The use of auditory models as the basis of feature extraction has been beneficial in many systems [5–9], especially in noisy environments [10]. Perhaps the most popular features used in speech recognition today are the Mel frequency cepstral coefficients (MFCCs) [11]. In conventional speech recognition systems, features are usually computed from the short-time power spectrum while the short-term phase spectrum is not used. This is primarily because early experiments on human speech perception have indicated that the human ear is not sensitive to shorttime phase. But recent experiments described in [12, 13] have indicated the usefulness of the short-ti