Applications of Surface Correlation to the Estimation of the Harmonic Fundamental of Speech

  • PDF / 1,236,527 Bytes
  • 7 Pages / 600 x 792 pts Page_size
  • 80 Downloads / 147 Views

DOWNLOAD

REPORT


pplications of Surface Correlation to the Estimation of the Harmonic Fundamental of Speech Douglas J. Nelson R523, U.S. Department of Defense, Ft. Meade, MD 20755, USA Email: [email protected] Received 24 July 2001 We present a method for estimating the fundamental frequency of harmonic signals, and apply this method to human speech. The method is based on cross-spectral methods, which provide accurate resolution of multicomponent FM signals in both time and frequency. The fundamental is re-introduced to the spectrum by a frequency-lag autocorrelation of the spectrum, even if the fundamental is completely missing in the original spectrum. By combining the different perspectives of the Fourier spectral representation and the time-lag autocorrelation function we suppress all components of harmonic signals except for the fundamental. Keywords and phrases: cross-spectrum, phase spectrum, STFT, Fourier transform, speech, formant recovery, equalization.

1. INTRODUCTION In applications, such as speech vocoding or tasks, such as determining the identity or gender of the speaker of a short segment of speech, accurate estimation of excitation and formants (vocal tract resonances) is an important problem. The basic structure of voiced speech is the superposition of vocal tract resonances, which are excited by a quasi-periodic train of pulses formed at the glottis at the back of the vocal tract. While both the excitation frequency ω0 and the formant frequencies are nonstationary, it is well known that these frequencies are statistically quite different for male and female speakers and for different individuals of the same gender.1 In gender identification (GID) and speaker identification (SID), the ability to accurately estimate and track speech-related frequencies results in feature distributions with reduced variance and, in principle, provides better identification performance. In vocoding, accurate estimation of speech features results in higher quality speech reproduction and lower coding bit rates. In this paper, we address the problem of isolating and accurately estimating speech components in both time and frequency. While the application in this paper is speech, it should be noted that the methods presented apply equally well to the estimation of any multicomponent nonstationary harmonic signal. 1 The excitation fundamental frequency is normally represented as F . We 0 use the notation ω0 to follow the convention that frequencies are represented by the variations of the Greek “ω” and representations of the signal are represented by variations of the letter “f ”.

Cross-spectral methods, based on the short time Fourier transform (STFT) phase, have been recently demonstrated by Nelson to be effective in accurately estimating speech formants and the vocal tract excitation in time and frequency [1]. In that paper, the concept of indicator functions, based on mixed partial phase derivatives, was introduced as a method of isolating regions of the TF surface representing vocal tract resonance and excitation, respectively. The a