The Use of Adaptive Frame for Speech Recognition
- PDF / 605,086 Bytes
- 7 Pages / 600 x 792 pts Page_size
- 101 Downloads / 185 Views
he Use of Adaptive Frame for Speech Recognition Sam Kwong Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong Email: [email protected]
Qianhua He Department of Electronic Engineering, South China University of Technology, China Email: [email protected] Received 19 January 2001 and in revised form 18 May 2001 We propose an adaptive frame speech analysis scheme through dividing speech signal into stationary and dynamic region. Long frame analysis is used for stationary speech, and short frame analysis for dynamic speech. For computation convenience, the feature vector of short frame is designed to be identical to that of long frame. Two expressions are derived to represent the feature vector of short frames. Word recognition experiments on the TIMIT and NON-TIMIT with discrete Hidden Markov Model (HMM) and continuous density HMM showed that steady performance improvement could be achieved for open set testing. On the TIMIT database, adaptive frame length approach (AFL) reduces the error reduction rates from 4.47% to 11.21% and 4.54% to 9.58% for DHMM and CHMM, respectively. In the NON-TIMIT database, AFL also can reduce the error reduction rates from 1.91% to 11.55% and 2.63% to 9.5% for discrete hidden Markov model (DHMM) and continuous HMM (CHMM), respectively. These results proved the effectiveness of our proposed adaptive frame length feature extraction scheme especially for the open testing. In fact, this is a practical measurement for evaluating the performance of a speech recognition system. Keywords and phrases: speech recognition, speech coding, adaptive frame, signal analysis.
1. INTRODUCTION To date, the most successful speech recognition systems mainly use Hidden Markov Model (HMM) for acoustic modeling. HMM in fact dominates the continuous speech recognition field [1]. In order to improve the performance of speech recognition, a great deal of efforts had been made to study the training approaches for HMMs [2, 3, 4], or variations of the conventional HMM, such as the segment HMM [1], and the HMMs with state-conditioned secondorder nonstationary [5]. In general, frame-based feature analysis for speech signals has been accepted as a very successful technique. In this method, time speech samples are blocked into frames of N samples, with adjacent frames separated by M samples. Then the spectral characteristic coefficients are calculated for each frame via some speech analysis methods (coding procedure), such as LPC, FFT analysis, Gabor expansion [6], or wavelets [7]. N is usually set to be the number of samples of 30–45 ms signal and M to be N/3, [8]. This procedure based on the assumption that speech signal could be considered as quasi-stationary if speech signal is examined over a sufficiently short period of time (between 5 and 100 ms). However, this is not true when the signal is
measured over long periods of time (on the order of 0.2 seconds or more). For reducing the discontinuities associated with windowing, pitch synchronously speech processing may be
Data Loading...