Feature Extraction of the Speech Signal
Isolated speech recognition, speaker recognition, and continuous speech recognition require the feature vector extracted from the speech signal. This is subjected to pattern recognition to formulate the classifier. The feature vector is extracted from eac
- PDF / 3,394,826 Bytes
- 42 Pages / 439.37 x 666.142 pts Page_size
- 42 Downloads / 306 Views
Feature Extraction of the Speech Signal
Abstract Isolated speech recognition, speaker recognition, and continuous speech recognition require the feature vector extracted from the speech signal. This is subjected to pattern recognition to formulate the classifier. The feature vector is extracted from each frame of the speech signal under test. In this chapter, various parameter extraction techniques such as linear predictive co-efficients as the filter co-efficients of the vocal tract model, poles of the vocal tract filter, cepstrual co-efficients, melfrequency cepstral co-efficients (MFCC), line spectral co-efficients, and reflection co-efficients are discussed in this chapter. The preprocessing techniques such as dynamic time warping, endpoint detection, and pre-emphasis are also discussed in this chapter.
3.1 Endpoint Detection The isolated speech signal recorded through the microphone will have noise at both ends of the speech segment. There is the need to identify the beginning and the ending of the speech segment from the recorded speech signal. This is known as endpoint detection. This is identified as follows. 1. The speech signal S is divided into frames. Compute the sum-squared values of the individual frames. The energy of the frame consists of voiced speech signal (due to vibration of the vocal chords) is usually greater than the noise signal. Identify the first frame of the speech segment that has the energy greater than the predefined upper threshold value. From this point onwards, search the frame in the backward direction such that the energy of frame exceeds the predefined lower threshold value. Let the identified first frame of the voiced speech segment is represented as V . 2. Let S(n) be the nth sample of the speech signal. If sgn(S(n))sgn(S(n + 1)) is negative, zero crossing has happened at the position nth sample of the speech signal. The zero-crossing rate of the unvoiced segment near to the voiced segment E. S. Gopi, Digital Speech Processing Using Matlab, Signals and Communication Technology, DOI: 10.1007/978-81-322-1677-3_3, © Springer India 2014
93
94
3 Feature Extraction of the Speech Signal
is larger when compared to the noise. The number of zero crossings per frame is known as zero-crossing rate. Once the first frame of the voiced speech segment (V ) is identified using the energy computation, the first frame of the unvoiced speech segment (if available) available prior to V is identified as follows. From V , search the previous 25 frames backwards to choose the first frame that has the zero-crossing rate lesser than the predefined threshold value and it is declared as the first unvoiced speech frame. 3. The above procedure is repeated from the last sample of the speech segment to identify the endpoint of the speech segment. %endpointdetection.m function [res1,res2,speechsegment,utforste, ltforste,ltforzcr]... =endpointdetection(S,FS) %mzcr, mste-mean of the zero-crossing rate and the short-time energy for the first 100\,ms %vzcr,vste-variance of the zero-crossing rate and the sh
Data Loading...