A novel voice activity detection based on phoneme recognition using statistical model
- PDF / 278,189 Bytes
- 10 Pages / 595.28 x 793.7 pts Page_size
- 61 Downloads / 194 Views
RESEARCH
Open Access
A novel voice activity detection based on phoneme recognition using statistical model Xulei Bao* and Jie Zhu
Abstract In this article, a novel voice activity detection (VAD) approach based on phoneme recognition using Gaussian Mixture Model based Hidden Markov Model (HMM/GMM) is proposed. Some sophisticated speech features such as high order statistics (HOS), harmonic structure information and Mel-frequency cepstral coefficients (MFCCs) are employed to represent each speech/non-speech segment. The main idea of this new method is regarding the non-speech as a new phoneme corresponding to the conventional phonemes in mandarin, and all of them are then trained under maximum likelihood principle with Baum-Welch algorithm using GMM/HMM model. The Viterbi decoding algorithm is finally used for searching the maximum likelihood of the observed signals. The proposed method shows a higher speech/non-speech detection accuracy over a wide range of SNR regimes compared with some existing VAD methods. We also propose a different method to demonstrate that the conventional speech enhancement method only with accurate VAD is not effective enough for automatic speech recognition (ASR) at low SNR regimes. 1 Introduction Voice activity detection (VAD), which is a scheme to detect the presence of speech in the observed signals automatically, plays an important role in speech signal processing [1-4]. It is because that high accurate VAD can reduce bandwidth usage and network traffic in voice over IP (VoIP), and can improve the performance of speech recognition in noisy systems. For example, there is a growing interest in developing useful systems for automatic speech recognition (ASR) in different noisy environments [5,6], and most of these studies are focused on developing more robust VAD systems in order to compensate for the harmful effect of the noise on the speech signal. Plentiful algorithms have been developed to achieve good performance of VAD in real environments in the last decade. Many of them are based on heuristic rules on several parameters such as linear predictive coding parameters, energy, formant shape, zero crossing rate, autocorrelation, cepstral features and periodicity measures [7-12]. For example, Fukuda et al. [11] replaced the traditional Mel-frequency cepstral coefficients (MFCCs) by the harmonic structure information that made a significant improvement of recognition rate in * Correspondence: [email protected] Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
ASR system. Li et al. [12] combined the high order statistical (HOS) with the low band to full band energy ration (LFER) for efficient speech/non-speech segments. However, the algorithms based on the speech features with heuristic rules have difficulty in coping with all noises observed in the real world. Recently, the statistical model based VAD approach is considered an attractive approach for noisy speech. Sohn et al. [13] proposed a robust VAD algorithm based on a statistical likel
Data Loading...