A novel voice activity detection based on phoneme recognition using statistical model

PDF / 278,189 Bytes
10 Pages / 595.28 x 793.7 pts Page_size
61 Downloads / 308 Views

RESEARCH

Open Access

A novel voice activity detection based on phoneme recognition using statistical model Xulei Bao* and Jie Zhu

Abstract In this article, a novel voice activity detection (VAD) approach based on phoneme recognition using Gaussian Mixture Model based Hidden Markov Model (HMM/GMM) is proposed. Some sophisticated speech features such as high order statistics (HOS), harmonic structure information and Mel-frequency cepstral coefficients (MFCCs) are employed to represent each speech/non-speech segment. The main idea of this new method is regarding the non-speech as a new phoneme corresponding to the conventional phonemes in mandarin, and all of them are then trained under maximum likelihood principle with Baum-Welch algorithm using GMM/HMM model. The Viterbi decoding algorithm is finally used for searching the maximum likelihood of the observed signals. The proposed method shows a higher speech/non-speech detection accuracy over a wide range of SNR regimes compared with some existing VAD methods. We also propose a different method to demonstrate that the conventional speech enhancement method only with accurate VAD is not effective enough for automatic speech recognition (ASR) at low SNR regimes. 1 Introduction Voice activity detection (VAD), which is a scheme to detect the presence of speech in the observed signals automatically, plays an important role in speech signal processing [1-4]. It is because that high accurate VAD can reduce bandwidth usage and network traffic in voice over IP (VoIP), and can improve the performance of speech recognition in noisy systems. For example, there is a growing interest in developing useful systems for automatic speech recognition (ASR) in different noisy environments [5,6], and most of these studies are focused on developing more robust VAD systems in order to compensate for the harmful effect of the noise on the speech signal. Plentiful algorithms have been developed to achieve good performance of VAD in real environments in the last decade. Many of them are based on heuristic rules on several parameters such as linear predictive coding parameters, energy, formant shape, zero crossing rate, autocorrelation, cepstral features and periodicity measures [7-12]. For example, Fukuda et al. [11] replaced the traditional Mel-frequency cepstral coefficients (MFCCs) by the harmonic structure information that made a significant improvement of recognition rate in * Correspondence: [email protected] Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

ASR system. Li et al. [12] combined the high order statistical (HOS) with the low band to full band energy ration (LFER) for efficient speech/non-speech segments. However, the algorithms based on the speech features with heuristic rules have difficulty in coping with all noises observed in the real world. Recently, the statistical model based VAD approach is considered an attractive approach for noisy speech. Sohn et al. [13] proposed a robust VAD algorithm based on a statistical likel

Data Loading...

A novel voice activity detection based on phoneme recognition using statistical model

Recommend Documents

A novel voice activity detection algorithm using modified global thresholding

An efficient voice activity detection algorithm by combining statistical model and energy detection

Improving English Phoneme Pronunciation with Automatic Speech Recognition Using Voice Chatbot

Voice-Activity and Overlapped Speech Detection Using x-Vectors

Using Spasmodic Closure Patterns to Simplify Visual Voice Activity Detection

Bangla Phoneme Recognition: Probabilistic Approach

Lightweight CNN for Robust Voice Activity Detection

Human Activity Recognition Based on Ensemble Classifier Model

Voice Recognition

Confusion analysis in phoneme based speech recognition in Hindi

A Novel Tiny Object Recognition Algorithm Based on Unit Statistical Curvature Feature

A voice activity detection algorithm in spectro-temporal domain using sparse representation