An efficient voice activity detection algorithm by combining statistical model and energy detection

  • PDF / 394,802 Bytes
  • 10 Pages / 595.276 x 793.701 pts Page_size
  • 36 Downloads / 244 Views

DOWNLOAD

REPORT


RESEARCH

Open Access

An efficient voice activity detection algorithm by combining statistical model and energy detection Ji Wu* and Xiao-Lei Zhang

Abstract In this article, we present a new voice activity detection (VAD) algorithm that is based on statistical models and empirical rule-based energy detection algorithm. Specifically, it needs two steps to separate speech segments from background noise. For the first step, the VAD detects possible speech endpoints efficiently using the empirical rulebased energy detection algorithm. However, the possible endpoints are not accurate enough when the signal-tonoise ratio is low. Therefore, for the second step, we propose a new gaussian mixture model-based multipleobservation log likelihood ratio algorithm to align the endpoints to their optimal positions. Several experiments are conducted to evaluate the proposed VAD on both accuracy and efficiency. The results show that it could achieve better performance than the six referenced VADs in various noise scenarios. Keywords: energy detection, gaussian mixture model (GMM), multiple-observation, voice activity detection (VAD)

1 Introduction Voice activity detector (VAD) segregates speeches from background noise. It finds diverse applications in many modern speech communication systems, such as speech recognition, speech coding, noisy speech enhancement, mobile telephony, and very small aperture terminals. During the past few decades, researchers have tried many approaches to improve the VAD performance. Traditional approaches include energy in time domain [1,2], pitch detection [3], and zero-crossing rate [2,4]. Recently, several spectral energy-based new features were proposed, including energy-entropy feature [5], spacial signal correlation [6], cepstral feature [7], higherorder statistics [8,9], teager energy [10], spectral divergence [11], etc. Multi-band technique, which utilized the band differences between the speech and the noise, was also employed to construct the features [12,13]. Meanwhile, statistical models have attracted much attention. Most of them were focused on finding a suitable model to simulate the empirical distribution of the speech. Sohn [14] assumed that the speech and noise signals in discrete Fourier transform (DFT) domain were independent gaussian distribution. Gazor [15] * Correspondence: [email protected] Department of Electronic Engineering, Multimedia Signal and Intelligent, Information Processing Laboratory, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, China

further used Laplace distribution to model the speech signals. Chang [16] analyzed the Gaussian, Laplace, and Gamma distributions in DFT domain and integrated them with goodness-of-fit test. Tahmasbi [17] supposed speech process, which was transformed by GARCH filter, having a variance gamma distribution. Ramirez [18] proposed the multiple-observation likelihood ratio test instead of the single frame LRT [14], which improved the VAD performance greatly. More recently, many mac