Improved monaural speech segregation based on computational auditory scene analysis
- PDF / 543,430 Bytes
- 15 Pages / 595 x 794 pts Page_size
- 43 Downloads / 224 Views
RESEARCH
Open Access
Improved monaural speech segregation based on computational auditory scene analysis Wang Yu, Lin Jiajun* , Chen Ning and Yuan Wenhao
Abstract A lot of effort has been made in Computational Auditory Scene Analysis (CASA) to segregate target speech from monaural mixtures. Based on the principle of CASA, this article proposes an improved algorithm for monaural speech segregation. To extract the energy feature more accurately, the proposed algorithm improves the threshold selection for response energy in initial segmentation stage. Since the resulting mask map often contains broken auditory element groups after grouping stage, a smoothing stage is proposed based on morphological image processing. Through the combination of erosion and dilation operations, we suppress the intrusions by removing the unwanted particles and enhance the segregated speech by complementing the broken auditory elements. Systematic evaluation shows that the proposed segregation algorithm improves the output signal-to-noise ratio by an average of 8.55 dB and cuts the percentage of noise residue by an average of 25.36% compared with the mixture, yielding a significant improvement for speech segregation. Keywords: Speech segregation, Computational Auditory Scene Analysis (CASA), Threshold selection, Morphological image processing
1 Introduction While monaural speech segregation remains a challenge to computers, the humans can distinguish and track speech signal of interest under various noisy environments. In 1990, Bregman published his book, Auditory Scene Analysis [1], which was the first to explain the principles underlying the perception of complex acoustic mixtures systematically, inspiring the establishment of its computational model, computational auditory scene analysis (CASA) [2]. The CASA simulates the human auditory system, and its processing of mixture speech is similar to human auditory perception. The system is made of two main stages: segmentation and grouping. It decomposes input signal into sensory segments in segmentation stage and then those segments which likely come from the same source are grouped into “target stream” together. Since the CASA system can solve the monaural speech separation problem, it has been improved continuously and tremendously in recent years. *Correspondence: jjlin [email protected] School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
The CASA system proposed by Brown and Cooke employs maps of different auditory features that generated from the output of a cochlear model for speech segregation. This system does not require a priori knowledge of the input signal but has a few limitations. It cannot handle sequential grouping problem effectively and often leaves missing parts in the segregated speech [3]. Wang and Brown [2,4] proposed a CASA model to segregate voiced speech based on oscillatory correlation, which uses harmonicity and temporal continuity as major grouping cues. This model is able to recover most of the targe
Data Loading...