Improved monaural speech segregation based on computational auditory scene analysis

PDF / 543,430 Bytes
15 Pages / 595 x 794 pts Page_size
43 Downloads / 257 Views

RESEARCH

Open Access

Improved monaural speech segregation based on computational auditory scene analysis Wang Yu, Lin Jiajun* , Chen Ning and Yuan Wenhao

Abstract A lot of eﬀort has been made in Computational Auditory Scene Analysis (CASA) to segregate target speech from monaural mixtures. Based on the principle of CASA, this article proposes an improved algorithm for monaural speech segregation. To extract the energy feature more accurately, the proposed algorithm improves the threshold selection for response energy in initial segmentation stage. Since the resulting mask map often contains broken auditory element groups after grouping stage, a smoothing stage is proposed based on morphological image processing. Through the combination of erosion and dilation operations, we suppress the intrusions by removing the unwanted particles and enhance the segregated speech by complementing the broken auditory elements. Systematic evaluation shows that the proposed segregation algorithm improves the output signal-to-noise ratio by an average of 8.55 dB and cuts the percentage of noise residue by an average of 25.36% compared with the mixture, yielding a signiﬁcant improvement for speech segregation. Keywords: Speech segregation, Computational Auditory Scene Analysis (CASA), Threshold selection, Morphological image processing

1 Introduction While monaural speech segregation remains a challenge to computers, the humans can distinguish and track speech signal of interest under various noisy environments. In 1990, Bregman published his book, Auditory Scene Analysis [1], which was the ﬁrst to explain the principles underlying the perception of complex acoustic mixtures systematically, inspiring the establishment of its computational model, computational auditory scene analysis (CASA) [2]. The CASA simulates the human auditory system, and its processing of mixture speech is similar to human auditory perception. The system is made of two main stages: segmentation and grouping. It decomposes input signal into sensory segments in segmentation stage and then those segments which likely come from the same source are grouped into “target stream” together. Since the CASA system can solve the monaural speech separation problem, it has been improved continuously and tremendously in recent years. *Correspondence: jjlin [email protected] School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China

The CASA system proposed by Brown and Cooke employs maps of diﬀerent auditory features that generated from the output of a cochlear model for speech segregation. This system does not require a priori knowledge of the input signal but has a few limitations. It cannot handle sequential grouping problem eﬀectively and often leaves missing parts in the segregated speech [3]. Wang and Brown [2,4] proposed a CASA model to segregate voiced speech based on oscillatory correlation, which uses harmonicity and temporal continuity as major grouping cues. This model is able to recover most of the targe

Data Loading...

Improved monaural speech segregation based on computational auditory scene analysis

Recommend Documents

Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation

On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis

Auditory Scene Detection

Sound Classification in Hearing Aids Inspired by Auditory Scene Analysis

Speech Processing in the Auditory System

Scene Analysis

Improved UAV Scene Matching Algorithm Based on CenSurE Features and FREAK Descriptor

Segregation Analysis

Speech identification and cortical potentials in individuals with auditory neuropathy

Scene Text Recognition Based on Deep Learning

Speech Analysis

Silent articulation modulates auditory and audiovisual speech perception