Speech Enhancement with Natural Sounding Residual Noise Based on Connected Time-Frequency Speech Presence Regions
- PDF / 1,408,228 Bytes
- 11 Pages / 600 x 792 pts Page_size
- 5 Downloads / 183 Views
peech Enhancement with Natural Sounding Residual Noise Based on Connected Time-Frequency Speech Presence Regions Karsten Vandborg Sørensen Department of Communication Technology, Aalborg University, DK-9220 Aalborg East, Denmark Email: [email protected]
Søren Vang Andersen Department of Communication Technology, Aalborg University, DK-9220 Aalborg East, Denmark Email: [email protected] Received 13 May 2004; Revised 3 March 2005 We propose time-frequency domain methods for noise estimation and speech enhancement. A speech presence detection method is used to find connected time-frequency regions of speech presence. These regions are used by a noise estimation method and both the speech presence decisions and the noise estimate are used in the speech enhancement method. Different attenuation rules are applied to regions with and without speech presence to achieve enhanced speech with natural sounding attenuated background noise. The proposed speech enhancement method has a computational complexity, which makes it feasible for application in hearing aids. An informal listening test shows that the proposed speech enhancement method has significantly higher mean opinion scores than minimum mean-square error log-spectral amplitude (MMSE-LSA) and decision-directed MMSE-LSA. Keywords and phrases: speech enhancement, noise estimation, minimum statistics, speech presence detection.
1.
INTRODUCTION
The performance of many speech enhancement methods relies mainly on the quality of a noise power spectral density (PSD) estimate. When the noise estimate differs from the true noise, it will lead to artifacts in the enhanced speech. The approach taken in this paper is based on connected region speech presence detection. Our aim is to exploit spectral and temporal masking mechanisms in the human auditory system [1] to reduce the perception of these artifacts in speech presence regions and eliminate the artifacts in speech absence regions. We achieve this by leaving downscaled natural sounding background noise in the enhanced speech in connected time-frequency regions with speech absence. The downscaled natural sounding background noise will spectrally and temporally mask artifacts in the speech estimate while preserving the naturalness of the background noise. In the definition of speech presence regions, we are inspired by the work of Yang [2]. Yang demonstrates high perceptual quality of a speech enhancement method where conThis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
stant gain is applied in frames with no detected speech presence. Yang lets a single decision cover a full frame. Thus, musical noise is present in the full spectrum of the enhanced speech in frames with speech activity. We therefore extend the notion of speech presence to individual time-frequency locations. This, in our experience, significantly improves the naturalness of the residual noise. The speech
Data Loading...