Single Channel multi-speaker speech Separation based on quantized ratio mask and residual network

  • PDF / 1,675,179 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 13 Downloads / 149 Views

DOWNLOAD

REPORT


Single Channel multi-speaker speech Separation based on quantized ratio mask and residual network Shanfa Ke 1,2 & Ruimin Hu 1,2 & Xiaochen Wang 1,2 & Tingzhao Wu 1,3 & Gang Li 1,3 & Zhongyuan Wang 1,3 Received: 15 September 2019 / Revised: 15 June 2020 / Accepted: 21 July 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract

The recently-proposed deep clustering-based algorithms represent a fundamental advance towards the single-channel multi-speaker speech sep- aration problem. These methods use an ideal binary mask to construct the objective function and K-means clustering method to estimate the ideal bina- ry mask. However, when sources belong to the same class or the number of sources is large, the assumption that one time-frequency unit of the mixture is dominated by only one source becomes weak, and the IBM-based separation causes spectral holes or aliasing. Instead, in our work, the quantized ideal ratio mask was proposed, the ideal ratio mask is quantized to have the output of the neural network with a limited number of possible values. Then the quan- tized ideal ratio mask is used to construct the objective function for the case of multi-source domination, to improve network performance. Furthermore, a network framework that combines a residual network, a recurring network, and a fully connected network was used for exploiting correlation information of frequency in our work. We evaluated our system on TIMIT dataset and show 1.6 dB SDR improvement over the previous state-of-the-art methods. Keywords Multi-speaker . Speech separation . Deep clustering . Quantized- IRM . Residual network

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s11042-02009419-y) contains supplementary material, which is available to authorized users.

* Ruimin Hu [email protected]

1

National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan 430072, China

2

Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan, University, Wuhan 430072, China

3

Collaborative Innovation Center of Geospatial Technology, Wuhan 430079, China

Multimedia Tools and Applications

1 Introduction Human being has an extraordinary ability to selectively attend to one speaker in the presence of other speakers and background noises, which is the so-called cocktail-party effect. But solving this cocktail party problem [5] has proven extremely challenging for computers. Speech separation allows computers to have this skill that recovers the interesting speech source signals from one or more observed mixture, it is an attractive research field and can be used for many applications, e.g. automatic speech recognition(ASR) [10, 26], speech enhancement [32] and hearing aid [28]. Driven by these applications, speech separation has been extensively studied over the past decades. One well-known approach is independent component analysis (ICA) [4, 16], which separated the mixture by estimating an unmixing matrix (t