Permutation Correction in the Frequency Domain in Blind Separation of Speech Mixtures

  • PDF / 1,737,055 Bytes
  • 16 Pages / 600.03 x 792 pts Page_size
  • 89 Downloads / 192 Views

DOWNLOAD

REPORT


Permutation Correction in the Frequency Domain in Blind Separation of Speech Mixtures ` 1 and D. T. Pham2 Ch. Serviere 1 Laboratoire 2 Laboratoire

des Images et des Signaux, BP 46, 38402 St Martin d’H`ere Cedex, France de Mod´elisation et Calcul, BP 53, 38041 Grenoble Cedex, France

Received 31 January 2005; Revised 26 August 2005; Accepted 1 September 2005 This paper presents a method for blind separation of convolutive mixtures of speech signals, based on the joint diagonalization of the time varying spectral matrices of the observation records. The main and still largely open problem in a frequency domain approach is permutation ambiguity. In an earlier paper of the authors, the continuity of the frequency response of the unmixing filters is exploited, but it leaves some frequency permutation jumps. This paper therefore proposes a new method based on two assumptions. The frequency continuity of the unmixing filters is still used in the initialization of the diagonalization algorithm. Then, the paper introduces a new method based on the time-frequency representations of the sources. They are assumed to vary smoothly with frequency. This hypothesis of the continuity of the time variation of the source energy is exploited on a sliding frequency bandwidth. It allows us to detect the remaining frequency permutation jumps. The method is compared with other approaches and results on real world recordings demonstrate superior performances of the proposed algorithm. Copyright © 2006 Hindawi Publishing Corporation. All rights reserved.

1.

INTRODUCTION

Blind source separation consists in extracting independent sources from their mixtures, without relying on any specific knowledge of the sources. Earlier works have been focused on linear instantaneous mixtures and several efficient algorithms have been developed. The problem is much more difficult in the case of convolutive mixtures, especially audio mixtures. Although there have been many works on this subject [1–3], the successful application of the proposed algorithms in realistic settings is still elusive [4], due mainly to the long impulse responses of the mixing filters. To blindly separate the sources, one would have to find an “inverse filter” (which would also have long response) such that the recovered sources are as mutually independent as is possible. A direct (time domain) approach would be too computationally heavy, not to mention the difficulty of convergence, since it requires the adjustment of too many parameters. However, by using the Fourier transform, the separation problem of convolutive mixtures can be recast as a set of separation problems of instantaneous mixtures associated with each frequency bin, which can be solved independently. But the discrete Fourier transform tends to produce nearly Gaussian variables, and it is well known that blind separation of instantaneous mixtures requires non-Gaussianity. Fortunately, speech signals

are highly non stationary so a promising approach is to exploit this nonstationarity to separate their mixtures using only