A hybrid speech enhancement system with DNN based speech reconstruction and Kalman filtering

  • PDF / 2,048,133 Bytes
  • 21 Pages / 439.642 x 666.49 pts Page_size
  • 88 Downloads / 204 Views

DOWNLOAD

REPORT


A hybrid speech enhancement system with DNN based speech reconstruction and Kalman filtering Hongjiang Yu1 · Wei-Ping Zhu1 · Zhiheng Ouyang1 · Benoit Champagne2 Received: 10 November 2019 / Revised: 22 June 2020 / Accepted: 6 August 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In this paper, we propose a hybrid speech enhancement system that exploits deep neural network (DNN) for speech reconstruction and Kalman filtering for further denoising, with the aim to improve performance under unseen noise conditions. Firstly, two separate DNNs are trained to learn the mapping from noisy acoustic features to the clean speech magnitudes and line spectrum frequencies (LSFs), respectively. Then the estimated clean magnitudes are combined with the phase of the noisy speech to reconstruct the estimated clean speech, while the LSFs are converted to linear prediction coefficients (LPCs) to implement Kalman filtering. Finally, the reconstructed speech is Kalman-filtered for further removing the residual noises. The proposed hybrid system takes advantage of both the DNN based reconstruction and traditional Kalman filtering, and can work reliably in either matched or unmatched acoustic environments. Computer based experiments are conducted to evaluate the proposed hybrid system with comparison to traditional iterative Kalman filtering and several state-of-the-art DNN based methods under both seen and unseen noises. It is shown that compared to the DNN based methods, the hybrid system achieves similar performance under seen noise, but notably better performance under unseen noise, in terms of both speech quality and intelligibility. Keywords Speech enhancement · Deep neural network · Kalman filter · Unmatched acoustic environment

1 Introduction In real world environments, speech signals are often corrupted by a wide range of background noises. These disturbances cause problems in applications including voice communication, automatic speech recognition and speaker identification. As a result, speech enhancement, which aims to improve speech quality and intelligibility, has been intensively  Hongjiang Yu

ho [email protected] 1

Department of Electrical and Computer Engineering, Concordia University, Montreal, Canada

2

Department of Electrical and Computer Engineering, McGill University, Montreal, Canada

Multimedia Tools and Applications

studied over the past several decades, and will likely continue to be an active research topic in speech processing, recognition and communication. Various denoising methods have been proposed in the literature, among which statistical filtering received the earliest attention. Wiener filtering is one of the well-known methods in this category, with its goal to find the optimal minimum mean square error (MMSE) estimate of the clean speech’s discrete Fourier transform (DFT) coefficients [11]. Wiener filtering introduces broadband residual noise instead of musical noise in the enhanced speech, which is undesirable even though often acceptable. Kalman filt