Multi-objective long-short term memory recurrent neural networks for speech enhancement

  • PDF / 3,725,873 Bytes
  • 16 Pages / 595.276 x 790.866 pts Page_size
  • 92 Downloads / 186 Views

DOWNLOAD

REPORT


ORIGINAL RESEARCH

Multi‑objective long‑short term memory recurrent neural networks for speech enhancement Nasir Saleem1,2   · Muhammad Irfan Khattak1   · Mu’ath Al‑Hasan3   · Atif Jan1 Received: 25 July 2020 / Accepted: 3 October 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Speech-in-noise perception is an important research problem in many real-world multimedia applications. The noise-reduction methods contributed significantly; however rely on a priori information about the noise signals. Deep learning approaches are developed for enhancing the speech signals in nonstationary noisy backgrounds and their benefits are evaluated for the perceived speech quality and intelligibility. In this paper, a multi-objective speech enhancement based on the Long-Short Term Memory (LSTM) recurrent neural network (RNN) is proposed to simultaneously estimate the magnitude and phase spectra of clean speech. During training, the noisy phase spectrum is incorporated as a target and the unstructured phase spectrum is transformed to its derivative that has an identical structure to corresponding magnitude spectrum. Critical Band Importance Functions (CBIFs) are used in training process to further improve the network performance. The results verified that the proposed multi-objective LSTM (MO-LSTM) successfully outscored the standard magnitude-aware LSTM (MA-LSTM), magnitude-aware DNN (MA-DNN), phase-aware DNN (PA-DNN), magnitude-aware GNN (MA-GNN) and magnitude-aware CNN (MA-CNN). Moreover, the proposed speech enhancement considerably improved the speech quality, intelligibility, noise-reduction and automatic speech recognition in changing noisy backgrounds, which is confirmed by the ANalysis Of VAriance (ANOVA) statistical analysis. Keywords  Speech enhancement · LSTM · DNN · ASR · RNN · Intelligibility · Speech quality

1 Introduction Speech enhancement aims to restore a clean speech from noisy speech. In conventional speech enhancement algorithms (Boll 1979; Cohen and Berdugo 2001; Ephraim and Van Trees 1995; Ephraim and Malah 1985; Saleem et al. 2019a, b, c, d; Saleem and Irfan 2018; Shoba and Rajavel 2020; Zao et al. 2014) such restoration is based on the unsupervised mathematical hypothesis about speech or noise signals. These algorithms often import musical noise artifacts which limit the performance of the speech enhancement. * Nasir Saleem [email protected] 1



Department of Electrical Engineering, University of Engineering and Technology, Peshawar 25000, KPK, Pakistan

2



Department of Electrical Engineering, FET, Gomal University, Dera Ismail Khan 29050, KPK, Pakistan

3

College of Engineering, Al Ain University, Al Ain, United Arab Emirates



The supervised machine learning speech enhancement approaches have demonstrated remarkable potential of improving the quality and intelligibility of noisy speech. Non-negative matrix factorization (NMF) (Kwon et  al. 2014) presents one recognizable example of machine learning approach where speech and noise bases functions are acquired indepe