Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder

PDF / 728,875 Bytes
12 Pages / 595.276 x 841.89 pts (A4) Page_size
21 Downloads / 253 Views

2020 21(11):1639-1650

1639

Frontiers of Information Technology & Electronic Engineering www.jzus.zju.edu.cn; engineering.cae.cn; www.springerlink.com ISSN 2095-9184 (print); ISSN 2095-9230 (online) E-mail: [email protected]

Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder∗ Jing-jing CHEN1 , Qi-rong MAO‡1,2 , You-cai QIN1 , Shuang-qing QIAN1 , Zhi-shen ZHENG1 1School

of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China

2Jiangsu

Key Laboratory of Security Technology for Industrial Cyberspace, Zhenjiang 212013, China

E-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected] Received Jan. 13, 2020; Revision accepted June 21, 2020; Crosschecked Sept. 8, 2020

Abstract: Much recent progress in monaural speech separation (MSS) has been achieved through a series of deep learning architectures based on autoencoders, which use an encoder to condense the input signal into compressed features and then feed these features into a decoder to construct a specific audio source of interest. However, these approaches can neither learn generative factors of the original input for MSS nor construct each audio source in mixed speech. In this study, we propose a novel weighted-factor autoencoder (WFAE) model for MSS, which introduces a regularization loss in the objective function to isolate one source without containing other sources. By incorporating a latent attention mechanism and a supervised source constructor in the separation layer, WFAE can learn source-specific generative factors and a set of discriminative features for each source, leading to MSS performance improvement. Experiments on benchmark datasets show that our approach outperforms the existing methods. In terms of three important metrics, WFAE has great success on a relatively challenging MSS case, i.e., speaker-independent MSS. Key words: Speech separation; Generative factors; Autoencoder; Deep learning https://doi.org/10.1631/FITEE.2000019 CLC number: TN912.3

1 Introduction Speech separation, also known as audio source separation, is a signiﬁcant task in signal processing. The aim of speech separation is to separate target speech from a mixed audio signal, and this work is important for some real-world applications. ‡ *

Corresponding author

Project supported by the Key Project of the National Natural Science Foundation of China (No. U1836220), the National Natural Science Foundation of China (No. 61672267), the Qing Lan Talent Program of Jiangsu Province, China, and the Key Innovation Project of Undergraduate Students in Jiangsu Province, China (No. 201810299045Z) ORCID: Jing-jing CHEN, https://orcid.org/0000-0003-29680313; Qi-rong MAO, https://orcid.org/0000-0002-0616-4431 c Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2020

For example, it can separate clean speech from a noisy speech signal to improve the accuracy of automatic sp

Data Loading...

Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder

Recommend Documents

A Variational Autoencoder Approach for Speech Signal Separation

Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation

A Deep Learning Generative Approach for Speech-to-Scene Generation

Improved monaural speech segregation based on computational auditory scene analysis

Latent Learning

Latent Weights Generating for Few Shot Learning Using Information Theory

Generative Adversarial Network-Based Semi-supervised Learning for Pathological Speech Classification

Insider Threat Detection Using Multi-autoencoder Filtering and Unsupervised Learning

Verification of a Generative Separation Kernel

Federated Generative Adversarial Learning

Generative Conversations for Creative Learning Reimagining Literacy

Parts-of-Speech tagging for Malayalam using deep learning techniques