DNN-based speech enhancement with self-attention on feature dimension

  • PDF / 2,521,057 Bytes
  • 22 Pages / 439.642 x 666.49 pts Page_size
  • 83 Downloads / 192 Views

DOWNLOAD

REPORT


DNN-based speech enhancement with self-attention on feature dimension Jiaming Cheng1

· Ruiyu Liang2 · Li Zhao1

Received: 13 November 2019 / Revised: 7 June 2020 / Accepted: 13 July 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract To make full use of the key information in frame-level features, a DNN-based model for speech enhancement is proposed using self-attention on the feature dimension. Two improvement strategies are adopted to strengthen the attention of the fully connected layers to the effective information in the features. First, the model introduces the fusion of feature domains on the input features, using a 136-dimensional combination of features including the MFCC, AMS, RASTA-PLP, cochleagram, and PNCC. The fusion of features complements information from different domains, including the mel domain and gammatone domain, thus providing more effective information for self-attention. Second, a featurelevel self-attention mechanism is applied to the output of the fully connected layer to obtain the information related to the task. The feature-level attention enables the fully connected layers to capture the internal correlations between different features and to reduce the redundancy brought by multiple features. The experimental results show that under the matched noisy condition, compared to the noisy signals, the proposed algorithm increased the PESQ, fwsegSNR and STOI by 40.92%, 60.2% and 8.31%, respectively, and by 23.64%, 32.55% and 3.4%, respectively, under the mismatched noisy condition. Comparisons between different neural networks indicate that the proposed algorithm is superior to the compared algorithms in both the matched and mismatched situations using fewer context frames. Therefore, the proposed model can effectively utilize the key information of the features to suppress the noise, thereby improving the speech quality and generalizing the mismatched samples. Keywords Speech enhancement · Deep neural network · Feature fusion · Self-attention  Jiaming Cheng

[email protected] Ruiyu Liang [email protected] Li Zhao [email protected] 1

School of Information Science and Engineering, Southeast University, Nanjing, 211189, China

2

School of Communication Engineering, Nanjing Institute of Technology, Nanjing, 211167, China

Multimedia Tools and Applications

1 Introduction Single-channel speech enhancement has been a popular trend over the past several decades. In recent years, speech enhancement algorithms have attracted the attention of many scholars because of growing challenges in many important real world applications, including hearing aid design and robust speech recognition [20]. Speech enhancement can be roughly divided into unsupervised learning methods and supervised learning methods. The former is also known as the traditional speech enhancement method. This type of method does not rely on a priori voice information and has a low requirement of computation and hardware. Therefore, it often has good real-time performance. Traditional algori