Noise and acoustic modeling with waveform generator in text-to-speech and neutral speech conversion

  • PDF / 3,319,757 Bytes
  • 26 Pages / 439.37 x 666.142 pts Page_size
  • 2 Downloads / 225 Views

DOWNLOAD

REPORT


Noise and acoustic modeling with waveform generator in text-to-speech and neutral speech conversion Mohammed Salah Al-Radhi 1

& Tamás Gábor Csapó

1,2

& Géza Németh

1

Received: 14 April 2020 / Revised: 31 July 2020 / Accepted: 28 August 2020 # The Author(s) 2020

Abstract

This article focuses on developing a system for high-quality synthesized and converted speech by addressing three fundamental principles. Although the noise-like component in the state-of-the-art parametric vocoders (for example, STRAIGHT) is often not accurate enough, a novel analytical approach for modeling unvoiced excitations using a temporal envelope is proposed. Discrete All Pole, Frequency Domain Linear Prediction, Low Pass Filter, and True envelopes are firstly studied and applied to the noise excitation signal in our continuous vocoder. Second, we build a deep learning model based text–to–speech (TTS) which converts written text into human-like speech with a feed-forward and several sequence-to-sequence models (long short-term memory, gated recurrent unit, and hybrid model). Third, a new voice conversion system is proposed using a continuous fundamental frequency to provide accurate time-aligned voiced segments. The results have been evaluated in terms of objective measures and subjective listening tests. Experimental results showed that the proposed models achieved the highest speaker similarity and better quality compared with the other conventional methods. Keywords Speech synthesis . Vocoder . Temporal envelope . Neural network . Voice conversion

* Mohammed Salah Al-Radhi [email protected] Tamás Gábor Csapó [email protected] Géza Németh [email protected]

1

Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary

2

MTA-ELTE Lendület Lingual Articulation Research Group, Budapest, Hungary

Multimedia Tools and Applications

1 Introduction Speech synthesis can be defined as the ability to produce human speech by a machine like computer. Statistical parametric speech synthesis (SPSS) using waveform parametrisation has recently attracted much interest caused by the advancement of Hidden Markov Model (HMM) [62] and deep neural network (DNN) [63] based textto-speech (TTS). Such a statistical framework that can be guided by an analysis/synthesis system (which is also called vocoder) is used to generate human voice from mathematical models of the vocal tract. Although there are several different types of vocoders (e.g., see [43] for comparison) that use analysis/synthesis, they follow the same main strategy. During the analysis phase, vocoder parameters are extracted from the speech waveform which represent the excitation speech signal and filter transfer function (spectral envelope). On the other hand, in the synthesis phase, the vocoded parameters are interpolated over the current frame across the synthesis filter to reconstruct the speech signal. Since the design of a vocoder depends on speech characteristics, the quality of synthesized speech may still be unsa