Noise and acoustic modeling with waveform generator in text-to-speech and neutral speech conversion

PDF / 3,319,757 Bytes
26 Pages / 439.37 x 666.142 pts Page_size
2 Downloads / 238 Views

Noise and acoustic modeling with waveform generator in text-to-speech and neutral speech conversion Mohammed Salah Al-Radhi 1

& Tamás Gábor Csapó

1,2

& Géza Németh

1

Received: 14 April 2020 / Revised: 31 July 2020 / Accepted: 28 August 2020 # The Author(s) 2020

Abstract

This article focuses on developing a system for high-quality synthesized and converted speech by addressing three fundamental principles. Although the noise-like component in the state-of-the-art parametric vocoders (for example, STRAIGHT) is often not accurate enough, a novel analytical approach for modeling unvoiced excitations using a temporal envelope is proposed. Discrete All Pole, Frequency Domain Linear Prediction, Low Pass Filter, and True envelopes are firstly studied and applied to the noise excitation signal in our continuous vocoder. Second, we build a deep learning model based text–to–speech (TTS) which converts written text into human-like speech with a feed-forward and several sequence-to-sequence models (long short-term memory, gated recurrent unit, and hybrid model). Third, a new voice conversion system is proposed using a continuous fundamental frequency to provide accurate time-aligned voiced segments. The results have been evaluated in terms of objective measures and subjective listening tests. Experimental results showed that the proposed models achieved the highest speaker similarity and better quality compared with the other conventional methods. Keywords Speech synthesis . Vocoder . Temporal envelope . Neural network . Voice conversion

* Mohammed Salah Al-Radhi [email protected] Tamás Gábor Csapó [email protected] Géza Németh [email protected]

1

Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary

2

MTA-ELTE Lendület Lingual Articulation Research Group, Budapest, Hungary

Multimedia Tools and Applications

1 Introduction Speech synthesis can be defined as the ability to produce human speech by a machine like computer. Statistical parametric speech synthesis (SPSS) using waveform parametrisation has recently attracted much interest caused by the advancement of Hidden Markov Model (HMM) [62] and deep neural network (DNN) [63] based textto-speech (TTS). Such a statistical framework that can be guided by an analysis/synthesis system (which is also called vocoder) is used to generate human voice from mathematical models of the vocal tract. Although there are several different types of vocoders (e.g., see [43] for comparison) that use analysis/synthesis, they follow the same main strategy. During the analysis phase, vocoder parameters are extracted from the speech waveform which represent the excitation speech signal and filter transfer function (spectral envelope). On the other hand, in the synthesis phase, the vocoded parameters are interpolated over the current frame across the synthesis filter to reconstruct the speech signal. Since the design of a vocoder depends on speech characteristics, the quality of synthesized speech may still be unsa

Data Loading...

Noise and acoustic modeling with waveform generator in text-to-speech and neutral speech conversion

Recommend Documents

Noise Reduction in Speech Processing

Evaluation and Modeling of Power Generator with Bimorph PZT Cantilever

Correction to: Acoustic Contrast Between Neutral and Angry Speech: Variation of Prosodic Features in Algerian Dialect Sp

Development of arbitrary waveform torsional vibration signal generator

Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation

Study of Harmonics-to-Noise Ratio and Critical-Band Energy Spectrum of Speech as Acoustic Indicators of Laryngeal and Vo

The Effects of Noise on Speech Recognition in Cochlear Implant Subjects: Predictions and Analysis Using Acoustic Models

A real-world noise removal with wavelet speech feature

Neural Modeling of Speech Processing and Speech Learning An Introduc

Gasdynamic and Acoustic Characteristics of a Subsonic Jet-Edge Rod Generator of Acoustic Radiation

Pulse current generator with improved waveform fidelity for high-voltage capacitively coupled plasma systems

Stress and Emotion Recognition Using Acoustic Speech Analysis