DNN-based grapheme-to-phoneme conversion for Arabic text-to-speech synthesis

  • PDF / 1,511,793 Bytes
  • 16 Pages / 595.276 x 790.866 pts Page_size
  • 105 Downloads / 214 Views

DOWNLOAD

REPORT


DNN‑based grapheme‑to‑phoneme conversion for Arabic text‑to‑speech synthesis Ikbel Hadj Ali1   · Zied Mnasri2   · Zied Lachiri2 Received: 9 March 2020 / Accepted: 16 August 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Arabic text-to-speech synthesis from non-diacritized text is still a big challenge, because of unique Arabic language rules and characteristics. Indeed, the diacritic and gemination signs, which are special characters representing respectively short vowels and consonant doubling, have a major effect on accurate pronunciation of Arabic. However these signs are often not mentioned in written texts, since most of Arab readers are used to guess them from the context. To tackle this issue, this paper presents a grapheme-to-phoneme conversion system for Arabic, which constitutes the text processing module of a deep neural networks (DNN)-based Arabic TTS systems. In the case of Arabic text, this step starts with predicting the diacritic and gemination signs. In this work, this step was fully realized based on DNN. Finally, the grapheme-to-phoneme conversion of the diacritized text was achieved using the Buckwalter code. In comparison to state-of-the-art approaches, the proposed system gives a higher accuracy rate either for all phonemes or for each class, and high precision, recall and F1 score for each class of diacritic signs. Keywords  Arabic text-to-speech synthesis · Deep neural networks (DNN) · Grapheme-to-phoneme conversion · Diacritic signs · Gemination

1 Introduction Since speech is the main natural communication means, Text-To Speech (TTS) research has been addressed a special care as a powerful human–machine communication systems. TTS (Taylor 2009) is a speech processing application that is utilized to transform text into human-like sound. It is divided into two main modules: (a) Text processing module, which

* Ikbel Hadj Ali [email protected] Zied Mnasri [email protected] Zied Lachiri [email protected] 1



Signal, Image and Technology of Information Laboratory, Electrical Engineering Department, Ecole Nationale d’Ingénieurs de Tunis, University Tunis El-Manar, Tunis, Tunisia



Signal, Image and Technology of Information Laboratory, Electrical Engineering Department, Ecole Nationale d’Ingénieurs de Tunis, University Tunis El-Manar, Tunis, Tunisia

2

consists in text segmentation into different levels (sentences, words...), text normalization and conversion of the text into a sequence of phonemes; (b) Digital signal processing module, which generates a speech waveform corresponding to the desired sequence of phonemes. The quality of a speech synthesizer can be assessed by its similarity to real human voice, i.e. its natural aspect, and its intelligibility. Thanks to new technologies, the quality of the sound generated by TTS systems has become almost indistinguishable from the human one. The changes in rhythm, pronunciation and inflection really resemble those of a human speaker. The performance of the state-of-the-art TTS systems has