Unsupervised stemmed text corpus for language modeling and transcription of Telugu broadcast news

  • PDF / 1,665,822 Bytes
  • 10 Pages / 595.276 x 790.866 pts Page_size
  • 58 Downloads / 206 Views

DOWNLOAD

REPORT


Unsupervised stemmed text corpus for language modeling and transcription of Telugu broadcast news Mythilisharan Pala1   · Laxminarayana Parayitam1 · Venkataramana Appala2 Received: 22 November 2019 / Accepted: 16 August 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In Indian Languages, root words will be either combined or modified to match the context with reference to tense, number and/or gender. So the number of unique words will increase when compared to many European languages. Whatever be the size of the text corpus used for language modeling cannot contain all the possible inflected words. A word which occurred during testing but not in training data is called Out of Vocabulary (OOV) word. Similarly, the text corpus cannot have all possible sequence of words. So Due to this data sparsity, Automatic Speech Recognition system (ASR) may not accommodate all the words in the language model/irrespective of the size of the text corpus. It also becomes computationally challenging if the volume of the data increases exponentially due to morphological changes to the root word. To reduce the OOVs in the language model, a new unsupervised stemming method is proposed in this paper for one Indian language, Telugu, based on the method proposed for Hindi. Other issues in the language modeling for Telugu using techniques like smoothing and interpolation, with supervised and unsupervised stemming data is also analyzed. It is observed that the smoothing techniques Witten–Bell and Kneser–Ney performing well when compared to other techniques, on pre-processed data with supervised learning. The ASRs accuracy is improved by 0.76% and 0.94% with supervised and unsupervised stemming respectively. Keywords  OOVs · Language model · Stemming · ASR

1 Introduction Telugu is one of four modern literary languages belonging to the Dravidian family. It is also one of the six classical languages of India. With native speakers of 81 million, it stands as fourth most widely spoken language in the sub-continent. A comprehensive ASR for Telugu language has not been made available due to lack of standard publicly accessible annotated speech corpus (Vegesna et al. 2017). ASR accuracy depends on the acoustic model, language model and lexicons. Language model gives the distribution

* Mythilisharan Pala [email protected] Laxminarayana Parayitam [email protected] Venkataramana Appala [email protected] 1



Research and Training Unit for Navigational Electronics, Osmania University, Hyderabad, India



Nuronics Labs, Hyderabad, India

2

of probabilities on sequence of words, calculated using the available training text corpus. Test speech may contain few new words that may not have been acquainted in the training corpus. These new words cannot be recognized by the decoder. New words appearing in the test speech are called as Out of Vocabulary (OOV). OOVs for Indian languages will be more than English and European languages. This is because of formation of more number of words by combining two word