Unsupervised stemmed text corpus for language modeling and transcription of Telugu broadcast news

PDF / 1,665,822 Bytes
10 Pages / 595.276 x 790.866 pts Page_size
58 Downloads / 302 Views

Unsupervised stemmed text corpus for language modeling and transcription of Telugu broadcast news Mythilisharan Pala1 · Laxminarayana Parayitam1 · Venkataramana Appala2 Received: 22 November 2019 / Accepted: 16 August 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In Indian Languages, root words will be either combined or modified to match the context with reference to tense, number and/or gender. So the number of unique words will increase when compared to many European languages. Whatever be the size of the text corpus used for language modeling cannot contain all the possible inflected words. A word which occurred during testing but not in training data is called Out of Vocabulary (OOV) word. Similarly, the text corpus cannot have all possible sequence of words. So Due to this data sparsity, Automatic Speech Recognition system (ASR) may not accommodate all the words in the language model/irrespective of the size of the text corpus. It also becomes computationally challenging if the volume of the data increases exponentially due to morphological changes to the root word. To reduce the OOVs in the language model, a new unsupervised stemming method is proposed in this paper for one Indian language, Telugu, based on the method proposed for Hindi. Other issues in the language modeling for Telugu using techniques like smoothing and interpolation, with supervised and unsupervised stemming data is also analyzed. It is observed that the smoothing techniques Witten–Bell and Kneser–Ney performing well when compared to other techniques, on pre-processed data with supervised learning. The ASRs accuracy is improved by 0.76% and 0.94% with supervised and unsupervised stemming respectively. Keywords OOVs · Language model · Stemming · ASR

1 Introduction Telugu is one of four modern literary languages belonging to the Dravidian family. It is also one of the six classical languages of India. With native speakers of 81 million, it stands as fourth most widely spoken language in the sub-continent. A comprehensive ASR for Telugu language has not been made available due to lack of standard publicly accessible annotated speech corpus (Vegesna et al. 2017). ASR accuracy depends on the acoustic model, language model and lexicons. Language model gives the distribution

* Mythilisharan Pala [email protected] Laxminarayana Parayitam [email protected] Venkataramana Appala [email protected] 1

Research and Training Unit for Navigational Electronics, Osmania University, Hyderabad, India

Nuronics Labs, Hyderabad, India

2

of probabilities on sequence of words, calculated using the available training text corpus. Test speech may contain few new words that may not have been acquainted in the training corpus. These new words cannot be recognized by the decoder. New words appearing in the test speech are called as Out of Vocabulary (OOV). OOVs for Indian languages will be more than English and European languages. This is because of formation of more number of words by combining two word

Data Loading...

Unsupervised stemmed text corpus for language modeling and transcription of Telugu broadcast news

Recommend Documents

An Approach for Morphological Analyzer Rules for Dravidian Telugu Language

Newsminer: Enriched Multidimensional Corpus for Text-Based Applications

Cross-language Text Mining

Sign Language Interpreter Detection Method for Live TV Broadcast Content

Unsupervised Information Extraction by Text Segmentation

Working with Text and Around Text in Foreign Language Environments

Combining Corpus-Based Description and Text-Based Analysis

Automatic Action Extraction for Short Text Conversation Using Unsupervised Learning

Unified Modeling Language-Geoframe Modeling Language

Computational and Corpus Approaches to Chinese Language Learning

Analysis of the Effect of Topic Modeling on General Corpus Mixed with In-Domain Text for English-Hindi Translation

Natural Language Processing (NLP) and Text Analytics