Experimenting with factored language model and generalized back-off for Hindi

PDF / 1,449,067 Bytes
14 Pages / 595.276 x 790.866 pts Page_size
18 Downloads / 183 Views

ORIGINAL RESEARCH

Experimenting with factored language model and generalized back-off for Hindi Arun R. Babhulgaonkar1

· Shefali P. Sonavane2

Received: 1 February 2020 / Accepted: 28 July 2020 © Bharati Vidyapeeth’s Institute of Computer Applications and Management 2020

Abstract Language modeling is a statistical technique to represent the text data in machine readable format. It finds the probability distribution of sequence of words present in the text. Language model estimates the likelihood of upcoming words in some spoken or written conversation. Markov assumption enables language model to predict the next word depending on previous n − 1 words, called as n-gram, in the sentence. Limitation of n-gram technique is that it utilizes only preceding words to predict the upcoming word. Factored language modeling is an extension to n-gram technique that facilitates to integrate grammatical and linguistic knowledge of the words such as number, gender, part-of-speech tag of the word, etc. in the model for predicting the next word. Back-off is a method to resort to less number of preceding words in case of unavailability of more words in contextual history. This research work finds the effect of various combinations of linguistic features and generalized back-off strategies on the upcoming word prediction capability of language model over Hindi language. The paper empirically compares the results obtained after utilizing linguistic features of Hindi words in factored language model against baseline n-gram technique. The language models are compared using perplexity metric. In summary, the factored language model with product combine strategy produces the lowest

& Arun R. Babhulgaonkar [email protected] Shefali P. Sonavane [email protected] 1

Dr. Babasaheb Ambedkar Technological University, Lonere, Maharashtra, India

2

Walchand College of Engineering, Sangli, Maharashtra, India

perplexity of 1.881235. It is about 50% less than traditional baseline trigram model. Keywords Factored language model (FLM) · Generalized back-off · n-gram · Perplexity

1 Introduction Hindi is the national and official language of India. According to https://www.Vistawide.com after English, Spanish and Mandarin, Hindi is the most natively spoken language. It is used by about 400 million people. It goes beyond 400 million if dialects of Hindi are considered which share the Devenagari script of Hindi language such as Marathi, Sanskrit, etc. Most of the government resolutions, documents, historical records, etc. are available in English which may not be understood by the villagers in India. This raises a need to develop an efficient automatic translation system from English to Hindi. Machine Translation is a technique used for translation of text in source natural language into target natural language. Language model is very essential component of a statistical machine translation (SMT) system. A language model finds the likelihood of words during some conversation in any natural language. It finds the tendenc

Data Loading...

Experimenting with factored language model and generalized back-off for Hindi

Recommend Documents

Empirical Laws of Natural Language Processing for Hindi Language

HINDIA: a deep-learning-based model for spell-checking of Hindi language

A deep neural network-based model for named entity recognition for Hindi language

Experimenting with Scenarios

Regression with Linear Factored Functions

Prediction of POS Tagging for Unknown Words for Specific Hindi and Marathi Language

Experimenting with inputs

Optimal Polynomial Backoff for IEEE 802.11 DCF

Newspaper Identification in Hindi

Hinduism and Hindi Theater

Iterative Descent Method for Generalized Leontief Model

Generalized Averaged Model