Experimenting with factored language model and generalized back-off for Hindi

  • PDF / 1,449,067 Bytes
  • 14 Pages / 595.276 x 790.866 pts Page_size
  • 18 Downloads / 167 Views

DOWNLOAD

REPORT


ORIGINAL RESEARCH

Experimenting with factored language model and generalized back-off for Hindi Arun R. Babhulgaonkar1

· Shefali P. Sonavane2

Received: 1 February 2020 / Accepted: 28 July 2020 © Bharati Vidyapeeth’s Institute of Computer Applications and Management 2020

Abstract Language modeling is a statistical technique to represent the text data in machine readable format. It finds the probability distribution of sequence of words present in the text. Language model estimates the likelihood of upcoming words in some spoken or written conversation. Markov assumption enables language model to predict the next word depending on previous n − 1 words, called as n-gram, in the sentence. Limitation of n-gram technique is that it utilizes only preceding words to predict the upcoming word. Factored language modeling is an extension to n-gram technique that facilitates to integrate grammatical and linguistic knowledge of the words such as number, gender, part-of-speech tag of the word, etc. in the model for predicting the next word. Back-off is a method to resort to less number of preceding words in case of unavailability of more words in contextual history. This research work finds the effect of various combinations of linguistic features and generalized back-off strategies on the upcoming word prediction capability of language model over Hindi language. The paper empirically compares the results obtained after utilizing linguistic features of Hindi words in factored language model against baseline n-gram technique. The language models are compared using perplexity metric. In summary, the factored language model with product combine strategy produces the lowest

& Arun R. Babhulgaonkar [email protected] Shefali P. Sonavane [email protected] 1

Dr. Babasaheb Ambedkar Technological University, Lonere, Maharashtra, India

2

Walchand College of Engineering, Sangli, Maharashtra, India

perplexity of 1.881235. It is about 50% less than traditional baseline trigram model. Keywords Factored language model (FLM) · Generalized back-off · n-gram · Perplexity

1 Introduction Hindi is the national and official language of India. According to https://www.Vistawide.com after English, Spanish and Mandarin, Hindi is the most natively spoken language. It is used by about 400 million people. It goes beyond 400 million if dialects of Hindi are considered which share the Devenagari script of Hindi language such as Marathi, Sanskrit, etc. Most of the government resolutions, documents, historical records, etc. are available in English which may not be understood by the villagers in India. This raises a need to develop an efficient automatic translation system from English to Hindi. Machine Translation is a technique used for translation of text in source natural language into target natural language. Language model is very essential component of a statistical machine translation (SMT) system. A language model finds the likelihood of words during some conversation in any natural language. It finds the tendenc