HINDIA: a deep-learning-based model for spell-checking of Hindi language

  • PDF / 2,178,147 Bytes
  • 16 Pages / 595.276 x 790.866 pts Page_size
  • 33 Downloads / 193 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

HINDIA: a deep‑learning‑based model for spell‑checking of Hindi language Shashank Singh1 · Shailendra Singh1 Received: 7 November 2019 / Accepted: 13 July 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract The spelling error is a mistake occurred while typing the text document. The applications like search engines, information retrieval, emails, etc., require user typing. In such applications, good spell-checker is essential to rectify the misspelling. Spell-checkers for western languages like English are very powerful and can handle any type of spelling errors, whereas in the case of Indian languages like Hindi, Urdu, Bengali, Kannada, Assamese, etc., the available spell-checkers are very basic ones. These spell-checkers are developed using traditional methods like statistical methods and rule-based methods. This article presents a novel model HINDIA to handle the spelling errors of the Hindi language, one of the most spoken languages in India. It utilizes a deep-learning method for spelling error detection and correction. The proposed spell-checking model works in two phases. In the first phase model identifies the erroneous words in the input sample and in the second phase it replaces the wrong words with the most probable correct words. Model HINDIA is developed using the attentionbased encoder–decoder bidirectional recurrent neural network (BiRNN) which uses long short-term memory cells. Several modifications in the BiRNN have been made and network is fine-tuned to process the spelling errors of Hindi language. It uses publicly available dataset ‘monolingual corpus’ developed by IIT Mumbai for training and testing. The performance of the proposed model is evaluated in two scenarios. In the first scenario where the testing dataset is generated using split function. HINDIA performs significantly well with precision 0.86, recall 0.72, f-measure 0.78 and accuracy 0.80. Further, in the second scenario, where a dataset is manually generated its performance is fairly good with precision 0.81, recall 0.72, f-measure 0.76 and accuracy 0.74. Model HINDIA gives better performance than the deep-learning-based Malayalam spellchecker and some other deep-learning-based correction models present in the literature. Keywords  Spelling · Spell-checker · Deep-learning · Long short-term memory · Encoder–decoder recurrent neural network

1 Introduction Artificial intelligence (AI) has become the most sought-out field of research in this information age. AI is using deeplearning (DL) techniques to ultimately devise some systems to assist the human. DL is the method that uses past experience to teach the machine to answer the particular question [1]. Nowadays, DL methods are being used rigorously by artificial intelligence and Natural language processing (NLP) * Shashank Singh [email protected] Shailendra Singh [email protected] 1



Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Chandigarh, India

researchers [2]. NLP