Parts-of-Speech tagging for Malayalam using deep learning techniques

  • PDF / 708,360 Bytes
  • 8 Pages / 595.276 x 790.866 pts Page_size
  • 37 Downloads / 220 Views

DOWNLOAD

REPORT


ORIGINAL RESEARCH

Parts-of-Speech tagging for Malayalam using deep learning techniques K. K. Akhil1 • R. Rajimol1 • V. S. Anoop1

Received: 4 January 2020 / Accepted: 1 June 2020 Ó Bharati Vidyapeeth’s Institute of Computer Applications and Management 2020

Abstract Parts-of-speech tagging is a process in linguistics which deals with tagging each word in a sentence with their corresponding parts-of-speech. This process is considered to be one of the pre-processing steps for many natural language processing tasks. Earlier approaches were based on simple heuristics and later several methods were reported in the literature that incorporated machine learning techniques such as artificial neural networks. Very recently, with the advancement of deep learning-based approaches, parts-of-speech tagging process became more accurate and a reasonable number of taggers are now available for high resource languages such as English. But the low resource languages such as Malayalam is still lacking computationally efficient and accurate methods and techniques for parts-of-speech tagging. In this direction, this work proposes a deep learning-based approach for parts-of-speech tagging for the Malayalam language. Experiments conducted on real datasets show that the proposed method outperforms some of the already available methods in terms of precision and accuracy. Keywords Parts-of-Speech tagging  Natural language processing  Malayalam language  Deep learning

& V. S. Anoop [email protected] 1

Indian Institute of Information Technology and ManagementKerala (IIITM-K) Technopark Campus, Thiruvananthapuram, Kerala 695581, India

1 Introduction Parts-of-Speech (POS) tagging is defined as the process of labeling each word in a sentence with a tag that mentions the usage of that particular word in the sentence. A basic POS tagger will usually classify the words into noun, verb, adjective, etc. but there are advanced taggers which can give additional labels such as numbers, gender, etc. A vast array of language processing systems use POS tagging as a pre-processing step which improves the precision and recall of such systems to a great extent. The pos tagged (annotated) language corpus largely find applications in speech recognition and analysis, information retrieval and other NLP tasks. Even though many approaches have been reported in the literature which proposed better ways for tagging the parts-of-speech, the approaches can be classified mainly into two - Rule-based and Machine Learning based approaches. The rule-based approaches use pre-defined rules handcrafted by humans. But assigning the tag to a word using manual process is very tedious and timeconsuming. On the other hand, there are machine learningbased approaches that use various stochastic and probabilistic techniques for labeling the POS. All these approaches achieve reasonable accuracy for resource-rich languages such as English but not performing satisfactorily for other low resource languages including most of the Indic Languages. Malayalam belongs to the Dra