Using machine learning to build POS tagger for under-resourced language: the case of Somali
- PDF / 1,680,896 Bytes
- 13 Pages / 595.276 x 790.866 pts Page_size
- 13 Downloads / 543 Views
ORIGINAL RESEARCH
Using machine learning to build POS tagger for under-resourced language: the case of Somali Siraj Mohammed1
Received: 5 November 2019 / Accepted: 19 May 2020 © Bharati Vidyapeeth’s Institute of Computer Applications and Management 2020
Abstract POS tagging serves as a preliminary task for many NLP applications. It refers to the process of classifying words into their parts of speech (also known as words classes or lexical categories). Somali is a member of the Cushitic languages with limited number of NLP tools for use. An accurate and reliable POS tagger is essential for many NLP tasks like shallow parsing, dependency parsing, sentiment analysis, and named entity recognition. In this paper, we present a statistical POS tagger for Somali language using different machine learning approaches (i.e., HMM and CRF) and neural network model. Our Somali POS tagger outperforms the state-of-the-art POS tagger by 87.51% on a tenfold cross-validation. The key contribution of this paper are (1) building a generic POS tagger, (2) comparing the performances with the existing state of the art techniques, and (3) exploring the use word embeddings for Somali POS tagging. Keywords Part-of-speech tagger · Machine learning · Neural network · Somali
1 Introduction Part of Speech (POS) tagging is one of the basic applications of NLP (Natural Language Processing) on any language. It is a process of assigning a tag to every word in a sentence and serves as a preliminary task for carrying out
tasks like chunking, dependency parsing, and named-entity recognition on any language. All of these NLP systems must use part of speech tagger as their preprocessor components for their best performance [1–3]. So our work focuses on carrying out POS tagging for Somali. Much of the research in POS tagging has been devoted to resourcerich languages like English and French. African languages like Somali have received far too little attention. Somali language belongs to the lowland East Cushitic family of Afro-Asiatic language. Other languages in the East Cushitic family include Afar, Oromo, Rendille and Boni. Somali language claims an estimated 16 million speakers in Somalia, Somaliland, Djibouti, Kenya, and Ethiopia. Somali Language has been one of the under-resourced languages both in terms of electronic resources and processing tools. Recently, insufficient attempts have been made to develop Somali corpus. A Somali text corpus that has linguistic information is publicly available at http:// www.somalicorpus.com/ for public [4]. However, this available of resources not has been used as resource to process NLP tasks like POS Tagger which is becoming a barrier for researches of higher level NLP applications. Given this circumstance, there is a need to develop a POS tagger for Somali. In this paper, we present an effective POS tagger using different machine learning approaches (i.e., HMM and CRF) and neural network models for under-resourced language-in the case of Somali. 1.1 Somali Language and Writing System
& Siraj Mohammed sirm
Data Loading...