Development of Prototype Morphological Analyzer for he South Indian Language of Kannada
A prototype morphological analyzer for the south Indian language of Kannada is presented in this work. The analyzer is based on Finite state machines and can handle 500 distinct Noun and Verb stems of Kannada. The morphological analyzer can simultaneously
- PDF / 239,868 Bytes
- 8 Pages / 430 x 660 pts Page_size
- 101 Downloads / 187 Views
Abstract. A prototype morphological analyzer for the south Indian language of Kannada is presented in this work. The analyzer is based on Finite state machines and can handle 500 distinct Noun and Verb stems of Kannada. The morphological analyzer can simultaneously serve as a stemmer, part of speech tagger and spell checker and hence it becomes a very efficient tool for content management. Keywords: Kannada Morphology, Finite State Machine, Kannada Content Management, Natural Language Processing.
1 Introduction The onset of localization of the content has capacitated the penetration of internet into those regions which do not speak English, particularly Asia. People can read and post things in their own native languages now. However, the current capabilities of the localized edition of internet is very limited. Key word based searching for the local languages is yet to be developed. Text categorization, summarization and retrieval has not been achieved in most of the Asian languages due to the lack of the essential stemming algorithms which are language specific. Similarly automatic translation of the pages to English or any other language is facilitated only if there is an efficient Part of Speech tagger(POS) [22]. As in the case of a stemming algorithm, most of the Asian languages also lack POS taggers for their respective languages. This can be addressed by developing a morph analyzer for that given language. A morph analyzer outputs the stem, the POS tag and affix for any given word. As a result the morph analyzer can be used for both stemming and part of speech tagging simultaneously. In view of this, we have attempted to develop a prototype Kannada Morph Analyzer. Kannada is the official language of the south Indian state of Karnataka, with about 44 million speakers. Though a language of rich literary history, it is resource poor when viewed through the prism of computational linguistics. There are hardly any attempts apart from the work of Sahoo and Vidyasagar [5] where a Kannada WordNet is attempted and a Kannada Indexing software prototype by Settar [6]. Both of them are highly constrained by the lack of a morphology analyzer. Unlike English where most the morphotactic changes do not bring about change in spellings, Kannada words change spellings when the stems are inflected, which adds to the complexity of developing the morph analyzer. The analyzer is based on Finite state machines and can handle 500 distinct Noun and Verb D.H.-L. Goh et al. (Eds.): ICADL 2007, LNCS 4822, pp. 109–116, 2007. © Springer-Verlag Berlin Heidelberg 2007
110
T.N. Vikram and S.R. Urs
stems of Kannada. The morphological analyzer can simultaneously serve as a stemmer, part of speech tagger and spell checker simultaneously, and hence it becomes a very efficient tool for content management. The paper is organized as follows. In Section 2 we briefly describe the state of the art in morphology analysis of various languages. Language specific morphology for Kannada is explained is Section 3. In Section 4 we explain the development
Data Loading...