Morphological analysis and disambiguation for Breton
- PDF / 2,203,724 Bytes
- 43 Pages / 439.37 x 666.142 pts Page_size
- 39 Downloads / 186 Views
Morphological analysis and disambiguation for Breton Francis M. Tyers1
•
Nick Howell2
Accepted: 30 September 2020 Ó Springer Nature B.V. 2020
Abstract In this paper we present an extended description of two resources for natural language processing of Breton, a morphological analyser and constraint grammar-based disambiguator. The constraint grammar was developed using a novel methodology by a linguist and a language consultant creating rules to solve specific errors in disambiguation in a machine translation system. In addition we introduce a new morphologically-disambiguated corpus of Breton and evaluate both the morphological analyser and constraint grammar for coverage and accuracy. For comparison we use the same corpus to train several reference systems for part-ofspeech tagging and lemmatisation and compare the performance. The experiments show that our system outperforms the reference systems by a wide margin when the reference systems are trained without an external full-form list, and performs comparably when they are trained with a full-form list generated from our morphological analyser. Keywords Breton NLP Pipeline
& Francis M. Tyers [email protected] Nick Howell [email protected] 1
Department of Linguistics, Indiana University, Bloomington, United States
2
School of Linguistics, Higher School of Economics, Moscow, Russia
123
F. M. Tyers, N. Howell
1 Introduction This paper presents a morphological analyser and disambiguator for Breton, an endangered language spoken in Brittany. Both tools are released as free/open-source software as part of the Apertium project (Forcada et al. 2011).1 and are thoroughly evaluated and compared with other widely-used approaches. The analyser is implemented as a finite-state transducer, which means that it can be used for both the analysis and the generation of forms—a transducer of this type maps between surface forms and lexical forms (lemmas and morphosyntactic tags). As a given surface form can have more than one analysis, we also describe the development of a rule-based system for morphological disambiguation based on constraint grammar (Karlsson 1990). In addition to presenting the two tools, we also describe the methodology used to create the constraint grammar, which relied on an online collaboration between a computational linguist and a speaker of Breton to create a machine translation system. Breton (ISO-639 br, bre) is a Celtic language of the Brythonic branch, spoken in the Brittany region of France (see Sec. 2). It is classified as in ‘‘serious danger of extinction’’. While there is some effort for revival, attention from the language technology community has been limited (see Sect. 3). The late 1990s and early 2000s saw work on Breton education tools; late 2000s finally saw some attention to Breton machine translation. This paper presents improvements to the rule-based morphological analysis and disambiguation components in the only publically available machine translation system available for the Breton language Tyers (2010b). The remaind
Data Loading...