Improving POS Tagging Across Portuguese Variants with Word Embeddings
Brazilian Portuguese (BP) and European Portuguese (EP) have specific NLP resources and tools for many tasks. It is generally agreed upon that applying them to the variant other than their intended one results in a performance drop; however, very little re
- PDF / 121,428 Bytes
- 6 Pages / 439.37 x 666.142 pts Page_size
- 4 Downloads / 201 Views
ract. Brazilian Portuguese (BP) and European Portuguese (EP) have specific NLP resources and tools for many tasks. It is generally agreed upon that applying them to the variant other than their intended one results in a performance drop; however, very little research has measured it. We evaluated a POS tagger in a cross-variant setting under multiple combinations of word embeddings, train and test corpora, and found that (i) BP is easier than EP, (ii) word embeddings help increase tagger performance significantly, but not enough to close the accuracy gap in a cross-variant setting and (iii) embeddings generated from a corpus with both variants are useful in cross-variant scenarios. While we cannot generalize observations from POS tagging to any NLP task, this is an important first step for such evaluations.
1
Introduction
Brazilian Portuguese (BP) and European Portuguese (EP) have very distinct features in phonology, syntax, word choice and, less remarkably, orthography1 . While mutually intelligible, the differences between both have motivated the development of specialized NLP resources and tools [4,11], and speakers of either variant usually agree that applying them to the variant other than their intended one hurts performance. While this is a reasonable assertion, there is very little published research exploring quantitatively this degradation. In this study, we experimented with training a part-of-speech (POS) tagger on one variant of Portuguese and testing it on the other. We evaluate POS tagging for a number of reasons. First, speakers of one variant of Portuguese can tell the POS of words in sentences from another one, including words they do not know, based on their morphology and context. This is because the knowledge of the general properties of the language, even if acquired from one variant, allows the speaker to understand the structure of utterances in another variant. Other reasons are more functional: the Bosque corpus [1] has texts from BP and EP annotated with the same tagset, making a direct comparison straightforward. The same is true for syntactic parsing, but we focus on the simpler task of POS tagging here, leaving syntactic parsing as a potential future work. 1
The texts explored here are from before the Spelling Agreement of the Portuguese language taking place.
c Springer International Publishing Switzerland 2016 J. Silva et al. (Eds.): PROPOR 2016, LNAI 9727, pp. 227–232, 2016. DOI: 10.1007/978-3-319-41552-9 22
228
E.R. Fonseca and S.M. Alu´ısio
We also fed a tagger with word embeddings (numeric vectors representing words) learned from texts in either variant. Specifically, we tried to answer the following questions: first, to what extent embeddings obtained from a variant v1 can help a tagger trained on variant v2 to improve its accuracy on v1 . This is analogous to using unlabeled data for domain adaptation. Second, as a generalization, whether embeddings obtained from a combination of texts from both variants could improve tagging performance on any cross-variant setting. Third,
Data Loading...