Improving POS Tagging Across Portuguese Variants with Word Embeddings

Brazilian Portuguese (BP) and European Portuguese (EP) have specific NLP resources and tools for many tasks. It is generally agreed upon that applying them to the variant other than their intended one results in a performance drop; however, very little re

PDF / 121,428 Bytes
6 Pages / 439.37 x 666.142 pts Page_size
4 Downloads / 246 Views

DOWNLOAD

REPORT

ract. Brazilian Portuguese (BP) and European Portuguese (EP) have speciﬁc NLP resources and tools for many tasks. It is generally agreed upon that applying them to the variant other than their intended one results in a performance drop; however, very little research has measured it. We evaluated a POS tagger in a cross-variant setting under multiple combinations of word embeddings, train and test corpora, and found that (i) BP is easier than EP, (ii) word embeddings help increase tagger performance signiﬁcantly, but not enough to close the accuracy gap in a cross-variant setting and (iii) embeddings generated from a corpus with both variants are useful in cross-variant scenarios. While we cannot generalize observations from POS tagging to any NLP task, this is an important ﬁrst step for such evaluations.

1

Introduction

Brazilian Portuguese (BP) and European Portuguese (EP) have very distinct features in phonology, syntax, word choice and, less remarkably, orthography1 . While mutually intelligible, the diﬀerences between both have motivated the development of specialized NLP resources and tools [4,11], and speakers of either variant usually agree that applying them to the variant other than their intended one hurts performance. While this is a reasonable assertion, there is very little published research exploring quantitatively this degradation. In this study, we experimented with training a part-of-speech (POS) tagger on one variant of Portuguese and testing it on the other. We evaluate POS tagging for a number of reasons. First, speakers of one variant of Portuguese can tell the POS of words in sentences from another one, including words they do not know, based on their morphology and context. This is because the knowledge of the general properties of the language, even if acquired from one variant, allows the speaker to understand the structure of utterances in another variant. Other reasons are more functional: the Bosque corpus [1] has texts from BP and EP annotated with the same tagset, making a direct comparison straightforward. The same is true for syntactic parsing, but we focus on the simpler task of POS tagging here, leaving syntactic parsing as a potential future work. 1

The texts explored here are from before the Spelling Agreement of the Portuguese language taking place.

c Springer International Publishing Switzerland 2016 J. Silva et al. (Eds.): PROPOR 2016, LNAI 9727, pp. 227–232, 2016. DOI: 10.1007/978-3-319-41552-9 22

228

E.R. Fonseca and S.M. Alu´ısio

We also fed a tagger with word embeddings (numeric vectors representing words) learned from texts in either variant. Speciﬁcally, we tried to answer the following questions: ﬁrst, to what extent embeddings obtained from a variant v1 can help a tagger trained on variant v2 to improve its accuracy on v1 . This is analogous to using unlabeled data for domain adaptation. Second, as a generalization, whether embeddings obtained from a combination of texts from both variants could improve tagging performance on any cross-variant setting. Third,

Data Loading...

Improving POS Tagging Across Portuguese Variants with Word Embeddings

Recommend Documents

Improving biterm topic model with word embeddings

Enhanced Neural Machine Translation by Joint Decoding with Word and POS-tagging Sequences

Learning class-specific word embeddings

Semantic Composition of Word-Embeddings with Genetic Programming

Correction to: Learning class-specific word embeddings

Joint Multiclass Debiasing of Word Embeddings

Dual embeddings and metrics for word and relational similarity

Enhancing the Numeracy of Word Embeddings: A Linear Algebraic Perspective

Improvement of Short Text Clustering Based on Weighted Word Embeddings

Fast Pathfinding in Knowledge Graphs Using Word Embeddings

A Deep Learning Architecture with Word Embeddings to Classify Sentiment in Twitter

Studying Ideational Change in Russian Politics with Topic Models and Word Embeddings