LX-DSemVectors: Distributional Semantics Models for Portuguese

In this article we describe the creation and distribution of the first publicly available word embeddings for Portuguese. Our embeddings are evaluated on their own and also compared with the original English models on a well-known analogy task. We gathere

PDF / 181,992 Bytes
12 Pages / 439.37 x 666.142 pts Page_size
0 Downloads / 221 Views

DOWNLOAD

REPORT

bstract. In this article we describe the creation and distribution of the ﬁrst publicly available word embeddings for Portuguese. Our embeddings are evaluated on their own and also compared with the original English models on a well-known analogy task. We gathered a large Portuguese corpus of 1.7 billion tokens, developed the ﬁrst distributional semantic analogies test set for Portuguese, and proceeded with the ﬁrst parametrization and evaluation of Portuguese word embeddings models. Keywords: Distributional semantics

1

· Word embeddings · Portuguese

Introduction

Current research trends focusing on distributional semantics are sparking interest in possible ways to enrich the resources and tools used for natural language processing (NLP) tasks. Researchers and practitioners are exploring possible improvements that can be achieved from integrating distributional vectors semantics, also known as word embeddings, in a range of syntactic and semantic tasks, including speech recognition [16], semantic similarity of words [15] partof-speech (POS) tagging, named entity recognition, sentiment analysis [13] and logical semantics [4]. Experimenting with word embeddings in such tasks requires large data sets to extract word embeddings in a speciﬁc language. At the time of writing we have found no such freely available or evaluated data set for Portuguese to exist. There is therefore a need to create word embeddings in the Portuguese language that can be explored in the types of tasks mentioned above for the English language. In this paper we describe our results in training, parameterizing and evaluating word embeddings for Portuguese – a computationally intensive and time consuming undertaking – as well as comparing them with an implementation for the English language. Our contribution is in making available a set of trained word embeddings for the computational processing of Portuguese, as well as a set of instructions for getting them running quickly and easily. In Sect. 2 we brieﬂy describe word embeddings and current methods for obtaining them, followed by a description of our own implementation of the c Springer International Publishing Switzerland 2016 J. Silva et al. (Eds.): PROPOR 2016, LNAI 9727, pp. 259–270, 2016. DOI: 10.1007/978-3-319-41552-9 27

260

J. Rodrigues et al.

models in Sect. 3 and the set of experiments we run to improve their accuracy. The resulting models are evaluated and analyzed in Sect. 4 against the English models, before we draw our conclusions and outline our plans for future work in Sect. 5.

2

Related Work

As concisely stated in [8], “distributional semantics is predicated on the assumption that linguistic units with certain semantic similarities also share certain similarities in the relevant environments”. Addressing this so-called ‘relevant environment’ using distributional semantics methods is based on two key paradigms – count-based and prediction-based methods. Both count and prediction-based methods generate a set of distributional vectors (also known as word embeddings or dis

Data Loading...

LX-DSemVectors: Distributional Semantics Models for Portuguese

Recommend Documents

Entity Linking with Distributional Semantics

Distributional Semantics for CRM: Making Word2vec Models Robust by Structurizing Them

BERTimbau: Pretrained BERT Models for Brazilian Portuguese

A Geometric Algebra Based Distributional Model to Encode Sentences Semantics

The Portuguese B\(^2\) SG: A Semantic Test for Distributional Thesaurus

Distributional Models in the Task of Hypernym Discovery

Mapping Execution and Model Semantics for Subject-Oriented Process Models

Relating Semantics as Fine-Grained Semantics for Intensional Logics

Semantics

Semantics

Semantics-Oriented Natural Language Processing Mathematical Models a

Metabiology Non-standard Models, General Semantics and Natural Evolu