LX-DSemVectors: Distributional Semantics Models for Portuguese
In this article we describe the creation and distribution of the first publicly available word embeddings for Portuguese. Our embeddings are evaluated on their own and also compared with the original English models on a well-known analogy task. We gathere
- PDF / 181,992 Bytes
- 12 Pages / 439.37 x 666.142 pts Page_size
- 0 Downloads / 221 Views
bstract. In this article we describe the creation and distribution of the first publicly available word embeddings for Portuguese. Our embeddings are evaluated on their own and also compared with the original English models on a well-known analogy task. We gathered a large Portuguese corpus of 1.7 billion tokens, developed the first distributional semantic analogies test set for Portuguese, and proceeded with the first parametrization and evaluation of Portuguese word embeddings models. Keywords: Distributional semantics
1
· Word embeddings · Portuguese
Introduction
Current research trends focusing on distributional semantics are sparking interest in possible ways to enrich the resources and tools used for natural language processing (NLP) tasks. Researchers and practitioners are exploring possible improvements that can be achieved from integrating distributional vectors semantics, also known as word embeddings, in a range of syntactic and semantic tasks, including speech recognition [16], semantic similarity of words [15] partof-speech (POS) tagging, named entity recognition, sentiment analysis [13] and logical semantics [4]. Experimenting with word embeddings in such tasks requires large data sets to extract word embeddings in a specific language. At the time of writing we have found no such freely available or evaluated data set for Portuguese to exist. There is therefore a need to create word embeddings in the Portuguese language that can be explored in the types of tasks mentioned above for the English language. In this paper we describe our results in training, parameterizing and evaluating word embeddings for Portuguese – a computationally intensive and time consuming undertaking – as well as comparing them with an implementation for the English language. Our contribution is in making available a set of trained word embeddings for the computational processing of Portuguese, as well as a set of instructions for getting them running quickly and easily. In Sect. 2 we briefly describe word embeddings and current methods for obtaining them, followed by a description of our own implementation of the c Springer International Publishing Switzerland 2016 J. Silva et al. (Eds.): PROPOR 2016, LNAI 9727, pp. 259–270, 2016. DOI: 10.1007/978-3-319-41552-9 27
260
J. Rodrigues et al.
models in Sect. 3 and the set of experiments we run to improve their accuracy. The resulting models are evaluated and analyzed in Sect. 4 against the English models, before we draw our conclusions and outline our plans for future work in Sect. 5.
2
Related Work
As concisely stated in [8], “distributional semantics is predicated on the assumption that linguistic units with certain semantic similarities also share certain similarities in the relevant environments”. Addressing this so-called ‘relevant environment’ using distributional semantics methods is based on two key paradigms – count-based and prediction-based methods. Both count and prediction-based methods generate a set of distributional vectors (also known as word embeddings or dis
Data Loading...