Evaluating cross-lingual textual similarity on dictionary alignment problem

PDF / 441,224 Bytes
20 Pages / 439.37 x 666.142 pts Page_size
81 Downloads / 205 Views

Evaluating cross-lingual textual similarity on dictionary alignment problem Yig˘it Sever1 • Go¨nenc¸ Ercan2

Springer Nature B.V. 2020

Abstract Bilingual or even polylingual word embeddings created many possibilities for tasks involving multiple languages. While some tasks like cross-lingual information retrieval aim to satisfy users’ multilingual information needs, some enable transferring valuable information from resource-rich languages to resourcepoor ones. In any case, it is important to build and evaluate methods that operate in a cross-lingual setting. In this paper, Wordnet definitions in 7 different languages are used to create a semantic textual similarity testbed to evaluate cross-lingual textual semantic similarity methods. A document alignment task is created to be used between Wordnet glosses of synsets in 7 different languages. Unsupervised textual similarity methods—Wasserstein distance, Sinkhorn distance and cosine similarity—are compared with a supervised Siamese deep learning model. The task is modeled both as a retrieval task and an alignment task to investigate the hubness of the semantic similarity functions. Our findings indicate that considering the problem as a retrieval and alignment problem has a detrimental effect on the results. Furthermore, we show that cross-lingual textual semantic similarity can be used as an automated Wordnet construction method. Keywords Cross-lingual textual semantic similarity Word embeddings Wasserstein distance Sinkhorn distance Siamese neural network

& Go¨nenc¸ Ercan [email protected] Yig˘it Sever [email protected] 1

Department of Computer Engineering, Middle East Technical University, Ankara, Turkey

2

Institute of Informatics, Hacettepe University, Ankara, Turkey

123

Y. Sever and G. Ercan

1 Introduction Recently proposed polylingual information retrieval methods are breaking the language barrier in many tasks. Today it is possible to search in one language to retrieve resources indexed in another language (Balikas et al. 2018). Tasks such as cross-lingual search (Vulic´ and Moens 2015; Litschko et al. 2018) and plagiarism detection (Barro´n-Ceden˜o et al. 2010; Potthast et al. 2011; Franco-Salvador et al. 2016; Rupnik et al. 2016) are becoming more effective. Furthermore, by building on these tools, it is possible to advance the state-of-theart of core natural language processing tasks by cross-lingual training and transfer learning techniques (Johnson et al. 2019). Naturally, these methods require some representation such as multilingual word embeddings that can operate between languages. Thus, evaluation of both the word embeddings and the methods using these embeddings is an important endeavour. Word embeddings are used to create an embedding space that encodes semantic relationships between words (Mikolov et al. 2013b). Methods for building polylingual word embeddings are proposed, extending these word embedding spaces to span more than one language (Artetxe et al. 2018a; Jawanpuria et al. 2019). A major contribution of t

Data Loading...

Evaluating cross-lingual textual similarity on dictionary alignment problem

Recommend Documents

Alignment Problem

Boosting Cross-lingual Entity Alignment with Textual Embedding

Evaluating Similarity Measures for Dataset Search

Dictionary Learning Based on Structural Self-similarity and Convolution Neural Network

Relevance, Textual

A Similarity Search of Trajectory Data Using Textual Information Retrieval Techniques

ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language

Electrochemical Dictionary

Dictionary Solutions

Molecular alignment as a penalized permutation Procrustes problem

An Enhanced Hybrid Model for Solving Multiple Sequence Alignment Problem

Emotion Detection on Twitter Textual Data