Evaluating cross-lingual textual similarity on dictionary alignment problem

  • PDF / 441,224 Bytes
  • 20 Pages / 439.37 x 666.142 pts Page_size
  • 81 Downloads / 186 Views

DOWNLOAD

REPORT


Evaluating cross-lingual textual similarity on dictionary alignment problem Yig˘it Sever1 • Go¨nenc¸ Ercan2

 Springer Nature B.V. 2020

Abstract Bilingual or even polylingual word embeddings created many possibilities for tasks involving multiple languages. While some tasks like cross-lingual information retrieval aim to satisfy users’ multilingual information needs, some enable transferring valuable information from resource-rich languages to resourcepoor ones. In any case, it is important to build and evaluate methods that operate in a cross-lingual setting. In this paper, Wordnet definitions in 7 different languages are used to create a semantic textual similarity testbed to evaluate cross-lingual textual semantic similarity methods. A document alignment task is created to be used between Wordnet glosses of synsets in 7 different languages. Unsupervised textual similarity methods—Wasserstein distance, Sinkhorn distance and cosine similarity—are compared with a supervised Siamese deep learning model. The task is modeled both as a retrieval task and an alignment task to investigate the hubness of the semantic similarity functions. Our findings indicate that considering the problem as a retrieval and alignment problem has a detrimental effect on the results. Furthermore, we show that cross-lingual textual semantic similarity can be used as an automated Wordnet construction method. Keywords Cross-lingual textual semantic similarity  Word embeddings  Wasserstein distance  Sinkhorn distance  Siamese neural network

& Go¨nenc¸ Ercan [email protected] Yig˘it Sever [email protected] 1

Department of Computer Engineering, Middle East Technical University, Ankara, Turkey

2

Institute of Informatics, Hacettepe University, Ankara, Turkey

123

Y. Sever and G. Ercan

1 Introduction Recently proposed polylingual information retrieval methods are breaking the language barrier in many tasks. Today it is possible to search in one language to retrieve resources indexed in another language (Balikas et al. 2018). Tasks such as cross-lingual search (Vulic´ and Moens 2015; Litschko et al. 2018) and plagiarism detection (Barro´n-Ceden˜o et al. 2010; Potthast et al. 2011; Franco-Salvador et al. 2016; Rupnik et al. 2016) are becoming more effective. Furthermore, by building on these tools, it is possible to advance the state-of-theart of core natural language processing tasks by cross-lingual training and transfer learning techniques (Johnson et al. 2019). Naturally, these methods require some representation such as multilingual word embeddings that can operate between languages. Thus, evaluation of both the word embeddings and the methods using these embeddings is an important endeavour. Word embeddings are used to create an embedding space that encodes semantic relationships between words (Mikolov et al. 2013b). Methods for building polylingual word embeddings are proposed, extending these word embedding spaces to span more than one language (Artetxe et al. 2018a; Jawanpuria et al. 2019). A major contribution of t