word.alignment : an R package for computing statistical word alignment and its evaluation

  • PDF / 1,992,899 Bytes
  • 23 Pages / 439.37 x 666.142 pts Page_size
  • 83 Downloads / 192 Views

DOWNLOAD

REPORT


word.alignment: an R package for computing statistical word alignment and its evaluation Neda Daneshgar1 · Majid Sarmad1 Received: 5 November 2017 / Accepted: 10 March 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Word alignment has lots of applications in various natural language processing (NLP) tasks. As far as we are aware, there is no word alignment package in the R environment. In this paper, word.alignment, a new R software package is introduced which implements a statistical word alignment model as an unsupervised learning. It uses IBM Model 1 as a machine translation model based on the use of the EM algorithm and the Viterbi search in order to find the best alignment. It also provides the symmetric alignment using three heuristic methods such as union, intersection, and grow-diag. It has also the ability to build an automatic bilingual dictionary applying an innovative rule. The generated dictionary is suitable for a number of NLP tasks. This package provides functions for measuring the quality of the word alignment via comparing the alignment with a gold standard alignment based on five metrics as well. It is easily installed and executable on the mostly widely used platforms. Note that it is easily usable and we show that its results are almost everywhere better than some other word alignment tools. Finally, some examples illustrating the use of word.alignment is provided. Keywords Natural language processing (NLP) · IBM model 1 · EM algorithm · Symmetric word alignment · Parallel corpus · Evaluation · Gold standard alignment · Test set

B

Majid Sarmad [email protected] http://www.um.ac.ir/∼sarmad Neda Daneshgar [email protected]

1

Department of Statistics, Ferdowsi University of Mashhad, Mashhad, Iran

123

N. Daneshgar, M. Sarmad

1 Introduction Word alignment is a process that is used to determine the equivalent words in a bilingual sentence pair. The bilingual sentence pair contains two different languages. One is the source language which is the language the translation starts from and another one that is known as the target language is the language the translation ends in. In the literature, two different types of word alignment applications have been considered. One main goal is to produce lexical data for bilingual dictionaries, while another goal can be providing data for MT (Wang 2004). A number of word alignment applications are multi lingual lexicography, word sense disambiguation, translation connections in computational lexicography (Vuli´c and Moens 2010), patent retrieval that is a branch of information retrieval (Jochim et al. 2011), and cross-lingual information retrieval (Nie 2010). Word alignment is also a necessary step for almost all state-of-the-art translation systems including syntax-based machine translation (SBMT), statistical MT (SMT), hierarchical phrase-based systems (Brunning 2010), example-based MT (Vuli´c and Moens 2010) and many other multi lingual applications. Therefore, the task of word alignment is interesting in itself for plenty