DiaBLa: a corpus of bilingual spontaneous written dialogues for machine translation

PDF / 1,781,790 Bytes
26 Pages / 439.37 x 666.142 pts Page_size
72 Downloads / 260 Views

DiaBLa: a corpus of bilingual spontaneous written dialogues for machine translation Rachel Bawden1 • Eric Bilinski2 • Thomas Lavergne3 • Sophie Rosset2

Accepted: 24 October 2020 Ó The Author(s) 2020

Abstract We present a new English–French dataset for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue. The test set contains 144 spontaneous dialogues (5700? sentences) between native English and French speakers, mediated by one of two neural MT systems in a range of role-play settings. The dialogues are accompanied by fine-grained sentence-level judgments of MT quality, produced by the dialogue participants themselves, as well as by manually normalised versions and reference translations produced a posteriori. The motivation for the corpus is twofold: to provide (i) a unique resource for evaluating MT models, and (ii) a corpus for the analysis of MT-mediated communication. We provide an initial analysis of the corpus to confirm that the participants’ judgments reveal perceptible differences in MT quality between the two MT systems used. Keywords Machine translation Corpus Dataset Evaluation Bilingual Dialogue Chat

1 Introduction The use of Machine Translation (MT) to translate everyday, written exchanges is becoming increasingly commonplace; translation tools now regularly appear on chat applications and social networking sites to enable cross-lingual communication. MT systems must therefore be able to handle a wide variety of topics, styles and & Rachel Bawden [email protected] 1

School of Informatics, University of Edinburgh, Edinburgh, Scotland, UK

2

LIMSI, CNRS, Universite´ Paris-Saclay, Orsay, France

3

LIMSI, CNRS, Universite´ Paris-Saclay, Univ. Paris-Sud, Orsay, France

123

vocabulary. Importantly, the translation of dialogue requires translating sentences coherently with respect to the conversational flow so that all aspects of the exchange, including speaker intent, attitude and style, are correctly communicated (Bawden 2018). It is important to have realistic data to evaluate MT models and to guide future MT research for informal, written exchanges. In this article, we present DiaBLa (Dialogue BiLingue ‘Bilingual Dialogue’), a new dataset of English–French spontaneous written dialogues mediated by MT,1 obtained by crowdsourcing, covering a range of dialogue topics and annotated with fine-grained human judgments of MT quality. To our knowledge, this is the first corpus of its kind. Our data collection protocol is designed to encourage speakers of two languages to interact, using role-play scenarios to provide conversation material. Sentence-level human judgments of translation quality are provided by the participants themselves while they are actively engaged in dialogue. The result is a rich bilingual test corpus of 144 dialogues, which are annotated with sentence-level MT quality evaluations and human reference translations. We begin by reviewing related work in corpus development, focusing particularly on informal written texts and spontaneous

Data Loading...

DiaBLa: a corpus of bilingual spontaneous written dialogues for machine translation

Recommend Documents

Introducing Corpus-based Translation Studies

Style in Translation: A Corpus-Based Perspective

The Multilingual Student Translation corpus: a resource for translation teaching and research

Spontaneous Breaking of Continuous Time Translation Symmetry

Spontaneous Breaking of Space Translation Symmetry

Corpus-Assisted Translation Teaching: An Overview

Cognitive and linguistic predictors of bilingual single-word translation

Dual Learning for Machine Translation and Beyond

Attitudes of Students Towards Corpus Use in Translation

Language Revitalization: A Benchmark for Akan-to-English Machine Translation

Corpus-Based Studies of Translational Chinese in English-Chinese Translation

A Mixed Learning Objective for Neural Machine Translation