Multimodal machine translation through visuals and speech

  • PDF / 1,508,988 Bytes
  • 51 Pages / 439.37 x 666.142 pts Page_size
  • 114 Downloads / 354 Views

DOWNLOAD

REPORT


Multimodal machine translation through visuals and speech Umut Sulubacak1   · Ozan Caglayan3 · Stig‑Arne Grönroos2 · Aku Rouhe2 · Desmond Elliott4 · Lucia Specia3 · Jörg Tiedemann1 Received: 5 December 2019 / Accepted: 22 July 2020 / Published online: 13 August 2020 © The Author(s) 2020

Abstract Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance evaluation. The paper concludes with a discussion of directions for future research in these areas: the need for more expansive and challenging datasets, for targeted evaluations of model performance, and for multimodality in both the input and output space. Keywords  Natural language processing · Machine translation · Multimodal machine translation · Image-guided translation · Speech language translation

1 Introduction Humans are able to make use of complex combinations of visual, auditory, tactile and other stimuli, and are capable of not only handling each sensory modality in isolation, but also simultaneously integrating them to improve the quality of perception and understanding  (Stein et  al. 2009). From a computational perspective, natural language processing (NLP) requires such abilities, too, in order to approach human-level grounding and understanding in various AI tasks.

* Umut Sulubacak [email protected] Extended author information available on the last page of the article

13

Vol.:(0123456789)

98

U. Sulubacak et al. IC IGT MT VGT ASR

Speech

VD SLT S2S

Text No translation Unimodal Multimodal Optional link

Language Lan n uage ngu u eB

Language A

Text

Speech

Fig. 1  Prominent examples of multimodal translation tasks, such as image-guided translation  (IGT), video-guided translation (VGT), and spoken language translation (SLT), shown in contrast to unimodal translation tasks, such as text-based machine translation (MT) and speech-to-speech translation  (S2S), and multimodal NLP tasks that do not involve translation, such as automatic speech recognition (ASR), image captioning (IC), and video description (VD)

While language covers written, spoken, and sign language in human communication; vision, speech, and language processing communities have worked largely apart in the past. As a consequence, NLP became more focused towards textual representations, which often disregard many other characteristics of commun