Multimodal machine translation through visuals and speech

PDF / 1,508,988 Bytes
51 Pages / 439.37 x 666.142 pts Page_size
114 Downloads / 372 Views

Multimodal machine translation through visuals and speech Umut Sulubacak1 · Ozan Caglayan3 · Stig‑Arne Grönroos2 · Aku Rouhe2 · Desmond Elliott4 · Lucia Specia3 · Jörg Tiedemann1 Received: 5 December 2019 / Accepted: 22 July 2020 / Published online: 13 August 2020 © The Author(s) 2020

Abstract Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance evaluation. The paper concludes with a discussion of directions for future research in these areas: the need for more expansive and challenging datasets, for targeted evaluations of model performance, and for multimodality in both the input and output space. Keywords Natural language processing · Machine translation · Multimodal machine translation · Image-guided translation · Speech language translation

1 Introduction Humans are able to make use of complex combinations of visual, auditory, tactile and other stimuli, and are capable of not only handling each sensory modality in isolation, but also simultaneously integrating them to improve the quality of perception and understanding (Stein et al. 2009). From a computational perspective, natural language processing (NLP) requires such abilities, too, in order to approach human-level grounding and understanding in various AI tasks.

* Umut Sulubacak [email protected] Extended author information available on the last page of the article

13

Vol.:(0123456789)

98

U. Sulubacak et al. IC IGT MT VGT ASR

Speech

VD SLT S2S

Text No translation Unimodal Multimodal Optional link

Language Lan n uage ngu u eB

Language A

Text

Speech

Fig. 1 Prominent examples of multimodal translation tasks, such as image-guided translation (IGT), video-guided translation (VGT), and spoken language translation (SLT), shown in contrast to unimodal translation tasks, such as text-based machine translation (MT) and speech-to-speech translation (S2S), and multimodal NLP tasks that do not involve translation, such as automatic speech recognition (ASR), image captioning (IC), and video description (VD)

While language covers written, spoken, and sign language in human communication; vision, speech, and language processing communities have worked largely apart in the past. As a consequence, NLP became more focused towards textual representations, which often disregard many other characteristics of commun

Data Loading...

Multimodal machine translation through visuals and speech

Recommend Documents

Speech-to-Speech Translation

A Speech-to-Speech Translation based Interface for Tourism

Analysing terminology translation errors in statistical and neural machine translation

Hybrid Approaches to Machine Translation

Neural machine translation: Challenges, progress and future

Dual Learning for Machine Translation and Beyond

Linguistically Motivated Statistical Machine Translation Models and

Named entity translation method based on machine translation lexicon

Multimodal Communication in Political Speech. Shaping Minds and Social Action

Remote diagnosis of diabetics patient through speech engine and fuzzy based machine learning algorithm

Machine Learning Systems for Multimodal Affect Recognition

Recent Advances in Example-Based Machine Translation