Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform

PDF / 2,064,856 Bytes
32 Pages / 439.37 x 666.142 pts Page_size
56 Downloads / 356 Views

Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform Sara Dahmani1 • Vincent Colotte1 Slim Ouni1

•

Ó Springer Nature B.V. 2020

Abstract In this paper, we present a multimodal acquisition setup that combines different motion-capture systems. This system is mainly aimed for recording expressive audiovisual corpus in the context of audiovisual speech synthesis. When dealing with speech recording, the standard optical motion-capture systems fail in tracking the articulators finely, especially the inner mouth region, due to the disappearing of certain markers during the articulation. Also, some systems have limited frame rates and are not suitable for smooth speech tracking. In this work, we demonstrate how those limitations can be overcome by creating a heterogeneous system taking advantage of different tracking systems. In the scope of this work, we recorded a prototypical corpus using our combined system for a single subject. This corpus was used to validate our multimodal data acquisition protocol and to assess the quality of the expressiveness before recording a large corpus. We conducted two evaluations of the recorded data, the first one concerns the production aspect of speech and the second one focuses on the speech perception aspect (both evaluations concern visual and acoustic modalities). Production analysis allowed us to identify characteristics specific to each expressive context. This analysis showed that the expressive content of the recorded data is globally in line with what is commonly expected in the literature. The perceptual evaluation, conducted as a human emotion recognition task using different types of stimulus, confirmed that the different recorded emotions were well perceived.

& Slim Ouni [email protected] Sara Dahmani [email protected] Vincent Colotte [email protected] 1

CNRS, Inria, LORIA, Universite´ de Lorraine, 54000 Nancy, France

123

S. Dahmani et al.

Keywords Expressive audiovisual speech Facial expressions Acted speech

1 Introduction When dealing with expressive audiovisual speech synthesis, acquiring a corpus is an essential step. The corpus textual content should be phonetically rich to cover different diphones in different contexts (previous and following diphones) as recommended in acoustic speech synthesis literature (Franc¸ois and Boe¨ffard 2001; Volker Strom and King 2006; Jonathan and Delhay 2008; Dutoit 2008). Moreover, in the case of expressive speech synthesis, the corpus should cover different emotions. More than that, in comparison with the corpus for acoustic-only speech synthesis, dealing with the visual component of speech is time-consuming, which may constrain the size of the corpus to acquire. In this paper, we address the issues that we experienced while recording an expressive audiovisual speech corpus. Some information on corpus recording setups and statistics on their content can be found in the literature, but very little information can be found about some essential details for building an

Data Loading...

Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform

Recommend Documents

Objective and Subjective Evaluation of an Expressive Speech Corpus

Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus

A Robust Multimodal Speech Recognition Method using Optical Flow Analysis

Audiovisual Speech Synchrony Measure: Application to Biometrics

Geological big data acquisition based on speech recognition

Speech Perception, Production and Acquisition Multidisciplinary appr

Multidimensional feature diversity based speech signal acquisition

Multimodal machine translation through visuals and speech

Silent articulation modulates auditory and audiovisual speech perception

The Acquisition of Peer Manding Using a Speech-Generating Device in Naturalistic Classroom Routines

Isabl Platform, a digital biobank for processing multimodal patient data

A Multimodal Communication Aid for Persons with Cerebral Palsy Using Head Movement and Speech Recognition