Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform
- PDF / 2,064,856 Bytes
- 32 Pages / 439.37 x 666.142 pts Page_size
- 56 Downloads / 194 Views
Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform Sara Dahmani1 • Vincent Colotte1 Slim Ouni1
•
Ó Springer Nature B.V. 2020
Abstract In this paper, we present a multimodal acquisition setup that combines different motion-capture systems. This system is mainly aimed for recording expressive audiovisual corpus in the context of audiovisual speech synthesis. When dealing with speech recording, the standard optical motion-capture systems fail in tracking the articulators finely, especially the inner mouth region, due to the disappearing of certain markers during the articulation. Also, some systems have limited frame rates and are not suitable for smooth speech tracking. In this work, we demonstrate how those limitations can be overcome by creating a heterogeneous system taking advantage of different tracking systems. In the scope of this work, we recorded a prototypical corpus using our combined system for a single subject. This corpus was used to validate our multimodal data acquisition protocol and to assess the quality of the expressiveness before recording a large corpus. We conducted two evaluations of the recorded data, the first one concerns the production aspect of speech and the second one focuses on the speech perception aspect (both evaluations concern visual and acoustic modalities). Production analysis allowed us to identify characteristics specific to each expressive context. This analysis showed that the expressive content of the recorded data is globally in line with what is commonly expected in the literature. The perceptual evaluation, conducted as a human emotion recognition task using different types of stimulus, confirmed that the different recorded emotions were well perceived.
& Slim Ouni [email protected] Sara Dahmani [email protected] Vincent Colotte [email protected] 1
CNRS, Inria, LORIA, Universite´ de Lorraine, 54000 Nancy, France
123
S. Dahmani et al.
Keywords Expressive audiovisual speech Facial expressions Acted speech
1 Introduction When dealing with expressive audiovisual speech synthesis, acquiring a corpus is an essential step. The corpus textual content should be phonetically rich to cover different diphones in different contexts (previous and following diphones) as recommended in acoustic speech synthesis literature (Franc¸ois and Boe¨ffard 2001; Volker Strom and King 2006; Jonathan and Delhay 2008; Dutoit 2008). Moreover, in the case of expressive speech synthesis, the corpus should cover different emotions. More than that, in comparison with the corpus for acoustic-only speech synthesis, dealing with the visual component of speech is time-consuming, which may constrain the size of the corpus to acquire. In this paper, we address the issues that we experienced while recording an expressive audiovisual speech corpus. Some information on corpus recording setups and statistics on their content can be found in the literature, but very little information can be found about some essential details for building an
Data Loading...