Emotion Modelling via Speech Content and Prosody: In Computer Games and Elsewhere

The chapter describes a typical modern speech emotion recognition engine as can be used to enhance computer games’ or other technical systems’ emotional intelligence. Acquisition of human affect via the spoken content and its prosody and further acoustic

  • PDF / 193,404 Bytes
  • 18 Pages / 439.36 x 666.15 pts Page_size
  • 93 Downloads / 167 Views

DOWNLOAD

REPORT


Emotion Modelling via Speech Content and Prosody: In Computer Games and Elsewhere Björn Schuller

Abstract The chapter describes a typical modern speech emotion recognition engine as can be used to enhance computer games’ or other technical systems’ emotional intelligence. Acquisition of human affect via the spoken content and its prosody and further acoustic features is highlighted. Features for both of these information streams are shortly discussed along chunking of the stream. Decision making with and without training data is presented, each. A particular focus is then laid on autonomous learning and adaptation methods as well as the required calculation of confidence measures. Practical aspects include the encoding of the information, distribution of the processing, and available toolkits. Benchmark performances are given by typical competitive challenges in the field.

Introduction The automatic recognition of emotion in speech dates back some twenty years by now looking back at the very first attempts, cf. e.g., [9]. It is the aim of this chapter to give a general glance ‘under the hud’ how today’s engines work. First, a very brief overview on modelling of emotion is given. A special focus is then laid on speech emotion recognition in computer games owing to the context of this book. Finally, the structure of the remaining chapter is provided aiming at familiarising the reader with the general principles of current engines and their abilities, principles, and necessities.

Emotion Modelling A number of different representation forms have been evaluated, with the most popular ones being discrete emotion classes such as ‘anger’, ‘joy’, or ‘neutral’ – usually reaching from two to roughly a dozen [51] depending on the

B. Schuller () Imperial College London, 180 Queen’s Gate, SW7 2AZ London, UK e-mail: [email protected] © Springer International Publishing Switzerland 2016 K. Karpouzis, G.N. Yannakakis (eds.), Emotion in Games, Socio-Affective Computing 4, DOI 10.1007/978-3-319-41316-7_5

85

86

B. Schuller

application of interest –, and a representation by continuous emotion ‘primitives’ in the sense of a number of (quasi-)value-continuous dimensions such as arousal/activation, valence/positivity/sentiment, dominance/power/potency, expectation/surprise/novelty, or intensity [43]. In a space spanned by these axes, the classes can be assigned as points or regions, thus allowing for a ‘translation’ between these two representation forms. Other popular approaches include tagging by allowing several class labels per instance of analysis (in case of two, the name complex emotions has been used), and calculating scores per each emotion class leading to ‘soft emotion profiles’ [32] – potentially with a minimum threshold to be exceeded. Besides choosing such a representation of emotion, one has to choose a temporal segmentation from, as the speech needs to be segmented into units of analysis. This analysis itself can be based on the spoken content or the ‘way of speaking’ it in the sense of prosody, arti