Emotion Recognition in Speech with Deep Learning Architectures

Deep neural networks (DNNs) became very popular for learning abstract high-level representations from raw data. This lead to improvements in several classification tasks including emotion recognition in speech. Besides the use as feature learner a DNN can

  • PDF / 507,966 Bytes
  • 14 Pages / 439.37 x 666.142 pts Page_size
  • 60 Downloads / 248 Views

DOWNLOAD

REPORT


Abstract. Deep neural networks (DNNs) became very popular for learning abstract high-level representations from raw data. This lead to improvements in several classification tasks including emotion recognition in speech. Besides the use as feature learner a DNN can also be used as classifier. In any case it is a challenge to determine the number of hidden layers and neurons in each layer for such networks. In this work the architecture of a DNN is determined by a restricted grid-search with the aim to recognize emotion in human speech. Because speech signals are essentially time series the data will be transformed in an appropriate format to use it as input for deep feed forward neural networks without losing much time dependent information. Furthermore the Elman-Net will be examined. The results shows that by maintaining time dependent information in the data better classification accuracies can be achieved with deep architectures.

1

Introduction

Paralinguistic information like the intonation are important parts in a conversation. We can consider these kinds of information as the semantics of a spoken utterance. For example, the word “yes” is basically an expression of agreement, but with a contemptuous intonation it can mean exactly the opposite namely rejection and this can be an evidence that the speaker is angry. Hence it is possible to perceive the emotional state of the speaker with paralinguistic information conveyed in the speech signal. Because emotions could be crucial for the interpretation of a spoken utterance, efforts are made to give computers the ability to recognize emotion in speech to improve the human-computer interaction (cf. [15]). Nowadays this is a growing field of research which is known as affective computing. Therefore the aim of speech emotion recognition is to identify the high-level affective state of an utterance from the low-level features. The task here is to recognize specific pattern as sequences in the speech signal and to categorize them into several classes of emotions. There are several machine learning models that can be used for classification. In machine learning theory a model is an algorithm which learns from data to tackle a specific task without having to have been explicitly programmed. The learning process is often called training. One of those models are artificial c Springer International Publishing AG 2016  F. Schwenker et al. (Eds.): ANNPR 2016, LNAI 9896, pp. 298–311, 2016. DOI: 10.1007/978-3-319-46182-3 25

Emotion Recognition in Speech with Deep Learning Architectures

299

neural networks (ANN), which are slightly inspired by the functioning of the human brain. A deep neural network is an ANN with many layers of nonlinear processing units. The field of research that studies methods to train ANNs with deep architectures is called deep learning. Deep learning architectures (DLAs) have been shown to exceed preliminary state-of-the art results in several tasks including emotion recognition in speech [1–3].

2

Related Work

For a long time, DNNs were considered to be har