Emotion Recognition in Speech with Deep Learning Architectures

Deep neural networks (DNNs) became very popular for learning abstract high-level representations from raw data. This lead to improvements in several classification tasks including emotion recognition in speech. Besides the use as feature learner a DNN can

PDF / 507,966 Bytes
14 Pages / 439.37 x 666.142 pts Page_size
61 Downloads / 264 Views

DOWNLOAD

REPORT

Abstract. Deep neural networks (DNNs) became very popular for learning abstract high-level representations from raw data. This lead to improvements in several classiﬁcation tasks including emotion recognition in speech. Besides the use as feature learner a DNN can also be used as classiﬁer. In any case it is a challenge to determine the number of hidden layers and neurons in each layer for such networks. In this work the architecture of a DNN is determined by a restricted grid-search with the aim to recognize emotion in human speech. Because speech signals are essentially time series the data will be transformed in an appropriate format to use it as input for deep feed forward neural networks without losing much time dependent information. Furthermore the Elman-Net will be examined. The results shows that by maintaining time dependent information in the data better classiﬁcation accuracies can be achieved with deep architectures.

1

Introduction

Paralinguistic information like the intonation are important parts in a conversation. We can consider these kinds of information as the semantics of a spoken utterance. For example, the word “yes” is basically an expression of agreement, but with a contemptuous intonation it can mean exactly the opposite namely rejection and this can be an evidence that the speaker is angry. Hence it is possible to perceive the emotional state of the speaker with paralinguistic information conveyed in the speech signal. Because emotions could be crucial for the interpretation of a spoken utterance, eﬀorts are made to give computers the ability to recognize emotion in speech to improve the human-computer interaction (cf. [15]). Nowadays this is a growing ﬁeld of research which is known as aﬀective computing. Therefore the aim of speech emotion recognition is to identify the high-level aﬀective state of an utterance from the low-level features. The task here is to recognize speciﬁc pattern as sequences in the speech signal and to categorize them into several classes of emotions. There are several machine learning models that can be used for classiﬁcation. In machine learning theory a model is an algorithm which learns from data to tackle a speciﬁc task without having to have been explicitly programmed. The learning process is often called training. One of those models are artiﬁcial c Springer International Publishing AG 2016 F. Schwenker et al. (Eds.): ANNPR 2016, LNAI 9896, pp. 298–311, 2016. DOI: 10.1007/978-3-319-46182-3 25

Emotion Recognition in Speech with Deep Learning Architectures

299

neural networks (ANN), which are slightly inspired by the functioning of the human brain. A deep neural network is an ANN with many layers of nonlinear processing units. The ﬁeld of research that studies methods to train ANNs with deep architectures is called deep learning. Deep learning architectures (DLAs) have been shown to exceed preliminary state-of-the art results in several tasks including emotion recognition in speech [1–3].

2

Related Work

For a long time, DNNs were considered to be har

Data Loading...

Emotion Recognition in Speech with Deep Learning Architectures

Recommend Documents

Speech and Facial Based Emotion Recognition Using Deep Learning Approaches

Deep Residual Local Feature Learning for Speech Emotion Recognition

Pattern recognition and features selection for speech emotion recognition model using deep learning

Deep Learning for NLP and Speech Recognition

Deep Learning Architectures for Face Recognition in Video Surveillance

Performance of deer hunting optimization based deep learning algorithm for speech emotion recognition

Feature Learning via Deep Belief Network for Chinese Speech Emotion Recognition

Multi-features Integration for Speech Emotion Recognition

Hybrid-Deep Learning Model for Emotion Recognition Using Facial Expressions

Hybridized Deep Learning Architectures for Human Activity Recognition

Efficient deep feature selection for remote sensing image recognition with fused deep learning architectures

New Era for Robust Speech Recognition Exploiting Deep Learning