Usage of DNN in Speaker Recognition: Advantages and Problems

In this paper we consider different approaches of artificial neural networks application for speaker recognition task. We investigated the performance of DNN application at different levels of speaker recognition system: i-vector extraction level and mode

PDF / 720,023 Bytes
10 Pages / 439.37 x 666.142 pts Page_size
18 Downloads / 308 Views

DOWNLOAD

REPORT

Speech Technology Center, Krasutskogo str. 4, 196084 St. Peterburg, Russia {kudashev,novoselov,tim,lavrentyeva}@speechpro.com 2 ITMO University, St. Petersburg, Russia [email protected]

Abstract. In this paper we consider different approaches of artiﬁcial neural networks application for speaker recognition task. We investigated the performance of DNN application at different levels of speaker recognition system: i-vector extraction level and model Back-End level. Results of our study perform high efﬁciency of the proposed neural network based approaches for solving this problem. It is shown that the use of DNN technology at different levels increases the reliability of speaker recognition system independently. However, there are some disadvantages of such systems, which are also described in this paper. Keywords: DNN

Speaker recognition PLDA

1 Introduction State-of-the-art technology of text-independent speaker recognition is based on the i-vector extraction paradigm. Typically this framework can be decomposed into three stages: the collection of sufﬁcient statistics, the extraction of i-vectors and a probabilistic linear discriminant analysis (PLDA) backend [1–4]. Sufﬁcient statistics are collected by using a sequence of feature vectors (e.g., mel-frequency cepstral coefﬁcients (MFCC)) which are usually represented by the Baum-Welch statistics obtained with respect to a GMM, refered to as universal background model (UBM). These statistics are converted into a single low-dimensional feature vector — an i-vector — that represents important information about the speaker and all other types of variability. After i-vectors are extracted, a PLDA model is used to produce veriﬁcation scores by comparing i-vectors extracted from different speech segments. Successful application of deep neural networks (DNN) [5, 6] in automatic speech recognition has provided a strong motivations to searching attempts of possible gains from applying DNN to speaker recognition task. For example, DNN posteriors instead of GMM posteriors have been used by Lei et al. [1], Kenny et al. [2] to derive sufﬁcient statistics for alternative i-vectors calculation allowing to discriminate speakers at triphone level. According to recent results, this approach signiﬁcantly outperforms a conventional UBM-TV-i-vectors scheme in speaker recognition on telephone speech. © Springer International Publishing Switzerland 2016 L. Cheng et al. (Eds.): ISNN 2016, LNCS 9719, pp. 82–91, 2016. DOI: 10.1007/978-3-319-40663-3_10

Usage of DNN in Speaker Recognition: Advantages and Problems

83

In the paper [7] authors also reported good achievement in DNN-based speaker identiﬁcation (SID) performance on microphone speech. Two approaches of DNN-based SID were considered: one uses the DNN to extract features, and another uses the DNN for feature modeling. Modeling is performed using the i-vector framework, in which the traditional universal background model is replaced with a DNN. Alternative way of succesfull applying DNN in SID task is extracting of bottl

Data Loading...

Usage of DNN in Speaker Recognition: Advantages and Problems

Recommend Documents

Fundamentals of Speaker Recognition

Audio-Visual Speaker Recognition

Speaker Recognition Engine

Speaker Recognition, Standardization

Forensic Speaker Recognition

Speaker Recognition, Overview

Visual-dynamic Speaker Recognition

Usage of supplementary cementitious materials: advantages and limitations

NIST SREs (Speaker Recognition Evaluations)

Speaker Recognition, One to One

DVDGCN: Modeling Both Context-Static and Speaker-Dynamic Graph for Emotion Recognition in Multi-speaker Conversations

Accuracy of MFCC-Based Speaker Recognition in Series 60 Device