Usage of DNN in Speaker Recognition: Advantages and Problems

In this paper we consider different approaches of artificial neural networks application for speaker recognition task. We investigated the performance of DNN application at different levels of speaker recognition system: i-vector extraction level and mode

  • PDF / 720,023 Bytes
  • 10 Pages / 439.37 x 666.142 pts Page_size
  • 18 Downloads / 173 Views

DOWNLOAD

REPORT


Speech Technology Center, Krasutskogo str. 4, 196084 St. Peterburg, Russia {kudashev,novoselov,tim,lavrentyeva}@speechpro.com 2 ITMO University, St. Petersburg, Russia [email protected]

Abstract. In this paper we consider different approaches of artificial neural networks application for speaker recognition task. We investigated the performance of DNN application at different levels of speaker recognition system: i-vector extraction level and model Back-End level. Results of our study perform high efficiency of the proposed neural network based approaches for solving this problem. It is shown that the use of DNN technology at different levels increases the reliability of speaker recognition system independently. However, there are some disadvantages of such systems, which are also described in this paper. Keywords: DNN

 Speaker recognition  PLDA

1 Introduction State-of-the-art technology of text-independent speaker recognition is based on the i-vector extraction paradigm. Typically this framework can be decomposed into three stages: the collection of sufficient statistics, the extraction of i-vectors and a probabilistic linear discriminant analysis (PLDA) backend [1–4]. Sufficient statistics are collected by using a sequence of feature vectors (e.g., mel-frequency cepstral coefficients (MFCC)) which are usually represented by the Baum-Welch statistics obtained with respect to a GMM, refered to as universal background model (UBM). These statistics are converted into a single low-dimensional feature vector — an i-vector — that represents important information about the speaker and all other types of variability. After i-vectors are extracted, a PLDA model is used to produce verification scores by comparing i-vectors extracted from different speech segments. Successful application of deep neural networks (DNN) [5, 6] in automatic speech recognition has provided a strong motivations to searching attempts of possible gains from applying DNN to speaker recognition task. For example, DNN posteriors instead of GMM posteriors have been used by Lei et al. [1], Kenny et al. [2] to derive sufficient statistics for alternative i-vectors calculation allowing to discriminate speakers at triphone level. According to recent results, this approach significantly outperforms a conventional UBM-TV-i-vectors scheme in speaker recognition on telephone speech. © Springer International Publishing Switzerland 2016 L. Cheng et al. (Eds.): ISNN 2016, LNCS 9719, pp. 82–91, 2016. DOI: 10.1007/978-3-319-40663-3_10

Usage of DNN in Speaker Recognition: Advantages and Problems

83

In the paper [7] authors also reported good achievement in DNN-based speaker identification (SID) performance on microphone speech. Two approaches of DNN-based SID were considered: one uses the DNN to extract features, and another uses the DNN for feature modeling. Modeling is performed using the i-vector framework, in which the traditional universal background model is replaced with a DNN. Alternative way of succesfull applying DNN in SID task is extracting of bottl