Prosodic Features Based Text-dependent Speaker Recognition with Short Utterance

Over the past several years, Gaussian mixtures models have been the dominant approach for modeling in text-independent speaker recognition field. But the recognition accuracy for these models declines when utterances’ length becomes short. Presently Mel-f

  • PDF / 412,818 Bytes
  • 12 Pages / 439.37 x 666.14 pts Page_size
  • 11 Downloads / 207 Views

DOWNLOAD

REPORT


)

1

2

School of Communication Engineering, Hangzhou Dianzi University, Hangzhou, China [email protected], [email protected] School of Mathematics and Computational Science, Sun Yat-sen University, Guangzhou, China [email protected]

Abstract. Over the past several years, Gaussian mixtures models have been the dominant approach for modeling in text-independent speaker recognition field. But the recognition accuracy for these models declines when utterances’ length becomes short. Presently Mel-frequency cepstral coefficients are generally used to characterize the properties of the vocal tract and widely applied in speech recognition. In addition, prosodic features, such as pitch and formant, are gener‐ ally considered to describe the glottal characteristics. However, the efficiency of those approaches remain unsatisfactory. In text-dependent short utterances speaker verification systems, prosodic features can assist to improve the recog‐ nition result theoretically. In order to optimize the performance of speaker veri‐ fication systems under the framework of adapted GMM-UBM, we adopt a variant speaker verification system based on prosodic features, in which a dual-judgmentmechanism is used in order to integrate vocal tract features with prosodic features. Experimental results showed that the new speech recognition system led a better consequence. Keywords: Speaker verification · Text dependent · Prosodic features · Dual judgment mechanism

1

Introduction

As one of the most natural biometric identification methods, speaker recognition has great potential in the field of convergent key [1, 2], ordinary digital signatures, biometric key [3], and so on. Speaker recognition technology [4], aiming to recognize the speaker identities automatically, is becoming more and more attractive. In the meantime, Short utterance speaker recognition (SUSR) has been hotspot. GMM-UBM and GMM-SVM [5, 6], based on clustering and subspace, are two popular speaker recognition methods. In systems based on such structures, [7] illustrates the performance change with different valid test utterance lengths on the NIST SRE 2005 database, where it can be seen that the Equal Error Rate increases sharply when the test utterances become shorter. © Springer Science+Business Media Singapore 2016 K. Li et al. (Eds.): ISICA 2015, CCIS 575, pp. 541–552, 2016. DOI: 10.1007/978-981-10-0356-1_57

542

J. Zhang et al.

In order to solve the problem of large data requirements, research has lead to Joint Factor Analysis (JFA), Support Vector Machine (SVM) and i-vector based technologies. The factor analysis subspace estimation and the i-vector method introduced in [8, 9] decrease the number of redundant model parameters to develop more accurate speaker models. Some methods try to improve the performance by selecting segments with higher discriminability on speaker characteristics. In other works performing short utterance speaker recognition, such as [10], dimension decoupled GMM is applied. Training and testing with 10 s of speech on variations of GMM and SVM have