A deep learning approach to integrate convolutional neural networks in speaker recognition
- PDF / 925,878 Bytes
- 9 Pages / 595.276 x 790.866 pts Page_size
- 103 Downloads / 199 Views
A deep learning approach to integrate convolutional neural networks in speaker recognition Soufiane Hourri1 · Nikola S. Nikolov2 · Jamal Kharroubi1 Received: 6 February 2020 / Accepted: 12 May 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract We propose a novel usage of convolutional neural networks (CNNs) for the problem of speaker recognition. While being particularly designed for computer vision problems, CNNs have recently been applied for speaker recognition by using spectrograms as input images. We believe that this approach is not optimal as it may result in two cumulative errors in solving both a computer vision and a speaker recognition problem. In this work, we aim at integrating CNNs in speaker recognition without relying on images. We use Restricted Boltzmann Machines (RBMs) to extract speakers models as matrices and introduce a new way to model target and non-target speakers, in order to perform speaker verification. Thus, we use a CNN to discriminate between target and non-target matrices. Experiments were conducted with the THUYG-20 SRE corpus under three noise conditions: clean, 9 db, and 0 db. The results demonstrate that our method outperforms the state-of-the-art approaches by decreasing the error rate by up to 60%. Keywords Speaker recognition · MFCC · Convolutional neural network · Restricted Boltzmann Machine · Deep learning
1 Introduction Nowadays, speaker verification is gaining high interest in the field of speaker recognition, due to the high demand for voice access and security applications (Zhang et al. 2017; Hanilçi 2018). Speaker recognition can be divided into speaker identification and speaker verification. In speaker identification, the voice of a person to be identified is compared to a set of known speakers and classified as one of them. In speaker verification, the voice of a speaker is either accepted or rejected as the voice of a particular person. A speaker verification system may be either text-dependent or text-independent. A text-dependent system proposes to speakers to read either a fixed phrase or randomly prompted words correctly. Then it measures two similarities; first, the one between the spoken and the proposed words, and second, the similarity between the voice of the claimed person and the speaker. Whereas, text-independent systems * Soufiane Hourri [email protected] 1
Faculté des Sciences et Techniques, Laboratoire des Systèmes Intelligents et Applications, Université Sidi Mohamed Ben Abdellah, Fez, Morocco
University of Limerick, Limerick, Ireland
2
are concerned with the speaker voice’s characteristics only. In this work, we build a text-independent speaker verification system, using the THUYG SRE 20 corpus (Rozi et al. 2015). We follow a process that begins with extracting features from a speech signal. In the context of speaker recognition, feature extraction aims at obtaining a compact representation from a raw acoustic signal (Sadjadi and Hansen 2015) in the form of a sequence of feature vectors (Redd
Data Loading...