Text-independent speaker recognition using LSTM-RNN and speech enhancement
- PDF / 1,447,343 Bytes
- 16 Pages / 439.37 x 666.142 pts Page_size
- 84 Downloads / 199 Views
Text-independent speaker recognition using LSTM-RNN and speech enhancement Samia Abd El-Moneim 1 & M. A. Nassar 2 & Moawad I. Dessouky 2 & Nabil A. Ismail 3 & Adel S. El-Fishawy 2 & Fathi E. Abd El-Samie 2,4 Received: 22 November 2018 / Revised: 18 August 2019 / Accepted: 30 September 2019 # Springer Science+Business Media, LLC, part of Springer Nature 2019
Abstract
Speaker recognition revolution has lead to the inclusion of speaker recognition modules in several commercial products. Most published algorithms for speaker recognition focus on textdependent speaker recognition. In contrast, text-independent speaker recognition is more advantageous as the client can talk freely to the system. In this paper, text-independent speaker recognition is considered in the presence of some degradation effects such as noise and reverberation. Mel-Frequency Cepstral Coefficients (MFCCs), spectrum and log-spectrum are used for feature extraction from the speech signals. These features are processed with the Long-Short Term Memory Recurrent Neural Network (LSTM-RNN) as a classification tool to complete the speaker recognition task. The network learns to recognize the speakers efficiently in a text-independent manner, when the recording circumstances are the same. The recognition rate reaches 95.33% using MFCCs, while it is increased to 98.7% when using spectrum or logspectrum. However, the system has some challenges to recognize speakers from different recording environments. Hence, different speech enhancement techniques, such as spectral subtraction and wavelet denoising, are used to improve the recognition performance to some extent. The proposed approach shows superiority, when compared to the algorithm of R. Togneri and D. Pullella (2011). Keywords Speaker recognition . MFCCs . Spectrum . Log-spectrum . LSTM-RNN . Reverberation . Speech enhancement
1 Introduction Biometric recognition systems depend on different measurements or signals such as speech signals. The speech signal is an appealing biometric, because voice is a naturally produced signal. Moreover, there is no need for special signal transducers or networks to
* Samia Abd El-Moneim [email protected] Extended author information available on the last page of the article
Multimedia Tools and Applications
be used during access in telephone applications. Speaker recognition systems can be categorized based on speech content into two types: text-dependent and text-independent systems. In text-dependent systems, the speaker must say a specific phrase during both training and testing, while in text-independent systems, the system identifies the speaker from any spoken phrase regardless of the utterance content. Text-independent speaker recognition is the much more stimulating of the two types. Speaker recognition systems have two stages: training and testing. In the training stage, a model for each speaker is created from a suitable representation of the speech created from the extracted features to discriminate between speakers [12]. Feature extraction is the most
Data Loading...