Text-independent speaker recognition using LSTM-RNN and speech enhancement

PDF / 1,447,343 Bytes
16 Pages / 439.37 x 666.142 pts Page_size
84 Downloads / 344 Views

Text-independent speaker recognition using LSTM-RNN and speech enhancement Samia Abd El-Moneim 1 & M. A. Nassar 2 & Moawad I. Dessouky 2 & Nabil A. Ismail 3 & Adel S. El-Fishawy 2 & Fathi E. Abd El-Samie 2,4 Received: 22 November 2018 / Revised: 18 August 2019 / Accepted: 30 September 2019 # Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract

Speaker recognition revolution has lead to the inclusion of speaker recognition modules in several commercial products. Most published algorithms for speaker recognition focus on textdependent speaker recognition. In contrast, text-independent speaker recognition is more advantageous as the client can talk freely to the system. In this paper, text-independent speaker recognition is considered in the presence of some degradation effects such as noise and reverberation. Mel-Frequency Cepstral Coefficients (MFCCs), spectrum and log-spectrum are used for feature extraction from the speech signals. These features are processed with the Long-Short Term Memory Recurrent Neural Network (LSTM-RNN) as a classification tool to complete the speaker recognition task. The network learns to recognize the speakers efficiently in a text-independent manner, when the recording circumstances are the same. The recognition rate reaches 95.33% using MFCCs, while it is increased to 98.7% when using spectrum or logspectrum. However, the system has some challenges to recognize speakers from different recording environments. Hence, different speech enhancement techniques, such as spectral subtraction and wavelet denoising, are used to improve the recognition performance to some extent. The proposed approach shows superiority, when compared to the algorithm of R. Togneri and D. Pullella (2011). Keywords Speaker recognition . MFCCs . Spectrum . Log-spectrum . LSTM-RNN . Reverberation . Speech enhancement

1 Introduction Biometric recognition systems depend on different measurements or signals such as speech signals. The speech signal is an appealing biometric, because voice is a naturally produced signal. Moreover, there is no need for special signal transducers or networks to

* Samia Abd El-Moneim [email protected] Extended author information available on the last page of the article

Multimedia Tools and Applications

be used during access in telephone applications. Speaker recognition systems can be categorized based on speech content into two types: text-dependent and text-independent systems. In text-dependent systems, the speaker must say a specific phrase during both training and testing, while in text-independent systems, the system identifies the speaker from any spoken phrase regardless of the utterance content. Text-independent speaker recognition is the much more stimulating of the two types. Speaker recognition systems have two stages: training and testing. In the training stage, a model for each speaker is created from a suitable representation of the speech created from the extracted features to discriminate between speakers [12]. Feature extraction is the most

Data Loading...

Text-independent speaker recognition using LSTM-RNN and speech enhancement

Recommend Documents

Enhancement in speaker recognition for optimized speech features using GMM, SVM and 1-D CNN

Medical reporting using speech recognition

Audio-Visual Speaker Recognition

Speaker Recognition Engine

Speaker Recognition, Standardization

Forensic Speaker Recognition

Fundamentals of Speaker Recognition

Speaker Recognition, Overview

Visual-dynamic Speaker Recognition

Time-Varying Noise Estimation for Speech Enhancement and Recognition Using Sequential Monte Carlo Method

Speaker Recognition Using SincNet and X-Vector Fusion

A Review of Signal Subspace Speech Enhancement and Its Application to Noise Robust Speech Recognition