Enhancement in speaker recognition for optimized speech features using GMM, SVM and 1-D CNN

  • PDF / 1,731,351 Bytes
  • 14 Pages / 595.276 x 790.866 pts Page_size
  • 52 Downloads / 220 Views

DOWNLOAD

REPORT


Enhancement in speaker recognition for optimized speech features using GMM, SVM and 1‑D CNN Sumita Nainan1 · Vaishali Kulkarni1 Received: 12 March 2020 / Accepted: 29 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Contemporary automatic speaker recognition (ASR) systems do not provide 100% accuracy making it imperative to explore different techniques to improve it. Easy access to mobile devices and advances in sensor technology, has made voice a preferred parameter for biometrics. Here, a comparative analysis of accuracies obtained in ASR with employment of classical Gaussian mixture model (GMM), support vector machine (SVM) which is the machine learning algorithm and the state of art 1-D CNN as classifiers is presented. Authors propose considering dynamic voice features along with static features as relevant speaker information in them lead to substantial improvement in the accuracy for ASR. As concatenation of features leads to the redundancy and increased computation complexity, Fisher score algorithm was employed to select the best contributing features resulting in improvement in accuracy. The results indicate that SVM and 1-D Neural network outperform GMM. Support Vector Machine (SVM), and 1-D CNN gave comparable results with 1-D CNN giving an improved accuracy of 94.77% in ASR. Keywords  ASR · 1-D CNN · SVM · GMM · Fisher score

1 Introduction Automatic speaker recognition (ASR) and verification is finding increasing applications in the field of forensics and surveillance besides being applicable in crucial fields involving banking services, security services, online shopping and social media networking (Gawande and Golhar 2018). With easy access to state-of-the-art mobile phones, advanced sensors making acquisition of human parameters easy and with progress in digital imaging and sensing platforms, biometrics remain an area of interest and ongoing research. Designing a dynamic and real time person recognition and authentication system which can correctly identify and authenticate a person in minimum time while emulating and matching the performance of the human counterpart in terms of speed and accuracy has become a challenge. Speech signals are commonly used for human machine interface (Minotto et al.

* Sumita Nainan [email protected] Vaishali Kulkarni [email protected] 1



SVKM’s NMIMS Deemed To Be University, Mumbai, India

2014). Voice biometric trait has been considered for this experimentation as it is non-invasive and simplest to acquire. ASR systems can be categorized as text dependent and text independent systems. The text independent system is more challenging as there is no restriction in the utterances of the speaker and hence the system should be robust enough to identify between speaker and non speaker contents and factor the environmental noise also (Salehghffari 2018). For real time ASR, speech enhancement and noise suppression algorithms become prerequisite to process the audio signal for clarity. Speech signal quality can be