Biomimetic multi-resolution analysis for robust speaker recognition

  • PDF / 585,378 Bytes
  • 10 Pages / 595 x 794 pts Page_size
  • 45 Downloads / 196 Views

DOWNLOAD

REPORT


RESEARCH

Open Access

Biomimetic multi-resolution analysis for robust speaker recognition Sridhar Krishna Nemala1 , Dmitry N Zotkin2 , Ramani Duraiswami2 and Mounya Elhilali1*

Abstract Humans exhibit a remarkable ability to reliably classify sound sources in the environment even in presence of high levels of noise. In contrast, most engineering systems suffer a drastic drop in performance when speech signals are corrupted with channel or background distortions. Our brains are equipped with elaborate machinery for speech analysis and feature extraction, which hold great lessons for improving the performance of automatic speech processing systems under adverse conditions. The work presented here explores a biologically-motivated multi-resolution speaker information representation obtained by performing an intricate yet computationally-efficient analysis of the information-rich spectro-temporal attributes of the speech signal. We evaluate the proposed features in a speaker verification task performed on NIST SRE 2010 data. The biomimetic approach yields significant robustness in presence of non-stationary noise and reverberation, offering a new framework for deriving reliable features for speaker recognition and speech processing. Introduction In addition to the intended message, human voice carries the unique imprint of a speaker. Just like fingerprints and faces, voice prints are biometric markers with tremendous potential for forensic, military, and commercial applications [1]. However, despite enormous advances in computing technology over the last few decades, automatic speaker verification (ASV) systems still rely heavily on training data collected in controlled environments, and most systems face a rapid degradation in performance when operating under previously unseen conditions (e.g. channel mismatch, environmental noise, or reverberation). In contrast, human perception of speech and ability to identify sound sources (including voices) is quite remarkable even at relatively high distortion levels [2]. Consequently, the pursuit of human-like recognition capabilities has spurred great interest in understanding how humans perceive and process speech signals. One of the intriguing processes taking place in the central auditory system involves ensembles of neurons with variable tuning to spectral profiles of acoustic signals. In

*Correspondence: [email protected] 1 Department of Electrical and Computer Engineering, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA Full list of author information is available at the end of the article

addition to the frequency (tonotopic) organization emerging as early as the cochlea, neurons in the central auditory system (specifically in the midbrain and more prominently in the auditory cortex) exhibit tuning to a variety of filter bandwidths and shapes [3]. This elegant neural architecture provides a detailed multi-resolution analysis of the spectral sound profile, which is presumably relevant to speech and speaker recognition. Only few studies so far