An investigation towards speaker identification using a single-sound-frame

  • PDF / 988,710 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 55 Downloads / 192 Views

DOWNLOAD

REPORT


An investigation towards speaker identification using a single-sound-frame Seyed Reza Shahamiri 1

& Fadi Thabtah

2

Received: 14 May 2019 / Revised: 27 June 2020 / Accepted: 11 August 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract

Traditional neural network-based speaker identification (SI) studies employ a combination of acoustic features extracted from sequential sounds to present the speakers’ voice biometrics in which several sound segments before and after the current segment are stacked and fed to the network. Although this method is particularly important for speech recognition tasks where words are constructed from sequential sound segments, and successful recognition of words depends on the previous phonetic sequences, SI systems should be able to operate based on the distinctive speaker features available in an individual sound segment and identify the speaker regardless of the previously uttered sounds. This paper investigates this hypothesis by proposing a novel text-independent SI model trained at sound level. In order to achieve this, the investigation was conducted by first studying the best distinguishable configuration of coefficients in a single acoustic segment, then to identify the best frame length to overlapping ratio, and finally measuring the reliability of conducting SI using only a single sound segment. Overall more than one hundred SI systems were trained and evaluated, in which results indicate that performing SI using a single acoustic sound frame decreases the complexity of SI and facilitates it since the classifier requires to learn fewer number of acoustic features in compare to the traditional stacked-based approaches. Keywords Automatic speaker identification . Feature extraction . MFCC . Deep neural networks

* Seyed Reza Shahamiri [email protected] Fadi Thabtah [email protected]

1

Department of Electrical, Computer, and Software Engineering, Faculty of Engineering, University of Auckland, Auckland, New Zealand

2

School of Digital Technologies, Manukau Institute of Technology, Auckland, New Zealand

Multimedia Tools and Applications

1 Introduction Using Speaker Recognition (SR) technologies to identify the speaker from a given utterance by comparing voice biometrics of the given speaker is known as automatic Speaker Identification (SI) [10]. Particularly, it is the process to compare one user voice profile against many profiles and find the best or exact match [2]. The most important aspect of using SI systems is for automating processes like directing clients’ mails to the right mailbox, recognizing talkers in discussion, cautioning discourse acknowledgment frameworks of speaker changes, checking if a client is enlisted in the framework as of, and so on. These SI systems may work without the knowledge of client’s voice sample since they rely only on identifying the input speaker from the existing database of speakers [12]. In general, a SI system goes through two primary phases: a training or enrollment phase, and a matching ph