Acoustic Variability of Voice Signal as Factor of Information Security for Automatic Speech Recognition Systems with Tun

  • PDF / 193,532 Bytes
  • 11 Pages / 612 x 792 pts (letter) Page_size
  • 49 Downloads / 163 Views

DOWNLOAD

REPORT


stic Variability of Voice Signal as Factor of Information Security for Automatic Speech Recognition Systems with Tuning to User Voice 1

V. V. Savchenko1* Nizhny Novgorod State Linguistic University, Nizhny Novgorod, Russia *ORCID: 0000-0003-3045-3337, e-mail: [email protected] Received March 5, 2020 Revised July 14, 2020 Accepted October 13, 2020

Abstract—The phenomenon of the voice signal acoustic variability in automatic speech recognition systems is considered. There are two varieties—intra- and inter-speaker speech variability. The probabilistic cluster model of minimal speech units in the Kullback–Leibler information metric is used for their mathematical description and comparison in magnitude. On its basis, theoretical estimates of the voice signal acoustic variability for each of its varieties are obtained separately. The effect of information security in systems with tuning to the authorized user voice is described and quantitatively characterized. The intra-speaker variability is negligible in comparison with the inter-speaker variability of speech, and therefore does not have a noticeable harmful effect on the effectiveness of automatic speech recognition. The computational experiment is set up to confirm and develop the theoretical research results, where two speech streams from two different speakers are considered. The author’s software is used for its implementation. According to the experimental results we find that the level of inter-speaker speech variability in a number of cases goes beyond the inter-phonemic differences within a homogeneous speech flow. Therefore, in systems with tuning to the speaker voice, the effect of voice signal acoustic variability is not only unambiguously generally positive, namely: it is an information protection from unauthorized access, but also it is significant in terms of probability-theoretic relation. The obtained results are intended for the development of new and modernization of existing systems for automatic speech recognition, designed to work in a standalone mode. DOI: 10.3103/S0735272720100039

1. INTRODUCTION For several decades, automatic speech recognition (ASR) is one of the most dynamically developing areas in the field of information systems and technologies [1], [2]. As a result, the greatest progress in this research area is observed in recent years, which is manifested in the creation of a new qualitative mathematical apparatus for ASR. We are talking about models of multilayer artificial neural networks (ANN) with tuning according to deep learning technology [2], [3] for voluminous (10–100 GB) speech databases (SDBs). The most impressive results from a commercial point of view are achieved in this trend direction of ASR [4]. There are well-known software developments as Apple Siri, Google Voice Search, Microsoft Cortana, etc. [5], [6] as a clear confirmation of this. However, along with the technical breakthrough and commercial success of speech technologies, a number of acute problems arose in this area, creating serious difficulties for the