Guaranteed Significance Level Criterion in Automatic Speech Signal Segmentation

  • PDF / 783,347 Bytes
  • 7 Pages / 612 x 792 pts (letter) Page_size
  • 59 Downloads / 178 Views

DOWNLOAD

REPORT


RY AND METHODS OF SIGNAL PROCESSING

Guaranteed Significance Level Criterion in Automatic Speech Signal Segmentation V. V. Savchenkoa, * and A. V. Savchenkob, ** aEditorial

Board of the Journal Radiotekhnika i Elektronika, Moscow, 125009 Russia b HSE University, Laboratory of Algorithms and Technologies for Networks Analysis, Nizhny Novgorod, 603155 Russia *e-mail: [email protected] **e-mail: [email protected]

Received February 14, 2019; revised February 7, 2020; accepted April 20, 2020

Abstract—The article considers the problem of automatic segmentation of a speech signal into phonetic units in conditions of their a priori uncertain spectral composition and correlation properties. A guaranteed significance level criterion is developed based on the information–theoretic approach. An example of practical application of this criterion is considered; a full-scale experiment is set up and conducted. It is shown that the proposed criterion can guarantee a stable significance level when processing speech frames of short duration. DOI: 10.1134/S1064226920110157

INTRODUCTION In tasks of automatic speech recognition (ASR), signal segmentation is traditionally understood as a phonemic [1, 2] or phonological [3] kind of ASR, the purpose of which is to divide the speech stream into a sequence of minimal (not divisible further) speech units such as phonemes and their allophones. It is an important component of speech signal processing in systems of various purposes [4–6]: from voice user interfaces and speaker identification to speech analytics, as well as biometrics. However, specialists often underestimate it. The reason for this lies in the very concept of phonological segmentation, which precedes the stage of recognition (paradigmatic identification [3]) of isolated signal segments within “deferred” speech segmentation [7]. For example, in [8, 9] the authors use the simplest method of phonological segmentation: dividing the speech signal into speech frames (signal segments) of the shortest possible duration τ = 10–20 ms, which is consistent with the pitch period of a typical speaker’s oral speech [9]. However, this approach is characterized by an acute problem of small observation samples [10], which aggravates the multiple comparisons problem [11]. The disconcerting conclusion [7, 12] is that the above task as applied to Russian continuous speech with a large vocabulary size has not, to date, been solved effectively enough or at all. Meanwhile, as shown in [13, 14] on a number of examples from experience, using speech signal segmentation and combining homogeneous frames into monophonemic speech segments makes it possible to largely overcome the

small sample problem and, in turn, the multiple comparisons problem in ASR tasks. Therefore, it can be argued [15–17] that full-scale phonological segmentation is currently the most promising method of increasing ASR efficiency at the stage of primary processing of speech signals [7]. For that, the choice of segmentation criteria is of utmost importance [3]. Therefore, the s