Non-intrusive speech quality prediction based on the blind estimation of clean speech and the i-vector framework

  • PDF / 1,637,671 Bytes
  • 15 Pages / 595.276 x 790.866 pts Page_size
  • 88 Downloads / 183 Views

DOWNLOAD

REPORT


RESEARCH ARTICLE

Non‑intrusive speech quality prediction based on the blind estimation of clean speech and the i‑vector framework Anderson R. Avila1   · Douglas O’Shaughnessy1 · Tiago H. Falk1 Received: 19 December 2019 © Springer Nature Switzerland AG 2020

Abstract Output-based instrumental speech quality assessment relies only on the received (processed) signal to predict quality. Such methods are called non-intrusive and are crucial in speech applications where reference clean signals are not accessible. In this paper, we propose a new non-intrusive instrumental quality measure based on the similarity between two i-vectors. As the reference clean signal is not available, the reference i-vector representation cannot be extracted directly from it. Therefore, we propose the use of a clean speech Gaussian mixture model to estimate the clean speech spectra from its degraded speech spectrum counterpart. Next, the two respective i-vector representations are extracted and either the cosine or Eucledian similarity metrics are computed as a correlate of speech quality. Here, the clean speech model is trained using RASTAfiltered mel-frequency cepstral coefficients extracted from a pool of clean speech files, thus allowing us to attain a model of clean spectrum characteristics. The proposed method is evaluated on noisy, reverberant, and enhanced speech conditions. Experimental results show the proposed system providing higher correlations with perceptual speech quality than several benchmark non-intrusive measures, especially for noisy and enhanced speech. Keywords  Speech quality assessment · Instrumental quality measurement · I-vector · Speech enhancement

Introduction Instrumental measurement of speech quality is an area of growing interest. Given the increasing number of voice services available (e.g., hands-free communication), quality assessment of speech signals is becoming essential to guaranteeing user satisfaction [1]. In speech communication applications, from its acquisition to its transmission and distribution, the speech signal is likely to have its perceptual quality compromised by many factors. For example, the signal may be affected by background noise and reverberation present in an enclosed environment [2]. In such scenarios, enhancement algorithms, such as noise suppression and dereverberation, may enhance the speech signal but at the same * Anderson R. Avila [email protected] Douglas O’Shaughnessy [email protected] Tiago H. Falk [email protected] 1



Institut national de la recherche scientifique, 800, rue de la Gauchetière Ouest, Montréal (Quebec) H5A 1K6, Canada

time introduce unwanted distortions. Moreover, errors in acquisition, storage, bandwidth constraints and low quality transmission channels could also compromise the perceived quality of the speech signal [3]. As pointed out in [4], the user experience is significantly related to the quality of service (QoS). Hence, maintaining reliable service provision to the end-user, assuring his/her best quality of experience (QoE), plays an i