Multi-channel spectrograms for speech processing applications using deep learning methods

  • PDF / 1,224,141 Bytes
  • 9 Pages / 595.276 x 790.866 pts Page_size
  • 72 Downloads / 152 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Multi‑channel spectrograms for speech processing applications using deep learning methods T. Arias‑Vergara1,2,3   · P. Klumpp2 · J. C. Vasquez‑Correa1,2 · E. Nöth2 · J. R. Orozco‑Arroyave1,2 · M. Schuster3 Received: 7 February 2020 / Accepted: 14 September 2020 © The Author(s) 2020

Abstract Time–frequency representations of the speech signals provide dynamic information about how the frequency component changes with time. In order to process this information, deep learning models with convolution layers can be used to obtain feature maps. In many speech processing applications, the time–frequency representations are obtained by applying the short-time Fourier transform and using single-channel input tensors to feed the models. However, this may limit the potential of convolutional networks to learn different representations of the audio signal. In this paper, we propose a methodology to combine three different time–frequency representations of the signals by computing continuous wavelet transform, Melspectrograms, and Gammatone spectrograms and combining then into 3D-channel spectrograms to analyze speech in two different applications: (1) automatic detection of speech deficits in cochlear implant users and (2) phoneme class recognition to extract phone-attribute features. For this, two different deep learning-based models are considered: convolutional neural networks and recurrent neural networks with convolution layers. Keywords  Speech processing · Multi-channel spectrograms · Cochlear implants · Phoneme recognition

1 Introduction In speech and audio processing applications, the data are commonly processed by computing compressed representations that may not capture the dynamic information of the signals. In the recent years, there has been an increasing number of works considering deep learning methods for speech and audio analysis such as convolutional neural networks (CNNs) and recurrent neural networks (RNN), among others [1]. Particularly for CNNs, audio data are processed by feeding the convolution layers with time–frequency representations (spectrograms) of the signals providing Authors must disclose all relationships or interests that could have direct or potential influence or impart bias on the work. * T. Arias‑Vergara [email protected] 1



Faculty of Engineering, Universidad de Antioquia UdeA, Calle 70 No. 52‑21, Medellín, Colombia

2



Pattern Recognition Lab, Friedrich-Alexander University, Erlangen‑Nürnberg, Germany

3

Department of Otorhinolaryngology, Head and Neck Surgery, Ludwig-Maximilians University, Munich, Germany



information about how the energy distributed in the frequency domain changes with time. After the convolution operation, the resulting feature maps contain low- and high-level features representing the acoustic information of the signals. Many works have shown the advantages of using CNNs and spectrograms in different speech processing applications such as automatic detection of disordered speech [2–4], acoustic models for automatic speech reco