Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation

  • PDF / 1,759,187 Bytes
  • 21 Pages / 439.37 x 666.142 pts Page_size
  • 74 Downloads / 176 Views

DOWNLOAD

REPORT


Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation Han‑Gyu Kim1 · Gil‑Jin Jang3 · Yung‑Hwan Oh2 · Ho‑Jin Choi2 

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract In this paper, we propose speech/music pitch classification based on recurrent neural network (RNN) for monaural speech segregation from music interferences. The speech segregation methods in this paper exploit sub-band masking to construct segregation masks modulated by the estimated speech pitch. However, for speech signals mixed with music, speech pitch estimation becomes unreliable, as speech and music have similar harmonic structures. In order to remove the music interference effectively, we propose an RNN-based speech/music pitch classification. Our proposed method models the temporal trajectories of speech and music pitch values and determines an unknown continuous pitch sequence as belonging to either speech or music. Among various types of RNNs, we chose simple recurrent network, long short-term memory (LSTM), and bidirectional LSTM for pitch classification. The experimental results show that our proposed method significantly outperforms the baseline methods for speech–music mixtures without loss of segregation performance for speech-noise mixtures. Keywords  Speech segregation · Speech pitch estimation · Pitch classification · Recurrent neural network · Long short-term memory · Bidirectional long short-term memory

This research was supported by Korea Electric Power Corporation (Grant no.:R18XA05) and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (Grant no.:NRF-2017M3C1B6071400). * Ho‑Jin Choi [email protected] Extended author information available on the last page of the article

13

Vol.:(0123456789)



H.-G. Kim et al.

1 Introduction The human ability to pay selective attention to a specific acoustic object is enabled by segregating the target object from unwanted noises with the help of auditory cues in the time-frequency domain. Source segregation methods implemented by machines try to mimic the aforementioned human auditory system as closely as possible, with applications in multiple fields such as speech recognition, audio-text alignment, and automatic music transcription [2, 19, 23]. Many researchers have made tremendous progress in developing effective monaural source segregation systems in which a mixture of audio signals from various sources is recorded by a single microphone. Nonnegative matrix factorization (NMF) [15], which utilizes the redundancy of sound sources, is a successful method for sound source segregation [19, 22, 23]. Another type of speech segregation methods is based on masking in the spectro-temporal region, where the segregation masks are obtained using factorial hidden Markov models [20] or independent component analysis [13]. Deep clustering which adopts artificial neural network is proposed for speech separation [9]. Deep clustering shows good performance in separating sou