Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation

PDF / 1,759,187 Bytes
21 Pages / 439.37 x 666.142 pts Page_size
74 Downloads / 221 Views

Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation Han‑Gyu Kim1 · Gil‑Jin Jang3 · Yung‑Hwan Oh2 · Ho‑Jin Choi2

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract In this paper, we propose speech/music pitch classification based on recurrent neural network (RNN) for monaural speech segregation from music interferences. The speech segregation methods in this paper exploit sub-band masking to construct segregation masks modulated by the estimated speech pitch. However, for speech signals mixed with music, speech pitch estimation becomes unreliable, as speech and music have similar harmonic structures. In order to remove the music interference effectively, we propose an RNN-based speech/music pitch classification. Our proposed method models the temporal trajectories of speech and music pitch values and determines an unknown continuous pitch sequence as belonging to either speech or music. Among various types of RNNs, we chose simple recurrent network, long short-term memory (LSTM), and bidirectional LSTM for pitch classification. The experimental results show that our proposed method significantly outperforms the baseline methods for speech–music mixtures without loss of segregation performance for speech-noise mixtures. Keywords Speech segregation · Speech pitch estimation · Pitch classification · Recurrent neural network · Long short-term memory · Bidirectional long short-term memory

This research was supported by Korea Electric Power Corporation (Grant no.:R18XA05) and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (Grant no.:NRF-2017M3C1B6071400). * Ho‑Jin Choi [email protected] Extended author information available on the last page of the article

13

Vol.:(0123456789)

H.-G. Kim et al.

1 Introduction The human ability to pay selective attention to a specific acoustic object is enabled by segregating the target object from unwanted noises with the help of auditory cues in the time-frequency domain. Source segregation methods implemented by machines try to mimic the aforementioned human auditory system as closely as possible, with applications in multiple fields such as speech recognition, audio-text alignment, and automatic music transcription [2, 19, 23]. Many researchers have made tremendous progress in developing effective monaural source segregation systems in which a mixture of audio signals from various sources is recorded by a single microphone. Nonnegative matrix factorization (NMF) [15], which utilizes the redundancy of sound sources, is a successful method for sound source segregation [19, 22, 23]. Another type of speech segregation methods is based on masking in the spectro-temporal region, where the segregation masks are obtained using factorial hidden Markov models [20] or independent component analysis [13]. Deep clustering which adopts artificial neural network is proposed for speech separation [9]. Deep clustering shows good performance in separating sou

Data Loading...

Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation

Recommend Documents

Improved monaural speech segregation based on computational auditory scene analysis

Speech, Audio, Image and Biomedical Signal Processing using Neural Networks

Evolving Recurrent Neural Networks for Pattern Classification

Age and Gender Recognition from Speech Using Deep Neural Networks

Static Music Emotion Recognition Using Recurrent Neural Networks

Robust Noisy Speech Parameterization Using Convolutional Neural Networks

Audio Classification in Speech and Music: A Comparison between a Statistical and a Neural Approach

Multi-objective long-short term memory recurrent neural networks for speech enhancement

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

Neural Modeling of Speech Processing and Speech Learning An Introduc

Perceptual Models for Speech, Audio, and Music Processing

Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder