A Pitch and Noise Robust Keyword Spotting System Using SMAC Features with Prosody Modification
- PDF / 473,974 Bytes
- 13 Pages / 439.37 x 666.142 pts Page_size
- 66 Downloads / 169 Views
A Pitch and Noise Robust Keyword Spotting System Using SMAC Features with Prosody Modification Karabi Maity1 · Gayadhar Pradhan1 · Jyoti Prakash Singh1 Received: 21 December 2019 / Revised: 1 October 2020 / Accepted: 6 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Spotting of keywords in continuous speech signal with the aid of the computer is called a keyword spotting (KWS) system. A variety of strategies have been suggested in the literature to detect keywords from the adult’s speech effectively. However, only a limited number of studies have been reported for KWS in children’s speech. Due to the difference in physiological properties, the pitch and speaking rate of children’s differ from the adult’s. Consequently, KWS system model parameters trained on the speech data from adult’s signal yield poor performance for children speech. In this paper, we have developed a KWS system for spotting keywords from children’s speech using models trained on adults’ speech. The proposed approach uses spectral moment time–frequency distribution augmented by low-order cepstral (SMAC) as the frontend feature. The mismatches due to differences in pitch and speaking rate of children and adult speakers are further mitigated by data-augmented training using explicit pitch and speaking rate modifications. The experimental findings presented in this paper show that the SMAC feature offers significantly better output for both clean and noisy test conditions than the conventional Mel frequency cepstral coefficients. Keywords Keyword spotting · Children’s speech · SMAC feature · Pitch modification · Duration modification
B
Karabi Maity [email protected] Gayadhar Pradhan [email protected] Jyoti Prakash Singh [email protected]
1
National Institute of Technology Patna, Patna, India
Circuits, Systems, and Signal Processing
1 Introduction Controlling household devices by voice control is becoming a regular feature nowadays. Starting from mobile phone to television, almost every electronic devices are now becoming equipped with voice-controlled commands. Some other voice commandcontrolled applications are voice-based dialing, audio mining, speech to gesture conversion, and spoken password verification [29]. Commands such as “Play Nursery Rhyme” given to Alexa is one such example. Recently, the voice command-based applications have made an inroad into children domain also. Pre-teen children are also using voice commands to find answers to their questions from voice-enabled devices or to handle games or robot. In such voice-controlled systems, the first step is identifying the keywords of the interest from a continuous speech. Over the years, a myriad of approaches has been reported for efficiently spotting keyword of interest from adults’ speech [3,5,17,27]. The adult speech data is easy to capture, but it is quite challenging to collect children’s speech for development of KWS system. Children’s voice acoustically and linguistically differs from the adult’s voice. The pitch and speaking rate are we
Data Loading...