Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation

PDF / 1,017,460 Bytes
12 Pages / 595.276 x 790.866 pts Page_size
40 Downloads / 233 Views

Hindi speech recognition using time delay neural network acoustic modeling with i‑vector adaptation Ankit Kumar1,2 · Rajesh Kumar Aggarwal1 Received: 26 November 2018 / Accepted: 28 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract It is a need of time to build an Automatic Speech Recognition (ASR) system for low and limited resource languages. Usually, statistical techniques such as Hidden Markov Models (HMM) have been applied for Indian language ASR systems for the last two decades. In this work, we have selected the Time-delay Neural Network (TDNN) based acoustic modeling with i-vector adaptation for limited resource Hindi ASR. The TDNN can capture the extended temporal context of acoustic events. To reduce the training time, we used sub-sampling based TDNN architecture in this work. Further, data augmentation techniques have been applied to extend the size of training data developed by TIFR, Mumbai. The results show that data augmentation significantly improves the performance of the Hindi ASR. Further, ≈ 4% average improvement has been recorded by applying i-vector adaptation in this work. We found the best system accuracy of 89.9% with TDNN based acoustic modeling with i-vector adaptation. Keywords Automatic speech recognition · TDNN · i-vector · Data perturbation

1 Introduction The Deep Neural Network (DNN) based acoustic modeling is a current trend in Large Vocabulary Continuous Speech Recognition (LVCSR) systems. Multi-layer networks were proposed in the late 1980s, and learning was based on back-propagation (Rumelhart et al. 1986). For more than four decades, ASR models were based on statistical modeling techniques like HMM-GMM. But, Nowadays, DNN has become the most popular choice among researchers (Dahl et al. 2011b, 2013; Hinton et al. 2012; Povey et al. 2016). DNN/HMM hybrid models were also used to train the acoustic model in various work to improve the ASR system’s performance (Jaitly and Hinton 2011; Dahl et al. 2011a; Seide et al. 2011). But, to get benefited from the DNN, the sufficiently large dataset is required to cover all the possibilities in speech. The fact behind the above statement is that when the size of speech data is enormous, a DNN * Ankit Kumar [email protected] 1

Department of Computer Engineering, National Institute of Technology, Kurukshetra, Haryana, India

School of Computing Science & Engineering, Galgotias University, Greater Noida, Uttar Pradesh, India

2

learns internal representations that are stable for irrelevant variabilities. These variabilities are speaker, environment, and bandwidth differences (Ko et al. 2017). In the case of a small amount of training data, DNN acoustic models should be trained carefully, as they may easily overfit the training data. In 2016, Chuangsuwanich (2016) found that around half of the hypothesized words with 10 h of training data will be wrong. With 3 h of training data, in some languages like Kumaanji and Telugu, the Word Error Rate (WER) can shoot up above 80%. India is a multilin

Data Loading...

Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation

Recommend Documents

Speech Emotion Recognition in Neurological Disorders Using Convolutional Neural Network

Neural-Network-Based Time-Delay Estimation

Stress and Emotion Recognition Using Acoustic Speech Analysis

A Novel Isolated Speech Recognition Method Based on Neural Network

Speech Emotion Recognition UsingConvolutional Neural Network and Long-Short TermMemory

Confusion analysis in phoneme based speech recognition in Hindi

Age and Gender Recognition from Speech Using Deep Neural Networks

A deep neural network-based model for named entity recognition for Hindi language

Automatic Speech Recognition of Arabic Phonemes with Neural Networks

Spatiotemporal dynamic of a coupled neutral-type neural network with time delay and diffusion

Medical reporting using speech recognition

Motion Time Study with Convolutional Neural Network