Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation
- PDF / 1,017,460 Bytes
- 12 Pages / 595.276 x 790.866 pts Page_size
- 40 Downloads / 212 Views
Hindi speech recognition using time delay neural network acoustic modeling with i‑vector adaptation Ankit Kumar1,2 · Rajesh Kumar Aggarwal1 Received: 26 November 2018 / Accepted: 28 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract It is a need of time to build an Automatic Speech Recognition (ASR) system for low and limited resource languages. Usually, statistical techniques such as Hidden Markov Models (HMM) have been applied for Indian language ASR systems for the last two decades. In this work, we have selected the Time-delay Neural Network (TDNN) based acoustic modeling with i-vector adaptation for limited resource Hindi ASR. The TDNN can capture the extended temporal context of acoustic events. To reduce the training time, we used sub-sampling based TDNN architecture in this work. Further, data augmentation techniques have been applied to extend the size of training data developed by TIFR, Mumbai. The results show that data augmentation significantly improves the performance of the Hindi ASR. Further, ≈ 4% average improvement has been recorded by applying i-vector adaptation in this work. We found the best system accuracy of 89.9% with TDNN based acoustic modeling with i-vector adaptation. Keywords Automatic speech recognition · TDNN · i-vector · Data perturbation
1 Introduction The Deep Neural Network (DNN) based acoustic modeling is a current trend in Large Vocabulary Continuous Speech Recognition (LVCSR) systems. Multi-layer networks were proposed in the late 1980s, and learning was based on back-propagation (Rumelhart et al. 1986). For more than four decades, ASR models were based on statistical modeling techniques like HMM-GMM. But, Nowadays, DNN has become the most popular choice among researchers (Dahl et al. 2011b, 2013; Hinton et al. 2012; Povey et al. 2016). DNN/HMM hybrid models were also used to train the acoustic model in various work to improve the ASR system’s performance (Jaitly and Hinton 2011; Dahl et al. 2011a; Seide et al. 2011). But, to get benefited from the DNN, the sufficiently large dataset is required to cover all the possibilities in speech. The fact behind the above statement is that when the size of speech data is enormous, a DNN * Ankit Kumar [email protected] 1
Department of Computer Engineering, National Institute of Technology, Kurukshetra, Haryana, India
School of Computing Science & Engineering, Galgotias University, Greater Noida, Uttar Pradesh, India
2
learns internal representations that are stable for irrelevant variabilities. These variabilities are speaker, environment, and bandwidth differences (Ko et al. 2017). In the case of a small amount of training data, DNN acoustic models should be trained carefully, as they may easily overfit the training data. In 2016, Chuangsuwanich (2016) found that around half of the hypothesized words with 10 h of training data will be wrong. With 3 h of training data, in some languages like Kumaanji and Telugu, the Word Error Rate (WER) can shoot up above 80%. India is a multilin
Data Loading...