L2 Mispronunciation Verification Based on Acoustic Phone Embedding and Siamese Networks
- PDF / 1,380,752 Bytes
- 11 Pages / 595.276 x 790.866 pts Page_size
- 64 Downloads / 151 Views
L2 Mispronunciation Verification Based on Acoustic Phone Embedding and Siamese Networks Yanlu Xie 1,2 & Zhenyu Wang 1,3 & Kaiqi Fu 1,2 Received: 14 February 2019 / Revised: 10 September 2020 / Accepted: 10 September 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract In computer-assisted pronunciation training (CAPT) system, feedback for non-native mispronunciation verification is important, for the reason that it is beneficial to the second language (L2) learners in respect of pronunciation improving. In pronunciation evaluation at the phone level, the pairwise distances between embeddings of native phones and non-native phones could be an ideal predictor of L2 learners’ proficiency. In CAPT, there are two key research issues to be addressed, one is mispronunciation verification and the other is pronunciation evaluation. Considering the positive role played by phone embedding and Siamese networks in related fields, we proposed to evaluate L2 learners’ pronunciation based on phone embedding and Siamese networks. Arbitrary-length speech segments corresponding to phones can be projected into acoustic phone embeddings space as fixeddimensional vector representations. For system inputs, what is used is a pair of acoustic feature vectors of phone segments. The vectors are pair-wise labeled. And the Siamese networks will encode the feature vectors to phone embeddings as high-level representation. Thus, we can differentiate each type of phones through the similarities of their embeddings. As a result, in terms of diagnostic accuracy in mispronunciation verification tasks, Based on bi-directional Long Short Term Memory (LSTM) and contrastive loss, Siamese networks can be trained by a self-supervision using the pairwise labeled vectors without any mispronunciation-labeled L2 speech data in the training set. Results show that the proposed networks surpassed other methods and achieve accuracy as high as 90.69%. Keywords Phone embedding . Siamese networks . Mispronunciation verification . Computer-assisted pronunciation training . Recurrent neural networks
1 Introduction For L2 learners of Mandarin, it is not an easy thing to acquire correct pronunciation. Native-like pronunciation pronunciation
* Yanlu Xie [email protected] Zhenyu Wang [email protected] Kaiqi Fu [email protected] 1
Beijing Language and Culture University, Beijing, China
2
Beijing Advanced Innovation Center for Language Resources, Beijing, China
3
Center for Robust Speech Systems (CRSS), University of Texas at Dallas, Richardson, TX, USA
is even harder to master, which is also true for those who have dialogical experience. The increasing global need for foreign language learning, together with the improvement in computing power, has heightened researchers’ interests in Computer-Assisted Language Learning (CALL). Traditional educational resources are limited, and the Computer-Assisted Language Learning system is a more effective alternative that provides timely and effective feedback to L2 learners. CAPT is one of
Data Loading...