L2 Mispronunciation Verification Based on Acoustic Phone Embedding and Siamese Networks

PDF / 1,380,752 Bytes
11 Pages / 595.276 x 790.866 pts Page_size
64 Downloads / 151 Views

L2 Mispronunciation Verification Based on Acoustic Phone Embedding and Siamese Networks Yanlu Xie 1,2 & Zhenyu Wang 1,3 & Kaiqi Fu 1,2 Received: 14 February 2019 / Revised: 10 September 2020 / Accepted: 10 September 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In computer-assisted pronunciation training (CAPT) system, feedback for non-native mispronunciation verification is important, for the reason that it is beneficial to the second language (L2) learners in respect of pronunciation improving. In pronunciation evaluation at the phone level, the pairwise distances between embeddings of native phones and non-native phones could be an ideal predictor of L2 learners’ proficiency. In CAPT, there are two key research issues to be addressed, one is mispronunciation verification and the other is pronunciation evaluation. Considering the positive role played by phone embedding and Siamese networks in related fields, we proposed to evaluate L2 learners’ pronunciation based on phone embedding and Siamese networks. Arbitrary-length speech segments corresponding to phones can be projected into acoustic phone embeddings space as fixeddimensional vector representations. For system inputs, what is used is a pair of acoustic feature vectors of phone segments. The vectors are pair-wise labeled. And the Siamese networks will encode the feature vectors to phone embeddings as high-level representation. Thus, we can differentiate each type of phones through the similarities of their embeddings. As a result, in terms of diagnostic accuracy in mispronunciation verification tasks, Based on bi-directional Long Short Term Memory (LSTM) and contrastive loss, Siamese networks can be trained by a self-supervision using the pairwise labeled vectors without any mispronunciation-labeled L2 speech data in the training set. Results show that the proposed networks surpassed other methods and achieve accuracy as high as 90.69%. Keywords Phone embedding . Siamese networks . Mispronunciation verification . Computer-assisted pronunciation training . Recurrent neural networks

1 Introduction For L2 learners of Mandarin, it is not an easy thing to acquire correct pronunciation. Native-like pronunciation pronunciation

* Yanlu Xie [email protected] Zhenyu Wang [email protected] Kaiqi Fu [email protected] 1

Beijing Language and Culture University, Beijing, China

2

Beijing Advanced Innovation Center for Language Resources, Beijing, China

3

Center for Robust Speech Systems (CRSS), University of Texas at Dallas, Richardson, TX, USA

is even harder to master, which is also true for those who have dialogical experience. The increasing global need for foreign language learning, together with the improvement in computing power, has heightened researchers’ interests in Computer-Assisted Language Learning (CALL). Traditional educational resources are limited, and the Computer-Assisted Language Learning system is a more effective alternative that provides timely and effective feedback to L2 learners. CAPT is one of

Data Loading...

L2 Mispronunciation Verification Based on Acoustic Phone Embedding and Siamese Networks

Recommend Documents

Context-Aware Based Discriminative Siamese Neural Network for Face Verification

Acoustic Networks

Embedding Online Runtime Verification for Fault Disambiguation on Robonaut2

Fully-Convolutional Siamese Networks for Object Tracking

End-to-End Blurry Template Matching Method Based on Siamese Networks

Acoustic Emission Recognition Based on Spectrogram and Acoustic Features

Acoustic Sensor Networks

Using Siamese Graph Neural Networks for Similarity-Based Retrieval in Process-Oriented Case-Based Reasoning

Acoustic Wireless Sensor Networks

Formal Verification of Neural Networks?

Deep Discriminative Embedding with Ranked Weight for Speaker Verification

Fully Embedding Fast Convolutional Networks on Pixel Processor Arrays