Recognition of Isolated Digit Using Random Forest for Audio-Visual Speech Recognition
- PDF / 671,857 Bytes
- 8 Pages / 595.276 x 790.866 pts Page_size
- 8 Downloads / 195 Views
RESEARCH ARTICLE
Recognition of Isolated Digit Using Random Forest for AudioVisual Speech Recognition Prashant Borde1
•
Sadanand Kulkarni1 • Bharti Gawali1 • Pravin Yannawar1
Received: 17 January 2017 / Revised: 9 May 2019 / Accepted: 29 October 2020 The National Academy of Sciences, India 2020
Abstract The proposed research work clearly investigates the effective use of two modalities (audio and visual inputs) toward designing functional audio-visual speech recognition system. The promising results presented in this piece of work were obtained on vVISWa (visual Vocabulary of Isolated Standard Words) dataset of audio-visual words and CUAVE (Clemson University Audio-Visual Experiments) database, respectively. The discrete cosine transform (DCT), local binary pattern (LBP) features of full frontal visual profile and MFCC features for acoustics signals were fused together for recognition purpose and were classified using random forest classifier. Keywords Face detection Lip tracking Local binary pattern (LBP) Discrete cosine transform (DCT) Mel-frequency cepstral coefficients (MFCC) Linear discriminant analysis (LDA) Random forest
& Prashant Borde [email protected] Sadanand Kulkarni [email protected] Bharti Gawali [email protected] Pravin Yannawar [email protected] 1
Vision and Intelligent System Laboratory, Department of Computer Science and Information Technology, Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, Maharashtra, India
1 Introduction Over a last few decades, automatic speech recognition (ASR) has enhanced human computer interaction with high-level reliability. The performance of many automatic speech recognition (ASR) system was reported low, when the acoustic signal is corrupted with noise [1]. The major challenge faced by ASR research community is to improve robustness of traditional ASR in face of audible noise. As the visual modality is not directly affected by audio noise, it can stand as potential source to make ASR systems more robust and to be transformed into AVSR (audio-visual speech recognition system). Lip reading is the technique to recognize what a person is saying by visually interpreting the movements of the lips, face, and tongue. The hearingimpaired or listeners with normal hearing use visual information of lip movements as a primary source of speech perception [2]. These approaches have been adopted to improve the performance of AVSR system in presence of noise [3, 4]. Gurbuz et al. [5] have described the incorporation of visual lip tracking and lip-reading algorithm that utilizes the affine invariant Fourier descriptors from parametric lip contours to improve the audio-visual speech recognition system. Saenko et al. [6] have discussed the approach for visual speech modeling based on articulatory features under visually challenging conditions. This idea was used to set stage for parallel support vector machine (SVM) classifier to extract different articulatory attributes from the input images and then combine their
Data Loading...