A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications

PDF / 887,427 Bytes
12 Pages / 612 x 792 pts (letter) Page_size
21 Downloads / 194 Views

A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications Mihaela Gordan Department of Informatics, Aristotle University of Thessaloniki, Box 451, Thessaloniki 54006, Greece Email: [email protected]

Constantine Kotropoulos Department of Informatics, Aristotle University of Thessaloniki, Box 451, Thessaloniki 54006, Greece Email: [email protected]

Ioannis Pitas Department of Informatics, Aristotle University of Thessaloniki, Box 451, Thessaloniki 54006, Greece Email: [email protected] Received 26 November 2001 and in revised form 26 July 2002 Visual speech recognition is an emerging research field. In this paper, we examine the suitability of support vector machines for visual speech recognition. Each word is modeled as a temporal sequence of visemes corresponding to the diﬀerent phones realized. One support vector machine is trained to recognize each viseme and its output is converted to a posterior probability through a sigmoidal mapping. To model the temporal character of speech, the support vector machines are integrated as nodes into a Viterbi lattice. We test the performance of the proposed approach on a small visual speech recognition task, namely the recognition of the first four digits in English. The word recognition rate obtained is at the level of the previous best reported rates. Keywords and phrases: visual speech recognition, mouth shape recognition, visemes, phonemes, support vector machines, Viterbi lattice.

1. INTRODUCTION Audio-visual speech recognition is an emerging research field where multimodal signal processing is required. The motivation for using the visual information in performing speech recognition lays on the fact that the human speech production is bimodal by its nature. In particular, human speech is produced by the vibration of the vocal cords and depends on the configuration of the articulatory organs, such as the nasal cavity, the tongue, the teeth, the velum, and the lips. A speaker produces speech using these articulatory organs together with the muscles that generate facial expressions. Because some of the articulators, such as the tongue, the teeth, and the lips are visible, there is an inherent relationship between the acoustic and visible speech. As a consequence, the speech can be partially recognized from the information of the visible articulators involved in its production and in particular from the image region comprising the mouth [1, 2, 3]. Undoubtedly, the most useful information for speech recognition is carried by the acoustic signal. When the acoustic speech is clean, performing visual speech recognition and integrating the recognition results from both modalities does

not bring too much improvement because the recognition rate from the acoustic information alone is very high, if not perfect. However, when the acoustic speech is degraded by noise, adding the visual information to the acoustic one improves significantly the recognition rate. Under noisy conditions, it has been proved that the use of both modalities

Data Loading...

A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications

Recommend Documents

Dynamic Bayesian Networks for Audio-Visual Speech Recognition

Domain Adaptive Fisher Vector for Visual Recognition

Visual-dynamic Speaker Recognition

Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

Speech Analysis for Mental Health Assessment Using Support Vector Machines

Incremental Support Vector Machine Framework for Visual Sensor Networks

Hierarchical support vector machine for facial micro-expression recognition

DDGCN: A Dynamic Directed Graph Convolutional Network for Action Recognition

A Novel Isolated Speech Recognition Method Based on Neural Network

Hindi speech recognition using time delay neural network acoustic modeling with i-vector adaptation

Recognition of Isolated Digit Using Random Forest for Audio-Visual Speech Recognition

Object Recognition System-on-Chip Using the Support Vector Machines