Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues

  • PDF / 3,161,011 Bytes
  • 16 Pages / 595.2 x 792 pts Page_size
  • 30 Downloads / 228 Views

DOWNLOAD

REPORT


Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues W. H. Adams IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA Email: [email protected]

Giridharan Iyengar IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA Email: [email protected]

Ching-Yung Lin IBM T. J. Watson Research Center, Hawthorne, NY 10532, USA Email: [email protected]

Milind Ramesh Naphade IBM T. J. Watson Research Center, Hawthorne, NY 10532, USA Email: [email protected]

Chalapathy Neti IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA Email: chalapathy [email protected]

Harriet J. Nock IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA Email: [email protected]

John R. Smith IBM T. J. Watson Research Center, Hawthorne, NY 10532, USA Email: [email protected] Received 2 April 2002 and in revised form 15 November 2002 We present a learning-based approach to the semantic indexing of multimedia content using cues derived from audio, visual, and text features. We approach the problem by developing a set of statistical models for a predefined lexicon. Novel concepts are then mapped in terms of the concepts in the lexicon. To achieve robust detection of concepts, we exploit features from multiple modalities, namely, audio, video, and text. Concept representations are modeled using Gaussian mixture models (GMM), hidden Markov models (HMM), and support vector machines (SVM). Models such as Bayesian networks and SVMs are used in a latefusion approach to model concepts that are not explicitly modeled in terms of features. Our experiments indicate promise in the proposed classification and fusion methodologies: our proposed fusion scheme achieves more than 10% relative improvement over the best unimodal concept detector. Keywords and phrases: query by keywords, multimodal information fusion, statistical modeling of multimedia, video indexing and retrieval, SVM, GMM, HMM, spoken document retrieval, video event detection, video TREC.

1.

INTRODUCTION

Large digital video libraries require tools for representing, searching, and retrieving content. One possibility is the

query-by-example (QBE) approach, in which users provide (usually visual) examples of the content they seek. However, such schemes have some obvious limitations, and since most users wish to search in terms of semantic-concepts rather

Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues than by visual content [1], work in the video retrieval area has begun to shift from QBE to query-by-keyword (QBK) approaches, which allow the users to search by specifying their query in terms of a limited vocabulary of semanticconcepts. This paper presents an overview of an ongoing IBM project which is developing a trainable QBK system for the labeling and retrieval of generic multimedia semanticconcepts in video; it will focus, in particular, upon the detection of semantic-concepts using information cues from multiple modalities (audio, video, speech, and potentially videotext).1 1.1. Related wo