Semantic Context Detection Using Audio Event Fusion

  • PDF / 455,759 Bytes
  • 12 Pages / 600.03 x 792 pts Page_size
  • 69 Downloads / 206 Views

DOWNLOAD

REPORT


Semantic Context Detection Using Audio Event Fusion: Camera-Ready Version Wei-Ta Chu,1 Wen-Huang Cheng,2 and Ja-Ling Wu1, 2 1 Department 2 Graduate

of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan Institute of Networking and Multimedia, National Taiwan University, Taipei 106, Taiwan

Received 31 August 2004; Revised 20 February 2005; Accepted 5 April 2005 Semantic-level content analysis is a crucial issue in achieving efficient content retrieval and management. We propose a hierarchical approach that models audio events over a time series in order to accomplish semantic context detection. Two levels of modeling, audio event and semantic context modeling, are devised to bridge the gap between physical audio features and semantic concepts. In this work, hidden Markov models (HMMs) are used to model four representative audio events, that is, gunshot, explosion, engine, and car braking, in action movies. At the semantic context level, generative (ergodic hidden Markov model) and discriminative (support vector machine (SVM)) approaches are investigated to fuse the characteristics and correlations among audio events, which provide cues for detecting gunplay and car-chasing scenes. The experimental results demonstrate the effectiveness of the proposed approaches and provide a preliminary framework for information mining by using audio characteristics. Copyright © 2006 Hindawi Publishing Corporation. All rights reserved.

1.

INTRODUCTION

As the rapid advance in media creation, storage, and compression technologies, large amounts of multimedia content have been created and disseminated by various ways. Massive multimedia data challenge users in content browsing and retrieving, thereby motivating the urging needs of information mining technologies. To facilitate effective or efficient multimedia document indexing, many research issues have been investigated. Shot boundary detection algorithms are amply studied [1, 2] to discover the structure of video. With the understanding of video structure, video adaptation applications [3] are then developed to manipulate information more flexibly. Moreover, techniques for genre classification are also investigated to facilitate browsing and retrieval. Audio classification and segmentation techniques [4, 5] are proposed to discriminate different types of audio, such as speech, music, noise, and silence. Additional work focuses on classifying musical sounds [6] and automatically constructing music snippets [7]. For video content, genres of films [8] and TV programs [9] are automatically classified by exploring various features. Features from audio, video, and text [10] could be exploited to perform content analysis, and multimodal approaches are proposed to efficiently cope with the access and retrieval issues of multimedia content. On the basis of physical features, the paradigms described above are developed to automatically analyze multimedia

content. However, they pose many problems in today’s applications. The semantic gap between low-level fe