Joint Audio-Visual Tracking Using Particle Filters

  • PDF / 1,156,932 Bytes
  • 11 Pages / 612 x 792 pts (letter) Page_size
  • 85 Downloads / 210 Views

DOWNLOAD

REPORT


Joint Audio-Visual Tracking Using Particle Filters Dmitry N. Zotkin Perceptual Interfaces and Reality Laboratory, Department of Computer Science, University of Maryland Institute for Advanced Computer Studies, University of Maryland at College Park, College Park, MD 20742, USA Email: [email protected]

Ramani Duraiswami Perceptual Interfaces and Reality Laboratory, Department of Computer Science, University of Maryland Institute for Advanced Computer Studies, University of Maryland at College Park, College Park, MD 20742, USA Email: [email protected]

Larry S. Davis Perceptual Interfaces and Reality Laboratory, Department of Computer Science, University of Maryland Institute for Advanced Computer Studies, University of Maryland at College Park, College Park, MD 20742, USA Email: [email protected] Received 8 November 2001 and in revised form 13 May 2002 It is often advantageous to track objects in a scene using multimodal information when such information is available. We use audio as a complementary modality to video data, which, in comparison to vision, can provide faster localization over a wider field of view. We present a particle-filter based tracking framework for performing multimodal sensor fusion for tracking people in a videoconferencing environment using multiple cameras and multiple microphone arrays. One advantage of our proposed tracker is its ability to seamlessly handle temporary absence of some measurements (e.g., camera occlusion or silence). Another advantage is the possibility of self-calibration of the joint system to compensate for imprecision in the knowledge of array or camera parameters by treating them as containing an unknown statistical component that can be determined using the particle filter framework during tracking. We implement the algorithm in the context of a videoconferencing and meeting recording system. The system also performs high-level semantic analysis of the scene by keeping participant tracks, recognizing turn-taking events and recording an annotated transcript of the meeting. Experimental results are presented. Our system operates in real time and is shown to be robust and reliable. Keywords and phrases: audio-visual tracking, sensor fusion, Monte-Carlo algorithms.

1.

INTRODUCTION

The goal of most machine perception systems is to mimic the performance of human and animal systems. A key characteristic of human systems is their multimodality. They rely on information from many modalities, chief among which are vision and audition. It is now apparent that many of the centers in the brain thought to encode space-time are activated by combinations of visual and audio stimuli [1]. However, the problems of computer vision and computer audition have essentially been performed on parallel tracks, with diļ¬€erent research communities and problems. Capabilities of computers have now reached such a level that it is now possible to build and develop systems that can combine multiple audio and video sensors and perform meaningful joint-analysis of a scene, such as joint audiovisual s