Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition

This paper reports on the setup and evaluation of robust speech recognition system parts, geared towards transcript generation for heterogeneous, real-life media collections. The system is deployed for generating speech transcripts for the NIST/TRECVID-20

  • PDF / 427,348 Bytes
  • 13 Pages / 430 x 660 pts Page_size
  • 16 Downloads / 265 Views

DOWNLOAD

REPORT


Abstract. This paper reports on the setup and evaluation of robust speech recognition system parts, geared towards transcript generation for heterogeneous, real-life media collections. The system is deployed for generating speech transcripts for the NIST/TRECVID-2007 test collection, part of a Dutch real-life archive of news-related genres. Performance figures for this type of content are compared to figures for broadcast news test data.

1

Introduction

The exploitation of linguistic content such as transcripts generated via automatic speech recognition (ASR) can boost the accessibility of multimedia archives enormously. This effect is of course limited to video data containing textual and/or spoken content but when available, the exploitation of linguistic content for the generation of a time-coded index can help to bridge the semantic gap between media features and search needs. This is confirmed by the results of TREC series of Workshops on Video Retrieval (TRECVID)1 . The TRECVID test collections contain not just video, but also ASR-generated transcripts of segments containing speech. Systems that do not exploit these transcripts typically do not perform as well as the systems that do incorporate speech features in their models [13], or to video content with links to related textual documents, such as subtitles and generated transcripts. ASR supports the conceptual querying of video content and the synchronization to any kind of textual resource that is accessible, including other full-text annotation for audiovisual material[4]. The potential of ASR-based indexing has been demonstrated most successfully in the broadcast news domain. Spoken document retrieval in the American-English broadcast news (BN) domain was even declared ‘a solved problem’ based on the results of the TREC Spoken Document Retrieval (SDR) track in 1999 [7]. Partly because collecting data to train recognition models for the BN domain is relatively easy, word-error-rates (WER) 1

http://trecvid.nist.gov

B. Falcidieno et al. (Eds.): SAMT 2007, LNCS 4816, pp. 78–90, 2007. c Springer-Verlag Berlin Heidelberg 2007 

Annotation of Heterogeneous Multimedia Content

79

below 10% are no longer exceptional[8,9], and ASR transcripts for BN content approximate the quality of manual transcripts, at least for several languages. In other domains than broadcast news and for many less favored languages, a similar recognition performance is usually harder to obtain due to (i) lack of domain-specific training data, and (ii) large variability in audio quality, speech characteristics and topics being addressed. However, as ASR performance of 50 % WER is regarded as a lower bound for successful retrieval, speech-based indexing for harder data remains feasible as long as the ASR performance is not below 50 % WER, and is actually a crucial enabling technology if no other means (metadata) are available to guide searching. For 2007, the TRECVID organisers have decided to shift the focus from broadcast news video to video from a real-life archive of news-related genr