Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition

This paper reports on the setup and evaluation of robust speech recognition system parts, geared towards transcript generation for heterogeneous, real-life media collections. The system is deployed for generating speech transcripts for the NIST/TRECVID-20

PDF / 427,348 Bytes
13 Pages / 430 x 660 pts Page_size
16 Downloads / 315 Views

DOWNLOAD

REPORT

Abstract. This paper reports on the setup and evaluation of robust speech recognition system parts, geared towards transcript generation for heterogeneous, real-life media collections. The system is deployed for generating speech transcripts for the NIST/TRECVID-2007 test collection, part of a Dutch real-life archive of news-related genres. Performance ﬁgures for this type of content are compared to ﬁgures for broadcast news test data.

1

Introduction

The exploitation of linguistic content such as transcripts generated via automatic speech recognition (ASR) can boost the accessibility of multimedia archives enormously. This eﬀect is of course limited to video data containing textual and/or spoken content but when available, the exploitation of linguistic content for the generation of a time-coded index can help to bridge the semantic gap between media features and search needs. This is conﬁrmed by the results of TREC series of Workshops on Video Retrieval (TRECVID)1 . The TRECVID test collections contain not just video, but also ASR-generated transcripts of segments containing speech. Systems that do not exploit these transcripts typically do not perform as well as the systems that do incorporate speech features in their models [13], or to video content with links to related textual documents, such as subtitles and generated transcripts. ASR supports the conceptual querying of video content and the synchronization to any kind of textual resource that is accessible, including other full-text annotation for audiovisual material[4]. The potential of ASR-based indexing has been demonstrated most successfully in the broadcast news domain. Spoken document retrieval in the American-English broadcast news (BN) domain was even declared ‘a solved problem’ based on the results of the TREC Spoken Document Retrieval (SDR) track in 1999 [7]. Partly because collecting data to train recognition models for the BN domain is relatively easy, word-error-rates (WER) 1

http://trecvid.nist.gov

B. Falcidieno et al. (Eds.): SAMT 2007, LNCS 4816, pp. 78–90, 2007. c Springer-Verlag Berlin Heidelberg 2007

Annotation of Heterogeneous Multimedia Content

79

below 10% are no longer exceptional[8,9], and ASR transcripts for BN content approximate the quality of manual transcripts, at least for several languages. In other domains than broadcast news and for many less favored languages, a similar recognition performance is usually harder to obtain due to (i) lack of domain-speciﬁc training data, and (ii) large variability in audio quality, speech characteristics and topics being addressed. However, as ASR performance of 50 % WER is regarded as a lower bound for successful retrieval, speech-based indexing for harder data remains feasible as long as the ASR performance is not below 50 % WER, and is actually a crucial enabling technology if no other means (metadata) are available to guide searching. For 2007, the TRECVID organisers have decided to shift the focus from broadcast news video to video from a real-life archive of news-related genr

Data Loading...

Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition

Recommend Documents

Automatic Speech Recognition of Galo

Automatic speech recognition: a survey

Isolated Word Automatic Speech Recognition System

Video Automatic Annotation

Video Automatic Annotation

Experiments on Automatic Recognition of Nonnative Arabic Speech

Medical reporting using speech recognition

Automatic Speech Recognition of Arabic Phonemes with Neural Networks

Using an Ontology for Multimedia Content Semantics

Robust Adaptation to Non-Native Accents in Automatic Speech Recognition

Federated Acoustic Model Optimization for Automatic Speech Recognition

Supporting Collaborative Workflows of Digital Multimedia Annotation