Term-Weighting for Summarization of Multi-party Spoken Dialogues

This paper explores the issue of term-weighting in the genre of spontaneous, multi-party spoken dialogues, with the intent of using such term-weights in the creation of extractive meeting summaries. The field of text information retrieval has yielded many

  • PDF / 233,011 Bytes
  • 12 Pages / 430 x 660 pts Page_size
  • 49 Downloads / 161 Views

DOWNLOAD

REPORT


Abstract. This paper explores the issue of term-weighting in the genre of spontaneous, multi-party spoken dialogues, with the intent of using such term-weights in the creation of extractive meeting summaries. The field of text information retrieval has yielded many term-weighting techniques to import for our purposes; this paper implements and compares several of these, namely tf.idf, Residual IDF and Gain. We propose that term-weighting for multi-party dialogues can exploit patterns in word usage among participant speakers, and introduce the su.idf metric as one attempt to do so. Results for all metrics are reported on both manual and automatic speech recognition (ASR) transcripts, and on both the ICSI and AMI meeting corpora.

1

Introduction

The primary focus of this research is to create extractive summaries of meeting speech, in order to present users with concise and informative overviews of the content of meetings. Such extractive summaries, when incorporated into a meeting browser, can act as efficient tools for navigating meeting records as a whole. This paper focuses on one fundamental component of the extractive summarization pipeline: the way that terms are weighted within a given meeting, and the bearing that various term-weighting schemes have on extraction performance. Choosing and implementing a term weighting method is often the first step in building an automatic summarization system. Though the unit of extraction may be the sentence or the dialogue act, those units need to be weighted by the importance of their constituent words. Popular text summarization techniques such as Maximal Marginal Relevance (MMR) and Latent Semantic Analysis (LSA) begin by representing sentences as vectors of term weights. There is a wide variety of term weighting schemes available, from simple binary weights of word presence/absence to more complex weighting schemes such as tf.idf and tf.ridf. Several of these are described in the following section. A central question of this paper is whether term-weighting techniques developed for information retrieval (IR) and summarization tasks on text are wellsuited for our domain of multiparty spontaneous spoken dialogues, or whether the patterns of word usage in such dialogues can be exploited in order to yield superior term-weighting for our task. To this end, we devise and implement a novel A. Popescu-Belis, S. Renals, and H. Bourlard (Eds.): MLMI 2007, LNCS 4892, pp. 156–167, 2007. c Springer-Verlag Berlin Heidelberg 2007 

Term-Weighting for Summarization of Multi-party Spoken Dialogues

157

term-weighting approach for multi-party speech called su.idf, based on differing word frequencies among speakers in a meeting. This metric is compared with 3 popular term-weighting schemes - tf.idf, ridf and Gain - and the metrics are evaluated via an extractive summarization task on both AMI and ICSI corpora.

2

Previous Term Weighting Work

Term weighting methods form an essential part of any IR system. Terms that characterize a given document well and discriminate the document from