Text Segmentation Using Context Overlap

In this paper we propose features desirable of linear text segmentation algorithms for the Information Retrieval domain, with emphasis on improving high similarity search of heterogeneous texts. We proceed to describe a robust purely statistical method, b

PDF / 3,810,985 Bytes
12 Pages / 430 x 660 pts Page_size
81 Downloads / 244 Views

DOWNLOAD

REPORT

bstract. In this paper we propose features desirable of linear text segmentation algorithms for the Information Retrieval domain, with emphasis on improving high similarity search of heterogeneous texts. We proceed to describe a robust purely statistical method, based on context overlap exploitation, that exhibits these desired features. Experimental results are presented, along with comparison to other existing algorithms.

1

Introduction

Text segmentation has recently enjoyed increased research focus. Segmentation algorithms aim to split a text document into contiguous blocks, called segments, each of which covers a compact topic while consecutive blocks cover diﬀerent topics. Applications include ﬁnding topic boundaries in text transcriptions of audio news, improving text navigation or intrinsic plagiarism (or anomaly) detection. It can also be used to improve Information Retrieval (henceforth IR) performance, which is main target application for the method described in this paper. To see how text segmentation might improve IR performance, consider standard IR scenario. Here documents are transformed into Vector Space and indexing techniques are employed to allow eﬃcient exact and proximity queries. Given the widely heterogeneous documents that a general IR system may expect, some of these documents may be monothematic and compact, dealing with a single topic. Others can be a mixture of various topics, connected not thematically but rather incidentally (for example, documents containing news agglomerated by date, not topic). Some may cover multiple topics intentionally, such as complex documents involving passages in diﬀerent languages. The problem here is that once the documents are converted into Vector Space, all structural information is lost. The resulting document vector shifts away from any one topic included in the original document. Still, user queries are typically monothematic and in this way the chance of high similarity match between user query and document vector decreases. This can result in missed hit. Thus having basic retrieval blocks correspond to single topics rather than whole documents seems like a methodologically sound step. It is up to ﬁnal application to merge and present topical, sub-document hits to the user. It also depends on application to set granularity of topics that we wish to tell apart. Identifying compact document chunks also has applications in intrinsic plagiarism detection, where it helps to reduce number of suspicious passages and subsequent queries. J. Neves, M. Santos, and J. Machado (Eds.): EPIA 2007, LNAI 4874, pp. 647–658, 2007. c Springer-Verlag Berlin Heidelberg 2007

648

ˇ uˇrek R. Reh˚

1.1

Motivation

There are practical considerations that are important in real-world IR systems. Driven by need to understand system’s behaviour (especially unexpected behaviour) and ability to make extensions to the system during development cycle, it is advantageous to keep the system architecture as simple, clear and robust as possible. Based on these concerns, three impor

Data Loading...

Text Segmentation Using Context Overlap

Recommend Documents

Text Segmentation

Readability: Text and Context

Text Segmentation for Document Recognition

Israeli Sociology Text in Context

From Text Detection to Text Segmentation: A Unified Evaluation Scheme

Text Line Segmentation for Medieval Devnagari Manuscript

Unsupervised Information Extraction by Text Segmentation

Adaptive Context Selection for Polyp Segmentation

Controlling Overlap

Arbitrary oriented multilingual text detection and segmentation using level set and Gaussian mixture model

TxtLineSeg: Text Line Segmentation of Unconstrained Printed Text in Devanagari Script

Removing Overlap