Text Segmentation Using Context Overlap
In this paper we propose features desirable of linear text segmentation algorithms for the Information Retrieval domain, with emphasis on improving high similarity search of heterogeneous texts. We proceed to describe a robust purely statistical method, b
- PDF / 3,810,985 Bytes
- 12 Pages / 430 x 660 pts Page_size
- 81 Downloads / 229 Views
bstract. In this paper we propose features desirable of linear text segmentation algorithms for the Information Retrieval domain, with emphasis on improving high similarity search of heterogeneous texts. We proceed to describe a robust purely statistical method, based on context overlap exploitation, that exhibits these desired features. Experimental results are presented, along with comparison to other existing algorithms.
1
Introduction
Text segmentation has recently enjoyed increased research focus. Segmentation algorithms aim to split a text document into contiguous blocks, called segments, each of which covers a compact topic while consecutive blocks cover different topics. Applications include finding topic boundaries in text transcriptions of audio news, improving text navigation or intrinsic plagiarism (or anomaly) detection. It can also be used to improve Information Retrieval (henceforth IR) performance, which is main target application for the method described in this paper. To see how text segmentation might improve IR performance, consider standard IR scenario. Here documents are transformed into Vector Space and indexing techniques are employed to allow efficient exact and proximity queries. Given the widely heterogeneous documents that a general IR system may expect, some of these documents may be monothematic and compact, dealing with a single topic. Others can be a mixture of various topics, connected not thematically but rather incidentally (for example, documents containing news agglomerated by date, not topic). Some may cover multiple topics intentionally, such as complex documents involving passages in different languages. The problem here is that once the documents are converted into Vector Space, all structural information is lost. The resulting document vector shifts away from any one topic included in the original document. Still, user queries are typically monothematic and in this way the chance of high similarity match between user query and document vector decreases. This can result in missed hit. Thus having basic retrieval blocks correspond to single topics rather than whole documents seems like a methodologically sound step. It is up to final application to merge and present topical, sub-document hits to the user. It also depends on application to set granularity of topics that we wish to tell apart. Identifying compact document chunks also has applications in intrinsic plagiarism detection, where it helps to reduce number of suspicious passages and subsequent queries. J. Neves, M. Santos, and J. Machado (Eds.): EPIA 2007, LNAI 4874, pp. 647–658, 2007. c Springer-Verlag Berlin Heidelberg 2007
648
ˇ uˇrek R. Reh˚
1.1
Motivation
There are practical considerations that are important in real-world IR systems. Driven by need to understand system’s behaviour (especially unexpected behaviour) and ability to make extensions to the system during development cycle, it is advantageous to keep the system architecture as simple, clear and robust as possible. Based on these concerns, three impor
Data Loading...