Viewing Term Proximity from a Different Perspective

This paper extends the state-of-the-art probabilistic model BM25 to utilize term proximity from a new perspective. Most previous work only consider dependencies between pairs of terms, and regard phrases as additional independent evidence. It is difficult

  • PDF / 230,718 Bytes
  • 12 Pages / 430 x 660 pts Page_size
  • 48 Downloads / 162 Views

DOWNLOAD

REPORT


Dept. of Computer Science and Engineer, Shanghai Jiao Tong University, Shanghai, 200240 China 2 Microsoft Research Asia, No.49 Zhichun Road, Beijing, 100080, China 3

Microsoft Research Ltd, 7 JJ Thomson Avenue, Cambridge CB3 0FB, England

{rsong,mitaylor,jrwen,hon}@microsoft.com, [email protected]

Abstract. This paper extends the state-of-the-art probabilistic model BM25 to utilize term proximity from a new perspective. Most previous work only consider dependencies between pairs of terms, and regard phrases as additional independent evidence. It is difficult to estimate the importance of a phrase and its extra contribution to a relevance score, as the phrase actually overlaps with the component terms. This paper proposes a new approach. First, query terms are grouped locally into non-overlapping phrases that may contain one or more query terms. Second, these phrases are not scored independently but are instead treated as providing a context for the component query terms. The relevance contribution of a term occurrence is measured by how many query terms occur in the context phrase and how compact they are. Third, we replace term frequency by the accumulated relevance contribution. Consequently, term proximity is easily integrated into the probabilistic model. Experimental results on TREC-10 and TREC-11 collections show stable improvements in terms of average precision and significant improvements in terms of top precisions.

1 Introduction A document is usually represented as a bag of words in information retrieval theory in order to make both the development of retrieval models easier and the retrieval operation tractable. People often observe that the independence assumption does not hold in textual data and there has always been the feeling that term dependencies, if used correctly, should improve the retrieval quality. Consequently, there has been much research on incorporating term dependence into retrieval models over the last few decades. Some recent work [10][17] on language models shows promising results by modeling dependencies on large web collections. However, most work on probabilistic models has not achieved consistent improvements. This paper aims to extend the state-of-the-art probabilistic model BM25 to take advantage of term proximity. By surveying the literature, we find two problems in previous work. First, it is difficult to estimate the importance of phrases because they are different in nature from words. It may be not appropriate to apply the same weighting schemes for them. C. Macdonald et al. (Eds.): ECIR 2008, LNCS 4956, pp. 346–357, 2008. © Springer-Verlag Berlin Heidelberg 2008

Viewing Term Proximity from a Different Perspective

347

Second, a naïve linear combination of scores of words and those of phrases may break the non-linear property of term frequency. In probabilistic models, the non-linear term frequency is desirable because of the statistical dependence of term occurrences: the information gained on observing a term the first time is greater than the information gained on s