Computing Information Retrieval Performance Measures Efficiently in the Presence of Tied Scores

The Information Retrieval community uses a variety of performance measures to evaluate the effectiveness of scoring functions. In this paper, we show how to adapt six popular measures — precision, recall, F1, average precision, reciprocal rank, and normal

  • PDF / 324,035 Bytes
  • 8 Pages / 430 x 660 pts Page_size
  • 53 Downloads / 195 Views

DOWNLOAD

REPORT


Abstract. The Information Retrieval community uses a variety of performance measures to evaluate the effectiveness of scoring functions. In this paper, we show how to adapt six popular measures — precision, recall, F1, average precision, reciprocal rank, and normalized discounted cumulative gain — to cope with scoring functions that are likely to assign many tied scores to the results of a search. Tied scores impose only a partial ordering on the results, meaning that there are multiple possible orderings of the result set, each one performing differently. One approach to cope with ties would be to average the performance values across all possible result orderings; but unfortunately, generating result permutations requires super-exponential time. The approach presented in this paper computes precisely the same performance value as the approach of averaging over all permutations, but does so as efficiently as the original, tie-oblivious measures.

1

Introduction

One of the fundamental problems in Information Retrieval is the ranking problem: Ordering the results of a query such that the most relevant results show up first. Ranking algorithms employ scoring functions that assign scores to each result of a query, where the score is an estimate of the result’s relevance to the query at hand. So, ranking the results of a query consists of assigning a score to each result and then sorting the results by score, from highest to lowest. Ranking algorithms are typically evaluated against a test collection consisting of a set of queries. Each query in the test collection has a set of results, and the results were arranged into a (partial or total) order by a human judge. In order to evaluate a ranking algorithm, the algorithm is applied to the result set of each query, the distance between the computed ranking and the “optimal” ordering determined by the judge is measured, and the distances are averaged over the entire test collection. Coming up with suitable distance metrics (or performance measures) has been the subject of considerable research, and there are numerous such metrics. Typically these performance measures assume that a ranking algorithm arranges the results of a query into a total ordering, i.e. that no two results to a query have the same score. This assumption is reasonable for scoring functions that map a rich set of features of the result document to a real-valued score, but C. Macdonald et al. (Eds.): ECIR 2008, LNCS 4956, pp. 414–421, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Computing Information Retrieval Performance Measures Efficiently

415

it is less warranted for evaluating the performance of a single discrete feature, e.g. page in-degree, click count, and page visits. We stress that we are not concerned with evaluating the final results of a ranking system, which will almost certainly combine many features into a realvalued score, but rather the internals of such a system, where the performance of individual, potentially discrete features needs to be assessed. A typical modern search engine will