Information Retrieval Models

This chapter introduces three classic information retrieval models: Boolean, vector space, and probabilistic. These models provide the foundations of query evaluation, the process that retrieves the relevant documents from a document collection upon a use

  • PDF / 211,001 Bytes
  • 11 Pages / 441 x 666 pts Page_size
  • 2 Downloads / 219 Views

DOWNLOAD

REPORT


Information Retrieval Models

Abstract This chapter introduces three classic information retrieval models: Boolean, vector space, and probabilistic. These models provide the foundations of query evaluation, the process that retrieves the relevant documents from a document collection upon a user’s query. The three models represent documents and compute their relevance to the user’s query in very different ways. We illustrate each of them separately and then compare their features.

3.1 Similarity and Matching Strategies So far, we have been discussing the representation of documents and queries and the techniques for document indexing. This is only part of the retrieval process. Another fundamental issue is the method for determining the degree of relevance of the user’s query with respect to the document representation, also called the matching process. In most practical cases, this process is expected to produce a ranked list of documents, where relevant documents should appear towards the top of the ranked list, in order to minimize the time spent by users in identifying relevant information. Ranking algorithms may use a variety of information sources, the frequency distribution of terms over documents, as well as other proprieties, e.g., in the Web search context, the “social relevance” of a page determined from the links that point to it. In this chapter, we introduce three classic information retrieval (IR) models. We start with the Boolean model, described in Sect. 3.2, the first IR model and probably also the most basic one. It provides exact matching; i.e., documents are either retrieved or not, and thus supports the construction of result sets in which documents are not ranked. Then, we follow Luhn’s intuition of adopting a statistical approach for IR [232]: he suggested to use the degree of similarity between an artificial document constructed from the user’s query and the representation of documents in the collection as a relevance measure for ranking search results. A simple way to do so is by counting the number of elements that are shared by the query and by the index representation of the document. This is the principle behind the vector space model, discussed in Sect. 3.3. S. Ceri et al., Web Information Retrieval, Data-Centric Systems and Applications, DOI 10.1007/978-3-642-39314-3_3, © Springer-Verlag Berlin Heidelberg 2013

27

28

3

Information Retrieval Models

Last, we illustrate the probabilistic indexing model. Unlike the previous ones, this model was not meant to support automatic indexing by IR systems; rather, it assumed a human evaluator to manually provide a probability value for each index term to be relevant to a document. An adaptation of this idea suitable for automatic IR is discussed in Sect. 3.4. Before we discuss the details of each specific model, let us first introduce a simple definition that we will use throughout the chapter. An IR model can be defined as an algorithm that takes a query q and a set of documents D = {d1 , . . . , dN } and associates a similarity coefficien