The Mathematics of Internet Search Engines

  • PDF / 888,590 Bytes
  • 32 Pages / 439.37 x 666.142 pts Page_size
  • 26 Downloads / 236 Views

DOWNLOAD

REPORT


The Mathematics of Internet Search Engines Fredrik K. Andersson · Sergei D. Silvestrov

Received: 8 December 2007 / Accepted: 10 June 2008 / Published online: 16 July 2008 © Springer Science+Business Media B.V. 2008

Abstract This article presents a survey of techniques for ranking results in search engines, with emphasis on link-based ranking methods and the PageRank algorithm. The problem of selecting, in relation to a user search query, the most relevant documents from an unstructured source such as the WWW is discussed in detail. The need for extending classical information retrieval techniques such as boolean searching and vector space models with link-based ranking methods is demonstrated. The PageRank algorithm is introduced, and its numerical and spectral properties are discussed. The article concludes with an alternative means of computing PageRank, along with some example applications of this new method. Keywords Citation ranking · PageRank · Search engines · Information retrieval · Text indexing · Ranking · Markov chains · Power method · Power series

1 Introduction Nowadays, an ever-increasing amount of human knowledge is placed in computerized repositories such as the World Wide Web. This gives rise to the problem of how to locate specific pieces of information in these often quite unstructured repositories, a problem that today is best solved through the use of search engines. A search engine is conceptually simple: a user provides a search query consisting of a few keywords that pertain to the documents to be retrieved from the repository. The search engine responds with an ordered list of documents that are deemed relevant to the query. The fundamental problem in constructing search engines is thus to determine relevance—when is a document relevant to a particular query? Search engine construction is based on ideas from the field of Information Retrieval, which

F.K. Andersson WorldLight.com AB, Kastanjeallén 1, 30231 Halmstad, Sweden e-mail: [email protected] S.D. Silvestrov () Centre for Mathematical Sciences, Lund University, Box 118, 22100 Lund, Sweden e-mail: [email protected]

212

F.K. Andersson, S.D. Silvestrov

provides methods for determining document relevancy by looking at the actual contents of documents. However, in uncontrolled environments, such as the WWW, these methods by themselves prove to be insufficient. Also, what may be worse, these methods are easily deceived by people seeking to exploit search engines for commercial gain. Modern search engines therefore consider weighting factors that are not deduced directly from the document contents. A common method, which is useful in hyperlinked information repositories such as the WWW, is to consider the link structure formed by the hyperlinks between documents. The most successful algorithm in this category is the PageRank algorithm, which was invented by the founders of the Google search engine. This article presents a survey of the ranking problem, with a special emphasis on linkbased ranking methods and the PageRa