Link Analysis

Ranking is perhaps the most important feature of a search engine, as it allows the user to efficiently order the huge amount of pages matching a query according to their relevance to the user’s information need. With respect to traditional textual search

  • PDF / 487,703 Bytes
  • 20 Pages / 441 x 666 pts Page_size
  • 15 Downloads / 179 Views

DOWNLOAD

REPORT


Link Analysis

Abstract Ranking is perhaps the most important feature of a search engine, as it allows the user to efficiently order the huge amount of pages matching a query according to their relevance to the user’s information need. With respect to traditional textual search engines, Web information retrieval systems build ranking by combining at least two evidences of relevance: the degree of matching of a page—the content score—and the degree of importance of a page—the popularity score. While the content score can be calculated using one of the information retrieval models described in Chap. 3, the popularity score can be calculated from an analysis of the indexed pages’ hyperlink structure using one or more link analysis models. In this chapter we introduce the two most famous link analysis models, PageRank and HITS.

7.1 The Web Graph The Web’s hyperlink structure can be represented as a directed graph, where each node represents a Web page and each arc represents a link. The source node of an arc is the Web page containing the link, while the destination node is the linked Web page. The resulting graph is simple: no self-loops are allowed, and even if there are multiple links between two pages only a single edge is placed. Figure 7.1 shows an example of a directed graph representing a Web of six pages. The graph can be described by an n × n adjacency matrix E, where for each couple of pages i and j , Ei,j = 1 if there is a link from i to j , and Ei,j = 0 otherwise. Considering the Web graph in Fig. 7.1, the resulting adjacency matrix is

P1 P2 P A= 3 P4 P5 P6

P1 0 ⎜ 0 ⎜ ⎜ 1 ⎜ ⎜ 1 ⎜ ⎝ 0 1 ⎛

P2 0 0 0 0 0 0

P3 0 0 0 0 1 0

P4 0 1 0 0 0 0

P5 1 1 0 0 0 0

P6 ⎞ 0 0 ⎟ ⎟ 0 ⎟ ⎟ 1 ⎟ ⎟ 0 ⎠ 0

S. Ceri et al., Web Information Retrieval, Data-Centric Systems and Applications, DOI 10.1007/978-3-642-39314-3_7, © Springer-Verlag Berlin Heidelberg 2013

(7.1)

91

92

7

Link Analysis

Fig. 7.1 An example of a page graph composed of 6 pages and 8 links

Hyperlinks directed into a page are referred as inlinks, whole those originating from a page are called outlinks. The indegree of a page i is calculated as the number of edges coming into i; likewise, the outdegree of node i is defined as the number of edges coming out from i. In the example of Fig. 7.1, page 1 has an indegree of 1 and an outdegree of 3. A directed graph is weakly connected if any node can be reached from any other node by traversing edges either in their indicated direction or in the opposite direction. The Web graph has been intensively studied over recent years, and it has been empirically shown to be far from strongly connected (intuitively, not all the pages in the graph are mutually linked). Many works addressed network properties of the Web, showing a scale-free nature of the link distributions, with small-world properties1 and preferential attachment [14]: the way new nodes link with other nodes is not driven by causality, but strongly depends on the node degrees—the more connected a node is, the more likely it is to receive new links. Theoretic