Representation by Graphs

A geometric space has the advantage that the similarity between any pair of points is independent of the presence and placement of any other points, no matter what the particular measure of similarity might be. This is computationally attractive, which is

  • PDF / 89,747 Bytes
  • 5 Pages / 439.37 x 666.142 pts Page_size
  • 19 Downloads / 167 Views

DOWNLOAD

REPORT


Representation by Graphs

A geometric space has the advantage that the similarity between any pair of points is independent of the presence and placement of any other points, no matter what the particular measure of similarity might be. This is computationally attractive, which is why it has been the basis of everything discussed so far. The global strategy using a statistical-geometry approach is to first build a space (and perhaps project it); then use a metric on that space to define local similarities; and then integrate these local similarities into clusters. In this chapter, we will consider an alternate path: first build local similarity measures between some (perhaps all) pairs of records; use these to build a graph representing the dataset; embed this graph into a statistical geometry and then cluster as before. Graphs consist of a set of nodes with weighted edges between some of the pairs of nodes. For us, the nodes will represent records, and the weights will represent similarity. Any notion of similarity will serve: we could use the dot product between records, or the root mean sum of the squared attribute differences—but, in this latter case, we no longer interpret this as Euclidean distance since there is no space for such a distance to exist in. This graph representation of a dataset is more abstract than a geometric representation. Before we can think about clustering in such a space, we need to extend the pairwise notion of similarity associated with edge weights to a notion of similarity between nodes that are not directly connected. In other words, we need to define the transitivity of similarity. Here there is a substantive decision that depends on the system from which the data came. One common way to extend local to global similarity is to make the global similarity the maximum of the sum of the local similarities along paths between the two nodes under consideration. This models settings in which similarity is related to ease of infection by some disease, or flow of information, where the “shortest” route between unconnected nodes is critical. An alternative is to make the global similarity depend not just on the most heavily weighted (that is, shortest) path between the two nodes, but also on the number of paths between them. This models settings in which similarity is related to influence between two records (it also represents the aggregate electrical resistance between D. B. Skillicorn, Understanding High-Dimensional Spaces, SpringerBriefs in Computer Science, DOI: 10.1007/978-3-642-33398-9_6, © The Author 2012

67

68

6 Representation by Graphs

two nodes if local similarity represents the reciprocal of resistance between two connected nodes).

6.1 Building a Graph from Records It is possible to cluster directly in the graph that represents a dataset, but this turns out to be difficult in practice. It is more usual to embed the graph into a geometric space and then cluster in that space. In other words, the overall construction is: • Compute the local similarities between pairs of reco