A Scalable Multilevel Algorithm for Graph Clustering and Community Structure Detection

One of the most useful measures of cluster quality is the modularity of the partition, which measures the difference between the number of the edges joining vertices from the same cluster and the expected number of such edges in a random (unstructured) gr

  • PDF / 441,128 Bytes
  • 12 Pages / 430 x 660 pts Page_size
  • 47 Downloads / 188 Views

DOWNLOAD

REPORT


Abstract. One of the most useful measures of cluster quality is the modularity of the partition, which measures the difference between the number of the edges joining vertices from the same cluster and the expected number of such edges in a random (unstructured) graph. In this paper we show that the problem of finding a partition maximizing the modularity of a given graph G can be reduced to a minimum weighted cut problem on a complete graph with the same vertices as G. We then show that the resulted minimum cut problem can be efficiently solved with existing software for graph partitioning and that our algorithm finds clusterings of a better quality and much faster than the existing clustering algorithms.

1

Introduction

One way to analyze and understand the information contained in the huge amount of data available on the WWW and the relationships between the individual items is to organize them into ”communities,” maximal groups of related items. Determining the communities is of great theoretical and practical importance since they correspond to entities such as collaboration networks, online social networks, scientific publications or news stories on a given topic, related commercial items, etc. Communities also arise in other types of networks such as computer and communication networks (the Internet, ad-hoc networks) and biological networks (protein interaction networks, genetic networks). The problem of identifying communities in a network is usually modeled as a graph clustering (GC) problem, where vertices correspond to individual items and edges describe relationships. Then the communities correspond to clusters with a lot of edges between vertices belonging to the same subgraph (called incluster edges) and fewer edges between vertices from different subgraphs (called between-cluster edges). The GC problem has been intensively studied in the recent years in relation to its applications in the analysis of networks. Girvan and Newman propose in [11], [18] algorithms based on the betweenness of the edges of a graph, a characteristic that measures the number of the shortest paths in a graph that use any given edge. In [15] Newman describes an algorithm based on a characteristic of clustering quality called modularity, a measure that takes 

This work has been supported by the Department of Energy under contract W-705ENG-36.

W. Aiello et al. (Eds.): WAW 2006, LNCS 4936, pp. 117–128, 2008. c Springer-Verlag Berlin Heidelberg 2008 

118

H.N. Djidjev

into account the number of in-cluster edges and the expected number of such edges. (We formally define and discuss modularity in more detail in the next section.) A faster version of the algorithm from [15] was described by Clauset et al. in [6]. Several algorithms have been proposed based on the eigenvectors of the graph Laplacian, e.g., [19], [16]. In all previous cases the algorithms reported in the literature are either not fast enough, or are inaccurate. In this paper we will describe a new approach for GC that uses our newly discovered relationship between the