A Dynamic Programming Framework for Large-Scale Online Clustering on Graphs

  • PDF / 948,198 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 51 Downloads / 218 Views

DOWNLOAD

REPORT


A Dynamic Programming Framework for Large-Scale Online Clustering on Graphs Yantao Li1 · Xiang Zhao2 · Zehui Qu2

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract As a fundamental technique for data analysis, graph clustering grouping graph data into clusters has attracted great attentions in recent years. In this paper, we present D P OC G, a dynamic programming framework for large-scale online clustering on graphs, which improves the scalability of a wide range of graph clustering algorithms. Specifically, D P OC G first identifies the nodes whose states are unchanged compared with the states at the previous time on a large-scale graph, then constructs these unchanged nodes as supernodes, which greatly reduces the size of the graph at the current time, and collapses nodes whose degrees are less than a predefined threshold. Based on our density-based graph clustering algorithm (DGC M), D P OC G partitions the reduced graph into clusters. In addition, we theoretically analyze D P OC G in terms of supernode generation, clustering on reduced graph, and computational complexity. We evaluate D P OC G on a synthetic dataset and seven real-world datasets, respectively, and the experimental results show that D P OC G consumes less running time and improves the efficiency of clustering. Keywords Online graph clustering · Large-scale graphs · supernodes · Running time · Efficiency

1 Introduction Graph data are composed of a large number of data objects (nodes) interconnected with each other by proper links (relationships), which have become more prominent in people’s daily life, such as friendship graphs in karate [1] and Facebook [2], interchange graphs in emails [3], co-author graphs in bibliographic data DBLP, and social connections in Twitter. In machine learning applications, pairwise similarities between data objects can be modeled as a graph, and the corresponding data clustering can be viewed as graph clustering. Graph clustering or community detection is typically used to identify clusters with closely connected nodes, such that nodes with higher similarity (affinity) can be partitioned into the same cluster and

B

Yantao Li [email protected]; [email protected]; [email protected]

1

College of Computer Science, Chongqing University, Chongqing 400044, China

2

College of Computer and Information Sciences, Southwest University, Chongqing 400715, China

123

Y. Li et al.

nodes within the same cluster have more connections than those in different clusters [4]. The goal of graph clustering is to integrate the tightly connected nodes into the same cluster. A wide variety of community structure detection (graph clustering) methods have been proposed and developed in different scientific fields. The partitioning approaches [5–7], such as k-means [8,9] and k-medoids [10], are the earliest methods for community detection, which create different segments of the graph and then evaluate the clustering results with the same criteria. Many community detection algorithms have been developed