Ensemble-based clustering of large probabilistic graphs using neighborhood and distance metric learning

  • PDF / 2,893,839 Bytes
  • 28 Pages / 439.37 x 666.142 pts Page_size
  • 33 Downloads / 164 Views

DOWNLOAD

REPORT


Ensemble‑based clustering of large probabilistic graphs using neighborhood and distance metric learning Malihe Danesh1 · Morteza Dorrigiv1   · Farzin Yaghmaee1 Accepted: 3 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Graphs are commonly used to express the communication of various data. Faced with uncertain data, we have probabilistic graphs. As a fundamental problem of such graphs, clustering has many applications in analyzing uncertain data. In this paper, we propose a novel method based on ensemble clustering for large probabilistic graphs. To generate ensemble clusters, we develop a set of probable possible worlds of the initial probabilistic graph. Then, we present a probabilistic co-association matrix as a consensus function to integrate base clustering results. It relies on co-occurrences of node pairs based on the probability of the corresponding common cluster graphs. Also, we apply two improvements in the steps before and after of ensembles generation. In the before step, we append neighborhood information based on node features to the initial graph to achieve a more accurate estimation of the probability between the nodes. In the after step, we use supervised metric learning-based Mahalanobis distance to automatically learn a metric from ensemble clusters. It aims to gain crucial features of the base clustering results. We evaluate our work using five real-world datasets and three clustering evaluation metrics, namely the Dunn index, Davies–Bouldin index, and Silhouette coefficient. The results show the impressive performance of clustering large probabilistic graphs. Keywords  Probabilistic graph · Ensemble clustering · Distance metric learning · Neighborhood

* Morteza Dorrigiv [email protected] Malihe Danesh [email protected] Farzin Yaghmaee [email protected] 1



Faculty of Electrical and Computer Engineering, Semnan University, Semnan, Iran

13

Vol.:(0123456789)



M. Danesh et al.

1 Introduction In the era of big data, data management systems often face uncertainty. Also, many applications are increasingly generating interdependent data, which may lead to uncertainty between the data. In these cases, probabilistic graphs provide a method to visualize these data, where uncertainty is assigned by probable value to each edge. In the probabilistic graphs, each edge is labeled by a probability of data uncertainty. For example, in the protein interaction network, an edge between two interactive proteins corresponds to the interaction obtained through a noise test. Thus, due to the limitations of the measurement model, it represents uncertainty in the form of a probability value. Besides, in social networks, the probability of the existence of an edge between two people may be used to model the truthiness of a relation or influence one over the other one. With the rising growth of applications with uncertain data, probabilistic graph mining has become increasingly essential. Hence, some mining problems have recently been studied for this pu