Is-ClusterMPP: clustering algorithm through point processes and influence space towards high-dimensional data

  • PDF / 2,326,191 Bytes
  • 28 Pages / 439.37 x 666.142 pts Page_size
  • 14 Downloads / 197 Views

DOWNLOAD

REPORT


Is-ClusterMPP: clustering algorithm through point processes and influence space towards high-dimensional data Khadidja Henni1

· Pierre-Yves Louis2 · Brigitte Vannier3 · Ahmed Moussa4

Received: 29 March 2018 / Revised: 18 September 2019 / Accepted: 19 November 2019 © Springer-Verlag GmbH Germany, part of Springer Nature 2019

Abstract Clustering via marked point processes and influence space, Is-ClusterMPP, is a new unsupervised clustering algorithm through adaptive MCMC sampling of a marked point processes of interacting balls. The designed Gibbs energy cost function makes use of k-influence space information. It detects clusters of different shapes, sizes and unbalanced local densities. It aims at dealing also with high-dimensional datasets. By using the k-influence space, Is-ClusterMPP solves the problem of local heterogeneity in densities and prevents the impact of the global density in the detection of unbalanced classes. This concept reduces also the input values amount. The curse of dimensionality is handled by using a local subspace clustering principal embedded in a weighted similarity metric. Balls covering data points are constituting a configuration sampled from a marked point process (MPP). Due to the choice of the energy function, they tends to cover neighboring data, which share the same cluster. The statistical model of random balls is sampled through a Monte Carlo Markovian dynamical approach. The energy is balancing different goals. (1) The data driven objective function is provided according to k-influence space. Data in a high-dense region are favored to be covered by a ball. (2) An interaction part in the energy prevents the balls full overlap phenomenon and favors connected groups of balls. The algorithm through Markov dynamics, does converge towards configurations sampled from the MPP model. This algorithm has been applied in real benchmarks through gene expression data set of various sizes. Different experiments have been done to compare Is-ClusterMPP against the most well-known clustering algorithms and its efficiency is claimed. Keywords Density-based clustering · Influence space · Marked point processes · Spatial data analysis · Gibbs cost/objective function · MCMC/Monte Carlo technique · High dimensional real data sets Mathematics Subject Classification 62H30 · 62H11 · 60G55 · 65C05

Extended author information available on the last page of the article

123

K. Henni et al.

1 Introduction Digital data take a prominent place in the nowadays technological world, science and society. This is the well known big data challenge (Elgendy and Elragal 2014). Nevertheless, the data value is not in their sizes, but in the useful information gained by their exploitation and analysis. Clustering, or unsupervised classification, is one of the most used technique in nowadays statistics and data analysis. Data are represented through points in some representation/feature space, like Rd , and grouped into different clusters according to a similarity function. A well-known example of such data are gene