Analytical review of clustering techniques and proximity measures

  • PDF / 1,896,825 Bytes
  • 29 Pages / 439.37 x 666.142 pts Page_size
  • 69 Downloads / 217 Views

DOWNLOAD

REPORT


Analytical review of clustering techniques and proximity measures Vivek Mehta1 · Seema Bawa1 · Jasmeet Singh1

© Springer Nature B.V. 2020

Abstract One of the most fundamental approaches to learn and understand from any type of data is by organizing it into meaningful groups (or clusters) and then analyzing them, which is a process known as cluster analysis. During this process of grouping, proximity measures play a significant role in deciding the similarity level of two objects. Moreover, before applying any learning algorithm on a dataset, different aspects related to preprocessing such as dealing with the sparsity of data, leveraging the correlation among features and normalizing the scales of different features are required to be considered. In this study, various proximity measures have been discussed and analyzed from the aforementioned aspects. In addition, a theoretical procedure for selecting a proximity measure for clustering purpose is proposed. This procedure can also be used in the process of designing a new proximity measure. Second, clustering algorithms of different categories have been overviewed and experimentally compared for various datasets of different domains. The datasets have been chosen in such a way that they range from a very low number of dimensions to a very high number of dimensions. Finally, the effect of using different proximity measures is analyzed in partitional and hierarchical clustering techniques based on experiments. Keywords  Unsupervised learning · Hierarchical clustering · Partitional clustering · Proximity measures

1 Introduction Clustering is one of the most essential techniques applied across a wide range of domains such as in image segmentation, text mining, market research and finance. This technique segregates a collection of data points into separate groups (clusters) for “maximizing intraclass similarity and minimizing interclass similarity” (Han et  al. 2011). Thus, all the similar points are grouped into a cluster and the clusters themselves are dissimilar to each other. This partitioning process is performed using a certain proximity measure, density measure and other similar measures. Unlike the process of classification which * Vivek Mehta [email protected] 1



Computer Science and Engineering Department, Thapar Institute of Engineering and Technology, Patiala, Punjab 147001, India

13

Vol.:(0123456789)



V. Mehta et al.

requires labels for data points, clustering does not require the knowledge of labels to recognize patterns in a given dataset. This is considerably significant, because in many situations it may either be tedious or expensive to gather the labeling information for a dataset (such as in the case of images and web documents etc.). The broad categories of clustering methods are as follows: hierarchical, partitional, density-based, grid-based and model-based. K-means is a widely used partitional clustering algorithm in which the sum of squares of distances between the center of a cluster and data points is minimized to obtain an opt