cs-means : Determining optimal number of clusters based on a level-of-similarity

  • PDF / 2,721,830 Bytes
  • 9 Pages / 595.276 x 790.866 pts Page_size
  • 100 Downloads / 186 Views

DOWNLOAD

REPORT


cs‑means: Determining optimal number of clusters based on a level‑of‑similarity Rabindra Lamsal1   · Shubham Katiyar1 Received: 28 May 2020 / Accepted: 22 September 2020 © Springer Nature Switzerland AG 2020

Abstract This paper proposes a centroid-based clustering algorithm, cs-means, which is capable of clustering data-points with n-features, without having to specify the number of clusters to be formed. The core logic behind the algorithm is a similarity measure that collectively decides whether to assign an incoming data-point to a pre-existing cluster, or create a new cluster and assign the data-point to it. The algorithm is application-specific and applicable when the need is to perform clustering analysis of a stream of data-points, where the similarity measure between an incoming data-point and the cluster to which the data-point is to be associated with, is higher than the predefined level-of-similarity (cluster strictness). The algorithm was experimented on 4 public datasets and 10 isotropic Gaussian blobs. The cluster analysis strongly confirms the objectives of the proposed clustering algorithm. Keywords  Unsupervised · Clustering · Centroid-based.

1 Introduction The main gist behind clustering is to group data-points into various groups (clusters) based on their features, i.e. properties. The generation of clusters varies applicationwise [17], because it depends on what factors are to be taken into consideration to form a particular cluster. But, the focus of every clustering algorithm remains same, i.e. to group similar data-points to a common cluster. The thing that differs is how this goal of forming a cluster is achieved. Different algorithms use different concepts to deal with similarity measure among the data-points. There are many popular clustering algorithms that group datapoints based on various strategies to define the similarity measure between them. Centroid-based [9], densitybased [7], graph-based [10], etc. are the commonly used strategies. Often used algorithms like k-means, hierarchical clustering, DBSCAN, etc. require a set of data-points in space,

beforehand. Without making some adjustments to these pre-existing clustering algorithms, it is not possible to cluster a stream of real-time data-points. A real-world problem exists when there is a necessity of grouping a set of datapoints but the possible number of clusters the data-points can generate is unknown. Therefore, considering the realtime approach limitation of the algorithms mentioned above, this paper proposes a clustering algorithm that can group incoming data-points without having to initialize the number of clusters to be formed. Hereinafter, the proposed clustering algorithm is termed as cs-means, with “cs” standing for “cluster strictness”. The main objectives of the proposed algorithm are to: (1) facilitate those problems which require clustering of data-points based on some predefined level-of-similarity, (2) introduce a realtime approach to a centroid based clustering algorithm, (3) determine the optimal number of cl