cs-means : Determining optimal number of clusters based on a level-of-similarity

PDF / 2,721,830 Bytes
9 Pages / 595.276 x 790.866 pts Page_size
100 Downloads / 209 Views

cs‑means: Determining optimal number of clusters based on a level‑of‑similarity Rabindra Lamsal1 · Shubham Katiyar1 Received: 28 May 2020 / Accepted: 22 September 2020 © Springer Nature Switzerland AG 2020

Abstract This paper proposes a centroid-based clustering algorithm, cs-means, which is capable of clustering data-points with n-features, without having to specify the number of clusters to be formed. The core logic behind the algorithm is a similarity measure that collectively decides whether to assign an incoming data-point to a pre-existing cluster, or create a new cluster and assign the data-point to it. The algorithm is application-specific and applicable when the need is to perform clustering analysis of a stream of data-points, where the similarity measure between an incoming data-point and the cluster to which the data-point is to be associated with, is higher than the predefined level-of-similarity (cluster strictness). The algorithm was experimented on 4 public datasets and 10 isotropic Gaussian blobs. The cluster analysis strongly confirms the objectives of the proposed clustering algorithm. Keywords Unsupervised · Clustering · Centroid-based.

1 Introduction The main gist behind clustering is to group data-points into various groups (clusters) based on their features, i.e. properties. The generation of clusters varies applicationwise [17], because it depends on what factors are to be taken into consideration to form a particular cluster. But, the focus of every clustering algorithm remains same, i.e. to group similar data-points to a common cluster. The thing that differs is how this goal of forming a cluster is achieved. Different algorithms use different concepts to deal with similarity measure among the data-points. There are many popular clustering algorithms that group datapoints based on various strategies to define the similarity measure between them. Centroid-based [9], densitybased [7], graph-based [10], etc. are the commonly used strategies. Often used algorithms like k-means, hierarchical clustering, DBSCAN, etc. require a set of data-points in space,

beforehand. Without making some adjustments to these pre-existing clustering algorithms, it is not possible to cluster a stream of real-time data-points. A real-world problem exists when there is a necessity of grouping a set of datapoints but the possible number of clusters the data-points can generate is unknown. Therefore, considering the realtime approach limitation of the algorithms mentioned above, this paper proposes a clustering algorithm that can group incoming data-points without having to initialize the number of clusters to be formed. Hereinafter, the proposed clustering algorithm is termed as cs-means, with “cs” standing for “cluster strictness”. The main objectives of the proposed algorithm are to: (1) facilitate those problems which require clustering of data-points based on some predefined level-of-similarity, (2) introduce a realtime approach to a centroid based clustering algorithm, (3) determine the optimal number of cl

Data Loading...

cs-means : Determining optimal number of clusters based on a level-of-similarity

Recommend Documents

An entropy-based initialization method of K -means clustering on the optimal number of clusters

Adaptive Determining for Optimal Cluster Number of K-Means Clustering Algorithm

Estimating the number of clusters via a corrected clustering instability

Design of a supply chain network for determining the optimal number of items at the inventory groups based on ABC analys

Assessing the Number of Clusters of the Latent Class Model

Fluctuating viscoelasticity based on a finite number of dumbbells

Determining Optimal Parallel Schedules in Tree-Based WSNs Using a Realistic Interference Model

Regional Innovation System Based on Industrial Clusters

Estimation Methods Based on Weighting Clusters

On the optimal number of advertising slots in a generalized second-price auction

A model to determining the remaining useful life of rotating equipment, based on a new approach to determining state of

Dynamic Models of the Firm Determining Optimal Investment, Financing