Adaptive Determining for Optimal Cluster Number of K-Means Clustering Algorithm

The clustering accuracy of K-means algorithm highly depends on the initial number of clusters and it takes a long time when dealing with the large sample data with high dimension. To solve these problems, this paper proposes a method to reduce the dimensi

  • PDF / 1,840,293 Bytes
  • 10 Pages / 439.37 x 666.142 pts Page_size
  • 71 Downloads / 160 Views

DOWNLOAD

REPORT


Adaptive Determining for Optimal Cluster Number of K-Means Clustering Algorithm Jiankai Sun, Zhong Li, Fengyuan Zou and Yunchu Yang

Abstract The clustering accuracy of K-means algorithm highly depends on the initial number of clusters and it takes a long time when dealing with the large sample data with high dimension. To solve these problems, this paper proposes a method to reduce the dimensionality for high dimensional data by multidimensional scaling transformation and designs a measure which can effectively evaluate the quality of nuclear clustering algorithm. Furthermore, an adaptive method to determine the optimal cluster number is presented. It firstly predicts the initial cluster number in a low-dimensional space by the tree clustering. Then, the optimal cluster number is gotten by the adaptive algorithm. Experiments show that this method has higher accuracy and stability.



Keywords Optimal cluster number Kernel cluster transformation Dimensionality reduction



 Multidimensional scaling

59.1 Introduction Cluster analysis, also called as unsupervised learning, can divide the sample into several clusters without training. Its principle is to make the data correlation within groups higher than between groups. Therefore, it becomes a very active research topic in data mining and other related fields [1]. K-means cluster algorithm is a widely used iterative descent clustering algorithm. However, there are some shortcomings: (1) K-means algorithm has J. Sun  Z. Li (&)  F. Zou  Y. Yang Department of Mathematical Sciences, Zhejiang Sci-Tech University, 310018 Hangzhou, China e-mail: [email protected]

W. Lu et al. (eds.), Proceedings of the 2012 International Conference on Information Technology and Software Engineering, Lecture Notes in Electrical Engineering 211, DOI: 10.1007/978-3-642-34522-7_59, Ó Springer-Verlag Berlin Heidelberg 2013

551

552

J. Sun et al.

good clustering effect for the data with circular or similar circular distribution, but it has poor result for the non-circular distribution data. (2) The quality of the K-means clustering results depends on the selection of the initial center and the initial cluster number. So far, many improved methods have been presented to solve these shortcomings. In order to determine the better initial cluster centers, Yuan and Zhou [2] used a sampling method to improve the quality of initial central points. Zalik [3] provided a recursive method to determine the best initial cluster centers and made the K-clustering algorithm to achieve the globally optimal convergence. Likas et al. [4, 5] presented the kernel clustering algorithm which is a good solution for the problem that the K-means clustering algorithm cannot deal with the non-circular distribution data. This kernel clustering method uses the Mercer kernel [6] to make the input sample map into a high dimensional feature space and cluster it in a feature space. So, nonlinear sample classification problem can be transformed into a linear classification problem. The kernel function, which is intro