Enhanced synchronization-inspired clustering for high-dimensional data

  • PDF / 1,932,304 Bytes
  • 21 Pages / 595.276 x 790.866 pts Page_size
  • 113 Downloads / 237 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Enhanced synchronization-inspired clustering for high-dimensional data Lei Chen1 · Qinghua Guo1

· Zhaohua Liu1 · Shiwen Zhang1 · Hongqiang Zhang1

Received: 16 May 2020 / Accepted: 17 August 2020 © The Author(s) 2020

Abstract The synchronization-inspired clustering algorithm (Sync) is a novel and outstanding clustering algorithm, which can accurately cluster datasets with any shape, density and distribution. However, the high-dimensional dataset with high dimensionality, high noise, and high redundancy brings some new challenges for the synchronization-inspired clustering algorithm, resulting in a significant increase in clustering time and a decrease in clustering accuracy. To address these challenges, an enhanced synchronization-inspired clustering algorithm, namely SyncHigh, is developed in this paper to quickly and accurately cluster the high-dimensional datasets. First, a PCA-based (Principal Component Analysis) dimension purification strategy is designed to find the principal components in all attributes. Second, a density-based data merge strategy is constructed to reduce the number of objects participating in the synchronization-inspired clustering algorithm, thereby speeding up clustering time. Third, the Kuramoto Model is enhanced to deal with mass differences between objects caused by the density-based data merge strategy. Finally, extensive experimental results on synthetic and real-world datasets show the effectiveness and efficiency of our SyncHigh algorithm. Keywords Synchronization-inspired · Clustering · High-dimensional dataset · Local density

Introduction Clustering uses an unsupervised way to uncover the hidden rules and patterns of human society; it is an indispensable mean to mine the complex real-world data [1]. Over the past few decades, a large number of excellent clustering algorithms have been proposed and expanded, and have demonstrated their power in various fields, such as transportation, meteorology, biology, and so on [2]. However, with the advent of the era of big data, complex data in

B

Qinghua Guo [email protected] Lei Chen [email protected] Zhaohua Liu [email protected] Shiwen Zhang [email protected] Hongqiang Zhang [email protected]

1

School of Information and Electrical Engineering, Hunan University of Science and Technology, Xiangtan, China

various applications have hundreds of thousands of dimensions, and are characterized by high noise, irregularity and imbalance [3]. In addition, the data dimension is getting higher and higher, showing exponential growth. Faced with these new features, traditional clustering algorithms perform poorly and are unsatisfactory. The main reasons are as follows: (1) complex data are in a high-dimensional space and are difficult to process; (2) there are a lot of redundancy and noise attributes in high-dimensional data; (3) the distribution of data is uneven, and the datasets present various irregular shapes; and (4) a lot of outliers are hidden in highdimensional data. To complete the high-dimen