Rethinking k -means clustering in the age of massive datasets: a constant-time approach
- PDF / 1,590,286 Bytes
- 23 Pages / 595.276 x 790.866 pts Page_size
- 29 Downloads / 185 Views
(0123456789().,-volV)(0123456789(). ,- volV)
S.I. : 2018 INDIA INTL. CONGRESS ON COMPUTATIONAL INTELLIGENCE
Rethinking k-means clustering in the age of massive datasets: a constant-time approach P. Olukanmi1 • F. Nelwamondo1,2 • T. Marwala1 Received: 2 July 2019 / Accepted: 6 December 2019 Springer-Verlag London Ltd., part of Springer Nature 2019
Abstract We introduce a highly efficient k-means clustering approach. We show that the classical central limit theorem addresses a special case (k = 1) of the k-means problem and then extend it to the general case. Instead of using the full dataset, our algorithm named k-means-lite applies the standard k-means to the combination C (size nk) of all sample centroids obtained from n independent small samples. Unlike ordinary uniform sampling, the approach asymptotically preserves the performance of the original algorithm. In our experiments with a wide range of synthetic and real-world datasets, k-means-lite matches the performance of k-means when C is constructed using 30 samples of size 40 ? 2k. Although the 30-sample choice proves to be a generally reliable rule, when the proposed approach is used to scale k-means?? (we call this scaled version k-means-lite??), k-means??’ performance is matched in several cases, using only five samples. These two new algorithms are presented to demonstrate the proposed approach, but the approach can be applied to create a constant-time version of any other k-means clustering algorithm, since it does not modify the internal workings of the base algorithm. Keywords k-means Clustering Efficiency Scalable Large datasets
1 Introduction The emerging Fourth Industrial Revolution [1, 2] is characterized by applications that rely on large datasets and intelligent analysis [3]. This new ‘data era’ calls for analysis methods that can handle massive datasets efficiently and accurately. This paper presents such an algorithm for a widely studied data analysis problem, namely clustering [2–4]. Clustering, which involves organizing objects into natural groups [7], is a fundamental mode of learning, understanding and intelligence [8]. It is widely studied in & P. Olukanmi [email protected] F. Nelwamondo [email protected] T. Marwala [email protected] 1
Institute for Intelligent Systems, University of Johannesburg, Johannesburg, South Africa
2
Council for Scientific and Industrial Research, Next Generation Enterprises and Institutions, Johannesburg, South Africa
data mining, pattern recognition, machine learning and other scientific disciplines such as biology [9] and psychology [10]. Jain et al. [8, 11, 12] have presented in-depth discussion of clustering techniques and challenges. A more recent survey is that of Wong [13]. This review discusses extensively classes of clustering techniques, such as partitional, hierarchical, density-based, grid-based, correlational, spectral and herd clustering techniques. Li and Ding [12] presented a survey of nonnegative factorization methods for clustering. The k-means approach is, perhaps, the most
Data Loading...