K-DBSCAN: An improved DBSCAN algorithm for big data

  • PDF / 1,436,356 Bytes
  • 22 Pages / 439.37 x 666.142 pts Page_size
  • 47 Downloads / 384 Views

DOWNLOAD

REPORT


K‑DBSCAN: An improved DBSCAN algorithm for big data Nahid Gholizadeh1 · Hamid Saadatfar1   · Nooshin Hanafi1 Accepted: 16 November 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Big data storage and processing are among the most important challenges now. Among data mining algorithms, DBSCAN is a common clustering method. One of the most important drawbacks of this algorithm is its low execution speed. This study aims to accelerate the DBSCAN execution speed so that the algorithm can respond to big datasets in an acceptable period of time. To overcome the problem, an initial grouping was applied to the data in this article through the K-means++ algorithm. DBSCAN was then employed to perform clustering in each group separately. As a result, the computational burden of DBSCAN execution reduced and the clustering execution speed increased significantly. Finally, border clusters were merged if necessary. According to the results of executing the proposed algorithm, it managed to greatly reduce the DBSCAN execution time (98% in the best-case scenario) with no significant changes in the qualitative evaluation criteria for clustering. Keywords  Data mining · Clustering · Big data · DBSCAN algorithm · K-means++  algorithm

1 Introduction The age of big data has resulted in the development and application of technologies and methods aimed at utilizing large amounts of data to support decisionmaking and knowledge discovery activities [1]. Large amounts of data have made Electronic supplementary material  The online version of this article (https​://doi.org/10.1007/s1122​ 7-020-03524​-3) contains supplementary material, which is available to authorised users. * Hamid Saadatfar [email protected] Nahid Gholizadeh [email protected] Nooshin Hanafi [email protected] 1



University of Birjand, Birjand, South Khorasan, Iran

13

Vol.:(0123456789)



N. Gholizadeh et al.

researchers and industries reconsider computational solutions for the analysis of big data. For instance, great emphasis has been put on the design of new algorithms, which are more efficient in computation, for the analysis of data on Twitter, Google, Facebook, and Wikipedia [2]. This enormous amount of data can be very useful for individuals and companies; however, analysis and recovery operations can become too time-consuming because of the high computational costs of data processing. A category of common methods for data analysis is referred to as data mining which means the identification of useful, reliable, simple, and understandable data patterns turning raw data into useful data or information [3]. One of the data mining techniques is clustering. Data clustering is considered an important area of unsupervised learning in which data can be divided into different groups based on their similarities from an informed perspective on the entire dataset [4]. Clustering is used in a wide range of areas such as vehicle re-identification [5], image denoising [6], time-series processing [7], and Web-ba