A survey on parallel clustering algorithms for Big Data

  • PDF / 1,426,106 Bytes
  • 33 Pages / 439.37 x 666.142 pts Page_size
  • 117 Downloads / 248 Views

DOWNLOAD

REPORT


A survey on parallel clustering algorithms for Big Data Zineb Dafir1   · Yasmine Lamari1 · Said Chah Slaoui1

© Springer Nature B.V. 2020

Abstract Data clustering is one of the most studied data mining tasks. It aims, through various methods, to discover previously unknown groups within the data sets. In the past years, considerable progress has been made in this field leading to the development of innovative and promising clustering algorithms. These traditional clustering algorithms present some serious issues in connection with the speed-up, the throughput, and the scalability. Thus, they can no longer be directly used in the context of Big Data, where data are mainly characterized by their volume, velocity, and variety. In order to overcome their limitations, the research today is heading to the parallel computing concept by giving rise to the so-called parallel clustering algorithms. This paper presents an overview of the latest parallel clustering algorithms categorized according to the computing platforms used to handle the Big Data, namely, the horizontal and vertical scaling platforms. The former category includes peer-to-peer networks, MapReduce, and Spark platforms, while the latter category includes Multi-core processors, Graphics Processing Unit, and Field Programmable Gate Arrays platforms. In addition, it includes a comparison of the performance of the reviewed algorithms based on some common criteria of clustering validation in the Big Data context. Therefore, it provides the reader with an overall vision of the current parallel clustering techniques. Keywords  Algorithms · Big Data · Clustering · Data mining · DBSCAN · FPGA · GPU · k-means · MapReduce · MPI · Multi-cores CPU · Spark

1 Introduction With the advent of the Big Data phenomenon, the data analysis techniques are currently being modernized in order to address the emerging challenges. Data clustering is no exception to this trend. This long-established data mining technique is used to partition a set * Zineb Dafir [email protected] Yasmine Lamari [email protected] Said Chah Slaoui [email protected] 1



Faculty of Science of Rabat, Mohammed V University, Rabat, Morocco

13

Vol.:(0123456789)



Z. Dafir et al.

of data instances into homogeneous subsets, such that each subset is formed by similar instances, and at the same time dissimilar to instances belonging to other subsets (Han et al. 2012). The primary objective is to discover previously unknown groups, which is a sought-after result in several problems in everyday life. This can be achieved through different categories of clustering methods such as hierarchical methods, partitioning methods, density-based methods, grid-based methods, or other clustering techniques (Fahad et  al. 2014).

1.1 Challenges Most traditional clustering algorithms are specialized and operate under specific conditions to solve a particular type of problem. Besides, they are outdated and impractical in the context of the Big Data due to their computational costs and their inability to