A survey on parallel clustering algorithms for Big Data

PDF / 1,426,106 Bytes
33 Pages / 439.37 x 666.142 pts Page_size
117 Downloads / 271 Views

A survey on parallel clustering algorithms for Big Data Zineb Dafir1 · Yasmine Lamari1 · Said Chah Slaoui1

© Springer Nature B.V. 2020

Abstract Data clustering is one of the most studied data mining tasks. It aims, through various methods, to discover previously unknown groups within the data sets. In the past years, considerable progress has been made in this field leading to the development of innovative and promising clustering algorithms. These traditional clustering algorithms present some serious issues in connection with the speed-up, the throughput, and the scalability. Thus, they can no longer be directly used in the context of Big Data, where data are mainly characterized by their volume, velocity, and variety. In order to overcome their limitations, the research today is heading to the parallel computing concept by giving rise to the so-called parallel clustering algorithms. This paper presents an overview of the latest parallel clustering algorithms categorized according to the computing platforms used to handle the Big Data, namely, the horizontal and vertical scaling platforms. The former category includes peer-to-peer networks, MapReduce, and Spark platforms, while the latter category includes Multi-core processors, Graphics Processing Unit, and Field Programmable Gate Arrays platforms. In addition, it includes a comparison of the performance of the reviewed algorithms based on some common criteria of clustering validation in the Big Data context. Therefore, it provides the reader with an overall vision of the current parallel clustering techniques. Keywords Algorithms · Big Data · Clustering · Data mining · DBSCAN · FPGA · GPU · k-means · MapReduce · MPI · Multi-cores CPU · Spark

1 Introduction With the advent of the Big Data phenomenon, the data analysis techniques are currently being modernized in order to address the emerging challenges. Data clustering is no exception to this trend. This long-established data mining technique is used to partition a set * Zineb Dafir [email protected] Yasmine Lamari [email protected] Said Chah Slaoui [email protected] 1

Faculty of Science of Rabat, Mohammed V University, Rabat, Morocco

13

Vol.:(0123456789)

Z. Dafir et al.

of data instances into homogeneous subsets, such that each subset is formed by similar instances, and at the same time dissimilar to instances belonging to other subsets (Han et al. 2012). The primary objective is to discover previously unknown groups, which is a sought-after result in several problems in everyday life. This can be achieved through different categories of clustering methods such as hierarchical methods, partitioning methods, density-based methods, grid-based methods, or other clustering techniques (Fahad et al. 2014).

1.1 Challenges Most traditional clustering algorithms are specialized and operate under specific conditions to solve a particular type of problem. Besides, they are outdated and impractical in the context of the Big Data due to their computational costs and their inability to

Data Loading...

A survey on parallel clustering algorithms for Big Data

Recommend Documents

Parallel knowledge acquisition algorithms for big data using MapReduce

A Survey on Clustering Algorithms Based on Bioinspired Optimization Techniques

A survey of density based clustering algorithms

Big Data and Clustering

Big Data 2.0 Processing Systems A Survey

Big Data Technologies: A Comprehensive Survey

Big Data Layers and Analytics: A Survey

A Non-stochastic Method for Clustering of Big Genomic Data

Techniques and Environments for Big Data Analysis Parallel, Cloud, a

A Generalized Study on Data Mining and Clustering Algorithms

Fuzzy-Based Kernelized Clustering Algorithms for Handling Big Data Using Apache Spark

Big Data Analysis: New Algorithms for a New Society