Parallel Implementation of PrePost Algorithm Based on Spark for Big Data

Frequent itemset mining is a fundamental element with respect to many data mining problems directed at finding interesting patterns in data. Recently the PrePost algorithm, a new algorithm for mining frequent itemsets based on the idea of N-lists, which i

PDF / 1,256,299 Bytes
11 Pages / 439.37 x 666.142 pts Page_size
105 Downloads / 237 Views

DOWNLOAD

REPORT

Abstract. Frequent itemset mining is a fundamental element with respect to many data mining problems directed at ﬁnding interesting patterns in data. Recently the PrePost algorithm, a new algorithm for mining frequent itemsets based on the idea of N-lists, which in most cases outperforms other current stateof-the-art algorithms, has been presented. The performance of PrePost algorithm degrades when it comes to processing of big data. However, the existing parallel Prepost algorithms implemented with the MapReduce model are not efﬁcient enough for iterative computation. In view of this, this article proposes a parallel algorithm based on the Spark RDD Framework, which enhances PrePost that uses also a hash table to improve the process of creating N-lists and Excombines the features of Spark in order to process large data efﬁciently. Experiments show that Our approach algorithm is more superior than MRPrePost in terms of performance, the stability and scalability. Keywords: Frequent itemset mining

Prepost Big data Spark

1 Introduction The past decade has witnessed the remarkable growth of Internet communication technology especially mobile Internet and sensor networks to perceive and obtain information. Organizations from industry, government, and academia possess and store large quantities of data which contain tremendous value. The potential value of big data [1] cannot be unearthed by simple collection or statistical analysis, currently referring to big data. Advanced big data analytics and applications require special technologies to efﬁciently cope with massive amounts of data. Data mining techniques [2] are now drawing attention from the practitioners of all data related industries for this purpose. The aim of data mining is to explore data in search and interpretation of unforeseen trends or patterns between variables, and then to verify the results with the detected patterns applied to new subsets. Since data gathered from a variety of data sources are often a series of isolated data, correlation analysis naturally becomes an important foundation for data mining and big data science [3]. Association rule mining [4] was proposed to discover certain interesting correlation relationships among the itemsets of the data. Furthermore, frequent itemset mining [5] is an essential step in the process of association rule mining. Most of the proposed algorithms for frequent itemsets can be © Springer Nature Switzerland AG 2019 Y. Farhaoui and L. Moussaid (Eds.): ICBDSDE 2018, SBD 53, pp. 322–332, 2019. https://doi.org/10.1007/978-3-030-12048-1_33

Parallel Implementation of PrePost Algorithm

323

clustered in to Apriori method [6] and FP-growth method [7]. In recent years the PrePost [8] and PrePost+ [9] algorithms based on N-list data structure have been proposed to reduce the mining time and memory usage with mining frequent itemsets. These algorithms working on single computers, have shown good performance in dealing with small amount of data. Nevertheless, conventional approaches come across signiﬁcant chall

Data Loading...

Parallel Implementation of PrePost Algorithm Based on Spark for Big Data

Recommend Documents

Apache Spark Implementation of Whale Optimization Algorithm

Classification of Big Data Using Spark Framework

Multidimensional Parallel Dynamic Programming Algorithm Based on Spark for Large-Scale Hydropower Systems

Apache Spark, Big Data, and Azure Databricks

A survey on parallel clustering algorithms for Big Data

The implementation of data storage and analytics platform for big data lake of electricity usage with spark

Parallel Implementation of kNN Algorithm for Breast Cancer Detection

Distributed and Parallel Ensemble Classification for Big Data Based on Kullback-Leibler Random Sample Partition

Spark Parallel Acceleration-Based Optimal Scheduling for Air Compressor Group

A New Heuristic Based on a Parallel Implementation of Firefly Algorithm

Parallel knowledge acquisition algorithms for big data using MapReduce

Techniques and Environments for Big Data Analysis Parallel, Cloud, a