Parallel Implementation of PrePost Algorithm Based on Spark for Big Data

Frequent itemset mining is a fundamental element with respect to many data mining problems directed at finding interesting patterns in data. Recently the PrePost algorithm, a new algorithm for mining frequent itemsets based on the idea of N-lists, which i

  • PDF / 1,256,299 Bytes
  • 11 Pages / 439.37 x 666.142 pts Page_size
  • 105 Downloads / 211 Views

DOWNLOAD

REPORT


Abstract. Frequent itemset mining is a fundamental element with respect to many data mining problems directed at finding interesting patterns in data. Recently the PrePost algorithm, a new algorithm for mining frequent itemsets based on the idea of N-lists, which in most cases outperforms other current stateof-the-art algorithms, has been presented. The performance of PrePost algorithm degrades when it comes to processing of big data. However, the existing parallel Prepost algorithms implemented with the MapReduce model are not efficient enough for iterative computation. In view of this, this article proposes a parallel algorithm based on the Spark RDD Framework, which enhances PrePost that uses also a hash table to improve the process of creating N-lists and Excombines the features of Spark in order to process large data efficiently. Experiments show that Our approach algorithm is more superior than MRPrePost in terms of performance, the stability and scalability. Keywords: Frequent itemset mining

 Prepost  Big data  Spark

1 Introduction The past decade has witnessed the remarkable growth of Internet communication technology especially mobile Internet and sensor networks to perceive and obtain information. Organizations from industry, government, and academia possess and store large quantities of data which contain tremendous value. The potential value of big data [1] cannot be unearthed by simple collection or statistical analysis, currently referring to big data. Advanced big data analytics and applications require special technologies to efficiently cope with massive amounts of data. Data mining techniques [2] are now drawing attention from the practitioners of all data related industries for this purpose. The aim of data mining is to explore data in search and interpretation of unforeseen trends or patterns between variables, and then to verify the results with the detected patterns applied to new subsets. Since data gathered from a variety of data sources are often a series of isolated data, correlation analysis naturally becomes an important foundation for data mining and big data science [3]. Association rule mining [4] was proposed to discover certain interesting correlation relationships among the itemsets of the data. Furthermore, frequent itemset mining [5] is an essential step in the process of association rule mining. Most of the proposed algorithms for frequent itemsets can be © Springer Nature Switzerland AG 2019 Y. Farhaoui and L. Moussaid (Eds.): ICBDSDE 2018, SBD 53, pp. 322–332, 2019. https://doi.org/10.1007/978-3-030-12048-1_33

Parallel Implementation of PrePost Algorithm

323

clustered in to Apriori method [6] and FP-growth method [7]. In recent years the PrePost [8] and PrePost+ [9] algorithms based on N-list data structure have been proposed to reduce the mining time and memory usage with mining frequent itemsets. These algorithms working on single computers, have shown good performance in dealing with small amount of data. Nevertheless, conventional approaches come across significant chall