SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming

  • PDF / 830,905 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 68 Downloads / 202 Views

DOWNLOAD

REPORT


SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming Wen Xiao1,2   · Juan Hu3

© The Author(s) 2020

Abstract Finding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed frequent itemset mining algorithm to meet the needs of massive streaming data processing. Apache Spark is a unified analytic engine for massive data processing which has been successfully used in many data mining fields. In this paper, we propose a distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat. The algorithm uses sliding window to process streaming data and uses vertical data structure to store the dataset in the sliding window. This algorithm is implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing. Experimental results show that SWEclat algorithm has good acceleration, parallel scalability and load balancing. Keywords  Frequent itemset mining · Streaming data · Sliding window · Distributed · Spark Streaming

1 Introduction Frequent itemset mining (FIM) is one of the most basic and important data mining tasks. Since it was proposed [1], it has attracted more and more attention. Classical FIM algorithms for mining static data include: Apriori [2] based on Generation-Test and its series of improved algorithms [3–5]; Frequent Pattern Growth (FP-Growth) * Wen Xiao [email protected] 1

College of Computer Science and Information, HOHAI University, Nanjing, China

2

Key Laboratory of Unmanned Aerial Vehicle Development and Data Application of Anhui Higher Education Institutes, Wanjiang University of Technology, Maanshan, China

3

Ma’anshan Engineering Technology Research Center for Wireless Sensor Network and IntelliSense, Wanjiang University of Technology, Maanshan, China



13

Vol.:(0123456789)



W. Xiao, J. Hu

[6] algorithm and other algorithms based on Pattern Growth; and Equivalence Class Transformation (Eclat) [7], Diffset Eclat (dEclat) [8] and other algorithms based on vertical data format. Mining frequent itemsets from data streams is one of the most important issues in FIM. It is widely used in retail chain data analysis, network traffic analysis, click-stream mining, IOT data analysis and other fields. Data stream is a continuous, unbounded, timely ordered sequence of data elements with high speed; these characteristics lead to some special limitations in Mining frequent itemsets: All elements in the data stream can only be visited once; data streams grow continuously, but memory during processing is limited and only part of the elements in data streams can be processed; the high speed of data stream also requires the high speed of mining. Therefore, the aim to design FIM algorithm for streaming data is generally required to complete the mining only by scanni