High Utility Itemsets Mining Based on Divide-and-Conquer Strategy

  • PDF / 2,393,345 Bytes
  • 19 Pages / 439.37 x 666.142 pts Page_size
  • 31 Downloads / 225 Views

DOWNLOAD

REPORT


High Utility Itemsets Mining Based on Divide‑and‑Conquer Strategy Jiyong Liao1   · Sheng Wu1 · Ailian Liu1

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract High utility itemsets mining has become a hot research topic in association rules mining. But many algorithms directly mine datasets, and there is a problem on dense datasets, that is, too many itemsets stored in each transaction. In the process of mining association rules, it takes a lot of storage space and affects the running efficiency of the algorithm. In the existing algorithms, there is a lack of efficient itemset mining algorithms for dense datasets. Aiming at this problem, a high utility itemsets mining algorithm based on divideand-conquer strategy is proposed. Using the improved silhouette coefficient to select the best K-means cluster number, the datasets are divided into many smaller subclasses. Then, the association rules mining is performed by Boolean matrix compression operation on each subclass, and iteratively merge them to get the final mining results. We also analyze the time complexity of our method and Apriori algorithm. Finally, experimental results on several well-known real world datasets are conducted to show that the improved algorithm performs faster and consumes less memory on dense datasets, which can effectively improve the computational efficiency of the algorithm. Keywords  Association rule mining · Divide-and-conquer strategy · K-means clustering · Dense datasets · Boolean matrix compression

1 Introduction Association rule mining is one of the most fundamental research direction in data mining, the task is to extract useful information from massive data to help people make decisions quickly. An enormous variety of association rule mining algorithms have been proposed over the past few decades, like FP-Growth [1], UP-Growth [2], UP-Growth+ [3]. One of * Jiyong Liao [email protected] Sheng Wu [email protected] Ailian Liu [email protected] 1



School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, Yunnan, China

13

Vol.:(0123456789)



J. Liao et al.

the most famous one is Apriori algorithm [4], its core idea is to generate frequent itemsets by scanning transaction databases, then generate association rules based on frequent itemsets. Due to Apriori algorithm has the disadvantage of repeatedly scanning database, association rules mining algorithm based on matrix compression has attracted a lot of attention in the community of data scientists and many impressive results have been reported [5–9]. In the case of large amount of data, the processing of Apriori algorithm is obviously slow, so the improved Apriori algorithm based on clustering are proposed [10, 11]. In recent years, with the increasing complexity of transactions, the high-utility itemset mining (HUIM) [12] has become an emerging topic since it can reveal the highly profitable products. Wang et  al. [13] presented a tree-based temporal association rule mining algorithm, the numerical