Oracle and Vertica for Frequent Itemset Mining

In the last few years, organizations have become much more interested in using data to create value. Big Data, however, presents new challenges to the extraction of knowledge using traditional Data Mining methods. In this paper we focus on a concrete impl

PDF / 1,315,470 Bytes
9 Pages / 439.37 x 666.142 pts Page_size
38 Downloads / 338 Views

DOWNLOAD

REPORT

stract. In the last few years, organizations have become much more interested in using data to create value. Big Data, however, presents new challenges to the extraction of knowledge using traditional Data Mining methods. In this paper we focus on a concrete implementation of association rules generation. The proposed algorithm is specialized for four datasets and its performance for different support thresholds is measured. This is done for two Database Management Systems (DBMS) – a traditional row-oriented DMBS in the face of Oracle and a column-oriented DBMS represented by Vertica. The results indicate the suitability of these DBMSs as tools for association rules generation. Keywords: Big data

Data mining Association rules Apriori algorithm

1 Introduction Data mining concerns the processing of data with the primary purpose of ﬁnding interesting patterns and trends. The methods used to explore the data vary signiﬁcantly and include associative rules, classiﬁcation, clustering, etc. all of which are being widely discussed in literature [7, 18]. Today, Big Data urges the use of more extensive data mining techniques due to the big data volumes as well as the variety of information content and dynamic data behavior [1, 12, 19]. This combined with the ever growing desire to mine the data directly in the transactional database [4, 8] presents new challenges to the traditional relational database management systems (DBMS). In this paper we study the performance of two popular DBMS – Oracle and Vertica – for association rule mining. The paper is organized as follows: In Sect. 2 some background information on the different data mining methods is provided together with reasoning on why association rules were chosen for this study. It continues by describing the Apriori algorithm for frequent itemsets discovery and the concrete implementation, which was used in the tests. Section 3 outlines the setup and characteristics of the employed environment. It also features information on the datasets used. The performance results of using the SQL implementation of the Apriori algorithm in both Oracle and Vertica are presented, analyzed and discussed in Sect. 4. In the last section conclusions and directions for future research are given.

© Springer International Publishing Switzerland 2016 Y. Tan and Y. Shi (Eds.): DMBD 2016, LNCS 9714, pp. 77–85, 2016. DOI: 10.1007/978-3-319-40973-3_8

78

H. Kyurkchiev and K. Kaloyanova

2 Background Data Mining (DM) incorporates methods and techniques from several areas – statistics, machine learning, artiﬁcial intelligence, etc. The most common ways to mine data to discover the underlying patterns and structure are [7]: • Association rule analysis – enables the discovery of interesting and frequent relations in large databases that concern the co-occurrences of different elements. The method is mainly used for market-basket analysis for direct marketing, sales promotions, and for discovering business trends. • Clustering analysis – used to understand the differences and the similarities

Data Loading...

Oracle and Vertica for Frequent Itemset Mining

Recommend Documents

Frequent Itemset Mining

Constrained Frequent Itemset Mining

Frequent Itemset Mining with Constraints

Frequent Itemset Discovery

Maximal Itemset Mining

SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming

Closed Itemset Mining and Non-redundant Association Rule Mining

A Hybrid Distributed Frequent Itemset Mining Method with Its Application in Medical Diagnosis

Frequent Set Mining with Constraints

Mining Frequent Seasonal Gradual Patterns

Frequent Pattern Mining with Constraints

High Average Utility Itemset Mining: A Survey