Oracle and Vertica for Frequent Itemset Mining

In the last few years, organizations have become much more interested in using data to create value. Big Data, however, presents new challenges to the extraction of knowledge using traditional Data Mining methods. In this paper we focus on a concrete impl

  • PDF / 1,315,470 Bytes
  • 9 Pages / 439.37 x 666.142 pts Page_size
  • 38 Downloads / 198 Views

DOWNLOAD

REPORT


stract. In the last few years, organizations have become much more interested in using data to create value. Big Data, however, presents new challenges to the extraction of knowledge using traditional Data Mining methods. In this paper we focus on a concrete implementation of association rules generation. The proposed algorithm is specialized for four datasets and its performance for different support thresholds is measured. This is done for two Database Management Systems (DBMS) – a traditional row-oriented DMBS in the face of Oracle and a column-oriented DBMS represented by Vertica. The results indicate the suitability of these DBMSs as tools for association rules generation. Keywords: Big data

 Data mining  Association rules  Apriori algorithm

1 Introduction Data mining concerns the processing of data with the primary purpose of finding interesting patterns and trends. The methods used to explore the data vary significantly and include associative rules, classification, clustering, etc. all of which are being widely discussed in literature [7, 18]. Today, Big Data urges the use of more extensive data mining techniques due to the big data volumes as well as the variety of information content and dynamic data behavior [1, 12, 19]. This combined with the ever growing desire to mine the data directly in the transactional database [4, 8] presents new challenges to the traditional relational database management systems (DBMS). In this paper we study the performance of two popular DBMS – Oracle and Vertica – for association rule mining. The paper is organized as follows: In Sect. 2 some background information on the different data mining methods is provided together with reasoning on why association rules were chosen for this study. It continues by describing the Apriori algorithm for frequent itemsets discovery and the concrete implementation, which was used in the tests. Section 3 outlines the setup and characteristics of the employed environment. It also features information on the datasets used. The performance results of using the SQL implementation of the Apriori algorithm in both Oracle and Vertica are presented, analyzed and discussed in Sect. 4. In the last section conclusions and directions for future research are given.

© Springer International Publishing Switzerland 2016 Y. Tan and Y. Shi (Eds.): DMBD 2016, LNCS 9714, pp. 77–85, 2016. DOI: 10.1007/978-3-319-40973-3_8

78

H. Kyurkchiev and K. Kaloyanova

2 Background Data Mining (DM) incorporates methods and techniques from several areas – statistics, machine learning, artificial intelligence, etc. The most common ways to mine data to discover the underlying patterns and structure are [7]: • Association rule analysis – enables the discovery of interesting and frequent relations in large databases that concern the co-occurrences of different elements. The method is mainly used for market-basket analysis for direct marketing, sales promotions, and for discovering business trends. • Clustering analysis – used to understand the differences and the similarities