Distributed graph cube generation using Spark framework

PDF / 3,422,654 Bytes
22 Pages / 439.37 x 666.142 pts Page_size
106 Downloads / 269 Views

Distributed graph cube generation using Spark framework Seok Kang1 · Suan Lee1 · Jinho Kim1

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract Graph OLAP is a technology that generates aggregates or summaries of a largescale graph based on the properties (or dimensions) associated with its nodes and edges, and in turn enables interactive analyses of the statistical information contained in the graph. To efficiently support these OLAP functions, a graph cube is widely used, which maintains aggregate graphs for all dimensions of the source graph. However, computing the graph cube for a large graph requires an enormous amount of time. While previous approaches have used the MapReduce framework to cut down on this computation time, the recently developed Spark environment offers superior computational performance. To leverage the advantages of Spark, we propose the GraphNaïve and GraphTDC algorithms. GraphNaïve sequentially computes graph cuboids for all dimensions in a graph, while GraphTDC computes them after first creating an execution plan. We also propose the Generate Multi-Dimension Table method to efficiently create a multidimensional graph table to express the graph. Evaluation experiments demonstrated that the GraphTDC algorithm significantly outperformed Spark SQL’s built-in library DataFrame, as the size of graphs increased. Keywords Distributed parallel processing · Spark framework · Resilient distributed dataset · Graph cube · Data cube · Online analytical processing

* Suan Lee [email protected]; [email protected] Seok Kang [email protected] Jinho Kim [email protected] 1

Department of Computer Science, Kangwon National University, Chuncheon, Kangwon, Korea

13

Vol.:(0123456789)

S. Kang et al.

1 Introduction The growth of information technology has led to the continuous generation of data in various domains. Businesses and organizations are increasingly interested in effectively analyzing large-scale data and using them for decision making. One widely used data analysis method for such sophisticated large-scale data is online multidimensional analysis (online analytical processing, OLAP) [1–3]. OLAP is a technology that allows business managers to easily analyze data by interactively providing aggregate values for the data based on combinations of multiple attributes or dimensions. In order to quickly retrieve aggregate values and efficiently provide these analysis functions, OLAP uses multidimensional cubes [4], which are data structures that contain all possible aggregate values that are needed in an OLAP application. Data cubes are often used for multidimensional data analysis in many fields. However, to compute these data cubes requires an enormous amount of time and resources, and it is often impractical to perform these computations using conventional computing methods. Owing to this, several studies have explored various methods for efficiently computing data cubes [5–12], including methods that use large-scale distributed parallel processing [13–

Data Loading...

Distributed graph cube generation using Spark framework

Recommend Documents

Classification of Big Data Using Spark Framework

Beginning Apache Spark 2 With Resilient Distributed Datasets, Spark

Building a knowledge graph by using cross-lingual transfer method and distributed MinIE algorithm on apache spark

Building Knowledge Graph in Spark Without SPARQL

Automatic Story Generation Based on Graph Model Using Godot Engine

The Flip-Graph of the 4-Dimensional Cube is Connected

Using Graph Transformation for Puzzle Game Level Generation and Validation

Compressed graph representation for scalable molecular graph generation

Cube

Digital ID Generation and Management Framework Using Blockchain

A Graph Data Model for Attack Graph Generation and Analysis

Natural Answer Generation via Graph Transformer