Distributed graph cube generation using Spark framework
- PDF / 3,422,654 Bytes
- 22 Pages / 439.37 x 666.142 pts Page_size
- 106 Downloads / 256 Views
Distributed graph cube generation using Spark framework Seok Kang1 · Suan Lee1 · Jinho Kim1
© Springer Science+Business Media, LLC, part of Springer Nature 2019
Abstract Graph OLAP is a technology that generates aggregates or summaries of a largescale graph based on the properties (or dimensions) associated with its nodes and edges, and in turn enables interactive analyses of the statistical information contained in the graph. To efficiently support these OLAP functions, a graph cube is widely used, which maintains aggregate graphs for all dimensions of the source graph. However, computing the graph cube for a large graph requires an enormous amount of time. While previous approaches have used the MapReduce framework to cut down on this computation time, the recently developed Spark environment offers superior computational performance. To leverage the advantages of Spark, we propose the GraphNaïve and GraphTDC algorithms. GraphNaïve sequentially computes graph cuboids for all dimensions in a graph, while GraphTDC computes them after first creating an execution plan. We also propose the Generate Multi-Dimension Table method to efficiently create a multidimensional graph table to express the graph. Evaluation experiments demonstrated that the GraphTDC algorithm significantly outperformed Spark SQL’s built-in library DataFrame, as the size of graphs increased. Keywords Distributed parallel processing · Spark framework · Resilient distributed dataset · Graph cube · Data cube · Online analytical processing
* Suan Lee [email protected]; [email protected] Seok Kang [email protected] Jinho Kim [email protected] 1
Department of Computer Science, Kangwon National University, Chuncheon, Kangwon, Korea
13
Vol.:(0123456789)
S. Kang et al.
1 Introduction The growth of information technology has led to the continuous generation of data in various domains. Businesses and organizations are increasingly interested in effectively analyzing large-scale data and using them for decision making. One widely used data analysis method for such sophisticated large-scale data is online multidimensional analysis (online analytical processing, OLAP) [1–3]. OLAP is a technology that allows business managers to easily analyze data by interactively providing aggregate values for the data based on combinations of multiple attributes or dimensions. In order to quickly retrieve aggregate values and efficiently provide these analysis functions, OLAP uses multidimensional cubes [4], which are data structures that contain all possible aggregate values that are needed in an OLAP application. Data cubes are often used for multidimensional data analysis in many fields. However, to compute these data cubes requires an enormous amount of time and resources, and it is often impractical to perform these computations using conventional computing methods. Owing to this, several studies have explored various methods for efficiently computing data cubes [5–12], including methods that use large-scale distributed parallel processing [13–
Data Loading...