Handling Data Skew for Aggregation in Spark SQL Using Task Stealing

PDF / 1,029,315 Bytes
16 Pages / 439.37 x 666.142 pts Page_size
59 Downloads / 274 Views

Handling Data Skew for Aggregation in Spark SQL Using Task Stealing Zeyu He1

· Qiuli Huang1 · Zhifang Li1 · Chuliang Weng1

Received: 6 August 2019 / Accepted: 11 November 2019 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In distributed in-memory computing systems, data distribution has a large impact on performance. Designing a good partition algorithm is difficult and requires users to have adequate prior knowledge of data, which makes data skew common in reality. Traditional approaches to handling data skew by sampling and repartitioning often incur additional overhead. In this paper, we proposed a dynamic execution optimization for the aggregation operator, which is one of the most general and expensive operators in Spark SQL. Our optimization aims to avoid the additional overhead and improve the performance when data skew occurs. The core idea is task stealing. Based on the relative size of data partitions, we add two types of tasks, namely segment tasks for larger partitions and stealing tasks for smaller partitions. In a stage, stealing tasks could actively steal and process data from segment tasks after processing their own. The optimization achieves significant performance improvements from 16% up to 67% on different sizes and distributions of data. Experiments show that involved overhead is minimal and could be negligible. Keywords In-memory computing · Spark SQL · Aggregation · Data skew

B

Zeyu He [email protected] Qiuli Huang [email protected] Zhifang Li [email protected] Chuliang Weng [email protected]

1

School of Data Science and Engineering, East China Normal University, Shanghai, China

123

International Journal of Parallel Programming

1 Introduction In recent years, as memory capacity continues to increase and DRAM price continues to decrease, the distributed in-memory computing systems represented by Apache Spark [22] have appeared. Spark provides an analytical library named Spark SQL [2] to handle structured and semi-structured data. This paper focuses on the aggregation operator in Spark SQL. It is one of the most common and expensive operators in data processing systems [4,18]. In Spark SQL, the aggregation operator adopts the classic two-phase implementation [14], which is expressed as two stages. In the first stage, tasks read data from partitions and locally execute the partial aggregation to reduce the input size transferred to the next stage. After the first stage is completed, tasks in the second stage will fetch data from the previous stage and perform the final aggregation. The partial aggregation and the final aggregation can be treated as two reduce operations. Caching data into memory in advance can significantly speed up the processing of aggregation. However, in the course of our practice, we found that the performance of aggregation largely depends on data distribution even after caching. Resilient Distributed Dataset (RDD) [28] is Spark’s core data abstraction. In an RDD, each partition is the basic unit of paral

Data Loading...

Handling Data Skew for Aggregation in Spark SQL Using Task Stealing

Recommend Documents

Fuzzy-Based Kernelized Clustering Algorithms for Handling Big Data Using Apache Spark

Handling SQL Injection Attack Using Progressive Neural Network

Data Skew

Handling Data

Using functions in SQL

Classification of Big Data Using Spark Framework

Data Aggregation

Reading SQL Server Data

Data Aggregation in Sensor Networks

Privacy-Preserving Data Aggregation for Smart Grids

Tracking Environmental Change Using Lake Sediments Data Handling and

Decomposition/Aggregation K-means for Big Data