Handling Data Skew for Aggregation in Spark SQL Using Task Stealing

  • PDF / 1,029,315 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 59 Downloads / 250 Views

DOWNLOAD

REPORT


Handling Data Skew for Aggregation in Spark SQL Using Task Stealing Zeyu He1

· Qiuli Huang1 · Zhifang Li1 · Chuliang Weng1

Received: 6 August 2019 / Accepted: 11 November 2019 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In distributed in-memory computing systems, data distribution has a large impact on performance. Designing a good partition algorithm is difficult and requires users to have adequate prior knowledge of data, which makes data skew common in reality. Traditional approaches to handling data skew by sampling and repartitioning often incur additional overhead. In this paper, we proposed a dynamic execution optimization for the aggregation operator, which is one of the most general and expensive operators in Spark SQL. Our optimization aims to avoid the additional overhead and improve the performance when data skew occurs. The core idea is task stealing. Based on the relative size of data partitions, we add two types of tasks, namely segment tasks for larger partitions and stealing tasks for smaller partitions. In a stage, stealing tasks could actively steal and process data from segment tasks after processing their own. The optimization achieves significant performance improvements from 16% up to 67% on different sizes and distributions of data. Experiments show that involved overhead is minimal and could be negligible. Keywords In-memory computing · Spark SQL · Aggregation · Data skew

B

Zeyu He [email protected] Qiuli Huang [email protected] Zhifang Li [email protected] Chuliang Weng [email protected]

1

School of Data Science and Engineering, East China Normal University, Shanghai, China

123

International Journal of Parallel Programming

1 Introduction In recent years, as memory capacity continues to increase and DRAM price continues to decrease, the distributed in-memory computing systems represented by Apache Spark [22] have appeared. Spark provides an analytical library named Spark SQL [2] to handle structured and semi-structured data. This paper focuses on the aggregation operator in Spark SQL. It is one of the most common and expensive operators in data processing systems [4,18]. In Spark SQL, the aggregation operator adopts the classic two-phase implementation [14], which is expressed as two stages. In the first stage, tasks read data from partitions and locally execute the partial aggregation to reduce the input size transferred to the next stage. After the first stage is completed, tasks in the second stage will fetch data from the previous stage and perform the final aggregation. The partial aggregation and the final aggregation can be treated as two reduce operations. Caching data into memory in advance can significantly speed up the processing of aggregation. However, in the course of our practice, we found that the performance of aggregation largely depends on data distribution even after caching. Resilient Distributed Dataset (RDD) [28] is Spark’s core data abstraction. In an RDD, each partition is the basic unit of paral