A gray-box modeling methodology for runtime prediction of Apache Spark jobs

  • PDF / 1,878,022 Bytes
  • 21 Pages / 439.37 x 666.142 pts Page_size
  • 52 Downloads / 205 Views

DOWNLOAD

REPORT


A gray‑box modeling methodology for runtime prediction of Apache Spark jobs Hani Al‑Sayeh1 · Stefan Hagedorn1   · Kai‑Uwe Sattler1

© The Author(s) 2020

Abstract Apache Spark jobs are often characterized by processing huge data sets and, therefore, require runtimes in the range of minutes to hours. Thus, being able to predict the runtime of such jobs would be useful not only to know when the job will finish, but also for scheduling purposes, to estimate monetary costs for cloud deployment, or to determine an appropriate cluster configuration, such as the number of nodes. However, predicting Spark job runtimes is much more challenging than for standard database queries: cluster configuration and parameters have a significant performance impact and jobs usually contain a lot of user-defined code making it difficult to estimate cardinalities and execution costs. In this paper, we present a gray-box modeling methodology for runtime prediction of Apache Spark jobs. Our approach comprises two steps: first, a white-box model for predicting the cardinalities of the input RDDs of each operator is built based on prior knowledge about the behavior and application parameters such as applied filters data, number of iterations, etc. In the second step, a black-box model for each task constructed by monitoring runtime metrics while varying allocated resources and input RDD cardinalities is used. We further show how to use this gray-box approach not only for predicting the runtime of a given job, but also as part of a decision model for reusing intermediate cached results of Spark jobs. Our methodology is validated with experimental evaluation showing a highly accurate prediction of the actual job runtime and a performance improvement if intermediate results can be reused. Keywords  Big data · Runtime prediction · Modeling

* Hani Al‑Sayeh hani‑bassam.al‑sayeh@tu‑ilmenau.de * Stefan Hagedorn stefan.hagedorn@tu‑ilmenau.de Kai‑Uwe Sattler kus@tu‑ilmenau.de 1



Technische Universität Ilmenau, Ilmenau, Thüringen, Germany

13

Vol.:(0123456789)



Distributed and Parallel Databases

1 Introduction Big data platforms such as Hadoop, Spark or Flink are mainly used to process and analyze huge volumes of data resulting in runtimes of minutes or even hours. For many users, the prediction of the expected runtime of such jobs would be very helpful. Based on this information, cluster resources could be allocated, scheduling of jobs can be improved, and costs for cloud deployment (e.g., in the form of a what-if analysis) can be estimated. Though predicting the runtime of arbitrary Spark jobs seems to be nearly impossible simply due to numerous parameters and user-written code, there are scenarios where several opportunities for collecting the information necessary for a good prediction model exist. Often, the development of Spark programs is an explorative task and programs written once are executed multiple times. Consider the following scenarios: 1. A data scientist loads several data files and performs some basic preprocessing and transf