OKCM: improving parallel task scheduling in high-performance computing systems using online learning

  • PDF / 2,438,814 Bytes
  • 24 Pages / 439.37 x 666.142 pts Page_size
  • 15 Downloads / 181 Views

DOWNLOAD

REPORT


OKCM: improving parallel task scheduling in high‑performance computing systems using online learning Jingbo Li1 · Xingjun Zhang1   · Li Han1 · Zeyu Ji1 · Xiaoshe Dong1 · Chenglong Hu1 Accepted: 29 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Task scheduling is becoming increasingly important in large-scale high-performance computing real-time systems as the parallel scale, number and types of task continue to increase. The prioritizing policies and backfilling mechanisms are the most useful practice for improving scheduling performance. In particular, these methods highly depend on the task running time prediction. Previous studies focused on improving the running time prediction accuracy, resulting in higher time overhead and deployment difficulties in real-time scheduling system. In this paper, an efficient running time prediction model, referred to as online learning and K-nearest neighbors (KNN)-based predictor with correction mechanism (OKCM), is proposed. OKCM updates in real time through online algorithm and is friendly to users with a small data accumulation by KNN-based predictor. To evaluate our model, a tracedriven simulator, named HPCsim, is designed and implemented. The experimental results demonstrated that OKCM can achieve higher prediction accuracy with a low overhead. Furthermore, OKCM can achieve significant scheduling performance improvement and can be used to enhance primary prioritizing and backfilling methods without being restricted by specific scheduling method. Keywords  High-performance computing · Parallel task scheduling · Running time prediction · Online learning · Correction mechanism

* Xingjun Zhang [email protected] Jingbo Li [email protected] Xiaoshe Dong [email protected] 1



School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China

13

Vol.:(0123456789)



J. Li et al.

1 Introduction Task scheduling system is a critical middleware for high-performance computing (HPC) platforms which is responsible for managing and scheduling tasks [1, 2]. An excellent scheduling system can effectively improve the performance of the HPC platform and reduce the average user waiting time, ensuring fairness between large and small tasks without starvation and the provision of high-quality services to users [3]. Users submit their tasks by a command to the centralized waiting queue managed by the task scheduler. A submit command contains all the information necessary to run the task, including requested nodes, requested running time, and task name. The task scheduler periodically checks waiting queues and HPC resources, and determines the task order in the queues [4]. The scheduling system decides when and where to execute the tasks. Firstly, the taskpriority policies, such as F1 [5] and UNICEF [6], are used to order tasks based on task attributes such as task arrival time, task running time estimate, and task size to decide when to run. For example, the shortest task first policy orders tasks by running ti