Way of Measuring Data Transfer Delays among Graphics Processing Units at Different Nodes of a Computer Cluster
- PDF / 791,692 Bytes
- 10 Pages / 612 x 792 pts (letter) Page_size
- 29 Downloads / 152 Views
of Measuring Data Transfer Delays among Graphics Processing Units at Different Nodes of a Computer Cluster A. A. Begaev1* and A. N. Sal’nikov1** 1
Department of Computational Mathematics and Cybernetics, Moscow State University, Moscow, 119991 Russia Received July 9, 2019; in final form, October 2, 2019; accepted October 2, 2019
Abstract—The basics of load tests for a computer cluster with a large number of GPUs (graphics processing units) distributed over the cluster’s nodes are presented and implemented as a program code. Information about the time delays in the transfer of data of different sizes among all GPUs in the system is collected as a result. Two modes of tests, “all to all” and “one to one,” are developed. In the first mode, all GPUs transfer data to all GPUs simultaneously. In the second mode, only the transfer between two GPUs proceeds at a single moment in time. Using test results obtained on the K60 computer cluster at the Keldysh Institute of Applied Mathematics, Russian Academy of Sciences, it is shown that the interconnector medium of the supercomputer is inhomogeneous in data transfer among the GPUs not only for transfer through the network, but also for the GPUs in a common node of the computer cluster. Keywords: load tests, interconnector medium, MPI, GPU. DOI: 10.3103/S0278641920010021
1. INTRODUCTION Supercomputers are widely used for solving different applied problems associated with heavy multistage computations and the processing of large amounts of information. These could be areas associated with computations of climate models [1], with imaging from a large number of sources, or with machine learning [2]. In solving these problems, it is often necessary to apply uniform operations to different homogeneous data. The emergence of such special computing machines as graphics processing units (GPUs) in supercomputers has led to high-parallel paradigms in which a large parallel task can be divided into several lower ones that can be executed on computing machines of different classes [3]. The main advantage of GPUs over multicore processors is the support of a great many light-weighted streams. The sets of such streams apply the same sequences of operations to the data in parallel. Since there are many such streams and sets, a high degree of parallelism is achieved, allowing the efficient implementation of computations on GPUs [4]. A supercomputer consists of large number of nodes, each of which can have several GPUs. An example of such a system is the Summit supercomputer [5] (first place in the top 500 as of June 2019). Parallel software allows the participation of an enormous number of nodes and GPUs in execution. However, the emergence of additional elements in the architecture results in additional complexity of communication, since the data must be transferred from one independent computing machine to another. When the parallel program needs to transfer data between GPUs, they can be at different nodes of the computer cluster. The volume of data to transfer between nodes and GPUs can be quite lar
Data Loading...