Mobile QR Code
Title Improving the Efficiency of Deep Learning Applications in a Multi-GPU Environment via Adaptive Mini-batch Data Redistribution
Authors 김인모(Inmo Kim) ; 김명선(Myungsun Kim)
DOI https://doi.org/10.5573/ieie.2022.59.9.51
Page pp.51-58
ISSN 2287-5026
Keywords Fair-share scheduling; Deep learning applications; Multi-GPU; Slow-down
Abstract More than one DNN model learning applications are usually executed inside a GPU cluster composed of several GPUs. In this environment, competition between applications to occupy GPU resources is inevitable if the total amount of GPU resources is less than that of resources required by all learning applications. Depending on the degree of competition that occurred at this time and the degree to which each learning application uses each GPU, some applications may complete learning in a short time and some must proceed for a much longer time than this. This study proposes an algorithm that solves the allocation of GPUs to learning applications in such an unfair way. Based on the current GPU resource utilization status of each application, the ratio of the predicted learning completion time and the learning completion time when each application uses GPU exclusively is calculated and defined as the current Slow-Down of the current application. The ratio of mini-batch data per GPU of each learning application is periodically adjusted and a fair allocation of GPU resources is achieved so that the Slow-Down values of currently executed learning applications become similar. As a result of confirming through various experiments, the maximum Slow-Down decreased by more than 53%, and the overall GPU utilization also increased by 25%.