Crux: GPU-Efficient Communication Scheduling for Deep Learning Training

Title: Crux: GPU-Efficient Communication Scheduling for Deep Learning Training

Authors: Jiamin Cao, Yu Guan, Kun Qian, Jiaqi Gao, Wencong Xiao, Jianbo Dong, Binzhang Fu, Dennis Cai, Ennan Zhai (Alibaba Cloud)
Scribe : Wei Li (HKUST)

Introduction
This paper is about deep learning training (DLT), for example, large language model (LLM) training. By deeply studying in-production DLT jobs, we observed that communication contention among different DLT jobs seriously influences the overall GPU computation utilization, resulting in the low efficiency of the training cluster. Existing work like Muri [SIGCOMM’ 22], HiveD [OSDI’ 20] (can reduce contention but cannot avoid), Varys [SIGCOMM’ 14], Sincronia [SIGCOMM’ 18 (unaware of deep learning flow pattern) failed to solve this problem. So we need to design a novel method based on the feature of the flow pattern.

Key idea and contribution:
Three insights: 1) prioritize larger jobs; 2) prioritize jobs with more computation; 3) prioritize jobs with less communication. Then try to quantify GPU intensity by formulation, which we proved that is equivalent to GPU utilization.

key idea: large job should get high priority. job is related data exchange and also related to communication.

Challenge 1: path selection.
Solution: GPU Intensity based path selection to reduces most GPU utilization loss in contention.
Challenge 2: limited priority levels.
Solution: priority compression based on DAG.

Evaluation
Test-bed evaluation: 96-GPU Clos topolgy. three model: GPT, BERT, ResNet.
With Crux, Norm. JCT is reduced -15% in BERT, -18% in GPT and GPU utility is increased 13.9%.
Trace-based simulation: Dataset: 2000-GPU, two topologies (double sided, Clos), two weeks. 11 random deep learning models.
In production trace-based simulation, the GPU utility of Sincronia, TACCL, CASSINI, Crux-PA, Crux-PS-PA, Crux-full is 0.39, 0.49, 0.51, 0.62, 0.62.

Questions and opinions:
Q: Do you have some tasks which you cannot precisely get your GPU intensity? And how do you deal with that?

A: I think that there are many existing hardware monitoring tools can accurately measure the GPU intensities, and because we only need to measure the computation workload and communication workload, we can use the tools.

Q: What is the fundamental difference between your scheduling algorithm with some existing scheduling algorithm for processing scheduling in operating system?

A: I think the difference between Crux and the existing communication scheduler is that we focus on optimization of GPU utilization, and we consider some characteristics of deep learning training jobs. For example, we consider the traffic part, existing pattern of deep learning training job and that is very different from previous application.

Q: I have a question for the task created, as you know most AI network with no oversubscription. So if with no oversubscription, is there any possible multiple jobs?

A: For jobs with full job of higher GPU intensity, if they do not need a large amount of network, the performance could be okay, but as far as we know, let me show you something based on our production GPU clusters, even the network benefit can be very high. We still have unevenly distributed traffic in the network. For example, we have high polarization problem. So there are always some links are congested and other links are not congested. On the other side, recently, we had more deep learning training jobs with higher communication consumption. For example, we are focusing on some deep learning training with inter source tensor parallel. And this inter source tensor parallel is a very communication consumption operations. So I think communication, including path selection and priority assignment, is still important. Even the network bandwidth can be increased in the future.

Personal thoughts
One point is to find the specific feature of deep learning job and then we get the right direction to design a good solution. Another point is that maximizing GPU utilization is equivalent to scheduling more data flow of GPU-intensive
DLT jobs in the network. Based on this, then we can try to formulate the problem.