Paper: Crux: GPU-Efficient Communication Scheduling for Deep Learning Training (Best Paper Honorable Mention, SIGCOMM’24)
Authors: Jiamin Cao, Yu Guan, Kun Qian, Jiaqi Gao, Wencong Xiao, Jianbo Dong, Binzhang Fu, Dennis Cai, and Ennan Zhai.
Presenter: Aranya Saha, Bangladesh University of Engineering and Technology, Bangladesh
Guest of Honor: Yu Guan, Jiaqi Gao and Ennan Zhai.
After the paper sharing, the participants had a lively discussion. The following is a partial Q&A record:
Q1: Crux aims to maximize GPU utilization, but for users, renting GPUs from the cloud is usually charged by the hour. Using Crux will prolong the user’s job completion time. Does this mean that users have to pay more?
A1 (Jiaqi Gao, Crux author): Using Crux will indeed prolong the time, but on the other hand, maximizing GPU utilization means reducing the cost of each GPU. If we can improve GPU utilization, then in a specific cluster of a given size, we can reduce our charges. In other words, you can choose to slightly extend the job completion time but pay less, or increase your job completion efficiency by increasing the cluster size. But this statement is also based on our formula. Increasing the cluster size means that you will have a higher weight and your priority will be higher, which means that your job will run faster than before. In general, although using Crux will prolong the job completion time, it can reduce the GPU charge. For all users, improving GPU utilization allows GPU cloud service providers to reduce capital waste and provide users with more discounts.
Q2: Model training is essentially an iterative process, including computation and communication. When the GPU intensity is high, will network resources be wasted?
A2 (Qiao Xiang): Jobs with high GPU intensity are actually computing and do not need communication, so when a job does not need communication, resources can be allocated to other jobs that need communication, and there will be no problem of wasting network resources.
Q3: Crux uses GPU intensity as the scheduling priority. Will this cause jobs with low GPU intensity to be starved?
A3 (Yu Guan, Crux author): We have done relevant experiments, and jobs with low GPU intensity will be much slower than when running alone. But for DLT jobs, there are many bubbles, and jobs with low GPU intensity have the opportunity to communicate.
Q4: Machine learning systems run in the form of pipelines and reduce bubbles by overlapping communication and computation. For a system targeting a single job, it will reduce as many bubbles as possible, which means that for each job itself, it does not have too many bubbles. In this case, if there is still a low GPU intensity job and other high GPU intensity jobs have no remaining bubbles, then this low GPU intensity job may starve?
A4 (Yu Guan, Crux author): Even if communication and computation can be completely overlapped, there are still bubbles in communication. For example, in one iteration, the job requires 5 seconds for computation and 2 seconds for communication. Even if the communication is completely covered by the computation, there is still a 3-second bubble.
Thank you to everyone who joined our reading group session on Crux: GPU-Efficient Communication Scheduling for Deep Learning Training!
Here is the link to the presentation slides for anyone interested:
Presentation on Crux: GPU-Efficient Communication Scheduling for Deep Learning Training