Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem

Title: Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem
Authors: Xuting Liu (University of Pennsylvania); Behnaz Arzani, Siva Kesava Reddy Kakarla (Microsoft Research); Liangyu Zhao (University of Washington); Vincent Liu (University of Pennsylvania); Miguel Castro (OpenAI); Srikanth Kandula (Microsoft); Luke Marshall (Microsoft Research)

Speaker:Xuting Liu (University of Pennsylvania)

Scribe:Yuling Lin(Xiamen University)

Introduction
The problem studied in the paper revolves around optimizing collective communication in machine learning (ML) systems, specifically within the context of distributed training on large-scale GPU clusters. This problem is of importance due to the increasing size and complexity of ML models, which demand efficient communication among multiple processing units to ensure accelerated training times and resource utilization. The interest lies in the challenge of scaling existing communication optimization techniques to meet the demands of cloud operators managing vast, multitenant GPU environments. Existing systems and tools, such as SCCL and TACCL, fall short because they struggle to scale effectively with the expanding problem sizes and often yield suboptimal communication schedules. These inefficiencies can lead to significant GPU idle times, increased training durations, and, ultimately, higher operational costs. The quest for a solution that can provide both optimal communication schedules and the ability to scale across large topologies is what makes this research area both critical and captivating. The paper introduces TE-CCL, a novel approach that leverages multi-commodity flow techniques to address these limitations, aiming to deliver a more efficient, scalable solution to the collective communication challenge in ML systems.

Key idea and contribution
The authors have constructed a system named TE-CCL that addresses the challenge of optimizing collective communication in distributed machine learning environments. This system is designed to improve upon the limitations of existing communication schedulers, which struggle with scalability and optimality when dealing with the large-scale GPU clusters common in cloud computing. TE-CCL models the collective communication problem using multi-commodity flow principles, a method traditionally applied in traffic engineering, to develop a more scalable and efficient solution.

The system’s core innovation lies in its ability to handle complex communication patterns such as ALLGATHER, ALLTOALL, and SCATTERGATHER, which are typical in machine learning collectives. TE-CCL introduces a novel mixed-integer linear programming (MILP) model that incorporates critical concepts like chunks, epochs, and buffers to effectively manage data transfer delays, store-and-forward capabilities, and in-network copying of data. This MILP model is further enhanced by converting it into a linear program (LP) for certain communication patterns that do not benefit from data replication, thereby improving scalability. Additionally, TE-CCL includes a sophisticated post-processing algorithm to eliminate unnecessary data flows, ensuring optimal use of network resources and minimizing transfer times.
Evaluation
The evaluation of TE-CCL was conducted using a range of GPU topologies and demonstrated its effectiveness against existing solutions like TACCL and SCCL. The authors implemented the system in Python, utilizing Gurobi for optimization, and converted the solutions into schedules compatible with hardware. Key metrics such as solver time, transfer time, output buffer size, transfer size, and algorithmic bandwidth were used to assess performance. The results showed TE-CCL outperforming TACCL in solver time and algorithmic bandwidth across various scenarios, and in some cases, TE-CCL was the only solver able to produce a feasible solution. This result is significant because it indicates TE-CCL’s potential to substantially improve the efficiency of distributed machine learning training in cloud environments, which is crucial for the advancement of AI and the handling of increasingly complex and data-intensive ML models.

Questions and opinions :
Q: What are the challenges for TE-CCL in handling collective communication operations that involve task dependencies, as opposed to those that do not, like ALLTOALL?
A: So for other creatives, you can technically use our formulation to you. It is possible to construct schedules for other patterns, such as ALLREDUCE or GATHER, by creatively combining or manipulating the existing solutions for ALLGATHER. It is also possible to use the formulation and models directly, and it is, we think it is possible. But we haven’t done so in our paper.
Q: I have 2 questions for you. The first one is that there is a solution required synchronization between different gpus. And how do you tackle the problem caused by the difference of computation? Capacity of different gpus? And the second question is, does your solution apply to multi-tasks like this?
A: 1. For handling synchronization between different GPUs, the solution does not require explicit synchronization mechanisms. Instead, it computes an optimized schedule offline and distributes this schedule to all GPUs with the expectation that they will complete their tasks according to the plan. The potential issue of problems, where some GPUs finish their tasks earlier than others, could impact performance. The recognition of this issue is acknowledged, and it is noted as an area for future work.
2. Yes, for multi-tenancy. Yes, I’d say it’s a string of traffic engineering that traffic engineering supports multi tenancy. So it is possible to model workload into our formulation, and I think good direction to work out in the future to better improve this formulation.
Q: What makes the new formulation different and better at solving multi-node problems, given that it still uses conventional optimization techniques like ILP and LP? particularly how it helps to reduce the solution space and improve the handling of multi-node problems compared to existing methods?
A: I think the magic comes from the field of traffic. Engineering formulation has been started and optimized for decades. So our formulation, based on traffic engineering formulation, turns out to be better in efficiency and schedule than previous formulations, and we also design other formulations, including A star which further increases the scalability compared to those MILP, or LP problem.
Personal thoughts
What stands out is the innovative approach of applying multi-commodity flow principles to model and solve complex communication problems in a scalable way. The TE-CCL system’s ability to outperform existing methods in both scalability and solution quality is an impressive feat, particularly its adaptability to different GPU topologies and its robust handling of various communication patterns.The paper also raises a number of open questions and potential areas for further exploration. For instance, while the evaluation is thorough, the impact of TE-CCL on real-world distributed training jobs beyond synthetic benchmarks is not fully explored.