Title: SyCCL: Exploiting Symmetry for Efficient Collective Communication Scheduling
Authors: Jiamin Cao (Alibaba Cloud); Shangfeng Shi (Tsinghua University and Alibaba Cloud); Jiaqi Gao (Alibaba Cloud); Weisen Liu, Yifan Yang (Tsinghua University and Alibaba Cloud); Yichi Xu, Zhilong Zheng, Yu Guan, Kun Qian (Alibaba Cloud); Ying Liu, Mingwei Xu (Tsinghua University); Tianshu Wang, Ning Wang, Jianbo Dong, Binzhang Fu, Dennis Cai, Ennan Zhai (Alibaba Cloud)
Introduction
The paper addresses the problem of efficient collective communication scheduling in large-scale distributed machine learning (ML). Collective communication, which includes operations like AllReduce and AllGather, is a bottleneck in distributed training, sometimes accounting for more than 30% of training time for large models . Existing libraries such as NCCL rely on fixed schedules, which fail to adapt to different topologies, data sizes, and communication patterns, leading to wasted bandwidth or high latency. Previous attempts to synthesize optimized schedules using Mixed Integer Linear Programming (MILP) improve performance but suffer from scalability issues, often timing out on large GPU clusters . This gap motivates the development of SyCCL, a system designed to scale collective schedule synthesis while maintaining near-optimal performance.
Key idea and contribution:
The key contribution of this paper is SyCCL, a scalable collective communication schedule synthesizer that exploits symmetry in both communication patterns and GPU topologies. Instead of encoding the entire scheduling problem into a single large MILP formulation, SyCCL introduces the concept of a sketch. A sketch breaks down large collective demands into smaller, symmetric sub-demands across subsets of GPUs . These sub-demands are solved individually and then combined into complete schedules, significantly reducing the search space.
SyCCL further improves practicality by using a two-phase synthesis process: (1) exploring sketches to maximize bandwidth utilization, and (2) employing MILP modeling to synthesize near-optimal sub-schedules before assembling them. It also leverages isomorphism and parallelism, ensuring that once a sub-demand is solved for one symmetric group, its solution can be reused for others . The implementation is production-ready, consisting of profiling, a C++ synthesizer, and an executor integrated with MSCCL .
Evaluation
The authors evaluate SyCCL on real A100 clusters and simulated H800 clusters, comparing it against NCCL and the state-of-the-art synthesizer TECCL. Results show up to 91% performance improvement over TECCL and 108% over NCCL on 32 A100 GPUs, and up to 127% improvement on H800 clusters . Importantly, SyCCL reduces synthesis time by 2–4 orders of magnitude compared to TECCL, making it feasible for production-scale jobs. End-to-end training experiments on GPT-6.7B and Llama3-8B demonstrate up to 6.3% faster training iteration time compared to NCCL . This result is significant because it demonstrates that SyCCL not only improves synthetic benchmarks but also translates into real gains for large-scale ML training workloads.
Q&As
Q1: When does the synthesized schedule outperform the manually written (expert) one?
A1: The synthesizer aims to automatically generate optimal schedules based on factors such as topology, collective operation, and data size. Manual schedules require deep expertise about hardware and network characteristics and take time to update as topologies evolve or new GPUs/interconnects appear. An automatic synthesizer adapts more quickly and can outperform human-written schedules in such dynamic or large-scale settings.
Q2: How does SyCCL handle network dynamics such as link failures or bandwidth fluctuations? Does it support real-time or incremental updates to schedules?
A2: SyCCL first measures network parameters like latency and bandwidth, then uses an MLP model to compute the optimal schedule. If the network changes (for example, a link goes down or latency increases), the synthesized schedule may become suboptimal. In those cases, incremental synthesis or similar lightweight updates would be preferable to recomputing everything from scratch.
Q3: Does the symmetry-based search space pruning introduce suboptimality, or is it purely a gain? Can you encode all constraints using your MIP-based synthesis?
A3: They do prune the search space when searching for sketches, but in the paper they compare results with and without pruning and show that pruning does not hurt optimality of the synthesized schedules.
Q4: How does SyCCL handle symmetric versus heterogeneous (asymmetric) topologies? Can it synthesize quickly, and what is the fallback strategy?
A4: The system is mainly designed for symmetric topologies. Heterogeneous topologies are uncommon in current GPU clusters. If such topologies exist, GPUs of the same type should be placed in the same network pod to retain intra-pod symmetry; the system can then search and solve schedules within each pod. Cross-pod synthesis may be slower, and handling fully heterogeneous topologies is left for future work.
Personal thoughts
I find this paper compelling because it strikes a balance between theoretical rigor and practical usability. The introduction of sketches is elegant, transforming an intractable scheduling problem into manageable subproblems while leveraging inherent symmetries in ML workloads and GPU clusters. The integration into existing frameworks like MSCCL also highlights the authors’ focus on deployability, which is often overlooked in purely academic work.
That said, one limitation is that while SyCCL excels in symmetry-rich topologies (e.g., Clos, Multi-rail), its advantages may diminish in more irregular or heterogeneous network setups. Another potential area of exploration is multi-collective scheduling, where overlapping communication patterns could be co-optimized. I also wonder how SyCCL’s performance holds up under dynamic conditions (e.g., failures, congestion) common in large-scale datacenters. Future work could extend SyCCL toward adaptive runtime scheduling that responds to system dynamics while still exploiting symmetry.
Overall, this paper makes a strong contribution to distributed ML systems, and I believe it sets a promising direction for both research and production systems.