EP12: MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud, May. 21, 2025

Paper : MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud
Authors : Yongji Wu, Yechen Xu, Jingrong Chen, Zhaodong Wang, Ying Zhang, Matthew Lentz, and Danyang Zhuo.
Presenter : Yuntao Zhao, Xiamen University.
Guest of Honor : Yongji Wu, Duke University.

Q: In the evaluation part for single application, you mentioned MCCS improved about 2.8 times performance improvement. Does this come from flow scheduling?

A:

It comes from both parts - the scheduling algorithm as well as the network flow scheduling. So it’s two parts working together.

Q: Considering that for small message sizes, the performance is not very good, do you have any solutions to optimize small message sizes?

A:

Small message performance can be significantly improved, actually. At least for messages that are not extremely small like several bytes. It’s mainly because we have multiple rounds of redundant message passing in our current prototype, so we have some overhead for very small messages. Small messages are also less frequently used and contribute little to the overall running time.

Q: Will that influence the start time of the MCCS services?

A:

Currently, the kernel will only be launched after our backend receives the message. But it’s basically still launched asynchronously with the GPU kernel execution. The communication is still scheduled to the GPU kernel after we receive the command from the user application. It basically still follows the CUDA stream semantics.

Q: Does MCCS support adaptive connection communication policy optimization at runtime?

A:

We can definitely support that. Our QoS example is actually a showcase of how we can track the policies for each application dynamically.

Q: When there are many new applications, will there be any scalability issues? For example, MCCS may have a fixed size of dedicated memory region?

A:

The overhead basically comes from the centralized scheduling part if we want to have control over all applications. How you configure each application is separate from the MCCS system. For normal training or inference jobs, the communication patterns are quite static. You only need coarse granularity reconfiguration when each application joins or leaves. This overhead will not be significant.

Q: In your experience, how much overhead does the MCCS service incur?

A:

We have less than one second for reconfiguring one application in our evaluation. This only happens when a training job completes and a new training job enters the system.

Q: In scenarios with both training and inference applications, since inference applications are latency sensitive, will MCCS make the latency of inference applications longer?

A:

This is actually a good use case for MCCS because we provide network administrators with flexible policy choices. We can dedicate some network links to the inference application to ensure their latency is minimized. You cannot do this without MCCS in public cloud.

Q: Regarding network information exposure, is there quantitative study about how unsafe it is to expose topology-related information to applications?

A:

There are problems with exposing network information. Cloud providers want to optimize their network so it doesn’t block applications, but they may want to make topology confidential because they’re innovating on that. Another tricky aspect is bandwidth and usage information. If you report high link utilization, clients fear network congestion. If you report low utilization, it suggests no one is using the cloud. This is very tricky information for cloud providers to expose - it can hurt both ways.

Q: Compared to J.K. Lee’s SIGCOMM 14 paper on TAG (Traffic Abstraction Graph), what makes it difficult to extend that work to machine learning systems in multi-tenant cloud?

A:

There’s a key difference. In traditional communication, you can specify requirements at an end-to-end level - node A sends bytes to node B, etc. The new space MCCS opens is that for ML workloads, there are knobs in choosing how to do point-to-point communication. For example, for an all-reduce operation, you can choose between ring all-reduce or tree all-reduce, and even with ring all-reduce, which order to do the reduce. We move beyond end-to-end, point-to-point abstraction to collective communication opportunities, so we can do optimization beyond flow scheduling - we can choose collective communication algorithms together with scheduling.

Q: Do you expect jitter to become more severe as more applications are added? Is there an approximate range for the number of applications?

A:

The jitter mainly comes from multiple applications sharing the same network links. Our policy is just a showcase of MCCS scheduling capability using simple policies. We don’t expect thousands of applications sharing the same network link - that doesn’t make sense. We use a simple time window-based traffic scheduling, and with more sophisticated scheduling mechanisms, this can definitely be addressed.

Q: Now that collective communication becomes a cloud service, would it be possible to charge for this collective communication service with different billing strategies for different tenants?

A:

That’s definitely a good motivation for cloud providers to implement MCCS. If you have a highly prioritized workload willing to pay significantly more, you can have dedicated network links reserved for that. The current charging model of GPU clusters is mostly based on GPU resources, but cloud providers spend significant resources on network infrastructure, so they would like to earn more from network resources.