Title: MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud
Authors: Yongji Wu(Duke University), Yechen Xu, Jingrong Chen, Zhaodong Wang, Ying Zhang, Matthew Lentz and Danyang Zhuo
Scribe: Mengrui Zhang(Xiamen University)
Introduction: Collective communication is the standard method for synchronizing data among workers during distributed machine learning training using various parallelization strategies. However, implementing collective communication algorithms through libraries is not ideal for multi-tenant cloud environments. Tenants lack awareness of the underlying physical network configuration and how other tenants utilize the shared cloud network, hindering the library’s ability to select the optimal algorithm. This paper introduces a novel approach to collective communication in multi-tenant cloud environments for machine learning model training.
Key Idea and Contribution: The authors propose Managed Collective Communication Service (MCCS) as a solution to the limitations of traditional collective communication libraries. MCCS re-architects collective communication by integrating it directly into the cloud network and offering it as a service. This decouples the implementation of collectives from the application, enabling joint optimization of multiple applications’ collective strategies and network scheduling. MCCS supports dynamic reconfiguration and policy enforcement by leveraging a service-based architecture that allows for topology-aware and multi-tenant-aware scheduling. This innovative approach provides greater flexibility and performance optimization opportunities than existing library-based methods.
Evaluation: The evaluation of MCCS demonstrates significant performance improvements. By employing topology-aware scheduling, MCCS achieves a speedup of 2.8 times in single-application scenarios and 2.5 times in multi-application scenarios compared to traditional approaches. Additionally, MCCS effectively enforces QoS policies, demonstrating its capability to prioritize high-priority applications’ traffic and improve throughput. These results are significant because they show MCCS’s potential to enhance performance and manageability in multi-tenant data center environments, addressing the key limitations of existing collective communication solutions.
Q1: In your problem statement, you said the current collective library has issues with being loosely coupled with the network. But in your talk, you just mentioned topology. What else have you done for the coupling part?
A1: The network is typically shared particularly in class or cloud environments. This means both the topology and the flows are shared, which includes flow scheduling. In our work, we primarily use topology information to perform topology-aware rank assignments. Additionally, we assign flows explicitly to ping the flow or assign the flow explicitly to one of the network paths in the network. So that’s another thing we do.
Q2: In scheduling the flow to the path. How does it differ from existing work?
A2: The difference is that we are enforcing this at the application layer. We do not need to change the network settings or switch settings. So that’s the main difference here.
Personal Thoughts: MCCS effectively addresses the limitations of traditional library-based approaches and proposes a service-based architecture. This approach not only enhances performance but also provides greater flexibility in handling diverse workloads in multi-tenant environments.