Predictable Network Practice in Alibaba Cloud
Speaker: Jiaqi Gao (Alibaba Cloud)
Scribe: Gao Han, Kexin Yu, Yao Wang (Xiamen University)
This speech mainly introduces Alibaba Cloud’s global network infrastructure and its application in large-scale machine learning infrastructure. The following is a summary of the main content of the speech:
- Global Network Infrastructure:
- Alibaba operates a global backbone network that supports all of its businesses, including Taobao and Alibaba Cloud.
- The network research team at Alibaba has made significant contributions in the design, implementation, and management of these networks, publishing several papers in top conferences.
-
Scale-Driven Network Research:
- The research team focuses on scale-driven research with the goal of delivering predictable performance, from small to massive scales.
- Their work primarily addresses issues such as reliability, operational efficiency, and traffic management in the backbone network.
-
Large-Scale Machine Learning Infrastructure:
- With the rapid growth of large language models like Transformers, the demand for computational power has significantly increased. Alibaba aims to provide “exascale” computing power to support the training and inference of these models.
- To achieve this, the network research team is building a high-throughput, low-latency, scalable network infrastructure.
- Efficient Network Architecture Design:
- The presentation introduced the HPN 7.0 architecture, a predictable data center network designed for AI, which aims to provide stable high-performance interconnects and optimized communication.
- This architecture uses a multi-layer design, including multipath algorithms and a globally coordinated collective communication library, to enhance training throughput and efficiency.
- Future Challenges and Research Directions:
- The speaker highlighted future challenges, such as cross-layer optimization and hardware-software co-design, which are critical to further improving system performance and resilience.
- Alibaba’s network research team is keen to collaborate with academia to advance research in these areas.
- Open Collaboration:
- The speaker emphasized that Alibaba’s network research team is open to collaboration with academia, aiming to share experiences and jointly advance infrastructure development and technological innovation.
Q&A
Q: To what degree should private programmability go in a network? Should it be limited to the management plane, or should it extend to the control plane or even deeper into the data plane?
A: Different levels of the network require different levels of programmability.
-
Data Plane: Hardware programmability is crucial for performance-critical applications with microsecond or nanosecond latency requirements. Software approaches are insufficient in these cases.
-
Control and Management Planes: These planes are already programmable using software, such as Python for implementing training frameworks. Existing solutions are sufficient for these levels.
While hardware programmability offers potential benefits, current commercial hardware is limited and can introduce vulnerabilities.