MegaTE: Extending WAN Traffic Engineering to Millions of Endpoints in Virtualized Cloud

Feiyan_Ding · July 30, 2024, 11:31am

Title: MegaTE: Extending WAN Traffic Engineering to Millions of Endpoints in Virtualized Cloud

Speaker: Congcong Miao (Tencent)
Scribe: Feiyan Ding (Xiamen University)

Authors: Congcong Miao (Tencent); Zhizhen Zhong (Massachusetts Institute of Technology); Yunming Xiao (Northwestern University); Feng Yang, Senkuo Zhang (Tencent); Yinan Jiang, Zizhuo Bai (Peking University); Chaodong Lu, Jingyi Geng, Zekun He, Yachen Wang, Xianneng Zou (Tencent); Chuanchuan Yang (Peking University)

Introduction
The paper addresses the challenge of optimizing traffic engineering in modern virtualized cloud environments, where traditional systems struggle to manage the vast number of individual traffic flows generated by containers and virtual machines. Effective traffic engineering is crucial as it directly impacts application performance and user experience, especially for latency-sensitive applications. Existing TE systems fall short because they rely on centralized control models that cannot scale to handle the dynamic and numerous endpoints typical of modern clouds, leading to inefficient resource allocation and increased latency.

Key idea and contribution:
The paper presents MegaTE, a traffic engineering (TE) system designed to handle millions of endpoints in virtualized cloud environments. MegaTE shifts from a centralized, top-down control model to a scalable, bottom-up asynchronous query mechanism. It integrates eBPF for segment routing at the data plane level and uses network contraction techniques for optimization on the control plane.

MegaTE comprises two main components: the MegaTE Control Plane and the MegaTE Data Plane. The Control Plane tackles traffic flow optimization by addressing the NP-Hard complexity of the TE problem with a two-stage algorithm. The first stage, MaxSiteFlow, allocates bandwidth between site pairs, while the second stage, MaxEndpointFlow, allocates bandwidth to individual endpoint pairs using a subset sum problem (SSP) approach with the FastSSP algorithm for efficient computation. The Data Plane, utilizing eBPF, implements the routing decisions made by the Control Plane, ensuring precise instance identification and flow management. This allows for efficient packet routing according to the determined paths and meets the quality of service (QoS) requirements of each virtual instance.

Evaluation
The evaluation of MegaTE used large-scale flow-level simulations with real-world traffic traces, providing a detailed performance assessment. The results showed that MegaTE supports a network with 20 times more endpoints than previous TE systems while keeping similar algorithm run times. Remarkably, MegaTE reduced packet latency for time-sensitive applications by up to 51%. This result is significant because it demonstrates that MegaTE greatly improves the user experience for latency-sensitive services. Furthermore, in failure scenarios, MegaTE met up to 8.2% more demand than leading TE algorithms, underscoring its robustness and reliability. Deployed by Tencent in its cloud WAN since December 2022, MegaTE has ensured high availability for priority applications and reduced costs for lower-priority applications by 50%.

Q: Why is it necessary to insert flow entries into routers, and how does this approach accommodate both long and short duration flows? Could you explain the evaluation process and how network topologies are obtained and measured for availability? Additionally, how is the interface between applications and networking managed, especially in the context of AI applications for cloud services?

A: Inserting flow entries is crucial for traffic management, addressing the needs of high-traffic applications by processing information on the server. Our system is designed to work well for both long and short duration flows without being affected by the short periods. The evaluation is conducted by building network models that simulate real-world conditions, starting with base topologies and adding endpoints at router sites. Availability is measured through continuous cloud monitoring, which provides us with the data to report on application availability. For AI applications in the cloud, we provide the underlying network infrastructure, but the detailed deployment of these applications is not managed by our team. We focus on measuring the performance and service level agreements of our internal cloud applications.

Personal thoughts
MegaTE presents a significant advancement in traffic engineering for virtualized cloud environments, effectively scaling to millions of endpoints. The system’s application in Tencent’s WAN adds to its credibility. In my view, open questions such as how MegaTE integrates with existing network infrastructures and its compatibility with various cloud providers’ environments would be interesting areas to explore.