RedTE: Mitigating Subsecond Traffic Bursts with Real-time and Distributed Traffic Engineering

siyong_huang · July 30, 2024, 11:30am

Title : RedTE: Mitigating Subsecond Traffic Bursts with Real-time and Distributed Traffic Engineering

Speaker : Kaihui Gao (Zhongguancun Laboratory)
Scribe : Siyong Huang (Xiamen University)

Author : Fei Gui (Tsinghua University); Songtao Wang(Zhongguancun Laboratory); Dan Li(Tsinghua University); Li Chen, Kaihui Gao (Zhongguancun Laboratory); Congcong Min (Guangdong Communications & Networks Institute); Yi Wang (institute of Future Networks in Southern University of Science and Technology)

Introduction :
Internet traffic is inherently bursty, leading to significant challenges such as queue buildup and packet loss, which adversely affect the performance of latency-sensitive applications. Traditional methods for mitigating traffic bursts are either too coarse-grained or fail to effectively address the unpredictability and millisecond-scale nature of these bursts. Existing approaches, such as end-host flow control and device-local traffic management, often fall short as they either do not scale well or fail to capture global traffic patterns, leading to suboptimal burst mitigation.

Key idea and contribution :
The key insight of RedTE is that traditional traffic engineering methods, which rely on centralized control and slower decision-making, are inadequate for managing subsecond traffic bursts effectively. By addressing the need for real-time response, RedTE introduces a distributed approach combined with advanced reinforcement learning to significantly improve burst mitigation.
RedTE’s design focuses on three main innovations to tackle the challenges of traffic bursts. First, it employs a distributed traffic engineering approach where routers at the network edge make localized, real-time routing decisions based on current network conditions. This decentralization reduces the control loop latency compared to centralized systems. Second, RedTE integrates multi-agent reinforcement learning (MARL), specifically using the MADDPG algorithm, to enable routers to make globally informed decisions while operating with only local data. This allows for effective cooperation among routers to achieve optimal network performance. Third, RedTE incorporates circular traffic replay during model training to handle the randomness of traffic arrivals and improve the convergence of reinforcement learning. Additionally, it optimizes the deployment phase by designing a reward function that balances TE performance with the number of updated entries in routing tables, thus enhancing both the accuracy of traffic management and the efficiency of rule updates. Together, these features enable RedTE to respond swiftly to traffic bursts and improve overall network efficiency.

Evaluation :
RedTE was evaluated using generated datasets from different topologies and demonstrated substantial improvements over existing traffic engineering solutions. It reduced the average normalized maximum link utilization (MLU) by up to 42.2% and the maximum queue length by up to 75% in these generated datasets. The system’s control loop latency was reduced to less than 100 milliseconds, a significant improvement compared to prior methods. This result is significant because it highlights RedTE’s capability to provide timely and effective burst mitigation, ensuring better network performance and reliability under dynamic traffic conditions.

Questions and opinions :
Q1: How does RedTE address network dynamics and local decisions, particularly when each router’s agent only has a partial view of network traffic? If multiple routers experience burst traffic simultaneously and choose the same path without coordination, how are synchronization issues resolved?
A1: RedTE manages this by dividing the process into two phases: training and inference. During the training phase, the system uses a global critic model that has a comprehensive view of the network. This allows agents to cooperate based on a global reward function, even if they only have local information. The agents learn from past experiences to make coordinated decisions. In the inference phase, agents apply the knowledge gained during training to handle real-time traffic, effectively managing similar traffic patterns despite having only partial information. RedTE relies on the predictability built into the Traffic Management System (TMS). The system assumes that traffic patterns observed during training will recur during inference. Agents use their training to recognize and respond to similar scenarios in real-time. This predictive capability allows them to make informed decisions and manage traffic effectively, even with incomplete information about other routers’ states.

Personal thoughts :
RedTE presents an intriguing approach to traffic engineering by combining distributed decision-making with MARL, similar to the SIGCOMM’23 paper Teal, which uses the COMA algorithm. The choice of MARL and specifically the MADDPG algorithm in RedTE offers an interesting contrast to Teal’s COMA, potentially providing different advantages in cooperation among agents. One point of curiosity is RedTE’s decision to include a consideration for the number of route entries in its reward function. This inclusion might seem to trade off some performance to reduce the number of entries updated, possibly sacrificing optimal traffic distribution for efficiency in deployment. However, RedTE demonstrates notable improvements in reducing control loop latency and enhancing network performance, suggesting that the balance it strikes is effective. It would be valuable to further explore how this trade-off impacts performance in diverse network scenarios and whether the system can dynamically adjust its focus between minimizing route entries and maximizing TE performance based on current network conditions.