FIGRET: Fine-Grained Robustness-Enhanced Traffic Engineering

siyong_huang · July 30, 2024, 11:31am

Title : FIGRET: Fine-Grained Robustness-Enhanced Traffic Engineering

Speaker : Ximeng Liu (Shanghai Jiao Tong University)
Scribe : Siyong Huang (Xiamen University)

Authors : Ximeng Liu, Shizhen Zhao (Shanghai Jiao Tong University); Yong Cui (Tsinghua University); Xinbing Wang (Shanghai Jiao Tong University)

Introduction :
Traffic engineering has been studied due to its importance in the allocation of network resources, especially in the context of the ever-growing volume of network traffic. However, existing TE schemes have fallen short in processing traffic bursts, as they either do not handle traffic bursts or uniformly guard against them. Failure to properly manage sudden traffic bursts can lead to severe congestion, delays, and packet loss. Therefore, there is a pressing need for a TE solution that can dynamically and efficiently manage these bursts without degrading overall network performance.

Key idea and contribution :
The authors introduce FIGRET, a Fine-Grained Robustness-Enhanced Traffic Engineering scheme, designed to address the limitations of existing TE methods. The key insight of this work lies in the diversity in traffic characteristics of different source-destination pairs. Based on the observation, FIGRET provides customized robustness enhancements by imposing different upper bounds on paths of different source-destination pairs.
However, it is hard to determine appropriate future demand matrices and bounds based on historical traffic data. FIGRET bypasses the need for complex linear programming by utilizing deep learning techniques and leveraging a burst-aware loss function. FIGRET directly maps historical traffic patterns to routing configurations, significantly reducing the complexity and improving scalability. To generate high-quality TE solutions and capture the different traffic characteristics, the burst-aware loss function specifically penalizes high-variance traffic demands by incorporating both the maximum link utilization and the variance of traffic demands into the loss calculation. This fine-grained approach ensures that traffic flows prone to bursts receive stricter robustness measures, while stable flows are allowed more flexibility, thereby optimizing performance across varied traffic scenarios.

Evaluation :
The evaluation involved extensive testing on real-world networks, including WAN datasets and data center topologies. The results demonstrated that FIGRET outperforms existing TE schemes significantly. For instance, compared to Google’s TE system in their data centers, FIGRET reduced the average MLU by 9%-34% and improved solution computation speed by 35×-1800×. Furthermore, against DOTE, a leading deep learning-based TE method, FIGRET notably reduced significant congestion events due to bursts by 41%-53.9%. This result is significant because it shows that FIGRET can enhance network performance and reliability, crucial for maintaining service quality in increasingly dynamic traffic environments.

Questions and opinions :
Q1: How do determine which traffic is burst？
A1: Calculate the variance of the traffic demand in the historical data to determine the burst traffic.

Personal thoughts :
The paper presents an innovative approach to traffic engineering by recognizing and leveraging the heterogeneous nature of network traffic. I appreciate the practical relevance of FIGRET, as it addresses a real-world problem that many data centers and WAN operators face based on a key and valid observation that network traffic exhibits different characteristics among different source-destination pairs. The use of deep learning to bypass the traditional traffic prediction bottleneck is particularly noteworthy, showcasing an effective application of AI in network management.
However, this paper uses FCN instead of GNN, making it hard to handle dynamic scenarios and some input transformations such as reordering paths. In addition, based on the running example of Figure 3 in the paper which is not shown in the presentation, if rare cases such as situation 1/2 happen, this method results in more severe congestion than normal cases. This is also proved by the experiments about worst-case performance. In addition, I wonder whether the TE controller recomputes and deploys the solution after the arrival of the actual demand matrix using this matrix.