Raha: A General Tool to Analyze WAN Degradation

Title: Raha: A General Tool to Analyze WAN Degradation

Authors

Behnaz Arzani (Microsoft Research); Sina Taheri (Microsoft); Pooria Namyar (University of Southern California); Ryan Beckett (Microsoft Research); Siva Kesava Reddy Kakarla (Microsoft Research); Elnaz Jallilipour (Microsoft)

Scribe: Rulan Yang (Xiamen University)


Introduction

None of the existing tools can help analyze the worst-case degradation of a network’s performance (traffic demand) under 1) arbitrary failure scenarios, 2) variable traffic demand, and 3) any path selection (any set of ordered paths) simutaneously. This is hard because the search space of such a problem is extramely huge. Raha, however, solves this problem using heuristic analysis, where the goal is to find the set of inputs (failures and demands) that maximize the gap between the healthy network (without failure) and an unhealthy network (with failure). Evaluation results show Raha can find ≥ 2× higher degradations compared to those tools that only consider up to 2 failures.


Key idea and contribution

The Raha model formulates the network performance analysis under varying demands and failures as a bi-level optimization, where the outer problem selects inputs (demands and failures) to maximize the performance gap between the optimal and heuristic algorithms, and the inner problems compute flow allocations. The optimal inner problem can be non-convex, while the heuristic inner problem must be convex for efficient evaluation. By moving non-convexity to the outer problem, Raha enables MetaOpt to handle complex constraints and compute flows for the chosen inputs.

Raha scales by decomposing the problem into manageable parts using fixed demands and clustering. When the operator provides a fixed demand matrix, the optimal flows of the healthy network become constants, reducing the problem to finding failure scenarios that maximize the gap with the unhealthy network. For large topologies, Raha divides the network into disjoint clusters and incrementally finds local demand matrices within and across clusters that approximate the worst-case impact. These local demands are combined to form a global demand matrix, which is then used to identify the failure scenario that maximizes network impact. Additionally, Raha leverages MetaOpt’s timeout feature to further improve scalability, stopping the solver when progress stalls while typically retaining near-optimal solutions.


Evaluation

Raha is implemented on top of MetaOpt and evaluated on both a production WAN in Africa and publicly available topologies from the Topology Zoo. The evaluation shows that Raha can consider all possible failures and variable traffic demands, identifying scenarios that cause significant performance degradation in ≤ 30 minutes for general cases, and within 10 minutes for fixed demands.
Raha consistently finds higher-impact failure scenarios under extreme variations in demand or large topologies, and its runtime scales reasonably with network size, number of primary paths, and probability thresholds. Techniques such as clustering and timeouts allow Raha to trade off optimality for faster computation, while still capturing meaningful degradations, with some scenarios showing up to 30% of traffic dropped. Additionally, Raha can guide network augmentation to mitigate risk, fully eliminating probable degradations in fewer than six augmentation steps. This result is significant because it demonstrates that network operators can proactively identify and mitigate worst-case performance degradations under realistic, arbitrary failure and demand conditions, a capability lacking in existing tools, thereby enabling more resilient WAN operations.

Personal thought

The core motivation of Raha is to perform worst-case degradation analysis of network performance under failure scenarios. Intuitively, one might think it is enough to test how much the peak performance drops under different failures. However, Raha shows this is not the case. Many factors—such as the number of failures (including probable failures), traffic demand, traffic engineering schemes, and path selection—can all affect performance degradation. Therefore, it is important to treat these factors as variables in worst-case analysis.

Instead of blindly using machine learning for this highly complex problem, Raha applies mathematical methods such as constraint solving, combined with clustering, to scale the solution. This is a smart approach. However, compared with systems like Jinggubang or Yu that use graph-based analysis, Raha’s modeling is less intuitive and may be harder for readers to understand.


Q&A

Q1: Why should operators care about finding the difference between healthy and failure states, instead of worst-case scenarios?

A1: Operators care about degradations that actually impact the network in practice. Worst-case failures are often unlikely and ignore the network’s design point. The true objective is to identify scenarios that cause real incidents in production.

Q2: How easy is it to obtain probabilities for failures?

A2: Operators give mixed feedback. With historical data (e.g., in Azure), probabilities can be computed using renewal reward theory. If data is unavailable, the system also works in a vanilla mode without probabilities.

Q3: What about rare “black swan” events?

A3: They are hard to estimate due to a lack of data. These tail events are theoretically interesting, but in practice, operators focus on common scenarios.