A General and Efficient Approach to Verifying Traffic Load Properties under Arbitrary k Failures

qin · July 30, 2024, 11:38am

Title: A General and Efficient Approach to Verifying Traffic Load Properties under Arbitrary 𝑘 Failures

Authors: Ruihan Li (Peking University & Alibaba Cloud); Yifei Yuan, Fangdan Ye, Mengqi Liu, Ruizhen Yang, Yang Yu, Tianchen Guo, Qing Ma, Xianlong Zeng (Alibaba Cloud); Chenren Xu (Peking University); Dennis Cai, Ennan Zhai (Alibaba Cloud)

Introduction
The paper studies the problem of verifying traffic load properties (TLPs) in Wide Area Networks (WANs) under arbitrary failure scenarios, a critical issue for maintaining network reliability and preventing service outages. Ensuring that networks meet TLPs, such as avoiding overloaded links, is vital for the performance and availability of services, especially in production environments like Alibaba Cloud’s global WAN, which serves over one billion customers. The challenge lies in the need for a verification system that can handle the complexity and scale of real-world networks while being able to model a wide range of protocols and failure scenarios. Existing systems and tools, such as QARC and Jingubang, fall short due to limitations in their generality and efficiency. QARC, for instance, assumes shortest-path-based forwarding and cannot handle protocols beyond this, while Jingubang, although more general, struggles with efficiency under arbitrary failure scenarios. These tools are not designed to verify TLPs and cannot scale to production WANs with thousands of links and complex, diverse protocols. The paper addresses these limitations by presenting YU, a verification system that can scale to production WANs and effectively identify potential failure scenarios that could lead to traffic load violations.

Key idea and contribution
The authors have developed a verification system named YU, which represents a significant leap forward in network traffic load property (TLP) verification. YU is the first system capable of checking TLPs under arbitrary failure scenarios at a scale that is applicable to production Wide Area Networks (WANs). The key innovation of YU lies in its ability to address two major challenges: generality and efficiency. For generality, YU employs a symbolic traffic execution approach, inspired by symbolic execution in computer programs, which allows it to model the forwarding behavior of a variety of network protocols under different failure scenarios. This approach is more versatile than previous methods that were limited to shortest-path-based forwarding or single-failure reasoning.
To tackle the efficiency challenge, YU introduces novel techniques such as k-failure-equivalence and link-local-equivalence reduction. These techniques optimize the symbolic traffic execution process by reducing the computational overhead associated with large-scale WANs and the substantial number of traffic flows they handle. YU utilizes multi-terminal binary decision diagrams (MTBDDs) for compact constraint encoding, and it proposes a k-failure-equivalence reduction technique that simplifies MTBDDs while ensuring verification correctness under scenarios with no more than a given degree of failures. Additionally, YU leverages link-local flow-equivalence to reduce the number of flows that need to be checked for TLP verification on a per-link basis, thus enhancing efficiency.

Evaluation
The evaluations of YU were conducted using Alibaba Cloud’s production WAN and synthetic datasets, demonstrating its effectiveness and efficiency in real-world scenarios. The system was tested on networks of varying sizes, including a full WAN with over 1000 routers and sub-networks of different scales. The results showed that YU can verify TLPs for the entire WAN within 28.6 minutes for up to two link failures, a significant achievement considering the complexity and size of the network. Compared to existing methods like Jingubang, YU exhibited superior performance, completing verification tasks hundreds of times faster. The reader should care about these results because they indicate that YU can handle large-scale, complex networks with efficiency, providing a practical solution for ensuring network reliability in the face of potential failures. This result is significant because it demonstrates a substantial improvement over current state-of-the-art methods, offering a scalable and effective approach to network verification that can be directly applied in production environments, thereby enhancing network reliability and availability for service providers and end-users alike.

Q1: What assumptions do you make about routing? So it looks like more general routing than shortest path routing. So what are the assumptions under which your approach works? Can it do any arbitrary routing?
A1: BGP protocol is very complex. It has paths, community, and local preferences and it impacts the IGP protocols. So the BGP protocol itself will provide the known shortest path routing. And we have referenced another work QR to formalize the entire network as the shortest path-based forwarding model, so it cannot even model all BGP attributes. It only models a small subset of BGP attributes. And besides the BGP. We also have the SR tunnel which I have shown in the example forwarding. So many protocols are shortest path based, and we handle these situations.

Q2: Is there something you cannot do? Is there any kind of routing your system cannot handle?
A2: We are not aware of that. Because we essentially use the symbolic execution, the symbolic traffic execution. Essentially, if the routers behave like a program that produces the routing table. Our system will also behave in the same way and produce the symbolic routing table, then from the symbolic routing table to the symbolic traffic load. Maybe it seems inefficient, but I don’t think we can’t handle it.

Personal thoughts
The paper detailing the YU verification system offers a compelling contribution to the field of network reliability and management. What stands out is the system’s ability to scale efficiently to production-level WANs while addressing the complex challenge of verifying traffic load properties under arbitrary failure scenarios. The paper’s approach to generality through symbolic traffic execution and the introduction of k-failure-equivalence and link-local-equivalence reduction techniques for efficiency are particularly impressive. These innovations allow YU to outperform existing systems, providing a more practical and effective solution for network verification.
Additiionally, the potential application of YU’s underlying principles to other domains, such as cloud service management or IoT device networks, may also be worth exploring. Overall, the paper presents a robust and innovative system that significantly advances the state of the art in network verification, while also opening up new avenues for future research and development.