New Evolution of Hoyan: Enhancing Scalability, Usability, and Accuracy for Alibaba's Global WAN Verification

Xing · September 11, 2025, 7:33am

Title : New Evolution of Hoyan: Enhancing Scalability, Usability, and Accuracy for Alibaba’s Global WAN Verification
Authors : Yifei Yuan, Fangdan Ye, Yifan Li, Jingkai Zhang, Mengqi Liu, Yuyang Sang, Ruizhen Yang, Duncheng She, Zhiqing Ye, Tianchen Guo, Xiaobo Zhu, Xinji Tang, Li Jia, Zhongyu Guan, Lingpeng Su, Ci Wang, Ruiyang Feng, Shuo Wu, Zhonghui Xie, Cheng Jin, Peng Zhang (Alibaba Cloud); Qing Ma (Alibaba Group); Xianlong Zeng, Dennis Cai, Ennan Zhai (Alibaba Cloud)
Scribe : Xing Fang (Xiamen University)

Background

As Alibaba’s global WAN continues to grow, the scale and frequency of network changes have increased significantly. Configuration verification has thus evolved from simply checking reachability to ensuring that more nuanced change intents—such as route attribute preservation, traffic shifting, or load balancing—are satisfied. However, the original version of Hoyan faced three key issues: limited scalability, insufficient expressiveness, and weak accuracy guarantees. The single-node simulation framework could no longer handle verification tasks involving millions of prefixes and billions of flows. Meanwhile, most change intents could not be formally expressed, leaving operators to rely on manual diffing or hardcoded rules. More critically, discrepancies between the simulation model and the real network—caused by monitoring blind spots, configuration parsing bugs, and vendor-specific behaviors (VSBs)—undermined trust in the system’s outputs. To address these limitations, Hoyan has undergone a major architectural upgrade focused on scalability, usability, and accuracy.

Challenges and Key Insights

The first major challenge for Hoyan was to scale verification to support the size and complexity of a modern production WAN. With thousands of routers and hundreds of thousands of prefixes involved in each change, the legacy single-machine simulator frequently failed due to resource exhaustion, especially when data center networks were included in the verification scope. The authors realized that addressing this required rethinking simulation granularity, task partitioning, and resource scheduling.

The second challenge concerned usability. Through close observation of real-world operations, the authors noted that many change intents could be formulated as logical assertions over differences in pre- and post-change RIB states—for example, “only routes matching X should change,” or “all other paths must remain unaffected.” However, without a way to formally express such intents, operators had to manually inspect simulation outputs, which was time-consuming and error-prone. To address this, the authors proposed the introduction of a dedicated specification language that allows change intents to be explicitly defined and automatically verified.

The third challenge was accuracy. Discrepancies between simulation results and real-world behavior were often caused by subtle misalignments in configuration semantics, undocumented device behaviors, or incomplete monitoring data. These issues could invalidate verification results, even if the simulation logic was technically correct. The authors thus identified simulation correctness as a first-class design goal, and proposed a dedicated accuracy diagnosis framework to monitor and correct such discrepancies.

Together, these insights led the team to design Hoyan as an intent-driven, scalable, and self-validating verification platform.

Implementation

To address scalability, Hoyan was rearchitected as a distributed simulation system. Input routes and flows are grouped into equivalence classes and partitioned into simulation subtasks, which are executed in parallel across multiple servers. Through prefix-based ordering and dependency minimization, each node only loads the minimal necessary data, significantly reducing memory usage and I/O overhead. For traffic simulation, additional optimizations ensure that cross-task dependencies are minimized through IP range alignment.

For intent verification, the authors introduced RCL (Route Change Language), a lightweight DSL for specifying control-plane change intents. RCL operates over a global RIB abstraction and supports predicates, filtering, aggregation, and logic composition. It allows operators to formally express expectations such as “prefixes with community X should now have local-pref Y,” or “all other prefixes must remain unchanged.” This abstraction bridges the gap between high-level change goals and low-level route state verification, enabling reliable and repeatable validation workflows.

To improve accuracy, Hoyan incorporates a daily diagnostic pipeline that compares simulated RIBs and traffic paths against live monitoring data. For key prefixes, it can also cross-check device-level output using show commands. The system pinpoints inconsistencies, generates forwarding traces, and assists in root cause analysis through automated propagation graph generation. This framework has proven effective in identifying previously unknown VSBs and modeling bugs, closing the loop between simulation and real-world behavior.

Evaluation

The system was evaluated under real-world conditions in Alibaba’s production network. In terms of performance, Hoyan’s distributed simulation engine can complete full WAN simulations involving over a million prefixes and billions of flows in under six minutes on a 10-node cluster—achieving a 5× speedup over the previous architecture. Even with the inclusion of data center routers (WAN+DCN), the total verification time remains within 25 minutes.

RCL has also demonstrated practical expressiveness and efficiency. Most real-world change intents can be expressed in under ten lines of RCL code, and over 90% of verification tasks complete within one minute. RCL is now integrated into daily workflows and supports both automated and manual verification triggers at scale.

In terms of accuracy, the system’s diagnostic framework has helped uncover and fix various critical issues, including missing monitoring data, configuration parsing errors, and over a dozen new VSB cases. Over a six-month deployment period, Hoyan has consistently reduced the rate of configuration-induced network incidents—from 56% down to 5%—highlighting its impact as a core infrastructure for safe, large-scale network change operations.

Q&A:

Q1: You mentioned Yu. Is the traffic load violations still part of this or is Yu a separate system? The Yu from last year’s SIGCOMM, is that part of Hoyan or a different system that verifies traffic properties separately?
A1: Last year we presented the relational network verification, which is about the specification for flow path changing. We integrated that into Hoyan. Yu is also integrated into Hoyan. Yu did symbolic simulation to figure out failures that can result in traffic load violations, while Hoyan uses concrete simulation for flows and the symbolic part is to encode failures, not for flows.

Q2: When we verify a new network change intent and there is some conflict between the new intent and the existing configuration, causing verification to fail, how can we identify which configuration code causes the conflict?
A2: This is a very good question. For instance, if your change refers to an undefined prefix, that is obviously a conflict, and we can check such things beforehand. I didn’t have time to present that, but there is a use case about it. For other conflicts, for instance if because of software bugs you cannot understand the command, this is totally out of the scope of the simulation. We have other tools to handle such issues.

Q3: Did you see systematic issues arising in configuration, for example differences between equipment or firmware interpretations?
A3: Yes. We call this VSB (vendor specific behavior), which is exactly what you’re talking about. There’s a table in our paper with many of them, and Hoyan uncovered a lot of these issues

personal Thoughts

This paper makes a valuable contribution to network verification by clearly addressing the practical needs for scalability, usability, and accuracy in production environments. From enabling minute-level simulation times to supporting complex intent specifications and maintaining alignment with real network behavior, the system provides a strong foundation and a clear roadmap for future research.

I’m particularly interested in the accuracy issue. Simulation errors can cause false positives or, more seriously, false negatives—letting faulty configurations enter production. The paper suggests using automated testing to detect implementation bugs or vendor-specific behaviors (VSBs) and patch the model. But a harder challenge is non-deterministic convergence, where simulation results are inconsistent with the real network. This remains an important research direction: we need more efficient ways to detect and repair such issues to ensure simulation consistency and reduce verification risk.