RD-Probe: Scalable Monitoring With Sufficient Coverage In Complex Datacenter Networks

Ruyi · July 30, 2024, 11:40am

Title: RD-Probe: Scalable Monitoring With Sufficient Coverage In Complex Datacenter Networks

Authors: Rui Ding, Xunpeng Liu, Shibo Yang, Qun Huang (Peking University, School of Computer Science); Baoshu Xie, Ronghua Sun, Zhi Zhang, Bolong Cui (Huawei Cloud Computing Technologies Co., Ltd)

Scribe: Ruyi Yao

Introduction
Network monitoring is vital for ensuring service availability in datacenter infrastructures.
Inefficient monitoring coverage makes blind spots and entails unnoticed failures on the network side. Analysis of Huawei production network reveals that insufficient coverage on Layer 2 is the culprit for many missed failures and their prolonged MTTRs (mean-time-to-repair).

Achieving sufficient coverage on Layer 2 is challenging within the production monitoring architecture, which can be attributed to three factors: (1) Virtualization techniques; (2) Randomness techniques; (3) Path explosion. The former two factors make Layer-2 ports invisible and the last one makes it hard to balance the coverage level and resource overhead.

Key idea and contribution

The authors design a dynamic and continuous monitoring of the underlay network named RD-Probe. It’s a deployed black-box system with explicit coverage and scalability guarantees.

RD-Probe generate probe tasks every epoch (three minutes by default). It has five steps in sequence during one epoch.

(1) Topology construction: It transforms the vanilla topology directly seen by the monitoring system into the de facto topology using switch configurations.
(2) Randomized generation: Based on the input, it selects a proper number of random Phase-I tasks and dispatches them to agents. Tasks are active until the epoch’s end.
(3) Packet mirroring: Switches match and mirror probe packets to the data processing module, where the coverage computation logic maintains the Layer-3 coverage counters.
(4) Coverage computation: It listens for the mirrored packets and obtains the interfaces they go through. It then increments the associated counters on seeing packets from new tasks and persists the updated counters to storage.
(5) Deterministic generation: Phase II reads the current coverage counters after a fixed amount of time since the epoch begins (e.g., 20s). It then decides which interfaces are still under-covered and finds additional tasks to boost their coverage. These tasks are also active till the end.

Evaluation
Authors deployed RD-Probe in three large production regions in Huawei Cloud. It improved coverage from 80.9% to 99.5% and unearthed several previously unnoticed issues while tolerating numerous faults.

Large-scale simulation of four industry solutions shows that RD-Probe is the only one achieving both sufficient coverage and scalability in complex datacenter networks.

Q: α is important. How is it tuned in practice?

A: This is where the operator’s experience kicks in. We typically set α in [1-5], and 1 is sufficient for datacenters with less stringent service-level agreements. Reliable services may specify a higher α.

Q: Would you share some insights in practice how you fix these problems after you find them?

A: Although many failures are self-healing through advanced solutions built into switches, others still require manual troubleshooting and tedious work. The general method is that we isolate and then reboot the buggy components, and I’m afraid that there is no universal solution for troubleshooting.

Q: Datacenters have spare capacities and use multi-path routing. How are these spare capacities used during troubleshooting? Do you troubleshoot offline?

A: Manual troubleshooting is done offline after you isolate the faulty components. Meanwhile, the affected traffic is rerouted to another normal path. We rarely encounter failures that could affect the whole switch, like when it completely goes down. In such cases, offline troubleshooting is inevitable.

Personal Thoughts
It’s interesting to see that the authors identified blind spots in network measurement and delved into layer 2 techniques. They cleverly adopted a method that combines randomness and determinism, providing a mathematical proof to ensure the coverage effect. I am curious about the extent of network changes that would require re-tuning the parameter α. Manual tuning still seems somewhat daunting, and I hope to see an implementation of auto-tuning.