Hawkeye: Diagnosing RDMA Network Performance Anomalies with PFC Provenance

Title: Hawkeye: Diagnosing RDMA Network Performance Anomalies
with PFC Provenance

Authors
Shicheng Wang (Tsinghua University); Menghao Zhang (Beihang University); Xiao Li (Tsinghua University); Qiyang Peng (Beihang University); Haoyuan Yu, Zhiliang Wang, Mingwei Xu (Tsinghua University); Xiaohe Hu (Infrawaves); Jiahai Yang, Xingang Shi (Tsinghua University)

Scribe: Rulan Yang (Xiamen University)

Introduction

RDMA offers ultra-low latency and high throughput, but its Priority Flow Control (PFC) can spread congestion across the network, causing complex network performance anomalies (NPAs) like head-of-line blocking, unfairness, and PFC storms. Existing systems cannot trace PFC causality or pinpoint root causes beyond immediate flow contention. To address this, the authors propose Hawkeye, which captures flow-level PFC impact, efficiently collects causal telemetry, and builds a provenance graph to identify anomaly types and root causes. Evaluations on NS-3 and Tofino show Hawkeye is accurate, efficient, and provides actionable insights for operators.

Key idea and contribution:

Hawkeye is designed to diagnose RDMA network performance anomalies with fine-grained PFC visibility, efficient telemetry collection, and accurate root-cause identification. It passively logs packet-level telemetry along both the victim flow path and PFC spreading path, aggregating flow- and port-level information for comprehensive analysis. A host-based detection agent triggers anomaly diagnosis by generating polling packets, which traverse causally relevant switches for in-data-plane PFC causality analysis while asynchronously collecting telemetry. Hawkeye constructs a heterogeneous provenance graph encoding port- and flow-level wait-for relationships, enabling identification of the anomaly type, its root cause, and contributing flows. By matching graph signatures of representative RDMA anomalies, such as micro-bursts, PFC storms, and deadlocks, Hawkeye provides a detailed breakdown of congestion causality, including flow victimization extents, PFC spreading paths, and sources of initial congestion.

Hawkeye’s architecture integrates both in-network and host-based components for precise PFC diagnosis. It first records telemetry with PFC visibility and causality awareness, capturing packet-level port PFC status and tracing PFC spreading. Next, it performs fast in-data-plane PFC causality analysis while the CPU collects detailed flow and port telemetry asynchronously. A host-based agent monitors flow performance and triggers diagnosis by sending polling packets, which switches use to infer causal neighbors and distribute PFC information. Finally, an offline provenance-based analyzer constructs a heterogeneous graph to trace congestion causality and identify root causes.

Evaluation

Hawkeye is evaluated through both large-scale NS-3 simulations and a real Tofino-based testbed to measure its effectiveness, efficiency, coverage, and deployability. In simulations, a Fat-Tree topology with 100 Gbps links is used, running realistic RoCEv2 workloads with long-tailed flow distributions. Hawkeye achieves high precision and recall in diagnosing RDMA PFC-related anomalies, outperforming traditional baselines like SpiderMon and NetSight, particularly for deadlocks and PFC storms. Its in-network telemetry collection efficiently covers causal switches with low bandwidth and processing overhead. Real testbed results confirm that Hawkeye’s switch resource usage and CPU-based polling scale well, reducing telemetry size by over 80% and reporting packets by ~95%, demonstrating practical deployability.

Q&A
This paper is being presented on behalf of the authors, and there will be no Q&A session.

Personal thoughts

Hawkeye demonstrates clear strengths in network diagnosis, such as flexible triggering mechanisms, PFC visibility with causality tracing, and effective root cause identification even on hotspot switches, which enhance network observability and accuracy. Its parameters can also be adjusted according to network scale and application requirements, allowing a good balance between accuracy and resource overhead. However, Hawkeye has limitations: partial deployment may lead to incomplete diagnosis coverage, fine-grained flow telemetry increases switch resource usage, and its applicability to non-PFC anomalies or complex topologies still requires further validation. Thus, while Hawkeye is valuable for modern RDMA network diagnosis, deployment, and scalability must be carefully considered.