Paper: Compact Data Structures for Network Telemetry
Authors: Shir landau feibish, Zaoxing Liu, Jennifer Rexford
Presenter: Lu Tang, Xiamen University.
Guests of Honor: Shir landau feibish, The Open University of Israel; Zaoxing Liu, University of Maryland
Q:
How should we design compact data structures tailored for AI workloads, where traditional per-flow monitoring might no longer be optimal?
A:
In AI workloads, especially in collective communications, flow-based monitoring may not be the right abstraction. Instead of tracking traditional five-tuples, we might need models based on message semantics (e.g., AllReduce). AI networks often rely on specialized infrastructures like NVIDIA’s Mellanox, where telemetry could be shifted to programmable NICs or host software—giving us slightly relaxed memory constraints. Thus, we may explore telemetry models that span multiple layers: switch-level, NIC-level, and host-level.
Q:
How can we support telemetry with extremely high or even zero-error accuracy, especially when using compact data structures constrained by memory?
A:
There is a trade-off between memory/resource usage and accuracy. One promising direction is reactive monitoring—only activating high-accuracy counters when needed. We explored this in a previous paper, where we dynamically reallocate resources once a problem is suspected. This way, instead of monitoring everything all the time, we use approximate structures for broad coverage, and deploy accurate counters on demand in problem areas.
It’s also interesting to look at hybrid data structures. For example, we could design a system that guarantees perfect accuracy for a subset of important flows—like VIPs or top-k heavy flows—and tolerates approximation for the rest. Some recent ideas include self-repairing or adaptive data structures that reconfigure in response to errors or workload changes.
Q:
Is it possible to coordinate telemetry across these devices at runtime? For example, could a compiler dynamically split tasks between DPUs, switches, and hosts depending on current workload or resource constraints?
A:
While the concept of runtime coordination is compelling, practical implementation remains constrained by current hardware limitations. Switches and hardware datapaths typically do not support dynamic reconfiguration at runtime due to the overhead of recompilation and deployment of new P4 programs, which may interrupt service or cause inconsistencies.
However, on the host or NIC side, telemetry tasks can be more readily reallocated or updated at runtime. To enable dynamic coordination, several technical challenges must be addressed, including real-time compilation latency, preservation of telemetry state across reconfigurations, and seamless transition without disrupting ongoing measurements.
Despite current constraints, progress in this area may enable future systems to adaptively reconfigure their monitoring strategies in response to workload changes.
Q:
Can we combine compact data structures with ML to perform more efficient and scalable network telemetry?
A:
Still an open research direction. ML often struggles with explainability, critical for network telemetry. Instead of using sketches as inputs for ML, a promising area is to use ML to tune or adapt data structures dynamically. There is potential, but integration remains an open challenge, especially in preserving precision and interpretability.
Q:
What are the most fundamental challenges if we want to expand accurate and efficient telemetry from data center networks to wide area networks (WANs)?
A:
When extending telemetry from data center networks to wide area networks, the primary challenges lie in resource constraints and traffic diversity. WAN devices often have limited resources for telemetry, and the traffic they carry is more heterogeneous and unpredictable compared to the more uniform and application-specific traffic seen in data centers. Despite these differences, many telemetry techniques and insights developed for data centers can still be effectively applied to WANs, provided they are adapted to the unique characteristics of WAN environments.
Q:
Can large language models (LLMs) be used in network measurement or sketch-based telemetry systems? If so, how?
A:
Regarding the use of large language models (LLMs) in network measurement, there are two main potential roles. First, LLMs can serve as powerful interfaces, translating natural language queries into telemetry configurations or helping users interact with complex telemetry data. Second, more speculatively, LLMs could be used to replace traditional sketches with learned models that generalize across diverse traffic patterns. However, major challenges remain, including the explainability of LLM decisions, generalization across dynamic traffic, and constraints imposed by current programmable hardware. Advances in more flexible, intelligent hardware may eventually make these applications more feasible.
Q:
Among the future directions listed in your survey paper, which one are you personally most excited to pursue? If you could only focus on one specific topic in telemetry or data structures, what would it be and why?
A:
Shir: If I had to pick one direction, I’m personally most excited about exploring new types of programmable hardware. I think there’s still a lot we don’t fully understand about how these architectures work and what they can really do. My focus would be on how we can adapt or redesign telemetry systems to take full advantage of these new capabilities. It’s not just about improving existing tools—it’s also about asking, can we do completely new things with this hardware that weren’t possible before? That’s a question I really want to explore in the next few years.
Alan: For me, I’m really passionate about building toward the long-term vision of AI-managed networks. Right now, we have more powerful hardware and better tools than ever, but I think we’re still missing an end-to-end loop that connects everything—from telemetry data collection to intelligent decision-making and automation. I want to step back and rethink the whole process: what exactly are we trying to detect, what are we missing in the hardware or software stack, and how can we put everything together? Even if we can’t solve it all at once, I’d love to demonstrate full-loop automation through one or two complete, working examples. That’s where I want to invest my energy.