μMon: Empowering Microsecond-level Network Monitoring with Wavelets

Title: µMon: Empowering Microsecond-level Network Monitoring with Wavelets

Authors: Hao Zheng, Chengyuan Huang, Xiangyu Han, Jiaqi Zheng, Xiaoliang Wang, Chen Tian, Wanchun Dou, Guihai Chen (Nanjing University)

Scribe: Haohao Song (Xiamen University, China)

Introduction
The paper presents µMon, a new network monitoring system that empowers data centers with the capability to monitor network traffic at the microsecond level. This level of granularity is essential given the current landscape of modern data centers, where network dynamics such as flow rate fluctuations and congestion events can manifest extremely rapidly, often within microseconds. These dynamics are a result of the deployment of high-speed forwarding devices and network stacks that leverage technologies like kernel bypass and hardware offloading, which significantly reduce latency. However, existing network monitoring systems are not designed to capture such fine-grained behaviors due to their time granularity operating at a much coarser level, typically in the range of seconds to minutes. This discrepancy poses a significant challenge for network performance analysis and management, as it hinders the ability to accurately detect and analyze transient congestion events, leading to increased network latency and jitters in application performance.

Key idea and contribution:
To address this gap, the authors propose µMon, which introduces WaveSketch, an innovative algorithm that employs in-dataplane wavelet transform to measure and compress flow rate curves. WaveSketch is designed to capture the most significant features of flow rate curves while discarding less important details, thereby balancing compression ratio and measurement accuracy. This approach allows for a more precise characterization of application traffic patterns and aids in profiling transport algorithms. A key feature of µMon is its ability to ‘replay’ congestion events by combining fine-grained flow rate measurements with network-collected congestion information. This capability enables network operators to analyze the cause and impact of congestion events, providing valuable insights for network management and optimization.

Evaluation
The authors evaluate µMon through testbed deployment and simulations at a granularity of 8.192 microseconds. This result is significant because µMon can achieve 90% accuracy in microsecond-level rate measurements with an average bandwidth overhead of 5 Mbps per host. Additionally, it can capture 99% of heavy congestion events with a bandwidth overhead of 31-82 Mbps per switch. The authors also explore the practical applications of µMon, such as network-wide synchronized analysis and replay of congestion events, which are facilitated by the system’s ability to measure traffic and network events with microsecond precision. This feature provides deep insights into the causes and impacts of network congestion, thus enhancing the capabilities of network operators to diagnose and address performance issues.

Q1: Does wavelength introduce errors in the precision?
A1: The total package is always right. But in some detailed flow race curves, it preserves the most important part with large raise changes. However, the smooth part may involve some errors.

Q2: How to set the time window, and how to implement it in the DPU or programmer switch?
A2: We gave a hardware implementation based on IMT architecture in the paper. So maybe you can read people for details. The time gradually granulated in the sketch is set to our design is just 8 microseconds.

Q3: Are your replay module and the analysis module online?
A3: The network detection solution detects the congestion online. Analyzing the congestion is offline. The detection is online.

Personal thoughts
This paper studies an emerging problem. With the deployment of ultralow latency forwarding devices and network stacks that utilize technologies like kernel bypass, fluctuations in flow rates and network congestion events (e.g., microbursts) typically manifest on a microsecond timescale. This necessitates a revisit of network monitoring. This paper nicely demonstrates the new research objectives and inherent challenges and solutions.