EP11: R-Pingmesh: A Service-Aware RoCE Network Monitoring and Diagnostic System, May. 7, 2025

letia_zhu · May 21, 2025, 5:40am

Paper : R-Pingmesh: A Service-Aware RoCE Network Monitoring and Diagnostic System
Authors : Kefei Liu, Zhuo Jiang, Jiao Zhang, Shixian Guo, Xuan Zhang, Yangyang Bai, Yongbin Dong, Feng Luo, Zhang Zhang, Lei Wang, Xiang Shi, Haohan Xu, Yang Bai, Dongyang Song, Haoran Wei, Bo Li, Yongchen Pan, Tian Pan, and Tao Huang.
Presenter : Haodong Chen, Xiamen University.
Guest of Honor : Kefei Liu, Beijing University of Posts and Telecommunications.

letia_zhu · May 22, 2025, 3:14am

Q: Considering the current practice, do you have any plans in the future to further improve the accuracy of determining network-related problems or improve the diagnosis?

A:

We have encountered many noisy problems in our detection system. This noise can come from CPU overload and various host-related issues, such as Linux locks and host bugs. These bugs may lead to false positives - you detect what appears to be a network problem, but it’s actually just a host-related issue.

So it’s really important in future work to separate host timeouts from network timeouts more accurately. We use TORMesh in RPMesh, which operates at the TOR switch level between RDMA NICs. TORMesh has some limitations:

The first limitation is that TORMesh relies on the number of NICs under the same TOR switch. If there are only a few RDMA NICs under the TOR switch, TORMesh becomes inaccurate.
For example, if there are only two RDMA NICs under the TOR switch, TORMesh is not accurate. If just one RDMA NIC fails, you will find that around 50% of the probes under that TOR switch will experience timeouts.

We will do extensive future work to improve the accuracy of TORMesh problem detection, or do more work to improve the accuracy of separating host timeouts from network timeouts.

Q: Is it possible to combine some code analysis tools on the host side with RPMesh together, so they can achieve better accuracy in end-to-end scenarios?

A:

That’s a good suggestion. We have tried adding some code instrumentation in the NCCL library. You can add NCCL logs to the NCCL code. In the core code, during the communication and computation phases, we can determine whether the bug occurs in communication or computation.

But actually, I think this approach is somewhat complicated. From the perspective of a probing system, we should keep it simple. We should rely on the probing system itself, rather than depending on training code and communication libraries.

Q: To further automate diagnosis and troubleshooting without any human effort, do you think it’s possible to rely on large language models and historical detection records from RPMesh together, so that we can develop an LLM agent to operate on top of RPMesh?

A:

This is a very good suggestion. At ByteDance, we are actually working on this. We use historical data from the RPMesh system and monitoring systems, such as switch logs, switch traffic, switch job logs, and RDMA NIC job logs.

We use these logs to train an LLM system, or use LLM inference with this data as input. We use the LLM system to:

First, monitor and detect errors or network bugs/failures
Second, locate the failure
Third, diagnose the root cause

With the help of the LLM system, we can significantly reduce human effort. However, we haven’t deployed this kind of system in production yet - this is still in our planning phase.

Q: What is the most difficult part when you conducted this project?

A:

The most challenging part is how to trace service data flows. PinMesh uses probing between NICs, which is straightforward but cannot properly handle service networks. Services may use only part of the network - specific NICs, links, and switches within the cluster.

We need to trace the service network paths. A very challenging question is whether we should passively trace the service network or actively probe it.

Some approaches, like Microsoft’s Everflow trace service, actions (such as when applications call postSend or postReceive). However, in RPMesh we use active probing to trace the service network. We use probing with service queue pairs (QPs), tracing the paths used by service QPs through active probing.

This idea is where we spent the most time thinking and discussing. For students who want to do research in network diagnostics, I think we should first learn about real problems in clusters, read experience papers, and discuss with network operators who can provide feedback about practical problems.

Q: Among problems found by RPMesh, intra-host bottlenecks seem quite interesting. Are there cases where RDMA NICs get overloaded?

A:

When the CPU is overloaded, it usually causes timeout problems, which we mentioned as noise in our detection. These timeout problems are usually misinterpreted as network timeouts and lead to false positives.

RDMA NIC overload problems do occur when the RDMA NIC becomes fully saturated. In machine learning workloads and LLM training, when RDMA NICs communicate with each other, they can consume the entire available bandwidth.

In these cases, you can detect RDMA NIC problems through very high RTT. For example, when the network is under low load, RTT is around several microseconds or at most tens of microseconds. When RDMA NICs are under heavy load, RTT increases to several hundred microseconds or even milliseconds.

We can use RTT to determine the RDMA NIC’s workload state.

Q: Which type of problem is the most common?

A:

The most common case found in ByteDance’s data center is link flapping - both RDMA NIC link flapping and switch link flapping.

For RDMA NIC flapping, we found this problem to be very common two years ago. We worked with vendors (most RDMA NICs are from NVIDIA) and debugged this problem for around half a year to one year. Finally, we found this problem stemmed from a mismatch between the RDMA NIC and the cable - such a simple issue, but it took half a year to debug.

For switch problems, the most common is switch link flapping, which is well known. Packet corruption is also very common, caused by faulty cables and broken optical modules.

Q: Why did you decide to build on PinMesh?

A:

We built RPMesh based on PinMesh using the traceroute technique to trace paths of RDMA probes. The reason is that it’s easy to deploy.

As far as we know, many data centers like Microsoft, Meta, Huawei, Alibaba, and Tencent have built their monitoring systems based on PinMesh or similar approaches. By building on PinMesh, we want to make this work easy to deploy so that other companies and network operators can easily upgrade their PinMesh systems to our RPMesh system.

Systems like INT and ERSPAN rely on advanced switch features. If some legacy switches in the cluster don’t support ERSPAN or INT, these systems cannot work. But traceroute is supported by nearly all switches, so our RPMesh system can be deployed across all clusters.

However, if you use more advanced features like INT, you can achieve better path tracing - you can obtain the specific switch ports traversed by network flows. With traceroute, you can only identify the switches on the path. If your clusters support INT or ERSPAN, you can use them to enhance the path tracing capabilities of RPMesh.

Q: In RPMesh design, it uses polling CQE, which is CPU-intensive. How does it reduce CPU usage?

A:

If you send a probe packet and use the CPU to repeatedly poll for completion queue entries (CQE), this is quite CPU-intensive. You can use event-based methods to reduce CPU consumption, which is supported by verbs/OFED verbs.

You can post probes and wait for CQE events - this operation is not CPU-intensive. The CQE timestamp is generated by the RDMA NIC hardware, so generating RDMA NIC timestamps doesn’t consume CPU resources.

Using event-based operations, we can keep CPU consumption at a very low level. In our paper, for a host with eight RDMA NICs, each RDMA NIC maintains a probing frequency of around 100 packets per second, so the total probing frequency is around 800 probes per second. For this setup, CPU consumption is less than 3% of one core.

Q: Have you used methods to address network congestion, such as ECMP imbalance or incast caused by collective communication? Have you applied corresponding optimizations?

A:

We found many congestion problems, such as ECMP imbalance and incast congestion.

For in-cast congestion, which is many-to-one communication, the link to a specific RDMA NIC has only one path, so there’s no alternative route. The straightforward solution is to design better congestion control algorithms. At ByteDance, we use PCC (Programmable Congestion Control) supported by RDMA NICs to design better congestion control algorithms that outperform DCQCN (the default RDMA NIC congestion control).

For ECMP imbalance, many researchers have improved load balancing algorithms. The most popular approach is packet-level load balancing, such as DRB. Many data centers have designed their own load balancing algorithms, mostly based on packet-level approaches.

At ByteDance, we have similar algorithms for flow-level or packet-level load balancing to reduce ECMP imbalance. The main overhead is handling packet reordering issues since packets traverse different paths. This requires hardware-software co-design, adding more buffers in RDMA NICs and software caches in CPUs.

Q: Can you share insights about problems you fixed and didn’t fix?

A:

The major problem with PinMesh and RPMesh systems is that they rely on agents. When PinMesh was published in 2015, cluster sizes were not very large - around thousands or tens of thousands of servers. In small clusters, using agents is reasonable.

When cluster or data center size grows to millions of servers, agent-based probing creates many problems. You need to deploy agents on all servers in the data center. If you have updates, you need to update agents across all servers, which is quite difficult for millions of servers.

The trend is toward designing agent-less systems. If you can probe without relying on agents, you don’t need to deploy agents on all servers. You can use several centralized probes to monitor all servers in the data center.

We actually tried to pursue this approach and will publish a paper that will be presented in September this year.

Q: In RPMesh, you focus on monitoring and diagnosis. What have you done to prevent anomalies from happening in the first place?

A:

This is a very good question. We have encountered many configuration problems. Verification work is quite challenging.

For ByteDance, what we’ve successfully accomplished is data plane verification, which addresses a limited scope. Some papers have been published at SIGCOMM last year.

For large-scale verification of entire data centers, this is quite challenging work. There are many different RDMA NICs and switches from different vendors with different configurations. We have complete configurations for each brand of switch, but we don’t know whether the configuration is optimal.

For example, we don’t know whether these configurations are optimal for ECN watermarks and PFC watermarks. Our network operators are working on optimizing configurations, which need to change when service workloads change - storage scenarios require one configuration, while training scenarios should have different configurations.

When the cluster size grows, configurations may need to change. Another problem is that there may be bugs or oversights when operators configure switches and RDMA NICs.

There are two challenges: first, determining whether configurations are optimal; second, detecting oversights. These are quite challenging problems in data centers, and we haven’t solved them well yet.