R-Pingmesh: A Service-Aware RoCE Network Monitoring and Diagnostic System

erfan · July 30, 2024, 11:35am

Title: R-Pingmesh: A Service-Aware RoCE Network Monitoring and Diagnostic System

Authors: Kefei Liu (State Key Laboratory of Networking and Switching Technology, BUPT, China), Zhuo Jiang (Douyin Vision Co., Ltd.), Jiao Zhang (State Key Laboratory of Networking and Switching Technology, BUPT, China and Purple Mountain Laboratories), Shixian Guo (Douyin Vision Co., Ltd.), Xuan Zhang (State Key Laboratory of Networking and Switching Technology, BUPT, China), Yangyang Bai (Douyin Vision Co., Ltd.), Yongbin Dong (Douyin Vision Co., Ltd.) Feng Luo (Douyin Vision Co., Ltd.), Zhang Zhang (Douyin Vision Co., Ltd.), Lei Wang (Douyin Vision Co., Ltd.), Xiang Shi (Douyin Vision Co., Ltd.), Haohan Xu (Douyin Vision Co., Ltd.), Yang Bai (Douyin Vision Co., Ltd.), Dongyang Song (Douyin Vision Co., Ltd.) Haoran Wei (Douyin Vision Co., Ltd.), Bo Li (Douyin Vision Co., Ltd.), Yongchen Pan (State Key Laboratory of Networking and Switching Technology, BUPT, China), Tian Pan (State Key Laboratory of Networking and Switching Technology, BUPT, China and Purple Mountain Laboratories), Tao Huang (State Key Laboratory of Networking and Switching Technology, BUPT, China and Purple Mountain Laboratories)

Introduction: R-Pingmesh is an RDMA monitoring system that helps operators diagnose network performance issues by using end-to-end latency measurements. The paper explores the technical specifications of the systems and various case studies from the deployment of the system on a large number of RNICs for the duration of six months.
The system is motivated by the growth in distributed machine learning and the popularity of RoCE networks where a small misconfiguration in the network can cause significant performance degradation in the training. Therefore, it is important to quickly detect faults and determine if network is to blame for those faults.

Key idea and contribution:
Existing monitoring tools that are designed for Ethernet networks, such as Pingmesh, are not a good fit for RoCE networks for several reasons. With these tools, it takes a long time to correlate the problem with the component that is causing it. It is also important to point that not all the issues with a training job are network-related. Additionally, a tool like Pingmesh cannot identify location of packet drops in RoCE networks or determine the severity of faults.
Their solution, R-Pingmesh is designed to 1) quickly detects faults, and 2) determine if those faults are network-related.
R-Pingmesh consists of a service-independent cluster monitoring service and a “service tracing probing” function. The individual system components such as the controller the agent and the analyzer each are tasked with specific functions such as storing RNIC communication information, using RDMA UD transports for periodic probing, and eBPF functions for kernel function tracing.

Evaluation
R-Pingmesh have been deployed for over 6 months in a large RoCE cluster and is shown to quick identify performance degradation in training jobs and determine whether it is caused by network or not.
The authors present several learning experiences while deploying R-Pingmesh. For example, using R-Pingmesh allows using tail RTT measurements to detect the degree of network congestion. R-Pingmesh is also able to detect sources of congestion such as many-to-one incast or uplink switch congestion which is caused by ECMP hash collisions.

Q: How can you measure one way delay in your optimized topologies?
A: As long as both RNICs are on the same host we can take one timestamp from each RNIC and calculate their difference without the need for synchronized clocks.

Q: If we use flowlets instead of ICMP your system will not work because you cant cover all links?
A: No, the system will not work because it will calculate a different hash for every packet.