EP20: Revisiting RDMA Reliability for Lossy Fabrics, Oct. 22, 2025

letian_zhu · November 2, 2025, 5:04pm

Paper: Revisiting RDMA Reliability for Lossy Fabrics
Authors: Wenxue Li, Xiangzhou Liu, Yunxuan Zhang, Zihao Wang, Wei Gu, Tao Qian, Gaoxiong Zeng, Shoushou Ren, Xinyang Huang, Zhenghang Ren, Bowen Liu, Junxue Zhang, Kai Chen, Bingyang Liu
Presenter: Wenyun Xu, Xiamen University
Guest of Honor: Xuewen Li, Hong Kong University of Science and Technology

letian_zhu · November 2, 2025, 5:17pm

Q: What’s the fundamental difference between DCP and NDP?
A: Both NDP and CP utilize packet trimming in switches, but they serve different purposes. NDP uses trimmed headers for receiver-driven congestion control—the receiver estimates bandwidth based on how many header-only packets arrive. CP focuses on loss recovery like DCP does, but CP’s end-host is software-based. DCP’s end-host is RDMA NIC—a completely hardware design. Because RDMA NIC SRAM is limited, we cannot maintain much information in the NIC. So we introduced a retransmission queue located in host memory, created when the connection is initialized. When the NIC receives a header-only packet, it writes metadata into the retransmission queue, then fetches entries from there to execute retransmissions. This detailed RDMA-related design isn’t mentioned in either CP or NDP.

Q: How big of an impact does hardware-based end-host implementation have on overall performance improvement?
A: We compared DCP with TCP, which uses software network stack by default. Both commodity RNICs like ConnectX-5 and DCP achieve about 50% higher bandwidth than TCP. The software network stack implementation is less efficient than hardware network stack in both throughput and latency. For latency, DCP achieves about two times less latency than TCP.

Q: Did you compare DCP with NDP or CP in your experiments?
A: Actually, we didn’t compare with both of them because we couldn’t find testbed-based implementations. If we had to compare with them, we would have to re-implement their systems in our testbed, which would be relatively high overhead. So we didn’t include those comparisons.

Q: Does DCP consistently perform better when integrated with congestion control? A: Yes, because DCP focuses on reliability—on how to improve loss recovery efficiency. Congestion control is at another level—it controls the sending rate, not loss recovery. DCP is actually compatible with any CC algorithm. In its architecture, DCP is decoupled from the CC module.

Q: How do you see the role of UEC and DCP in future data centers? Can DCP benefit from standardization?
A: DCP was actually inspired by UEC. UEC mentions in its specification that it includes a packet trimming function in switches, but UEC didn’t mention how to utilize this function for RDMA NICs. We were inspired by UEC and designed DCP to actually utilize this packet trimming function for RDMA networks. DCP can benefit from UEC because if future UEC-ready switches already support the packet trimming function, DCP won’t require modification to switches. DCP will only need a customized RDMA NIC product that can work together with UEC-ready switches.

Q: What’s the relationship between DCP and Unified Bus? Is DCP already part of UB?
A: I didn’t go through the UB specification in detail, but currently DCP is not included in the UB specification. During my internship, I prepared a draft for the DCP specification for UB, but I’m not sure which distribution or date it will be included. UB consists of two parts—scale-up and scale-out—and the scale-out part is very similar to UEC. I think DCP can be a part of UB since UEC and DCP work in the same direction.

Q: What was the most fundamental challenge or difficulty you ran into during designing and implementing DCP?
A: For me personally, the most difficult part was understanding the current RDMA NIC architecture and its ASIC implementation, then fitting DCP’s design into this architecture. We don’t have to implement everything from scratch, but we need to think outside the box. Actually, we found some problems that we didn’t see as problems when we started. For example, the batch prefetching design in header-only based retransmission was a follow-up design—we didn’t design it at the beginning. It was inspired by ideal experiment results. The most difficult part is doing implementation, evaluating, finding problems, and feeding them back to the design.

Q: Have you tried leveraging ChatGPT or other large language models during your implementation or design?
A: I think AI will be helpful, but the implementation was done during my internship, and in the company environment, large language models weren’t very reliable. The company won’t allow models from other companies, so I didn’t use large language models very frequently during implementation. But I think they will be helpful in the future.