EP9: Fast, Scalable, and Accurate Rate Limiter for RDMA NICs, Apr. 9, 2025

letia_zhu · May 21, 2025, 5:37am

Paper : Fast, Scalable, and Accurate Rate Limiter for RDMA NICs
Authors : Zilong Wang, Xinchen Wan, Luyang Li, Yijun Sun, Peng Xie, Xin Wei, Qingsong Ning, Junxue Zhang, Kai Chen.
Presenter : Qiang Su, The Chinese University of Hong Kong.
Guest of Honor : Zilong Wang, The Hong Kong University of Science and Technology.

MengruiZhang · May 21, 2025, 10:04am

Q: Why do we need packet-level pacing here? Would it not be enough to pace bundles of packets similar to how TCP does it?

A: In modern high-speed networks like 100Gbps or more, congestion control is critical. If we use TCP-style pacing (sending a batch of packets), it can cause network congestion due to bursts. The CX5066 NIC, for example, uses a default desegregation mechanism that sends packets in bursts, leading to congestion. Accurate packet-level pacing helps mitigate this and is necessary for controlling congestion in such high-speed environments.

Q: Have you tested sending bundles of packets per flow instead of single packets in Tassel? Does that affect accuracy or cause bursts?

A: Yes, if we send multiple packets at once, we sacrifice some accuracy but gain higher performance. Specifically, sending two packets at a time can double the packet rate but reduce precision. The Tassel workflow first schedules tens of thousands of flows and then filters hundreds of packets based on transmission time. Sending packets in bundles can speed up sorting but does introduce a tradeoff between speed and accuracy.

Q: How do you define the end time of every data stream, flow, or packet?

A: After fetching the keys for each flow, we know the flow size and packet size. From this, we compute transmission time and intervals based on the rate limit. Then we filter imminent packets (those near to be sent) based on scheduling latency and transmission intervals. These packets are stored in the NIC with their metadata; others are dropped and fetched later.

Q: In NS3, every queue pair (QP) sends only one packet at a time. Wouldn’t that cause context-switching overhead? Does RDMA hardware also do this?

A: NS3 is a software simulator and assumes the NIC has unlimited on-chip memory. It simulates packet-level pacing without hardware limitations like cache misses or context switching. In real RDMA NICs, limited on-chip memory can only store a few QP contexts. So enabling packet-level rate limiting naïvely can cause cache misses, leading to performance drops—like what happens in the Bluefield NIC when QP count increases.

Q: Why is this hardware limitation not visible in NS3 simulations?

A: Because NS3 doesn’t simulate hardware constraints. It models only the transport layer behavior, not NIC architecture, such as memory size or cache behavior. This results in unrealistic (idealized) performance in simulations compared to real hardware.

Q: In your implementation, do we have to use FPGA for testing Tassel? Are there alternative options?

A: Using FPGA is common for hardware-level networking research. However, for testing, you could also use open-source platforms like Corundum or JingZhao NIC. These provide environments to implement and test networking functions without requiring full custom FPGA development.

Q: Do you use Chisel for FPGA programming? Why or why not?

A: We use Verilog instead of Chisel. While Chisel offers a higher-level abstraction, Verilog provides finer control and better performance, which is critical for achieving high performance in high-speed networks.

Q: Have you tried using large language models (LLMs) for Verilog programming?

A: Yes, LLMs are helpful in early design stages—for writing Python simulations or simple modules. But for complex hardware implementation, their utility is still limited. They’re good for basic tasks but not yet reliable for full hardware design.

Q: What motivated you to work on rate limiting on RDMA NICs?

A: It started with a paper like HPCC, Alibaba’s 2019 paper. In that paper, which included an FPGA implementation of HPCC, they mentioned that the rate-limiting algorithm is hard to implement in hardware. This raised our interest in understanding why that is the case. When we designed the SRNIC, we also had to design a congestion control (CC) algorithm. We found that maintaining a window in hardware is easier than maintaining a timer for the pacing algorithm. So, DCTCP turned out to be easier to implement than DCQCN. We chose DCTCP as the default CC for our RNIC, and then we wanted to investigate further why rate-limiting is hard in hardware—so we started this project.

Q: What’s the most fundamental challenge, or the biggest challenge you encountered when working on Tassel?

A: The biggest challenge is figuring out what algorithm can actually be implemented in the RNIC. For example, we initially thought piling our related work was ideal—it supports tens of thousands of rows, accurate rete-limiting, real-time rete-limiting, and programmable scheduling. But its design is too complicated, and its achievable clock frequency is low, around 100 MHz, which reduces the packet rate. So we needed a simpler and more efficient design for fast and accurate rete-limiting.

Q: What’s the limitation or potential drawback of your approach in the paper?

A: In our work, we initially assumed that we must achieve the most accurate rate-limiting. But in some cases, we can sacrifice a bit of accuracy to gain much higher performance—for example, by sending two packets at a time. So the trade-off between accuracy, scalability, and performance might need further exploration.

Q: What’s the largest scale you have tried in terms of number of flows and capacity of the NIC?

A: We currently support tens of thousands of flows and can handle hundreds of packets per second. We’ve achieved 100 Gbps with small packets, and with higher MAC capacity, we could scale to 200 or even 400 Gbps.

Q: What’s the next step in your research, as you mentioned the trade-offs among accuracy, performance, and scalability?

A: There are indeed trade-offs. If we aim for ultra-high performance or scalability, we might need to sacrifice some accuracy. For specific use cases like AI or storage, the balance between these metrics might be adjusted accordingly.

Q: Could another future research direction be co-design? Specifically, co-designing application-level scheduling and congestion control with rate-limiting?

A: Yes, it’s possible. In scenarios like AI training, workloads are predictable. The training framework could configure the read-void flow and set the rate-limiter in the NIC, which could help resolve congestion.

Q: If we have a “know-it-all” scheduler that timestamps every packet precisely, do we still need a rate-limiter?

A: In some cases, no. If flows have different priorities and each has a timestamp, we could just send them at line rate in order—no rate-limiter needed. But when priorities are unknown or flows share equal priority (like different users or applications), we need the rate-limiter to ensure fairness.

Q: So, in scenarios where flows have the same priority, the rate-limiter ensures fair bandwidth sharing?

A: Exactly. The rate-limiter ensures fairness and prevents congestion when multiple flows with equal priority share bandwidth.

Q: Any advice for graduate or PhD students?

A: Be a self-motivated student.