Fast, Scalable, and Accurate Rate Limiter for RDMA NICs

Title: Fast, Scalable, and Accurate Rate Limiter for RDMA NICs

Authors: Zilong Wang, Xinchen Wan (Hong Kong University of Science and Technology), Luyang Li (ICT/CAS), Yijun Sun (Hong Kong University of Science and Technology), Peng Xie (ByteDance), Xin Wei (ByteDance), Qingsong Ning (ByteDance), Junxue Zhang, Kai Chen (Hong Kong University of Science and Technology)

Scribe: Haohao Song (Xiamen University, China)

Introduction
Remote Direct Memory Access Network Interface Cards (RNICs) require a rate limiter that combines precision, scalability, and speed. The rate limiter must accurately apply specific rates to each flow and manage packet intervals to ensure smooth traffic across a large number of connections. This is essential for effective congestion control in data centers and for isolating network traffic to preserve application performance and user experience. However, existing solutions like SENIC and PIEO, while providing accurate and scalable rate limiting, are too slow, achieving only 31.4% of the necessary packet rate for RNICs. Their single-packet-per-sorting approach results in a significant performance bottleneck when handling thousands of flows, offering only 34.5 Mpps compared to the 110 Mpps capability of RNICs. Therefore, there is an urgent demand to improve the rate limiter of RNICs.

Key idea and contribution:
The authors introduce Tassel, a novel rate limiter designed specifically for RNICs. Tassel’s approach is hierarchical, first applying scalable rate limiting to the flows to be scheduled and then accurate rate limiting to the packets to be transmitted. This two-step process is designed to improve performance while maintaining accuracy and scalability. Tassel uses adaptive batching to fetch multiple packets from scheduled flows to hide PCIe and sorting latency, thus improving packet rate for flow-level rate limiting. Additionally, it employs packet filtering to reduce the number of packets requiring sorting, thereby increasing the sorting rate and enhancing packet-level rate-limiting performance. In terms of hardware design, Tassel utilizes a combination of data structures, such as the pipelined heap, timing wheel, and register array, to meet the diverse requirements of different stages in the rate-limiting process. This combination allows Tassel to achieve high performance, scalability, and accuracy simultaneously.

Evaluation
The authors have integrated Tassel into the RNIC architecture by replacing the original QP (Queue Pair) scheduler module and implemented a prototype using FPGA (Field-Programmable Gate Array). This result is significant because experimental results show that Tassel achieves a packet rate of 125 Mpps (millions of packets per second), outperforming SENIC and PIEO by 3.6 times. It supports 16,000 flows with low resource usage, consuming 7.5% to 25.6% less computing and memory resources compared to SENIC and PIEO, while maintaining high accuracy across a wide range of rate limits, from 100 Kbps to 100 Gbps.

Q1: Your scheduler contains something like a time window and if the packet is not in the nearby time window, you will drop it. What do you mean you drop it? Does it need to retransmit on the center side or something else?
A1: The workflow is that we fetch multiple packets and we find the first packets are close to being sent and we drop the first packet in the sender and update the send pointer. So when we reschedule it next time, we will refresh it and no package transmission will happen.

Q2: Could your idea be applied to the black-box DPU?
A2: Our method is more related to the DMA layer instead of the transport layer. We can replace or improve the original rate limiter or scheduler. We do not touch the transport layer.

Personal thoughts
RNICs employ rate limiters to regulate traffic through the introduction of gaps between packets and to assign rates to various flows, ensuring functions like isolation and fairness. The paper argues that current methods are not fast. It introduces a two-stage method that organizes flows and then selects packets for transmission. In addition, this method advocates for the dispatch of multiple packets after the sorting of all flows to minimize the sorting cost, the approach. I really appreciate its motivation and contributions.