Unlocking Superior Performance in Reconfigurable Data Center Networks with Credit-Based Transport

Title: Unlocking Superior Performance in Reconfigurable Data Center Networks with Credit-Based Transport

Authors: Federico De Marchi (Max Planck Institute for Informatics), Jialong Li (Shenzhen University of Advanced Technology), Ying Zhang (Meta), Wei Bai (NVIDIA), Yiting Xia (Max Planck Institute for Informatics)

Scribe: Mengrui Zhang(Xiamen University)

Introduction

This paper studies the problem of transport performance in Reconfigurable Data Center Networks (RDCNs). While RDCNs leverage optical circuit switches to build flexible topologies and promise higher capacity and power efficiency, existing transport protocols fail to fully exploit their potential. Prior work, such as ExpressPass, has shown that credit-based transport can achieve high throughput and bounded queues in Clos networks. However, applying such designs directly to varying expanders like Opera introduces two key challenges: (1) credits and data packets may traverse asymmetric paths under frequently changing topologies, breaking the consistency of bandwidth allocation; and (2) multi-bottlenecks in varying expanders lead to severe credit waste that cannot be effectively corrected, resulting in low link utilization.

Key Idea and Contribution

The authors propose Flare, a new credit-based transport protocol tailored for Opera’s varying expander topology. Similar to ExpressPass, Flare performs credit pacing at all links to control congestion, but it is specifically designed to overcome the unique challenges of varying expanders. Its design introduces several innovations: enforcing strict symmetry between credit and data paths even under frequent topology changes, probabilistic credit admission to reduce waste and bias allocation toward shorter paths, an extended credit-rate control mechanism, and a tentative credit system to opportunistically fill unused bandwidth. Together, these mechanisms ensure low latency, high throughput, and bounded queue sizes in Opera.

By making credit admission probabilistic and introducing tentative credits, Flare can both improve utilization and reduce credit waste, while symmetry enforcement guarantees consistent bandwidth allocation. The authors implemented Flare using DPDK and Tofino2 switches, validating its effectiveness through large-scale simulations and a hardware testbed.

Evaluation

The authors compare Flare against state-of-the-art baselines, including NDP, ExpressPass, TDTCP, and Bolt under realistic workloads (e.g., web search, Hadoop, RPC). Flare consistently achieves higher throughput, up to 2× over NDP and 1.5× over ExpressPass, and significantly lower flow completion times, up to 10× shorter than ExpressPass and 15× shorter than TDTCP. Importantly, Opera with Flare outperforms a cost-similar Clos network, achieving 1.15× higher throughput even under adversarial traffic.

Q&A

Q1: How is the threshold set for probabilistic credit admission, and is it static or application-dependent?

A1: In the evaluation, the threshold was set statically at 50% of the queue. The threshold essentially determines what portion of credits in the queue are subject to probabilistic admission. For high traffic loads, lowering the threshold may increase throughput, while for lighter traffic, a higher threshold may help avoid underutilization.

Q2: How does probabilistic credit admission compare to approaches like overcommitting credits?

A2: Probabilistic admission acts as a lightweight, stateless heuristic that reflects the likelihood of a credit surviving multiple bottlenecks. It favors credits with fewer remaining hops, reducing the chances of wasted bandwidth. Unlike overcommitting credits, which indiscriminately increases traffic, probabilistic admission selectively filters credits early, reducing waste in a more controlled way.

Personal Thoughts

I like that this paper bridges the long-standing performance gap between reconfigurable and Clos networks by innovating at the transport layer, rather than just topology or routing. The probabilistic credit mechanism is particularly clever—sacrificing short-term fairness to maximize overall utilization. The testbed validation also strengthens the paper’s credibility.