EP23: InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers, Dec. 3, 2025

letian_zhu · December 8, 2025, 11:38am

Paper: InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers
Authors: Chenchen Shou, Guyue Liu, Hao Nie, Huaiyu Meng, Yu Zhou, Yimin Jiang, Wenqing Lv, Yelong Xu, Yuanwei Lu, Zhang Chen, Yanbo Yu, Yichen Shen, Yibo Zhu, Daxin Jiang
Presenter: Chengxuan Pei, The Chinese University of Hong Kong
Guest of Honor: Chenchen Shou, Peking University

letian_zhu · December 8, 2025, 11:48am

Q: Has anybody implemented this in data centers like Alibaba or Google?

A: Yes, this is a very good question. Our collaborating company for this paper, LightLegends, released a super node prototype based on the OCS Transceiver hardware. At this year’s Shanghai AI conference, they demonstrated a super node based on our OCS Transceiver hardware. Going forward, Peking University’s DevFund and LightLegends will continue to optimize OCS and preset optical switching super nodes according to the mainstream traffic patterns of current large model training.

Q: Did you evaluate the availability or reliability of your prototype? What is the expected lifetime of the optical transceiver?

A: In practice, I didn’t touch the prototype directly—the prototype was published after the paper was published. So after I published this paper, I had no opportunity to see this prototype. The hardware evaluation is in the paper, but the super node evaluation is unknown. I can touch the OCS transceiver hardware, but the integration—the super node—I can’t touch the integrated super node.

Q: What is the switching latency, and what are the costs or trade-offs of doing reconfiguration?

A: The switching latency has two parts. The first part is the hardware overhead of OCS switching—it’s maybe nanoseconds in some papers, like in SIGCOMM’21. Our architecture has a hardware switching latency of 60 to 80 microseconds. But the cost also includes the communications software stack for reconnecting and routing convergence, which may cost 10 seconds or maybe minutes to handle this recovery. In our paper, switching only occurs when failure happens, so the overhead of switching can be tolerated much by the minutes of overhead of failure handling. However, this does not fully unleash the hardware potential of the OCS transceiver—we have many opportunities to optimize the protocols for dynamic topology switching.

Q: Do you need to do any packet or data retransmissions when doing reconfiguration?

A: Of course, we have some overhead—like data retransmission overhead, session rebuilding overhead, and routing convergence overhead. This is because all data transfer is built on traditional data center protocols designed for traditional electrical packet switching networks. In AI data centers, the traffic pattern is predictable, which gives us alternatives to optimize technologies and protocols. But this paper is focused on topology optimization, not protocol optimization.

Q: What is the relationship between Infinite HPD and DCN topology? Can Infinite HPD work with any topology?

A: Yes, this is based on the separation between the data center network bandwidth and the Infinite HPD. The data center network only uses like 100 gigabit bandwidth, and the Infinite HPD provides terabit-level bandwidth. So Infinite HPD can work with other data center networks like Rail-Optimized topologies, Fat-tree, even Google Torus, if they have enough ports. But maybe one day the super node will scale to the whole data center network, like in Huawei CloudMatrix 384, which may scale to thousands of GPUs. In this way, the scale-out network may not be necessary, and our architecture may not be adopted.

Q: For the orchestration between HPD and DCN, what specific job information does the operator need to tell the orchestrator?

A: The orchestration algorithm has two objectives. The first is GPU utilization or availability—that’s the main objective. The second is DCN traffic locality, which reflects congestion in the DCN. We have a two-phase algorithm: the deployment phase uses physical cabling of the HPD to optimize traffic locality, and the runtime phase orchestrates TP group placement to optimize DCN traffic locality. We use a binary search algorithm to find the best trade-off between these two metrics—first maximizing GPU utilization for big training jobs, then minimizing cross-ToR traffic to reduce congestion.

Q: How does the orchestrator get fault information, and is the bypassing decision centralized or distributed?

A: The workflow is that the detector knows a node has failed, then the orchestrator gets this information and controls the OCS transceiver to bypass the failure node. So the bypassing decision is also made by the orchestrator. We use centralized orchestration rather than distributed control—though both approaches make sense.

Q: What was the most challenging part of this work?

A: The most challenging part was when DeepSeek was published during the Chinese Spring Festival. We needed to think about how to support the MOE architecture, like EP or TP. Finally, we didn’t solve the problem completely in this paper—maybe we will solve it in future work. This was about one month before the SIGCOMM deadline.

Q: What’s the most exciting future work along this line?

A: First is finding a similar topology architecture that can support all traffic patterns. Maybe we can configure the topology like a tree at one time, like a Torus at another time, and like a Dragonfly at another time—an architecture that can support all traffic patterns. Second is optimizing protocols for optical circuit switching networks, as traditional protocols like RDMA or BGP are not good for OCS networks. Third is exploring whether optical transceivers or optical reconfigurations can be leveraged to improve inference as well.

Q: What’s your advice for students who want to work in this area?

A: First, collaborate with big companies or those with fancy hardware. In universities, you don’t have access to hardware like OCS Transceivers—these appear in companies. Second, get an advantage in some new areas like optical circuit switching. This is a great area with many opportunities. In other areas like LLM inference or training, it’s hard to have a novel idea that can change the whole field. So I think we can choose some new areas for future work.