EP15: OptimusPrime: Unleash Dataplane Programmability through a Transformable Architecture, July. 2, 2025

Paper: OptimusPrime: Unleash Dataplane Programmability through a Transformable Architecture
Authors: Zhikang Chen, Yong Feng, Shuxin Liu, Haoyu Song, Hanyi Zhou, Tong Yun, Wenquan Xu, Tian Pan, Bin Liu
Presenter: Xuanhao Liu, Xiamen University
Guest of Honor: Zhikang Chen, Tsinghua University

Q: In the future work section, you mentioned that during runtime, the transformable block could be converted to a pipeline or RTC core dynamically to meet the demands of the program. Is that possible?

A: In our current design, it’s not possible because it depends on the program compilation process. You need to download the program to the OptimusPrime architecture. If you want to transform a block from a pipeline stage to an RTC block, you must provide another P4X program, and the entire architecture will be reconfigured. I also published another paper in NSDI’22 introducing IPSA, a runtime-configurable architecture for RMT, which is different from OptimusPrime. OptimusPrime benefits from a ring-based interconnection, but it is also limited by it. Unlike a crossbar that can connect transformable blocks in any way, the ring-based interconnection must follow a linear structure.

Q: So my takeaway is that OptimusPrime needs to recompile the program when the operator changes the configuration. That’s why it can’t support runtime reconfiguration, right?

A: Yes. When the operator changes the program configuration, it must go through the compilation process, and the whole architecture will be reconfigured. dRMT, supported by FlexCore, can handle runtime changes because it has a crossbar. The main limitation in OptimusPrime comes from the ring-based interconnection.

Q: Following this runtime reconfiguration question, I recall another NSDI’22 paper, Runtime Programmable Switches by Jiarong Xing, which is based on dRMT. Could we borrow some ideas from that paper to make OptimusPrime runtime configurable?

A: My team also published a paper in NSDI’22 (first author Yong Feng, I am the second author) titled Enabling In-situ Programmability in Network Data Plane: From Architecture to Language. It’s also about runtime reconfiguration, but based on a traditional RMT architecture. We designed IPSA, a variant of RMT that uses a crossbar to connect processors and memory. In OptimusPrime, like traditional RMT, each transformable block—whether pipeline stage or RTC processor—has its own private memory space, so it cannot directly access another processor’s memory. This is why it doesn’t support runtime configuration. We are exploring combining OptimusPrime with the IPSA architecture to enable runtime configurability, and also integrating ideas from our NSDI’24 paper, which supports more complex operations in the RMT pipeline.

Q: What is the fundamental limitation of OptimusPrime?

A: The ring-based interconnection. We designed it to reduce resource consumption on FPGA or ASIC, but it has become a fundamental limitation. For example, OptimusPrime cannot support execution patterns like pipeline → RTC → pipeline → RTC. While it saves resources, the ring design limits programmability.

Q: Can OptimusPrime be shared by multiple applications?

A: If a P4 pipeline like Intel Tofino can be shared by multiple applications, then OptimusPrime can too.

Q: But wouldn’t your ring design cause bandwidth contention or isolation problems?

A: Yes. In our evaluations, the inner ring showed some contention, but in the outer ring, contention was not significant.

Q: Some applications require few resources, others require many. Could we run multiple applications with different resource requirements together to improve overall utilization?

A: Yes, and there is related work—Jiaxin Lin from UT Austin published Enabling Portable and High-Performance SmartNIC Programs with Alkali (NSDI’25), which addresses similar issues for SmartNICs. My teammates are also working to enhance the P4X language for better optimization in multi-application scenarios.

Q: What types of applications are hard to program on OptimusPrime?

A: Any application that can be programmed in P4 can also be programmed in OptimusPrime, since it’s essentially an alternative P4/RMT architecture. Applications that don’t run efficiently in a P4 pipeline likely won’t run efficiently here either. We’ve only tested certain cases, so I can’t give a complete answer.

Q: Could OptimusPrime offload important network operations in AI data centers, such as collective communication?

A: Currently, no. The design doesn’t include shared memory (though mentioned as future work in the paper). To support collective communication, the memory subsystem and its controller would need to be redesigned.

Q: If used in a SmartNIC or NIC, would OptimusPrime need to support a high-performance network stack?

A: Yes. In that case, additional modules would need to be added to the architecture.

Q: From a high level, it seems difficult to translate the semantics of complex C programs into P4-based packet processing. What are the key implementation challenges in OptimusPrime?

A: I focus on the hardware (FPGA) side, but the compiler is indeed challenging—it has over 20k lines of code. Finding the relationship between the MAU and CPU architectures was hard, and we ended up using a traditional five-stage RISC-V CPU pipeline, which is not the most efficient.

Q: What was the most challenging part of the hardware design?

A: Managing contention in both the outer and inner rings. We structured the data path so that the pipeline comes first and RTC cores come afterward, allowing only one-way traffic (pipeline → RTC) to reduce contention.

Q: For a new student entering programmable data plane architecture, where should they start?

A: Look for open-source code on GitHub—many NSDI and SIGCOMM papers release their code.

Q: But FPGA development can be hard to debug and test. Any advice?

A: Professor Xiangrui Yang from NUDT published a simplified RMT FPGA implementation, which is easier for beginners to learn.