EP27: Software-based Live Migration for RDMA, Mar. 25, 2026

letian_zhu · April 22, 2026, 12:23pm

Paper: Software-based Live Migration for RDMA
Authors: Xiaoyu Li, Ran Shu, Yongqiang Xiong, Fengyuan Ren
Presenter: Haiyang Xu, Southeast University
Guest of Honor: Xiaoyu Li, Peng Cheng Laboratory

letian_zhu · April 22, 2026, 12:33pm

Q: Your paper addresses differential synchronization between TCP and RDMA protocols. Is it possible to integrate your work with existing approaches that use RDMA for bulk state transfer to the migration destination?

A: I think we can integrate our work with those approaches that use RDMA for state transfer without very massive architectural changes. That is my general opinion, though I haven’t analyzed it very deeply — you may want to think more carefully about that point.

Q: If state image transfer is used, would it consume a significant amount of RDMA bandwidth, leaving less bandwidth available for normal applications?

A: It may consume some RDMA bandwidth, yes. In my design, I also need to notify all the communication partners, which itself consumes bandwidth. Honestly, I don’t have a precise figure on the overhead yet. I am currently expanding my work to a journal version, and I plan to add this evaluation there. My estimation is that it won’t cost too much bandwidth, but that needs to be validated.

Q: If state updates are very frequent, the state transfers from source to destination RDMA node would also be very frequent — otherwise state loss may occur. Did you try any reliable synchronization protocol between the source and destination for the migration?

A: Our assumption is that live migration is not very frequent. For latency-sensitive applications, you don’t want to change their state frequently unless it’s absolutely necessary — for example, to do server upgrades or load balancing. However, you raise a valid point: if the application has a very large state, compression could be an optimization. You could compress all the states into a compact image, transfer it, and then decompress and restore at the destination. Since migration is infrequent, the migration latency is acceptable in this scenario.

Q: Is your work primarily targeting traditional cloud computing cases? Have you considered AI data center scenarios, where GPUs are involved?

A: Yes, we are mainly targeting traditional cloud cases. For AI data centers, live migration is much more complex — you need to handle not only RDMA migration but also GPU state migration. GPU migration work has already been published at SOSP 2025. I also think that for machine learning inference, the more important problem is not live migration but fast startup and fast restore, which is a different scenario. Work like Phoenix OS has discussed all three aspects: live migration, fast startup, and fast restore. If you are interested, that is worth reading.

Q (from chat): If the source node crashes or a network interruption occurs during the migration process, how does the system guarantee state consistency and recoverability?

A: It depends on when the source node crashes. I think one practical solution is to use lazy migration. A more general case is: when the system detects that the source node is about to fail, it proactively starts migrating to the destination. For network interruptions specifically, since they are often temporary, a simple policy is to back off and wait for the network to recover, then resume migration. Alternatively, you could choose a different migration destination and restart. These are straightforward but practical solutions, though I acknowledge there may be more sophisticated approaches.

Q: Did you consider different state consistency models — for instance, strong consistency (source and destination are always in sync) versus a weaker model that tolerates short-term inconsistency?

A: Our design mainly focused on the strong consistency model. Suppose the migration source posts work requests 1, 2, 3, 4, 5 in sequence; in our design, we wait for all of them to complete before finalizing migration, so the application sees them in the correct order after restoration. That said, I think it is very interesting to explore a weaker model that tolerates short-term inconsistency — there could be new design space there. The paper RedPlane from SIGCOMM 2021 discusses related consistency issues and is worth reading.

Q: In the pre-copy phase, the physical data path between source and destination may go through switches and be vulnerable to congestion or switch failures, which could cause state inconsistency. How do you handle this?

A: You are right that this is an important real-world concern. A naive but valid solution is to roll back — either keep the service running on the migration source, or choose a different migration destination and retry. For network interruptions that are temporary, backing off and retrying at the same destination is also practical. Rolling back does have overhead, especially in AI data centers where GPU costs are high. More sophisticated solutions exist, though they go beyond the scope of this paper.

Q: When there are many in-flight packets during migration, is the overhead significant? Would the application see a drastic performance drop?

A: Yes, from the application perspective, there may be a noticeable performance drop during the blackout period before the migration is fully restored. That is a known cost of the pre-copy approach and is reflected in our evaluation results.

Q: What motivated you to work specifically on this problem?

A: Live migration originally started as one feature of a larger project. But as I thought more carefully about what makes RDMA live migration uniquely challenging — by reading many RDMA demos and understanding the full RDMA workflow — I realized it was substantial enough to stand as an independent research contribution. The specific challenges became clearer over time through that deep dive into RDMA’s mechanisms.

Q: We are working on network function migration between DPUs. Do you have any high-level insights from your RDMA migration experience that could apply?

A: This is not a very mature thought, but since the application runs on board the hardware, you need to communicate with the hardware provider, fetch all application states from the source DPU, and then inject those states onto the destination DPU. A key thing to think about is how to use PCIe bandwidth efficiently during this process. We can discuss further offline.

Q: Did you consider virtualization scenarios such as XEN or containers? Your paper doesn’t seem to discuss virtualization explicitly.

A: We can integrate our work with virtualized environments. I actually have a separate published work on RDMA virtualization for containers, and I attempted to integrate it with this migration work. There are no fundamental blockers, but it does require significant engineering effort. It could be a future ATC paper.

Q: What advice do you have for new students who want to get started in this area?

A: From my experience, the most important thing is to find experienced students or mentors who can guide you. My first paper took a very long time because I didn’t know how to present innovation in a systems paper. It wasn’t until my fifth year, when a mentor at Microsoft taught me how to frame system contributions, that I published my first paper in my sixth year. So if you want a less painful PhD, find experienced people to work with early on.

Beyond that, innovation in systems research is harder to find than in areas like AI. It often doesn’t come from within the university — it comes from practical problems at companies like Huawei and Alibaba. I strongly recommend industry internships during your PhD and maintaining close connections with industry partners to stay grounded in real problems.