Paper: eTran: Extensible Kernel Transport with eBPF
Authors: Zhongjie Chen, Qingkai Meng, ChonLam Lao, Yifan Liu,
Fengyuan Ren, Minlan Yu, Yang Zhou
Presenter: Yao Wang, Xiamen University
Guest of Honor: Qingkai Meng, Nanjing University
Q: How easy is it to use e-train to implement a new transport protocol? Looking at the eBPF code for DCTCP and HOMA (approximately 7.5k lines), is it actually easier or still pretty complex?
A: Making transport implementation easier is not our target. Our goal is to offload transport logic into the kernel to benefit from advantages like security, protection, and resource multiplexing. Implementing DCTCP and HOMA is not very easy, especially HOMA because its congestion control algorithm is very complex. But after implementing basic transports like DCTCP and HOMA, users can use our template transports for future customization. For TCP variant congestion control, it’s easier since it just needs to revise some AIMD functions. For HOMA-like congestion control, users can follow our HOMA platform implementation.
Q: What’s the limitation of e-train? Which types of transport protocols would be extremely hard or impossible to implement in the current design?
A: E-train can implement more transports because we split all transport logic into two parts. The functionality we summarized is very common for all transports: segmentation, reassembly, pacing, rate limiting, congestion control, and loss recovery. All transports need these abstractions and functionalities. We discuss this in our appendix, and protocols like HPCC can be implemented with e-train.
Q: What is the key difference between e-train and other user-space transport libraries such as Meta’s MV Fast?
A: The key difference is that the core transport logic is located in the kernel in e-train. Untrusted user space libraries cannot have direct access to transport states because they are protected in the kernel. E-train has better security attributes than other user space transport stacks. Also, user space transport libraries always need polling and consume a lot of CPU resources, while e-train is interrupt-driven and much more CPU efficient.
Q: Can e-train reuse upstream TCP congestion control that has BPF struct options for BPF-based TCP congestion control?
A: Yes, we can reuse upstream TCP congestion control in e-train as long as we implement the TCP congestion control logic in several hooks provided in e-train. But e-train can do more - it allows users to reimplement all transport logic, not just congestion control, but also loss recovery and many other things.
Q: Have you received feedback from the Linux kernel community on this extensible BPF approach?
A: We have communicated with the kernel community to discuss the feasibility of merging e-train into the kernel. This is a very challenging process.
Q: Comparing with Google’s Snap (user space stack design), which approach is more practical for future clouds?
A: I heard that Google doesn’t use Snap anymore. There are many trade-offs between systems. These systems need a huge team to maintain and build, which is not suitable for all companies or users. But the kernel is out of the box, and everyone can use it directly. As far as I know, Google has not open-sourced their Snap.
Q: Could the dual-code structure (user space and kernel space) make it difficult to integrate into Linux mainline due to consistency issues during updates?
A: This is a general problem for all systems that have two parts located at user space and kernel. The key to solving this problem is to define a good interface between user space and kernel. If this interface is well-defined, even if we make changes in kernel transport, there’s no need for the user space library to apply code changes.
Q: Why do tenants in public clouds need to customize their transport protocols?
A: Network extensibility is very important because for certain workloads, we need smarter and more advanced algorithms to handle them gracefully. For example, in incast scenarios, standard congestion control algorithms cannot handle incast traffic gracefully, but recent receiver-driven congestion control algorithms can. Users want to achieve minimal latency for applications and optimize congestion control for their specific traffic patterns to enable high performance.
Q: How does e-train deal with lower layer protocols like IP protocol? Can it reuse kernel capabilities like IP tables?
A: That’s the advantage of e-train - it can leverage existing kernel infrastructures. For example, routing tables in the kernel can be directly used by e-train. In eBPF code, e-train can call BPF helper functions provided by the kernel to look up routing tables. There’s no need for e-train to implement low-level protocols. User space transport stacks must re-implement all layers of networking stacks, but e-train can leverage existing kernel infrastructures.
Q: With multiple congestion control algorithms coexisting in public cloud, will there be challenges at physical network switches due to compatibility issues?
A: This is out of the scope of e-train. Coexisting different transports with different congestion control algorithms is very challenging. They can coexist but may cause performance issues. For example, there’s a paper called EQDS that shows how TCP and RDMA coexist and found performance gaps between TCP traffic and RDMA traffic.
Q: What does e-train sacrifice to gain such flexibility in kernel-style customization? What are the fundamental limitations?
A: The complexity of implementing transport is a trade-off made by e-train. Writing eBPF code is not very easy because of many constraints in eBPF verifier, such as instruction limit size and no unbounded loops. Kernel modification is another trade-off - initially we tried to implement everything without kernel modification, but found it very challenging and even impossible, so we decided to introduce new hooks.
Q: Has e-train encountered any new safety verification challenges compared to existing verifiers?
A: This is a design goal of e-train - to make kernel modifications minor and simple to verify correctness. We reuse many data structures, input contexts, and BPF helper functions already implemented in XDP. Our XDP-J and XDP-Egress interfaces have exactly the same input context as XDP and can call many BPF helper functions. We didn’t introduce any complexity in the eBPF verifier to implement new hooks. E-train is safe by design because new hooks are all processed by the existing XDP verifier.