NetEdit: An Orchestration Platform for eBPF Network Functions at Scale

bang · July 30, 2024, 11:46am

Title: NetEdit: An Orchestration Platform for eBPF Network Functions at Scale

Authors: Theophilus A. Benson (Carnegie Mellon University); Prashanth Kannan, Prankur Gupta, Balasubramanian Madhavan, Kumar Saurabh Arora, Jie Meng, Martin Lau, Abhishek Dhamija, Rajiv Krishnamurthy, Srikanth Sundaresan, Neil Spring, Ying Zhang (Meta)

Scribe：Bang Huang (Xiamen University)

Introduction

This paper introduces NetEdit, an eBPF orchestration platform developed by Meta to addresses the challenges of optimizing the host networking stack across Meta’s data center fleet, which comprises over 20 data centers with varying hardware configurations and diverse network traffic. The critical issue at hand is how to efficiently manage and orchestrate this diverse network environment to ensure optimal performance. Existing systems and tools, such as user-space libraries, routing configurations, and kernel modules, fall short due to their limitations in granularity, control, and flexibility. The necessity for a reliable, scalable, and efficient solution is paramount in such a complex and high-stakes environment.

Key idea and contribution

NetEdit provides a set of abstractions allowing service owners and networking experts to collaboratively optimize network performance. The system leverages eBPF for its flexibility and control, enabling precise tuning of network parameters such as initial congestion windows, per-packet receiver windows, and other transport features. The platform also supports safe, reliable experimentation and deployment, offering high availability and minimal dependencies on the broader infrastructure.

Similar to SDN, NetEdit’s design consists: the data plane, the control plane, and a testing framework for observability (refer to the original paper for detailed design). NetEdit’s data plane leverages eBPF to implement various network tuning features. These features are deployed across multiple eBPF programs, which are deterministically ordered at each attachment point to ensure consistent execution. NetEdit’s control plane handles policy management and orchestration. Policies are authored externally and stored in a central repository, expressed in a flexible configuration language. NetEdit’s observability framework is designed to track both short-term and long-term system health and performance.

Evaluation

NetEdit’s performance was evaluated by comparing the network and service metrics with NetEdit enabled versus when it was disabled across an entire region.

Q&A

Q1: The system about eBPF internals, attach points and version management.

A1: I appreciates the community and mentions trying out ideas and upstreaming. While the system knows about features to a certain extent, there are components that feature owners need to deploy.

Q2: The composability of eBPF programs and who is responsible for ensuring they work well together in the future.

A2: We test these programs in advance through “dogfooding” (using their own products), ensuring they work well together.

Q3: The potential combinatorial problem with the increasing number of features.

A3: We acknowledges this but notes that they currently manage the number of features and will rethink their approach in the future if needed.

Q4: Will the reduction frame test the feature and ensure it composes well with other features? Is this how new features are rolled out?

A4: We start with an idea and experiment with it. They use AB testing with multiple sample features in a controlled environment. This deterministic approach helps identify potential issues, ensuring features work well together before being rolled out to production.

Q5: Does the control plane of NetEdit interact as a content orchestration system to monitor and integrate with other orchestration systems?

A5: NetEdit is within Meta and integrates closely with their configuration management system. This setup allows NetEdit to understand the services running on it. However, it does not directly connect to ports, so if a service goes down, the tuning adjustments will not be directly affected.

Q6: During your work, have you needed or wished for a new hook in the kernel network stack that doesn’t currently exist?

A6: We have upstreamed features they needed with the help of their team. They mention building custom congestion control algorithms and a struct interface for managing and deploying these algorithms. It’s an ongoing effort with various other improvements.

Q7: Do you foresee a future where the entire stack is programmable, allowing you to program anything across the entire stack?

A7: The responder acknowledges the possibility but emphasizes the associated costs of such flexibility. They highlight the trade-off between flexibility and efficiency, noting that per-packet programs are expensive. While they can already implement various functionalities and custom algorithms today, making the entire kernel customizable comes with its own set of challenges and costs.

Personal thoughts

The approach’s flexibility and control over network parameters are impressive, particularly in a large-scale and diverse environment like Meta’s data centers. However, using eBPF, while powerful, also introduces potential risks, such as the possibility of bugs affecting the entire system. They paper mentions the importance of rigorous testing and the need for further research in this area, which is crucial for ensuring the system’s reliability. Additionally, the interaction between different eBPF programs and the kernel presents a complex testing challenge. It would be interesting to explore more robust testing frameworks and methodologies to mitigate these risks.