Albatross: A Containerized Cloud Gateway Platform with FPGA-accelerated Packet-level Load Balancing

JinghuiJiang · September 9, 2025, 2:45pm

Title: Albatross: A Containerized Cloud Gateway Platform with FPGA-accelerated Packet-level Load Balancing

Authors: Jianyuan Lu(Alibaba Cloud), Shunmin Zhu(Hangzhou Feitian Cloud, Alibaba Cloud), Jun Liang(Alibaba Cloud), Yuxiang Lin(Alibaba Cloud), Tian Pan(Alibaba Cloud), Yisong Qiao(Alibaba Cloud), Yang Song(Alibaba Cloud), Wenqiang Su(Alibaba Cloud), Yixin Xie(Alibaba Cloud), Yanqiang Li(Alibaba Cloud), Enge Song(Alibaba Cloud), Shize Zhang(Alibaba Cloud), Xiaoqing Sun(Alibaba Cloud), Rong Wen(Alibaba Cloud), Xionglie Wei(Alibaba Cloud), Biao Lyu (Alibaba Cloud), Xing Li (Zhejiang University, Alibaba Cloud)

Introduction

Alibaba Cloud’s network virtualization architecture relied heavily on high-performance Tofino switching ASICs for its centralized gateways. The sudden discontinuation of the Tofino chip’s development in January 2023 created a critical performance and supply-chain vacuum, forcing a search for a viable alternative. This problem is important because high-capacity, stable, and cost-effective gateways are essential for handling immense traffic surges for millions of tenants in a modern cloud environment. Existing solutions fell short: first-generation x86-based gateways suffered from single-core overload due to RSS hashing , while other switching ASICs or DPU-based solutions presented issues with programmability, compiler stability, insufficient resources for gateway-scale routing, or introduced new operational complexities. To address these challenges, the authors developed Albatross, their 3rd gen cloud gateway based on FPGA and x86 CPUs. Albatross delivers FPGA-based packet-level load balancing to prevent CPU core overload , implements a two-stage rate limiter for millions of tenants , and uses containerization with a BGP proxy to lessen the overhead caused by high-density deployments.

Key idea and contribution

The authors developed Albatross , a third-generation cloud gateway platform built on commodity hardware, specifically x86 CPUs and FPGAs, to fill the gap left by Tofino. The core idea is to use the FPGA not as a full offload engine, but as a targeted accelerator for specific data plane logic, namely load balancing and rate-limiting. This approach allows the main packet processing logic to remain on the CPU, enabling maximum code reuse from their battle-tested first-generation x86 gateway and shortening the time-to-market.

Albatross’s key contributions are threefold. First, it introduces an FPGA-based packet-level load balancing (PLB) mechanism that sprays ingress traffic across all available CPU cores, effectively preventing single-core overload from heavy-hitter flows. The system then reorders packets at the egress to maintain per-flow packet order, marking the first industrial design that extends PLB capabilities to external processors across different chips. Second, to protect the entire gateway from being saturated by anomalous traffic, it implements a novel two-stage rate limiter on the FPGA that can enforce rate limits for millions of tenants using only 2MB of on-chip memory. Third, Albatross is the first cloud gateway deployed using containerization , virtualizing hardware resources to host multiple independent gateway instances on a single server. This improves resource utilization and is made practical by a BGP proxy that solves the control-plane overhead on uplink switches caused by high-density container deployment.

Evaluation

The evaluation demonstrates that a single Albatross node achieves a throughput of 80-120 Mpps with an average latency of around 20µs. In controlled tests, PLB successfully mitigates single-core overload from heavy-hitter flows, evenly distributing traffic that would otherwise saturate a single core under RSS, and improves P99 latency when gateway load exceeds 75%. The two-stage rate limiter is shown to effectively protect benign tenants by selectively throttling a dominant tenant’s traffic surge at the NIC pipeline, preventing indiscriminate packet drops at the CPU. This result is significant because it presents a proven, cost-effective, and resilient architectural path for cloud providers grappling with supply-chain dependencies, demonstrating that a hybrid FPGA-CPU approach can solve critical performance bottlenecks in software gateways while reducing infrastructure costs for a new availability zone by 50% through containerization.

Q&A

Q1: “How do you compare the complexity of writing software as opposed to writing it on Albatross? From a programming point of view, would it be easier to write software or easier to write software on Albatross?”

A1: The speaker explained that before deploying the Albatross server, it would take days to set up a new gateway cluster. However, with the Albatross server, they build redundant servers in advance. This allows them, when faced with a traffic burst, to set up a new GW pod in only seconds. Therefore, the speaker considers this to be a “big improvement” in elasticity.

Q2: What are the hardware specifications regarding the number of FPGAs and their speed?
A2: The platform utilizes four FPGAs, each with a port speed of 200 Gbps.

Personal thoughts

A key strength of the paper is its demonstration of a pragmatic engineering approach to a real-world crisis. Instead of seeking a perfect replacement for Tofino, the authors designed a hybrid system that balances performance and practicality by reusing existing, validated code.

However, the paper acknowledges a significant limitation: a substantial performance regression compared to the previous Tofino-based gateway, Sailfish. While the authors justify this by addressing the majority “throughput-insensitive” market and planning future upgrades, this trade-off highlights the challenge of replacing specialized ASICs. The work also raises questions about the long-term operational complexity of an FPGA-based pipeline, which is known to be difficult to program and debug. Finally, while the paper touches upon the challenges of supporting stateful network functions within the PLB architecture, a deeper exploration of this area remains an open question for future work.