A Decentralized SDN Architecture for the WAN

chengjin.zhou · July 30, 2024, 11:43am

Title: A Decentralized SDN Architecture for the WAN

Authors: Alexander Krentsel (Google / UC Berkeley); Nitika Saran (Cornell); Bikash Koley, Subhasree Mandal, Ashok Narayanan (Google); Sylvia Ratnasamy (Google / UC Berkeley); Ali Al-Shabibi, Anees Shaikh, Rob Shakir, Ankit Singla (Google); Hakim Weatherspoon (Cornell)

Scribe: Chengjin Zhou (Nankai University)

Introduction:
The paper addresses the issue of network complexity in wide-area networks (WANs), which leads to frequent outages and major disruptions. Despite significant investments in improving WAN reliability, failures remain common due to the intricate interactions among diverse network components and teams. The authors propose simplifying the network architecture by eliminating unnecessary components, focusing on reducing points of failure, and improving reliability.

Key Idea and Contribution:
The paper proposes a novel decentralized Software-Defined Networking architecture dSDN. The key idea is to remove the reliance on external SDN control planes by integrating control functions directly within routers. This is achieved through an operator-defined control layer that runs on routers. This approach eliminates the need for fallback legacy protocols, simplifying the network design and reducing points of failure.

In dSDN architecture, every router runs an operator-defined dSDN controller. Each dSDN controller constructs a global network view via a simple flooding-based dissemination protocol and then locally runs a TE algorithm to compute
capacity-aware paths. When a packet enters the network, the ingress router records the TE-computed path into the packet’s header and all other routers along the path simply enforce the source route.
dSDN significantly simplifies the control plane infrastructure while maintain benefits of SDN, while retaining the best of the SDN , and it can be practically realized on current routers, and significantly outperforms centralized SDN.

Evaluation:

The evaluation focuses on how dSDN’s routing compares to traditional centralized SDN (cSDN) . The authors evaluate the convergence performance of dSDN in terms of convergence time (i.e., the time from when a network event occurs to when new routes reflecting the event are installed at all routers) on production network B4. Convergence time consists of 3 periods, propagation time, computation time and programming time. Summing across components, dSDN’s convergence time is 120-150x faster than cSDN. This result is significant because it indicates that dSDN achieves simplification without cost to routing performance and significantly outperforms cSDN.

Personal Opinion
The paper presents a compelling argument for simplifying WAN control architectures by decentralizing SDN control logic. The proposed dSDN architecture offers a novel solution to reduce complexity while maintaining the benefits of SDN. One open question that arises from this work is how dSDN can adapt to dynamic network conditions and scale to larger, more complex WAN deployments. Overall, the paper provides valuable insights into rethinking SDN infrastructure for improved WAN performance and reliability.

Q1: How often does the demand change in your network?
A1: This is effectively a deployment question of how often you want to update your demand. Operationally, one might naively decide to update anytime demand changes. However, since demand is continuous, it is constantly changing. In reality, you choose a cutoff point for when to update. Different routers may have inconsistent views of demand, which might not cause significant issues because, with source routing, differences in views won’t result in incomplete paths. Discrepancies might lead to paths overlapping unexpectedly from the perspective of two different controllers, but traffic is always delivered. Our convergence impact evaluation shows that we perform much better than centralized SDN (cSDN). cSDN faces a similar problem because, despite computing on a single view, it takes a long time to program all paths, leading to old and new paths coexisting. However, because our system converges faster, we experience a much lower impact.

Q2: What is the overhead of running such a controller on a router?
A2: By limiting the usage to just a portion of the CPUs available on the router, we still achieve efficient performance. Traffic engineering is the main compute-intensive task, yet it still runs within seconds, even with restrictions to a few CPUs.

Q3: From a routing perspective, it seems better to explore decentralization. Do you see any functionalities that still need centralized control?
A3: One challenge might be scenarios involving external dependencies, such as labeling specific traffic or relative utilities, which might require maintaining some centralized control. However, you can still manage this on the router. As the operator, you implement the policy on the router, which means that whatever policy you applied globally can still be implemented locally on the router.