Title: Canal Mesh: A Cloud-Scale Sidecar-Free Multi-Tenant Service Mesh Architecture
Author: Enge Song (Alibaba Cloud), Yang Song (Alibaba Cloud), Chengyun Lu (Alibaba Cloud), Tian Pan (Alibaba Cloud), Shaokai Zhang (Alibaba Cloud), Jianyuan Lu (Alibaba Cloud), Jiangu Zhao (Alibaba Cloud), Xining Wang (Alibaba Cloud), Xiaomin Wu (Alibaba Cloud), Minglan Gao (Alibaba Cloud), Zongquan Li (Alibaba Cloud), Ziyang Fang (Alibaba Cloud),
Biao Lyu (Alibaba Cloud, Zhejiang University), Pengyu Zhang (Alibaba Cloud), Rong Wen (Alibaba Cloud), Li Yi (Alibaba Cloud), Zhigang Zong (Alibaba Cloud), Shunmin Zhu‡(Alibaba Cloud, Tsinghua University)
Scribe: Kexin Yu (Xiamen University)
Introduction:
Service mesh frameworks have become increasingly popular in the realm of microservices, particularly due to their ability to facilitate service-to-service communication. A key component of these frameworks is the sidecar proxy, which is deployed within each Kubernetes (K8s) pod to manage network traffic. However, extensive deployment of sidecars has led to a variety of challenges, particularly for large-scale cloud environments. This paper presents Canal Mesh, a novel multi-tenant service mesh architecture that eliminates the need for sidecars, addressing the limitations observed in existing solutions such as Istio and Ambient. The authors highlight several critical issues associated with per-pod sidecars, including their intrusive nature, significant resource consumption, increased latency, and substantial orchestration overhead.
Key idea and contribution:
The two key design principles of Canal Mesh are: deploying proxies outside the cluster to achieve non-intrusive service mesh and sharing proxies at the service and cluster level to optimize resource consumption and reduce control plane overhead. The authors evaluated the performance, resource consumption, and control plane overhead of Canal Mesh in comparison to Istio and Ambient.
Functional Equivalence via On-node Proxy: A lightweight proxy is deployed on each user node, responsible for basic security and observability features, while other features remain deployed remotely. eBPF is used instead of ip tables to improve the traffic redirection performance of the proxy, and compute-intensive asymmetric crypto for zero-trust networks is offloaded to dedicated hardware.
Multi-pronged Approach for High Availability: Multiple backends are deployed per service within a single availability zone (AZ) to ensure service availability; backends are also deployed across different AZs to handle failures of an entire AZ. Shuffle sharding is used to minimize the overlap of backends for different services, reducing the blast radius in case of proxy failure. A multi-indicator monitoring system and rapid response mechanisms are built to handle abnormal traffic spikes.
Precise Scaling with Root Cause Analysis: Utilization data is collected from each proxy backend, allowing real-time monitoring of load fluctuations for the top services on each backend. This enables pinpointing the specific services causing the backend load increase and performing targeted scaling.
LB Disaggregation and Session Aggregation: The load balancer is broken down into its essential functions (load distribution and session maintenance) and integrated separately into the existing infrastructure, leveraging cloud infrastructure efficiently. Session aggregation via tunneling is proposed to alleviate the session state maintenance burden on memory-constrained SmartNICs.
Evaluation:
Canal Mesh addresses the issues of intrusion, performance, resource, and orchestration overhead faced by Istio and Ambient by remotely deploying service mesh functionality in the public cloud. Specifically, Canal’s throughput is 12.3x and 2.3x higher than Istio and Ambient respectively, and its latency is 1.7x and 1.3x lower. Canal’s CPU consumption is 12x to 19x and 4.6x to 7.2x lower than Istio and Ambient. The time to complete the configuration for creating hundreds of pods is 1.5x to 2.1x and 1.2x to 1.5x faster for Canal compared to Istio and Ambient. Canal’s bandwidth usage is 9.8x and 4.6x lower than Istio and Ambient. Furthermore, Canal provides a production-proven system that solves challenges around remote deployment, service integration, and multi-tenancy, achieving low cost, high availability, and elasticity.
Questions and opinions :
Question 1:
You are now reducing that per node sidecar to be very simple, right? So can you now make that EBPF-based? Because now you’ve sort of eliminated the worries about how complex it has to be?
Answer 1:
Not in the slides. We have already achieved it in our system. We redirect the traffic from the sidecar to the application to the unknown proxy by EBPF. It can improve the performance by nearly 30%.
Question 2:
So what’s the additional benefit you’re, to the sidecar you bring to your customer? And how many customers are willing to migrate to your solution, and how many customers have migrated to your channel?
Answer 2:
Okay, there are many benefits such as the resource consumption is reduced much. We don’t inject a heavy, like heavy proxy into the user’s code. So they buy the resource can fully run their application.
Personal thoughts:
Canal Mesh proposes an innovative solution that effectively solves the problems in traditional service grid architecture. Through remote agent deployment and multi-tenant design, it reduces user intrusion and resource consumption and improves performance and control plane efficiency. However, deeper proofs are expected to be given whether the system’s correctness is affected in the process of remote deployment.