EP17: Learning Production-Optimized Congestion Control Selection for Alibaba Cloud CDN, Aug. 27, 2025

letian_zhu · September 4, 2025, 8:17am

Paper: Learning Production-Optimized Congestion Control Selection for Alibaba Cloud CDN
Authors: Xuan Zeng, Haoran Xu, Chen Chen, Xumiao Zhang, Xiaoxi Zhang, Xu Chen, Guihai Chen, Yubing Qiu, Yiping Zhang, Chong Hao, Ennan Zhai
Presenter: Jianheng Xiao, The Chinese University of Hong Kong
Guest of Honor: Xuan Zeng, Alibaba Cloud

letian_zhu · September 4, 2025, 8:41am

Q: What analysis did you use to compute the information gain from different features, such as network type?

A: We conducted large-scale measurements for two weeks, collecting rebuffer rates. For each network prefix, we performed A-B tests comparing Cubic and BBR rebuffer rates, then labeled each IP prefix with the winner. Using this labeled dataset, we computed standard information gain for classification.

Q: Why is CCA performance so regional? Also, network conditions change over time, which means different CCAs’ performance changes too.

A: Different regions have different operators applying varied policies across provinces. This creates performance differences between CCA algorithms, even in nearby regions. While networks are dynamic and network types change over time, the main factor affecting rebuffer rates strongly correlates with network type, allowing reliable CCA prediction.

Q: AliCCS uses IP prefixes as input, but the same prefix doesn’t necessarily mean the same network type. Why does this work?

A: Even with complex networks behind the same /24 IP prefix, we observed sufficient uniformity for hours at a time. The /24 prefix remains statically mapped to the same network due to ISP policies that don’t frequently change prefixes. This allows us to resolve hidden state problems.

Q: Does this approach introduce new attack surfaces? Could an adversary inject traffic bursts to influence CCA selection?

A: Yes, deliberately changing TCP statistics distributions could impact AliCCS decisions. However, our GAN-based model creates unified representations across network conditions. While attacks might affect individual /24 prefixes, global representation unification reduces impact without causing worst-case performance at CDN nodes.

Q: Would AliCCS cause fairness issues with other network traffic?

A: In mainland China, networks carrying short video traffic have sufficient bandwidth. We see minimal congestion or competition between flows using different CCS strategies. Our measurements comparing scenarios with and without AliCCS showed no significant difference in background traffic from other Alibaba services.

Q: If bandwidth is abundant, why would congestion control be triggered? How does this affect the quality of experience?

A: Short videos finish in under one second due to low traffic volume. The issue isn’t bandwidth or congestion, but rate limiting causing congestion control fluctuations. Despite good average throughput, rebuffer events occur from throughput fluctuations. Dynamic CCA selection reduces these fluctuations for better rebuffer rates.

Q: What fundamentally hurts short video QoE? Is the slow start too slow or the multiplicative decrease too aggressive?

A: The problem is congestion window reduction from rate limiting. Even without congestion, hitting ISP limiting windows during slow start causes window reduction and packet reception pauses. Cubic aggressively probes bandwidth, linearly increasing until hitting rate limits, then halving the window. BBR probes bandwidth and RTT to find optimal window sizes, avoiding fluctuations despite packet loss.

Q: What would happen if everyone used AliCCS?

A: CCS changes the environment, affecting future CCS decisions. At a global scale across all Alibaba Cloud traffic, competing flows and congestion become more important. Our model, based on network-type association, would be insufficient, requiring dynamic updates through online learning for changing conditions.

Q: Did you try learning-based congestion control algorithms before designing AliCCS?

A: We tried machine learning CCA algorithms from recent SIGCOMM papers, including reinforcement learning approaches. While showing promise in some conditions, global-scale testing across Chinese regions revealed the worst-case performance. Additionally, as CDN providers, we lack client-side signals and real-time rebuffer feedback due to collaboration limitations with service providers.

Q: Why focus on CDN? Would AliCCS work for other Alibaba Cloud services?

A: Short video services generate 80-90% of Alibaba Cloud CDN profits, making them our focus for cost reduction. It’s also a collaboration issue - sometimes we can access needed information, sometimes organizational barriers prevent it, so we optimize with available data.

Q: What’s the performance gap between directly provided versus inferred network types?

A: Initially, we asked service providers for network information, but only 10% of requests could provide it due to software limitations. Client-side network type detection is unreliable due to Android/iOS issues. Comparing direct information versus our model inference, machine learning achieved better performance by leveraging multiple dimensions like congestion window and RTT.

Q: Could you make CCA choices based on packet loss and bandwidth fluctuation distributions rather than network type?

A: We tried this approach early on. Training classification models on packet loss and bandwidth fluctuations requires labeled winner data, for which we lack sufficient support. Service providers share network type information but not rebuffer rates or service quality metrics (business secrets). Network type information provides richer datasets for training.