Turbo: Efficient Communication Framework for Large-scale Data Processing Cluster

JinghuiJiang · July 30, 2024, 11:34am

Title: Turbo: Efficient Communication Framework for Large-scale Data Processing Cluster

Authors: Xuya Jia (Tencent); Zhiyi Yao, Chao Peng (Fudan University);Zihao Zhao, Bin Lei, Edison Liu (NVIDIA); Xiang Li, Zekun He, Yachen Wang, Xianneng Zou, Chongqing Zhao, Jinhui Chu(Tencent); Jilong Wang (Tsinghua University); Congcong Miao(Tencent)

Speaker: Xuya Jia (Tencent)

Scribe: Jinghui Jiang (Xiamen University)

Introduction
Big data processing clusters are suffering from a long job completion time due to the inefficient utilization of the RDMA capability. Results in a large-scale cluster with hundreds of server nodes to process large-scale jobs have shown that the existing deployment of the RDMA technique results in a long tail job completion time, with some jobs even taking up more than twice the average time to complete. In this paper, the authors present the design and implementation of Turbo, an efficient communication framework for the large-scale data processing cluster to achieve high performance and scalability. The core of Turbo’s approach is to leverage a dynamic block-level flowlet transmission mechanism and a non-blocking communication middleware to improve the network throughput and enhance the system’s scalability. Furthermore, Turbo ensures high system reliability by utilizing an external shuffle service, and TCP serves as a backup.

Key idea and contribution
Turbo’s key contributions are divided into two parts. The first part is a dynamic block-level flowlet transmission mechanism. In previous work, they used RNIC to maintain QP for data transmission flow. However, in the big data scenario, the limited hardware memory of RNIC makes it difficult to maintain enough QP, resulting in performance degradation. Turbo noticed this problem and designed a dynamic block-level flowlet transmission mechanism to deal with it. Turbo divides data blocks of a flow into multiple flowlets and integrates a dynamic source port selection to change the transmission path. In addition, Turbo also designed a half-handshake connection mechanism (a mechanism that can start transmitting data after receiving the receiver’s Ack) to reduce the waiting time for ACK from the receiver during a full-handshake connection, further optimizing the consumption of RNIC resources by QP connection.

The second part is non-blocking communication middleware. In previous work, during the shuffle process, the blocked execution and serial data transmission issues in the communication middleware finally resulted in poor performance. However, it is challenging to transmit data without blocking, as each worker and executor are closely bound. To address this challenge, Turbo designs a non-blocking communication middleware. This middleware firstly unbinds the worker from the executor to enable the parallelism between computation and communication. Then, it allocates executor tasks to relatively lightly loaded workers through a gate mechanism to achieve load balancing.

Evaluation
The authors integrate Turbo into Apache Spark and evaluate Turbo in a small-scale testbed and a large-scale cluster. The small scale testbed evaluation results show that Turbo improves the network throughput by 15.1% while maintaining high system reliability. The large-scale production results have shown that Turbo can reduce the job completion time by 23.9% and increase the job completion rate by 2.03× over the existing RDMA solutions.

Q&A:
Q1: How does Turbo achieve unblocking in communication and computation?
A1: Turbo can decouple the worker and executor, enabling the overlap
between communication and computation. It also has a gate mechanism to allocate requests to workers with a minimal workload to achieve load balancing so that communication and computation can overlap.

Q2: How does Turbo handle the out-of-order data packet with multi-packet transmission?
A2: When DCT starts transmitting, it is guaranteed that the previous link has ended. And DCT is a single-path transmission, so they will not be out of order.

Personal thoughts
The paper introduces Turbo, a communication framework that is scalable and has good performance in large RDMA clusters. This paper optimizes the blocking phenomenon encountered in RDMA communication and proposes a new and creative RDMA transmission method. I think this is a very interesting system, and this paper nicely demonstrates the new research objectives and inherent challenges and solutions.